DataFrame#

This page gives an overview of all public DataFrame methods.

class polars.DataFrame(data: dict[str, Sequence[Any]] | Sequence[Any] | np.ndarray[Any, Any] | pa.Table | pd.DataFrame | Series | None = None, columns: ColumnsType | None = None, orient: Orientation | None = None)[source]

Two-dimensional data structure representing data as a table with rows and columns.

Parameters:
datadict, Sequence, ndarray, Series, or pandas.DataFrame

Two-dimensional data in various forms. dict must contain Sequences. Sequence may contain Series or other Sequences.

columnsSequence of str or (str,DataType) pairs, default None

Column labels to use for resulting DataFrame. If specified, overrides any labels already present in the data. Must match data dimensions.

orient{‘col’, ‘row’}, default None

Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.

Notes

Some methods internally convert the DataFrame into a LazyFrame before collecting the results back into a DataFrame. This can lead to unexpected behavior when using a subclassed DataFrame. For example,

>>> class MyDataFrame(pl.DataFrame):
...     pass
...
>>> isinstance(MyDataFrame().lazy().collect(), MyDataFrame)
False

Examples

Constructing a DataFrame from a dictionary:

>>> data = {"a": [1, 2], "b": [3, 4]}
>>> df = pl.DataFrame(data)
>>> df
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   │
└─────┴─────┘

Notice that the dtype is automatically inferred as a polars Int64:

>>> df.dtypes
[<class 'polars.datatypes.Int64'>, <class 'polars.datatypes.Int64'>]

In order to specify dtypes for your columns, initialize the DataFrame with a list of typed Series:

>>> data = [
...     pl.Series("col1", [1, 2], dtype=pl.Float32),
...     pl.Series("col2", [3, 4], dtype=pl.Int64),
... ]
>>> df2 = pl.DataFrame(data)
>>> df2
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ 4    │
└──────┴──────┘

Or set the columns parameter with a list of (name,dtype) pairs (compatible with all of the other valid data parameter types):

>>> data = {"col1": [1, 2], "col2": [3, 4]}
>>> df3 = pl.DataFrame(data, columns=[("col1", pl.Float32), ("col2", pl.Int64)])
>>> df3
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ 4    │
└──────┴──────┘

Constructing a DataFrame from a numpy ndarray, specifying column names:

>>> import numpy as np
>>> data = np.array([(1, 2), (3, 4)], dtype=np.int64)
>>> df4 = pl.DataFrame(data, columns=["a", "b"], orient="col")
>>> df4
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   │
└─────┴─────┘

Constructing a DataFrame from a list of lists, row orientation inferred:

>>> data = [[1, 2, 3], [4, 5, 6]]
>>> df4 = pl.DataFrame(data, columns=["a", "b", "c"])
>>> df4
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘

Methods:

apply

Apply a custom/user-defined function (UDF) over the rows of the DataFrame.

cleared

Create an empty copy of the current DataFrame.

clone

Cheap deepcopy/clone.

describe

Summary statistics for a DataFrame.

drop

Remove column from DataFrame and return as new.

drop_in_place

Drop in place.

drop_nulls

Return a new DataFrame where the null values are dropped.

estimated_size

Return an estimation of the total (heap) allocated size of the DataFrame.

explode

Explode DataFrame to long format by exploding a column with Lists.

extend

Extend the memory backed by this DataFrame with the values from other.

fill_nan

Fill floating point NaN values by an Expression evaluation.

fill_null

Fill null values using the specified value or strategy.

filter

Filter the rows in the DataFrame based on a predicate expression.

find_idx_by_name

Find the index of a column by name.

fold

Apply a horizontal reduction on a DataFrame.

frame_equal

Check if DataFrame is equal to other.

get_column

Get a single column as Series by name.

get_columns

Get the DataFrame as a List of Series.

groupby

Start a groupby operation.

groupby_dynamic

Group based on a time value (or index value of type Int32, Int64).

groupby_rolling

Create rolling groups based on a time column.

hash_rows

Hash and combine the rows in this DataFrame.

head

Get the first n rows.

hstack

Return a new DataFrame grown horizontally by stacking multiple Series to it.

insert_at_idx

Insert a Series at a certain column index.

interpolate

Interpolate intermediate values.

is_duplicated

Get a mask of all duplicated rows in this DataFrame.

is_empty

Check if the dataframe is empty.

is_unique

Get a mask of all unique rows in this DataFrame.

join

Join in SQL-like fashion.

join_asof

Perform an asof join.

lazy

Start a lazy query from this point.

limit

Get the first n rows.

max

Aggregate the columns of this DataFrame to their maximum value.

mean

Aggregate the columns of this DataFrame to their mean value.

median

Aggregate the columns of this DataFrame to their median value.

melt

Unpivot a DataFrame from wide to long format.

min

Aggregate the columns of this DataFrame to their minimum value.

n_chunks

Get number of chunks used by the ChunkedArrays of this DataFrame.

n_unique

Return the number of unique rows, or the number of unique row-subsets.

null_count

Create a new DataFrame that shows the null counts per column.

partition_by

Split into multiple DataFrames partitioned by groups.

pearson_corr

Return Pearson product-moment correlation coefficients.

pipe

Offers a structured way to apply a sequence of user-defined functions (UDFs).

pivot

Create a spreadsheet-style pivot table as a DataFrame.

product

Aggregate the columns of this DataFrame to their product values.

quantile

Aggregate the columns of this DataFrame to their quantile value.

rechunk

Rechunk the data in this DataFrame to a contiguous allocation.

rename

Rename column names.

replace

Replace a column by a new Series.

replace_at_idx

Replace a column at an index location.

reverse

Reverse the DataFrame.

row

Get a row as tuple, either by index or by predicate.

rows

Convert columnar data to rows as python tuples.

sample

Sample from this DataFrame.

select

Select columns from this DataFrame.

shift

Shift values by the given period.

shift_and_fill

Shift the values by a given period and fill the resulting null values.

shrink_to_fit

Shrink DataFrame memory usage.

slice

Get a slice of this DataFrame.

sort

Sort the DataFrame by column.

std

Aggregate the columns of this DataFrame to their standard deviation value.

sum

Aggregate the columns of this DataFrame to their sum value.

tail

Get the last n rows.

take_every

Take every nth row in the DataFrame and return as a new DataFrame.

to_arrow

Collect the underlying arrow arrays in an Arrow Table.

to_dict

Convert DataFrame to a dictionary mapping column name to values.

to_dicts

Convert every row to a dictionary.

to_dummies

Get one hot encoded dummy variables.

to_numpy

Convert DataFrame to a 2D NumPy array.

to_pandas

Cast to a pandas DataFrame.

to_series

Select column as Series at index location.

to_struct

Convert a DataFrame to a Series of type Struct.

transpose

Transpose a DataFrame over the diagonal.

unique

Drop duplicate rows from this DataFrame.

unnest

Decompose a struct into its fields.

unstack

Unstack a long table to a wide form without doing an aggregation.

upsample

Upsample a DataFrame at a regular frequency.

var

Aggregate the columns of this DataFrame to their variance value.

vstack

Grow this DataFrame vertically by stacking a DataFrame to it.

with_column

Return a new DataFrame with the column added or replaced.

with_columns

Add or overwrite multiple columns in a DataFrame.

with_row_count

Add a column at index 0 that counts the rows.

write_avro

Write to Apache Avro file.

write_csv

Write to comma-separated values (CSV) file.

write_ipc

Write to Arrow IPC binary stream or Feather file.

write_json

Serialize to JSON representation.

write_ndjson

Serialize to newline delimited JSON representation.

write_parquet

Write to Apache Parquet file.

Attributes:

columns

Get or set column names.

dtypes

Get dtypes of columns in DataFrame.

height

Get the height of the DataFrame.

schema

Get a dict[column name, DataType].

shape

Get the shape of the DataFrame.

width

Get the width of the DataFrame.

apply(f: Callable[[tuple[Any, ...]], Any], return_dtype: Optional[Union[Type[DataType], DataType]] = None, inference_size: int = 256) DF[source]

Apply a custom/user-defined function (UDF) over the rows of the DataFrame.

The UDF will receive each row as a tuple of values: udf(row).

Implementing logic using a Python function is almost always _significantly_ slower and more memory intensive than implementing the same logic using the native expression API because:

  • The native expression engine runs in Rust; UDFs run in Python.

  • Use of Python UDFs forces the DataFrame to be materialized in memory.

  • Polars-native expressions can be parallelised (UDFs cannot).

  • Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Parameters:
f

Custom function/ lambda function.

return_dtype

Output type of the operation. If none given, Polars tries to infer the type.

inference_size

Only used in the case when the custom function returns rows. This uses the first n rows to determine the output schema

Notes

The frame-level apply cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-level apply syntax instead.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [-1, 5, 8]})

Return a DataFrame by mapping each row to a tuple:

>>> df.apply(lambda t: (t[0] * 2, t[1] * 3))
shape: (3, 2)
┌──────────┬──────────┐
│ column_0 ┆ column_1 │
│ ---      ┆ ---      │
│ i64      ┆ i64      │
╞══════════╪══════════╡
│ 2        ┆ -3       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4        ┆ 15       │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 6        ┆ 24       │
└──────────┴──────────┘

It is better to implement this with an expression:

>>> (
...     df.select([pl.col("foo") * 2, pl.col("bar") * 3])
... )  

Return a Series by mapping each row to a scalar:

>>> df.apply(lambda t: (t[0] * 2 + t[1]))
shape: (3, 1)
┌───────┐
│ apply │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ 9     │
├╌╌╌╌╌╌╌┤
│ 14    │
└───────┘

In this case it is better to use the following expression:

>>> df.select(pl.col("foo") * 2 + pl.col("bar"))  
cleared() DF[source]

Create an empty copy of the current DataFrame.

Returns a DataFrame with identical schema but no data.

See also

clone

Cheap deepcopy/clone.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> df.cleared()
shape: (0, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ f64 ┆ bool │
╞═════╪═════╪══════╡
└─────┴─────┴──────┘
clone() DF[source]

Cheap deepcopy/clone.

See also

cleared

Create an empty copy of the current DataFrame, with identical schema but no data.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... )
>>> df.clone()
shape: (4, 3)
┌─────┬──────┬───────┐
│ a   ┆ b    ┆ c     │
│ --- ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 4.0  ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 10.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4   ┆ 13.0 ┆ true  │
└─────┴──────┴───────┘
property columns: list[str][source]

Get or set column names.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.columns
['foo', 'bar', 'ham']

Set column names:

>>> df.columns = ["apple", "banana", "orange"]
>>> df
shape: (3, 3)
┌───────┬────────┬────────┐
│ apple ┆ banana ┆ orange │
│ ---   ┆ ---    ┆ ---    │
│ i64   ┆ i64    ┆ str    │
╞═══════╪════════╪════════╡
│ 1     ┆ 6      ┆ a      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2     ┆ 7      ┆ b      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3     ┆ 8      ┆ c      │
└───────┴────────┴────────┘
describe() DF[source]

Summary statistics for a DataFrame.

Examples

>>> from datetime import date
>>> df = pl.DataFrame(
...     {
...         "a": [1.0, 2.8, 3.0],
...         "b": [4, 5, None],
...         "c": [True, False, True],
...         "d": [None, "b", "c"],
...         "e": ["usd", "eur", None],
...         "f": [date(2020, 1, 1), date(2021, 1, 1), date(2022, 1, 1)],
...     }
... )
>>> df.describe()
shape: (7, 7)
┌────────────┬──────────┬──────────┬──────┬──────┬──────┬────────────┐
│ describe   ┆ a        ┆ b        ┆ c    ┆ d    ┆ e    ┆ f          │
│ ---        ┆ ---      ┆ ---      ┆ ---  ┆ ---  ┆ ---  ┆ ---        │
│ str        ┆ f64      ┆ f64      ┆ f64  ┆ str  ┆ str  ┆ str        │
╞════════════╪══════════╪══════════╪══════╪══════╪══════╪════════════╡
│ count      ┆ 3.0      ┆ 3.0      ┆ 3.0  ┆ 3    ┆ 3    ┆ 3          │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null_count ┆ 0.0      ┆ 1.0      ┆ 0.0  ┆ 1    ┆ 1    ┆ 0          │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mean       ┆ 2.266667 ┆ 4.5      ┆ null ┆ null ┆ null ┆ null       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ std        ┆ 1.101514 ┆ 0.707107 ┆ null ┆ null ┆ null ┆ null       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ min        ┆ 1.0      ┆ 4.0      ┆ 0.0  ┆ b    ┆ eur  ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ max        ┆ 3.0      ┆ 5.0      ┆ 1.0  ┆ c    ┆ usd  ┆ 2022-01-01 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ median     ┆ 2.8      ┆ 4.5      ┆ null ┆ null ┆ null ┆ null       │
└────────────┴──────────┴──────────┴──────┴──────┴──────┴────────────┘
drop(columns: Union[str, Sequence[str]]) DF[source]

Remove column from DataFrame and return as new.

Parameters:
columns

Column(s) to drop.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.drop("ham")
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 6.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8.0 │
└─────┴─────┘
drop_in_place(name: str) Series[source]

Drop in place.

Parameters:
name

Column to drop.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.drop_in_place("ham")
shape: (3,)
Series: 'ham' [str]
[
    "a"
    "b"
    "c"
]
drop_nulls(subset: Optional[Union[str, Sequence[str]]] = None) DF[source]

Return a new DataFrame where the null values are dropped.

Parameters:
subset

Subset of column(s) on which drop_nulls will be applied.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, None, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.drop_nulls()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘

This method only drops nulls row-wise if any single value of the row is null.

Below are some example snippets that show how you could drop null values based on other conditions

>>> df = pl.DataFrame(
...     {
...         "a": [None, None, None, None],
...         "b": [1, 2, None, 1],
...         "c": [1, None, None, 1],
...     }
... )
>>> df
shape: (4, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ null ┆ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1    ┆ 1    │
└──────┴──────┴──────┘

Drop a row only if all values are null:

>>> df.filter(
...     ~pl.fold(
...         acc=True,
...         f=lambda acc, s: acc & s.is_null(),
...         exprs=pl.all(),
...     )
... )
shape: (3, 3)
┌──────┬─────┬──────┐
│ a    ┆ b   ┆ c    │
│ ---  ┆ --- ┆ ---  │
│ f64  ┆ i64 ┆ i64  │
╞══════╪═════╪══════╡
│ null ┆ 1   ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2   ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1   ┆ 1    │
└──────┴─────┴──────┘

Drop a column if all values are null:

>>> df[[s.name for s in df if not (s.null_count() == df.height)]]
shape: (4, 2)
┌──────┬──────┐
│ b    ┆ c    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1    ┆ 1    │
└──────┴──────┘
property dtypes: list[Union[Type[polars.datatypes.DataType], polars.datatypes.DataType]][source]

Get dtypes of columns in DataFrame. Dtypes can also be found in column headers when printing the DataFrame.

See also

schema

Returns a {colname:dtype} mapping.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.dtypes
[<class 'polars.datatypes.Int64'>, <class 'polars.datatypes.Float64'>, <class 'polars.datatypes.Utf8'>]
>>> df
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6.0 ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘
estimated_size(unit: SizeUnit = 'b') int | float[source]

Return an estimation of the total (heap) allocated size of the DataFrame.

Estimated size is given in the specified unit (bytes by default).

This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, [StructArray]’s size is an upper bound.

When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.

FFI buffers are included in this estimation.

Parameters:
unit{‘b’, ‘kb’, ‘mb’, ‘gb’, ‘tb’}

Scale the returned size to the given unit.

Examples

>>> df = pl.DataFrame(
...     {
...         "x": list(reversed(range(1_000_000))),
...         "y": [v / 1000 for v in range(1_000_000)],
...         "z": [str(v) for v in range(1_000_000)],
...     },
...     columns=[("x", pl.UInt32), ("y", pl.Float64), ("z", pl.Utf8)],
... )
>>> df.estimated_size()
25888898
>>> df.estimated_size("mb")
24.689577102661133
explode(columns: Union[str, Sequence[str], Expr, Sequence[Expr]]) DataFrame[source]

Explode DataFrame to long format by exploding a column with Lists.

Parameters:
columns

Column of LargeList type.

Returns:
DataFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "letters": ["a", "a", "b", "c"],
...         "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]],
...     }
... )
>>> df
shape: (4, 2)
┌─────────┬───────────┐
│ letters ┆ numbers   │
│ ---     ┆ ---       │
│ str     ┆ list[i64] │
╞═════════╪═══════════╡
│ a       ┆ [1]       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a       ┆ [2, 3]    │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b       ┆ [4, 5]    │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ c       ┆ [6, 7, 8] │
└─────────┴───────────┘
>>> df.explode("numbers")
shape: (8, 2)
┌─────────┬─────────┐
│ letters ┆ numbers │
│ ---     ┆ ---     │
│ str     ┆ i64     │
╞═════════╪═════════╡
│ a       ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a       ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b       ┆ 4       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b       ┆ 5       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c       ┆ 6       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c       ┆ 7       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c       ┆ 8       │
└─────────┴─────────┘
extend(other: DF) DF[source]

Extend the memory backed by this DataFrame with the values from other.

Different from vstack which adds the chunks from other to the chunks of this DataFrame extend appends the data from other to the underlying memory locations and thus may cause a reallocation.

If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.

Prefer extend over vstack when you want to do a query after a single append. For instance during online operations where you add n rows and rerun a query.

Prefer vstack over extend when you want to append many times before doing a query. For instance when you read in multiple files and when to store them in a single DataFrame. In the latter case, finish the sequence of vstack operations with a rechunk.

Parameters:
other

DataFrame to vertically add.

Examples

>>> df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]})
>>> df1.extend(df2)
shape: (6, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10  ┆ 40  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 20  ┆ 50  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 30  ┆ 60  │
└─────┴─────┘
fill_nan(fill_value: Expr | int | float | None) DataFrame[source]

Fill floating point NaN values by an Expression evaluation.

Parameters:
fill_value

Value to fill NaN with.

Returns:
DataFrame with NaN replaced with fill_value

Warning

Note that floating point NaNs (Not a Number) are not missing values! To replace missing values, use fill_null().

See also

fill_null

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1.5, 2, float("NaN"), 4],
...         "b": [0.5, 4, float("NaN"), 13],
...     }
... )
>>> df.fill_nan(99)
shape: (4, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.5  ┆ 0.5  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ 4.0  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 99.0 ┆ 99.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4.0  ┆ 13.0 │
└──────┴──────┘
fill_null(value: Any | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, matches_supertype: bool = True) DF[source]

Fill null values using the specified value or strategy.

Parameters:
value

Value used to fill null values.

strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}

Strategy used to fill null values.

limit

Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.

matches_supertype

Fill all matching supertype of the fill value.

Returns:
DataFrame with None values replaced by the filling strategy.

See also

fill_nan

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, None, 4],
...         "b": [0.5, 4, None, 13],
...     }
... )
>>> df.fill_null(99)
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 99  ┆ 99.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> df.fill_null(strategy="forward")
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> df.fill_null(strategy="max")
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> df.fill_null(strategy="zero")
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0   ┆ 0.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
filter(predicate: Expr | str | Series | list[bool] | numpy.ndarray[Any, Any]) DataFrame[source]

Filter the rows in the DataFrame based on a predicate expression.

Parameters:
predicate

Expression that evaluates to a boolean Series.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )

Filter on one condition:

>>> df.filter(pl.col("foo") < 3)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   ┆ b   │
└─────┴─────┴─────┘

Filter on multiple conditions:

>>> df.filter((pl.col("foo") < 3) & (pl.col("ham") == "a"))
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

Filter on an OR condition:

>>> df.filter((pl.col("foo") == 1) | (pl.col("ham") == "c"))
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘
find_idx_by_name(name: str) int[source]

Find the index of a column by name.

Parameters:
name

Name of the column to find.

Examples

>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]}
... )
>>> df.find_idx_by_name("ham")
2
fold(operation: Callable[[Series, Series], Series]) Series[source]

Apply a horizontal reduction on a DataFrame.

This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).

An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:

Int8 + Utf8 = Utf8 Float32 + Int64 = Float32 Float32 + Float64 = Float64

Parameters:
operation

function that takes two Series and returns a Series.

Examples

A horizontal sum operation:

>>> df = pl.DataFrame(
...     {
...         "a": [2, 1, 3],
...         "b": [1, 2, 3],
...         "c": [1.0, 2.0, 3.0],
...     }
... )
>>> df.fold(lambda s1, s2: s1 + s2)
shape: (3,)
Series: 'a' [f64]
[
    4.0
    5.0
    9.0
]

A horizontal minimum operation:

>>> df = pl.DataFrame({"a": [2, 1, 3], "b": [1, 2, 3], "c": [1.0, 2.0, 3.0]})
>>> df.fold(lambda s1, s2: s1.zip_with(s1 < s2, s2))
shape: (3,)
Series: 'a' [f64]
[
    1.0
    1.0
    3.0
]

A horizontal string concatenation:

>>> df = pl.DataFrame(
...     {
...         "a": ["foo", "bar", 2],
...         "b": [1, 2, 3],
...         "c": [1.0, 2.0, 3.0],
...     }
... )
>>> df.fold(lambda s1, s2: s1 + s2)
shape: (3,)
Series: 'a' [str]
[
    "foo11.0"
    "bar22.0"
    null
]

A horizontal boolean or, similar to a row-wise .any():

>>> df = pl.DataFrame(
...     {
...         "a": [False, False, True],
...         "b": [False, True, False],
...     }
... )
>>> df.fold(lambda s1, s2: s1 | s2)
shape: (3,)
Series: 'a' [bool]
[
        false
        true
        true
]
frame_equal(other: DataFrame, null_equal: bool = True) bool[source]

Check if DataFrame is equal to other.

Parameters:
other

DataFrame to compare with.

null_equal

Consider null values as equal.

Examples

>>> df1 = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df2 = pl.DataFrame(
...     {
...         "foo": [3, 2, 1],
...         "bar": [8.0, 7.0, 6.0],
...         "ham": ["c", "b", "a"],
...     }
... )
>>> df1.frame_equal(df1)
True
>>> df1.frame_equal(df2)
False
get_column(name: str) Series[source]

Get a single column as Series by name.

Parameters:
namestr

Name of the column to retrieve.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> df.get_column("foo")
shape: (3,)
Series: 'foo' [i64]
[
        1
        2
        3
]
get_columns() list[Series][source]

Get the DataFrame as a List of Series.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> df.get_columns()
[shape: (3,)
Series: 'foo' [i64]
[
        1
        2
        3
], shape: (3,)
Series: 'bar' [i64]
[
        4
        5
        6
]]
>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... )
>>> df.get_columns()
[shape: (4,)
Series: 'a' [i64]
[
    1
    2
    3
    4
], shape: (4,)
Series: 'b' [f64]
[
    0.5
    4.0
    10.0
    13.0
], shape: (4,)
Series: 'c' [bool]
[
    true
    true
    false
    true
]]
groupby(by: Union[str, Expr, Sequence[str | Expr]], maintain_order: bool = False) GroupBy[DF][source]

Start a groupby operation.

Parameters:
by

Column(s) to group by.

maintain_order

Make sure that the order of the groups remain consistent. This is more expensive than a default groupby. Note that this only works in expression aggregations.

Examples

Below we group by column “a”, and we sum column “b”.

>>> df = pl.DataFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... )
>>> df.groupby("a").agg(pl.col("b").sum()).sort(by="a")
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 11  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 6   │
└─────┴─────┘

We can also loop over the grouped DataFrame

>>> for sub_df in df.groupby("a"):
...     print(sub_df)  
...
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ b   ┆ 2   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 4   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 5   ┆ 2   │
└─────┴─────┴─────┘
shape: (1, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘
groupby_dynamic(index_column: str, *, every: str | timedelta, period: str | timedelta | None = None, offset: str | timedelta | None = None, truncate: bool = True, include_boundaries: bool = False, closed: ClosedWindow = 'left', by: str | Sequence[str] | Expr | Sequence[Expr] | None = None, start_by: StartBy = 'window') DynamicGroupBy[DF][source]

Group based on a time value (or index value of type Int32, Int64).

Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

  • every: interval of the window

  • period: length of the window

  • offset: offset of the window

The every, period and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a groupby_dynamic on an integer column, the windows are defined by:

  • “1i” # length 1

  • “10i” # length 10

Parameters:
index_column

Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

In case of a dynamic groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

every

interval of the window

period

length of the window, if None it is equal to ‘every’

offset

offset of the window if None and period is None it will be equal to negative every

truncate

truncate the time value to the window lower bound

include_boundaries

Add the lower and upper bound of the window to the “_lower_bound” and “_upper_bound” columns. This will impact performance because it’s harder to parallelize

closed{‘right’, ‘left’, ‘both’, ‘none’}

Define whether the temporal window interval is closed or not.

by

Also group by this column/these columns

start_by{‘window’, ‘datapoint’, ‘monday’}

The strategy to determine the start of the first window by. * ‘window’: Truncate the start of the window with the ‘every’ argument. * ‘datapoint’: Start from the first encountered data point. * ‘monday’: Start the window on the monday before the first data point.

Examples

>>> from datetime import datetime
>>> # create an example dataframe
>>> df = pl.DataFrame(
...     {
...         "time": pl.date_range(
...             low=datetime(2021, 12, 16),
...             high=datetime(2021, 12, 16, 3),
...             interval="30m",
...         ),
...         "n": range(7),
...     }
... )
>>> df
shape: (7, 2)
┌─────────────────────┬─────┐
│ time                ┆ n   │
│ ---                 ┆ --- │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2021-12-16 00:00:00 ┆ 0   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ 3   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 4   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ 5   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ 6   │
└─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

>>> (
...     df.groupby_dynamic("time", every="1h", closed="right").agg(
...         [
...             pl.col("time").min().alias("time_min"),
...             pl.col("time").max().alias("time_max"),
...         ]
...     )
... )
shape: (4, 3)
┌─────────────────────┬─────────────────────┬─────────────────────┐
│ time                ┆ time_min            ┆ time_max            │
│ ---                 ┆ ---                 ┆ ---                 │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
╞═════════════════════╪═════════════════════╪═════════════════════╡
│ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
└─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result

>>> (
...     df.groupby_dynamic(
...         "time", every="1h", include_boundaries=True, closed="right"
...     ).agg([pl.col("time").count().alias("time_count")])
... )
shape: (4, 4)
┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
│ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
└─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed=”left”, should not include right end of interval [lower_bound, upper_bound)

>>> (
...     df.groupby_dynamic("time", every="1h", closed="left").agg(
...         [
...             pl.col("time").count().alias("time_count"),
...             pl.col("time").list().alias("time_agg_list"),
...         ]
...     )
... )
shape: (4, 3)
┌─────────────────────┬────────────┬─────────────────────────────────────┐
│ time                ┆ time_count ┆ time_agg_list                       │
│ ---                 ┆ ---        ┆ ---                                 │
│ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]                  │
╞═════════════════════╪════════════╪═════════════════════════════════════╡
│ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-16... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-16... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-16... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]               │
└─────────────────────┴────────────┴─────────────────────────────────────┘

When closed=”both” the time values at the window boundaries belong to 2 groups.

>>> (
...     df.groupby_dynamic("time", every="1h", closed="both").agg(
...         [pl.col("time").count().alias("time_count")]
...     )
... )
shape: (5, 2)
┌─────────────────────┬────────────┐
│ time                ┆ time_count │
│ ---                 ┆ ---        │
│ datetime[μs]        ┆ u32        │
╞═════════════════════╪════════════╡
│ 2021-12-15 23:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ 1          │
└─────────────────────┴────────────┘

Dynamic groupbys can also be combined with grouping on normal keys

>>> df = pl.DataFrame(
...     {
...         "time": pl.date_range(
...             low=datetime(2021, 12, 16),
...             high=datetime(2021, 12, 16, 3),
...             interval="30m",
...         ),
...         "groups": ["a", "a", "a", "b", "b", "a", "a"],
...     }
... )
>>> df
shape: (7, 2)
┌─────────────────────┬────────┐
│ time                ┆ groups │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ str    │
╞═════════════════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a      │
└─────────────────────┴────────┘
>>> (
...     df.groupby_dynamic(
...         "time",
...         every="1h",
...         closed="both",
...         by="groups",
...         include_boundaries=True,
...     ).agg([pl.col("time").count().alias("time_count")])
... )
shape: (7, 5)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
│ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
│ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
│ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
│ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic groupby on an index column

>>> df = pl.DataFrame(
...     {
...         "idx": pl.arange(0, 6, eager=True),
...         "A": ["A", "A", "B", "B", "B", "C"],
...     }
... )
>>> (
...     df.groupby_dynamic(
...         "idx",
...         every="2i",
...         period="3i",
...         include_boundaries=True,
...         closed="right",
...     ).agg(pl.col("A").list().alias("A_agg_list"))
... )
shape: (3, 4)
┌─────────────────┬─────────────────┬─────┬─────────────────┐
│ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
│ ---             ┆ ---             ┆ --- ┆ ---             │
│ i64             ┆ i64             ┆ i64 ┆ list[str]       │
╞═════════════════╪═════════════════╪═════╪═════════════════╡
│ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
└─────────────────┴─────────────────┴─────┴─────────────────┘
groupby_rolling(index_column: str, *, period: str | timedelta, offset: str | timedelta | None = None, closed: ClosedWindow = 'right', by: str | Sequence[str] | Expr | Sequence[Expr] | None = None) RollingGroupBy[DF][source]

Create rolling groups based on a time column.

Also works for index values of type Int32 or Int64.

Different from a dynamic_groupby the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupby_dynamic

The period and offset arguments are created either from a timedelta, or by using the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a groupby_rolling on an integer column, the windows are defined by:

  • “1i” # length 1

  • “10i” # length 10

Parameters:
index_column

Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

In case of a rolling groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

period

length of the window

offset

offset of the window. Default is -period

closed{‘right’, ‘left’, ‘both’, ‘none’}

Define whether the temporal window interval is closed or not.

by

Also group by this column/these columns

See also

groupby_dynamic

Examples

>>> dates = [
...     "2020-01-01 13:45:48",
...     "2020-01-01 16:42:13",
...     "2020-01-01 16:45:09",
...     "2020-01-02 18:12:48",
...     "2020-01-03 19:45:32",
...     "2020-01-08 23:16:43",
... ]
>>> df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_column(
...     pl.col("dt").str.strptime(pl.Datetime)
... )
>>> out = df.groupby_rolling(index_column="dt", period="2d").agg(
...     [
...         pl.sum("a").alias("sum_a"),
...         pl.min("a").alias("min_a"),
...         pl.max("a").alias("max_a"),
...     ]
... )
>>> assert out["sum_a"].to_list() == [3, 10, 15, 24, 11, 1]
>>> assert out["max_a"].to_list() == [3, 7, 7, 9, 9, 1]
>>> assert out["min_a"].to_list() == [3, 3, 3, 3, 2, 1]
>>> out
shape: (6, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ dt                  ┆ sum_a ┆ min_a ┆ max_a │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
└─────────────────────┴───────┴───────┴───────┘
hash_rows(seed: int = 0, seed_1: int | None = None, seed_2: int | None = None, seed_3: int | None = None) Series[source]

Hash and combine the rows in this DataFrame.

The hash value is of type UInt64.

Parameters:
seed

Random seed parameter. Defaults to 0.

seed_1

Random seed parameter. Defaults to seed if not set.

seed_2

Random seed parameter. Defaults to seed if not set.

seed_3

Random seed parameter. Defaults to seed if not set.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, None, 3, 4],
...         "ham": ["a", "b", None, "d"],
...     }
... )
>>> df.hash_rows(seed=42)
shape: (4,)
Series: '' [u64]
[
    10783150408545073287
    1438741209321515184
    10047419486152048166
    2047317070637311557
]
head(n: int = 5) DF[source]

Get the first n rows.

Parameters:
n

Number of rows to return.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> df.head(3)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘
property height: int[source]

Get the height of the DataFrame.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
>>> df.height
5
hstack(columns: list[Series] | DataFrame, in_place: bool = False) DF[source]

Return a new DataFrame grown horizontally by stacking multiple Series to it.

Parameters:
columns

Series to stack.

in_place

Modify in place.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> x = pl.Series("apple", [10, 20, 30])
>>> df.hstack([x])
shape: (3, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ str ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6   ┆ a   ┆ 10    │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 7   ┆ b   ┆ 20    │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 8   ┆ c   ┆ 30    │
└─────┴─────┴─────┴───────┘
insert_at_idx(index: int, series: Series) DF[source]

Insert a Series at a certain column index. This operation is in place.

Parameters:
index

Column to insert the new Series column.

series

Series to insert.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> s = pl.Series("baz", [97, 98, 99])
>>> df.insert_at_idx(1, s)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ baz ┆ bar │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 97  ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 98  ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 99  ┆ 6   │
└─────┴─────┴─────┘
>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... )
>>> s = pl.Series("d", [-2.5, 15, 20.5, 0])
>>> df.insert_at_idx(3, s)
shape: (4, 4)
┌─────┬──────┬───────┬──────┐
│ a   ┆ b    ┆ c     ┆ d    │
│ --- ┆ ---  ┆ ---   ┆ ---  │
│ i64 ┆ f64  ┆ bool  ┆ f64  │
╞═════╪══════╪═══════╪══════╡
│ 1   ┆ 0.5  ┆ true  ┆ -2.5 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  ┆ true  ┆ 15.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ 10.0 ┆ false ┆ 20.5 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 ┆ true  ┆ 0.0  │
└─────┴──────┴───────┴──────┘
interpolate() DF[source]

Interpolate intermediate values. The interpolation method is linear.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, None, 9, 10],
...         "bar": [6, 7, 9, None],
...         "baz": [1, None, None, 9],
...     }
... )
>>> df.interpolate()
shape: (4, 3)
┌─────┬──────┬─────┐
│ foo ┆ bar  ┆ baz │
│ --- ┆ ---  ┆ --- │
│ i64 ┆ i64  ┆ i64 │
╞═════╪══════╪═════╡
│ 1   ┆ 6    ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 7    ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 9   ┆ 9    ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 10  ┆ null ┆ 9   │
└─────┴──────┴─────┘
is_duplicated() Series[source]

Get a mask of all duplicated rows in this DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 1],
...         "b": ["x", "y", "z", "x"],
...     }
... )
>>> df.is_duplicated()
shape: (4,)
Series: '' [bool]
[
        true
        false
        false
        true
]
is_empty() bool[source]

Check if the dataframe is empty.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> df.is_empty()
False
>>> df.filter(pl.col("foo") > 99).is_empty()
True
is_unique() Series[source]

Get a mask of all unique rows in this DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 1],
...         "b": ["x", "y", "z", "x"],
...     }
... )
>>> df.is_unique()
shape: (4,)
Series: '' [bool]
[
        false
        true
        true
        false
]
join(other: DataFrame, left_on: str | Expr | Sequence[str | Expr] | None = None, right_on: str | Expr | Sequence[str | Expr] | None = None, on: str | Expr | Sequence[str | Expr] | None = None, how: JoinStrategy = 'inner', suffix: str = '_right') DataFrame[source]

Join in SQL-like fashion.

Parameters:
other

DataFrame to join with.

left_on

Name(s) of the left join column(s).

right_on

Name(s) of the right join column(s).

on

Name(s) of the join columns in both DataFrames.

how{‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’}

Join strategy.

suffix

Suffix to append to columns with a duplicate name.

Returns:
Joined DataFrame

See also

join_asof

Notes

For joining on columns with categorical data, see pl.StringCache().

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> other_df = pl.DataFrame(
...     {
...         "apple": ["x", "y", "z"],
...         "ham": ["a", "b", "d"],
...     }
... )
>>> df.join(other_df, on="ham")
shape: (2, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   ┆ y     │
└─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="outer")
shape: (4, 4)
┌──────┬──────┬─────┬───────┐
│ foo  ┆ bar  ┆ ham ┆ apple │
│ ---  ┆ ---  ┆ --- ┆ ---   │
│ i64  ┆ f64  ┆ str ┆ str   │
╞══════╪══════╪═════╪═══════╡
│ 1    ┆ 6.0  ┆ a   ┆ x     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ 7.0  ┆ b   ┆ y     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ null ┆ d   ┆ z     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ 8.0  ┆ c   ┆ null  │
└──────┴──────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="left")
shape: (3, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   ┆ y     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 8.0 ┆ c   ┆ null  │
└─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="semi")
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6.0 ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
└─────┴─────┴─────┘
>>> df.join(other_df, on="ham", how="anti")
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘
join_asof(other: DataFrame, left_on: str | None = None, right_on: str | None = None, on: str | None = None, by_left: str | Sequence[str] | None = None, by_right: str | Sequence[str] | None = None, by: str | Sequence[str] | None = None, strategy: AsofJoinStrategy = 'backward', suffix: str = '_right', tolerance: str | int | float | None = None, allow_parallel: bool = True, force_parallel: bool = False) DataFrame[source]

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the asof_join key.

For each row in the left DataFrame:

  • A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

  • A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

The default is “backward”.

Parameters:
other

Lazy DataFrame to join with.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

on

Join column of both DataFrames. If set, left_on and right_on should be None.

by

join on these columns before doing asof join

by_left

join on these columns before doing asof join

by_right

join on these columns before doing asof join

strategy{‘backward’, ‘forward’}

Join strategy.

suffix

Suffix to append to columns with a duplicate name.

tolerance

Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time” you use the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

Examples

>>> from datetime import datetime
>>> gdp = pl.DataFrame(
...     {
...         "date": [
...             datetime(2016, 1, 1),
...             datetime(2017, 1, 1),
...             datetime(2018, 1, 1),
...             datetime(2019, 1, 1),
...         ],  # note record date: Jan 1st (sorted!)
...         "gdp": [4164, 4411, 4566, 4696],
...     }
... )
>>> population = pl.DataFrame(
...     {
...         "date": [
...             datetime(2016, 5, 12),
...             datetime(2017, 5, 12),
...             datetime(2018, 5, 12),
...             datetime(2019, 5, 12),
...         ],  # note record date: May 12th (sorted!)
...         "population": [82.19, 82.66, 83.12, 83.52],
...     }
... )
>>> population.join_asof(
...     gdp, left_on="date", right_on="date", strategy="backward"
... )
shape: (4, 3)
┌─────────────────────┬────────────┬──────┐
│ date                ┆ population ┆ gdp  │
│ ---                 ┆ ---        ┆ ---  │
│ datetime[μs]        ┆ f64        ┆ i64  │
╞═════════════════════╪════════════╪══════╡
│ 2016-05-12 00:00:00 ┆ 82.19      ┆ 4164 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2017-05-12 00:00:00 ┆ 82.66      ┆ 4411 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2018-05-12 00:00:00 ┆ 83.12      ┆ 4566 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2019-05-12 00:00:00 ┆ 83.52      ┆ 4696 │
└─────────────────────┴────────────┴──────┘
lazy() LazyFrame[source]

Start a lazy query from this point. This returns a LazyFrame object.

Operations on a LazyFrame are not executed until this is requested by either calling:

Lazy operations are advised because they allow for query optimization and more parallelization.

Returns:
LazyFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> df.lazy()  
<polars.LazyFrame object at ...>
limit(n: int = 5) DF[source]

Get the first n rows.

Alias for DataFrame.head().

Parameters:
n

Number of rows to return.

Examples

>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3, 4, 5, 6], "bar": ["a", "b", "c", "d", "e", "f"]}
... )
>>> df.limit(4)
shape: (4, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ d   │
└─────┴─────┘
max(axis: Literal[0] = 0) DF[source]
max(axis: Literal[1]) Series
max(axis: int = 0) Union[DF, Series]

Aggregate the columns of this DataFrame to their maximum value.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.max()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘
mean(*, axis: Literal[0] = 0, null_strategy: NullStrategy = 'ignore') DF[source]
mean(*, axis: Literal[1], null_strategy: NullStrategy = 'ignore') Series
mean(*, axis: int = 0, null_strategy: NullStrategy = 'ignore') Union[DF, Series]

Aggregate the columns of this DataFrame to their mean value.

Parameters:
axis

Either 0 or 1.

null_strategy{‘ignore’, ‘propagate’}

This argument is only used if axis == 1.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.mean()
shape: (1, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ f64 ┆ f64 ┆ str  │
╞═════╪═════╪══════╡
│ 2.0 ┆ 7.0 ┆ null │
└─────┴─────┴──────┘

Note: a PanicException is raised with axis = 1 and a string column.

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...     }
... )
>>> df.mean(axis=1)
shape: (3,)
Series: 'foo' [f64]
[
        3.5
        4.5
        5.5
]

Note: the mean of booleans evaluates to null.

>>> df = pl.DataFrame(
...     {
...         "a": [True, True, False],
...         "b": [True, True, True],
...     }
... )
>>> df.mean()
shape: (1, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ bool ┆ bool │
╞══════╪══════╡
│ null ┆ null │
└──────┴──────┘

Instead, cast to numeric type:

>>> df.select(pl.all().cast(pl.UInt8)).mean()
shape: (1, 2)
┌──────────┬─────┐
│ a        ┆ b   │
│ ---      ┆ --- │
│ f64      ┆ f64 │
╞══════════╪═════╡
│ 0.666667 ┆ 1.0 │
└──────────┴─────┘
median() DF[source]

Aggregate the columns of this DataFrame to their median value.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.median()
shape: (1, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ f64 ┆ f64 ┆ str  │
╞═════╪═════╪══════╡
│ 2.0 ┆ 7.0 ┆ null │
└─────┴─────┴──────┘
melt(id_vars: Optional[Union[str, Sequence[str]]] = None, value_vars: Optional[Union[str, Sequence[str]]] = None, variable_name: str | None = None, value_name: str | None = None) DF[source]

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
id_vars

Columns to use as identifier variables.

value_vars

Values to use as identifier variables. If value_vars is empty all columns that are not in id_vars will be used.

variable_name

Name to give to the value column. Defaults to “variable”

value_name

Name to give to the value column. Defaults to “value”

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["x", "y", "z"],
...         "b": [1, 3, 5],
...         "c": [2, 4, 6],
...     }
... )
>>> df.melt(id_vars="a", value_vars=["b", "c"])
shape: (6, 3)
┌─────┬──────────┬───────┐
│ a   ┆ variable ┆ value │
│ --- ┆ ---      ┆ ---   │
│ str ┆ str      ┆ i64   │
╞═════╪══════════╪═══════╡
│ x   ┆ b        ┆ 1     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y   ┆ b        ┆ 3     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ z   ┆ b        ┆ 5     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ x   ┆ c        ┆ 2     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y   ┆ c        ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ z   ┆ c        ┆ 6     │
└─────┴──────────┴───────┘
min(axis: Literal[0] = 0) DF[source]
min(axis: Literal[1]) Series
min(axis: int = 0) Union[DF, Series]

Aggregate the columns of this DataFrame to their minimum value.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.min()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘
n_chunks(strategy: Literal['first']) int[source]
n_chunks(strategy: Literal['all']) list[int]
n_chunks(strategy: str = 'first') int | list[int]

Get number of chunks used by the ChunkedArrays of this DataFrame.

Parameters:
strategy{‘first’, ‘all’}

Return the number of chunks of the ‘first’ column, or ‘all’ columns in this DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... )
>>> df.n_chunks()
1
>>> df.n_chunks(strategy="all")
[1, 1, 1]
n_unique(subset: Optional[Union[str, Expr, Sequence[str | Expr]]] = None) int[source]

Return the number of unique rows, or the number of unique row-subsets.

Parameters:
subset

One or more columns/expressions that define what to count; omit to return the count of unique rows.

Notes

This method operates at the DataFrame level; to operate on subsets at the expression level you can make use of struct-packing instead, for example:

>>> expr_unique_subset = pl.struct(["a", "b"]).n_unique()

If instead you want to count the number of unique values per-column, you can also use expression-level syntax to return a new frame containing that result:

>>> df = pl.DataFrame([[1, 2, 3], [1, 2, 4]], columns=["a", "b", "c"])
>>> df_nunique = df.select(pl.all().n_unique())

In aggregate context there is also an equivalent method for returning the unique values per-group:

>>> df_agg_nunique = df.groupby(by=["a"]).n_unique()

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 1, 2, 3, 4, 5],
...         "b": [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
...         "c": [True, True, True, False, True, True],
...     }
... )
>>> df.n_unique()
5
>>> # simple columns subset
>>> df.n_unique(subset=["b", "c"])
4
>>> # expression subset
>>> df.n_unique(
...     subset=[
...         (pl.col("a") // 2),
...         (pl.col("c") | (pl.col("b") >= 2)),
...     ],
... )
3
null_count() DF[source]

Create a new DataFrame that shows the null counts per column.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, None, 3],
...         "bar": [6, 7, None],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.null_count()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 0   │
└─────┴─────┴─────┘
partition_by(groups: Union[str, Sequence[str]], maintain_order: bool = False, *, as_dict: Literal[False] = False) list[DF][source]
partition_by(groups: Union[str, Sequence[str]], maintain_order: bool = False, *, as_dict: Literal[True]) dict[Any, DF]
partition_by(groups: Union[str, Sequence[str]], maintain_order: bool, *, as_dict: bool) list[DF] | dict[Any, DF]

Split into multiple DataFrames partitioned by groups.

Parameters:
groups

Groups to partition by.

maintain_order

Keep predictable output order. This is slower as it requires an extra sort operation.

as_dict

If True, return the partitions in a dictionary keyed by the distinct group values instead of a list.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": ["A", "A", "B", "B", "C"],
...         "N": [1, 2, 2, 4, 2],
...         "bar": ["k", "l", "m", "m", "l"],
...     }
... )
>>> df.partition_by(groups="foo", maintain_order=True)
[shape: (2, 3)
 ┌─────┬─────┬─────┐
 │ foo ┆ N   ┆ bar │
 │ --- ┆ --- ┆ --- │
 │ str ┆ i64 ┆ str │
 ╞═════╪═════╪═════╡
 │ A   ┆ 1   ┆ k   │
 ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
 │ A   ┆ 2   ┆ l   │
 └─────┴─────┴─────┘,
 shape: (2, 3)
 ┌─────┬─────┬─────┐
 │ foo ┆ N   ┆ bar │
 │ --- ┆ --- ┆ --- │
 │ str ┆ i64 ┆ str │
 ╞═════╪═════╪═════╡
 │ B   ┆ 2   ┆ m   │
 ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
 │ B   ┆ 4   ┆ m   │
 └─────┴─────┴─────┘,
 shape: (1, 3)
 ┌─────┬─────┬─────┐
 │ foo ┆ N   ┆ bar │
 │ --- ┆ --- ┆ --- │
 │ str ┆ i64 ┆ str │
 ╞═════╪═════╪═════╡
 │ C   ┆ 2   ┆ l   │
 └─────┴─────┴─────┘]
>>> df.partition_by(groups="foo", maintain_order=True, as_dict=True)
{'A': shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N   ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ A   ┆ 1   ┆ k   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ A   ┆ 2   ┆ l   │
└─────┴─────┴─────┘, 'B': shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ N   ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ B   ┆ 2   ┆ m   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ B   ┆ 4   ┆ m   │
└─────┴─────┴─────┘, 'C': shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ N   ┆ bar │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ C   ┆ 2   ┆ l   │
└─────┴─────┴─────┘}
pearson_corr(**kwargs: dict[str, Any]) DataFrame[source]

Return Pearson product-moment correlation coefficients.

See numpy corrcoef for more information.

Parameters:
kwargs

keyword arguments are passed to numpy corrcoef

Notes

This functionality requires numpy to be installed.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [3, 2, 1], "ham": [7, 8, 9]})
>>> df.pearson_corr()
shape: (3, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ f64  ┆ f64  │
╞══════╪══════╪══════╡
│ 1.0  ┆ -1.0 ┆ 1.0  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ -1.0 ┆ 1.0  ┆ -1.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0  ┆ -1.0 ┆ 1.0  │
└──────┴──────┴──────┘
pipe(func: Callable[[...], Any], *args: Any, **kwargs: Any) Any[source]

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Parameters:
func

Callable; will receive the frame as the first parameter, followed by any given args/kwargs.

args

Arguments to pass to the UDF.

kwargs

Keyword arguments to pass to the UDF.

Notes

It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See df.lazy().

Examples

>>> def cast_str_to_int(data, col_name):
...     return data.with_column(pl.col(col_name).cast(pl.Int64))
...
>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["10", "20", "30", "40"]})
>>> df.pipe(cast_str_to_int, col_name="b")
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 20  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 30  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 40  │
└─────┴─────┘
>>> df = pl.DataFrame({"b": [1, 2], "a": [3, 4]})
>>> df
shape: (2, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   │
└─────┴─────┘
>>> df.pipe(lambda tdf: tdf.select(sorted(tdf.columns)))
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 2   │
└─────┴─────┘
pivot(values: Sequence[str] | str, index: Sequence[str] | str, columns: Sequence[str] | str, aggregate_fn: PivotAgg | Expr = 'first', maintain_order: bool = True, sort_columns: bool = False) DF[source]

Create a spreadsheet-style pivot table as a DataFrame.

Parameters:
values

Column values to aggregate. Can be multiple columns if the columns arguments contains multiple columns as well

index

One or multiple keys to group by

columns

Columns whose values will be used as the header of the output DataFrame

aggregate_fn{‘first’, ‘sum’, ‘max’, ‘min’, ‘mean’, ‘median’, ‘last’, ‘count’}

A predefined aggregate function str or an expression.

maintain_order

Sort the grouped keys so that the output order is predictable.

sort_columns

Sort the transposed columns by name. Default is by order of discovery.

Returns:
DataFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": ["one", "one", "one", "two", "two", "two"],
...         "bar": ["A", "B", "C", "A", "B", "C"],
...         "baz": [1, 2, 3, 4, 5, 6],
...     }
... )
>>> df.pivot(values="baz", index="foo", columns="bar")
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ foo ┆ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ one ┆ 1   ┆ 2   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ two ┆ 4   ┆ 5   ┆ 6   │
└─────┴─────┴─────┴─────┘
product() DF[source]

Aggregate the columns of this DataFrame to their product values.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3],
...         "b": [0.5, 4, 10],
...         "c": [True, True, False],
...     }
... )
>>> df.product()
shape: (1, 3)
┌─────┬──────┬─────┐
│ a   ┆ b    ┆ c   │
│ --- ┆ ---  ┆ --- │
│ i64 ┆ f64  ┆ i64 │
╞═════╪══════╪═════╡
│ 6   ┆ 20.0 ┆ 0   │
└─────┴──────┴─────┘
quantile(quantile: float, interpolation: RollingInterpolationMethod = 'nearest') DF[source]

Aggregate the columns of this DataFrame to their quantile value.

Parameters:
quantile

Quantile between 0.0 and 1.0.

interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}

Interpolation method.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.quantile(0.5, "nearest")
shape: (1, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ f64 ┆ f64 ┆ str  │
╞═════╪═════╪══════╡
│ 2.0 ┆ 7.0 ┆ null │
└─────┴─────┴──────┘
rechunk() DF[source]

Rechunk the data in this DataFrame to a contiguous allocation.

This will make sure all subsequent operations have optimal and predictable performance.

rename(mapping: dict[str, str]) Union[DF, DataFrame][source]

Rename column names.

Parameters:
mapping

Key value pairs that map from old name to new name.

Examples

>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]}
... )
>>> df.rename({"foo": "apple"})
shape: (3, 3)
┌───────┬─────┬─────┐
│ apple ┆ bar ┆ ham │
│ ---   ┆ --- ┆ --- │
│ i64   ┆ i64 ┆ str │
╞═══════╪═════╪═════╡
│ 1     ┆ 6   ┆ a   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2     ┆ 7   ┆ b   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3     ┆ 8   ┆ c   │
└───────┴─────┴─────┘
replace(column: str, new_col: Series) DF[source]

Replace a column by a new Series.

Parameters:
column

Column to replace.

new_col

New column to insert.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> s = pl.Series([10, 20, 30])
>>> df.replace("foo", s)  # works in-place!
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 10  ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 20  ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 30  ┆ 6   │
└─────┴─────┘
replace_at_idx(index: int, series: Series) DF[source]

Replace a column at an index location.

Parameters:
index

Column index.

series

Series that will replace the column.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> s = pl.Series("apple", [10, 20, 30])
>>> df.replace_at_idx(0, s)
shape: (3, 3)
┌───────┬─────┬─────┐
│ apple ┆ bar ┆ ham │
│ ---   ┆ --- ┆ --- │
│ i64   ┆ i64 ┆ str │
╞═══════╪═════╪═════╡
│ 10    ┆ 6   ┆ a   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 20    ┆ 7   ┆ b   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 30    ┆ 8   ┆ c   │
└───────┴─────┴─────┘
reverse() DF[source]

Reverse the DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "key": ["a", "b", "c"],
...         "val": [1, 2, 3],
...     }
... )
>>> df.reverse()
shape: (3, 2)
┌─────┬─────┐
│ key ┆ val │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ c   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a   ┆ 1   │
└─────┴─────┘
row(index: int | None = None, *, by_predicate: Expr | None = None) tuple[Any, ...][source]

Get a row as tuple, either by index or by predicate.

Parameters:
index

Row index.

by_predicate

Select the row according to a given expression/predicate.

Notes

The index and by_predicate params are mutually exclusive. Additionally, to ensure clarity, the by_predicate parameter must be supplied by keyword.

When using by_predicate it is an error condition if anything other than one row is returned; more than one row raises TooManyRowsReturned, and zero rows will raise NoRowsReturned (both inherit from RowsException).

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> # return the row at the given index
>>> df.row(2)
(3, 8, 'c')
>>> # return the row that matches the given predicate
>>> df.row(by_predicate=(pl.col("ham") == "b"))
(2, 7, 'b')
rows() list[tuple[Any, ...]][source]

Convert columnar data to rows as python tuples.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> df.rows()
[(1, 2), (3, 4), (5, 6)]
sample(n: int | None = None, frac: float | None = None, with_replacement: bool = False, shuffle: bool = False, seed: int | None = None) DF[source]

Sample from this DataFrame.

Parameters:
n

Number of items to return. Cannot be used with frac. Defaults to 1 if frac is None.

frac

Fraction of items to return. Cannot be used with n.

with_replacement

Allow values to be sampled more than once.

shuffle

Shuffle the order of sampled data points.

seed

Seed for the random number generator. If set to None (default), a random seed is generated using the random module.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.sample(n=2, seed=0)  
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8   ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   ┆ b   │
└─────┴─────┴─────┘
property schema: dict[str, Union[Type[polars.datatypes.DataType], polars.datatypes.DataType]][source]

Get a dict[column name, DataType].

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.schema
{'foo': <class 'polars.datatypes.Int64'>, 'bar': <class 'polars.datatypes.Float64'>, 'ham': <class 'polars.datatypes.Utf8'>}
select(exprs: Union[str, Expr, Series, Sequence[str | Expr | Series | WhenThen | WhenThenThen]]) DF[source]

Select columns from this DataFrame.

Parameters:
exprs

Column or columns to select.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.select("foo")
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 2   │
├╌╌╌╌╌┤
│ 3   │
└─────┘
>>> df.select(["foo", "bar"])
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   │
└─────┴─────┘
>>> df.select(pl.col("foo") + 1)
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 2   │
├╌╌╌╌╌┤
│ 3   │
├╌╌╌╌╌┤
│ 4   │
└─────┘
>>> df.select([pl.col("foo") + 1, pl.col("bar") + 1])
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 9   │
└─────┴─────┘
>>> df.select(pl.when(pl.col("foo") > 2).then(10).otherwise(0))
shape: (3, 1)
┌─────────┐
│ literal │
│ ---     │
│ i64     │
╞═════════╡
│ 0       │
├╌╌╌╌╌╌╌╌╌┤
│ 0       │
├╌╌╌╌╌╌╌╌╌┤
│ 10      │
└─────────┘
property shape: tuple[int, int][source]

Get the shape of the DataFrame.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
>>> df.shape
(5, 1)
shift(periods: int) DF[source]

Shift values by the given period.

Parameters:
periods

Number of places to shift (may be negative).

See also

shift_and_fill

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.shift(periods=1)
shape: (3, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ null ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1    ┆ 6    ┆ a    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 7    ┆ b    │
└──────┴──────┴──────┘
>>> df.shift(periods=-1)
shape: (3, 3)
┌──────┬──────┬──────┐
│ foo  ┆ bar  ┆ ham  │
│ ---  ┆ ---  ┆ ---  │
│ i64  ┆ i64  ┆ str  │
╞══════╪══════╪══════╡
│ 2    ┆ 7    ┆ b    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ 8    ┆ c    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
└──────┴──────┴──────┘
shift_and_fill(periods: int, fill_value: int | str | float) DataFrame[source]

Shift the values by a given period and fill the resulting null values.

Parameters:
periods

Number of places to shift (may be negative).

fill_value

fill None values with this value.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.shift_and_fill(periods=1, fill_value=0)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 0   ┆ 0   ┆ 0   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   ┆ b   │
└─────┴─────┴─────┘
shrink_to_fit(in_place: bool = False) DF[source]

Shrink DataFrame memory usage.

Shrinks to fit the exact capacity needed to hold the data.

slice(offset: int, length: int | None = None) DF[source]

Get a slice of this DataFrame.

Parameters:
offset

Start index. Negative indexing is supported.

length

Length of the slice. If set to None, all rows starting at the offset will be selected.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.slice(1, 2)
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘
sort(by: Union[str, Sequence[str], Expr, Sequence[Expr]], reverse: bool | list[bool] = False, nulls_last: bool = False) Union[DF, DataFrame][source]

Sort the DataFrame by column.

Parameters:
by

By which column to sort. Only accepts string.

reverse

Reverse/descending sort.

nulls_last

Place null values last. Can only be used if sorted by a single column.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.sort("foo", reverse=True)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 6.0 ┆ a   │
└─────┴─────┴─────┘

Sort by multiple columns. For multiple columns we can also use expression syntax.

>>> df.sort(
...     [pl.col("foo"), pl.col("bar") ** 2],
...     reverse=[True, False],
... )
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 6.0 ┆ a   │
└─────┴─────┴─────┘
std(ddof: int = 1) DF[source]

Aggregate the columns of this DataFrame to their standard deviation value.

Parameters:
ddof

Degrees of freedom

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.std()
shape: (1, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ f64 ┆ f64 ┆ str  │
╞═════╪═════╪══════╡
│ 1.0 ┆ 1.0 ┆ null │
└─────┴─────┴──────┘
>>> df.std(ddof=0)
shape: (1, 3)
┌──────────┬──────────┬──────┐
│ foo      ┆ bar      ┆ ham  │
│ ---      ┆ ---      ┆ ---  │
│ f64      ┆ f64      ┆ str  │
╞══════════╪══════════╪══════╡
│ 0.816497 ┆ 0.816497 ┆ null │
└──────────┴──────────┴──────┘
sum(*, axis: Literal[0] = 0, null_strategy: NullStrategy = 'ignore') DF[source]
sum(*, axis: Literal[1], null_strategy: NullStrategy = 'ignore') Series
sum(*, axis: int = 0, null_strategy: NullStrategy = 'ignore') Union[DF, Series]

Aggregate the columns of this DataFrame to their sum value.

Parameters:
axis

Either 0 or 1.

null_strategy{‘ignore’, ‘propagate’}

This argument is only used if axis == 1.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.sum()
shape: (1, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ i64 ┆ str  │
╞═════╪═════╪══════╡
│ 6   ┆ 21  ┆ null │
└─────┴─────┴──────┘
>>> df.sum(axis=1)
shape: (3,)
Series: 'foo' [str]
[
        "16a"
        "27b"
        "38c"
]
tail(n: int = 5) DF[source]

Get the last n rows.

Parameters:
n

Number of rows to return.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> df.tail(3)
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8   ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 9   ┆ d   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 10  ┆ e   │
└─────┴─────┴─────┘
take_every(n: int) DF[source]

Take every nth row in the DataFrame and return as a new DataFrame.

Examples

>>> s = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]})
>>> s.take_every(2)
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 7   │
└─────┴─────┘
to_arrow() Table[source]

Collect the underlying arrow arrays in an Arrow Table.

This operation is mostly zero copy.

Data types that do copy:
  • CategoricalType

Examples

>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3, 4, 5, 6], "bar": ["a", "b", "c", "d", "e", "f"]}
... )
>>> df.to_arrow()
pyarrow.Table
foo: int64
bar: large_string
----
foo: [[1,2,3,4,5,6]]
bar: [["a","b","c","d","e","f"]]
to_dict(as_series: Literal[True] = True) dict[str, polars.internals.series.series.Series][source]
to_dict(as_series: Literal[False]) dict[str, list[Any]]
to_dict(as_series: bool = True) dict[str, polars.internals.series.series.Series] | dict[str, list[Any]]

Convert DataFrame to a dictionary mapping column name to values.

Parameters:
as_series

True -> Values are series False -> Values are List[Any]

Examples

>>> df = pl.DataFrame(
...     {
...         "A": [1, 2, 3, 4, 5],
...         "fruits": ["banana", "banana", "apple", "apple", "banana"],
...         "B": [5, 4, 3, 2, 1],
...         "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
...         "optional": [28, 300, None, 2, -30],
...     }
... )
>>> df
shape: (5, 5)
┌─────┬────────┬─────┬────────┬──────────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ optional │
│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---      │
│ i64 ┆ str    ┆ i64 ┆ str    ┆ i64      │
╞═════╪════════╪═════╪════════╪══════════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ 28       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ 300      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ null     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ apple  ┆ 2   ┆ beetle ┆ 2        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ banana ┆ 1   ┆ beetle ┆ -30      │
└─────┴────────┴─────┴────────┴──────────┘
>>> df.to_dict(as_series=False)
{'A': [1, 2, 3, 4, 5],
'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'],
'B': [5, 4, 3, 2, 1],
'cars': ['beetle', 'audi', 'beetle', 'beetle', 'beetle'],
'optional': [28, 300, None, 2, -30]}
>>> df.to_dict(as_series=True)
{'A': shape: (5,)
Series: 'A' [i64]
[
    1
    2
    3
    4
    5
], 'fruits': shape: (5,)
Series: 'fruits' [str]
[
    "banana"
    "banana"
    "apple"
    "apple"
    "banana"
], 'B': shape: (5,)
Series: 'B' [i64]
[
    5
    4
    3
    2
    1
], 'cars': shape: (5,)
Series: 'cars' [str]
[
    "beetle"
    "audi"
    "beetle"
    "beetle"
    "beetle"
], 'optional': shape: (5,)
Series: 'optional' [i64]
[
    28
    300
    null
    2
    -30
]}
to_dicts() list[dict[str, Any]][source]

Convert every row to a dictionary.

Note that this is slow.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
>>> df.to_dicts()
[{'foo': 1, 'bar': 4}, {'foo': 2, 'bar': 5}, {'foo': 3, 'bar': 6}]
to_dummies(*, columns: Optional[Sequence[str]] = None) DF[source]

Get one hot encoded dummy variables.

Parameters:
columns:

A subset of columns to convert to dummy variables. None means “all columns”.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2],
...         "bar": [3, 4],
...         "ham": ["a", "b"],
...     }
... )
>>> df.to_dummies()
shape: (2, 6)
┌───────┬───────┬───────┬───────┬───────┬───────┐
│ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │
│ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    ┆ u8    │
╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0     ┆ 1     ┆ 0     ┆ 1     ┆ 0     ┆ 1     │
└───────┴───────┴───────┴───────┴───────┴───────┘
to_numpy() ndarray[Any, Any][source]

Convert DataFrame to a 2D NumPy array.

This operation clones data.

Notes

If you’re attempting to convert Utf8 to an array you’ll need to install pyarrow.

Examples

>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]}
... )
>>> numpy_array = df.to_numpy()
>>> type(numpy_array)
<class 'numpy.ndarray'>
to_pandas(*args: Any, date_as_object: bool = False, **kwargs: Any) DataFrame[source]

Cast to a pandas DataFrame.

This requires that pandas and pyarrow are installed. This operation clones data.

Parameters:
args

Arguments will be sent to pyarrow.Table.to_pandas().

date_as_object

Cast dates to objects. If False, convert to datetime64[ns] dtype.

kwargs

Arguments will be sent to pyarrow.Table.to_pandas().

Returns:
pandas.DataFrame

Examples

>>> import pandas
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> pandas_df = df.to_pandas()
>>> type(pandas_df)
<class 'pandas.core.frame.DataFrame'>
to_series(index: int = 0) Series[source]

Select column as Series at index location.

Parameters:
index

Location of selection.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.to_series(1)
shape: (3,)
Series: 'bar' [i64]
[
        6
        7
        8
]
to_struct(name: str) Series[source]

Convert a DataFrame to a Series of type Struct.

Parameters:
name

Name for the struct Series

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4, 5],
...         "b": ["one", "two", "three", "four", "five"],
...     }
... )
>>> df.to_struct("nums")
shape: (5,)
Series: 'nums' [struct[2]]
[
    {1,"one"}
    {2,"two"}
    {3,"three"}
    {4,"four"}
    {5,"five"}
]
transpose(include_header: bool = False, header_name: str = 'column', column_names: Optional[Union[Iterator[str], Sequence[str]]] = None) DF[source]

Transpose a DataFrame over the diagonal.

Parameters:
include_header

If set, the column names will be added as first column.

header_name

If include_header is set, this determines the name of the column that will be inserted.

column_names

Optional generator/iterator that yields column names. Will be used to replace the columns in the DataFrame.

Returns:
DataFrame

Notes

This is a very expensive operation. Perhaps you can do it differently.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]})
>>> df.transpose(include_header=True)
shape: (2, 4)
┌────────┬──────────┬──────────┬──────────┐
│ column ┆ column_0 ┆ column_1 ┆ column_2 │
│ ---    ┆ ---      ┆ ---      ┆ ---      │
│ str    ┆ i64      ┆ i64      ┆ i64      │
╞════════╪══════════╪══════════╪══════════╡
│ a      ┆ 1        ┆ 2        ┆ 3        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b      ┆ 1        ┆ 2        ┆ 3        │
└────────┴──────────┴──────────┴──────────┘

Replace the auto-generated column names with a list

>>> df.transpose(include_header=False, column_names=["a", "b", "c"])
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 2   ┆ 3   │
└─────┴─────┴─────┘

Include the header as a separate column

>>> df.transpose(
...     include_header=True, header_name="foo", column_names=["a", "b", "c"]
... )
shape: (2, 4)
┌─────┬─────┬─────┬─────┐
│ foo ┆ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ a   ┆ 1   ┆ 2   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 1   ┆ 2   ┆ 3   │
└─────┴─────┴─────┴─────┘

Replace the auto-generated column with column names from a generator function

>>> def name_generator():
...     base_name = "my_column_"
...     count = 0
...     while True:
...         yield f"{base_name}{count}"
...         count += 1
...
>>> df.transpose(include_header=False, column_names=name_generator())
shape: (2, 3)
┌─────────────┬─────────────┬─────────────┐
│ my_column_0 ┆ my_column_1 ┆ my_column_2 │
│ ---         ┆ ---         ┆ ---         │
│ i64         ┆ i64         ┆ i64         │
╞═════════════╪═════════════╪═════════════╡
│ 1           ┆ 2           ┆ 3           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1           ┆ 2           ┆ 3           │
└─────────────┴─────────────┴─────────────┘
unique(maintain_order: bool = True, subset: str | Sequence[str] | None = None, keep: UniqueKeepStrategy = 'first') DF[source]

Drop duplicate rows from this DataFrame.

Parameters:
maintain_order

Keep the same order as the original DataFrame. This requires more work to compute.

subset

Subset to use to compare rows.

keep{‘first’, ‘last’}

Which of the duplicate rows to keep (in conjunction with subset).

Returns:
DataFrame with unique rows

Warning

Note that this fails if there is a column of type List in the DataFrame or subset.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 1, 2, 3, 4, 5],
...         "b": [0.5, 0.5, 1.0, 2.0, 3.0, 3.0],
...         "c": [True, True, True, False, True, True],
...     }
... )
>>> df.unique()
shape: (5, 3)
┌─────┬─────┬───────┐
│ a   ┆ b   ┆ c     │
│ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ bool  │
╞═════╪═════╪═══════╡
│ 1   ┆ 0.5 ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 1.0 ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 2.0 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4   ┆ 3.0 ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5   ┆ 3.0 ┆ true  │
└─────┴─────┴───────┘
unnest(names: Union[str, Sequence[str]]) DF[source]

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Parameters:
names

Names of the struct columns that will be decomposed by its fields

Examples

>>> df = pl.DataFrame(
...     {
...         "before": ["foo", "bar"],
...         "t_a": [1, 2],
...         "t_b": ["a", "b"],
...         "t_c": [True, None],
...         "t_d": [[1, 2], [3]],
...         "after": ["baz", "womp"],
...     }
... ).select(["before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after"])
>>> df
shape: (2, 3)
┌────────┬─────────────────────┬───────┐
│ before ┆ t_struct            ┆ after │
│ ---    ┆ ---                 ┆ ---   │
│ str    ┆ struct[4]           ┆ str   │
╞════════╪═════════════════════╪═══════╡
│ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
└────────┴─────────────────────┴───────┘
>>> df.unnest("t_struct")
shape: (2, 6)
┌────────┬─────┬─────┬──────┬───────────┬───────┐
│ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
│ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
│ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
│ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
└────────┴─────┴─────┴──────┴───────────┴───────┘
unstack(step: int, how: UnstackDirection = 'vertical', columns: str | Sequence[str] | None = None, fill_values: list[Any] | None = None) DF[source]

Unstack a long table to a wide form without doing an aggregation.

This can be much faster than a pivot, because it can skip the grouping phase.

Parameters:
step

Number of rows in the unstacked frame.

how{ ‘vertical’, ‘horizontal’ }

Direction of the unstack.

columns

Column to include in the operation.

fill_values

Fill values that don’t fit the new size with this value.

Warning

This functionality is experimental and may be subject to changes without it being considered a breaking change.

Examples

>>> from string import ascii_uppercase
>>> df = pl.DataFrame(
...     {
...         "col1": ascii_uppercase[0:9],
...         "col2": pl.arange(0, 9, eager=True),
...     }
... )
>>> df
shape: (9, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ str  ┆ i64  │
╞══════╪══════╡
│ A    ┆ 0    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ C    ┆ 2    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ D    ┆ 3    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...  ┆ ...  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ F    ┆ 5    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ G    ┆ 6    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ H    ┆ 7    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ I    ┆ 8    │
└──────┴──────┘
>>> df.unstack(step=3, how="vertical")
shape: (3, 6)
┌────────┬────────┬────────┬────────┬────────┬────────┐
│ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
╞════════╪════════╪════════╪════════╪════════╪════════╡
│ A      ┆ D      ┆ G      ┆ 0      ┆ 3      ┆ 6      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B      ┆ E      ┆ H      ┆ 1      ┆ 4      ┆ 7      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C      ┆ F      ┆ I      ┆ 2      ┆ 5      ┆ 8      │
└────────┴────────┴────────┴────────┴────────┴────────┘
>>> df.unstack(step=3, how="horizontal")
shape: (3, 6)
┌────────┬────────┬────────┬────────┬────────┬────────┐
│ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │
│ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ str    ┆ str    ┆ str    ┆ i64    ┆ i64    ┆ i64    │
╞════════╪════════╪════════╪════════╪════════╪════════╡
│ A      ┆ B      ┆ C      ┆ 0      ┆ 1      ┆ 2      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ D      ┆ E      ┆ F      ┆ 3      ┆ 4      ┆ 5      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ G      ┆ H      ┆ I      ┆ 6      ┆ 7      ┆ 8      │
└────────┴────────┴────────┴────────┴────────┴────────┘
upsample(time_column: str, *, every: str | timedelta, offset: str | timedelta | None = None, by: Optional[Union[str, Sequence[str]]] = None, maintain_order: bool = False) DF[source]

Upsample a DataFrame at a regular frequency.

Parameters:
time_column

time column will be used to determine a date_range. Note that this column has to be sorted for the output to make sense.

every

interval will start ‘every’ duration

offset

change the start of the date_range by this offset.

by

First group by these columns and then upsample for every group

maintain_order

Keep the ordering predictable. This is slower.

The `every` and `offset` arguments are created with
the following string language:
- 1ns (1 nanosecond)
- 1us (1 microsecond)
- 1ms (1 millisecond)
- 1s (1 second)
- 1m (1 minute)
- 1h (1 hour)
- 1d (1 day)
- 1w (1 week)
- 1mo (1 calendar month)
- 1y (1 calendar year)
- 1i (1 index count)
Or combine them:
“3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

Examples

Upsample a DataFrame by a certain interval.

>>> from datetime import datetime
>>> df = pl.DataFrame(
...     {
...         "time": [
...             datetime(2021, 2, 1),
...             datetime(2021, 4, 1),
...             datetime(2021, 5, 1),
...             datetime(2021, 6, 1),
...         ],
...         "groups": ["A", "B", "A", "B"],
...         "values": [0, 1, 2, 3],
...     }
... )
>>> (
...     df.upsample(
...         time_column="time", every="1mo", by="groups", maintain_order=True
...     ).select(pl.all().forward_fill())
... )
shape: (7, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ i64    │
╞═════════════════════╪════════╪════════╡
│ 2021-02-01 00:00:00 ┆ A      ┆ 0      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-03-01 00:00:00 ┆ A      ┆ 0      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-04-01 00:00:00 ┆ A      ┆ 0      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-05-01 00:00:00 ┆ A      ┆ 2      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-04-01 00:00:00 ┆ B      ┆ 1      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-05-01 00:00:00 ┆ B      ┆ 1      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-06-01 00:00:00 ┆ B      ┆ 3      │
└─────────────────────┴────────┴────────┘
var(ddof: int = 1) DF[source]

Aggregate the columns of this DataFrame to their variance value.

Parameters:
ddof

Degrees of freedom

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.var()
shape: (1, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ ham  │
│ --- ┆ --- ┆ ---  │
│ f64 ┆ f64 ┆ str  │
╞═════╪═════╪══════╡
│ 1.0 ┆ 1.0 ┆ null │
└─────┴─────┴──────┘
>>> df.var(ddof=0)
shape: (1, 3)
┌──────────┬──────────┬──────┐
│ foo      ┆ bar      ┆ ham  │
│ ---      ┆ ---      ┆ ---  │
│ f64      ┆ f64      ┆ str  │
╞══════════╪══════════╪══════╡
│ 0.666667 ┆ 0.666667 ┆ null │
└──────────┴──────────┴──────┘
vstack(df: DataFrame, in_place: bool = False) DF[source]

Grow this DataFrame vertically by stacking a DataFrame to it.

Parameters:
df

DataFrame to stack.

in_place

Modify in place

Examples

>>> df1 = pl.DataFrame(
...     {
...         "foo": [1, 2],
...         "bar": [6, 7],
...         "ham": ["a", "b"],
...     }
... )
>>> df2 = pl.DataFrame(
...     {
...         "foo": [3, 4],
...         "bar": [8, 9],
...         "ham": ["c", "d"],
...     }
... )
>>> df1.vstack(df2)
shape: (4, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   ┆ c   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 9   ┆ d   │
└─────┴─────┴─────┘
property width: int[source]

Get the width of the DataFrame.

Examples

>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]})
>>> df.width
1
with_column(column: Series | Expr) DataFrame[source]

Return a new DataFrame with the column added or replaced.

Parameters:
column

Series, where the name of the Series refers to the column in the DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> df.with_column((pl.col("b") ** 2).alias("b_squared"))  # added
shape: (3, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ b_squared │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ f64       │
╞═════╪═════╪═══════════╡
│ 1   ┆ 2   ┆ 4.0       │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ 16.0      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ 6   ┆ 36.0      │
└─────┴─────┴───────────┘
>>> df.with_column(pl.col("a") ** 2)  # replaced
shape: (3, 2)
┌──────┬─────┐
│ a    ┆ b   │
│ ---  ┆ --- │
│ f64  ┆ i64 │
╞══════╪═════╡
│ 1.0  ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 9.0  ┆ 4   │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 25.0 ┆ 6   │
└──────┴─────┘
with_columns(exprs: Optional[Union[Expr, Series, Sequence[Expr | Series]]] = None, **named_exprs: Expr | Series) DataFrame[source]

Add or overwrite multiple columns in a DataFrame.

Parameters:
exprs

List of Expressions that evaluate to columns.

**named_exprs

Named column Expressions, provided as kwargs.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... )
>>> df.with_columns(
...     [
...         (pl.col("a") ** 2).alias("a^2"),
...         (pl.col("b") / 2).alias("b/2"),
...         (pl.col("c").is_not()).alias("not c"),
...     ]
... )
shape: (4, 6)
┌─────┬──────┬───────┬──────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪══════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
└─────┴──────┴───────┴──────┴──────┴───────┘
>>> # Support for kwarg expressions is considered EXPERIMENTAL.
>>> # Currently requires opt-in via `pl.Config` boolean flag:
>>>
>>> pl.Config.with_columns_kwargs = True
>>> df.with_columns(
...     d=pl.col("a") * pl.col("b"),
...     e=pl.col("c").is_not(),
... )
shape: (4, 5)
┌─────┬──────┬───────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ d    ┆ e     │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false │
└─────┴──────┴───────┴──────┴───────┘
with_row_count(name: str = 'row_nr', offset: int = 0) DF[source]

Add a column at index 0 that counts the rows.

Parameters:
name

Name of the column to add.

offset

Start the row count at this offset. Default = 0

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... )
>>> df.with_row_count()
shape: (3, 3)
┌────────┬─────┬─────┐
│ row_nr ┆ a   ┆ b   │
│ ---    ┆ --- ┆ --- │
│ u32    ┆ i64 ┆ i64 │
╞════════╪═════╪═════╡
│ 0      ┆ 1   ┆ 2   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1      ┆ 3   ┆ 4   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2      ┆ 5   ┆ 6   │
└────────┴─────┴─────┘
write_avro(file: BinaryIO | BytesIO | str | Path, compression: AvroCompression = 'uncompressed') None[source]

Write to Apache Avro file.

Parameters:
file

File path to which the file should be written.

compression{‘uncompressed’, ‘snappy’, ‘deflate’}

Compression method. Defaults to “uncompressed”.

Examples

>>> import pathlib
>>>
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> path: pathlib.Path = dirpath / "new_file.avro"
>>> df.write_avro(path)
write_csv(file: None = None, has_header: bool = True, sep: str = ',', quote: str = '"', batch_size: int = 1024, datetime_format: str | None = None, date_format: str | None = None, time_format: str | None = None, float_precision: int | None = None, null_value: str | None = None) str[source]
write_csv(file: TextIO | _io.BytesIO | str | pathlib.Path, has_header: bool = True, sep: str = ',', quote: str = '"', batch_size: int = 1024, datetime_format: str | None = None, date_format: str | None = None, time_format: str | None = None, float_precision: int | None = None, null_value: str | None = None) None

Write to comma-separated values (CSV) file.

Parameters:
file

File path to which the result should be written. If set to None (default), the output is returned as a string instead.

has_header

Whether to include header in the CSV output.

sep

Separate CSV fields with this symbol.

quote

Byte to use as quoting character.

batch_size

Number of rows that will be processed per thread.

datetime_format

A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).

date_format

A format string, with the specifiers defined by the chrono Rust crate.

time_format

A format string, with the specifiers defined by the chrono Rust crate.

float_precision

Number of decimal places to write, applied to both Float32 and Float64 datatypes.

null_value

A string representing null values (defaulting to the empty string).

Examples

>>> import pathlib
>>>
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> path: pathlib.Path = dirpath / "new_file.csv"
>>> df.write_csv(path, sep=",")
write_ipc(file: BinaryIO | BytesIO | str | Path, compression: IpcCompression = 'uncompressed') None[source]

Write to Arrow IPC binary stream or Feather file.

Parameters:
file

File path to which the file should be written.

compression{‘uncompressed’, ‘lz4’, ‘zstd’}

Compression method. Defaults to “uncompressed”.

Examples

>>> import pathlib
>>>
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> path: pathlib.Path = dirpath / "new_file.arrow"
>>> df.write_ipc(path)
write_json(file: None = None, pretty: bool = False, row_oriented: bool = False, json_lines: bool | None = None, *, to_string: bool | None = None) str[source]
write_json(file: io.IOBase | str | pathlib.Path, pretty: bool = False, row_oriented: bool = False, json_lines: bool | None = None, *, to_string: bool | None = None) None

Serialize to JSON representation.

Parameters:
file

File path to which the result should be written. If set to None (default), the output is returned as a string instead.

pretty

Pretty serialize json.

row_oriented

Write to row oriented json. This is slower, but more common.

json_lines

Deprecated argument. Toggle between JSON and NDJSON format.

to_string

Deprecated argument. Ignore file argument and return a string.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...     }
... )
>>> df.write_json()
'{"columns":[{"name":"foo","datatype":"Int64","values":[1,2,3]},{"name":"bar","datatype":"Int64","values":[6,7,8]}]}'
>>> df.write_json(row_oriented=True)
'[{"foo":1,"bar":6},{"foo":2,"bar":7},{"foo":3,"bar":8}]'
write_ndjson(file: None = None) str[source]
write_ndjson(file: io.IOBase | str | pathlib.Path) None

Serialize to newline delimited JSON representation.

Parameters:
file

File path to which the result should be written. If set to None (default), the output is returned as a string instead.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...     }
... )
>>> df.write_ndjson()
'{"foo":1,"bar":6}\n{"foo":2,"bar":7}\n{"foo":3,"bar":8}\n'
write_parquet(file: str | Path | BytesIO, *, compression: ParquetCompression = 'zstd', compression_level: int | None = None, statistics: bool = False, row_group_size: int | None = None, use_pyarrow: bool = False, pyarrow_options: dict[str, object] | None = None) None[source]

Write to Apache Parquet file.

Parameters:
file

File path to which the file should be written.

compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}

Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.

compression_level

The level of compression to use. Higher compression means smaller files on disk.

  • “gzip” : min-level: 0, max-level: 10.

  • “brotli” : min-level: 0, max-level: 11.

  • “zstd” : min-level: 1, max-level: 22.

statistics

Write statistics to the parquet headers. This requires extra compute.

row_group_size

Size of the row groups in number of rows. If None (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds. If None and use_pyarrow=True, the row group size will be the minimum of the DataFrame size and 64 * 1024 * 1024.

use_pyarrow

Use C++ parquet implementation vs Rust parquet implementation. At the moment C++ supports more features.

pyarrow_options

Arguments passed to pyarrow.parquet.write_table.

Examples

>>> import pathlib
>>>
>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 4, 5],
...         "bar": [6, 7, 8, 9, 10],
...         "ham": ["a", "b", "c", "d", "e"],
...     }
... )
>>> path: pathlib.Path = dirpath / "new_file.parquet"
>>> df.write_parquet(path)