LazyFrame#

This page gives an overview of all public LazyFrame methods.

class polars.LazyFrame[source]

Representation of a Lazy computation graph/query againat a DataFrame.

Notes

LazyFrames are instantiated by calling lazy() on an existing DataFrame; they are also created when calling the various “scan” IO methods, and are the preferred way to operate on data with polars.

>>> ldf = pl.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]}).lazy()

Methods:

cache

Cache the result once the execution of the physical plan hits this node.

cleared

Create an empty copy of the current LazyFrame.

clone

Very cheap deepcopy/clone.

collect

Collect into a DataFrame.

describe_optimized_plan

Create a string representation of the optimized query plan.

describe_plan

Create a string representation of the unoptimized query plan.

drop

Remove one or multiple columns from a DataFrame.

drop_nulls

Drop rows with null values from this LazyFrame.

explode

Explode lists to long format.

fetch

Collect a small number of rows for debugging purposes.

fill_nan

Fill floating point NaN values.

fill_null

Fill null values using the specified value or strategy.

filter

Filter the rows in the DataFrame based on a predicate expression.

first

Get the first row of the DataFrame.

from_json

Read a logical plan from a JSON string to construct a LazyFrame.

groupby

Start a groupby operation.

groupby_dynamic

Group based on a time value (or index value of type Int32, Int64).

groupby_rolling

Create rolling groups based on a time column.

head

Get the first n rows.

inspect

Inspect a node in the computation graph.

interpolate

Interpolate intermediate values.

join

Add a join operation to the Logical Plan.

join_asof

Perform an asof join.

last

Get the last row of the DataFrame.

lazy

Return lazy representation, i.e. itself.

limit

Get the first n rows.

map

Apply a custom function.

max

Aggregate the columns in the DataFrame to their maximum value.

mean

Aggregate the columns in the DataFrame to their mean value.

median

Aggregate the columns in the DataFrame to their median value.

melt

Unpivot a DataFrame from wide to long format.

min

Aggregate the columns in the DataFrame to their minimum value.

pipe

Offers a structured way to apply a sequence of user-defined functions (UDFs).

profile

Profile a LazyFrame.

quantile

Aggregate the columns in the DataFrame to their quantile value.

read_json

Read a logical plan from a JSON file to construct a LazyFrame.

rename

Rename column names.

reverse

Reverse the DataFrame.

select

Select columns from this DataFrame.

shift

Shift the values by a given period.

shift_and_fill

Shift the values by a given period and fill the resulting null values.

show_graph

Show a plot of the query plan.

slice

Get a slice of this DataFrame.

sort

Sort the DataFrame.

std

Aggregate the columns in the DataFrame to their standard deviation value.

sum

Aggregate the columns in the DataFrame to their sum value.

tail

Get the last n rows.

take_every

Take every nth row in the LazyFrame and return as a new LazyFrame.

unique

Drop duplicate rows from this DataFrame.

unnest

Decompose a struct into its fields.

var

Aggregate the columns in the DataFrame to their variance value.

with_column

Add or overwrite column in a DataFrame.

with_columns

Add or overwrite multiple columns in a DataFrame.

with_context

Add an external context to the computation graph.

with_row_count

Add a column at index 0 that counts the rows.

write_json

Write the logical plan of this LazyFrame to a file or string in JSON format.

Attributes:

columns

Get or set column names.

dtypes

Get dtypes of columns in LazyFrame.

schema

Get a dict[column name, DataType].

width

Get the width of the LazyFrame.

cache() LDF[source]

Cache the result once the execution of the physical plan hits this node.

cleared() LazyFrame[source]

Create an empty copy of the current LazyFrame.

The copy has an identical schema but no data.

See also

clone

Cheap deepcopy/clone.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... ).lazy()
>>> df.cleared().fetch()
shape: (0, 3)
┌─────┬─────┬──────┐
│ a   ┆ b   ┆ c    │
│ --- ┆ --- ┆ ---  │
│ i64 ┆ f64 ┆ bool │
╞═════╪═════╪══════╡
└─────┴─────┴──────┘
clone() LDF[source]

Very cheap deepcopy/clone.

See also

cleared

Create an empty copy of the current LazyFrame, with identical schema but no data.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... ).lazy()
>>> (df.clone())  
<polars.LazyFrame object at ...>
collect(*, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, common_subplan_elimination: bool = True, streaming: bool = False) DataFrame[source]

Collect into a DataFrame.

Note: use fetch() if you want to run your query on the first n rows only. This can be a huge time saver in debugging queries.

Parameters:
type_coercion

Do type coercion optimization.

predicate_pushdown

Do predicate pushdown optimization.

projection_pushdown

Do projection pushdown optimization.

simplify_expression

Run simplify expressions optimization.

no_optimization

Turn off (certain) optimizations.

slice_pushdown

Slice pushdown optimization.

common_subplan_elimination

Will try to cache branching subplans that occur on self-joins or unions.

streaming

Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:
DataFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... ).lazy()
>>> df.groupby("a", maintain_order=True).agg(pl.all().sum()).collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 4   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 11  ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 6   ┆ 1   │
└─────┴─────┴─────┘
property columns: list[str][source]

Get or set column names.

Examples

>>> df = (
...     pl.DataFrame(
...         {
...             "foo": [1, 2, 3],
...             "bar": [6, 7, 8],
...             "ham": ["a", "b", "c"],
...         }
...     )
...     .lazy()
...     .select(["foo", "bar"])
... )
>>> df.columns
['foo', 'bar']
describe_optimized_plan(type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, common_subplan_elimination: bool = True, streaming: bool = False) str[source]

Create a string representation of the optimized query plan.

describe_plan() str[source]

Create a string representation of the unoptimized query plan.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... ).lazy()
>>> df.groupby("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).describe_plan()  
drop(columns: str | list[str]) LDF[source]

Remove one or multiple columns from a DataFrame.

Parameters:
columns
  • Name of the column that should be removed.

  • List of column names.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... ).lazy()
>>> df.drop("ham").collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═════╡
│ 1   ┆ 6.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8.0 │
└─────┴─────┘
drop_nulls(subset: list[str] | str | None = None) LDF[source]

Drop rows with null values from this LazyFrame.

Parameters:
subset

Subset of column(s) on which drop_nulls will be applied.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, None, 8],
...         "ham": ["a", "b", "c"],
...     }
... )
>>> df.lazy().drop_nulls().collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘

This method only drops nulls row-wise if any single value of the row is null.

Below are some example snippets that show how you could drop null values based on other conditions:

>>> df = pl.DataFrame(
...     {
...         "a": [None, None, None, None],
...         "b": [1, 2, None, 1],
...         "c": [1, None, None, 1],
...     }
... )
>>> df
shape: (4, 3)
┌──────┬──────┬──────┐
│ a    ┆ b    ┆ c    │
│ ---  ┆ ---  ┆ ---  │
│ f64  ┆ i64  ┆ i64  │
╞══════╪══════╪══════╡
│ null ┆ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1    ┆ 1    │
└──────┴──────┴──────┘

Drop a row only if all values are null:

>>> df.filter(
...     ~pl.fold(
...         acc=True,
...         f=lambda acc, s: acc & s.is_null(),
...         exprs=pl.all(),
...     )
... )
shape: (3, 3)
┌──────┬─────┬──────┐
│ a    ┆ b   ┆ c    │
│ ---  ┆ --- ┆ ---  │
│ f64  ┆ i64 ┆ i64  │
╞══════╪═════╪══════╡
│ null ┆ 1   ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2   ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1   ┆ 1    │
└──────┴─────┴──────┘
property dtypes: list[type[polars.datatypes.DataType]][source]

Get dtypes of columns in LazyFrame.

See also

schema

Returns a {colname:dtype} mapping.

Examples

>>> lf = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... ).lazy()
>>> lf.dtypes
[<class 'polars.datatypes.Int64'>, <class 'polars.datatypes.Float64'>, <class 'polars.datatypes.Utf8'>]
explode(columns: Union[str, Sequence[str], Expr, Sequence[Expr]]) LDF[source]

Explode lists to long format.

Examples

>>> df = pl.DataFrame(
...     {
...         "letters": ["a", "a", "b", "c"],
...         "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]],
...     }
... ).lazy()
>>> df.explode("numbers").collect()
shape: (8, 2)
┌─────────┬─────────┐
│ letters ┆ numbers │
│ ---     ┆ ---     │
│ str     ┆ i64     │
╞═════════╪═════════╡
│ a       ┆ 1       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a       ┆ 2       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ a       ┆ 3       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b       ┆ 4       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ b       ┆ 5       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c       ┆ 6       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c       ┆ 7       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ c       ┆ 8       │
└─────────┴─────────┘
fetch(n_rows: int = 500, *, type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, common_subplan_elimination: bool = True, allow_streaming: bool = False) DataFrame[source]

Collect a small number of rows for debugging purposes.

Fetch is like a collect() operation, but it overwrites the number of rows read by every scan operation. This is a utility that helps debug a query on a smaller number of rows.

Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.

Parameters:
n_rows

Collect n_rows from the data sources.

type_coercion

Run type coercion optimization.

predicate_pushdown

Run predicate pushdown optimization.

projection_pushdown

Run projection pushdown optimization.

simplify_expression

Run simplify expressions optimization.

no_optimization

Turn off optimizations.

slice_pushdown

Slice pushdown optimization

common_subplan_elimination

Will try to cache branching subplans that occur on self-joins or unions.

allow_streaming

Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:
DataFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... ).lazy()
>>> df.groupby("a", maintain_order=True).agg(pl.all().sum()).fetch(2)
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ a   ┆ 1   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   ┆ 5   │
└─────┴─────┴─────┘
fill_nan(fill_value: int | float | Expr | None) LDF[source]

Fill floating point NaN values.

Parameters:
fill_value

Value to fill the NaN values with.

Warning

Note that floating point NaN (Not a Number) are not missing values! To replace missing values, use fill_null() instead.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1.5, 2, float("NaN"), 4],
...         "b": [0.5, 4, float("NaN"), 13],
...     }
... ).lazy()
>>> df.fill_nan(99).collect()
shape: (4, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ f64  ┆ f64  │
╞══════╪══════╡
│ 1.5  ┆ 0.5  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ 4.0  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 99.0 ┆ 99.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4.0  ┆ 13.0 │
└──────┴──────┘
fill_null(value: Any | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, matches_supertype: bool = True) LDF[source]

Fill null values using the specified value or strategy.

Parameters:
value

Value used to fill null values.

strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}

Strategy used to fill null values.

limit

Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.

matches_supertype

Fill all matching supertypes of the fill value literal.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, None, 4],
...         "b": [0.5, 4, None, 13],
...     }
... ).lazy()
>>> df.fill_null(99).collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 99  ┆ 99.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> df.fill_null(strategy="forward").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> df.fill_null(strategy="max").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
>>> df.fill_null(strategy="zero").collect()
shape: (4, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ i64 ┆ f64  │
╞═════╪══════╡
│ 1   ┆ 0.5  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ 4.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 0   ┆ 0.0  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ 13.0 │
└─────┴──────┘
filter(predicate: Expr | str | Series | list[bool]) LDF[source]

Filter the rows in the DataFrame based on a predicate expression.

Parameters:
predicate

Expression that evaluates to a boolean Series.

Examples

>>> lf = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... ).lazy()

Filter on one condition:

>>> lf.filter(pl.col("foo") < 3).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   ┆ b   │
└─────┴─────┴─────┘

Filter on multiple conditions:

>>> lf.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
└─────┴─────┴─────┘

Filter on an OR condition:

>>> lf.filter((pl.col("foo") == 1) | (pl.col("ham") == "c")).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   ┆ c   │
└─────┴─────┴─────┘
first() LDF[source]

Get the first row of the DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... ).lazy()
>>> df.first().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 2   │
└─────┴─────┘
classmethod from_json(json: str) LazyFrame[source]

Read a logical plan from a JSON string to construct a LazyFrame.

Parameters:
json

String in JSON format.

See also

read_json
groupby(by: Union[str, Sequence[str], Expr, Sequence[Expr]], maintain_order: bool = False) LazyGroupBy[LDF][source]

Start a groupby operation.

Parameters:
by

Column(s) to group by.

maintain_order

Make sure that the order of the groups remain consistent. This is more expensive than a default groupby.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... ).lazy()

The following does NOT work: # df.groupby(“a”)[“b”].sum().collect() # ^^^^ TypeError: ‘LazyGroupBy’ object is not subscriptable instead, use .agg(): >>> df.groupby(by=”a”, maintain_order=True).agg(pl.col(“b”).sum()).collect() shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ — ┆ — │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 4 │ ├╌╌╌╌╌┼╌╌╌╌╌┤ │ b ┆ 11 │ ├╌╌╌╌╌┼╌╌╌╌╌┤ │ c ┆ 6 │ └─────┴─────┘

groupby_dynamic(index_column: str, *, every: str | timedelta, period: str | timedelta | None = None, offset: str | timedelta | None = None, truncate: bool = True, include_boundaries: bool = False, closed: ClosedWindow = 'left', by: str | Sequence[str] | Expr | Sequence[Expr] | None = None, start_by: StartBy = 'window') LazyGroupBy[LDF][source]

Group based on a time value (or index value of type Int32, Int64).

Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

A window is defined by:

  • every: interval of the window

  • period: length of the window

  • offset: offset of the window

The every, period and offset arguments are created with the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a groupby_dynamic on an integer column, the windows are defined by:

  • “1i” # length 1

  • “10i” # length 10

Parameters:
index_column

Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

In case of a dynamic groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

every

interval of the window

period

length of the window, if None it is equal to ‘every’

offset

offset of the window if None and period is None it will be equal to negative every

truncate

truncate the time value to the window lower bound

include_boundaries

Add the lower and upper bound of the window to the “_lower_bound” and “_upper_bound” columns. This will impact performance because it’s harder to parallelize

closed{‘right’, ‘left’, ‘both’, ‘none’}

Define whether the temporal window interval is closed or not.

by

Also group by this column/these columns

start_by{‘window’, ‘datapoint’, ‘monday’}

The strategy to determine the start of the first window by. * ‘window’: Truncate the start of the window with the ‘every’ argument. * ‘datapoint’: Start from the first encountered data point. * ‘monday’: Start the window on the monday before the first data point.

See also

groupby_rolling

Examples

>>> from datetime import datetime
>>> # create an example dataframe
>>> df = pl.DataFrame(
...     {
...         "time": pl.date_range(
...             low=datetime(2021, 12, 16),
...             high=datetime(2021, 12, 16, 3),
...             interval="30m",
...         ),
...         "n": range(7),
...     }
... )
>>> df
shape: (7, 2)
┌─────────────────────┬─────┐
│ time                ┆ n   │
│ ---                 ┆ --- │
│ datetime[μs]        ┆ i64 │
╞═════════════════════╪═════╡
│ 2021-12-16 00:00:00 ┆ 0   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ 1   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ 3   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 4   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ 5   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ 6   │
└─────────────────────┴─────┘

Group by windows of 1 hour starting at 2021-12-16 00:00:00.

>>> (
...     df.lazy()
...     .groupby_dynamic("time", every="1h", closed="right")
...     .agg(
...         [
...             pl.col("time").min().alias("time_min"),
...             pl.col("time").max().alias("time_max"),
...         ]
...     )
... ).collect()
shape: (4, 3)
┌─────────────────────┬─────────────────────┬─────────────────────┐
│ time                ┆ time_min            ┆ time_max            │
│ ---                 ┆ ---                 ┆ ---                 │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        │
╞═════════════════════╪═════════════════════╪═════════════════════╡
│ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │
└─────────────────────┴─────────────────────┴─────────────────────┘

The window boundaries can also be added to the aggregation result

>>> (
...     df.lazy()
...     .groupby_dynamic(
...         "time", every="1h", include_boundaries=True, closed="right"
...     )
...     .agg([pl.col("time").count().alias("time_count")])
... ).collect()
shape: (4, 4)
┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
│ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
│ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
│ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
└─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

When closed=”left”, should not include right end of interval [lower_bound, upper_bound)

>>> (
...     df.lazy()
...     .groupby_dynamic("time", every="1h", closed="left")
...     .agg(
...         [
...             pl.col("time").count().alias("time_count"),
...             pl.col("time").list().alias("time_agg_list"),
...         ]
...     )
... ).collect()
shape: (4, 3)
┌─────────────────────┬────────────┬─────────────────────────────────────┐
│ time                ┆ time_count ┆ time_agg_list                       │
│ ---                 ┆ ---        ┆ ---                                 │
│ datetime[μs]        ┆ u32        ┆ list[datetime[μs]]                  │
╞═════════════════════╪════════════╪═════════════════════════════════════╡
│ 2021-12-16 00:00:00 ┆ 2          ┆ [2021-12-16 00:00:00, 2021-12-16... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 2          ┆ [2021-12-16 01:00:00, 2021-12-16... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 2          ┆ [2021-12-16 02:00:00, 2021-12-16... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ 1          ┆ [2021-12-16 03:00:00]               │
└─────────────────────┴────────────┴─────────────────────────────────────┘

When closed=”both” the time values at the window boundaries belong to 2 groups.

>>> (
...     df.lazy()
...     .groupby_dynamic("time", every="1h", closed="both")
...     .agg([pl.col("time").count().alias("time_count")])
... ).collect()
shape: (5, 2)
┌─────────────────────┬────────────┐
│ time                ┆ time_count │
│ ---                 ┆ ---        │
│ datetime[μs]        ┆ u32        │
╞═════════════════════╪════════════╡
│ 2021-12-15 23:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ 1          │
└─────────────────────┴────────────┘

Dynamic groupbys can also be combined with grouping on normal keys

>>> df = pl.DataFrame(
...     {
...         "time": pl.date_range(
...             low=datetime(2021, 12, 16),
...             high=datetime(2021, 12, 16, 3),
...             interval="30m",
...         ),
...         "groups": ["a", "a", "a", "b", "b", "a", "a"],
...     }
... )
>>> df
shape: (7, 2)
┌─────────────────────┬────────┐
│ time                ┆ groups │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ str    │
╞═════════════════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a      │
└─────────────────────┴────────┘
>>> (
...     df.lazy()
...     .groupby_dynamic(
...         "time",
...         every="1h",
...         closed="both",
...         by="groups",
...         include_boundaries=True,
...     )
...     .agg([pl.col("time").count().alias("time_count")])
... ).collect()
shape: (7, 5)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐
│ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ time_count │
│ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---        │
│ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32        │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡
│ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2          │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1          │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘

Dynamic groupby on an index column

>>> df = pl.DataFrame(
...     {
...         "idx": pl.arange(0, 6, eager=True),
...         "A": ["A", "A", "B", "B", "B", "C"],
...     }
... )
>>> (
...     df.lazy()
...     .groupby_dynamic(
...         "idx",
...         every="2i",
...         period="3i",
...         include_boundaries=True,
...         closed="right",
...     )
...     .agg(pl.col("A").list().alias("A_agg_list"))
... ).collect()
shape: (3, 4)
┌─────────────────┬─────────────────┬─────┬─────────────────┐
│ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list      │
│ ---             ┆ ---             ┆ --- ┆ ---             │
│ i64             ┆ i64             ┆ i64 ┆ list[str]       │
╞═════════════════╪═════════════════╪═════╪═════════════════╡
│ 0               ┆ 3               ┆ 0   ┆ ["A", "B", "B"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2               ┆ 5               ┆ 2   ┆ ["B", "B", "C"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4               ┆ 7               ┆ 4   ┆ ["C"]           │
└─────────────────┴─────────────────┴─────┴─────────────────┘
groupby_rolling(index_column: str, *, period: str | timedelta, offset: str | timedelta | None = None, closed: ClosedWindow = 'right', by: str | Sequence[str] | Expr | Sequence[Expr] | None = None) LazyGroupBy[LDF][source]

Create rolling groups based on a time column.

Also works for index values of type Int32 or Int64.

Different from a dynamic_groupby the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupby_dynamic.

The period and offset arguments are created either from a timedelta, or by using the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

In case of a groupby_rolling on an integer column, the windows are defined by:

  • “1i” # length 1

  • “10i” # length 10

Parameters:
index_column

Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.

In case of a rolling groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.

period

length of the window

offset

offset of the window. Default is -period

closed{‘right’, ‘left’, ‘both’, ‘none’}

Define whether the temporal window interval is closed or not.

by

Also group by this column/these columns

See also

groupby_dynamic

Examples

>>> dates = [
...     "2020-01-01 13:45:48",
...     "2020-01-01 16:42:13",
...     "2020-01-01 16:45:09",
...     "2020-01-02 18:12:48",
...     "2020-01-03 19:45:32",
...     "2020-01-08 23:16:43",
... ]
>>> df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_column(
...     pl.col("dt").str.strptime(pl.Datetime)
... )
>>> out = df.groupby_rolling(index_column="dt", period="2d").agg(
...     [
...         pl.sum("a").alias("sum_a"),
...         pl.min("a").alias("min_a"),
...         pl.max("a").alias("max_a"),
...     ]
... )
>>> assert out["sum_a"].to_list() == [3, 10, 15, 24, 11, 1]
>>> assert out["max_a"].to_list() == [3, 7, 7, 9, 9, 1]
>>> assert out["min_a"].to_list() == [3, 3, 3, 3, 2, 1]
>>> out
shape: (6, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ dt                  ┆ sum_a ┆ min_a ┆ max_a │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ i64   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 13:45:48 ┆ 3     ┆ 3     ┆ 3     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:42:13 ┆ 10    ┆ 3     ┆ 7     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:45:09 ┆ 15    ┆ 3     ┆ 7     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-02 18:12:48 ┆ 24    ┆ 3     ┆ 9     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-03 19:45:32 ┆ 11    ┆ 2     ┆ 9     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-08 23:16:43 ┆ 1     ┆ 1     ┆ 1     │
└─────────────────────┴───────┴───────┴───────┘
head(n: int = 5) LDF[source]

Get the first n rows.

Parameters:
n

Number of rows to return.

Notes

Consider using the fetch() operation if you only want to test your query. The fetch() operation will load the first n rows at the scan level, whereas the head()/limit() are applied at the end.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... ).lazy()
>>> df.head().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 11  │
└─────┴─────┘
>>> df.head(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 8   │
└─────┴─────┘
inspect(fmt: str = '{}') LDF[source]

Inspect a node in the computation graph.

Print the value that this node in the computation graph evaluates to and passes on the value.

Examples

>>> df = pl.DataFrame({"foo": [1, 1, -2, 3]}).lazy()
>>> (
...     df.select(
...         [
...             pl.col("foo").cumsum().alias("bar"),
...         ]
...     )
...     .inspect()  # print the node before the filter
...     .filter(pl.col("bar") == pl.col("foo"))
... )  
<polars.LazyFrame object at ...>
interpolate() LDF[source]

Interpolate intermediate values. The interpolation method is linear.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, None, 9, 10],
...         "bar": [6, 7, 9, None],
...         "baz": [1, None, None, 9],
...     }
... ).lazy()
>>> df.interpolate().collect()
shape: (4, 3)
┌─────┬──────┬─────┐
│ foo ┆ bar  ┆ baz │
│ --- ┆ ---  ┆ --- │
│ i64 ┆ i64  ┆ i64 │
╞═════╪══════╪═════╡
│ 1   ┆ 6    ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 7    ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 9   ┆ 9    ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 10  ┆ null ┆ 9   │
└─────┴──────┴─────┘
join(other: LazyFrame, left_on: str | Expr | Sequence[str | Expr] | None = None, right_on: str | Expr | Sequence[str | Expr] | None = None, on: str | Expr | Sequence[str | Expr] | None = None, how: JoinStrategy = 'inner', suffix: str = '_right', allow_parallel: bool = True, force_parallel: bool = False) LDF[source]

Add a join operation to the Logical Plan.

Parameters:
other

Lazy DataFrame to join with.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

on

Join column of both DataFrames. If set, left_on and right_on should be None.

how{‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’}

Join strategy.

suffix

Suffix to append to columns with a duplicate name.

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

See also

join_asof

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... ).lazy()
>>> other_df = pl.DataFrame(
...     {
...         "apple": ["x", "y", "z"],
...         "ham": ["a", "b", "d"],
...     }
... ).lazy()
>>> df.join(other_df, on="ham").collect()
shape: (2, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   ┆ y     │
└─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="outer").collect()
shape: (4, 4)
┌──────┬──────┬─────┬───────┐
│ foo  ┆ bar  ┆ ham ┆ apple │
│ ---  ┆ ---  ┆ --- ┆ ---   │
│ i64  ┆ f64  ┆ str ┆ str   │
╞══════╪══════╪═════╪═══════╡
│ 1    ┆ 6.0  ┆ a   ┆ x     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ 7.0  ┆ b   ┆ y     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ null ┆ d   ┆ z     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ 8.0  ┆ c   ┆ null  │
└──────┴──────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="left").collect()
shape: (3, 4)
┌─────┬─────┬─────┬───────┐
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ f64 ┆ str ┆ str   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 6.0 ┆ a   ┆ x     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   ┆ y     │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 8.0 ┆ c   ┆ null  │
└─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="semi").collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ 6.0 ┆ a   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7.0 ┆ b   │
└─────┴─────┴─────┘
>>> df.join(other_df, on="ham", how="anti").collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str │
╞═════╪═════╪═════╡
│ 3   ┆ 8.0 ┆ c   │
└─────┴─────┴─────┘
join_asof(other: LazyFrame, left_on: str | None = None, right_on: str | None = None, on: str | None = None, by_left: str | Sequence[str] | None = None, by_right: str | Sequence[str] | None = None, by: str | Sequence[str] | None = None, strategy: AsofJoinStrategy = 'backward', suffix: str = '_right', tolerance: str | int | float | None = None, allow_parallel: bool = True, force_parallel: bool = False) LDF[source]

Perform an asof join.

This is similar to a left-join except that we match on nearest key rather than equal keys.

Both DataFrames must be sorted by the join_asof key.

For each row in the left DataFrame:

  • A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

  • A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

The default is “backward”.

Parameters:
other

Lazy DataFrame to join with.

left_on

Join column of the left DataFrame.

right_on

Join column of the right DataFrame.

on

Join column of both DataFrames. If set, left_on and right_on should be None.

by

Join on these columns before doing asof join.

by_left

Join on these columns before doing asof join.

by_right

Join on these columns before doing asof join.

strategy{‘backward’, ‘forward’}

Join strategy.

suffix

Suffix to append to columns with a duplicate name.

tolerance

Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time” you use the following string language:

  • 1ns (1 nanosecond)

  • 1us (1 microsecond)

  • 1ms (1 millisecond)

  • 1s (1 second)

  • 1m (1 minute)

  • 1h (1 hour)

  • 1d (1 day)

  • 1w (1 week)

  • 1mo (1 calendar month)

  • 1y (1 calendar year)

  • 1i (1 index count)

Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds

allow_parallel

Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

force_parallel

Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

Examples

>>> from datetime import datetime
>>> gdp = pl.DataFrame(
...     {
...         "date": [
...             datetime(2016, 1, 1),
...             datetime(2017, 1, 1),
...             datetime(2018, 1, 1),
...             datetime(2019, 1, 1),
...         ],  # note record date: Jan 1st (sorted!)
...         "gdp": [4164, 4411, 4566, 4696],
...     }
... ).lazy()
>>> population = pl.DataFrame(
...     {
...         "date": [
...             datetime(2016, 5, 12),
...             datetime(2017, 5, 12),
...             datetime(2018, 5, 12),
...             datetime(2019, 5, 12),
...         ],  # note record date: May 12th (sorted!)
...         "population": [82.19, 82.66, 83.12, 83.52],
...     }
... ).lazy()
>>> population.join_asof(
...     gdp, left_on="date", right_on="date", strategy="backward"
... ).collect()
shape: (4, 3)
┌─────────────────────┬────────────┬──────┐
│ date                ┆ population ┆ gdp  │
│ ---                 ┆ ---        ┆ ---  │
│ datetime[μs]        ┆ f64        ┆ i64  │
╞═════════════════════╪════════════╪══════╡
│ 2016-05-12 00:00:00 ┆ 82.19      ┆ 4164 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2017-05-12 00:00:00 ┆ 82.66      ┆ 4411 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2018-05-12 00:00:00 ┆ 83.12      ┆ 4566 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2019-05-12 00:00:00 ┆ 83.52      ┆ 4696 │
└─────────────────────┴────────────┴──────┘
last() LDF[source]

Get the last row of the DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... ).lazy()
>>> df.last().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 5   ┆ 6   │
└─────┴─────┘
lazy() LDF[source]

Return lazy representation, i.e. itself.

Useful for writing code that expects either a DataFrame or LazyFrame.

Returns:
LazyFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [None, 2, 3, 4],
...         "b": [0.5, None, 2.5, 13],
...         "c": [True, True, False, None],
...     }
... )
>>> df.lazy()  
<polars.LazyFrame object at ...>
limit(n: int = 5) LDF[source]

Get the first n rows.

Alias for LazyFrame.head().

Parameters:
n

Number of rows to return.

Notes

Consider using the fetch() operation if you only want to test your query. The fetch() operation will load the first n rows at the scan level, whereas the head()/limit() are applied at the end.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... ).lazy()
>>> df.limit().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 11  │
└─────┴─────┘
>>> df.limit(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 8   │
└─────┴─────┘
map(f: Callable[[DataFrame], DataFrame], predicate_pushdown: bool = True, projection_pushdown: bool = True, slice_pushdown: bool = True, no_optimizations: bool = False, schema: Union[None, Dict[str, Union[Type[DataType], DataType]]] = None, validate_output_schema: bool = True) LDF[source]

Apply a custom function.

It is important that the function returns a Polars DataFrame.

Parameters:
f

Lambda/ function to apply.

predicate_pushdown

Allow predicate pushdown optimization to pass this node.

projection_pushdown

Allow projection pushdown optimization to pass this node.

slice_pushdown

Allow slice pushdown optimization to pass this node.

no_optimizations

Turn off all optimizations past this point.

schema

Output schema of the function, if set to None we assume that the schema will remain unchanged by the applied function.

validate_output_schema

It is paramount that polars’ schema is correct. This flag will ensure that the output schema of this function will be checked with the expected schema. Setting this to False will not do this check, but may lead to hard to debug bugs.

Warning

The schema of a LazyFrame must always be correct. It is up to the caller of this function to ensure that this invariant is upheld.

It is important that the optimization flags are correct. If the custom function for instance does an aggregation of a column, predicate_pushdown should not be allowed, as this prunes rows and will influence your aggregation results.

Examples

>>> df = pl.DataFrame({"a": [1, 2], "b": [3, 4]}).lazy()
>>> df.map(lambda x: 2 * x).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 8   │
└─────┴─────┘
max() LDF[source]

Aggregate the columns in the DataFrame to their maximum value.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.max().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 4   ┆ 2   │
└─────┴─────┘
mean() LDF[source]

Aggregate the columns in the DataFrame to their mean value.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.mean().collect()
shape: (1, 2)
┌─────┬──────┐
│ a   ┆ b    │
│ --- ┆ ---  │
│ f64 ┆ f64  │
╞═════╪══════╡
│ 2.5 ┆ 1.25 │
└─────┴──────┘
median() LDF[source]

Aggregate the columns in the DataFrame to their median value.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.median().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 2.5 ┆ 1.0 │
└─────┴─────┘
melt(id_vars: str | list[str] | None = None, value_vars: str | list[str] | None = None, variable_name: str | None = None, value_name: str | None = None) LDF[source]

Unpivot a DataFrame from wide to long format.

Optionally leaves identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
id_vars

Columns to use as identifier variables.

value_vars

Values to use as identifier variables. If value_vars is empty all columns that are not in id_vars will be used.

variable_name

Name to give to the value column. Defaults to “variable”

value_name

Name to give to the value column. Defaults to “value”

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["x", "y", "z"],
...         "b": [1, 3, 5],
...         "c": [2, 4, 6],
...     }
... ).lazy()
>>> df.melt(id_vars="a", value_vars=["b", "c"]).collect()
shape: (6, 3)
┌─────┬──────────┬───────┐
│ a   ┆ variable ┆ value │
│ --- ┆ ---      ┆ ---   │
│ str ┆ str      ┆ i64   │
╞═════╪══════════╪═══════╡
│ x   ┆ b        ┆ 1     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y   ┆ b        ┆ 3     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ z   ┆ b        ┆ 5     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ x   ┆ c        ┆ 2     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y   ┆ c        ┆ 4     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ z   ┆ c        ┆ 6     │
└─────┴──────────┴───────┘
min() LDF[source]

Aggregate the columns in the DataFrame to their minimum value.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.min().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 1   │
└─────┴─────┘
pipe(func: Callable[[...], Any], *args: Any, **kwargs: Any) Any[source]

Offers a structured way to apply a sequence of user-defined functions (UDFs).

Parameters:
func

Callable; will receive the frame as the first parameter, followed by any given args/kwargs.

args

Arguments to pass to the UDF.

kwargs

Keyword arguments to pass to the UDF.

Examples

>>> def cast_str_to_int(data, col_name):
...     return data.with_column(pl.col(col_name).cast(pl.Int64))
...
>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["10", "20", "30", "40"]}).lazy()
>>> df.pipe(cast_str_to_int, col_name="b").collect()
shape: (4, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 20  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 30  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 40  │
└─────┴─────┘
>>> df = pl.DataFrame({"b": [1, 2], "a": [3, 4]})
>>> df
shape: (2, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   │
└─────┴─────┘
>>> df.lazy().pipe(lambda tdf: tdf.select(sorted(tdf.columns))).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 2   │
└─────┴─────┘
profile(type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, no_optimization: bool = False, slice_pushdown: bool = True, common_subplan_elimination: bool = True, show_plot: bool = False, truncate_nodes: int = 0, figsize: tuple[int, int] = (18, 8), allow_streaming: bool = False) tuple[DataFrame, DataFrame][source]

Profile a LazyFrame.

This will run the query and return a tuple containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.

The units of the timings are microseconds.

Parameters:
type_coercion

Do type coercion optimization.

predicate_pushdown

Do predicate pushdown optimization.

projection_pushdown

Do projection pushdown optimization.

simplify_expression

Run simplify expressions optimization.

no_optimization

Turn off (certain) optimizations.

slice_pushdown

Slice pushdown optimization.

common_subplan_elimination

Will try to cache branching subplans that occur on self-joins or unions.

show_plot

Show a gantt chart of the profiling result

truncate_nodes

Truncate the label lengths in the gantt chart to this number of characters.

figsize

matplotlib figsize of the profiling plot

allow_streaming

Run parts of the query in a streaming fashion (this is in an alpha state)

Returns:
DataFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... ).lazy()
>>> df.groupby("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).profile()  
(shape: (3, 3)
 ┌─────┬─────┬─────┐
 │ a   ┆ b   ┆ c   │
 │ --- ┆ --- ┆ --- │
 │ str ┆ i64 ┆ i64 │
 ╞═════╪═════╪═════╡
 │ a   ┆ 4   ┆ 10  │
 ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
 │ b   ┆ 11  ┆ 10  │
 ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
 │ c   ┆ 6   ┆ 1   │
 └─────┴─────┴─────┘,
 shape: (3, 3)
 ┌────────────────────────┬───────┬──────┐
 │ node                   ┆ start ┆ end  │
 │ ---                    ┆ ---   ┆ ---  │
 │ str                    ┆ u64   ┆ u64  │
 ╞════════════════════════╪═══════╪══════╡
 │ optimization           ┆ 0     ┆ 5    │
 ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
 │ groupby_partitioned(a) ┆ 5     ┆ 470  │
 ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
 │ sort(a)                ┆ 475   ┆ 1964 │
 └────────────────────────┴───────┴──────┘)
quantile(quantile: float, interpolation: RollingInterpolationMethod = 'nearest') LDF[source]

Aggregate the columns in the DataFrame to their quantile value.

Parameters:
quantile

Quantile between 0.0 and 1.0.

interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}

Interpolation method.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.quantile(0.7).collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════╪═════╡
│ 3.0 ┆ 1.0 │
└─────┴─────┘
classmethod read_json(file: str | pathlib.Path | io.IOBase) LazyFrame[source]

Read a logical plan from a JSON file to construct a LazyFrame.

Parameters:
file

Path to a file or a file-like object.

rename(mapping: dict[str, str]) LDF[source]

Rename column names.

Parameters:
mapping

Key value pairs that map from old name to new name.

Examples

>>> df = pl.DataFrame(
...     {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]}
... ).lazy()
>>> df.rename({"foo": "apple"}).collect()
shape: (3, 3)
┌───────┬─────┬─────┐
│ apple ┆ bar ┆ ham │
│ ---   ┆ --- ┆ --- │
│ i64   ┆ i64 ┆ str │
╞═══════╪═════╪═════╡
│ 1     ┆ 6   ┆ a   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2     ┆ 7   ┆ b   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3     ┆ 8   ┆ c   │
└───────┴─────┴─────┘
reverse() LDF[source]

Reverse the DataFrame.

Examples

>>> df = pl.DataFrame(
...     {
...         "key": ["a", "b", "c"],
...         "val": [1, 2, 3],
...     }
... ).lazy()
>>> df.reverse().collect()
shape: (3, 2)
┌─────┬─────┐
│ key ┆ val │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ c   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ a   ┆ 1   │
└─────┴─────┘
property schema: Dict[str, Union[Type[DataType], DataType]][source]

Get a dict[column name, DataType].

Examples

>>> lf = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6.0, 7.0, 8.0],
...         "ham": ["a", "b", "c"],
...     }
... ).lazy()
>>> lf.schema
{'foo': <class 'polars.datatypes.Int64'>, 'bar': <class 'polars.datatypes.Float64'>, 'ham': <class 'polars.datatypes.Utf8'>}
select(exprs: Union[str, Expr, Series, Sequence[str | Expr | Series | WhenThen | WhenThenThen]]) LDF[source]

Select columns from this DataFrame.

Parameters:
exprs

Column or columns to select.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...         "ham": ["a", "b", "c"],
...     }
... ).lazy()
>>> df.select("foo").collect()
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 2   │
├╌╌╌╌╌┤
│ 3   │
└─────┘
>>> df.select(["foo", "bar"]).collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   │
└─────┴─────┘
>>> df.select(pl.col("foo") + 1).collect()
shape: (3, 1)
┌─────┐
│ foo │
│ --- │
│ i64 │
╞═════╡
│ 2   │
├╌╌╌╌╌┤
│ 3   │
├╌╌╌╌╌┤
│ 4   │
└─────┘
>>> df.select([pl.col("foo") + 1, pl.col("bar") + 1]).collect()
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 9   │
└─────┴─────┘
>>> df.select(pl.when(pl.col("foo") > 2).then(10).otherwise(0)).collect()
shape: (3, 1)
┌─────────┐
│ literal │
│ ---     │
│ i64     │
╞═════════╡
│ 0       │
├╌╌╌╌╌╌╌╌╌┤
│ 0       │
├╌╌╌╌╌╌╌╌╌┤
│ 10      │
└─────────┘
shift(periods: int) LDF[source]

Shift the values by a given period.

Parameters:
periods

Number of places to shift (may be negative).

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... ).lazy()
>>> df.shift(periods=1).collect()
shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ null ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1    ┆ 2    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ 4    │
└──────┴──────┘
>>> df.shift(periods=-1).collect()
shape: (3, 2)
┌──────┬──────┐
│ a    ┆ b    │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 3    ┆ 4    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5    ┆ 6    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null │
└──────┴──────┘
shift_and_fill(periods: int, fill_value: Expr | int | str | float) LDF[source]

Shift the values by a given period and fill the resulting null values.

Parameters:
periods

Number of places to shift (may be negative).

fill_value

fill None values with the result of this expression.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... ).lazy()
>>> df.shift_and_fill(periods=1, fill_value=0).collect()
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0   ┆ 0   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 4   │
└─────┴─────┘
>>> df.shift_and_fill(periods=-1, fill_value=0).collect()
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 6   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 0   ┆ 0   │
└─────┴─────┘
show_graph(optimized: bool = True, *, show: bool = True, output_path: str | None = None, raw_output: bool = False, figsize: tuple[float, float] = (16.0, 12.0), type_coercion: bool = True, predicate_pushdown: bool = True, projection_pushdown: bool = True, simplify_expression: bool = True, slice_pushdown: bool = True, common_subplan_elimination: bool = True, streaming: bool = False) str | None[source]

Show a plot of the query plan. Note that you should have graphviz installed.

Parameters:
optimized

Optimize the query plan.

show

Show the figure.

output_path

Write the figure to disk.

raw_output

Return dot syntax. This cannot be combined with show and/or output_path.

figsize

Passed to matplotlib if show == True.

type_coercion

Do type coercion optimization.

predicate_pushdown

Do predicate pushdown optimization.

projection_pushdown

Do projection pushdown optimization.

simplify_expression

Run simplify expressions optimization.

slice_pushdown

Slice pushdown optimization.

common_subplan_elimination

Will try to cache branching subplans that occur on self-joins or unions.

streaming

Run parts of the query in a streaming fashion (this is in an alpha state)

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["a", "b", "a", "b", "b", "c"],
...         "b": [1, 2, 3, 4, 5, 6],
...         "c": [6, 5, 4, 3, 2, 1],
...     }
... ).lazy()
>>> df.groupby("a", maintain_order=True).agg(pl.all().sum()).sort(
...     "a"
... ).show_graph()  
slice(offset: int, length: int | None = None) LDF[source]

Get a slice of this DataFrame.

Parameters:
offset

Start index. Negative indexing is supported.

length

Length of the slice. If set to None, all rows starting at the offset will be selected.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": ["x", "y", "z"],
...         "b": [1, 3, 5],
...         "c": [2, 4, 6],
...     }
... ).lazy()
>>>
>>> df.slice(1, 2).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ y   ┆ 3   ┆ 4   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ z   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘
sort(by: Union[str, Sequence[str], Expr, Sequence[Expr], Sequence[str | Expr]], reverse: Union[bool, Sequence[bool]] = False, nulls_last: bool = False) LDF[source]

Sort the DataFrame.

Sorting can be done by:

  • A single column name

  • An expression

  • Multiple expressions

Parameters:
by

Column (expressions) to sort by.

reverse

Sort in descending order.

nulls_last

Place null values last. Can only be used if sorted by a single column.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, None],
...         "bar": [6.0, 7.0, 8.0, 9.0],
...         "ham": ["a", "b", "c", "d"],
...     }
... ).lazy()
>>> df.sort("foo").collect()
shape: (4, 3)
┌──────┬─────┬─────┐
│ foo  ┆ bar ┆ ham │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ null ┆ 9.0 ┆ d   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1    ┆ 6.0 ┆ a   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2    ┆ 7.0 ┆ b   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3    ┆ 8.0 ┆ c   │
└──────┴─────┴─────┘
>>> df.sort("foo", nulls_last=True).collect()
shape: (4, 3)
┌──────┬─────┬─────┐
│ foo  ┆ bar ┆ ham │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 1    ┆ 6.0 ┆ a   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2    ┆ 7.0 ┆ b   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3    ┆ 8.0 ┆ c   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 9.0 ┆ d   │
└──────┴─────┴─────┘
>>> df.sort("foo", reverse=True).collect()
shape: (4, 3)
┌──────┬─────┬─────┐
│ foo  ┆ bar ┆ ham │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 3    ┆ 8.0 ┆ c   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2    ┆ 7.0 ┆ b   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1    ┆ 6.0 ┆ a   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 9.0 ┆ d   │
└──────┴─────┴─────┘

Sort by multiple columns. For multiple columns we can also use expression syntax.

>>> df.sort(
...     [pl.col("foo"), pl.col("bar") ** 2],
...     reverse=[True, False],
... ).collect()
shape: (4, 3)
┌──────┬─────┬─────┐
│ foo  ┆ bar ┆ ham │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ f64 ┆ str │
╞══════╪═════╪═════╡
│ 3    ┆ 8.0 ┆ c   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2    ┆ 7.0 ┆ b   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1    ┆ 6.0 ┆ a   │
├╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 9.0 ┆ d   │
└──────┴─────┴─────┘
std(ddof: int = 1) LDF[source]

Aggregate the columns in the DataFrame to their standard deviation value.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.std().collect()
shape: (1, 2)
┌──────────┬─────┐
│ a        ┆ b   │
│ ---      ┆ --- │
│ f64      ┆ f64 │
╞══════════╪═════╡
│ 1.290994 ┆ 0.5 │
└──────────┴─────┘
>>> df.std(ddof=0).collect()
shape: (1, 2)
┌──────────┬──────────┐
│ a        ┆ b        │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 1.118034 ┆ 0.433013 │
└──────────┴──────────┘
sum() LDF[source]

Aggregate the columns in the DataFrame to their sum value.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.sum().collect()
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 10  ┆ 5   │
└─────┴─────┘
tail(n: int = 5) LDF[source]

Get the last n rows.

Parameters:
n

Number of rows.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [7, 8, 9, 10, 11, 12],
...     }
... ).lazy()
>>> df.tail().collect()
shape: (5, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 2   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 10  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 11  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 6   ┆ 12  │
└─────┴─────┘
>>> df.tail(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 5   ┆ 11  │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 6   ┆ 12  │
└─────┴─────┘
take_every(n: int) LDF[source]

Take every nth row in the LazyFrame and return as a new LazyFrame.

Examples

>>> s = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}).lazy()
>>> s.take_every(2).collect()
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 7   │
└─────┴─────┘
unique(maintain_order: bool = True, subset: str | list[str] | None = None, keep: UniqueKeepStrategy = 'first') LDF[source]

Drop duplicate rows from this DataFrame.

Note that this fails if there is a column of type List in the DataFrame or subset.

Parameters:
maintain_order

Keep the same order as the original DataFrame. This requires more work to compute.

subset

Subset to use to compare rows.

keep{‘first’, ‘last’}

Which of the duplicate rows to keep.

Returns:
DataFrame with unique rows

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 1],
...         "bar": ["a", "a", "a", "a"],
...         "ham": ["b", "b", "b", "b"],
...     }
... ).lazy()
>>> df.unique().collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ a   ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ a   ┆ b   │
└─────┴─────┴─────┘
>>> df.unique(subset=["bar", "ham"]).collect()
shape: (1, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1   ┆ a   ┆ b   │
└─────┴─────┴─────┘
>>> df.unique(keep="last").collect()
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 2   ┆ a   ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ a   ┆ b   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1   ┆ a   ┆ b   │
└─────┴─────┴─────┘
unnest(names: str | list[str]) LDF[source]

Decompose a struct into its fields.

The fields will be inserted into the DataFrame on the location of the struct type.

Parameters:
names

Names of the struct columns that will be decomposed by its fields

Examples

>>> df = (
...     pl.DataFrame(
...         {
...             "before": ["foo", "bar"],
...             "t_a": [1, 2],
...             "t_b": ["a", "b"],
...             "t_c": [True, None],
...             "t_d": [[1, 2], [3]],
...             "after": ["baz", "womp"],
...         }
...     )
...     .lazy()
...     .select(
...         ["before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after"]
...     )
... )
>>> df.fetch()
shape: (2, 3)
┌────────┬─────────────────────┬───────┐
│ before ┆ t_struct            ┆ after │
│ ---    ┆ ---                 ┆ ---   │
│ str    ┆ struct[4]           ┆ str   │
╞════════╪═════════════════════╪═══════╡
│ foo    ┆ {1,"a",true,[1, 2]} ┆ baz   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bar    ┆ {2,"b",null,[3]}    ┆ womp  │
└────────┴─────────────────────┴───────┘
>>> df.unnest("t_struct").fetch()
shape: (2, 6)
┌────────┬─────┬─────┬──────┬───────────┬───────┐
│ before ┆ t_a ┆ t_b ┆ t_c  ┆ t_d       ┆ after │
│ ---    ┆ --- ┆ --- ┆ ---  ┆ ---       ┆ ---   │
│ str    ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str   │
╞════════╪═════╪═════╪══════╪═══════════╪═══════╡
│ foo    ┆ 1   ┆ a   ┆ true ┆ [1, 2]    ┆ baz   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ bar    ┆ 2   ┆ b   ┆ null ┆ [3]       ┆ womp  │
└────────┴─────┴─────┴──────┴───────────┴───────┘
var(ddof: int = 1) LDF[source]

Aggregate the columns in the DataFrame to their variance value.

Examples

>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1]}).lazy()
>>> df.var().collect()
shape: (1, 2)
┌──────────┬──────┐
│ a        ┆ b    │
│ ---      ┆ ---  │
│ f64      ┆ f64  │
╞══════════╪══════╡
│ 1.666667 ┆ 0.25 │
└──────────┴──────┘
>>> df.var(ddof=0).collect()
shape: (1, 2)
┌──────┬────────┐
│ a    ┆ b      │
│ ---  ┆ ---    │
│ f64  ┆ f64    │
╞══════╪════════╡
│ 1.25 ┆ 0.1875 │
└──────┴────────┘
property width: int[source]

Get the width of the LazyFrame.

Examples

>>> lf = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}).lazy()
>>> lf.width
2
with_column(column: Series | Expr) LDF[source]

Add or overwrite column in a DataFrame.

Parameters:
column

Expression that evaluates to column or a Series to use.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... ).lazy()
>>> df.with_column((pl.col("b") ** 2).alias("b_squared")).collect()  # added
shape: (3, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ b_squared │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ i64 ┆ f64       │
╞═════╪═════╪═══════════╡
│ 1   ┆ 2   ┆ 4.0       │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ 16.0      │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ 6   ┆ 36.0      │
└─────┴─────┴───────────┘
>>> df.with_column(pl.col("a") ** 2).collect()  # replaced
shape: (3, 2)
┌──────┬─────┐
│ a    ┆ b   │
│ ---  ┆ --- │
│ f64  ┆ i64 │
╞══════╪═════╡
│ 1.0  ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 9.0  ┆ 4   │
├╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 25.0 ┆ 6   │
└──────┴─────┘
>>> df.with_column(pl.Series("c", [7, 8, 9])).collect()  # add from a Series
shape: (3, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 7   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 4   ┆ 8   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5   ┆ 6   ┆ 9   │
└─────┴─────┴─────┘
with_columns(exprs: Optional[Union[Expr, Series, Sequence[Expr | Series]]] = None, **named_exprs: Expr | Series | str) LDF[source]

Add or overwrite multiple columns in a DataFrame.

Parameters:
exprs

List of Expressions that evaluate to columns.

**named_exprs

Named column Expressions, provided as kwargs.

Examples

>>> ldf = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4],
...         "b": [0.5, 4, 10, 13],
...         "c": [True, True, False, True],
...     }
... ).lazy()
>>> ldf.with_columns(
...     [
...         (pl.col("a") ** 2).alias("a^2"),
...         (pl.col("b") / 2).alias("b/2"),
...         (pl.col("c").is_not()).alias("not c"),
...     ]
... ).collect()
shape: (4, 6)
┌─────┬──────┬───────┬──────┬──────┬───────┐
│ a   ┆ b    ┆ c     ┆ a^2  ┆ b/2  ┆ not c │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---  ┆ ---   │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ f64  ┆ bool  │
╞═════╪══════╪═══════╪══════╪══════╪═══════╡
│ 1   ┆ 0.5  ┆ true  ┆ 1.0  ┆ 0.25 ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ 4.0  ┆ true  ┆ 4.0  ┆ 2.0  ┆ false │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ 10.0 ┆ false ┆ 9.0  ┆ 5.0  ┆ true  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4   ┆ 13.0 ┆ true  ┆ 16.0 ┆ 6.5  ┆ false │
└─────┴──────┴───────┴──────┴──────┴───────┘
>>> # Support for kwarg expressions is considered EXPERIMENTAL.
>>> # Currently requires opt-in via `pl.Config` boolean flag:
>>>
>>> pl.Config.with_columns_kwargs = True
>>> ldf.with_columns(
...     d=pl.col("a") * pl.col("b"),
...     e=pl.col("c").is_not(),
...     f="foo",
... ).collect()
shape: (4, 6)
┌─────┬──────┬───────┬──────┬───────┬─────┐
│ a   ┆ b    ┆ c     ┆ d    ┆ e     ┆ f   │
│ --- ┆ ---  ┆ ---   ┆ ---  ┆ ---   ┆ --- │
│ i64 ┆ f64  ┆ bool  ┆ f64  ┆ bool  ┆ str │
╞═════╪══════╪═══════╪══════╪═══════╪═════╡
│ 1   ┆ 0.5  ┆ true  ┆ 0.5  ┆ false ┆ foo │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4.0  ┆ true  ┆ 8.0  ┆ false ┆ foo │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3   ┆ 10.0 ┆ false ┆ 30.0 ┆ true  ┆ foo │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 13.0 ┆ true  ┆ 52.0 ┆ false ┆ foo │
└─────┴──────┴───────┴──────┴───────┴─────┘
with_context(other: LDF | list[LDF]) LDF[source]

Add an external context to the computation graph.

This allows expressions to also access columns from DataFrames that are not part of this one.

Parameters:
other

Lazy DataFrame to join with.

Examples

>>> df_a = pl.DataFrame({"a": [1, 2, 3], "b": ["a", "c", None]}).lazy()
>>> df_other = pl.DataFrame({"c": ["foo", "ham"]})
>>> (
...     df_a.with_context(df_other.lazy()).select(
...         [pl.col("b") + pl.col("c").first()]
...     )
... ).collect()
shape: (3, 1)
┌──────┐
│ b    │
│ ---  │
│ str  │
╞══════╡
│ afoo │
├╌╌╌╌╌╌┤
│ cfoo │
├╌╌╌╌╌╌┤
│ null │
└──────┘

Fill nulls with the median from another dataframe >>> train_df = pl.DataFrame( … {“feature_0”: [-1.0, 0, 1], “feature_1”: [-1.0, 0, 1]} … ).lazy() >>> test_df = pl.DataFrame( … {“feature_0”: [-1.0, None, 1], “feature_1”: [-1.0, 0, 1]} … ).lazy() >>> ( … test_df.with_context(train_df.select(pl.all().suffix(“_train”))).select( … pl.col(“feature_0”).fill_null(pl.col(“feature_0_train”).median()) … ) … ).collect() shape: (3, 1) ┌───────────┐ │ feature_0 │ │ — │ │ f64 │ ╞═══════════╡ │ -1.0 │ ├╌╌╌╌╌╌╌╌╌╌╌┤ │ 0.0 │ ├╌╌╌╌╌╌╌╌╌╌╌┤ │ 1.0 │ └───────────┘

with_row_count(name: str = 'row_nr', offset: int = 0) LDF[source]

Add a column at index 0 that counts the rows.

Parameters:
name

Name of the column to add.

offset

Start the row count at this offset.

Warning

This can have a negative effect on query performance. This may, for instance, block predicate pushdown optimization.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 3, 5],
...         "b": [2, 4, 6],
...     }
... ).lazy()
>>> df.with_row_count().collect()
shape: (3, 3)
┌────────┬─────┬─────┐
│ row_nr ┆ a   ┆ b   │
│ ---    ┆ --- ┆ --- │
│ u32    ┆ i64 ┆ i64 │
╞════════╪═════╪═════╡
│ 0      ┆ 1   ┆ 2   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1      ┆ 3   ┆ 4   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2      ┆ 5   ┆ 6   │
└────────┴─────┴─────┘
write_json(file: None = None, *, to_string: bool | None = None) str[source]
write_json(file: io.IOBase | str | pathlib.Path, *, to_string: bool | None = None) None

Write the logical plan of this LazyFrame to a file or string in JSON format.

Parameters:
file

File path to which the result should be written. If set to None (default), the output is returned as a string instead.

to_string

Deprecated argument. Ignore file argument and return a string.

Examples

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3],
...         "bar": [6, 7, 8],
...     }
... ).lazy()
>>> df.write_json()
'{"DataFrameScan":{"df":{"columns":[{"name":"foo","datatype":"Int64","values":[1,2,3]},{"name":"bar","datatype":"Int64","values":[6,7,8]}]},"schema":{"inner":{"foo":"Int64","bar":"Int64"}},"output_schema":null,"projection":null,"selection":null}}'