polars.internals.lazyframe.groupby.LazyGroupBy.apply#

LazyGroupBy.apply(f: Callable[[DataFrame], DataFrame], schema: Optional[Dict[str, Union[Type[DataType], DataType]]]) LDF[source]#

Apply a custom/user-defined function (UDF) over the groups as a new DataFrame.

Implementing logic using a Python function is almost always _significantly_ slower and more memory intensive than implementing the same logic using the native expression API because:

  • The native expression engine runs in Rust; UDFs run in Python.

  • Use of Python UDFs forces the DataFrame to be materialized in memory.

  • Polars-native expressions can be parallelised (UDFs cannot).

  • Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Parameters:
f

Function to apply over each group of the LazyFrame.

schema

Schema of the output function. This has to be known statically. If the schema provided is incorrect, this is a bug in the callers query and may lead to errors. If none given, polars assumes the schema is unchanged.

Examples

The function is applied by group.

>>> df = pl.DataFrame(
...     {
...         "foo": [1, 2, 3, 1],
...         "bar": ["a", "b", "c", "c"],
...     }
... )
>>> (
...     df.lazy()
...     .groupby("bar", maintain_order=True)
...     .agg(
...         [
...             pl.col("foo").apply(lambda x: x.sum()),
...         ]
...     )
...     .collect()
... )
shape: (3, 2)
┌─────┬─────┐
│ bar ┆ foo │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 4   │
└─────┴─────┘

It is better to implement this with an expression:

>>> (
...     df.groupby("bar", maintain_order=True).agg(
...         pl.col("foo").sum(),
...     )
... )