polars.lazyframe.groupby.LazyGroupBy.apply#
- LazyGroupBy.apply(function: Callable[[DataFrame], DataFrame], schema: SchemaDict | None) LDF [source]#
Apply a custom/user-defined function (UDF) over the groups as a new DataFrame.
Implementing logic using a Python function is almost always _significantly_ slower and more memory intensive than implementing the same logic using the native expression API because:
The native expression engine runs in Rust; UDFs run in Python.
Use of Python UDFs forces the DataFrame to be materialized in memory.
Polars-native expressions can be parallelised (UDFs cannot).
Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
- Parameters:
- function
Function to apply over each group of the LazyFrame.
- schema
Schema of the output function. This has to be known statically. If the given schema is incorrect, this is a bug in the caller’s query and may lead to errors. If set to None, polars assumes the schema is unchanged.
Examples
>>> df = pl.DataFrame( ... { ... "id": [0, 1, 2, 3, 4], ... "color": ["red", "green", "green", "red", "red"], ... "shape": ["square", "triangle", "square", "triangle", "square"], ... } ... ) >>> df shape: (5, 3) ┌─────┬───────┬──────────┐ │ id ┆ color ┆ shape │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═══════╪══════════╡ │ 0 ┆ red ┆ square │ │ 1 ┆ green ┆ triangle │ │ 2 ┆ green ┆ square │ │ 3 ┆ red ┆ triangle │ │ 4 ┆ red ┆ square │ └─────┴───────┴──────────┘
For each color group sample two rows:
>>> ( ... df.lazy() ... .groupby("color") ... .apply(lambda group_df: group_df.sample(2), schema=None) ... .collect() ... ) shape: (4, 3) ┌─────┬───────┬──────────┐ │ id ┆ color ┆ shape │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═══════╪══════════╡ │ 1 ┆ green ┆ triangle │ │ 2 ┆ green ┆ square │ │ 4 ┆ red ┆ square │ │ 3 ┆ red ┆ triangle │ └─────┴───────┴──────────┘
It is better to implement this with an expression:
>>> ( ... df.lazy() ... .filter(pl.arange(0, pl.count()).shuffle().over("color") < 2) ... .collect() ... )