polars.internals.dataframe.groupby.GroupBy.apply#

GroupBy.apply(f: Callable[[DataFrame], DataFrame]) DF[source]#

Apply a custom/user-defined function (UDF) over the groups as a sub-DataFrame.

Implementing logic using a Python function is almost always _significantly_ slower and more memory intensive than implementing the same logic using the native expression API because:

  • The native expression engine runs in Rust; UDFs run in Python.

  • Use of Python UDFs forces the DataFrame to be materialized in memory.

  • Polars-native expressions can be parallelised (UDFs cannot).

  • Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Parameters:
f

Custom function.

Returns:
DataFrame

Examples

>>> df = pl.DataFrame(
...     {
...         "id": [0, 1, 2, 3, 4],
...         "color": ["red", "green", "green", "red", "red"],
...         "shape": ["square", "triangle", "square", "triangle", "square"],
...     }
... )
>>> df
shape: (5, 3)
┌─────┬───────┬──────────┐
│ id  ┆ color ┆ shape    │
│ --- ┆ ---   ┆ ---      │
│ i64 ┆ str   ┆ str      │
╞═════╪═══════╪══════════╡
│ 0   ┆ red   ┆ square   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ green ┆ triangle │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ green ┆ square   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ red   ┆ triangle │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ red   ┆ square   │
└─────┴───────┴──────────┘

For each color group sample two rows:

>>> (
...     df.groupby("color").apply(lambda group_df: group_df.sample(2))
... )  
shape: (4, 3)
┌─────┬───────┬──────────┐
│ id  ┆ color ┆ shape    │
│ --- ┆ ---   ┆ ---      │
│ i64 ┆ str   ┆ str      │
╞═════╪═══════╪══════════╡
│ 1   ┆ green ┆ triangle │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ green ┆ square   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ red   ┆ square   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ red   ┆ triangle │
└─────┴───────┴──────────┘

It is better to implement this with an expression:

>>> (
...     df.filter(pl.arange(0, pl.count()).shuffle().over("color") < 2)
... )