polars.Expr.apply#

Expr.apply(f: Union[Callable[[Series], Series], Callable[[Any], Any]], return_dtype: Optional[Union[Type[DataType], DataType]] = None) Expr[source]#

Apply a custom/user-defined function (UDF) in a GroupBy or Projection context.

Depending on the context it has the following behavior:

  • Selection

    Expects f to be of type Callable[[Any], Any]. Applies a python function over each individual value in the column.

  • GroupBy

    Expects f to be of type Callable[[Series], Series]. Applies a python function over each group.

Implementing logic using a Python function is almost always _significantly_ slower and more memory intensive than implementing the same logic using the native expression API because:

  • The native expression engine runs in Rust; UDFs run in Python.

  • Use of Python UDFs forces the DataFrame to be materialized in memory.

  • Polars-native expressions can be parallelised (UDFs cannot).

  • Polars-native expressions can be logically optimised (UDFs cannot).

Wherever possible you should strongly prefer the native expression API to achieve the best performance.

Parameters:
f

Lambda/ function to apply.

return_dtype

Dtype of the output Series. If not set, polars will assume that the dtype remains unchanged.

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 1],
...         "b": ["a", "b", "c", "c"],
...     }
... )

In a selection context, the function is applied by row.

>>> (
...     df.with_column(
...         pl.col("a").apply(lambda x: x * 2).alias("a_times_2"),
...     )
... )
shape: (4, 3)
┌─────┬─────┬───────────┐
│ a   ┆ b   ┆ a_times_2 │
│ --- ┆ --- ┆ ---       │
│ i64 ┆ str ┆ i64       │
╞═════╪═════╪═══════════╡
│ 1   ┆ a   ┆ 2         │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ b   ┆ 4         │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ c   ┆ 6         │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ c   ┆ 2         │
└─────┴─────┴───────────┘

It is better to implement this with an expression:

>>> (
...     df.with_column(
...         (pl.col("a") * 2).alias("a_times_2"),
...     )
... )  

In a GroupBy context the function is applied by group:

>>> (
...     df.lazy()
...     .groupby("b", maintain_order=True)
...     .agg(
...         [
...             pl.col("a").apply(lambda x: x.sum()),
...         ]
...     )
...     .collect()
... )
shape: (3, 2)
┌─────┬─────┐
│ b   ┆ a   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 2   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ c   ┆ 4   │
└─────┴─────┘

It is better to implement this with an expression:

>>> (
...     df.groupby("b", maintain_order=True).agg(
...         pl.col("a").sum(),
...     )
... )