polars.Expr.map_batches#

Expr.map_batches(
function: Callable[[Series], Series | Any],
return_dtype: PolarsDataType | None = None,
*,
agg_list: bool = False,
is_elementwise: bool = False,
) Self[source]#

Apply a custom python function to a whole Series or sequence of Series.

The output of this custom function must be a Series (or a NumPy array, in which case it will be automatically converted into a Series). If you want to apply a custom function elementwise over single values, see map_elements(). A reasonable use case for map functions is transforming the values represented by an expression using a third-party library.

Warning

If you are looking to map a function over a window function or group_by context, refer to map_elements() instead. Read more in the book.

Parameters:
function

Lambda/function to apply.

return_dtype

Dtype of the output Series. If not set, the dtype will be inferred based on the first non-null value that is returned by the function.

is_elementwise

If set to true this can run in the streaming engine, but may yield incorrect results in group-by. Ensure you know what you are doing!

agg_list

Aggregate the values of the expression into a list before applying the function. This parameter only works in a group-by context. The function will be invoked only once on a list of groups, rather than once per group.

Warning

If return_dtype is not provided, this may lead to unexpected results. We allow this, but it is considered a bug in the user’s query.

Examples

>>> df = pl.DataFrame(
...     {
...         "sine": [0.0, 1.0, 0.0, -1.0],
...         "cosine": [1.0, 0.0, -1.0, 0.0],
...     }
... )
>>> df.select(pl.all().map_batches(lambda x: x.to_numpy().argmax()))
shape: (1, 2)
┌──────┬────────┐
│ sine ┆ cosine │
│ ---  ┆ ---    │
│ i64  ┆ i64    │
╞══════╪════════╡
│ 1    ┆ 0      │
└──────┴────────┘

In a group-by context, the agg_list parameter can improve performance if used correctly. The following example has agg_list set to False, which causes the function to be applied once per group. The input of the function is a Series of type Int64. This is less efficient.

>>> df = pl.DataFrame(
...     {
...         "a": [0, 1, 0, 1],
...         "b": [1, 2, 3, 4],
...     }
... )
>>> df.group_by("a").agg(
...     pl.col("b").map_batches(lambda x: x.max(), agg_list=False)
... )  
shape: (2, 2)
┌─────┬───────────┐
│ a   ┆ b         │
│ --- ┆ ---       │
│ i64 ┆ list[i64] │
╞═════╪═══════════╡
│ 1   ┆ [4]       │
│ 0   ┆ [3]       │
└─────┴───────────┘

Using agg_list=True would be more efficient. In this example, the input of the function is a Series of type List(Int64).

>>> df.group_by("a").agg(
...     pl.col("b").map_batches(lambda x: x.list.max(), agg_list=True)
... )  
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0   ┆ 3   │
│ 1   ┆ 4   │
└─────┴─────┘