Polars Expressions

Polars has a powerful concept called expressions. Polars expressions can be used in various contexts and are a functional mapping of Fn(Series) -> Series, meaning that they have Series as input and Series as output. By looking at this functional definition, we can see that the output of an Expr also can serve as the input of an Expr.

That may sound a bit strange, so lets give an example.

The following is an expression:

pl.col("foo").sort().head(2)

The snippet above says select column "foo" then sort this column and then take first 2 values of the sorted output. The power of expressions is that every expression produces a new expression and that they can be piped together. You can run an expression by passing them on one of polars execution contexts. Here we run two expressions by running df.select:

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").filter(pl.col("foo") == 1).sum()
])

All expressions are ran in parallel, meaning that separate polars expressions are embarrassingly parallel. (Note that within an expression there may be more parallelization going on).

Expression examples

In this section we will go through some examples, but first let's create a dataset:

import polars as pl
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
shape: (5, 4)
┌──────┬───────┬──────────────────────┬────────┐
│ nrs  ┆ names ┆ random               ┆ groups │
│ ---  ┆ ---   ┆ ---                  ┆ ---    │
│ i64  ┆ str   ┆ f64                  ┆ str    │
╞══════╪═══════╪══════════════════════╪════════╡
│ 1    ┆ foo   ┆ 0.15416284237967237  ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.7400496965154048   ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.26331501518513467  ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.5337393933802977   ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014574962485419674 ┆ B      │
└──────┴───────┴──────────────────────┴────────┘

You can do a lot with expressions. They are so expressive that you sometimes have multiple ways to get the same results. To get a feel for them let's go through some examples.

Count unique values

We can count the unique values in a column. Note that we are creating the same result in different ways. To not have duplicate column names in the DataFrame, we use an alias expression, which renames an expression.

out = df.select(
    [
        pl.col("names").n_unique().alias("unique_names_1"),
        pl.col("names").unique().count().alias("unique_names_2"),
    ]
)
print(out)
shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ ---            ┆ ---            │
│ u32            ┆ u32            │
╞════════════════╪════════════════╡
│ 5              ┆ 5              │
└────────────────┴────────────────┘

Various aggregations

We can do various aggregations. Below we show some of them, but there are more, such as median, mean, first, etc.

out = df.select(
    [
        pl.sum("random").alias("sum"),
        pl.min("random").alias("min"),
        pl.max("random").alias("max"),
        pl.col("random").max().alias("other_max"),
        pl.std("random").alias("std dev"),
        pl.var("random").alias("variance"),
    ]
)
print(out)
shape: (1, 6)
┌────────────────┬────────────────┬────────────────┬───────────────┬───────────────┬───────────────┐
│ sum            ┆ min            ┆ max            ┆ other_max     ┆ std dev       ┆ variance      │
│ ---            ┆ ---            ┆ ---            ┆ ---           ┆ ---           ┆ ---           │
│ f64            ┆ f64            ┆ f64            ┆ f64           ┆ f64           ┆ f64           │
╞════════════════╪════════════════╪════════════════╪═══════════════╪═══════════════╪═══════════════╡
│ 1.705841909945 ┆ 0.014574962485 ┆ 0.740049696515 ┆ 0.74004969651 ┆ 0.29320870456 ┆ 0.08597134443 │
│ 9292           ┆ 419674         ┆ 4048           ┆ 54048         ┆ 7623          ┆ 422363        │
└────────────────┴────────────────┴────────────────┴───────────────┴───────────────┴───────────────┘

Filter and conditionals

We can also do some pretty complex things. In the next snippet we count all names ending with the string "am".

out = df.select(
    [
        pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
    ]
)
print(df)
shape: (1, 1)
┌───────┐
│ names │
│ ---   │
│ u32   │
╞═══════╡
│ 2     │
└───────┘

Binary functions and modification

In the example below we use a conditional to create a new expression in the following when -> then -> otherwise construct. The when() function requires a predicate expression (and thus leads to a boolean Series), the then expects an expression that will be used in case the predicate evaluates true, and the otherwise expects an expression that will be used in case the predicate evaluates false.

Note that you can pass any expression, or just base expressions like pl.col("foo"), pl.lit(3), pl.lit("bar"), etc.

Finally, we multiply this with result of a sum expression.

out = df.select(
    [
        pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"),
    ]
)
print(df)
shape: (5, 1)
┌────────────────────┐
│ literal            │
│ ---                │
│ f64                │
╞════════════════════╡
│ 1.695791266176396  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.0                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.8964651670364816 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.0                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.1603245873396164 │
└────────────────────┘

Window expressions

A polars expression can also do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression. In the examples below we do a GROUPBY OVER "groups" and AGGREGATE SUM of "random", and in the next expression we GROUPBY OVER "names" and AGGREGATE a LIST of "random". These window functions can be combined with other expressions, and are an efficient way to determine group statistics. See more of those group statistics here.

df = df[
    [
        pl.col("*"),  # select all
        pl.col("random").sum().over("groups").alias("sum[random]/groups"),
        pl.col("random").list().over("names").alias("random/name"),
    ]
]
print(df)
shape: (5, 6)
┌──────┬───────┬──────────────────────┬────────┬─────────────────────┬────────────────────────┐
│ nrs  ┆ names ┆ random               ┆ groups ┆ sum[random]/groups  ┆ random/name            │
│ ---  ┆ ---   ┆ ---                  ┆ ---    ┆ ---                 ┆ ---                    │
│ i64  ┆ str   ┆ f64                  ┆ str    ┆ f64                 ┆ list [f64]             │
╞══════╪═══════╪══════════════════════╪════════╪═════════════════════╪════════════════════════╡
│ 1    ┆ foo   ┆ 0.15416284237967237  ┆ A      ┆ 0.8942125388950771  ┆ [0.15416284237967237]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.7400496965154048   ┆ A      ┆ 0.8942125388950771  ┆ [0.7400496965154048]   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.26331501518513467  ┆ B      ┆ 0.27788997767055434 ┆ [0.26331501518513467]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.5337393933802977   ┆ C      ┆ 0.5337393933802977  ┆ [0.5337393933802977]   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014574962485419674 ┆ B      ┆ 0.27788997767055434 ┆ [0.014574962485419674] │
└──────┴───────┴──────────────────────┴────────┴─────────────────────┴────────────────────────┘

Conclusion

This is the tip of the iceberg in terms of possible expressions, there are a ton more, and they can be combined in myriad ways.

This page was an introduction to Polars expressions and gave a glimpse of what's possible with them. In the next page, we see in which contexts we can use expressions. And later we'll go through expressions in various groupby contexts and by doing that keep Polars execution parallel.