Polars Expressions

Polars has a powerful concept called expressions. Polars expressions can be used in various context and produce Series. That may sound a bit strange, so lets give an example.

The following is an expression:

col("foo").sort().head(2)

The snippet above says on select column "foo" -> sort -> take first 2 values. The power of expressions is that every expression produces a new expression and that they can be piped together. Besides, being very expressive, they are also embarrassingly parallel!

Expression examples

In the next section we will go through some examples, but first create a dataset:

import polars as pl
from polars import col
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
shape: (5, 4)
┌──────┬────────┬──────────────────────┬────────┐
│ nrs  ┆ names  ┆ random               ┆ groups │
│ ---  ┆ ---    ┆ ---                  ┆ ---    │
│ i64  ┆ str    ┆ f64                  ┆ str    │
╞══════╪════════╪══════════════════════╪════════╡
│ 1    ┆ "foo"  ┆ 0.15416284237967237  ┆ "A"    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ "ham"  ┆ 0.7400496965154048   ┆ "A"    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ "spam" ┆ 0.26331501518513467  ┆ "B"    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ "egg"  ┆ 0.5337393933802977   ┆ "C"    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null   ┆ 0.014574962485419674 ┆ "B"    │
└──────┴────────┴──────────────────────┴────────┘

Some more examples.

You can do a lot with expressions. They are so expressive that you sometimes have got multiple ways to get the same results. To get a feel for them let's go through some examples.

Count unique values

We can count the unique values in a column. Note that we are creating the same result in a different ways. To not have duplicate column names in the DataFrame, we use an alias expression, which renames an expression.

df = df[
    [
        col("names").n_unique().alias("unique_names_1"),
        col("names").unique().count().alias("unique_names_2"),
    ]
]
print(df)
shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ ---            ┆ ---            │
│ u32            ┆ u32            │
╞════════════════╪════════════════╡
│ 5              ┆ 5              │
└────────────────┴────────────────┘

Various aggregations

We can do various aggregations. Below we show some of them, but there are more, such as median, mean, first etc.

df = df[
    [
        pl.sum("random").alias("sum"),
        pl.min("random").alias("min"),
        pl.max("random").alias("max"),
        col("random").max().alias("other_max"),
        pl.std("random").alias("std dev"),
        pl.var("random").alias("variance"),
    ]
]
print(df)
shape: (1, 6)
┌────────────────┬────────────────┬────────────────┬───────────────┬───────────────┬───────────────┐
│ sum            ┆ min            ┆ max            ┆ other_max     ┆ std dev       ┆ variance      │
│ ---            ┆ ---            ┆ ---            ┆ ---           ┆ ---           ┆ ---           │
│ f64            ┆ f64            ┆ f64            ┆ f64           ┆ f64           ┆ f64           │
╞════════════════╪════════════════╪════════════════╪═══════════════╪═══════════════╪═══════════════╡
│ 1.705841909945 ┆ 0.014574962485 ┆ 0.740049696515 ┆ 0.74004969651 ┆ 0.29320870456 ┆ 0.08597134443 │
│ 9292           ┆ 419674         ┆ 4048           ┆ 54048         ┆ 7623          ┆ 422363        │
└────────────────┴────────────────┴────────────────┴───────────────┴───────────────┴───────────────┘

Filter and conditionals

We can also do quite some complex things. In the next snippet we count all names ending with the string "am".

df = df[[col("names").filter(col("names").str.contains(r"am$")).count()]]
print(df)
shape: (1, 1)
┌───────┐
│ names │
│ ---   │
│ u32   │
╞═══════╡
│ 2     │
└───────┘

Binary functions and modification

In the example below we use a conditional to create a new expression in the following when -> then -> otherwise construct. The when() function requires a predicate expression (and thus leads to a boolean Series), the then requires expects an expression that will be used in case the predicate evaluates true, and the otherwise expects and expression that will be used in case the predicate evaluates false.

Note that you can pass any expression, or just base expressions like col("foo"), lit(3), lit("bar"), etc.

Finally, we multiply this with result of a sum expression.

df = df[[pl.when(col("random") > 0.5).then(0).otherwise(col("random")) * pl.sum("nrs")]]
print(df)
shape: (5, 1)
┌────────────────────┐
│ literal            │
│ ---                │
│ f64                │
╞════════════════════╡
│ 1.695791266176396  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.0                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.8964651670364816 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.0                │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0.1603245873396164 │
└────────────────────┘

Window expressions (split-apply-combine)

A polars expression can also do a do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression. In the examples below we do a GROUPBY OVER "groups" and AGGREGATE SUM of "random", and in the next expression we GROUPBY OVER "names" and AGGREGATE a LIST of "random". These window functions can be combined with other expressions, and are an efficient way to determine group statistics. See more of those group statistics here

df = df[
    [
        col("*"),  # select all
        col("random").sum().over("groups").alias("sum[random]/groups"),
        col("random").list().over("names").alias("random/name"),
    ]
]
print(df)
shape: (5, 6)
┌──────┬────────┬──────────────────────┬────────┬─────────────────────┬────────────────────────┐
│ nrs  ┆ names  ┆ random               ┆ groups ┆ sum[random]/groups  ┆ random/name            │
│ ---  ┆ ---    ┆ ---                  ┆ ---    ┆ ---                 ┆ ---                    │
│ i64  ┆ str    ┆ f64                  ┆ str    ┆ f64                 ┆ list [f64]             │
╞══════╪════════╪══════════════════════╪════════╪═════════════════════╪════════════════════════╡
│ 1    ┆ "foo"  ┆ 0.15416284237967237  ┆ "A"    ┆ 0.8942125388950771  ┆ [0.15416284237967237]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ "ham"  ┆ 0.7400496965154048   ┆ "A"    ┆ 0.8942125388950771  ┆ [0.7400496965154048]   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ "spam" ┆ 0.26331501518513467  ┆ "B"    ┆ 0.27788997767055434 ┆ [0.26331501518513467]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ "egg"  ┆ 0.5337393933802977   ┆ "C"    ┆ 0.5337393933802977  ┆ [0.5337393933802977]   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5    ┆ null   ┆ 0.014574962485419674 ┆ "B"    ┆ 0.27788997767055434 ┆ [0.014574962485419674] │
└──────┴────────┴──────────────────────┴────────┴─────────────────────┴────────────────────────┘

Conclusion

This is only, a small tip of the possible expressions, there are a ton more, and they can be combined myriad ways.

This page was an introduction to Polars expressions and gave a glimpse of what's possible with them. Next page we see in which contexts we can use expressions. And later we'll go through expressions in various groupby contexts and by doing that keep Polars execution parallel.