Polars Expressions

The following is an expression:

pl.col("foo").sort().head(2)

The snippet above says:

  1. Select column "foo"
  2. Then sort the column
  3. Then take the first two values of the sorted output

The power of expressions is that every expression produces a new expression, and that they can be piped together. You can run an expression by passing them to one of Polars execution contexts.

Here we run two expressions by running df.select:

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").filter(pl.col("foo") == 1).sum()
])

All expressions are ran in parallel, meaning that separate Polars expressions are embarrassingly parallel. Note that within an expression there may be more parallelization going on.

Expression examples

In this section we will go through some examples, but first let's create a dataset:

import polars as pl
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

You can do a lot with expressions. They are so expressive that you sometimes have multiple ways to get the same results. To get a better feel for them let's go through some more examples.

Count unique values

We can count the unique values in a column. Note that we are creating the same result in different ways. To avoid duplicate column names in the DataFrame, we could use an alias expression that can rename the expression.

out = df.select(
    [
        pl.col("names").n_unique().alias("unique_names_1"),
        pl.col("names").unique().count().alias("unique_names_2"),
    ]
)
print(out)
shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ ---            ┆ ---            │
│ u32            ┆ u32            │
╞════════════════╪════════════════╡
│ 5              ┆ 5              │
└────────────────┴────────────────┘

Various aggregations

We can do various aggregations. Below are examples of some of them, but there are more such as median, mean, first, etc.

out = df.select(
    [
        pl.sum("random").alias("sum"),
        pl.min("random").alias("min"),
        pl.max("random").alias("max"),
        pl.col("random").max().alias("other_max"),
        pl.std("random").alias("std dev"),
        pl.var("random").alias("variance"),
    ]
)
print(out)
shape: (1, 6)
┌──────────┬──────────┬──────┬───────────┬──────────┬──────────┐
│ sum      ┆ min      ┆ max  ┆ other_max ┆ std dev  ┆ variance │
│ ---      ┆ ---      ┆ ---  ┆ ---       ┆ ---      ┆ ---      │
│ f64      ┆ f64      ┆ f64  ┆ f64       ┆ f64      ┆ f64      │
╞══════════╪══════════╪══════╪═══════════╪══════════╪══════════╡
│ 1.705842 ┆ 0.014575 ┆ 0.74 ┆ 0.74      ┆ 0.293209 ┆ 0.085971 │
└──────────┴──────────┴──────┴───────────┴──────────┴──────────┘

Filter and conditionals

We can also do some pretty complex things. In the next snippet we count all names ending with the string "am".

out = df.select(
    [
        pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
    ]
)
print(df)
shape: (1, 1)
┌───────┐
│ names │
│ ---   │
│ u32   │
╞═══════╡
│ 2     │
└───────┘

Binary functions and modification

In the example below we use a conditional to create a new expression in the following when -> then -> otherwise construct. The when function requires a predicate expression (and thus leads to a boolean Series). The then function expects an expression that will be used in case the predicate evaluates to true, and the otherwise function expects an expression that will be used in case the predicate evaluates to false.

Note that you can pass any expression, or just base expressions like pl.col("foo"), pl.lit(3), pl.lit("bar"), etc.

Finally, we multiply this with the result of a sum expression:

out = df.select(
    [
        pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"),
    ]
)
print(df)
shape: (5, 1)
┌──────────┐
│ literal  │
│ ---      │
│ f64      │
╞══════════╡
│ 1.695791 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0      │
├╌╌╌╌╌╌╌╌╌╌┤
│ 2.896465 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0      │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.160325 │
└──────────┘

Window expressions

A polars expression can also do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression. In the examples below we do a GROUPBY OVER "groups" and AGGREGATE SUM of "random", and in the next expression we GROUPBY OVER "names" and AGGREGATE a LIST of "random". These window functions can be combined with other expressions and are an efficient way to determine group statistics. See more on those group statistics here.

df = df[
    [
        pl.col("*"),  # select all
        pl.col("random").sum().over("groups").alias("sum[random]/groups"),
        pl.col("random").list().over("names").alias("random/name"),
    ]
]
print(df)
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬────────────────────┬─────────────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ sum[random]/groups ┆ random/name │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---                ┆ ---         │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ f64                ┆ list [f64]  │
╞══════╪═══════╪══════════╪════════╪════════════════════╪═════════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 0.894213           ┆ [0.154163]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      ┆ 0.894213           ┆ [0.74]      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 0.2778             ┆ [0.263315]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 0.533739           ┆ [0.533739]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 0.2778             ┆ [0.014575]  │
└──────┴───────┴──────────┴────────┴────────────────────┴─────────────┘

Conclusion

This is the tip of the iceberg in terms of possible expressions. There are a ton more, and they can be combined in a variety ways.

This page was an introduction to Polars expressions, and gave a glimpse of what's possible with them. In the next page we'll discuss in which contexts expressions can be used. Later in the guide we'll go through expressions in various groupby contexts, all while keeping Polars execution parallel.