Polars Expressions
The following is an expression:
pl.col("foo").sort().head(2)
The snippet above says:
- Select column "foo"
- Then sort the column
- Then take the first two values of the sorted output
The power of expressions is that every expression produces a new expression, and that they
can be piped together. You can run an expression by passing them to one of Polars
execution contexts.
Here we run two expressions by running df.select
:
df.select([
pl.col("foo").sort().head(2),
pl.col("bar").filter(pl.col("foo") == 1).sum()
])
All expressions are ran in parallel, meaning that separate Polars
expressions are embarrassingly
parallel. Note that within an expression there may be more parallelization going on.
Expression examples
In this section we will go through some examples, but first let's create a dataset:
import polars as pl
import numpy as np
np.random.seed(12)
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪══════════╪════════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.74 ┆ A │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.263315 ┆ B │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.533739 ┆ C │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.014575 ┆ B │
└──────┴───────┴──────────┴────────┘
You can do a lot with expressions. They are so expressive that you sometimes have multiple ways to get the same results. To get a better feel for them let's go through some more examples.
Count unique values
We can count the unique values in a column. Note that we are creating the same result in
different ways. To avoid duplicate column names in the DataFrame
, we could use an
alias
expression that can rename the expression.
out = df.select(
[
pl.col("names").n_unique().alias("unique_names_1"),
pl.col("names").unique().count().alias("unique_names_2"),
]
)
print(out)
shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞════════════════╪════════════════╡
│ 5 ┆ 5 │
└────────────────┴────────────────┘
Various aggregations
We can do various aggregations. Below are examples of some of them, but there are more such as
median
, mean
, first
, etc.
out = df.select(
[
pl.sum("random").alias("sum"),
pl.min("random").alias("min"),
pl.max("random").alias("max"),
pl.col("random").max().alias("other_max"),
pl.std("random").alias("std dev"),
pl.var("random").alias("variance"),
]
)
print(out)
shape: (1, 6)
┌──────────┬──────────┬──────┬───────────┬──────────┬──────────┐
│ sum ┆ min ┆ max ┆ other_max ┆ std dev ┆ variance │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪══════════╪══════╪═══════════╪══════════╪══════════╡
│ 1.705842 ┆ 0.014575 ┆ 0.74 ┆ 0.74 ┆ 0.293209 ┆ 0.085971 │
└──────────┴──────────┴──────┴───────────┴──────────┴──────────┘
Filter and conditionals
We can also do some pretty complex things. In the next snippet we count all names ending
with the string "am"
.
out = df.select(
[
pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
]
)
print(df)
shape: (1, 1)
┌───────┐
│ names │
│ --- │
│ u32 │
╞═══════╡
│ 2 │
└───────┘
Binary functions and modification
In the example below we use a conditional to create a new expression in the following
when -> then -> otherwise
construct. The when
function requires a predicate
expression (and thus leads to a boolean Series
). The then
function expects an
expression that will be used in case the predicate evaluates to true
, and the otherwise
function expects an expression that will be used in case the predicate evaluates to false
.
Note that you can pass any expression, or just base expressions like pl.col("foo")
,
pl.lit(3)
, pl.lit("bar")
, etc.
Finally, we multiply this with the result of a sum
expression:
out = df.select(
[
pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"),
]
)
print(df)
shape: (5, 1)
┌──────────┐
│ literal │
│ --- │
│ f64 │
╞══════════╡
│ 1.695791 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 2.896465 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.160325 │
└──────────┘
Window expressions
A polars expression can also do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression.
In the examples below we do a GROUPBY OVER "groups"
and AGGREGATE SUM of "random"
, and in the next expression
we GROUPBY OVER "names"
and AGGREGATE a LIST of "random"
. These window functions can be combined with other expressions
and are an efficient way to determine group statistics. See more on those group statistics here.
df = df[
[
pl.col("*"), # select all
pl.col("random").sum().over("groups").alias("sum[random]/groups"),
pl.col("random").list().over("names").alias("random/name"),
]
]
print(df)
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬────────────────────┬─────────────┐
│ nrs ┆ names ┆ random ┆ groups ┆ sum[random]/groups ┆ random/name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ f64 ┆ list [f64] │
╞══════╪═══════╪══════════╪════════╪════════════════════╪═════════════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A ┆ 0.894213 ┆ [0.154163] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.74 ┆ A ┆ 0.894213 ┆ [0.74] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 0.2778 ┆ [0.263315] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.533739 ┆ C ┆ 0.533739 ┆ [0.533739] │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 0.2778 ┆ [0.014575] │
└──────┴───────┴──────────┴────────┴────────────────────┴─────────────┘
Conclusion
This is the tip of the iceberg in terms of possible expressions. There are a ton more, and they can be combined in a variety ways.
This page was an introduction to Polars
expressions, and gave a glimpse of what's
possible with them. In the next page we'll discuss in which contexts expressions can be used. Later in the guide we'll go through expressions in various groupby contexts, all while keeping Polars
execution parallel.