# Polars Expressions

The following is an expression:

pl.col("foo").sort().head(2)

df.column("foo")?.sort(false).head(Some(2));


The snippet above says:

1. Select column "foo"
2. Then sort the column (not in reversed order)
3. Then take the first two values of the sorted output

The power of expressions is that every expression produces a new expression, and that they can be piped together. You can run an expression by passing them to one of Polars execution contexts.

Here we run two expressions by running df.select:

df.select([
pl.col("bar").filter(pl.col("foo") == 1).sum()
])

df.select([
col("bar").filter(col("foo").eq(lit(1))).sum(),
]).collect()?;


All expressions are run in parallel, meaning that separate Polars expressions are embarrassingly parallel. Note that within an expression there may be more parallelization going on.

## Expression examples

In this section we will go through some examples, but first let's create a dataset:

import polars as pl
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)

use color_eyre::Result;
use polars::prelude::*;

fn main() -> Result<()> {
let mut arr = [0f64; 5];

let df = df! (
"nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
"names" => &[Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
"random" => &arr,
"groups" => &["A", "A", "B", "C", "B"],
)?;

println!("{}", &df);

shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
│ null ┆ egg   ┆ 0.533739 ┆ C      │
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘


You can do a lot with expressions. They are so expressive that you sometimes have multiple ways to get the same results. To get a better feel for them let's go through some more examples.

A note for the Rust examples: Each of these examples use the same dataset. So, due to Rust's ownership rules, and the fact that all the examples run in the same context, we'll clone() the dataset for each example to ensure that no prior example affects the behavior of later examples. This is the case for all Rust examples for the remainder of this book. It's worth mentioning, that clones in Polars are very efficient, and don't result in a "deep copy" of the data. They're implemented using the Rust Arc type (Atomically Reference Counted).

### Count unique values

We can count the unique values in a column. Note that we are creating the same result in different ways. To avoid duplicate column names in the DataFrame, we could use an alias expression that can rename the expression.

out = df.select(
[
pl.col("names").n_unique().alias("unique_names_1"),
pl.col("names").unique().count().alias("unique_names_2"),
]
)
print(out)

    let out = df
.clone()
.lazy()
.select([
col("names").n_unique().alias("unique_names_1"),
col("names").unique().count().alias("unique_names_2"),
])
.collect()?;
println!("{}", out);

shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ ---            ┆ ---            │
│ u32            ┆ u32            │
╞════════════════╪════════════════╡
│ 5              ┆ 5              │
└────────────────┴────────────────┘


### Various aggregations

We can do various aggregations. Below are examples of some of them, but there are more such as median, mean, first, etc.

out = df.select(
[
pl.sum("random").alias("sum"),
pl.min("random").alias("min"),
pl.max("random").alias("max"),
pl.col("random").max().alias("other_max"),
pl.std("random").alias("std dev"),
pl.var("random").alias("variance"),
]
)
print(out)

    let out = df
.clone()
.lazy()
.select([
sum("random").alias("sum"),
min("random").alias("min"),
max("random").alias("max"),
col("random").max().alias("other_max"),
col("random").std(1).alias("std dev"),
col("random").var(1).alias("variance"),
])
.collect()?;
println!("{}", out);

shape: (1, 6)
┌──────────┬──────────┬─────────┬───────────┬──────────┬──────────┐
│ sum      ┆ min      ┆ max     ┆ other_max ┆ std dev  ┆ variance │
│ ---      ┆ ---      ┆ ---     ┆ ---       ┆ ---      ┆ ---      │
│ f64      ┆ f64      ┆ f64     ┆ f64       ┆ f64      ┆ f64      │
╞══════════╪══════════╪═════════╪═══════════╪══════════╪══════════╡
│ 1.705842 ┆ 0.014575 ┆ 0.74005 ┆ 0.74005   ┆ 0.293209 ┆ 0.085971 │
└──────────┴──────────┴─────────┴───────────┴──────────┴──────────┘


### Filter and conditionals

We can also do some pretty complex things. In the next snippet we count all names ending with the string "am".

Note that in Rust, the strings feature must be enabled for str expression to be available.

out = df.select(
[
pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(), ] ) print(out)   let out = df .clone() .lazy() .select([col("names") .filter(col("names").str().contains("am$"))
.count()])
.collect()?;
println!("{}", out);

shape: (1, 1)
┌───────┐
│ names │
│ ---   │
│ u32   │
╞═══════╡
│ 2     │
└───────┘


### Binary functions and modification

In the example below we use a conditional to create a new expression in the following when -> then -> otherwise construct. The when function requires a predicate expression (and thus leads to a boolean Series). The then function expects an expression that will be used in case the predicate evaluates to true, and the otherwise function expects an expression that will be used in case the predicate evaluates to false.

Note that you can pass any expression, or just base expressions like pl.col("foo"), pl.lit(3), pl.lit("bar"), etc.

Finally, we multiply this with the result of a sum expression:

out = df.select(
[
pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"),
]
)
print(out)

    let out = df
.clone()
.lazy()
.select([when(col("random").gt(0.5)).then(0).otherwise(col("random")) * sum("nrs")])
.collect()?;
println!("{}", out);

shape: (5, 1)
┌──────────┐
│ literal  │
│ ---      │
│ f64      │
╞══════════╡
│ 1.695791 │
│ 0.0      │
│ 2.896465 │
│ 0.0      │
│ 0.160325 │
└──────────┘


It is also possible to chain multiple when -> then statements together like in the example below. This is similar to the SQL CASE WHEN.

out = df.select(
pl.when(pl.col("groups") == "A").then(1).when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random"))
)
print(out)

    let out = df
.clone()
.lazy()
.select([when(col("random").eq("A")).then(1)
.when(col("random").gt(0.5)).then(0)
.otherwise(col("random"))])
.collect()?;
println!("{}", out);

shape: (5, 1)
┌──────────┐
│ literal  │
│ ---      │
│ f64      │
╞══════════╡
│ 1.0      │
│ 1.0      │
│ 0.263315 │
│ 0.0      │
│ 0.014575 │
└──────────┘


If you are looking to replace the values of a column based on a dictionary, you don't need chained when -> then. You can use map_dict. Read more in the reference guide here.

### Window expressions

A polars expression can also do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression. In the examples below we do a GROUPBY OVER "groups" and AGGREGATE SUM of "random", and in the next expression we GROUPBY OVER "names" and AGGREGATE a LIST of "random". These window functions can be combined with other expressions and are an efficient way to determine group statistics. See more on those group statistics here.

df = df.select(
[
pl.col("*"),  # select all
pl.col("random").sum().over("groups").alias("sum[random]/groups"),
pl.col("random").list().over("names").alias("random/name"),
]
)
print(df)

    let df = df
.lazy()
.select([
col("*"), // Select all
col("random")
.sum()
.over([col("groups")])
.alias("sum[random]/groups"),
col("random")
.list()
.over([col("names")])
.alias("random/name"),
])
.collect()?;
println!("{}", df);

shape: (5, 6)
┌──────┬───────┬──────────┬────────┬────────────────────┬─────────────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ sum[random]/groups ┆ random/name │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---                ┆ ---         │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ f64                ┆ list[f64]   │
╞══════╪═══════╪══════════╪════════╪════════════════════╪═════════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 0.894213           ┆ [0.154163]  │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      ┆ 0.894213           ┆ [0.74005]   │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 0.27789            ┆ [0.263315]  │
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 0.533739           ┆ [0.533739]  │
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 0.27789            ┆ [0.014575]  │
└──────┴───────┴──────────┴────────┴────────────────────┴─────────────┘


## Conclusion

This is the tip of the iceberg in terms of possible expressions. There are a ton more, and they can be combined in a variety ways.

This page was an introduction to Polars expressions, and gave a glimpse of what's possible with them. In the next page we'll discuss in which contexts expressions can be used. Later in the guide we'll go through expressions in various groupby contexts, all while keeping Polars execution parallel.