Skip to content

Contexts

Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. The two core components of the language are Contexts and Expressions, the latter we will cover in the next section.

A context, as implied by the name, refers to the context in which an expression needs to be evaluated. There are three main contexts 1:

  1. Selection: df.select([..]), df.with_columns([..])
  2. Filtering: df.filter()
  3. Group by / Aggregation: df.group_by(..).agg([..])

The examples below are performed on the following DataFrame:

DataFrame

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)

DataFrame

use rand::{thread_rng, Rng};

let mut arr = [0f64; 5];
thread_rng().fill(&mut arr);

let df = df! (
    "nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
    "names" => &[Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
    "random" => &arr,
    "groups" => &["A", "A", "B", "C", "B"],
)?;

println!("{}", &df);

shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
│ null ┆ egg   ┆ 0.533739 ┆ C      │
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

Select

In the select context the selection applies expressions over columns. The expressions in this context must produce Series that are all the same length or have a length of 1.

A Series of a length of 1 will be broadcasted to match the height of the DataFrame. Note that a select may produce new columns that are aggregations, combinations of expressions, or literals.

select

out = df.select(
    pl.sum("nrs"),
    pl.col("names").sort(),
    pl.col("names").first().alias("first name"),
    (pl.mean("nrs") * 10).alias("10xnrs"),
)
print(out)

select

let out = df
    .clone()
    .lazy()
    .select([
        sum("nrs"),
        col("names").sort(false),
        col("names").first().alias("first name"),
        (mean("nrs") * lit(10)).alias("10xnrs"),
    ])
    .collect()?;
println!("{}", out);

shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ ---   ┆ ---        ┆ ---    │
│ i64 ┆ str   ┆ str        ┆ f64    │
╞═════╪═══════╪════════════╪════════╡
│ 11  ┆ null  ┆ foo        ┆ 27.5   │
│ 11  ┆ egg   ┆ foo        ┆ 27.5   │
│ 11  ┆ foo   ┆ foo        ┆ 27.5   │
│ 11  ┆ ham   ┆ foo        ┆ 27.5   │
│ 11  ┆ spam  ┆ foo        ┆ 27.5   │
└─────┴───────┴────────────┴────────┘

As you can see from the query the select context is very powerful and allows you to perform arbitrary expressions independent (and in parallel) of each other.

Similarly to the select statement there is the with_columns statement which also is an entrance to the selection context. The main difference is that with_columns retains the original columns and adds new ones while select drops the original columns.

with_columns

df = df.with_columns(
    pl.sum("nrs").alias("nrs_sum"),
    pl.col("random").count().alias("count"),
)
print(df)

with_columns

let out = df
    .clone()
    .lazy()
    .with_columns([
        sum("nrs").alias("nrs_sum"),
        col("random").count().alias("count"),
    ])
    .collect()?;
println!("{}", out);

shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 11      ┆ 5     │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      ┆ 11      ┆ 5     │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 11      ┆ 5     │
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└──────┴───────┴──────────┴────────┴─────────┴───────┘

Filter

In the filter context you filter the existing dataframe based on arbitrary expression which evaluates to the Boolean data type.

filter

out = df.filter(pl.col("nrs") > 2)
print(out)

filter

let out = df.clone().lazy().filter(col("nrs").gt(lit(2))).collect()?;
println!("{}", out);

shape: (2, 6)
┌─────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64 ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞═════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 3   ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ 5   ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└─────┴───────┴──────────┴────────┴─────────┴───────┘

Group by / aggregation

In the group_by context, expressions work on groups and thus may yield results of any length (a group may have many members).

group_by

out = df.group_by("groups").agg(
    pl.sum("nrs"),  # sum nrs by groups
    pl.col("random").count().alias("count"),  # count group members
    # sum random where name != null
    pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
    pl.col("names").reverse().alias("reversed names"),
)
print(out)

group_by

let out = df
    .lazy()
    .group_by([col("groups")])
    .agg([
        sum("nrs"),                           // sum nrs by groups
        col("random").count().alias("count"), // count group members
        // sum random where name != null
        col("random")
            .filter(col("names").is_not_null())
            .sum()
            .suffix("_sum"),
        col("names").reverse().alias("reversed names"),
    ])
    .collect()?;
println!("{}", out);

shape: (3, 5)
┌────────┬─────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs ┆ count ┆ random_sum ┆ reversed names │
│ ---    ┆ --- ┆ ---   ┆ ---        ┆ ---            │
│ str    ┆ i64 ┆ u32   ┆ f64        ┆ list[str]      │
╞════════╪═════╪═══════╪════════════╪════════════════╡
│ A      ┆ 3   ┆ 2     ┆ 0.894213   ┆ ["ham", "foo"] │
│ B      ┆ 8   ┆ 2     ┆ 0.263315   ┆ [null, "spam"] │
│ C      ┆ 0   ┆ 1     ┆ 0.533739   ┆ ["egg"]        │
└────────┴─────┴───────┴────────────┴────────────────┘

As you can see from the result all expressions are applied to the group defined by the group_by context. Besides the standard group_by, group_by_dynamic, and group_by_rolling are also entrances to the group by context.


  1. There are additional List and SQL contexts which are covered later in this guide. But for simplicity, we leave them out of scope for now.