Skip to content

Contexts

Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. The two core components of the language are Contexts and Expressions, the latter we will cover in the next section.

A context, as implied by the name, refers to the context in which an expression needs to be evaluated. There are three main contexts 1:

  1. Selection: df.select([..]), df.with_columns([..])
  2. Filtering: df.filter()
  3. Groupy / Aggregation: df.groupby(..).agg([..])

The examples below are performed on the following DataFrame:

DataFrame

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)

DataFrame

use rand::{thread_rng, Rng};

let mut arr = [0f64; 5];
thread_rng().fill(&mut arr);

let df = df! (
    "nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
    "names" => &[Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
    "random" => &arr,
    "groups" => &["A", "A", "B", "C", "B"],
)?;

println!("{}", &df);

DataFrame

const arr = Array.from({ length: 5 }).map((_) =>
  chance.floating({ min: 0, max: 1 }),
);

let df = pl.DataFrame({
  nrs: [1, 2, 3, null, 5],
  names: ["foo", "ham", "spam", "egg", null],
  random: arr,
  groups: ["A", "A", "B", "C", "B"],
});
console.log(df);

shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
│ null ┆ egg   ┆ 0.533739 ┆ C      │
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

Select

In the select context the selection applies expressions over columns. The expressions in this context must produce Series that are all the same length or have a length of 1.

A Series of a length of 1 will be broadcasted to match the height of the DataFrame. Note that a select may produce new columns that are aggregations, combinations of expressions, or literals.

select

out = df.select(
    [
        pl.sum("nrs"),
        pl.col("names").sort(),
        pl.col("names").first().alias("first name"),
        (pl.mean("nrs") * 10).alias("10xnrs"),
    ]
)
print(out)

select

let out = df
    .clone()
    .lazy()
    .select([
        sum("nrs"),
        col("names").sort(false),
        col("names").first().alias("first name"),
        (mean("nrs") * lit(10)).alias("10xnrs"),
    ])
    .collect()?;
println!("{}", out);

select

let out = df.select(
  pl.col("nrs").sum(),
  pl.col("names").sort(),
  pl.col("names").first().alias("first name"),
  pl.mean("nrs").multiplyBy(10).alias("10xnrs"),
);
console.log(out);

shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ ---   ┆ ---        ┆ ---    │
│ i64 ┆ str   ┆ str        ┆ f64    │
╞═════╪═══════╪════════════╪════════╡
│ 11  ┆ null  ┆ foo        ┆ 27.5   │
│ 11  ┆ egg   ┆ foo        ┆ 27.5   │
│ 11  ┆ foo   ┆ foo        ┆ 27.5   │
│ 11  ┆ ham   ┆ foo        ┆ 27.5   │
│ 11  ┆ spam  ┆ foo        ┆ 27.5   │
└─────┴───────┴────────────┴────────┘

As you can see from the query the select context is very powerful and allows you to perform arbitrary expressions independent (and in parallel) of each other.

Similarly to the select statement there is the with_columns statement which also is an entrance to the selection context. The main difference is that with_columns retains the original columns and adds new ones while select drops the original columns.

with_columns

df = df.with_columns(
    [
        pl.sum("nrs").alias("nrs_sum"),
        pl.col("random").count().alias("count"),
    ]
)
print(df)

with_columns

let out = df
    .clone()
    .lazy()
    .with_columns([
        sum("nrs").alias("nrs_sum"),
        col("random").count().alias("count"),
    ])
    .collect()?;
println!("{}", out);

withColumns

df = df.withColumns(
  pl.col("nrs").sum().alias("nrs_sum"),
  pl.col("random").count().alias("count"),
);

console.log(df);

shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 11      ┆ 5     │
│ 2    ┆ ham   ┆ 0.74005  ┆ A      ┆ 11      ┆ 5     │
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 11      ┆ 5     │
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└──────┴───────┴──────────┴────────┴─────────┴───────┘

Filter

In the filter context you filter the existing dataframe based on arbritary expression which evaluates to the Boolean data type.

filter

out = df.filter(pl.col("nrs") > 2)
print(out)

filter

let out = df.clone().lazy().filter(col("nrs").gt(lit(2))).collect()?;
println!("{}", out);

filter

out = df.filter(pl.col("nrs").gt(2));
console.log(out);

shape: (2, 6)
┌─────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64 ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞═════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 3   ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
│ 5   ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└─────┴───────┴──────────┴────────┴─────────┴───────┘

Groupby / Aggregation

In the groupby context expressions work on groups and thus may yield results of any length (a group may have many members).

groupby

out = df.groupby("groups").agg(
    [
        pl.sum("nrs"),  # sum nrs by groups
        pl.col("random").count().alias("count"),  # count group members
        # sum random where name != null
        pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
        pl.col("names").reverse().alias(("reversed names")),
    ]
)
print(out)

groupby

let out = df
    .lazy()
    .groupby([col("groups")])
    .agg([
        sum("nrs"),                           // sum nrs by groups
        col("random").count().alias("count"), // count group members
        // sum random where name != null
        col("random")
            .filter(col("names").is_not_null())
            .sum()
            .suffix("_sum"),
        col("names").reverse().alias("reversed names"),
    ])
    .collect()?;
println!("{}", out);

groupBy

out = df.groupBy("groups").agg(
  pl
    .col("nrs")
    .sum(), // sum nrs by groups
  pl
    .col("random")
    .count()
    .alias("count"), // count group members
  // sum random where name != null
  pl
    .col("random")
    .filter(pl.col("names").isNotNull())
    .sum()
    .suffix("_sum"),
  pl.col("names").reverse().alias("reversed names"),
);
console.log(out);

shape: (3, 5)
┌────────┬──────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs  ┆ count ┆ random_sum ┆ reversed names │
│ ---    ┆ ---  ┆ ---   ┆ ---        ┆ ---            │
│ str    ┆ i64  ┆ u32   ┆ f64        ┆ list[str]      │
╞════════╪══════╪═══════╪════════════╪════════════════╡
│ B      ┆ 8    ┆ 2     ┆ 0.263315   ┆ [null, "spam"] │
│ C      ┆ null ┆ 1     ┆ 0.533739   ┆ ["egg"]        │
│ A      ┆ 3    ┆ 2     ┆ 0.894213   ┆ ["ham", "foo"] │
└────────┴──────┴───────┴────────────┴────────────────┘

As you can see from the result all expressions are applied to the group defined by the groupby context. Besides the standard groupby, groupby_dynamic, and groupby_rolling are also entrances to the groupby context.


  1. There are additional List and SQL contexts which are covered later in this guide. But for simplicity, we leave them out of scope for now.