Contexts
Polars has developed its own Domain Specific Language (DSL) for transforming data. The language is very easy to use and allows for complex queries that remain human readable. The two core components of the language are Contexts and Expressions, the latter we will cover in the next section.
A context, as implied by the name, refers to the context in which an expression needs to be evaluated. There are three main contexts 1:
- Selection:
df.select([..])
,df.with_columns([..])
- Filtering:
df.filter()
- Group by / Aggregation:
df.group_by(..).agg([..])
The examples below are performed on the following DataFrame
:
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
use rand::{thread_rng, Rng};
let mut arr = [0f64; 5];
thread_rng().fill(&mut arr);
let df = df! (
"nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
"names" => &[Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
"random" => &arr,
"groups" => &["A", "A", "B", "C", "B"],
)?;
println!("{}", &df);
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪══════════╪════════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A │
│ 2 ┆ ham ┆ 0.74005 ┆ A │
│ 3 ┆ spam ┆ 0.263315 ┆ B │
│ null ┆ egg ┆ 0.533739 ┆ C │
│ 5 ┆ null ┆ 0.014575 ┆ B │
└──────┴───────┴──────────┴────────┘
Select
In the select
context the selection applies expressions over columns. The expressions in this context must produce Series
that are all the same length or have a length of 1.
A Series
of a length of 1 will be broadcasted to match the height of the DataFrame
. Note that a select may produce new columns that are aggregations, combinations of expressions, or literals.
out = df.select(
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
(pl.mean("nrs") * 10).alias("10xnrs"),
)
print(out)
let out = df
.clone()
.lazy()
.select([
sum("nrs"),
col("names").sort(false),
col("names").first().alias("first name"),
(mean("nrs") * lit(10)).alias("10xnrs"),
])
.collect()?;
println!("{}", out);
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪════════════╪════════╡
│ 11 ┆ null ┆ foo ┆ 27.5 │
│ 11 ┆ egg ┆ foo ┆ 27.5 │
│ 11 ┆ foo ┆ foo ┆ 27.5 │
│ 11 ┆ ham ┆ foo ┆ 27.5 │
│ 11 ┆ spam ┆ foo ┆ 27.5 │
└─────┴───────┴────────────┴────────┘
As you can see from the query the select
context is very powerful and allows you to perform arbitrary expressions independent (and in parallel) of each other.
Similarly to the select
statement there is the with_columns
statement which also is an entrance to the selection context. The main difference is that with_columns
retains the original columns and adds new ones while select
drops the original columns.
df = df.with_columns(
pl.sum("nrs").alias("nrs_sum"),
pl.col("random").count().alias("count"),
)
print(df)
let out = df
.clone()
.lazy()
.with_columns([
sum("nrs").alias("nrs_sum"),
col("random").count().alias("count"),
])
.collect()?;
println!("{}", out);
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ u32 │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A ┆ 11 ┆ 5 │
│ 2 ┆ ham ┆ 0.74005 ┆ A ┆ 11 ┆ 5 │
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 11 ┆ 5 │
│ null ┆ egg ┆ 0.533739 ┆ C ┆ 11 ┆ 5 │
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 11 ┆ 5 │
└──────┴───────┴──────────┴────────┴─────────┴───────┘
Filter
In the filter
context you filter the existing dataframe based on arbitrary expression which evaluates to the Boolean
data type.
shape: (2, 6)
┌─────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ u32 │
╞═════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 11 ┆ 5 │
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 11 ┆ 5 │
└─────┴───────┴──────────┴────────┴─────────┴───────┘
Group by / aggregation
In the group_by
context, expressions work on groups and thus may yield results of any length (a group may have many members).
out = df.group_by("groups").agg(
pl.sum("nrs"), # sum nrs by groups
pl.col("random").count().alias("count"), # count group members
# sum random where name != null
pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
pl.col("names").reverse().alias("reversed names"),
)
print(out)
let out = df
.lazy()
.group_by([col("groups")])
.agg([
sum("nrs"), // sum nrs by groups
col("random").count().alias("count"), // count group members
// sum random where name != null
col("random")
.filter(col("names").is_not_null())
.sum()
.suffix("_sum"),
col("names").reverse().alias("reversed names"),
])
.collect()?;
println!("{}", out);
shape: (3, 5)
┌────────┬─────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs ┆ count ┆ random_sum ┆ reversed names │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 ┆ f64 ┆ list[str] │
╞════════╪═════╪═══════╪════════════╪════════════════╡
│ A ┆ 3 ┆ 2 ┆ 0.894213 ┆ ["ham", "foo"] │
│ B ┆ 8 ┆ 2 ┆ 0.263315 ┆ [null, "spam"] │
│ C ┆ 0 ┆ 1 ┆ 0.533739 ┆ ["egg"] │
└────────┴─────┴───────┴────────────┴────────────────┘
As you can see from the result all expressions are applied to the group defined by the group_by
context. Besides the standard group_by
, group_by_dynamic
, and group_by_rolling
are also entrances to the group by context.
-
There are additional List and SQL contexts which are covered later in this guide. But for simplicity, we leave them out of scope for now. ↩