Expression contexts
You cannot use an expression anywhere. An expression needs a context, the available contexts are:
- selection:
df.select([..])
- groupy aggregation:
df.groupby(..).agg([..])
- hstack/ add columns:
df.with_columns([..])
Syntactic sugar
The reason for such a context, is that you actually are using the Polars lazy API, even if you use it in eager. For instance this snippet:
df.groupby("foo").agg([pl.col("bar").sum()])
actually desugars to:
(df.lazy().groupby("foo").agg([pl.col("bar").sum()])).collect()
This allows Polars to push the expression into the query engine, do optimizations, and cache intermediate results.
Select context
In the select
context the selection applies expressions over columns. The expressions in this context must produce Series
that are all
the same length or have a length of 1
.
A Series
of a length of 1
will be broadcasted to match the height of the DataFrame
.
Note that a select
may produce new columns that are aggregations, combinations of expressions, or literals.
Selection context
out = df.select(
[
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
(pl.mean("nrs") * 10).alias("10xnrs"),
]
)
print(out)
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪════════════╪════════╡
│ 11 ┆ null ┆ foo ┆ 27.5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11 ┆ egg ┆ foo ┆ 27.5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11 ┆ foo ┆ foo ┆ 27.5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11 ┆ ham ┆ foo ┆ 27.5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11 ┆ spam ┆ foo ┆ 27.5 │
└─────┴───────┴────────────┴────────┘
Add columns
Adding columns to a DataFrame
using with_columns
is also the selection
context.
df = df.with_columns(
[
pl.sum("nrs").alias("nrs_sum"),
pl.col("random").count().alias("count"),
]
)
print(out)
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ u32 │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A ┆ 11 ┆ 5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ ham ┆ 0.74 ┆ A ┆ 11 ┆ 5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 11 ┆ 5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ egg ┆ 0.533739 ┆ C ┆ 11 ┆ 5 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 11 ┆ 5 │
└──────┴───────┴──────────┴────────┴─────────┴───────┘
Groupby context
In the groupby
context expressions work on groups and thus may yield results of any length (a group may have many members).
out = df.groupby("groups").agg(
[
pl.sum("nrs"), # sum nrs by groups
pl.col("random").count().alias("count"), # count group members
# sum random where name != null
pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
pl.col("names").reverse().alias(("reversed names")),
]
)
print(out)
shape: (3, 5)
┌────────┬──────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs ┆ count ┆ random_sum ┆ reversed names │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 ┆ f64 ┆ list [str] │
╞════════╪══════╪═══════╪════════════╪════════════════╡
│ B ┆ 8 ┆ 2 ┆ 0.263315 ┆ [null, "spam"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C ┆ null ┆ 1 ┆ 0.533739 ┆ ["egg"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 3 ┆ 2 ┆ 0.894213 ┆ ["ham", "foo"] │
└────────┴──────┴───────┴────────────┴────────────────┘
Besides the standard groupby
, groupby_dynamic
, and groupby_rolling
are also entrances to the groupby context
.