Expressions
Expressions
are the core strength of Polars
. The expressions
offer a versatile structure that both solves easy queries and is easily extended to complex ones. Below we will cover the basic components that serve as building block (or in Polars
terminology contexts) for all your queries:
select
filter
with_columns
group_by
To learn more about expressions and the context in which they operate, see the User Guide sections: Contexts and Expressions.
Select statement
To select a column we need to do two things. Define the DataFrame
we want the data from. And second, select the data that we need. In the example below you see that we select col('*')
. The asterisk stands for all columns.
shape: (8, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴───────┘
You can also specify the specific columns that you want to return. There are two ways to do this. The first option is to create a list
of column names, as seen below.
shape: (8, 2)
┌─────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.267969 │
│ 1 ┆ 0.900438 │
│ 2 ┆ 0.405714 │
│ 3 ┆ 0.706133 │
│ 4 ┆ 0.394016 │
│ 5 ┆ 0.500425 │
│ 6 ┆ 0.493089 │
│ 7 ┆ 0.187685 │
└─────┴──────────┘
The second option is to specify each column within a list
in the select
statement. This option is shown below.
shape: (3, 2)
┌─────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.267969 │
│ 1 ┆ 0.900438 │
│ 2 ┆ 0.405714 │
└─────┴──────────┘
If you want to exclude an entire column from your view, you can simply use exclude
in your select
statement.
shape: (8, 3)
┌──────────┬─────────────────────┬───────┐
│ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- │
│ f64 ┆ datetime[μs] ┆ f64 │
╞══════════╪═════════════════════╪═══════╡
│ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
│ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
│ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN │
│ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN │
│ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
│ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
│ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null │
└──────────┴─────────────────────┴───────┘
Filter
The filter
option allows us to create a subset of the DataFrame
. We use the same DataFrame
as earlier and we filter between two specified dates.
df.filter(
pl.col("c").is_between(datetime(2022, 12, 2), datetime(2022, 12, 8)),
)
let start_date = NaiveDate::from_ymd_opt(2022, 12, 2)
.unwrap()
.and_hms_opt(0, 0, 0)
.unwrap();
let end_date = NaiveDate::from_ymd_opt(2022, 12, 8)
.unwrap()
.and_hms_opt(0, 0, 0)
.unwrap();
let out = df
.clone()
.lazy()
.filter(
col("c")
.gt_eq(lit(start_date))
.and(col("c").lt_eq(lit(end_date))),
)
.collect()?;
println!("{}", out);
shape: (7, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴───────┘
With filter
you can also create more complex filters that include multiple columns.
shape: (2, 4)
┌─────┬──────────┬─────────────────────┬─────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
└─────┴──────────┴─────────────────────┴─────┘
With_columns
with_columns
allows you to create new columns for your analyses. We create two new columns e
and b+42
. First we sum all values from column b
and store the results in column e
. After that we add 42
to the values of b
. Creating a new column b+42
to store these results.
df.with_columns([pl.col("b").sum().alias("e"), (pl.col("b") + 42).alias("b+42")])
let out = df
.clone()
.lazy()
.with_columns([
col("b").sum().alias("e"),
(col("b") + lit(42)).alias("b+42"),
])
.collect()?;
println!("{}", out);
shape: (8, 6)
┌─────┬──────────┬─────────────────────┬───────┬─────────┬───────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ b+42 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╪═════════╪═══════════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 ┆ 3.85547 ┆ 42.267969 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 ┆ 3.85547 ┆ 42.900438 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN ┆ 3.85547 ┆ 42.405714 │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN ┆ 3.85547 ┆ 42.706133 │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 ┆ 3.85547 ┆ 42.394016 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 ┆ 3.85547 ┆ 42.500425 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 ┆ 3.85547 ┆ 42.493089 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null ┆ 3.85547 ┆ 42.187685 │
└─────┴──────────┴─────────────────────┴───────┴─────────┴───────────┘
Group by
We will create a new DataFrame
for the Group by functionality. This new DataFrame
will include several 'groups' that we want to group by.
shape: (8, 2)
┌─────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0 ┆ A │
│ 1 ┆ A │
│ 2 ┆ A │
│ 3 ┆ B │
│ 4 ┆ B │
│ 5 ┆ C │
│ 6 ┆ X │
│ 7 ┆ X │
└─────┴─────┘
shape: (4, 2)
┌─────┬───────┐
│ y ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════╡
│ A ┆ 3 │
│ B ┆ 2 │
│ C ┆ 1 │
│ X ┆ 2 │
└─────┴───────┘
shape: (4, 3)
┌─────┬───────┬─────┐
│ y ┆ count ┆ sum │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞═════╪═══════╪═════╡
│ A ┆ 3 ┆ 3 │
│ B ┆ 2 ┆ 7 │
│ C ┆ 1 ┆ 5 │
│ X ┆ 2 ┆ 13 │
└─────┴───────┴─────┘
Combining operations
Below are some examples on how to combine operations to create the DataFrame
you require.
df_x = df.with_columns((pl.col("a") * pl.col("b")).alias("a * b")).select(
[pl.all().exclude(["c", "d"])]
)
print(df_x)
let out = df
.clone()
.lazy()
.with_columns([(col("a") * col("b")).alias("a * b")])
.select([col("*").exclude(["c", "d"])])
.collect()?;
println!("{}", out);
shape: (8, 3)
┌─────┬──────────┬──────────┐
│ a ┆ b ┆ a * b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.267969 ┆ 0.0 │
│ 1 ┆ 0.900438 ┆ 0.900438 │
│ 2 ┆ 0.405714 ┆ 0.811429 │
│ 3 ┆ 0.706133 ┆ 2.118399 │
│ 4 ┆ 0.394016 ┆ 1.576064 │
│ 5 ┆ 0.500425 ┆ 2.502127 │
│ 6 ┆ 0.493089 ┆ 2.958534 │
│ 7 ┆ 0.187685 ┆ 1.313797 │
└─────┴──────────┴──────────┘
df_y = df.with_columns([(pl.col("a") * pl.col("b")).alias("a * b")]).select(
[pl.all().exclude("d")]
)
print(df_y)
let out = df
.clone()
.lazy()
.with_columns([(col("a") * col("b")).alias("a * b")])
.select([col("*").exclude(["d"])])
.collect()?;
println!("{}", out);
shape: (8, 4)
┌─────┬──────────┬─────────────────────┬──────────┐
│ a ┆ b ┆ c ┆ a * b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪══════════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 0.0 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 0.900438 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ 0.811429 │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ 2.118399 │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 1.576064 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ 2.502127 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ 2.958534 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ 1.313797 │
└─────┴──────────┴─────────────────────┴──────────┘