Expressions
Expressions are the core strength of Polars. The expressions offer a versatile structure that both solves easy queries and is easily extended to complex ones. Below we will cover the basic components that serve as building block (or in Polars terminology contexts) for all your queries:
selectfilterwith_columnsgroup_by
To learn more about expressions and the context in which they operate, see the User Guide sections: Contexts and Expressions.
Select statement
To select a column we need to do two things. Define the DataFrame we want the data from. And second, select the data that we need. In the example below you see that we select col('*'). The asterisk stands for all columns.
shape: (8, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴───────┘
You can also specify the specific columns that you want to return. There are two ways to do this. The first option is to create a list of column names, as seen below.
shape: (8, 2)
┌─────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.267969 │
│ 1 ┆ 0.900438 │
│ 2 ┆ 0.405714 │
│ 3 ┆ 0.706133 │
│ 4 ┆ 0.394016 │
│ 5 ┆ 0.500425 │
│ 6 ┆ 0.493089 │
│ 7 ┆ 0.187685 │
└─────┴──────────┘
The second option is to specify each column within a list in the select statement. This option is shown below.
shape: (3, 2)
┌─────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.267969 │
│ 1 ┆ 0.900438 │
│ 2 ┆ 0.405714 │
└─────┴──────────┘
If you want to exclude an entire column from your view, you can simply use exclude in your select statement.
shape: (8, 3)
┌──────────┬─────────────────────┬───────┐
│ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- │
│ f64 ┆ datetime[μs] ┆ f64 │
╞══════════╪═════════════════════╪═══════╡
│ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
│ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
│ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN │
│ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN │
│ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
│ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
│ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null │
└──────────┴─────────────────────┴───────┘
Filter
The filter option allows us to create a subset of the DataFrame. We use the same DataFrame as earlier and we filter between two specified dates.
df.filter(
pl.col("c").is_between(datetime(2022, 12, 2), datetime(2022, 12, 8)),
)
let start_date = NaiveDate::from_ymd_opt(2022, 12, 2)
.unwrap()
.and_hms_opt(0, 0, 0)
.unwrap();
let end_date = NaiveDate::from_ymd_opt(2022, 12, 8)
.unwrap()
.and_hms_opt(0, 0, 0)
.unwrap();
let out = df
.clone()
.lazy()
.filter(
col("c")
.gt_eq(lit(start_date))
.and(col("c").lt_eq(lit(end_date))),
)
.collect()?;
println!("{}", out);
shape: (7, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴───────┘
With filter you can also create more complex filters that include multiple columns.
shape: (2, 4)
┌─────┬──────────┬─────────────────────┬─────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
└─────┴──────────┴─────────────────────┴─────┘
With_columns
with_columns allows you to create new columns for your analyses. We create two new columns e and b+42. First we sum all values from column b and store the results in column e. After that we add 42 to the values of b. Creating a new column b+42 to store these results.
df.with_columns([pl.col("b").sum().alias("e"), (pl.col("b") + 42).alias("b+42")])
let out = df
.clone()
.lazy()
.with_columns([
col("b").sum().alias("e"),
(col("b") + lit(42)).alias("b+42"),
])
.collect()?;
println!("{}", out);
shape: (8, 6)
┌─────┬──────────┬─────────────────────┬───────┬─────────┬───────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ b+42 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╪═════════╪═══════════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 1.0 ┆ 3.85547 ┆ 42.267969 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 2.0 ┆ 3.85547 ┆ 42.900438 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ NaN ┆ 3.85547 ┆ 42.405714 │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ NaN ┆ 3.85547 ┆ 42.706133 │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 0.0 ┆ 3.85547 ┆ 42.394016 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ -5.0 ┆ 3.85547 ┆ 42.500425 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ -42.0 ┆ 3.85547 ┆ 42.493089 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ null ┆ 3.85547 ┆ 42.187685 │
└─────┴──────────┴─────────────────────┴───────┴─────────┴───────────┘
Group by
We will create a new DataFrame for the Group by functionality. This new DataFrame will include several 'groups' that we want to group by.
shape: (8, 2)
┌─────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0 ┆ A │
│ 1 ┆ A │
│ 2 ┆ A │
│ 3 ┆ B │
│ 4 ┆ B │
│ 5 ┆ C │
│ 6 ┆ X │
│ 7 ┆ X │
└─────┴─────┘
shape: (4, 2)
┌─────┬───────┐
│ y ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════╡
│ A ┆ 3 │
│ B ┆ 2 │
│ C ┆ 1 │
│ X ┆ 2 │
└─────┴───────┘
shape: (4, 3)
┌─────┬───────┬─────┐
│ y ┆ count ┆ sum │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞═════╪═══════╪═════╡
│ A ┆ 3 ┆ 3 │
│ B ┆ 2 ┆ 7 │
│ C ┆ 1 ┆ 5 │
│ X ┆ 2 ┆ 13 │
└─────┴───────┴─────┘
Combining operations
Below are some examples on how to combine operations to create the DataFrame you require.
df_x = df.with_columns((pl.col("a") * pl.col("b")).alias("a * b")).select(
[pl.all().exclude(["c", "d"])]
)
print(df_x)
let out = df
.clone()
.lazy()
.with_columns([(col("a") * col("b")).alias("a * b")])
.select([col("*").exclude(["c", "d"])])
.collect()?;
println!("{}", out);
shape: (8, 3)
┌─────┬──────────┬──────────┐
│ a ┆ b ┆ a * b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.267969 ┆ 0.0 │
│ 1 ┆ 0.900438 ┆ 0.900438 │
│ 2 ┆ 0.405714 ┆ 0.811429 │
│ 3 ┆ 0.706133 ┆ 2.118399 │
│ 4 ┆ 0.394016 ┆ 1.576064 │
│ 5 ┆ 0.500425 ┆ 2.502127 │
│ 6 ┆ 0.493089 ┆ 2.958534 │
│ 7 ┆ 0.187685 ┆ 1.313797 │
└─────┴──────────┴──────────┘
df_y = df.with_columns([(pl.col("a") * pl.col("b")).alias("a * b")]).select(
[pl.all().exclude("d")]
)
print(df_y)
let out = df
.clone()
.lazy()
.with_columns([(col("a") * col("b")).alias("a * b")])
.select([col("*").exclude(["d"])])
.collect()?;
println!("{}", out);
shape: (8, 4)
┌─────┬──────────┬─────────────────────┬──────────┐
│ a ┆ b ┆ c ┆ a * b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪══════════╡
│ 0 ┆ 0.267969 ┆ 2022-12-01 00:00:00 ┆ 0.0 │
│ 1 ┆ 0.900438 ┆ 2022-12-02 00:00:00 ┆ 0.900438 │
│ 2 ┆ 0.405714 ┆ 2022-12-03 00:00:00 ┆ 0.811429 │
│ 3 ┆ 0.706133 ┆ 2022-12-04 00:00:00 ┆ 2.118399 │
│ 4 ┆ 0.394016 ┆ 2022-12-05 00:00:00 ┆ 1.576064 │
│ 5 ┆ 0.500425 ┆ 2022-12-06 00:00:00 ┆ 2.502127 │
│ 6 ┆ 0.493089 ┆ 2022-12-07 00:00:00 ┆ 2.958534 │
│ 7 ┆ 0.187685 ┆ 2022-12-08 00:00:00 ┆ 1.313797 │
└─────┴──────────┴─────────────────────┴──────────┘