Skip to content

Expressions

Expressions are the core strength of Polars. The expressions offer a versatile structure that both solves easy queries and is easily extended to complex ones. Below we will cover the basic components that serve as building block (or in Polars terminology contexts) for all your queries:

  • select
  • filter
  • with_columns
  • groupby

To learn more about expressions and the context in which they operate, see the User Guide sections: Contexts and Expressions.

Select statement

To select a column we need to do two things. Define the DataFrame we want the data from. And second, select the data that we need. In the example below you see that we select col('*'). The asterisk stands for all columns.

select

df.select(pl.col("*"))

select

let out = df.clone().lazy().select([col("*")]).collect()?;
println!("{}",out);

select

df.select(pl.col("*"));

shape: (8, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a   ┆ b        ┆ c                   ┆ d     │
│ --- ┆ ---      ┆ ---                 ┆ ---   │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64   │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 0   ┆ 0.449507 ┆ 2022-12-01 00:00:00 ┆ 1.0   │
│ 1   ┆ 0.064056 ┆ 2022-12-02 00:00:00 ┆ 2.0   │
│ 2   ┆ 0.012581 ┆ 2022-12-03 00:00:00 ┆ NaN   │
│ 3   ┆ 0.802239 ┆ 2022-12-04 00:00:00 ┆ NaN   │
│ 4   ┆ 0.946181 ┆ 2022-12-05 00:00:00 ┆ 0.0   │
│ 5   ┆ 0.157343 ┆ 2022-12-06 00:00:00 ┆ -5.0  │
│ 6   ┆ 0.117897 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 7   ┆ 0.765403 ┆ 2022-12-08 00:00:00 ┆ null  │
└─────┴──────────┴─────────────────────┴───────┘

You can also specify the specific columns that you want to return. There are two ways to do this. The first option is to create a list of column names, as seen below.

select

df.select(pl.col(["a", "b"]))

select

let out = df.clone().lazy().select([col("a"), col("b")]).collect()?;
println!("{}",out);

select

df.select(pl.col(["a", "b"]));

shape: (8, 2)
┌─────┬──────────┐
│ a   ┆ b        │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 0   ┆ 0.449507 │
│ 1   ┆ 0.064056 │
│ 2   ┆ 0.012581 │
│ 3   ┆ 0.802239 │
│ 4   ┆ 0.946181 │
│ 5   ┆ 0.157343 │
│ 6   ┆ 0.117897 │
│ 7   ┆ 0.765403 │
└─────┴──────────┘

The second option is to specify each column within a list in the select statement. This option is shown below.

select

df.select([pl.col("a"), pl.col("b")]).limit(3)

select

let out = df.clone().lazy().select([col("a"), col("b")]).limit(3).collect()?;
println!("{}",out);

select

df.select([pl.col("a"), pl.col("b")]).limit(3);

shape: (3, 2)
┌─────┬──────────┐
│ a   ┆ b        │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 0   ┆ 0.449507 │
│ 1   ┆ 0.064056 │
│ 2   ┆ 0.012581 │
└─────┴──────────┘

If you want to exclude an entire column from your view, you can simply use exclude in your select statement.

select

df.select([pl.exclude("a")])

select

let out = df.clone().lazy().select([col("*").exclude(["a"])]).collect()?;
println!("{}",out);

select

df.select([pl.exclude("a")]);

shape: (8, 3)
┌──────────┬─────────────────────┬───────┐
│ b        ┆ c                   ┆ d     │
│ ---      ┆ ---                 ┆ ---   │
│ f64      ┆ datetime[μs]        ┆ f64   │
╞══════════╪═════════════════════╪═══════╡
│ 0.449507 ┆ 2022-12-01 00:00:00 ┆ 1.0   │
│ 0.064056 ┆ 2022-12-02 00:00:00 ┆ 2.0   │
│ 0.012581 ┆ 2022-12-03 00:00:00 ┆ NaN   │
│ 0.802239 ┆ 2022-12-04 00:00:00 ┆ NaN   │
│ 0.946181 ┆ 2022-12-05 00:00:00 ┆ 0.0   │
│ 0.157343 ┆ 2022-12-06 00:00:00 ┆ -5.0  │
│ 0.117897 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 0.765403 ┆ 2022-12-08 00:00:00 ┆ null  │
└──────────┴─────────────────────┴───────┘

Filter

The filter option allows us to create a subset of the DataFrame. We use the same DataFrame as earlier and we filter between two specified dates.

filter

df.filter(
    pl.col("c").is_between(datetime(2022, 12, 2), datetime(2022, 12, 8)),
)

filter

// TODO

filter

df.filter(pl.col("c").gt(new Date(2022, 12, 2)).lt(new Date(2022, 12, 8)));

shape: (7, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a   ┆ b        ┆ c                   ┆ d     │
│ --- ┆ ---      ┆ ---                 ┆ ---   │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64   │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 1   ┆ 0.064056 ┆ 2022-12-02 00:00:00 ┆ 2.0   │
│ 2   ┆ 0.012581 ┆ 2022-12-03 00:00:00 ┆ NaN   │
│ 3   ┆ 0.802239 ┆ 2022-12-04 00:00:00 ┆ NaN   │
│ 4   ┆ 0.946181 ┆ 2022-12-05 00:00:00 ┆ 0.0   │
│ 5   ┆ 0.157343 ┆ 2022-12-06 00:00:00 ┆ -5.0  │
│ 6   ┆ 0.117897 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
│ 7   ┆ 0.765403 ┆ 2022-12-08 00:00:00 ┆ null  │
└─────┴──────────┴─────────────────────┴───────┘

With filter you can also create more complex filters that include multiple columns.

filter

df.filter((pl.col("a") <= 3) & (pl.col("d").is_not_nan()))

filter

let out = df.clone().lazy().filter(col("a").lt_eq(3).and(col("d").is_not_null())).collect()?;
println!("{}",out);

filter

df.filter(pl.col("a").ltEq(3).and(pl.col("d").isNotNull()));

shape: (2, 4)
┌─────┬──────────┬─────────────────────┬─────┐
│ a   ┆ b        ┆ c                   ┆ d   │
│ --- ┆ ---      ┆ ---                 ┆ --- │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64 │
╞═════╪══════════╪═════════════════════╪═════╡
│ 0   ┆ 0.449507 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
│ 1   ┆ 0.064056 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
└─────┴──────────┴─────────────────────┴─────┘

With_columns

with_columns allows you to create new columns for you analyses. We create two new columns e and b+42. First we sum all values from column b and store the results in column e. After that we add 42 to the values of b. Creating a new column b+42 to store these results.

with_columns

df.with_columns([pl.col("b").sum().alias("e"), (pl.col("b") + 42).alias("b+42")])

with_columns

let out = df.clone().lazy().with_columns(
    [
        col("b").sum().alias("e"),
        (col("b") + lit(42)).alias("b+42")
    ]
).collect()?;
println!("{}",out);

withColumns

df.withColumns([
  pl.col("b").sum().alias("e"),
  pl.col("b").plus(42).alias("b+42"),
]);

shape: (8, 6)
┌─────┬──────────┬─────────────────────┬───────┬──────────┬───────────┐
│ a   ┆ b        ┆ c                   ┆ d     ┆ e        ┆ b+42      │
│ --- ┆ ---      ┆ ---                 ┆ ---   ┆ ---      ┆ ---       │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64   ┆ f64      ┆ f64       │
╞═════╪══════════╪═════════════════════╪═══════╪══════════╪═══════════╡
│ 0   ┆ 0.449507 ┆ 2022-12-01 00:00:00 ┆ 1.0   ┆ 3.315207 ┆ 42.449507 │
│ 1   ┆ 0.064056 ┆ 2022-12-02 00:00:00 ┆ 2.0   ┆ 3.315207 ┆ 42.064056 │
│ 2   ┆ 0.012581 ┆ 2022-12-03 00:00:00 ┆ NaN   ┆ 3.315207 ┆ 42.012581 │
│ 3   ┆ 0.802239 ┆ 2022-12-04 00:00:00 ┆ NaN   ┆ 3.315207 ┆ 42.802239 │
│ 4   ┆ 0.946181 ┆ 2022-12-05 00:00:00 ┆ 0.0   ┆ 3.315207 ┆ 42.946181 │
│ 5   ┆ 0.157343 ┆ 2022-12-06 00:00:00 ┆ -5.0  ┆ 3.315207 ┆ 42.157343 │
│ 6   ┆ 0.117897 ┆ 2022-12-07 00:00:00 ┆ -42.0 ┆ 3.315207 ┆ 42.117897 │
│ 7   ┆ 0.765403 ┆ 2022-12-08 00:00:00 ┆ null  ┆ 3.315207 ┆ 42.765403 │
└─────┴──────────┴─────────────────────┴───────┴──────────┴───────────┘

Groupby

We will create a new DataFrame for the Groupby functionality. This new DataFrame will include several 'groups' that we want to groupby.

DataFrame

df2 = pl.DataFrame(
    {
        "x": np.arange(0, 8),
        "y": ["A", "A", "A", "B", "B", "C", "X", "X"],
    }
)

DataFrame

let df2: DataFrame = df!("x" => 0..8,
                        "y"=> &["A", "A", "A", "B", "B", "C", "X", "X"],
                    ).expect("should not fail");
println!("{}",df2);

DataFrame

df2 = pl.DataFrame({
  x: [...Array(8).keys()],
  y: ["A", "A", "A", "B", "B", "C", "X", "X"],
});

shape: (8, 2)
┌─────┬─────┐
│ x   ┆ y   │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0   ┆ A   │
│ 1   ┆ A   │
│ 2   ┆ A   │
│ 3   ┆ B   │
│ 4   ┆ B   │
│ 5   ┆ C   │
│ 6   ┆ X   │
│ 7   ┆ X   │
└─────┴─────┘

groupby

df2.groupby("y", maintain_order=True).count()

groupby

let out = df2.clone().lazy().groupby(["y"]).agg([count()]).collect()?;
println!("{}",out);

groupBy

df2.groupBy("y").count();
console.log(df2);

shape: (4, 2)
┌─────┬───────┐
│ y   ┆ count │
│ --- ┆ ---   │
│ str ┆ u32   │
╞═════╪═══════╡
│ A   ┆ 3     │
│ B   ┆ 2     │
│ C   ┆ 1     │
│ X   ┆ 2     │
└─────┴───────┘

groupby

df2.groupby("y", maintain_order=True).agg(
    [
        pl.col("*").count().alias("count"),
        pl.col("*").sum().alias("sum"),
    ]
)

groupby

let out = df2.clone().lazy().groupby(["y"]).agg([
    col("*").count().alias("count"),
    col("*").sum().alias("sum"),
]).collect()?;
println!("{}",out);

groupBy

df2
  .groupBy("y")
  .agg(pl.col("*").sum().alias("count"), pl.col("*").sum().alias("sum"));

shape: (4, 3)
┌─────┬───────┬─────┐
│ y   ┆ count ┆ sum │
│ --- ┆ ---   ┆ --- │
│ str ┆ u32   ┆ i64 │
╞═════╪═══════╪═════╡
│ A   ┆ 3     ┆ 3   │
│ B   ┆ 2     ┆ 7   │
│ C   ┆ 1     ┆ 5   │
│ X   ┆ 2     ┆ 13  │
└─────┴───────┴─────┘

Combining operations

Below are some examples on how to combine operations to create the DataFrame you require.

select · with_columns

df_x = df.with_columns((pl.col("a") * pl.col("b")).alias("a * b")).select(
    [pl.all().exclude(["c", "d"])]
)

print(df_x)

select · with_columns

let out = df.clone().lazy().with_columns([
    (col("a") * col("b")).alias("a * b")
]).select([
    col("*").exclude(["c","d"])
]).collect()?;
println!("{}",out);

select · withColumns

df_x = df
  .withColumns(pl.col("a").mul(pl.col("b")).alias("a * b"))
  .select([pl.all().exclude(["c", "d"])]);

console.log(df_x);

shape: (8, 3)
┌─────┬──────────┬──────────┐
│ a   ┆ b        ┆ a * b    │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 0   ┆ 0.449507 ┆ 0.0      │
│ 1   ┆ 0.064056 ┆ 0.064056 │
│ 2   ┆ 0.012581 ┆ 0.025161 │
│ 3   ┆ 0.802239 ┆ 2.406717 │
│ 4   ┆ 0.946181 ┆ 3.784723 │
│ 5   ┆ 0.157343 ┆ 0.786714 │
│ 6   ┆ 0.117897 ┆ 0.707384 │
│ 7   ┆ 0.765403 ┆ 5.357824 │
└─────┴──────────┴──────────┘

select · with_columns

df_y = df.with_columns([(pl.col("a") * pl.col("b")).alias("a * b")]).select(
    [pl.all().exclude("d")]
)

print(df_y)

select · with_columns

let out = df.clone().lazy().with_columns([
    (col("a") * col("b")).alias("a * b")
]).select([
    col("*").exclude(["d"])
]).collect()?;
println!("{}",out);

select · withColumns

df_y = df
  .withColumns([pl.col("a").mul(pl.col("b")).alias("a * b")])
  .select([pl.all().exclude("d")]);
console.log(df_y);

shape: (8, 4)
┌─────┬──────────┬─────────────────────┬──────────┐
│ a   ┆ b        ┆ c                   ┆ a * b    │
│ --- ┆ ---      ┆ ---                 ┆ ---      │
│ i64 ┆ f64      ┆ datetime[μs]        ┆ f64      │
╞═════╪══════════╪═════════════════════╪══════════╡
│ 0   ┆ 0.449507 ┆ 2022-12-01 00:00:00 ┆ 0.0      │
│ 1   ┆ 0.064056 ┆ 2022-12-02 00:00:00 ┆ 0.064056 │
│ 2   ┆ 0.012581 ┆ 2022-12-03 00:00:00 ┆ 0.025161 │
│ 3   ┆ 0.802239 ┆ 2022-12-04 00:00:00 ┆ 2.406717 │
│ 4   ┆ 0.946181 ┆ 2022-12-05 00:00:00 ┆ 3.784723 │
│ 5   ┆ 0.157343 ┆ 2022-12-06 00:00:00 ┆ 0.786714 │
│ 6   ┆ 0.117897 ┆ 2022-12-07 00:00:00 ┆ 0.707384 │
│ 7   ┆ 0.765403 ┆ 2022-12-08 00:00:00 ┆ 5.357824 │
└─────┴──────────┴─────────────────────┴──────────┘