Column selections

Let's create a dataset to use in this section:

Python Rust

DataFrame

from datetime import date, datetime

df = pl.DataFrame(
    {
        "id": [9, 4, 2],
        "place": ["Mars", "Earth", "Saturn"],
        "date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
        "sales": [33.4, 2142134.1, 44.7],
        "has_people": [False, True, False],
        "logged_at": pl.date_range(
            datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
        ),
    }
).with_row_count("rn")
print(df)

DataFrame

    use chrono::prelude::*;
    use polars::time::*;

    let df = df!(
            "id" => &[9, 4, 2],
            "place" => &["Mars", "Earth", "Saturn"],
        "date" => date_range("date",
                NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), Duration::parse("1d"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
            "sales" => &[33.4, 2142134.1, 44.7],
            "has_people" => &[false, true, false],
            "logged_at" => date_range("logged_at",
                NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 2).unwrap(), Duration::parse("1s"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
    )?
    .with_row_count("rn", None)?;
    println!("{}", &df);

shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn  ┆ id  ┆ place  ┆ date       ┆ sales     ┆ has_people ┆ logged_at           │
│ --- ┆ --- ┆ ---    ┆ ---        ┆ ---       ┆ ---        ┆ ---                 │
│ u32 ┆ i64 ┆ str    ┆ date       ┆ f64       ┆ bool       ┆ datetime[μs]        │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0   ┆ 9   ┆ Mars   ┆ 2022-01-01 ┆ 33.4      ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ 4   ┆ Earth  ┆ 2022-01-02 ┆ 2142134.1 ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ 2   ┆ Saturn ┆ 2022-01-03 ┆ 44.7      ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘

Expression expansion

As we've seen in the previous section, we can select specific columns using the pl.col method. It can also select multiple columns - both as a means of convenience, and to expand the expression.

This kind of convenience feature isn't just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions (depending on the DataFrame schema), resulting in being able to select multiple columns + run computation on them!

Select all, or all but some

We can select all columns in the DataFrame object by providing the argument *:

Python Rust

all

out = df.select(pl.col("*"))

# Is equivalent to
out = df.select(pl.all())
print(out)

let out = df.clone().lazy().select([col("*")]).collect()?;

// Is equivalent to
let out = df.clone().lazy().select([all()]).collect()?;
println!("{}", &out);

shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn  ┆ id  ┆ place  ┆ date       ┆ sales     ┆ has_people ┆ logged_at           │
│ --- ┆ --- ┆ ---    ┆ ---        ┆ ---       ┆ ---        ┆ ---                 │
│ u32 ┆ i64 ┆ str    ┆ date       ┆ f64       ┆ bool       ┆ datetime[μs]        │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0   ┆ 9   ┆ Mars   ┆ 2022-01-01 ┆ 33.4      ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ 4   ┆ Earth  ┆ 2022-01-02 ┆ 2142134.1 ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ 2   ┆ Saturn ┆ 2022-01-03 ┆ 44.7      ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘

Often, we don't just want to include all columns, but include all while excluding a few. This can be done easily as well:

Python Rust

exclude

out = df.select(pl.col("*").exclude("logged_at", "rn"))
print(out)

let out = df
    .clone()
    .lazy()
    .select([col("*").exclude(["logged_at", "rn"])])
    .collect()?;
println!("{}", &out);

shape: (3, 5)
┌─────┬────────┬────────────┬───────────┬────────────┐
│ id  ┆ place  ┆ date       ┆ sales     ┆ has_people │
│ --- ┆ ---    ┆ ---        ┆ ---       ┆ ---        │
│ i64 ┆ str    ┆ date       ┆ f64       ┆ bool       │
╞═════╪════════╪════════════╪═══════════╪════════════╡
│ 9   ┆ Mars   ┆ 2022-01-01 ┆ 33.4      ┆ false      │
│ 4   ┆ Earth  ┆ 2022-01-02 ┆ 2142134.1 ┆ true       │
│ 2   ┆ Saturn ┆ 2022-01-03 ┆ 44.7      ┆ false      │
└─────┴────────┴────────────┴───────────┴────────────┘

By multiple strings

Specifying multiple strings allows expressions to expand to all matching columns:

Python Rust

dt.to_string

out = df.select(pl.col("date", "logged_at").dt.to_string("%Y-%h-%d"))
print(out)

let out = df
    .clone()
    .lazy()
    .select([cols(["date", "logged_at"]).dt().to_string("%Y-%h-%d")])
    .collect()?;
println!("{}", &out);

shape: (3, 2)
┌─────────────┬─────────────┐
│ date        ┆ logged_at   │
│ ---         ┆ ---         │
│ str         ┆ str         │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘

By regular expressions

Multiple column selection is possible by regular expressions also, by making sure to wrap the regex by ^ and $ to let pl.col know that a regex selection is expected:

Python Rust

out = df.select(pl.col("^.*(as|sa).*$"))
print(out)

let out = df.clone().lazy().select([col("^.*(as|sa).*$")]).collect()?;
println!("{}", &out);

shape: (3, 2)
┌───────────┬────────────┐
│ sales     ┆ has_people │
│ ---       ┆ ---        │
│ f64       ┆ bool       │
╞═══════════╪════════════╡
│ 33.4      ┆ false      │
│ 2142134.1 ┆ true       │
│ 44.7      ┆ false      │
└───────────┴────────────┘

By data type

pl.col can select multiple columns using Polars data types:

Python Rust

n_unique

out = df.select(pl.col(pl.Int64, pl.UInt32, pl.Boolean).n_unique())
print(out)

let out = df
    .clone()
    .lazy()
    .select([dtype_cols([DataType::Int64, DataType::UInt32, DataType::Boolean]).n_unique()])
    .collect()?;
// gives different result than python as the id col is i32 in rust
println!("{}", &out);

shape: (1, 3)
┌─────┬─────┬────────────┐
│ rn  ┆ id  ┆ has_people │
│ --- ┆ --- ┆ ---        │
│ u32 ┆ u32 ┆ u32        │
╞═════╪═════╪════════════╡
│ 3   ┆ 3   ┆ 2          │
└─────┴─────┴────────────┘

Using `selectors`

Polars also allows for the use of intuitive selections for columns based on their name, dtype or other properties; and this is built on top of existing functionality outlined in col used above. It is recommended to use them by importing and aliasing polars.selectors as cs.

By `dtype`

To select just the integer and string columns, we can do:

Python Rust

selectors

import polars.selectors as cs

out = df.select(cs.integer(), cs.string())
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594

shape: (3, 3)
┌─────┬─────┬────────┐
│ rn  ┆ id  ┆ place  │
│ --- ┆ --- ┆ ---    │
│ u32 ┆ i64 ┆ str    │
╞═════╪═════╪════════╡
│ 0   ┆ 9   ┆ Mars   │
│ 1   ┆ 4   ┆ Earth  │
│ 2   ┆ 2   ┆ Saturn │
└─────┴─────┴────────┘

Applying set operations

These selectors also allow for set based selection operations. For instance, to select the numeric columns except the first column that indicates row numbers:

Python Rust

cs.first · cs.numeric

out = df.select(cs.numeric() - cs.first())
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594

shape: (3, 2)
┌─────┬───────────┐
│ id  ┆ sales     │
│ --- ┆ ---       │
│ i64 ┆ f64       │
╞═════╪═══════════╡
│ 9   ┆ 33.4      │
│ 4   ┆ 2142134.1 │
│ 2   ┆ 44.7      │
└─────┴───────────┘

We can also select the row number by name and any non-numeric columns:

Python Rust

cs.by_name · cs.numeric

out = df.select(cs.by_name("rn") | ~cs.numeric())
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594

shape: (3, 5)
┌─────┬────────┬────────────┬────────────┬─────────────────────┐
│ rn  ┆ place  ┆ date       ┆ has_people ┆ logged_at           │
│ --- ┆ ---    ┆ ---        ┆ ---        ┆ ---                 │
│ u32 ┆ str    ┆ date       ┆ bool       ┆ datetime[μs]        │
╞═════╪════════╪════════════╪════════════╪═════════════════════╡
│ 0   ┆ Mars   ┆ 2022-01-01 ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ Earth  ┆ 2022-01-02 ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ Saturn ┆ 2022-01-03 ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴────────┴────────────┴────────────┴─────────────────────┘

By patterns and substrings

Selectors can also be matched by substring and regex patterns:

Python Rust

cs.contains · cs.matches

out = df.select(cs.contains("rn"), cs.matches(".*_.*"))
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/1059

shape: (3, 3)
┌─────┬────────────┬─────────────────────┐
│ rn  ┆ has_people ┆ logged_at           │
│ --- ┆ ---        ┆ ---                 │
│ u32 ┆ bool       ┆ datetime[μs]        │
╞═════╪════════════╪═════════════════════╡
│ 0   ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴────────────┴─────────────────────┘

Converting to expressions

What if we want to apply a specific operation on the selected columns (i.e. get back to representing them as expressions to operate upon)? We can simply convert them using as_expr and then proceed as normal:

Python Rust

cs.temporal

out = df.select(cs.temporal().as_expr().dt.to_string("%Y-%h-%d"))
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594

shape: (3, 2)
┌─────────────┬─────────────┐
│ date        ┆ logged_at   │
│ ---         ┆ ---         │
│ str         ┆ str         │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘

Debugging `selectors`

Polars also provides two helpful utility functions to aid with using selectors: is_selector and selector_column_names:

Python Rust

is_selector

from polars.selectors import is_selector

out = cs.temporal()
print(is_selector(out))

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594

True

To predetermine the column names that are selected, which is especially useful for a LazyFrame object:

Python Rust

selector_column_names

from polars.selectors import expand_selector

out = cs.temporal().as_expr().dt.to_string("%Y-%h-%d")
print(expand_selector(df, out))

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594

('date', 'logged_at')

Column selections

Expression expansion

Select all, or all but some

By multiple strings

By regular expressions

By data type

Using selectors

By dtype

Applying set operations

By patterns and substrings

Converting to expressions

Debugging selectors

Using `selectors`

By `dtype`

Debugging `selectors`