Skip to content

Column selections

Let's create a dataset to use in this section:

DataFrame

from datetime import date, datetime

df = pl.DataFrame(
    {
        "id": [9, 4, 2],
        "place": ["Mars", "Earth", "Saturn"],
        "date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
        "sales": [33.4, 2142134.1, 44.7],
        "has_people": [False, True, False],
        "logged_at": pl.date_range(
            datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
        ),
    }
).with_row_count("rn")
print(df)

DataFrame

    use chrono::prelude::*;
    use polars::time::*;

    let df = df!(
            "id" => &[9, 4, 2],
            "place" => &["Mars", "Earth", "Saturn"],
        "date" => date_range("date",
                NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), Duration::parse("1d"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
            "sales" => &[33.4, 2142134.1, 44.7],
            "has_people" => &[false, true, false],
            "logged_at" => date_range("logged_at",
                NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 2).unwrap(), Duration::parse("1s"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
    )?
    .with_row_count("rn", None)?;
    println!("{}", &df);

shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn  ┆ id  ┆ place  ┆ date       ┆ sales     ┆ has_people ┆ logged_at           │
│ --- ┆ --- ┆ ---    ┆ ---        ┆ ---       ┆ ---        ┆ ---                 │
│ u32 ┆ i64 ┆ str    ┆ date       ┆ f64       ┆ bool       ┆ datetime[μs]        │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0   ┆ 9   ┆ Mars   ┆ 2022-01-01 ┆ 33.4      ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ 4   ┆ Earth  ┆ 2022-01-02 ┆ 2142134.1 ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ 2   ┆ Saturn ┆ 2022-01-03 ┆ 44.7      ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘

Expression expansion

As we've seen in the previous section, we can select specific columns using the pl.col method. It can also select multiple columns - both as a means of convenience, and to expand the expression.

This kind of convenience feature isn't just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions (depending on the DataFrame schema), resulting in being able to select multiple columns + run computation on them!

Select all, or all but some

We can select all columns in the DataFrame object by providing the argument *:

all

out = df.select(pl.col("*"))

# Is equivalent to
out = df.select(pl.all())
print(out)

let out = df.clone().lazy().select([col("*")]).collect()?;

// Is equivalent to
let out = df.clone().lazy().select([all()]).collect()?;
println!("{}", &out);
shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn  ┆ id  ┆ place  ┆ date       ┆ sales     ┆ has_people ┆ logged_at           │
│ --- ┆ --- ┆ ---    ┆ ---        ┆ ---       ┆ ---        ┆ ---                 │
│ u32 ┆ i64 ┆ str    ┆ date       ┆ f64       ┆ bool       ┆ datetime[μs]        │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0   ┆ 9   ┆ Mars   ┆ 2022-01-01 ┆ 33.4      ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ 4   ┆ Earth  ┆ 2022-01-02 ┆ 2142134.1 ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ 2   ┆ Saturn ┆ 2022-01-03 ┆ 44.7      ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘

Often, we don't just want to include all columns, but include all while excluding a few. This can be done easily as well:

exclude

out = df.select(pl.col("*").exclude("logged_at", "rn"))
print(out)

let out = df
    .clone()
    .lazy()
    .select([col("*").exclude(["logged_at", "rn"])])
    .collect()?;
println!("{}", &out);
shape: (3, 5)
┌─────┬────────┬────────────┬───────────┬────────────┐
│ id  ┆ place  ┆ date       ┆ sales     ┆ has_people │
│ --- ┆ ---    ┆ ---        ┆ ---       ┆ ---        │
│ i64 ┆ str    ┆ date       ┆ f64       ┆ bool       │
╞═════╪════════╪════════════╪═══════════╪════════════╡
│ 9   ┆ Mars   ┆ 2022-01-01 ┆ 33.4      ┆ false      │
│ 4   ┆ Earth  ┆ 2022-01-02 ┆ 2142134.1 ┆ true       │
│ 2   ┆ Saturn ┆ 2022-01-03 ┆ 44.7      ┆ false      │
└─────┴────────┴────────────┴───────────┴────────────┘

By multiple strings

Specifying multiple strings allows expressions to expand to all matching columns:

dt.to_string

out = df.select(pl.col("date", "logged_at").dt.to_string("%Y-%h-%d"))
print(out)

let out = df
    .clone()
    .lazy()
    .select([cols(["date", "logged_at"]).dt().to_string("%Y-%h-%d")])
    .collect()?;
println!("{}", &out);
shape: (3, 2)
┌─────────────┬─────────────┐
│ date        ┆ logged_at   │
│ ---         ┆ ---         │
│ str         ┆ str         │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘

By regular expressions

Multiple column selection is possible by regular expressions also, by making sure to wrap the regex by ^ and $ to let pl.col know that a regex selection is expected:

out = df.select(pl.col("^.*(as|sa).*$"))
print(out)
let out = df.clone().lazy().select([col("^.*(as|sa).*$")]).collect()?;
println!("{}", &out);
shape: (3, 2)
┌───────────┬────────────┐
│ sales     ┆ has_people │
│ ---       ┆ ---        │
│ f64       ┆ bool       │
╞═══════════╪════════════╡
│ 33.4      ┆ false      │
│ 2142134.1 ┆ true       │
│ 44.7      ┆ false      │
└───────────┴────────────┘

By data type

pl.col can select multiple columns using Polars data types:

n_unique

out = df.select(pl.col(pl.Int64, pl.UInt32, pl.Boolean).n_unique())
print(out)

let out = df
    .clone()
    .lazy()
    .select([dtype_cols([DataType::Int64, DataType::UInt32, DataType::Boolean]).n_unique()])
    .collect()?;
// gives different result than python as the id col is i32 in rust
println!("{}", &out);
shape: (1, 3)
┌─────┬─────┬────────────┐
│ rn  ┆ id  ┆ has_people │
│ --- ┆ --- ┆ ---        │
│ u32 ┆ u32 ┆ u32        │
╞═════╪═════╪════════════╡
│ 3   ┆ 3   ┆ 2          │
└─────┴─────┴────────────┘

Using selectors

Polars also allows for the use of intuitive selections for columns based on their name, dtype or other properties; and this is built on top of existing functionality outlined in col used above. It is recommended to use them by importing and aliasing polars.selectors as cs.

By dtype

To select just the integer and string columns, we can do:

selectors

import polars.selectors as cs

out = df.select(cs.integer(), cs.string())
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 3)
┌─────┬─────┬────────┐
│ rn  ┆ id  ┆ place  │
│ --- ┆ --- ┆ ---    │
│ u32 ┆ i64 ┆ str    │
╞═════╪═════╪════════╡
│ 0   ┆ 9   ┆ Mars   │
│ 1   ┆ 4   ┆ Earth  │
│ 2   ┆ 2   ┆ Saturn │
└─────┴─────┴────────┘

Applying set operations

These selectors also allow for set based selection operations. For instance, to select the numeric columns except the first column that indicates row numbers:

cs.first · cs.numeric

out = df.select(cs.numeric() - cs.first())
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 2)
┌─────┬───────────┐
│ id  ┆ sales     │
│ --- ┆ ---       │
│ i64 ┆ f64       │
╞═════╪═══════════╡
│ 9   ┆ 33.4      │
│ 4   ┆ 2142134.1 │
│ 2   ┆ 44.7      │
└─────┴───────────┘

We can also select the row number by name and any non-numeric columns:

cs.by_name · cs.numeric

out = df.select(cs.by_name("rn") | ~cs.numeric())
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 5)
┌─────┬────────┬────────────┬────────────┬─────────────────────┐
│ rn  ┆ place  ┆ date       ┆ has_people ┆ logged_at           │
│ --- ┆ ---    ┆ ---        ┆ ---        ┆ ---                 │
│ u32 ┆ str    ┆ date       ┆ bool       ┆ datetime[μs]        │
╞═════╪════════╪════════════╪════════════╪═════════════════════╡
│ 0   ┆ Mars   ┆ 2022-01-01 ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ Earth  ┆ 2022-01-02 ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ Saturn ┆ 2022-01-03 ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴────────┴────────────┴────────────┴─────────────────────┘

By patterns and substrings

Selectors can also be matched by substring and regex patterns:

cs.contains · cs.matches

out = df.select(cs.contains("rn"), cs.matches(".*_.*"))
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/1059
shape: (3, 3)
┌─────┬────────────┬─────────────────────┐
│ rn  ┆ has_people ┆ logged_at           │
│ --- ┆ ---        ┆ ---                 │
│ u32 ┆ bool       ┆ datetime[μs]        │
╞═════╪════════════╪═════════════════════╡
│ 0   ┆ false      ┆ 2022-12-01 00:00:00 │
│ 1   ┆ true       ┆ 2022-12-01 00:00:01 │
│ 2   ┆ false      ┆ 2022-12-01 00:00:02 │
└─────┴────────────┴─────────────────────┘

Converting to expressions

What if we want to apply a specific operation on the selected columns (i.e. get back to representing them as expressions to operate upon)? We can simply convert them using as_expr and then proceed as normal:

cs.temporal

out = df.select(cs.temporal().as_expr().dt.to_string("%Y-%h-%d"))
print(out)

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 2)
┌─────────────┬─────────────┐
│ date        ┆ logged_at   │
│ ---         ┆ ---         │
│ str         ┆ str         │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘

Debugging selectors

Polars also provides two helpful utility functions to aid with using selectors: is_selector and selector_column_names:

is_selector

from polars.selectors import is_selector

out = cs.temporal()
print(is_selector(out))

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
True

To predetermine the column names that are selected, which is especially useful for a LazyFrame object:

selector_column_names

from polars.selectors import expand_selector

out = cs.temporal().as_expr().dt.to_string("%Y-%h-%d")
print(expand_selector(df, out))

// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
('date', 'logged_at')