Column selections
Let's create a dataset to use in this section:
from datetime import date, datetime
df = pl.DataFrame(
{
"id": [9, 4, 2],
"place": ["Mars", "Earth", "Saturn"],
"date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
"sales": [33.4, 2142134.1, 44.7],
"has_people": [False, True, False],
"logged_at": pl.date_range(
datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
),
}
).with_row_count("rn")
print(df)
use chrono::prelude::*;
use polars::time::*;
let df = df!(
"id" => &[9, 4, 2],
"place" => &["Mars", "Earth", "Saturn"],
"date" => date_range("date",
NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), Duration::parse("1d"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
"sales" => &[33.4, 2142134.1, 44.7],
"has_people" => &[false, true, false],
"logged_at" => date_range("logged_at",
NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 2).unwrap(), Duration::parse("1s"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
)?
.with_row_count("rn", None)?;
println!("{}", &df);
shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn ┆ id ┆ place ┆ date ┆ sales ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ date ┆ f64 ┆ bool ┆ datetime[μs] │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0 ┆ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘
Expression expansion
As we've seen in the previous section, we can select specific columns using the pl.col method. It can also select multiple columns - both as a means of convenience, and to expand the expression.
This kind of convenience feature isn't just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions (depending on the DataFrame schema), resulting in being able to select multiple columns + run computation on them!
Select all, or all but some
We can select all columns in the DataFrame object by providing the argument *:
out = df.select(pl.col("*"))
# Is equivalent to
out = df.select(pl.all())
print(out)
let out = df.clone().lazy().select([col("*")]).collect()?;
// Is equivalent to
let out = df.clone().lazy().select([all()]).collect()?;
println!("{}", &out);
shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn ┆ id ┆ place ┆ date ┆ sales ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ date ┆ f64 ┆ bool ┆ datetime[μs] │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0 ┆ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘
Often, we don't just want to include all columns, but include all while excluding a few. This can be done easily as well:
out = df.select(pl.col("*").exclude("logged_at", "rn"))
print(out)
let out = df
.clone()
.lazy()
.select([col("*").exclude(["logged_at", "rn"])])
.collect()?;
println!("{}", &out);
shape: (3, 5)
┌─────┬────────┬────────────┬───────────┬────────────┐
│ id ┆ place ┆ date ┆ sales ┆ has_people │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ date ┆ f64 ┆ bool │
╞═════╪════════╪════════════╪═══════════╪════════════╡
│ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false │
│ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true │
│ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false │
└─────┴────────┴────────────┴───────────┴────────────┘
By multiple strings
Specifying multiple strings allows expressions to expand to all matching columns:
out = df.select(pl.col("date", "logged_at").dt.to_string("%Y-%h-%d"))
print(out)
let out = df
.clone()
.lazy()
.select([cols(["date", "logged_at"]).dt().to_string("%Y-%h-%d")])
.collect()?;
println!("{}", &out);
shape: (3, 2)
┌─────────────┬─────────────┐
│ date ┆ logged_at │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘
By regular expressions
Multiple column selection is possible by regular expressions also, by making sure to wrap the regex by ^ and $ to let pl.col know that a regex selection is expected:
out = df.select(pl.col("^.*(as|sa).*$"))
print(out)
let out = df.clone().lazy().select([col("^.*(as|sa).*$")]).collect()?;
println!("{}", &out);
shape: (3, 2)
┌───────────┬────────────┐
│ sales ┆ has_people │
│ --- ┆ --- │
│ f64 ┆ bool │
╞═══════════╪════════════╡
│ 33.4 ┆ false │
│ 2142134.1 ┆ true │
│ 44.7 ┆ false │
└───────────┴────────────┘
By data type
pl.col can select multiple columns using Polars data types:
out = df.select(pl.col(pl.Int64, pl.UInt32, pl.Boolean).n_unique())
print(out)
let out = df
.clone()
.lazy()
.select([dtype_cols([DataType::Int64, DataType::UInt32, DataType::Boolean]).n_unique()])
.collect()?;
// gives different result than python as the id col is i32 in rust
println!("{}", &out);
shape: (1, 3)
┌─────┬─────┬────────────┐
│ rn ┆ id ┆ has_people │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪════════════╡
│ 3 ┆ 3 ┆ 2 │
└─────┴─────┴────────────┘
Using selectors
Polars also allows for the use of intuitive selections for columns based on their name, dtype or other properties; and this is built on top of existing functionality outlined in col used above. It is recommended to use them by importing and aliasing polars.selectors as cs.
By dtype
To select just the integer and string columns, we can do:
import polars.selectors as cs
out = df.select(cs.integer(), cs.string())
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 3)
┌─────┬─────┬────────┐
│ rn ┆ id ┆ place │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str │
╞═════╪═════╪════════╡
│ 0 ┆ 9 ┆ Mars │
│ 1 ┆ 4 ┆ Earth │
│ 2 ┆ 2 ┆ Saturn │
└─────┴─────┴────────┘
Applying set operations
These selectors also allow for set based selection operations. For instance, to select the numeric columns except the first column that indicates row numbers:
out = df.select(cs.numeric() - cs.first())
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 2)
┌─────┬───────────┐
│ id ┆ sales │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═══════════╡
│ 9 ┆ 33.4 │
│ 4 ┆ 2142134.1 │
│ 2 ┆ 44.7 │
└─────┴───────────┘
We can also select the row number by name and any non-numeric columns:
out = df.select(cs.by_name("rn") | ~cs.numeric())
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 5)
┌─────┬────────┬────────────┬────────────┬─────────────────────┐
│ rn ┆ place ┆ date ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ date ┆ bool ┆ datetime[μs] │
╞═════╪════════╪════════════╪════════════╪═════════════════════╡
│ 0 ┆ Mars ┆ 2022-01-01 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ Earth ┆ 2022-01-02 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ Saturn ┆ 2022-01-03 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴────────┴────────────┴────────────┴─────────────────────┘
By patterns and substrings
Selectors can also be matched by substring and regex patterns:
out = df.select(cs.contains("rn"), cs.matches(".*_.*"))
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/1059
shape: (3, 3)
┌─────┬────────────┬─────────────────────┐
│ rn ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ datetime[μs] │
╞═════╪════════════╪═════════════════════╡
│ 0 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴────────────┴─────────────────────┘
Converting to expressions
What if we want to apply a specific operation on the selected columns (i.e. get back to representing them as expressions to operate upon)? We can simply convert them using as_expr and then proceed as normal:
out = df.select(cs.temporal().as_expr().dt.to_string("%Y-%h-%d"))
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 2)
┌─────────────┬─────────────┐
│ date ┆ logged_at │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘
Debugging selectors
Polars also provides two helpful utility functions to aid with using selectors: is_selector and selector_column_names:
from polars.selectors import is_selector
out = cs.temporal()
print(is_selector(out))
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
True
To predetermine the column names that are selected, which is especially useful for a LazyFrame object:
from polars.selectors import expand_selector
out = cs.temporal().as_expr().dt.to_string("%Y-%h-%d")
print(expand_selector(df, out))
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
('date', 'logged_at')