Column selections
Let's create a dataset to use in this section:
from datetime import date, datetime
df = pl.DataFrame(
{
"id": [9, 4, 2],
"place": ["Mars", "Earth", "Saturn"],
"date": pl.date_range(date(2022, 1, 1), date(2022, 1, 3), "1d", eager=True),
"sales": [33.4, 2142134.1, 44.7],
"has_people": [False, True, False],
"logged_at": pl.date_range(
datetime(2022, 12, 1), datetime(2022, 12, 1, 0, 0, 2), "1s", eager=True
),
}
).with_row_count("rn")
print(df)
use chrono::prelude::*;
use polars::time::*;
let df = df!(
"id" => &[9, 4, 2],
"place" => &["Mars", "Earth", "Saturn"],
"date" => date_range("date",
NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(), Duration::parse("1d"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
"sales" => &[33.4, 2142134.1, 44.7],
"has_people" => &[false, true, false],
"logged_at" => date_range("logged_at",
NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(), NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 2).unwrap(), Duration::parse("1s"),ClosedWindow::Both, TimeUnit::Milliseconds, None)?,
)?
.with_row_count("rn", None)?;
println!("{}", &df);
shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn ┆ id ┆ place ┆ date ┆ sales ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ date ┆ f64 ┆ bool ┆ datetime[μs] │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0 ┆ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘
Expression expansion
As we've seen in the previous section, we can select specific columns using the pl.col
method. It can also select multiple columns - both as a means of convenience, and to expand the expression.
This kind of convenience feature isn't just decorative or syntactic sugar. It allows for a very powerful application of DRY principles in your code: a single expression that specifies multiple columns expands into a list of expressions (depending on the DataFrame schema), resulting in being able to select multiple columns + run computation on them!
Select all, or all but some
We can select all columns in the DataFrame
object by providing the argument *
:
out = df.select(pl.col("*"))
# Is equivalent to
out = df.select(pl.all())
print(out)
let out = df.clone().lazy().select([col("*")]).collect()?;
// Is equivalent to
let out = df.clone().lazy().select([all()]).collect()?;
println!("{}", &out);
shape: (3, 7)
┌─────┬─────┬────────┬────────────┬───────────┬────────────┬─────────────────────┐
│ rn ┆ id ┆ place ┆ date ┆ sales ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str ┆ date ┆ f64 ┆ bool ┆ datetime[μs] │
╞═════╪═════╪════════╪════════════╪═══════════╪════════════╪═════════════════════╡
│ 0 ┆ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴─────┴────────┴────────────┴───────────┴────────────┴─────────────────────┘
Often, we don't just want to include all columns, but include all while excluding a few. This can be done easily as well:
out = df.select(pl.col("*").exclude("logged_at", "rn"))
print(out)
let out = df
.clone()
.lazy()
.select([col("*").exclude(["logged_at", "rn"])])
.collect()?;
println!("{}", &out);
shape: (3, 5)
┌─────┬────────┬────────────┬───────────┬────────────┐
│ id ┆ place ┆ date ┆ sales ┆ has_people │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ date ┆ f64 ┆ bool │
╞═════╪════════╪════════════╪═══════════╪════════════╡
│ 9 ┆ Mars ┆ 2022-01-01 ┆ 33.4 ┆ false │
│ 4 ┆ Earth ┆ 2022-01-02 ┆ 2142134.1 ┆ true │
│ 2 ┆ Saturn ┆ 2022-01-03 ┆ 44.7 ┆ false │
└─────┴────────┴────────────┴───────────┴────────────┘
By multiple strings
Specifying multiple strings allows expressions to expand to all matching columns:
out = df.select(pl.col("date", "logged_at").dt.to_string("%Y-%h-%d"))
print(out)
let out = df
.clone()
.lazy()
.select([cols(["date", "logged_at"]).dt().to_string("%Y-%h-%d")])
.collect()?;
println!("{}", &out);
shape: (3, 2)
┌─────────────┬─────────────┐
│ date ┆ logged_at │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘
By regular expressions
Multiple column selection is possible by regular expressions also, by making sure to wrap the regex by ^
and $
to let pl.col
know that a regex selection is expected:
out = df.select(pl.col("^.*(as|sa).*$"))
print(out)
let out = df.clone().lazy().select([col("^.*(as|sa).*$")]).collect()?;
println!("{}", &out);
shape: (3, 2)
┌───────────┬────────────┐
│ sales ┆ has_people │
│ --- ┆ --- │
│ f64 ┆ bool │
╞═══════════╪════════════╡
│ 33.4 ┆ false │
│ 2142134.1 ┆ true │
│ 44.7 ┆ false │
└───────────┴────────────┘
By data type
pl.col
can select multiple columns using Polars data types:
out = df.select(pl.col(pl.Int64, pl.UInt32, pl.Boolean).n_unique())
print(out)
let out = df
.clone()
.lazy()
.select([dtype_cols([DataType::Int64, DataType::UInt32, DataType::Boolean]).n_unique()])
.collect()?;
// gives different result than python as the id col is i32 in rust
println!("{}", &out);
shape: (1, 3)
┌─────┬─────┬────────────┐
│ rn ┆ id ┆ has_people │
│ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 │
╞═════╪═════╪════════════╡
│ 3 ┆ 3 ┆ 2 │
└─────┴─────┴────────────┘
Using selectors
Polars also allows for the use of intuitive selections for columns based on their name, dtype
or other properties; and this is built on top of existing functionality outlined in col
used above. It is recommended to use them by importing and aliasing polars.selectors
as cs
.
By dtype
To select just the integer and string columns, we can do:
import polars.selectors as cs
out = df.select(cs.integer(), cs.string())
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 3)
┌─────┬─────┬────────┐
│ rn ┆ id ┆ place │
│ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ str │
╞═════╪═════╪════════╡
│ 0 ┆ 9 ┆ Mars │
│ 1 ┆ 4 ┆ Earth │
│ 2 ┆ 2 ┆ Saturn │
└─────┴─────┴────────┘
Applying set operations
These selectors also allow for set based selection operations. For instance, to select the numeric columns except the first column that indicates row numbers:
out = df.select(cs.numeric() - cs.first())
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 2)
┌─────┬───────────┐
│ id ┆ sales │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪═══════════╡
│ 9 ┆ 33.4 │
│ 4 ┆ 2142134.1 │
│ 2 ┆ 44.7 │
└─────┴───────────┘
We can also select the row number by name and any non-numeric columns:
out = df.select(cs.by_name("rn") | ~cs.numeric())
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 5)
┌─────┬────────┬────────────┬────────────┬─────────────────────┐
│ rn ┆ place ┆ date ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ date ┆ bool ┆ datetime[μs] │
╞═════╪════════╪════════════╪════════════╪═════════════════════╡
│ 0 ┆ Mars ┆ 2022-01-01 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ Earth ┆ 2022-01-02 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ Saturn ┆ 2022-01-03 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴────────┴────────────┴────────────┴─────────────────────┘
By patterns and substrings
Selectors can also be matched by substring and regex patterns:
out = df.select(cs.contains("rn"), cs.matches(".*_.*"))
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/1059
shape: (3, 3)
┌─────┬────────────┬─────────────────────┐
│ rn ┆ has_people ┆ logged_at │
│ --- ┆ --- ┆ --- │
│ u32 ┆ bool ┆ datetime[μs] │
╞═════╪════════════╪═════════════════════╡
│ 0 ┆ false ┆ 2022-12-01 00:00:00 │
│ 1 ┆ true ┆ 2022-12-01 00:00:01 │
│ 2 ┆ false ┆ 2022-12-01 00:00:02 │
└─────┴────────────┴─────────────────────┘
Converting to expressions
What if we want to apply a specific operation on the selected columns (i.e. get back to representing them as expressions to operate upon)? We can simply convert them using as_expr
and then proceed as normal:
out = df.select(cs.temporal().as_expr().dt.to_string("%Y-%h-%d"))
print(out)
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
shape: (3, 2)
┌─────────────┬─────────────┐
│ date ┆ logged_at │
│ --- ┆ --- │
│ str ┆ str │
╞═════════════╪═════════════╡
│ 2022-Jan-01 ┆ 2022-Dec-01 │
│ 2022-Jan-02 ┆ 2022-Dec-01 │
│ 2022-Jan-03 ┆ 2022-Dec-01 │
└─────────────┴─────────────┘
Debugging selectors
Polars also provides two helpful utility functions to aid with using selectors: is_selector
and selector_column_names
:
from polars.selectors import is_selector
out = cs.temporal()
print(is_selector(out))
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
True
To predetermine the column names that are selected, which is especially useful for a LazyFrame object:
from polars.selectors import expand_selector
out = cs.temporal().as_expr().dt.to_string("%Y-%h-%d")
print(expand_selector(df, out))
// Not available in Rust, refer the following link
// https://github.com/pola-rs/polars/issues/10594
('date', 'logged_at')