Introduction
This book is an introduction to the
Polars
DataFrame library. Its goal is to
introduce you to Polars
by going through examples and comparing it to other
solutions. Some design choices are introduced here. The guide will also introduce you to
optimal usage of Polars
.
Even though Polars
is completely written in Rust
(no
runtime overhead!) and uses Arrow
-- the
native arrow2 Rust
implementation -- as its foundation, the
examples presented in this guide will be mostly using its higher-level language
bindings. Higher-level bindings only serve as a thin wrapper for functionality implemented in the core library.
For Pandas
users, our
Python package will offer the easiest way to get started with
Polars
.
Goals and non-goals
The goal of Polars
is to provide a lightning fast DataFrame
library that utilizes all
available cores on your machine. Unlike tools such as dask -- which tries to parallelize existing single-threaded libraries
like NumPy
and Pandas
-- Polars
is written from the ground up, designed for parallelization of queries on DataFrame
s.
Polars
goes to great lengths to:
- Reduce redundant copies
- Traverse memory cache efficiently
- Minimize contention in parallelism
Polars
is lazy and semi-lazy. It allows you to do most of your work eagerly, similar to Pandas
, but
it also provides a powerful expression syntax that will be optimized and executed on within the query engine.
In lazy Polars
we are able to do query optimization on the entire query, further improving performance and memory pressure.
Polars
keeps track of your query in a logical plan. This
plan is optimized and reordered before running it. When a result is requested, Polars
distributes the available work to different executors that use the algorithms available
in the eager API to produce a result. Because the whole query context is known to
the optimizer and executors of the logical plan, processes dependent on separate data
sources can be parallelized on the fly.
Performance 🚀🚀
Polars
is very fast, and in fact is one of the best performing solutions available.
See the results in h2oai's db-benchmark. The image below shows the biggest datasets yielding a result.
Polars
TPCH Benchmark results are now available on the official website.
Current status
Below a concise list of the features allowing Polars
to meet its goals:
- Copy-on-write (COW) semantics
- "Free" clones
- Cheap appends
- Appending without clones
- Column oriented data storage
- No block manager (i.e. predictable performance)
- Missing values indicated with bitmask
- NaN are different from missing
- Bitmask optimizations
- Efficient algorithms
- Very fast IO
- Its csv and parquet readers are among the fastest in existence
- Query optimizations
- Predicate pushdown
- Filtering at scan level
- Projection pushdown
- Projection at scan level
- Aggregate pushdown
- Aggregations at scan level
- Simplify expressions
- Parallel execution of physical plan
- Cardinality based groupby dispatch
- Different groupby strategies based on data cardinality
- Predicate pushdown
- SIMD vectorization
NumPy
universal functions
Acknowledgements
Development of Polars
is proudly powered by
Getting started
Installation
Installing and using Polars
is just a simple pip install
, cargo add
, or yarn add
away.
# Installing for python
$ pip install polars
// Installing into a Rust project
$ cargo add polars
// Installing for Node
$ yarn add nodejs-polars
All binaries are pre-built for Python
v3.6+.
Quick start
Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation. This example is presented in python only, as the "eager" API is not the preferred model in Rust.
import polars as pl
df = pl.read_csv("https://j.mp/iriscsv")
print(df.filter(pl.col("sepal_length") > 5)
.groupby("species", maintain_order=True)
.agg(pl.all().sum())
)
The snippet above will output:
shape: (3, 5)
┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
│ species ┆ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
│ setosa ┆ 116.9 ┆ 81.7 ┆ 33.2 ┆ 6.1 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ versicolor ┆ 281.9 ┆ 131.8 ┆ 202.9 ┆ 63.3 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ virginica ┆ 324.5 ┆ 146.2 ┆ 273.1 ┆ 99.6 │
└────────────┴──────────────┴─────────────┴──────────────┴─────────────┘
As we can see, Polars
pretty-prints the output object, including the column name and
datatype as headers.
Lazy quick start
If we want to run this query in lazy Polars
we'd write:
import polars as pl
print(
pl.read_csv("https://j.mp/iriscsv")
.lazy()
.filter(pl.col("sepal_length") > 5)
.groupby("species", maintain_order=True)
.agg(pl.all().sum())
.collect()
)
use color_eyre::{Result};
use polars::prelude::*;
use reqwest::blocking::Client;
use std::io::Cursor;
fn main() -> Result<()> {
let data: Vec<u8> = Client::new()
.get("https://j.mp/iriscsv")
.send()?
.text()?
.bytes()
.collect();
let df = CsvReader::new(Cursor::new(data))
.has_header(true)
.finish()?
.lazy()
.filter(col("sepal_length").gt(5))
.groupby([col("species")])
.agg([col("*").sum()])
.collect()?;
println!("{:?}", df);
Ok(())
}
When the data is stored locally, we can also use scan_csv
in Python, or LazyCsvReader
in Rust to run the query in lazy polars.
Note about Rust usage
Some functionality is not enabled by default. It must be added as an additional feature. This can be enabled by directly adding it to your Cargo.toml
[dependencies]
polars = { version = "0.24.3", features = ["lazy"] }
reqwest = { version = "0.11.12", features = ["blocking"] }
color-eyre = "0.6"
References
If you want to dive right into the Python
API docs, check the the reference docs. Alternatively, the Rust
API docs are available on docs.rs.
Lazy API
The lazy API builds a query plan. Nothing is executed until you explicitly ask Polars
to execute the query (via LazyFrame.collect()
, or LazyFrame.fetch()
). This provides
Polars
with the entire context of the query, allowing optimizations and choosing the
fastest algorithm given that context.
Going from eager to lazy is often as simple as starting your query with .lazy()
and ending with .collect()
.
So the eager snippet above would become:
(
df.lazy()
.filter(pl.col("sepal_length") > 5)
.groupby("species", maintain_order=True)
.agg(pl.all().sum())
.collect()
)
let df = df
.lazy()
.filter(col("sepal_length").gt(5))
.groupby([col("species")])
.agg([col("*" ).sum()])
.collect()?;
Polars quick exploration guide
This quick exploration guide is written for new users of Polars. The goal is to provide an overview of the most common functions and capabilities of the package. In this guide we will provide several examples. At the end of every part there is a link to relevant parts of the Polars Book with more information and the API reference guide.
In this exploration guide we will go through the follow topics:
- Installation and Import
- Object creation
- Viewing data
- Expressions
- Combining dataframes
- Remaining topics
Installation and Import
Install Polars
in your (virtual) environment with the following command:
pip install -U polars
Import Polars
as follows:
import polars as pl
# to enrich the examples in this quickstart with dates
from datetime import datetime, timedelta
# to generate data for the examples
import numpy as np
Object creation
From scratch
Creating a simple Series
or Dataframe
is easy and very familiar to other packages.
You can create a Series
in Polars by providing a list
or a tuple
.
# with a tuple
series = pl.Series("a", [1, 2, 3, 4, 5])
print(series)
shape: (5,)
Series: 'a' [i64]
[
1
2
3
4
5
]
# with a list
series = pl.Series([1, 2, 3, 4, 5])
print(series)
shape: (5,)
Series: '' [i64]
[
1
2
3
4
5
]
A DataFrame
is created from a dict
or a collection of dicts
.
dataframe = pl.DataFrame({"integer": [1, 2, 3],
"date": [
(datetime(2022, 1, 1)),
(datetime(2022, 1, 2)),
(datetime(2022, 1, 3))
],
"float":[4.0, 5.0, 6.0]})
print(dataframe)
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
└─────────┴─────────────────────┴───────┘
Additional information
From files
In Polars we can also read files and create a DataFrame
. In the following examples we write the output of the DataFrame
from the previous part to a specific file type (csv
, json
and parquet
). After that we will read it and print the output for inspection.
csv
dataframe.write_csv('output.csv')
df_csv = pl.read_csv('output.csv')
print(df_csv)
shape: (3, 3)
┌─────────┬────────────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═════════╪════════════════════════════╪═══════╡
│ 1 ┆ 2022-01-01T00:00:00.000000 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-01-02T00:00:00.000000 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-01-03T00:00:00.000000 ┆ 6.0 │
└─────────┴────────────────────────────┴───────┘
As we can see above, Polars made the datetimes a string
. We can tell Polars to parse dates, when reading the csv, to ensure the date becomes a datetime. The example can be found below:
df_csv_with_dates = pl.read_csv('output.csv', parse_dates=True)
print(df_csv_with_dates)
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
└─────────┴─────────────────────┴───────┘
json
dataframe.write_json('output.json')
df_json = pl.read_json('output.json')
print(df_json)
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
└─────────┴─────────────────────┴───────┘
parquet
dataframe.write_parquet('output.parquet')
df_parquet = pl.read_parquet('output.parquet')
print(df_parquet)
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
└─────────┴─────────────────────┴───────┘
Additional information
Viewing data
This part focuses on viewing data in a DataFrame
. We first create a DataFrame
to work with.
df = pl.DataFrame({"a": np.arange(0, 8),
"b": np.random.rand(8),
"c": [datetime(2022, 12, 1) + timedelta(days=idx) for idx in range(8)],
"d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None]
})
print(df)
shape: (8, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 0 ┆ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 0.634639 ┆ 2022-12-03 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 0.67404 ┆ 2022-12-04 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 0.896408 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 0.062943 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 7 ┆ 0.108093 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴───────┘
The head
function shows by default the first 5 rows of a DataFrame
. You can specify the number of rows you want to see (e.g. df.head(10)
).
df.head(5)
shape: (5, 4)
┌─────┬──────────┬─────────────────────┬─────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═════╡
│ 0 ┆ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0.634639 ┆ 2022-12-03 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 0.67404 ┆ 2022-12-04 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
└─────┴──────────┴─────────────────────┴─────┘
The tail
function shows the last 5 rows of a DataFrame
. You can also specify the number of rows you want to see, similar to head
.
df.tail(5)
shape: (5, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 3 ┆ 0.67404 ┆ 2022-12-04 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 0.896408 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 0.062943 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 7 ┆ 0.108093 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴───────┘
If you want to get an impression of the data of your DataFrame
, you can also use sample
. With sample
you get an n number of random rows from the DataFrame
.
df.sample(n=3)
shape: (3, 4)
┌─────┬──────────┬─────────────────────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪══════╡
│ 0 ┆ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 7 ┆ 0.108093 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴──────┘
Describe
returns summary statistics of your DataFrame
. It will provide several quick statistics if possible.
df.describe()
shape: (7, 5)
┌────────────┬─────────┬──────────┬────────────────────────────┬───────┐
│ describe ┆ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ str ┆ f64 │
╞════════════╪═════════╪══════════╪════════════════════════════╪═══════╡
│ count ┆ 8.0 ┆ 8.0 ┆ 8 ┆ 8.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null_count ┆ 0.0 ┆ 0.0 ┆ 0 ┆ 1.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ mean ┆ 3.5 ┆ 0.431245 ┆ null ┆ NaN │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ std ┆ 2.44949 ┆ 0.340445 ┆ null ┆ NaN │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ min ┆ 0.0 ┆ 0.062943 ┆ 2022-12-01 00:00:00.000000 ┆ -42.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ max ┆ 7.0 ┆ 0.896408 ┆ 2022-12-08 00:00:00.000000 ┆ 2.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ median ┆ 3.5 ┆ 0.42741 ┆ null ┆ 1.0 │
└────────────┴─────────┴──────────┴────────────────────────────┴───────┘
Additional information
- Link to aggregations on
DataFrames
in the Reference guide: link - Link to descriptive
DataFrame
functions in the Reference guide: link - Link to
DataFrame
attributes in the Reference guide: link
Expressions
Expressions
are the core strength of Polars
. The expressions
offer a versatile structure that solves easy queries, but is easily extended to complex analyses. Below we will cover the basic components that serve as building block for all your queries.
select
filter
with_columns
Select statement
To select a column we need to do two things. Define the DataFrame
we want the data from. And second, select the data that we need. In the example below you see that we select col('*')
. The asterisk stands for all columns.
df.select(
pl.col('*')
)
shape: (8, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 0 ┆ 0.164545 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ 0.747291 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 0.889227 ┆ 2022-12-03 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 0.736651 ┆ 2022-12-04 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 0.099687 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 0.965809 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 0.93697 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 7 ┆ 0.848925 ┆ 2022-12-08 00:00:00 ┆ null │
└─────┴──────────┴─────────────────────┴───────┘
You can also specify the specific columns that you want to return. There are two ways to do this. The first option is to create a list
of column names, as seen below.
df.select(
pl.col(['a', 'b'])
)
shape: (8, 2)
┌─────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.164545 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.747291 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.889227 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.736651 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 0.099687 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 0.965809 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 0.93697 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 0.848925 │
└─────┴──────────┘
The second option is to specify each column within a list
in the select
statement. This option is shown below.
# in this example we limit the number of rows returned to 3, as the comparison is clear.
# this also shows how easy we can extend our expression to what we need.
df.select([
pl.col('a'),
pl.col('b')
]).limit(3)
shape: (3, 2)
┌─────┬──────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════╪══════════╡
│ 0 ┆ 0.164545 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.747291 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.889227 │
└─────┴──────────┘
If you want to exclude an entire column from your view, you can simply use exclude
in your select
statement.
df.select([
pl.exclude('a')
])
shape: (8, 3)
┌──────────┬─────────────────────┬───────┐
│ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- │
│ f64 ┆ datetime[μs] ┆ f64 │
╞══════════╪═════════════════════╪═══════╡
│ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.634639 ┆ 2022-12-03 00:00:00 ┆ NaN │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.67404 ┆ 2022-12-04 00:00:00 ┆ NaN │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.896408 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.062943 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 0.108093 ┆ 2022-12-08 00:00:00 ┆ null │
└──────────┴─────────────────────┴───────┘
Additional information
- Link to
select
withexpressions
in the Polars Book: link
Filter
The filter
option allows us to create a subset of the DataFrame
. We use the same DataFrame
as earlier and we filter between two specified dates.
df.filter(
pl.col("c").is_between(datetime(2022, 12, 2), datetime(2022, 12, 8)),
)
shape: (5, 4)
┌─────┬──────────┬─────────────────────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╡
│ 2 ┆ 0.634639 ┆ 2022-12-03 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 0.67404 ┆ 2022-12-04 00:00:00 ┆ NaN │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 0.896408 ┆ 2022-12-06 00:00:00 ┆ -5.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 6 ┆ 0.062943 ┆ 2022-12-07 00:00:00 ┆ -42.0 │
└─────┴──────────┴─────────────────────┴───────┘
With filter
you can also create more complex filters that include multiple columns.
df.filter(
(pl.col('a') <= 3) & (pl.col('d').is_not_nan())
)
shape: (2, 4)
┌─────┬──────────┬─────────────────────┬─────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪═════╡
│ 0 ┆ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 1.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 2.0 │
└─────┴──────────┴─────────────────────┴─────┘
Additional information
- Link to filtering in
expressions
in the Polars Book: link
With_columns
with_colums
allows you to create new columns for you analyses. We create two new columns e
and b+42
. First we sum all values from column b
and store the results in column e
. After that we add 42
to the values of b
. Creating a new column b+42
to store these results.
df.with_columns([
pl.col('b').sum().alias('e'),
(pl.col('b') + 42).alias('b+42')
])
shape: (8, 6)
┌─────┬──────────┬─────────────────────┬───────┬──────────┬───────────┐
│ a ┆ b ┆ c ┆ d ┆ e ┆ b+42 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪═════════════════════╪═══════╪══════════╪═══════════╡
│ 0 ┆ 0.606396 ┆ 2022-12-01 00:00:00 ┆ 1.0 ┆ 4.126554 ┆ 42.606396 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.404966 ┆ 2022-12-02 00:00:00 ┆ 2.0 ┆ 4.126554 ┆ 42.404966 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.619193 ┆ 2022-12-03 00:00:00 ┆ NaN ┆ 4.126554 ┆ 42.619193 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.41586 ┆ 2022-12-04 00:00:00 ┆ NaN ┆ 4.126554 ┆ 42.41586 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 0.35721 ┆ 2022-12-05 00:00:00 ┆ 0.0 ┆ 4.126554 ┆ 42.35721 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 0.726861 ┆ 2022-12-06 00:00:00 ┆ -5.0 ┆ 4.126554 ┆ 42.726861 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 0.201782 ┆ 2022-12-07 00:00:00 ┆ -42.0 ┆ 4.126554 ┆ 42.201782 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 0.794286 ┆ 2022-12-08 00:00:00 ┆ null ┆ 4.126554 ┆ 42.794286 │
└─────┴──────────┴─────────────────────┴───────┴──────────┴───────────┘
Groupby
We will create a new DataFrame
for the Groupby functionality. This new DataFrame
will include several 'groups' that we want to groupby.
df2 = pl.DataFrame({
"x": np.arange(0, 8),
"y": ['A', 'A', 'A', 'B', 'B', 'C', 'X', 'X'],
})
print(df2)
shape: (8, 2)
┌─────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 0 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ C │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ X │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 7 ┆ X │
└─────┴─────┘
# without maintain_order you will get a random order back.
df2.groupby("y", maintain_order=True).count()
shape: (4, 2)
┌─────┬───────┐
│ y ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════╡
│ A ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ C ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ X ┆ 2 │
└─────┴───────┘
df2.groupby("y", maintain_order=True).agg([
pl.col("*").count().alias("count"),
pl.col("*").sum().alias("sum")
])
shape: (4, 3)
┌─────┬───────┬─────┐
│ y ┆ count ┆ sum │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞═════╪═══════╪═════╡
│ A ┆ 3 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ B ┆ 2 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ C ┆ 1 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ X ┆ 2 ┆ 13 │
└─────┴───────┴─────┘
Additional information
- Link to
groupby
withexpressions
in the Polars Book: link
Combining operations
Below are some examples on how to combine operations to create the DataFrame
you require.
# create a new column that multiplies column `a` and `b` from our DataFrame
# select all the columns, but exclude column `c` and `d` from the final DataFrame
df_x = df.with_column(
(pl.col("a") * pl.col("b")).alias("a * b")
).select([
pl.all().exclude(['c', 'd'])
])
print(df_x)
shape: (8, 3)
┌─────┬──────────┬──────────┐
│ a ┆ b ┆ a * b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╡
│ 0 ┆ 0.220182 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 0.750839 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.634639 ┆ 1.269277 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.67404 ┆ 2.022121 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 0.41127 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 0.896408 ┆ 4.482038 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 0.062943 ┆ 0.377657 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 0.108093 ┆ 0.756653 │
└─────┴──────────┴──────────┘
# only excluding column `d` in this example
df_y = df.with_columns([
(pl.col("a") * pl.col("b")).alias("a * b")
]).select([
pl.all().exclude('d')
])
print(df_y)
shape: (8, 4)
┌─────┬──────────┬─────────────────────┬──────────┐
│ a ┆ b ┆ c ┆ a * b │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 │
╞═════╪══════════╪═════════════════════╪══════════╡
│ 0 ┆ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 0.0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 0.750839 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.634639 ┆ 2022-12-03 00:00:00 ┆ 1.269277 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.67404 ┆ 2022-12-04 00:00:00 ┆ 2.022121 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.41127 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 0.896408 ┆ 2022-12-06 00:00:00 ┆ 4.482038 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 0.062943 ┆ 2022-12-07 00:00:00 ┆ 0.377657 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 0.108093 ┆ 2022-12-08 00:00:00 ┆ 0.756653 │
└─────┴──────────┴─────────────────────┴──────────┘
Additional information
- Link to contexts in
expressions
in the Polars Book: link
Combining dataframes
Join
Let's have a closer look on how to join
two DataFrames
to a single DataFrame
.
df = pl.DataFrame({"a": np.arange(0, 8),
"b": np.random.rand(8),
"c": [datetime(2022, 12, 1) + timedelta(days=idx) for idx in range(8)],
"d": [1, 2.0, np.NaN, np.NaN, 0, -5, -42, None]
})
df2 = pl.DataFrame({
"x": np.arange(0, 8),
"y": ['A', 'A', 'A', 'B', 'B', 'C', 'X', 'X'],
})
Our two DataFrames
both have an 'id'-like column: a
and x
. We can use those columns to join
the DataFrames
in this example.
df.join(df2, left_on="a", right_on="x")
shape: (8, 5)
┌─────┬──────────┬─────────────────────┬───────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ y │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 ┆ str │
╞═════╪══════════╪═════════════════════╪═══════╪═════╡
│ 0 ┆ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 1.0 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 2.0 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0.634639 ┆ 2022-12-03 00:00:00 ┆ NaN ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 0.67404 ┆ 2022-12-04 00:00:00 ┆ NaN ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.0 ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 0.896408 ┆ 2022-12-06 00:00:00 ┆ -5.0 ┆ C │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ 0.062943 ┆ 2022-12-07 00:00:00 ┆ -42.0 ┆ X │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 7 ┆ 0.108093 ┆ 2022-12-08 00:00:00 ┆ null ┆ X │
└─────┴──────────┴─────────────────────┴───────┴─────┘
Additional information
Concat
We can also concatenate
two DataFrames
. Vertical concatenation will make the DataFrame
longer. Horizontal concatenation will make the DataFrame
wider. Below you can see the result of an horizontal concatenation of our two DataFrames
.
pl.concat([df,df2], how="horizontal")
shape: (8, 6)
┌─────┬──────────┬─────────────────────┬───────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ x ┆ y │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[μs] ┆ f64 ┆ i64 ┆ str │
╞═════╪══════════╪═════════════════════╪═══════╪═════╪═════╡
│ 0 ┆ 0.220182 ┆ 2022-12-01 00:00:00 ┆ 1.0 ┆ 0 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0.750839 ┆ 2022-12-02 00:00:00 ┆ 2.0 ┆ 1 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0.634639 ┆ 2022-12-03 00:00:00 ┆ NaN ┆ 2 ┆ A │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 0.67404 ┆ 2022-12-04 00:00:00 ┆ NaN ┆ 3 ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 0.102818 ┆ 2022-12-05 00:00:00 ┆ 0.0 ┆ 4 ┆ B │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 5 ┆ 0.896408 ┆ 2022-12-06 00:00:00 ┆ -5.0 ┆ 5 ┆ C │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 6 ┆ 0.062943 ┆ 2022-12-07 00:00:00 ┆ -42.0 ┆ 6 ┆ X │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 7 ┆ 0.108093 ┆ 2022-12-08 00:00:00 ┆ null ┆ 7 ┆ X │
└─────┴──────────┴─────────────────────┴───────┴─────┴─────┘
Additional information
- Link to
concatenation
in the Polars Book: link - More information about
concatenation
in the Reference guide link
Remaining topics
This guide was a quick introduction to some of the most used functions within Polars
. There is a lot more to explore, both in the Polars Book as in the Reference guide. Below are other common topics including a link to find more information about them.
- Dealing with timeseries link
- Processing missing data link
- Reading data from Pandas DataFrame or Numpy array link
- Working with the Lazy API link
Polars Expressions
Polars
has a powerful concept called expressions that is central to its very fast performance.
Expressions are at the core of many data science operations:
- taking a sample of rows from a column
- multiplying values in a column
- extracting a column of years from dates
- convert a column of strings to lowercase
- and so on!
However, expressions are also used within other operations:
- taking the mean of a group in a
groupby
operation - calculating the size of groups in a
groupby
operation - taking the sum horizontally across columns
Polars
performs these core data transformations very quickly by:
- automatic query optimization on each expression
- automatic parallelization of expressions on many columns
Polars expressions are a mapping from a series to a series (or mathematically Fn(Series) -> Series
). As expressions have a Series
as an input and a Series
as an output then it is straightforward to do a sequence of expressions (similar to method chaining in Pandas
).
This has all been a bit abstract, so let's start with some examples.
Polars Expressions
The following is an expression:
pl.col("foo").sort().head(2)
df.column("foo")?.sort(false).head(Some(2));
The snippet above says:
- Select column "foo"
- Then sort the column (not in reversed order)
- Then take the first two values of the sorted output
The power of expressions is that every expression produces a new expression, and that they
can be piped together. You can run an expression by passing them to one of Polars
execution contexts.
Here we run two expressions by running df.select
:
df.select([
pl.col("foo").sort().head(2),
pl.col("bar").filter(pl.col("foo") == 1).sum()
])
df.select([
col("foo").sort(Default::default()).head(Some(2)),
col("bar").filter(col("foo").eq(lit(1))).sum(),
]).collect()?;
All expressions are run in parallel, meaning that separate Polars
expressions are embarrassingly parallel. Note that within an expression there may be more parallelization going on.
Expression examples
In this section we will go through some examples, but first let's create a dataset:
import polars as pl
import numpy as np
np.random.seed(12)
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", None],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
use color_eyre::Result;
use polars::prelude::*;
use rand::{thread_rng, Rng};
fn main() -> Result<()> {
let mut arr = [0f64; 5];
thread_rng().fill(&mut arr);
let df = df! (
"nrs" => &[Some(1), Some(2), Some(3), None, Some(5)],
"names" => &[Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
"random" => &arr,
"groups" => &["A", "A", "B", "C", "B"],
)?;
println!("{}", &df);
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪══════════╪════════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A │
│ 2 ┆ ham ┆ 0.74005 ┆ A │
│ 3 ┆ spam ┆ 0.263315 ┆ B │
│ null ┆ egg ┆ 0.533739 ┆ C │
│ 5 ┆ null ┆ 0.014575 ┆ B │
└──────┴───────┴──────────┴────────┘
You can do a lot with expressions. They are so expressive that you sometimes have multiple ways to get the same results. To get a better feel for them let's go through some more examples.
A note for the Rust examples: Each of these examples use the same dataset. So, due to Rust's ownership rules, and the fact that all the examples run in the same context, we'll
clone()
the dataset for each example to ensure that no prior example affects the behavior of later examples. This is the case for all Rust examples for the remainder of this book. It's worth mentioning, that clones in Polars are very efficient, and don't result in a "deep copy" of the data. They're implemented using the RustArc
type (Automatic Reference Counting).
Count unique values
We can count the unique values in a column. Note that we are creating the same result in
different ways. To avoid duplicate column names in the DataFrame
, we could use an
alias
expression that can rename the expression.
out = df.select(
[
pl.col("names").n_unique().alias("unique_names_1"),
pl.col("names").unique().count().alias("unique_names_2"),
]
)
print(out)
let out = df
.clone()
.lazy()
.select([
col("names").n_unique().alias("unique_names_1"),
col("names").unique().count().alias("unique_names_2"),
])
.collect()?;
println!("{}", out);
shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞════════════════╪════════════════╡
│ 5 ┆ 5 │
└────────────────┴────────────────┘
Various aggregations
We can do various aggregations. Below are examples of some of them, but there are more such as
median
, mean
, first
, etc.
out = df.select(
[
pl.sum("random").alias("sum"),
pl.min("random").alias("min"),
pl.max("random").alias("max"),
pl.col("random").max().alias("other_max"),
pl.std("random").alias("std dev"),
pl.var("random").alias("variance"),
]
)
print(out)
let out = df
.clone()
.lazy()
.select([
sum("random").alias("sum"),
min("random").alias("min"),
max("random").alias("max"),
col("random").max().alias("other_max"),
col("random").std(1).alias("std dev"),
col("random").var(1).alias("variance"),
])
.collect()?;
println!("{}", out);
shape: (1, 6)
┌──────────┬──────────┬─────────┬───────────┬──────────┬──────────┐
│ sum ┆ min ┆ max ┆ other_max ┆ std dev ┆ variance │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞══════════╪══════════╪═════════╪═══════════╪══════════╪══════════╡
│ 1.705842 ┆ 0.014575 ┆ 0.74005 ┆ 0.74005 ┆ 0.293209 ┆ 0.085971 │
└──────────┴──────────┴─────────┴───────────┴──────────┴──────────┘
Filter and conditionals
We can also do some pretty complex things. In the next snippet we count all names ending with the string "am"
.
Note that in
Rust
, thestrings
feature must be enabled forstr
expression to be available.
out = df.select(
[
pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
]
)
print(out)
let out = df
.clone()
.lazy()
.select([col("names")
.filter(col("names").str().contains("am$"))
.count()])
.collect()?;
println!("{}", out);
shape: (1, 1)
┌───────┐
│ names │
│ --- │
│ u32 │
╞═══════╡
│ 2 │
└───────┘
Binary functions and modification
In the example below we use a conditional to create a new expression in the following
when -> then -> otherwise
construct. The when
function requires a predicate
expression (and thus leads to a boolean Series
). The then
function expects an
expression that will be used in case the predicate evaluates to true
, and the otherwise
function expects an expression that will be used in case the predicate evaluates to false
.
Note that you can pass any expression, or just base expressions like pl.col("foo")
,
pl.lit(3)
, pl.lit("bar")
, etc.
Finally, we multiply this with the result of a sum
expression:
out = df.select(
[
pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"),
]
)
print(out)
let out = df
.clone()
.lazy()
.select([when(col("random").gt(0.5)).then(0).otherwise(col("random")) * sum("nrs")])
.collect()?;
println!("{}", out);
shape: (5, 1)
┌──────────┐
│ literal │
│ --- │
│ f64 │
╞══════════╡
│ 1.695791 │
│ 0.0 │
│ 2.896465 │
│ 0.0 │
│ 0.160325 │
└──────────┘
Window expressions
A polars expression can also do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression.
In the examples below we do a GROUPBY OVER "groups"
and AGGREGATE SUM of "random"
, and in the next expression
we GROUPBY OVER "names"
and AGGREGATE a LIST of "random"
. These window functions can be combined with other expressions
and are an efficient way to determine group statistics. See more on those group statistics here.
df = df.select(
[
pl.col("*"), # select all
pl.col("random").sum().over("groups").alias("sum[random]/groups"),
pl.col("random").list().over("names").alias("random/name"),
]
)
print(df)
let df = df
.lazy()
.select([
col("*"), // Select all
col("random")
.sum()
.over([col("groups")])
.alias("sum[random]/groups"),
col("random")
.list()
.over([col("names")])
.alias("random/name"),
])
.collect()?;
println!("{}", df);
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬────────────────────┬─────────────┐
│ nrs ┆ names ┆ random ┆ groups ┆ sum[random]/groups ┆ random/name │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ f64 ┆ list[f64] │
╞══════╪═══════╪══════════╪════════╪════════════════════╪═════════════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A ┆ 0.894213 ┆ [0.154163] │
│ 2 ┆ ham ┆ 0.74005 ┆ A ┆ 0.894213 ┆ [0.74005] │
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 0.27789 ┆ [0.263315] │
│ null ┆ egg ┆ 0.533739 ┆ C ┆ 0.533739 ┆ [0.533739] │
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 0.27789 ┆ [0.014575] │
└──────┴───────┴──────────┴────────┴────────────────────┴─────────────┘
Conclusion
This is the tip of the iceberg in terms of possible expressions. There are a ton more, and they can be combined in a variety ways.
This page was an introduction to Polars
expressions, and gave a glimpse of what's
possible with them. In the next page we'll discuss in which contexts expressions can be used. Later in the guide we'll go through expressions in various groupby contexts, all while keeping Polars
execution parallel.
Expression contexts
You cannot use an expression anywhere. An expression needs a context, the available contexts are:
- selection:
df.select([..])
- groupby aggregation:
df.groupby(..).agg([..])
- hstack/ add columns:
df.with_columns([..])
Syntactic sugar
The reason for such a context, is that you actually are using the Polars lazy API, even if you use it in eager. For instance this snippet:
df.groupby("foo").agg([pl.col("bar").sum()])
eager.groupby(["foo"])?.agg(&[("bar", &["sum"])])?;
actually desugars to:
(df.lazy().groupby("foo").agg([pl.col("bar").sum()])).collect()
eager.lazy().groupby(["foo"]).agg([col("bar").sum()]).collect()?;
This allows Polars to push the expression into the query engine, do optimizations, and cache intermediate results.
Rust differs from Python somewhat in this respect. Where Python's eager mode is little more than a thin veneer over the lazy API, Rust's eager mode is closer to an implementation detail, and isn't really recommended for end-user use. It is possible that the eager API in Rust will be scoped private sometime in the future. Therefore, for the remainder of this document, assume that the Rust examples are using the lazy API.
Select context
In the select
context the selection applies expressions over columns. The expressions in this context must produce Series
that are all
the same length or have a length of 1
.
A Series
of a length of 1
will be broadcasted to match the height of the DataFrame
.
Note that a select
may produce new columns that are aggregations, combinations of expressions, or literals.
Selection context
out = df.select(
[
pl.sum("nrs"),
pl.col("names").sort(),
pl.col("names").first().alias("first name"),
(pl.mean("nrs") * 10).alias("10xnrs"),
]
)
print(out)
let out = df
.clone()
.lazy()
.select([
sum("nrs"),
col("names").sort(false),
col("names").first().alias("first name"),
mean("nrs").mul(lit(10)).alias("10xnrs"),
])
.collect()?;
println!("{}", out);
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ f64 │
╞═════╪═══════╪════════════╪════════╡
│ 11 ┆ null ┆ foo ┆ 27.5 │
│ 11 ┆ egg ┆ foo ┆ 27.5 │
│ 11 ┆ foo ┆ foo ┆ 27.5 │
│ 11 ┆ ham ┆ foo ┆ 27.5 │
│ 11 ┆ spam ┆ foo ┆ 27.5 │
└─────┴───────┴────────────┴────────┘
Add columns
Adding columns to a DataFrame
using with_columns
is also the selection
context.
df = df.with_columns(
[
pl.sum("nrs").alias("nrs_sum"),
pl.col("random").count().alias("count"),
]
)
print(df)
let out = df
.clone()
.lazy()
.with_columns([
sum("nrs").alias("nrs_sum"),
col("random").count().alias("count"),
])
.collect()?;
println!("{}", out);
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs ┆ names ┆ random ┆ groups ┆ nrs_sum ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ u32 │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A ┆ 11 ┆ 5 │
│ 2 ┆ ham ┆ 0.74005 ┆ A ┆ 11 ┆ 5 │
│ 3 ┆ spam ┆ 0.263315 ┆ B ┆ 11 ┆ 5 │
│ null ┆ egg ┆ 0.533739 ┆ C ┆ 11 ┆ 5 │
│ 5 ┆ null ┆ 0.014575 ┆ B ┆ 11 ┆ 5 │
└──────┴───────┴──────────┴────────┴─────────┴───────┘
Groupby context
In the groupby
context expressions work on groups and thus may yield results of any length (a group may have many members).
out = df.groupby("groups").agg(
[
pl.sum("nrs"), # sum nrs by groups
pl.col("random").count().alias("count"), # count group members
# sum random where name != null
pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
pl.col("names").reverse().alias(("reversed names")),
]
)
print(out)
let out = df
.lazy()
.groupby([col("groups")])
.agg([
sum("nrs"), // sum nrs by groups
col("random").count().alias("count"), // count group members
// sum random where name != null
col("random")
.filter(col("names").is_not_null())
.sum()
.suffix("_sum"),
col("names").reverse().alias("reversed names"),
])
.collect()?;
println!("{}", out);
shape: (3, 5)
┌────────┬──────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs ┆ count ┆ random_sum ┆ reversed names │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 ┆ f64 ┆ list[str] │
╞════════╪══════╪═══════╪════════════╪════════════════╡
│ C ┆ null ┆ 1 ┆ 0.533739 ┆ ["egg"] │
│ A ┆ 3 ┆ 2 ┆ 0.894213 ┆ ["ham", "foo"] │
│ B ┆ 8 ┆ 2 ┆ 0.263315 ┆ [null, "spam"] │
└────────┴──────┴───────┴────────────┴────────────────┘
Besides the standard groupby
, groupby_dynamic
, and groupby_rolling
are also entrances to the groupby context
.
GroupBy
The GroupBy page is under construction.
A multithreaded approach
One of the most efficient ways to process tabular data is to parallelize its processing
via the "split-apply-combine" approach. This operation is at the core of the Polars
grouping implementation, allowing it to attain lightning-fast operations. Specifically, both the "split" and "apply" phases are executed in a multi-threaded
fashion.
A simple grouping operation is taken below as an example to illustrate this approach:
For the hashing operations performed during the "split" phase, Polars
uses a
multithreaded lock-free approach that is illustrated on the following schema:
This parallelization allows the grouping and joining operations (for instance) to be blazingly fast!
Check out this blog post for more details.
Do not kill the parallelization!
The following is specific to
Python
, and doesn't apply toRust
. WithinRust
, blocks and closures (lambdas) can, and will, be executed concurrently.
We have all heard that Python
is slow, and does "not scale." Besides the overhead of
running "slow" bytecode, Python
has to remain within the constraints of the Global
Interpreter Lock (GIL). This means that if you were to use a lambda
or a custom Python
function to apply during a parallelized phase, Polars
speed is capped running Python
code preventing any multiple threads from executing the function.
This all feels terribly limiting, especially because we often need those lambda
functions in a
.groupby()
step, for example. This approach is still supported by Polars
, but
keeping in mind bytecode and the GIL costs have to be paid.
To mitigate this, Polars
implements a powerful syntax defined not only in its lazy API,
but also in its eager API. Let's take a look at what that means.
We can start with the simple
US congress dataset
.
Note to Rust users, the
dtype-categorical
feature must be enabled for the examples in this section.
dtypes = {
"first_name": pl.Categorical,
"gender": pl.Categorical,
"type": pl.Categorical,
"state": pl.Categorical,
"party": pl.Categorical,
}
dataset = pl.read_csv(url, dtypes=dtypes).with_column(pl.col("birthday").str.strptime(pl.Date, strict=False))
let url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv";
let mut schema = Schema::new();
schema.with_column("first_name".to_string(), DataType::Categorical(None));
schema.with_column("gender".to_string(), DataType::Categorical(None));
schema.with_column("type".to_string(), DataType::Categorical(None));
schema.with_column("state".to_string(), DataType::Categorical(None));
schema.with_column("party".to_string(), DataType::Categorical(None));
schema.with_column("birthday".to_string(), DataType::Date);
let data: Vec<u8> = Client::new().get(url).send()?.text()?.bytes().collect();
let dataset = CsvReader::new(Cursor::new(data))
.has_header(true)
.with_dtypes(Some(&schema))
.with_parse_dates(true)
.finish()?;
println!("{}", &dataset);
Basic aggregations
You can easily combine different aggregations by adding multiple expressions in a
list
. There is no upper bound on the number of aggregations you can do, and you can
make any combination you want. In the snippet below we do the following aggregations:
Per GROUP "first_name"
we
- count the number of rows in the group:
- short form:
pl.count("party")
- full form:
pl.col("party").count()
- short form:
- aggregate the gender values groups:
- full form:
pl.col("gender")
- full form:
- get the first value of column
"last_name"
in the group:- short form:
pl.first("last_name")
(not available in Rust) - full form:
pl.col("last_name").first()
- short form:
Besides the aggregation, we immediately sort the result and limit to the top 5
so that
we have a nice summary overview.
import polars as pl
from .dataset import dataset
q = (
dataset.lazy()
.groupby("first_name")
.agg(
[
pl.count(),
pl.col("gender"),
pl.first("last_name"),
]
)
.sort("count", reverse=True)
.limit(5)
)
df = q.collect()
let df = dataset
.clone()
.lazy()
.groupby(["first_name"])
.agg([count(), col("gender").list(), col("last_name").first()])
.sort(
"count",
SortOptions {
descending: true,
nulls_last: true,
},
)
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 4)
┌────────────┬───────┬─────────────────────┬───────────┐
│ first_name ┆ count ┆ gender ┆ last_name │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ u32 ┆ list[cat] ┆ str │
╞════════════╪═══════╪═════════════════════╪═══════════╡
│ John ┆ 1256 ┆ ["M", "M", ... "M"] ┆ Walker │
│ William ┆ 1022 ┆ ["M", "M", ... "M"] ┆ Few │
│ James ┆ 714 ┆ ["M", "M", ... "M"] ┆ Armstrong │
│ Thomas ┆ 454 ┆ ["M", "M", ... "M"] ┆ Tucker │
│ Charles ┆ 439 ┆ ["M", "M", ... "M"] ┆ Carroll │
└────────────┴───────┴─────────────────────┴───────────┘
Conditionals
It's that easy! Let's turn it up a notch. Let's say we want to know how
many delegates of a "state" are "Pro" or "Anti" administration. We could directly query
that in the aggregation without the need of lambda
or grooming the DataFrame
.
import polars as pl
from .dataset import dataset
q = (
dataset.lazy()
.groupby("state")
.agg(
[
(pl.col("party") == "Anti-Administration").sum().alias("anti"),
(pl.col("party") == "Pro-Administration").sum().alias("pro"),
]
)
.sort("pro", reverse=True)
.limit(5)
)
df = q.collect()
let df = dataset
.clone()
.lazy()
.groupby(["state"])
.agg([
(col("party").eq(lit("Anti-Administration")))
.sum()
.alias("anti"),
(col("party").eq(lit("Pro-Administration")))
.sum()
.alias("pro"),
])
.sort(
"pro",
SortOptions {
descending: true,
nulls_last: false,
},
)
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 3)
┌───────┬──────┬─────┐
│ state ┆ anti ┆ pro │
│ --- ┆ --- ┆ --- │
│ cat ┆ u32 ┆ u32 │
╞═══════╪══════╪═════╡
│ NJ ┆ 0 ┆ 3 │
│ CT ┆ 0 ┆ 3 │
│ NC ┆ 1 ┆ 2 │
│ MA ┆ 0 ┆ 1 │
│ PA ┆ 1 ┆ 1 │
└───────┴──────┴─────┘
Similarly, this could also be done with a nested GROUPBY, but that doesn't help show off some of these nice features. 😉
import polars as pl
from .dataset import dataset
q = (
dataset.lazy()
.groupby(["state", "party"])
.agg([pl.count("party").alias("count")])
.filter((pl.col("party") == "Anti-Administration") | (pl.col("party") == "Pro-Administration"))
.sort("count", reverse=True)
.limit(5)
)
df = q.collect()
let df = dataset
.clone()
.lazy()
.groupby(["state", "party"])
.agg([col("party").count().alias("count")])
.filter(
col("party")
.eq(lit("Anti-Administration"))
.or(col("party").eq(lit("Pro-Administration"))),
)
.sort(
"count",
SortOptions {
descending: true,
nulls_last: true,
},
)
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ state ┆ party ┆ count │
│ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 │
╞═══════╪═════════════════════╪═══════╡
│ NJ ┆ Pro-Administration ┆ 3 │
│ VA ┆ Anti-Administration ┆ 3 │
│ CT ┆ Pro-Administration ┆ 3 │
│ NC ┆ Pro-Administration ┆ 2 │
│ DE ┆ Anti-Administration ┆ 1 │
└───────┴─────────────────────┴───────┘
Filtering
We can also filter the groups. Let's say we want to compute a mean per group, but we
don't want to include all values from that group, and we also don't want to filter the
rows from the DataFrame
(because we need those rows for another aggregation).
In the example below we show how that can be done.
Note that we can make
Python
functions for clarity. These functions don't cost us anything. That is because we only createPolars
expressions, we don't apply a custom function over aSeries
during runtime of the query. Of course, you can make functions that return expressions in Rust, too.
from datetime import date
import polars as pl
from .dataset import dataset
def compute_age() -> pl.Expr:
return date(2021, 1, 1).year - pl.col("birthday").dt.year()
def avg_birthday(gender: str) -> pl.Expr:
return compute_age().filter(pl.col("gender") == gender).mean().alias(f"avg {gender} birthday")
q = (
dataset.lazy()
.groupby(["state"])
.agg(
[
avg_birthday("M"),
avg_birthday("F"),
(pl.col("gender") == "M").sum().alias("# male"),
(pl.col("gender") == "F").sum().alias("# female"),
]
)
.limit(5)
)
df = q.collect()
fn compute_age() -> Expr {
lit(2022) - col("birthday").dt().year()
}
fn avg_birthday(gender: &str) -> Expr {
compute_age()
.filter(col("gender").eq(lit(gender)))
.mean()
.alias(&format!("avg {} birthday", gender))
}
let df = dataset
.clone()
.lazy()
.groupby(["state"])
.agg([
avg_birthday("M"),
avg_birthday("F"),
(col("gender").eq(lit("M"))).sum().alias("# male"),
(col("gender").eq(lit("F"))).sum().alias("# female"),
])
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 5)
┌───────┬────────────────┬────────────────┬────────┬──────────┐
│ state ┆ avg M birthday ┆ avg F birthday ┆ # male ┆ # female │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ f64 ┆ f64 ┆ u32 ┆ u32 │
╞═══════╪════════════════╪════════════════╪════════╪══════════╡
│ OL ┆ 281.0 ┆ null ┆ 2 ┆ 0 │
│ MA ┆ 198.757212 ┆ 102.5 ┆ 423 ┆ 4 │
│ NV ┆ 137.833333 ┆ 85.0 ┆ 54 ┆ 2 │
│ GA ┆ 172.106529 ┆ 93.777778 ┆ 310 ┆ 9 │
│ OH ┆ 171.836735 ┆ 79.444444 ┆ 672 ┆ 9 │
└───────┴────────────────┴────────────────┴────────┴──────────┘
Sorting
It's common to see a DataFrame
being sorted for the sole purpose of managing the ordering during a GROUPBY operation. Let's say that we want to get the names of the oldest and youngest politicians per state. We could SORT and GROUPBY.
import polars as pl
from .dataset import dataset
def get_person() -> pl.Expr:
return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")
q = (
dataset.lazy()
.sort("birthday", reverse=True)
.groupby(["state"])
.agg(
[
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
]
)
.limit(5)
)
df = q.collect()
fn get_person() -> Expr {
col("first_name") + lit(" ") + col("last_name")
}
let df = dataset
.clone()
.lazy()
.sort(
"birthday",
SortOptions {
descending: true,
nulls_last: true,
},
)
.groupby(["state"])
.agg([
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
])
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 3)
┌───────┬─────────────────┬───────────────────┐
│ state ┆ youngest ┆ oldest │
│ --- ┆ --- ┆ --- │
│ cat ┆ str ┆ str │
╞═══════╪═════════════════╪═══════════════════╡
│ SC ┆ Joe Cunningham ┆ Ralph Izard │
│ MS ┆ Steven Palazzo ┆ Narsworthy Hunter │
│ RI ┆ Patrick Kennedy ┆ James Mason │
│ UT ┆ Mia Love ┆ John Bernhisel │
│ AR ┆ Tim Griffin ┆ Archibald Yell │
└───────┴─────────────────┴───────────────────┘
However, if we also want to sort the names alphabetically, this breaks. Luckily we can sort in a groupby
context separate from the DataFrame
.
import polars as pl
from .dataset import dataset
def get_person() -> pl.Expr:
return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")
q = (
dataset.lazy()
.sort("birthday", reverse=True)
.groupby(["state"])
.agg(
[
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person().sort().first().alias("alphabetical_first"),
]
)
.limit(5)
)
df = q.collect()
let df = dataset
.clone()
.lazy()
.sort(
"birthday",
SortOptions {
descending: true,
nulls_last: true,
},
)
.groupby(["state"])
.agg([
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person().sort(false).first().alias("alphabetical_first"),
])
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 4)
┌───────┬─────────────────────┬─────────────────┬────────────────────┐
│ state ┆ youngest ┆ oldest ┆ alphabetical_first │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ str ┆ str ┆ str │
╞═══════╪═════════════════════╪═════════════════╪════════════════════╡
│ SC ┆ Joe Cunningham ┆ Ralph Izard ┆ Abraham Nott │
│ KY ┆ Ben Chandler ┆ John Edwards ┆ Aaron Harding │
│ AK ┆ Mark Begich ┆ Thomas Cale ┆ Anthony Dimond │
│ PR ┆ Aníbal Acevedo-Vilá ┆ Tulio Larrinaga ┆ Antonio Colorado │
│ MI ┆ Peter Meijer ┆ Edward Bradley ┆ Aaron Bliss │
└───────┴─────────────────────┴─────────────────┴────────────────────┘
We can even sort by another column in the groupby
context. If we want to know if the alphabetically sorted name is male or female we could add: pl.col("gender").sort_by("first_name").first().alias("gender")
import polars as pl
from .dataset import dataset
def get_person() -> pl.Expr:
return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")
q = (
dataset.lazy()
.sort("birthday", reverse=True)
.groupby(["state"])
.agg(
[
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person().sort().first().alias("alphabetical_first"),
pl.col("gender").sort_by("first_name").first().alias("gender"),
]
)
.sort("state")
.limit(5)
)
df = q.collect()
let df = dataset
.clone()
.lazy()
.sort(
"birthday",
SortOptions {
descending: true,
nulls_last: true,
},
)
.groupby(["state"])
.agg([
get_person().first().alias("youngest"),
get_person().last().alias("oldest"),
get_person().sort(false).first().alias("alphabetical_first"),
col("gender")
.sort_by(["first_name"], [false])
.first()
.alias("gender"),
])
.sort("state", SortOptions::default())
.limit(5)
.collect()?;
println!("{}", df);
shape: (5, 5)
┌───────┬──────────────────┬───────────────────┬────────────────────┬────────┐
│ state ┆ youngest ┆ oldest ┆ alphabetical_first ┆ gender │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ str ┆ str ┆ str ┆ cat │
╞═══════╪══════════════════╪═══════════════════╪════════════════════╪════════╡
│ PA ┆ Conor Lamb ┆ Thomas Fitzsimons ┆ Aaron Kreider ┆ M │
│ KY ┆ Ben Chandler ┆ John Edwards ┆ Aaron Harding ┆ M │
│ MD ┆ Frank Kratovil ┆ Benjamin Contee ┆ Albert Blakeney ┆ M │
│ OH ┆ Anthony Gonzalez ┆ John Smith ┆ Aaron Harlan ┆ M │
│ VA ┆ Scott Taylor ┆ William Grayson ┆ A. McEachin ┆ M │
└───────┴──────────────────┴───────────────────┴────────────────────┴────────┘
Conclusion
In the examples above we've seen that we can do a lot by combining expressions. By doing so we delay the use of custom Python
functions that slow down the queries (by the slow nature of Python AND the GIL).
If we are missing a type expression let us know by opening a feature request!
Folds
Polars
provides expressions/methods for horizontal aggregations like sum
,
min
, mean
,
etc. by setting the argument axis=1
. However, when you need a more complex aggregation the default methods provided by the
Polars
library may not be sufficient. That's when folds
come in handy.
The Polars
fold
expression operates on columns for maximum speed. It utilizes the data layout very efficiently and often has vectorized execution.
Let's start with an example by implementing the sum
operation ourselves, with a fold
.
Manual Sum
df = pl.DataFrame(
{
"a": [1, 2, 3],
"b": [10, 20, 30],
}
)
out = df.select(
pl.fold(acc=pl.lit(0), f=lambda acc, x: acc + x, exprs=pl.col("*")).alias("sum"),
)
print(out)
let df = df!(
"a" => &[1, 2, 3],
"b" => &[10, 20, 30],
)?;
let out = df
.lazy()
.select([fold_exprs(lit(0), |acc, x| Ok(acc + x), [col("*")]).alias("sum")])
.collect()?;
println!("{}", out);
shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 11 │
│ 22 │
│ 33 │
└─────┘
The snippet above recursively applies the function f(acc, x) -> acc
to an accumulator acc
and a new column x
.
The function operates on columns individually and can take advantage of cache efficiency and vectorization.
Conditional
In the case where you'd want to apply a condition/predicate on all columns in a DataFrame
a fold
operation can be
a very concise way to express this.
df = pl.DataFrame(
{
"a": [1, 2, 3],
"b": [0, 1, 2],
}
)
out = df.filter(
pl.fold(
acc=pl.lit(True),
f=lambda acc, x: acc & x,
exprs=pl.col("*") > 1,
)
)
print(out)
let df = df!(
"a" => &[1, 2, 3],
"b" => &[0, 1, 2],
)?;
let out = df
.lazy()
.filter(fold_exprs(
lit(true),
|acc, x| acc.bitand(&x),
[col("*").gt(1)],
))
.collect()?;
println!("{}", out);
shape: (1, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3 ┆ 2 │
└─────┴─────┘
In the snippet we filter all rows where each column value is >
1
.
Folds and string data
Folds could be used to concatenate string data. However, due to the materialization of intermediate columns, this operation will have squared complexity.
Therefore, we recommend using the concat_str
expression for this.
Note that, in
Rust
, theconcat_str
feature must be enabled to use theconcat_str
expression.
df = pl.DataFrame(
{
"a": ["a", "b", "c"],
"b": [1, 2, 3],
}
)
out = df.select(
[
pl.concat_str(["a", "b"]),
]
)
print(out)
let df = df!(
"a" => &["a", "b", "c"],
"b" => &[1, 2, 3],
)?;
let out = df
.lazy()
.select([concat_str([col("a"), col("b")], "")])
.collect()?;
println!("{:?}", out);
shape: (3, 1)
┌─────┐
│ a │
│ --- │
│ str │
╞═════╡
│ a1 │
│ b2 │
│ c3 │
└─────┘
Window functions 🚀🚀
Window functions are expressions with superpowers. They allow you to perform aggregations on groups in the
select
context. Let's get a feel of what that means. First we create a dataset. The dataset loaded in the
snippet below contains information about pokemon and has the following columns:
['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']
import polars as pl
# then let's load some csv data with information about pokemon
df = pl.read_csv(
"https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
)
use color_eyre::Result;
use polars::prelude::*;
use reqwest::blocking::Client;
fn main() -> Result<()> {
let data: Vec<u8> = Client::new()
.get("https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv")
.send()?
.text()?
.bytes()
.collect();
let df = CsvReader::new(std::io::Cursor::new(data))
.has_header(true)
.finish()?;
println!("{}", df);
shape: (163, 13)
┌─────┬───────────────────────┬─────────┬────────┬─────┬─────────┬───────┬────────────┬───────────┐
│ # ┆ Name ┆ Type 1 ┆ Type 2 ┆ ... ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ ┆ i64 ┆ i64 ┆ i64 ┆ bool │
╞═════╪═══════════════════════╪═════════╪════════╪═════╪═════════╪═══════╪════════════╪═══════════╡
│ 1 ┆ Bulbasaur ┆ Grass ┆ Poison ┆ ... ┆ 65 ┆ 45 ┆ 1 ┆ false │
│ 2 ┆ Ivysaur ┆ Grass ┆ Poison ┆ ... ┆ 80 ┆ 60 ┆ 1 ┆ false │
│ 3 ┆ Venusaur ┆ Grass ┆ Poison ┆ ... ┆ 100 ┆ 80 ┆ 1 ┆ false │
│ 3 ┆ VenusaurMega Venusaur ┆ Grass ┆ Poison ┆ ... ┆ 120 ┆ 80 ┆ 1 ┆ false │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 147 ┆ Dratini ┆ Dragon ┆ null ┆ ... ┆ 50 ┆ 50 ┆ 1 ┆ false │
│ 148 ┆ Dragonair ┆ Dragon ┆ null ┆ ... ┆ 70 ┆ 70 ┆ 1 ┆ false │
│ 149 ┆ Dragonite ┆ Dragon ┆ Flying ┆ ... ┆ 100 ┆ 80 ┆ 1 ┆ false │
│ 150 ┆ Mewtwo ┆ Psychic ┆ null ┆ ... ┆ 90 ┆ 130 ┆ 1 ┆ true │
└─────┴───────────────────────┴─────────┴────────┴─────┴─────────┴───────┴────────────┴───────────┘
Groupby Aggregations in selection
Below we show how to use window functions to group over different columns and perform an aggregation on them.
Doing so allows us to use multiple groupby operations in parallel, using a single query. The results of the aggregation
are projected back to the original rows. Therefore, a window function will always lead to a DataFrame
with the same size
as the original.
Note how we call .over("Type 1")
and .over(["Type 1", "Type 2"])
. Using window functions we can aggregate over different groups in a single select
call! Note that, in Rust, the type of the argument to over()
must be a collection, so even when you're only using one column, you must provided it in an array.
The best part is, this won't cost you anything. The computed groups are cached and shared between different window
expressions.
out = df.select(
[
"Type 1",
"Type 2",
pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"),
pl.col("Defense").mean().over(["Type 1", "Type 2"]).alias("avg_defense_by_type_combination"),
pl.col("Attack").mean().alias("avg_attack"),
]
)
let out = df
.clone()
.lazy()
.select([
col("Type 1"),
col("Type 2"),
col("Attack")
.mean()
.over(["Type 1"])
.alias("avg_attack_by_type"),
col("Defense")
.mean()
.over(["Type 1", "Type 2"])
.alias("avg_defense_by_type_combination"),
col("Attack").mean().alias("avg_attack"),
])
.collect()?;
println!("{}", out);
shape: (163, 5)
┌─────────┬────────┬────────────────────┬─────────────────────────────────┬────────────┐
│ Type 1 ┆ Type 2 ┆ avg_attack_by_type ┆ avg_defense_by_type_combination ┆ avg_attack │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪════════╪════════════════════╪═════════════════════════════════╪════════════╡
│ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │
│ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │
│ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │
│ Grass ┆ Poison ┆ 72.923077 ┆ 67.8 ┆ 75.349693 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ Dragon ┆ null ┆ 94.0 ┆ 55.0 ┆ 75.349693 │
│ Dragon ┆ null ┆ 94.0 ┆ 55.0 ┆ 75.349693 │
│ Dragon ┆ Flying ┆ 94.0 ┆ 95.0 ┆ 75.349693 │
│ Psychic ┆ null ┆ 53.875 ┆ 51.428571 ┆ 75.349693 │
└─────────┴────────┴────────────────────┴─────────────────────────────────┴────────────┘
Operations per group
Window functions can do more than aggregation. They can also be viewed as an operation within a group. If, for instance, you
want to sort
the values within a group
, you can write col("value").sort().over("group")
and voilà! We sorted by group!
Let's filter out some rows to make this more clear.
filtered = df.filter(pl.col("Type 2") == "Psychic").select(
[
"Name",
"Type 1",
"Speed",
]
)
print(filtered)
let filtered = df
.clone()
.lazy()
.filter(col("Type 2").eq(lit("Psychic")))
.select([col("Name"), col("Type 1"), col("Speed")])
.collect()?;
println!("{}", filtered);
shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name ┆ Type 1 ┆ Speed │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════════════════╪════════╪═══════╡
│ Slowpoke ┆ Water ┆ 15 │
│ Slowbro ┆ Water ┆ 30 │
│ SlowbroMega Slowbro ┆ Water ┆ 30 │
│ Exeggcute ┆ Grass ┆ 40 │
│ Exeggutor ┆ Grass ┆ 55 │
│ Starmie ┆ Water ┆ 115 │
│ Jynx ┆ Ice ┆ 95 │
└─────────────────────┴────────┴───────┘
Observe that the group Water
of column Type 1
is not contiguous. There are two rows of Grass
in between. Also note
that each pokemon within a group are sorted by Speed
in ascending
order. Unfortunately, for this example we want them sorted in
descending
speed order. Luckily with window functions this is easy to accomplish.
out = filtered.with_columns(
[
pl.col(["Name", "Speed"]).sort(reverse=True).over("Type 1"),
]
)
print(out)
let out = filtered
.lazy()
.with_columns([cols(["Name", "Speed"]).sort(true).over(["Type 1"])])
.collect()?;
println!("{}", out);
shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name ┆ Type 1 ┆ Speed │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════════════════════╪════════╪═══════╡
│ Starmie ┆ Water ┆ 115 │
│ Slowpoke ┆ Water ┆ 30 │
│ SlowbroMega Slowbro ┆ Water ┆ 30 │
│ Exeggutor ┆ Grass ┆ 55 │
│ Exeggcute ┆ Grass ┆ 40 │
│ Slowbro ┆ Water ┆ 15 │
│ Jynx ┆ Ice ┆ 95 │
└─────────────────────┴────────┴───────┘
Polars
keeps track of each group's location and maps the expressions to the proper row locations. This will also work
over different groups in a single select
.
The power of window expressions is that you often don't need a groupby -> explode
combination, but you can put the logic in a
single expression. It also makes the API cleaner. If properly used a:
groupby
-> marks that groups are aggregated and we expect aDataFrame
of sizen_groups
over
-> marks that we want to compute something within a group, but doesn't modify the original size of theDataFrame
Window expression rules
The evaluations of window expressions are as follows (assuming we apply it to a pl.Int32
column):
# aggregate and broadcast within a group
# output type: -> Int32
pl.sum("foo").over("groups")
# sum within a group and multiply with group elements
# output type: -> Int32
(pl.col("x").sum() * pl.col("y")).over("groups")
# sum within a group and multiply with group elements
# and aggregate the group to a list
# output type: -> List(Int32)
(pl.col("x").sum() * pl.col("y")).list().over("groups")
# note that it will require an explicit `list()` call
# sum within a group and multiply with group elements
# and aggregate the group to a list
# the flatten call explodes that list
# This is the fastest method to do things over groups when the groups are sorted
(pl.col("x").sum() * pl.col("y")).list().over("groups").flatten()
// aggregate and broadcast within a group
// output type: -> i32
sum("foo").over([col("groups")]),
// sum within a group and multiply with group elements
// output type: -> i32
(col("x").sum() * col("y"))
.over([col("groups")])
.alias("x1"),
// sum within a group and multiply with group elements
// and aggregate the group to a list
// output type: -> ChunkedArray<i32>
(col("x").sum() * col("y"))
.list()
.over([col("groups")])
.alias("x2"),
// note that it will require an explicit `list()` call
// sum within a group and multiply with group elements
// and aggregate the group to a list
// the flatten call explodes that list
// This is the fastest method to do things over groups when the groups are sorted
(col("x").sum() * col("y"))
.list()
.over([col("groups")])
.flatten()
.alias("x3"),
])
.collect()?;
More examples
For more exercise, below are some window functions for us to compute:
- sort all pokemon by type
- select the first
3
pokemon per type as"Type 1"
- sort the pokemon within a type by speed and select the first
3
as"fastest/group"
- sort the pokemon within a type by attack and select the first
3
as"strongest/group"
- sort the pokemon by name within a type and select the first
3
as"sorted_by_alphabet"
out = df.sort("Type 1").select(
[
pl.col("Type 1").head(3).list().over("Type 1").flatten(),
pl.col("Name").sort_by(pl.col("Speed")).head(3).list().over("Type 1").flatten().alias("fastest/group"),
pl.col("Name").sort_by(pl.col("Attack")).head(3).list().over("Type 1").flatten().alias("strongest/group"),
pl.col("Name").sort().head(3).list().over("Type 1").flatten().alias("sorted_by_alphabet"),
]
)
let out = df
.clone()
.lazy()
.select([
col("Type 1")
.head(Some(3))
.list()
.over(["Type 1"])
.flatten(),
col("Name")
.sort_by(["Speed"], [false])
.head(Some(3))
.list()
.over(["Type 1"])
.flatten()
.alias("fastest/group"),
col("Name")
.sort_by(["Attack"], [false])
.head(Some(3))
.list()
.over(["Type 1"])
.flatten()
.alias("strongest/group"),
col("Name")
.sort(false)
.head(Some(3))
.list()
.over(["Type 1"])
.flatten()
.alias("sorted_by_alphabet"),
])
.collect()?;
println!("{:?}", out);
shape: (43, 4)
┌────────┬─────────────────────┬─────────────────┬─────────────────────────┐
│ Type 1 ┆ fastest/group ┆ strongest/group ┆ sorted_by_alphabet │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞════════╪═════════════════════╪═════════════════╪═════════════════════════╡
│ Bug ┆ Paras ┆ Metapod ┆ Beedrill │
│ Bug ┆ Metapod ┆ Kakuna ┆ BeedrillMega Beedrill │
│ Bug ┆ Parasect ┆ Caterpie ┆ Butterfree │
│ Dragon ┆ Dratini ┆ Dratini ┆ Dragonair │
│ ... ┆ ... ┆ ... ┆ ... │
│ Rock ┆ Omanyte ┆ Omastar ┆ Geodude │
│ Water ┆ Slowpoke ┆ Magikarp ┆ Blastoise │
│ Water ┆ Slowbro ┆ Tentacool ┆ BlastoiseMega Blastoise │
│ Water ┆ SlowbroMega Slowbro ┆ Horsea ┆ Cloyster │
└────────┴─────────────────────┴─────────────────┴─────────────────────────┘
Flattened window function
If we have a window function that aggregates to a list
like the example above with the following Python expression:
pl.col("Name").sort_by(pl.col("Speed")).head(3).list().over("Type 1")
and in Rust:
col("Name").sort_by(["Speed"], [false]).head(Some(3)).list().over(["Type 1"])
This still works, but that would give us a column type List
which might not be what we want (this would significantly increase our memory usage!).
Instead we could flatten
. This just turns our 2D list into a 1D array and projects that array/column back to our DataFrame
.
This is very fast because the reshape is often free, and adding the column back the the original DataFrame
is also a lot cheaper (since we don't require a join like in a normal window function).
However, for this operation to make sense, it is important that the columns used in over([..])
are sorted!
List context
An expression context we haven't discussed yet is the List
context. This means simply we can apply any expression on the elements of a List
.
A note for
Rust
users, these features require thelist
feature flag.
Row wise computations
This context is ideal for computing things in row orientation.
Polars expressions work on columns that have the guarantee that they consist of homogeneous data.
Columns have this guarantee, rows in a DataFrame
not so much.
Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List
data type.
Let's say we have the following data:
grades = pl.DataFrame(
{
"student": ["bas", "laura", "tim", "jenny"],
"arithmetic": [10, 5, 6, 8],
"biology": [4, 6, 2, 7],
"geography": [8, 4, 9, 7],
}
)
print(grades)
let grades = df!(
"student" => &["bas", "laura", "tim", "jenny"],
"arithmetic" => &[10, 5, 6, 8],
"biology" => &[4, 6, 2, 7],
"geography" => &[8, 4, 9, 7],
)?;
println!("{}", grades);
shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 │
│ laura ┆ 5 ┆ 6 ┆ 4 │
│ tim ┆ 6 ┆ 2 ┆ 9 │
│ jenny ┆ 8 ┆ 7 ┆ 7 │
└─────────┴────────────┴─────────┴───────────┘
If we want to compute the rank
of all the columns except for "student"
, we can collect those into a list
data type:
This would give:
out = grades.select([pl.concat_list(pl.all().exclude("student")).alias("all_grades")])
print(out)
let out = grades
.clone()
.lazy()
.select([concat_lst([all().exclude(["student"])]).alias("all_grades")])
.collect()?;
println!("{}", out);
shape: (4, 1)
┌────────────┐
│ all_grades │
│ --- │
│ list[i64] │
╞════════════╡
│ [10, 4, 8] │
│ [5, 6, 4] │
│ [6, 2, 9] │
│ [8, 7, 7] │
└────────────┘
Running polars expression on list elements
We can run any polars expression on the elements of a list with the arr.eval
(arr().eval
in Rust) expression! These expressions run entirely on polars' query engine and can run in parallel so will be super fast.
Let's expand the example from above with something a little more interesting. Pandas allows you to compute the percentages of the rank
values. Polars doesn't provide such a keyword argument. But because expressions are so versatile we can create our own percentage rank expression. Let's try that!
Note that we must select
the list's element from the context. When we apply expressions over list elements, we use pl.element()
to select the element of a list.
Also note that to use
arr().eval
in Rust requires thelist_eval
feature flag.
# the percentage rank expression
rank_pct = pl.element().rank(reverse=True) / pl.col("").count()
grades.with_column(
# create the list of homogeneous data
pl.concat_list(pl.all().exclude("student")).alias("all_grades")
).select([
# select all columns except the intermediate list
pl.all().exclude("all_grades"),
# compute the rank by calling `arr.eval`
pl.col("all_grades").arr.eval(rank_pct, parallel=True).alias("grades_rank")
])
// the percentage rank expression
let rank_opts = RankOptions {
method: RankMethod::Average,
descending: true,
};
let rank_pct = col("").rank(rank_opts) / col("").count().cast(DataType::Float32);
let grades = grades
.clone()
.lazy()
.with_column(
// create the list of homogeneous data
concat_lst([all().exclude(["student"])]).alias("all_grades"),
)
.select([
// select all columns except the intermediate list
all().exclude(["all_grades"]),
// compute the rank by calling `arr.eval`
col("all_grades")
.arr()
.eval(rank_pct, true)
.alias("grades_rank"),
])
.collect()?;
println!("{}", grades);
This outputs:
shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list [f32] │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas ┆ 10 ┆ 4 ┆ 8 ┆ [0.333333, 1.0, 0.666667] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ laura ┆ 5 ┆ 6 ┆ 4 ┆ [0.666667, 0.333333, 1.0] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ 6 ┆ 2 ┆ 9 ┆ [0.666667, 1.0, 0.333333] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny ┆ 8 ┆ 7 ┆ 7 ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘
Note that this solution works for any expressions/operation you want to do row wise.
Numpy interop
Polars
expressions support NumPy
ufuncs. See here
for a list on all supported numpy functions.
This means that if a function is not provided by Polars
, we can use NumPy
and we still have fast columnar operation through
the NumPy
API.
Example
import polars as pl
import numpy as np
df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
out = df.select(
[
np.log(pl.all()).suffix("_log"),
]
)
print(out)
shape: (3, 2)
┌──────────┬──────────┐
│ a_log ┆ b_log │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════════╪══════════╡
│ 0.0 ┆ 1.386294 │
│ 0.693147 ┆ 1.609438 │
│ 1.098612 ┆ 1.791759 │
└──────────┴──────────┘
Gotcha's
Read more about the gotcha's here.
Custom functions
You should be convinced by now that polar expressions are so powerful and flexible that the need for custom python functions is much less needed than you might need in other libraries.
Still, you need to have the power to be able to pass an expression's state to a third party library or apply your black box function over data in polars.
For this we provide the following expressions:
map
apply
To map
or to apply
.
These functions have an important distinction in how they operate and consequently what data they will pass to the user.
A map
passes the Series
backed by the expression
as is.
map
follows the same rules in both the select
and the groupby
context, this will
mean that the Series
represents a column in a DataFrame
. Note that in the groupby
context, that column is not yet
aggregated!
Use cases for map
are for instance passing the Series
in an expression to a third party library. Below we show how
we could use map
to pass an expression column to a neural network model.
df.with_column([
pl.col("features").map(lambda s: MyNeuralNetwork.forward(s.to_numpy())).alias("activations")
])
df.with_column([
col("features").map(|s| Ok(my_nn.forward(s))).alias("activations")
])
Use cases for map
in the groupby
context are slim. They are only used for performance reasons, but can quite easily
lead to incorrect results. Let me explain why.
df = pl.DataFrame(
{
"keys": ["a", "a", "b"],
"values": [10, 7, 1],
}
)
out = df.groupby("keys", maintain_order=True).agg(
[
pl.col("values").map(lambda s: s.shift()).alias("shift_map"),
pl.col("values").shift().alias("shift_expression"),
]
)
print(df)
"keys" => &["a", "a", "b"],
"values" => &[10, 7, 1],
)?;
let out = df
.lazy()
.groupby(["keys"])
.agg([
col("values")
.map(|s| Ok(s.shift(1)), GetOutput::default())
.alias("shift_map"),
col("values").shift(1).alias("shift_expression"),
])
.collect()?;
println!("{}", out);
shape: (3, 2)
┌──────┬────────┐
│ keys ┆ values │
│ --- ┆ --- │
│ str ┆ i64 │
╞══════╪════════╡
│ a ┆ 10 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a ┆ 7 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└──────┴────────┘
In the snippet above we groupby the "keys"
column. That means we have the following groups:
"a" -> [10, 7]
"b" -> [1]
If we would then apply a shift
operation to the right, we'd expect:
"a" -> [null, 10]
"b" -> [null]
Now, let's print and see what we've got.
print(out)
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map ┆ shift_expression │
│ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ list[i64] │
╞══════╪════════════╪══════════════════╡
│ a ┆ [null, 10] ┆ [null, 10] │
│ b ┆ [7] ┆ [null] │
└──────┴────────────┴──────────────────┘
Ouch.. we clearly get the wrong results here. Group "b"
even got a value from group "a"
😵.
This went horribly wrong, because the map
applies the function before we aggregate! So that means the whole column
[10, 7, 1
] got shifted to [null, 10, 7]
and was then aggregated.
So my advice is to never use map
in the groupby
context unless you know you need it and know what you are doing.
To apply
Luckily we can fix previous example with apply
. apply
works on the smallest logical elements for that operation.
That is:
select context
-> single elementsgroupby context
-> single groups
So with apply
we should be able to fix our example:
out = df.groupby("keys", maintain_order=True).agg(
[
pl.col("values").apply(lambda s: s.shift()).alias("shift_map"),
pl.col("values").shift().alias("shift_expression"),
]
)
print(out)
let out = df
.clone()
.lazy()
.groupby([col("keys")])
.agg([
col("values")
.apply(|s| Ok(s.shift(1)), GetOutput::default())
.alias("shift_map"),
col("values").shift(1).alias("shift_expression"),
])
.collect()?;
println!("{}", out);
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map ┆ shift_expression │
│ --- ┆ --- ┆ --- │
│ str ┆ list[i64] ┆ list[i64] │
╞══════╪════════════╪══════════════════╡
│ a ┆ [null, 10] ┆ [null, 10] │
│ b ┆ [null] ┆ [null] │
└──────┴────────────┴──────────────────┘
And observe, a valid result! 🎉
apply
in the select
context
In the select
context, the apply
expression passes elements of the column to the python function.
Note that you are now running python, this will be slow.
Let's go through some examples to see what to expect. We will continue with the DataFrame
we defined at the start of
this section and show an example with the apply
function and a counter example where we use the expression API to
achieve the same goals.
Adding a counter
In this example we create a global counter
and then add the integer 1
to the global state at every element processed.
Every iteration the result of the increment will be added to the element value.
Note, this example isn't provided in Rust. The reason is that the global
counter
value would lead to data races when this apply is evaluated in parallel. It would be possible to wrap it in aMutex
to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarantees.
counter = 0
def add_counter(val: int) -> int:
global counter
counter += 1
return counter + val
out = df.select(
[
pl.col("values").apply(add_counter).alias("solution_apply"),
(pl.col("values") + pl.arange(1, pl.count() + 1)).alias("solution_expr"),
]
)
print(out)
shape: (3, 2)
┌────────────────┬───────────────┐
│ solution_apply ┆ solution_expr │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════════════╪═══════════════╡
│ 11 ┆ 11 │
│ 9 ┆ 9 │
│ 4 ┆ 4 │
└────────────────┴───────────────┘
Combining multiple column values
If we want to have access to values of different columns in a single apply
function call, we can create struct
data
type. This data type collects those columns as fields in the struct
. So if we'd create a struct from the columns
"keys"
and "values"
, we would get the following struct elements:
[
{"keys": "a", "values": 10},
{"keys": "a", "values": 7},
{"keys": "b", "values": 1},
]
In Python, those would be passed as dict
to the calling python function and can thus be indexed by field: str
. In rust, you'll get a Series
with the Struct
type. The fields of the struct can then be indexed and downcast.
out = df.select(
[
pl.struct(["keys", "values"]).apply(lambda x: len(x["keys"]) + x["values"]).alias("solution_apply"),
(pl.col("keys").str.lengths() + pl.col("values")).alias("solution_expr"),
]
)
print(out)
let out = df
.lazy()
.select([
// pack to struct to get access to multiple fields in a custom `apply/map`
as_struct(&[col("keys"), col("values")])
// we will compute the len(a) + b
.apply(
|s| {
// downcast to struct
let ca = s.struct_()?;
// get the fields as Series
let s_a = &ca.fields()[0];
let s_b = &ca.fields()[1];
// downcast the `Series` to their known type
let ca_a = s_a.utf8()?;
let ca_b = s_b.i32()?;
// iterate both `ChunkedArrays`
let out: Int32Chunked = ca_a
.into_iter()
.zip(ca_b)
.map(|(opt_a, opt_b)| match (opt_a, opt_b) {
(Some(a), Some(b)) => Some(a.len() as i32 + b),
_ => None,
})
.collect();
Ok(out.into_series())
},
GetOutput::from_type(DataType::Int32),
)
.alias("solution_apply"),
(col("keys").str().count_match(".") + col("values")).alias("solution_expr"),
])
.collect()?;
println!("{}", out);
shape: (3, 2)
┌────────────────┬───────────────┐
│ solution_apply ┆ solution_expr │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════════════╪═══════════════╡
│ 11 ┆ 11 │
│ 8 ┆ 8 │
│ 2 ┆ 2 │
└────────────────┴───────────────┘
Return types?
Custom python functions are black boxes for polars. We really don't know what kind of black arts you are doing, so we have to infer and try our best to understand what you meant.
As a user it helps to understand what we do to better utilize custom functions.
The data type is automatically inferred. We do that by waiting for the first non-null value. That value will then be used
to determine the type of the Series
.
The mapping of python types to polars data types is as follows:
int
->Int64
float
->Float64
bool
->Boolean
str
->Utf8
list[tp]
->List[tp]
(where the inner type is inferred with the same rules)dict[str, [tp]]
->struct
Any
->object
(Prevent this at all times)
Rust types map as follows:
i32
ori64
->Int64
f32
orf64
->Float64
bool
->Boolean
String
orstr
->Utf8
Vec<tp>
->List[tp]
(where the inner type is inferred with the same rules)
import polars as pl
Expressions
fn(Series) -> Series
- Lazily evaluated
- Can be optimized
- Gives the library writer context and informed decision can be made
- Embarrassingly parallel
- Context dependent
- selection / projection ->
Series
= COLUMN, LITERAL or VALUE - aggregation ->
Series
= GROUPS
- selection / projection ->
df = pl.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"B": [5, 4, 3, 2, 1],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
"optional": [28, 300, None, 2, -30],
}
)
df
A | fruits | B | cars | optional |
---|---|---|---|---|
i64 | str | i64 | str | i64 |
1 | "banana" | 5 | "beetle" | 28 |
2 | "banana" | 4 | "audi" | 300 |
3 | "apple" | 3 | "beetle" | null |
4 | "apple" | 2 | "beetle" | 2 |
5 | "banana" | 1 | "beetle" | -30 |
Selection context
# We can select by name
(df.select([
pl.col("A"),
"B", # the col part is inferred
pl.lit("B"), # we must tell polars we mean the literal "B"
pl.col("fruits"),
]))
A | B | literal | fruits |
---|---|---|---|
i64 | i64 | str | str |
1 | 5 | "B" | "banana" |
2 | 4 | "B" | "banana" |
3 | 3 | "B" | "apple" |
4 | 2 | "B" | "apple" |
5 | 1 | "B" | "banana" |
# you can select columns with a regex if it starts with '^' and ends with '$'
(df.select([
pl.col("^A|B$").sum()
]))
A | B |
---|---|
i64 | i64 |
15 | 15 |
# you can select multiple columns by name
(df.select([
pl.col(["A", "B"]).sum()
]))
A | B |
---|---|
i64 | i64 |
15 | 15 |
# We select everything in normal order
# Then we select everything in reversed order
(df.select([
pl.all(),
pl.all().reverse().suffix("_reverse")
]))
A | fruits | B | cars | optional | A_reverse | fruits_reverse | B_reverse | cars_reverse | optional_reverse |
---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | str | i64 | i64 | str | i64 | str | i64 |
1 | "banana" | 5 | "beetle" | 28 | 5 | "banana" | 1 | "beetle" | -30 |
2 | "banana" | 4 | "audi" | 300 | 4 | "apple" | 2 | "beetle" | 2 |
3 | "apple" | 3 | "beetle" | null | 3 | "apple" | 3 | "beetle" | null |
4 | "apple" | 2 | "beetle" | 2 | 2 | "banana" | 4 | "audi" | 300 |
5 | "banana" | 1 | "beetle" | -30 | 1 | "banana" | 5 | "beetle" | 28 |
# all expressions run in parallel
# single valued `Series` are broadcasted to the shape of the `DataFrame`
(df.select([
pl.all(),
pl.all().sum().suffix("_sum")
]))
A | fruits | B | cars | optional | A_sum | fruits_sum | B_sum | cars_sum | optional_sum |
---|---|---|---|---|---|---|---|---|---|
i64 | str | i64 | str | i64 | i64 | str | i64 | str | i64 |
1 | "banana" | 5 | "beetle" | 28 | 15 | null | 15 | null | 300 |
2 | "banana" | 4 | "audi" | 300 | 15 | null | 15 | null | 300 |
3 | "apple" | 3 | "beetle" | null | 15 | null | 15 | null | 300 |
4 | "apple" | 2 | "beetle" | 2 | 15 | null | 15 | null | 300 |
5 | "banana" | 1 | "beetle" | -30 | 15 | null | 15 | null | 300 |
# there are `str` and `dt` namespaces for specialized functions
predicate = pl.col("fruits").str.contains("^b.*")
(df.select([
predicate
]))
fruits |
---|
bool |
true |
true |
false |
false |
true |
# use the predicate to filter
df.filter(predicate)
A | fruits | B | cars | optional |
---|---|---|---|---|
i64 | str | i64 | str | i64 |
1 | "banana" | 5 | "beetle" | 28 |
2 | "banana" | 4 | "audi" | 300 |
5 | "banana" | 1 | "beetle" | -30 |
# predicate expressions can be used to filter
(df.select([
pl.col("A").filter(pl.col("fruits").str.contains("^b.*")).sum(),
(pl.col("B").filter(pl.col("cars").str.contains("^b.*")).sum() * pl.col("B").sum()).alias("some_compute()"),
]))
A | some_compute() |
---|---|
i64 | i64 |
8 | 165 |
# We can do arithmetic on columns and (literal) values
# can be evaluated to 1 without programmer knowing
some_var = 1
(df.select([
((pl.col("A") / 124.0 * pl.col("B")) / pl.sum("B") * some_var).alias("computed")
]))
computed |
---|
f64 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
# We can combine columns by a predicate
(df.select([
"fruits",
"B",
pl.when(pl.col("fruits") == "banana").then(pl.col("B")).otherwise(-1).alias("b")
]))
fruits | B | b |
---|---|---|
str | i64 | i64 |
"banana" | 5 | 5 |
"banana" | 4 | 4 |
"apple" | 3 | -1 |
"apple" | 2 | -1 |
"banana" | 1 | 1 |
# We can combine columns by a fold operation on column level
(df.select([
"A",
"B",
pl.fold(0, lambda a, b: a + b, [pl.col("A"), "B", pl.col("B")**2, pl.col("A") / 2.0]).alias("fold")
]))
A | B | fold |
---|---|---|
i64 | i64 | f64 |
1 | 5 | 31 |
2 | 4 | 23 |
3 | 3 | 16 |
4 | 2 | 12 |
5 | 1 | 9 |
# even combine all
(df.select([
pl.arange(0, df.height).alias("idx"),
"A",
pl.col("A").shift().alias("A_shifted"),
pl.concat_str(pl.all(), "-").alias("str_concat_1"), # prefer this
pl.fold(pl.col("A"), lambda a, b: a + "-" + b, pl.all().exclude("A")).alias("str_concat_2"), # over this (accidentally O(n^2))
]))
idx | A | A_shifted | str_concat_1 | str_concat_2 |
---|---|---|---|---|
i64 | i64 | i64 | str | str |
0 | 1 | null | "1-banana-5-beetle-28" | "1-banana-5-beetle-28" |
1 | 2 | 1 | "2-banana-4-audi-300" | "2-banana-4-audi-300" |
2 | 3 | 2 | null | null |
3 | 4 | 3 | "4-apple-2-beetle-2" | "4-apple-2-beetle-2" |
4 | 5 | 4 | "5-banana-1-beetle--30" | "5-banana-1-beetle--30" |
Aggregation context
- expressions are applied over groups instead of columns
# we can still combine many expressions
(df.sort("cars").groupby("fruits")
.agg([
pl.col("B").sum().alias("B_sum"),
pl.sum("B").alias("B_sum2"), # syntactic sugar for the first
pl.first("fruits").alias("fruits_first"),
pl.count("A").alias("count"),
pl.col("cars").reverse()
]))
fruits | B_sum | B_sum2 | fruits_first | count | cars |
---|---|---|---|---|---|
str | i64 | i64 | str | u32 | list |
"banana" | 10 | 10 | "banana" | 3 | [beetle, beetle, audi] |
"apple" | 5 | 5 | "apple" | 2 | [beetle, beetle] |
# We can explode the list column "cars"
(df.sort("cars").groupby("fruits")
.agg([
pl.col("B").sum().alias("B_sum"),
pl.sum("B").alias("B_sum2"), # syntactic sugar for the first
pl.first("fruits").alias("fruits_first"),
pl.count("A").alias("count"),
pl.col("cars").reverse()
])).explode("cars")
fruits | B_sum | B_sum2 | fruits_first | count | cars |
---|---|---|---|---|---|
str | i64 | i64 | str | u32 | str |
"apple" | 5 | 5 | "apple" | 2 | "beetle" |
"apple" | 5 | 5 | "apple" | 2 | "beetle" |
"banana" | 10 | 10 | "banana" | 3 | "beetle" |
"banana" | 10 | 10 | "banana" | 3 | "beetle" |
"banana" | 10 | 10 | "banana" | 3 | "audi" |
(df.groupby("fruits")
.agg([
pl.col("B").sum().alias("B_sum"),
pl.sum("B").alias("B_sum2"), # syntactic sugar for the first
pl.first("fruits").alias("fruits_first"),
pl.count(),
pl.col("B").shift().alias("B_shifted")
])
.explode("B_shifted")
)
fruits | B_sum | B_sum2 | fruits_first | count | B_shifted |
---|---|---|---|---|---|
str | i64 | i64 | str | u32 | i64 |
"apple" | 5 | 5 | "apple" | 2 | null |
"apple" | 5 | 5 | "apple" | 2 | 3 |
"banana" | 10 | 10 | "banana" | 3 | null |
"banana" | 10 | 10 | "banana" | 3 | 5 |
"banana" | 10 | 10 | "banana" | 3 | 4 |
# We can explode the list column "cars"
(df.sort("cars").groupby("fruits")
.agg([
pl.col("B").sum(),
pl.sum("B").alias("B_sum2"), # syntactic sugar for the first
pl.first("fruits").alias("fruits_first"),
pl.count("A").alias("count"),
pl.col("cars").reverse()
])).explode("cars")
fruits | B_sum | B_sum2 | fruits_first | count | cars |
---|---|---|---|---|---|
str | i64 | i64 | str | u32 | str |
"apple" | 5 | 5 | "apple" | 2 | "beetle" |
"apple" | 5 | 5 | "apple" | 2 | "beetle" |
"banana" | 10 | 10 | "banana" | 3 | "beetle" |
"banana" | 10 | 10 | "banana" | 3 | "beetle" |
"banana" | 10 | 10 | "banana" | 3 | "audi" |
# we can also get the list of the groups
(df.groupby("fruits")
.agg([
pl.col("B").shift().alias("shift_B"),
pl.col("B").reverse().alias("rev_B"),
]))
fruits | shift_B | rev_B |
---|---|---|
str | list | list |
"apple" | [null, 3] | [2, 3] |
"banana" | [null, 5, 4] | [1, 4, 5] |
# we can do predicates in the groupby as well
(df.groupby("fruits")
.agg([
pl.col("B").filter(pl.col("B") > 1).list().keep_name(),
]))
fruits | B |
---|---|
str | list |
"banana" | [5, 4] |
"apple" | [3, 2] |
# and sum only by the values where the predicates are true
(df.groupby("fruits")
.agg([
pl.col("B").filter(pl.col("B") > 1).mean(),
]))
fruits | B_mean |
---|---|
str | f64 |
"banana" | 4.5 |
"apple" | 2.5 |
# Another example
(df.groupby("fruits")
.agg([
pl.col("B").shift_and_fill(1, fill_value=0).alias("shifted"),
pl.col("B").shift_and_fill(1, fill_value=0).sum().alias("shifted_sum"),
]))
fruits | shifted | shifted_sum |
---|---|---|
str | list | i64 |
"apple" | [0, 3] | 3 |
"banana" | [0, 5, 4] | 9 |
Window functions!
- Expression with superpowers.
- Aggregation in selection context
pl.col("foo").aggregation_expression(..).over("column_used_to_group")
# groupby 2 different columns
(df.select([
"fruits",
"cars",
"B",
pl.col("B").sum().over("fruits").alias("B_sum_by_fruits"),
pl.col("B").sum().over("cars").alias("B_sum_by_cars"),
]))
fruits | cars | B | B_sum_by_fruits | B_sum_by_cars |
---|---|---|---|---|
str | str | i64 | i64 | i64 |
"apple" | "beetle" | 3 | 5 | 11 |
"apple" | "beetle" | 2 | 5 | 11 |
"banana" | "beetle" | 5 | 10 | 11 |
"banana" | "audi" | 4 | 10 | 4 |
"banana" | "beetle" | 1 | 10 | 11 |
# reverse B by groups and show the results in original DF
(df.select([
"fruits",
"B",
pl.col("B").reverse().over("fruits").alias("B_reversed_by_fruits")
]))
fruits | B | B_reversed_by_fruits |
---|---|---|
str | i64 | i64 |
"apple" | 3 | 2 |
"apple" | 2 | 3 |
"banana" | 5 | 1 |
"banana" | 4 | 4 |
"banana" | 1 | 5 |
# Lag a column within "fruits"
(df.select([
"fruits",
"B",
pl.col("B").shift().over("fruits").alias("lag_B_by_fruits")
]))
fruits | B | lag_B_by_fruits |
---|---|---|
str | i64 | i64 |
"apple" | 3 | null |
"apple" | 2 | 3 |
"banana" | 5 | null |
"banana" | 4 | 5 |
"banana" | 1 | 4 |
Note: To use this notebook, you must first install the Rust
excvr
jupyter kernel. Also, note thatclone()
is used fairly often in the examples. This is because we tend to create one dataset for multiple examples. When this dataset is used, the rust ownership system willmove
that dataframe, which will make it unavailable to later examples. Byclone
ing, we can keep using it over and over.
:dep polars = { version = "0.23.2", features = ["lazy", "csv-file", "strings", "temporal", "dtype-duration", "dtype-categorical", "concat_str", "list", "list_eval", "rank", "lazy_regex"]}
:dep color-eyre = {version = "0.6.2"}
:dep rand = {version = "0.8.5"}
:dep reqwest = { version = "0.11.11", features = ["blocking"]}
use color_eyre::{Result};
use polars::prelude::*;
Expressions
fn(Series) -> Series
- Lazily evaluated
- Can be optimized
- Gives the library writer context and informed decisions can be made
- Embarrassingly parallel
- Context dependent
- selection/projection ->
Series
= COLUMN, LITERAL, or VALUE - aggregation ->
Series
= GROUPS
- selection/projection ->
let df = df! [
"A" => [1, 2, 3, 4, 5],
"fruits" => ["banana", "banana", "apple", "apple", "banana"],
"B" => [5, 4, 3, 2, 1],
"cars" => ["beetle", "audi", "beetle", "beetle", "beetle"],
"optional" => [Some(28), Some(300), None, Some(2), Some(-30)],
]?;
df
shape: (5, 5)
┌─────┬────────┬─────┬────────┬──────────┐
│ A ┆ fruits ┆ B ┆ cars ┆ optional │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ i32 ┆ str ┆ i32 │
╞═════╪════════╪═════╪════════╪══════════╡
│ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ apple ┆ 3 ┆ beetle ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ apple ┆ 2 ┆ beetle ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │
└─────┴────────┴─────┴────────┴──────────┘
Selection context
// We can select by name
// We'll be re-using the dataframe a bunch, so we'll clone copies as we go.
df.clone().lazy().select([
col("A"),
col("B"),
lit("B"), // we must tell polars we mean the literal "B"
col("fruits"),
]).collect()?
shape: (5, 4)
┌─────┬─────┬─────────┬────────┐
│ A ┆ B ┆ literal ┆ fruits │
│ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ str ┆ str │
╞═════╪═════╪═════════╪════════╡
│ 1 ┆ 5 ┆ B ┆ banana │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ B ┆ banana │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ B ┆ apple │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 ┆ B ┆ apple │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5 ┆ 1 ┆ B ┆ banana │
└─────┴─────┴─────────┴────────┘
// you can select columns with a regex if it starts with '^' and ends with '$'
df.clone().lazy().select([
col("^A|B$").sum()
]).collect()?
shape: (1, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 15 ┆ 15 │
└─────┴─────┘
// you can select multiple columns by name
df.clone().lazy().select([
cols(["A", "B"]).sum()
]).collect()?
shape: (1, 2)
┌─────┬─────┐
│ A ┆ B │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 15 ┆ 15 │
└─────┴─────┘
// We select everything in normal order
// Then we select everything in reversed order
df.clone().lazy().select([
all(),
all().reverse().suffix("_reverse")
]).collect()?
shape: (5, 10)
┌─────┬────────┬─────┬────────┬─────┬────────────────┬───────────┬──────────────┬──────┐
│ A ┆ fruits ┆ B ┆ cars ┆ ... ┆ fruits_reverse ┆ B_reverse ┆ cars_reverse ┆ opti │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ onal │
│ i32 ┆ str ┆ i32 ┆ str ┆ ┆ str ┆ i32 ┆ str ┆ _rev │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ erse │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ i32 │
╞═════╪════════╪═════╪════════╪═════╪════════════════╪═══════════╪══════════════╪══════╡
│ 1 ┆ banana ┆ 5 ┆ beetle ┆ ... ┆ banana ┆ 1 ┆ beetle ┆ -30 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ banana ┆ 4 ┆ audi ┆ ... ┆ apple ┆ 2 ┆ beetle ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ apple ┆ 3 ┆ beetle ┆ ... ┆ apple ┆ 3 ┆ beetle ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ apple ┆ 2 ┆ beetle ┆ ... ┆ banana ┆ 4 ┆ audi ┆ 300 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5 ┆ banana ┆ 1 ┆ beetle ┆ ... ┆ banana ┆ 5 ┆ beetle ┆ 28 │
└─────┴────────┴─────┴────────┴─────┴────────────────┴───────────┴──────────────┴──────┘
// all expressions run in parallel
// single valued `Series` are broadcasted to the shape of the `DataFrame`
df.clone().lazy().select([
all(),
all().sum().suffix("_sum")
]).collect()?
shape: (5, 10)
┌─────┬────────┬─────┬────────┬─────┬────────────┬───────┬──────────┬──────────────┐
│ A ┆ fruits ┆ B ┆ cars ┆ ... ┆ fruits_sum ┆ B_sum ┆ cars_sum ┆ optional_sum │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ i32 ┆ str ┆ ┆ str ┆ i32 ┆ str ┆ i32 │
╞═════╪════════╪═════╪════════╪═════╪════════════╪═══════╪══════════╪══════════════╡
│ 1 ┆ banana ┆ 5 ┆ beetle ┆ ... ┆ null ┆ 15 ┆ null ┆ 300 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ banana ┆ 4 ┆ audi ┆ ... ┆ null ┆ 15 ┆ null ┆ 300 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ apple ┆ 3 ┆ beetle ┆ ... ┆ null ┆ 15 ┆ null ┆ 300 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ apple ┆ 2 ┆ beetle ┆ ... ┆ null ┆ 15 ┆ null ┆ 300 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ banana ┆ 1 ┆ beetle ┆ ... ┆ null ┆ 15 ┆ null ┆ 300 │
└─────┴────────┴─────┴────────┴─────┴────────────┴───────┴──────────┴──────────────┘
// there are `str` and `dt` namespaces for specialized functions
let predicate = col("fruits").str().contains("^b.*");
df.clone().lazy().select([
predicate
]).collect()?
shape: (5, 1)
┌────────┐
│ fruits │
│ --- │
│ bool │
╞════════╡
│ true │
├╌╌╌╌╌╌╌╌┤
│ true │
├╌╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌╌┤
│ false │
├╌╌╌╌╌╌╌╌┤
│ true │
└────────┘
// use the predicate to filter
let predicate = col("fruits").str().contains("^b.*");
df.clone().lazy().filter(predicate).collect()?
shape: (3, 5)
┌─────┬────────┬─────┬────────┬──────────┐
│ A ┆ fruits ┆ B ┆ cars ┆ optional │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ str ┆ i32 ┆ str ┆ i32 │
╞═════╪════════╪═════╪════════╪══════════╡
│ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │
└─────┴────────┴─────┴────────┴──────────┘
// predicate expressions can be used to filter
df.clone().lazy().select([
col("A").filter(col("fruits").str().contains("^b.*")).sum(),
(col("B").filter(col("cars").str().contains("^b.*")).sum() * col("B").sum()).alias("some_compute()"),
]).collect()?
shape: (1, 2)
┌─────┬────────────────┐
│ A ┆ some_compute() │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪════════════════╡
│ 8 ┆ 165 │
└─────┴────────────────┘
// We can do arithmetic on columns and (literal) values
// can be evaluated to 1 without programmer knowing
let some_var = 1;
df.clone().lazy().select([
((col("A") / lit(124.0) * col("B")) / sum("B") * lit(some_var)).alias("computed")
]).collect()?
shape: (5, 1)
┌──────────┐
│ computed │
│ --- │
│ f64 │
╞══════════╡
│ 0.002688 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.004301 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.004839 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.004301 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.002688 │
└──────────┘
// We can combine columns by a predicate
// This doesn't work. It seems like the condition always evaluates to true
df.clone().lazy().select([
col("fruits"),
col("B"),
when(col("fruits") == lit("banana")).then(col("B")).otherwise(-1).alias("b when not banana")
]).collect()?
shape: (5, 3)
┌────────┬─────┬───────────────────┐
│ fruits ┆ B ┆ b when not banana │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 │
╞════════╪═════╪═══════════════════╡
│ banana ┆ 5 ┆ -1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 4 ┆ -1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 3 ┆ -1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 2 ┆ -1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 1 ┆ -1 │
└────────┴─────┴───────────────────┘
// We can combine columns by a fold operation on column level
df.clone().lazy().select([
col("A"),
col("B"),
fold_exprs(lit(0), |a, b| Ok(&a + &b), [
col("A"),
lit("B"),
col("B").pow(lit(2)),
col("A") / lit(2.0)
]).alias("fold")
]).collect()?
shape: (5, 3)
┌─────┬─────┬───────────┐
│ A ┆ B ┆ fold │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ str │
╞═════╪═════╪═══════════╡
│ 1 ┆ 5 ┆ 1B25.00.5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 2B16.01.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ 3B9.01.5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 ┆ 4B4.02.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 1 ┆ 5B1.02.5 │
└─────┴─────┴───────────┘
// even combine all
use std::convert::TryInto;
let height: i32 = df.height().try_into()?;
df.clone().lazy().select([
range(0i32, height).alias("idx"),
col("A"),
col("A").shift(1).alias("A_shifted"),
concat_str([all()], "").alias("str_concat_1"), // prefer this
fold_exprs(col("A"), |a, b| Ok(a + b), [all().exclude(["A"])]).alias("str_concat_2"), // over this (accidentally O(n^2))
]).collect()?
shape: (5, 5)
┌─────┬─────┬───────────┬───────────────────┬───────────────────┐
│ idx ┆ A ┆ A_shifted ┆ str_concat_1 ┆ str_concat_2 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 ┆ str ┆ str │
╞═════╪═════╪═══════════╪═══════════════════╪═══════════════════╡
│ 0 ┆ 1 ┆ null ┆ 1banana5beetle28 ┆ 1banana5beetle28 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 1 ┆ 2banana4audi300 ┆ 2banana4audi300 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 2 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 3 ┆ 4apple2beetle2 ┆ 4apple2beetle2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 5 ┆ 4 ┆ 5banana1beetle-30 ┆ 5banana1beetle-30 │
└─────┴─────┴───────────┴───────────────────┴───────────────────┘
Aggregation context
- expressions are applied over groups instead of columns
// we can still combine many expressions
df.clone().lazy().sort("cars", SortOptions::default()).groupby(["fruits"])
.agg([
col("B").sum().alias("B_sum"),
sum("B").alias("B_sum2"), // syntactic sugar for the first
col("fruits").first().alias("fruits_first"),
col("A").count().alias("count"),
col("cars").reverse()
]).collect()?
shape: (2, 6)
┌────────┬───────┬────────┬──────────────┬───────┬──────────────────────────────┐
│ fruits ┆ B_sum ┆ B_sum2 ┆ fruits_first ┆ count ┆ cars │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 ┆ str ┆ u32 ┆ list[str] │
╞════════╪═══════╪════════╪══════════════╪═══════╪══════════════════════════════╡
│ apple ┆ 5 ┆ 5 ┆ apple ┆ 2 ┆ ["beetle", "beetle"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10 ┆ 10 ┆ banana ┆ 3 ┆ ["beetle", "beetle", "audi"] │
└────────┴───────┴────────┴──────────────┴───────┴──────────────────────────────┘
// We can explode the list column "cars"
df.clone().lazy()
.sort("cars", SortOptions { descending: false, nulls_last: false })
.groupby(["fruits"])
.agg([
col("B").sum().alias("B_sum"),
sum("B").alias("B_sum2"), // syntactic sugar for the first
col("fruits").first().alias("fruits_first"),
col("A").count().alias("count"),
col("cars").reverse()
])
.explode(["cars"])
.collect()?
shape: (5, 6)
┌────────┬───────┬────────┬──────────────┬───────┬────────┐
│ fruits ┆ B_sum ┆ B_sum2 ┆ fruits_first ┆ count ┆ cars │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 ┆ str ┆ u32 ┆ str │
╞════════╪═══════╪════════╪══════════════╪═══════╪════════╡
│ banana ┆ 10 ┆ 10 ┆ banana ┆ 3 ┆ beetle │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ banana ┆ 10 ┆ 10 ┆ banana ┆ 3 ┆ beetle │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ banana ┆ 10 ┆ 10 ┆ banana ┆ 3 ┆ audi │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ apple ┆ 5 ┆ 5 ┆ apple ┆ 2 ┆ beetle │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ apple ┆ 5 ┆ 5 ┆ apple ┆ 2 ┆ beetle │
└────────┴───────┴────────┴──────────────┴───────┴────────┘
df.clone().lazy()
.groupby(["fruits"])
.agg([
col("B").sum().alias("B_sum"),
col("fruits").first().alias("fruits_first"),
count(),
col("B").shift(1).alias("B_shifted")
])
.explode(["B_shifted"])
.collect()?
shape: (5, 5)
┌────────┬───────┬──────────────┬───────┬───────────┐
│ fruits ┆ B_sum ┆ fruits_first ┆ count ┆ B_shifted │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ str ┆ u32 ┆ i32 │
╞════════╪═══════╪══════════════╪═══════╪═══════════╡
│ apple ┆ 5 ┆ apple ┆ 2 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 5 ┆ apple ┆ 2 ┆ 3 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10 ┆ banana ┆ 3 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10 ┆ banana ┆ 3 ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10 ┆ banana ┆ 3 ┆ 4 │
└────────┴───────┴──────────────┴───────┴───────────┘
// we can also get the list of the groups
df.clone().lazy()
.groupby(["fruits"])
.agg([
col("B").shift(1).alias("shift_B"),
col("B").reverse().alias("rev_B"),
])
.collect()?
shape: (2, 3)
┌────────┬──────────────┬───────────┐
│ fruits ┆ shift_B ┆ rev_B │
│ --- ┆ --- ┆ --- │
│ str ┆ list[i32] ┆ list[i32] │
╞════════╪══════════════╪═══════════╡
│ banana ┆ [null, 5, 4] ┆ [1, 4, 5] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ [null, 3] ┆ [2, 3] │
└────────┴──────────────┴───────────┘
// we can do predicates in the groupby as well
df.clone().lazy()
.groupby(["fruits"])
.agg([
col("B").filter(col("B").gt(lit(1))).list().keep_name(),
])
.collect()?
shape: (2, 2)
┌────────┬───────────┐
│ fruits ┆ B │
│ --- ┆ --- │
│ str ┆ list[i32] │
╞════════╪═══════════╡
│ apple ┆ [3, 2] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ [5, 4] │
└────────┴───────────┘
// and sum only by the values where the predicates are true
df.clone().lazy()
.groupby(["fruits"])
.agg([
col("B").filter(col("B").gt(lit(1))).mean(),
])
.collect()?
shape: (2, 2)
┌────────┬─────┐
│ fruits ┆ B │
│ --- ┆ --- │
│ str ┆ f64 │
╞════════╪═════╡
│ apple ┆ 2.5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ banana ┆ 4.5 │
└────────┴─────┘
// Another example
df.clone().lazy()
.groupby(["fruits"])
.agg([
col("B").shift_and_fill(1, 0).alias("shifted"),
col("B").shift_and_fill(1, 0).sum().alias("shifted_sum"),
])
.collect()?
shape: (2, 3)
┌────────┬───────────┬─────────────┐
│ fruits ┆ shifted ┆ shifted_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ list[i32] ┆ i32 │
╞════════╪═══════════╪═════════════╡
│ apple ┆ [0, 3] ┆ 3 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ [0, 5, 4] ┆ 9 │
└────────┴───────────┴─────────────┘
Window functions!
- Expression with superpowers.
- Aggregation in selection context
#![allow(unused)] fn main() { col("foo").aggregation_expression(..).over("column_used_to_group") }
// groupby 2 different columns
df.clone().lazy()
.select([
col("fruits"),
col("cars"),
col("B"),
col("B").sum().over(["fruits"]).alias("B_sum_by_fruits"),
col("B").sum().over(["cars"]).alias("B_sum_by_cars"),
])
.collect()?
shape: (5, 5)
┌────────┬────────┬─────┬─────────────────┬───────────────┐
│ fruits ┆ cars ┆ B ┆ B_sum_by_fruits ┆ B_sum_by_cars │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 ┆ i32 ┆ i32 │
╞════════╪════════╪═════╪═════════════════╪═══════════════╡
│ banana ┆ beetle ┆ 5 ┆ 10 ┆ 11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ audi ┆ 4 ┆ 10 ┆ 4 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ beetle ┆ 3 ┆ 5 ┆ 11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ beetle ┆ 2 ┆ 5 ┆ 11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ beetle ┆ 1 ┆ 10 ┆ 11 │
└────────┴────────┴─────┴─────────────────┴───────────────┘
// reverse B by groups and show the results in original DF
df.clone().lazy()
.select([
col("fruits"),
col("B"),
col("B").reverse().over(["fruits"]).alias("B_reversed_by_fruits")
])
.collect()?
shape: (5, 3)
┌────────┬─────┬──────────────────────┐
│ fruits ┆ B ┆ B_reversed_by_fruits │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 │
╞════════╪═════╪══════════════════════╡
│ banana ┆ 5 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 4 ┆ 4 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 3 ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 2 ┆ 3 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 1 ┆ 5 │
└────────┴─────┴──────────────────────┘
// Lag a column within "fruits"
df.clone().lazy()
.select([
col("fruits"),
col("B"),
col("B").shift(1).over(["fruits"]).alias("lag_B_by_fruits")
])
.collect()?
shape: (5, 3)
┌────────┬─────┬─────────────────┐
│ fruits ┆ B ┆ lag_B_by_fruits │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ i32 │
╞════════╪═════╪═════════════════╡
│ banana ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 4 ┆ 5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 3 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 2 ┆ 3 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 1 ┆ 4 │
└────────┴─────┴─────────────────┘
Expression API
The full list of possible expressions is available on the Expr
class definition in the reference guide.
Video Introduction
Don't enjoy reading? Take a look at this introduction video on Polars
and its expressions.
Data types
Polars
is entirely based on Arrow
data types and backed by Arrow
memory arrays. This makes data processing
cache-efficient and well-supported for Inter Process Communication. Most data types follow the exact implementation
from Arrow
, with exception of Utf8
(this is actually LargeUtf8
), Categorical
, and Object
(support is limited).
The data types are:
Int8
: 8-bit signed integer.Int16
: 16-bit signed integer.Int32
: 32-bit signed integer.Int64
: 64-bit signed integer.UInt8
: 8-bit unsigned integer.UInt16
: 16-bit unsigned integer.UInt32
: 32-bit unsigned integer.UInt64
: 64-bit unsigned integer.Float32
: 32-bit floating point.Float64
: 64-bit floating point.Boolean
: Boolean type effectively bit packed.Utf8
: String data (this is actuallyArrow
LargeUtf8
internally).Binary
: Store data as bytes.List
: A list array contains a child array containing the list values and an offset array. (this is actuallyArrow
LargeList
internally).Struct
: A struct array is represented asVec<Series>
and is useful to pack multiple/heterogenous values in a single column.Object
: A limited supported data type that can be any value.Date
: Date representation, internally represented as days since UNIX epoch encoded by a 32-bit signed integer.Datetime
: Datetime representation, internally represented as microseconds since UNIX epoch encoded by a 64-bit signed integer.Duration
: A timedelta type, internally represented as microseconds. Created when subtractingDate/Datetime
.Time
: Time representation, internally represented as nanoseconds since midnight.
To learn more about the internal representation of these data types, check the Arrow
columnar format.
Coming from Pandas
Here we set out the key points that anyone who has experience with Pandas
and wants to
try Polars
should know. We include both differences in the concepts the libraries are
built on and differences in how you should write Polars
code compared to Pandas
code.
Differences in concepts between Polars
and Pandas
Polars
does not have a multi-index/index
Pandas
gives a label to each row with an index. Polars
does not use an index and
each row is indexed by its integer position in the table.
Polars aims to have predictable results and readable queries, as such we think an index does not help us reach that
objective. We believe the semantics of a query should not change by the state of an index or a reset_index
call.
In Polars a DataFrame will always be a 2D table with heterogeneous data-types. The data-types may have nesting, but the table itself will not. Operations like resampling will be done by specialized functions or methods that act like 'verbs' on a table explicitly stating columns that 'verb' operates on. As such, it is our conviction that not having indices make things simpler, more explicit, more readable and less error-prone.
Note that an 'index' data structure as known in databases will be used by polars as an optimization technique.
For more detail on how you select data in Polars
see the selecting data
section.
Polars
uses Apache Arrow arrays to represent data in memory while Pandas
uses Numpy
arrays
Polars
represents data in memory with Arrow arrays while Pandas
represents data in
memory in Numpy
arrays. Apache Arrow is an emerging standard for in-memory columnar
analytics that can accelerate data load times, reduce memory usage and accelerate
calculations.
Polars
can convert data to Numpy
format with the to_numpy
method.
Polars
has more support for parallel operations than Pandas
Polars
exploits the strong support for concurrency in Rust to run many operations in
parallel. While some operations in Pandas
are multi-threaded the core of the library
is single-threaded and an additional library such as Dask
must be used to parallelise
operations.
Polars
can lazily evaluate queries and apply query optimization
Eager evaluation is where code is evaluated as soon as you run the code. Lazy evaluation is where running a line of code means that the underlying logic is added to a query plan rather than being evaluated.
Polars
supports eager evaluation and lazy evaluation whereas Pandas
only supports
eager evaluation. The lazy evaluation mode is powerful because Polars
carries out
automatic query optimization where it examines the query plan and looks for ways to
accelerate the query or reduce memory usage.
Dask
also supports lazy evaluation where it generates a query plan. However, Dask
does not carry out query optimization on the query plan.
Key syntax differences
Users coming from Pandas
generally need to know one thing...
polars != pandas
If your Polars
code looks like it could be Pandas
code, it might run, but it likely
runs slower than it should.
Let's go through some typical Pandas
code and see how we might write that in Polars
.
Selecting data
As there is no index in Polars
there is no .loc
or iloc
method in Polars
- and
there is also no SettingWithCopyWarning
in Polars
.
To learn more about how you select data in Polars
see the selecting data section.
However, the best way to select data in Polars
is to use the expression API. For
example, if you want to select a column in Pandas
you can do one of the following:
df['a']
df.loc[:,'a']
but in Polars
you would use the .select
method:
df.select(['a'])
If you want to select rows based on the values then in Polars
you use the .filter
method:
df.filter(pl.col('a') < 10)
As noted in the section on expressions below, Polars
can run operations in .select
and filter
in parallel and Polars
can carry out query optimization on the full set
of data selection criteria.
Be lazy
Working in lazy evaluation mode is straightforward and should be your default in
Polars
as the lazy mode allows Polars
to do query optimization.
We can run in lazy mode by either using an implicitly lazy function (such as scan_csv
)
or explicitly using the lazy
method.
Take the following simple example where we read a CSV file from disk and do a groupby.
The CSV file has numerous columns but we just want to do a groupby on one of the id
columns (id1
) and then sum by a value column (v1
). In Pandas
this would be:
df = pd.read_csv(csv_file)
grouped_df = df.loc[:,['id1','v1']].groupby('id1').sum('v1')
In Polars
you can build this query in lazy mode with query optimization and evaluate
it by replacing the eager Pandas
function read_csv
with the implicitly lazy Polars
function scan_csv
:
df = pl.scan_csv(csv_file)
grouped_df = df.groupby('id1').agg(pl.col('v1').sum()).collect()
Polars
optimizes this query by identifying that only the id1
and v1
columns are
relevant and so will only read these columns from the CSV. By calling the .collect
method at the end of the second line we instruct Polars
to eagerly evaluate the query.
If you do want to run this query in eager mode you can just replace scan_csv
with
read_csv
in the Polars
code.
Read more about working with lazy evaluation in the lazy API section.
Express yourself
A typical Pandas
script consists of multiple data transformations that are executed
sequentially. However, in Polars
these transformations can be executed in parallel
using expressions.
Column assignment
We have a dataframe df
with a column called value
. We want to add two new columns, a
column called tenXValue
where the value
column is multiplied by 10 and a column
called hundredXValue
where the value
column is multiplied by 100.
In Pandas
this would be:
df["tenXValue"] = df["value"] * 10
df["hundredXValue"] = df["value"] * 100
These column assignments are executed sequentially.
In Polars
we add columns to df
using the .with_column
method and name them with
the .alias
method:
df.with_columns([
(pl.col("value") * 10).alias("tenXValue"),
(pl.col("value") * 100).alias("hundredXValue"),
])
These column assignments are executed in parallel.
Column assignment based on predicate
In this case we have a dataframe df
with columns a
,b
and c
. We want to re-assign
the values in column a
based on a condition. When the value in column c
is equal to
2 then we replace the value in a
with the value in b
.
In Pandas
this would be:
df.loc[df["c"] == 2, "a"] = df.loc[df["c"] == 2, "b"]
while in Polars
this would be:
df.with_column(
pl.when(pl.col("c") == 2)
.then(pl.col("b"))
.otherwise(pl.col("a")).alias("a")
)
The Polars
way is pure in that the original DataFrame
is not modified. The mask
is
also not computed twice as in Pandas
(you could prevent this in Pandas
, but that
would require setting a temporary variable).
Additionally Polars
can compute every branch of an if -> then -> otherwise
in
parallel. This is valuable, when the branches get more expensive to compute.
Filtering
We want to filter the dataframe df
with housing data based on some criteria.
In Pandas
you filter the dataframe by passing Boolean expressions to the loc
method:
df.loc[(df['sqft_living'] > 2500) & (df['price'] < 300000)]
while in Polars
you call the filter
method:
df.filter(
(pl.col("m2_living") > 2500) & (pl.col("price") < 300000)
)
The query optimizer in Polars
can also detect if you write multiple filters separately
and combine them into a single filter in the optimized plan.
Pandas
transform
The Pandas
documentation demonstrates an operation on a groupby called transform
. In
this case we have a dataframe df
and we want a new column showing the number of rows
in each group.
In Pandas
we have:
df = pd.DataFrame({
"type": ["m", "n", "o", "m", "m", "n", "n"],
"c": [1, 1, 1, 2, 2, 2, 2],
})
df["size"] = df.groupby("c")["type"].transform(len)
Here Pandas
does a groupby on "c"
, takes column "type"
, computes the group length
and then joins the result back to the original DataFrame
producing:
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
In Polars
the same can be achieved with window
functions:
df.select([
pl.all(),
pl.col("type").count().over("c").alias("size")
])
shape: (7, 3)
┌─────┬──────┬──────┐
│ c ┆ type ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪══════╪══════╡
│ 1 ┆ m ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ n ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ o ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 │
└─────┴──────┴──────┘
Because we can store the whole operation in a single expression, we can combine several
window
functions and even combine different groups!
Polars
will cache window expressions that are applied over the same group, so storing
them in a single select
is both convenient and optimal. In the following example
we look at a case where we are calculating group statistics over "c"
twice:
df.select([
pl.all(),
pl.col("c").count().over("c").alias("size"),
pl.col("c").sum().over("type").alias("sum"),
pl.col("c").reverse().over("c").flatten().alias("reverse_type")
])
shape: (7, 5)
┌─────┬──────┬──────┬─────┬──────────────┐
│ c ┆ type ┆ size ┆ sum ┆ reverse_type │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪═════╪══════════════╡
│ 1 ┆ m ┆ 3 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ n ┆ 3 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ o ┆ 3 ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 ┆ 5 ┆ 1 │
└─────┴──────┴──────┴─────┴──────────────┘
Missing data
Pandas
uses NaN
and/or None
values to indicate missing values depending on the dtype of the column. In addition the behaviour in Pandas
varies depending on whether the default dtypes or optional nullable arrays are used. In Polars
missing data corresponds to a null
value for all data types.
For float columns Polars
permits the use of NaN
values. These NaN
values are not considered to be missing data but instead a special floating point value.
In Pandas
an integer column with missing values is cast to be a float column with NaN
values for the missing values (unless using optional nullable integer dtypes). In Polars
any missing values in an integer column are simply null
values and the column remains an integer column.
See the missing data section for more details.
Coming from Apache Spark
Column-based API vs. Row-based API
Whereas the Spark
DataFrame
is analogous to a collection of rows, a Polars
DataFrame
is closer to a collection of columns. This means that you can combine columns in Polars
in ways that are not possible in Spark
, because Spark
preserves the relationship of the data in each row.
Consider this sample dataset:
import polars as pl
df = pl.DataFrame({
"foo": ["a", "b", "c", "d", "d"],
"bar": [1, 2, 3, 4, 5],
})
dfs = spark.createDataFrame(
[
("a", 1),
("b", 2),
("c", 3),
("d", 4),
("d", 5),
],
schema=["foo", "bar"],
)
Example 1: Combining head
and sum
In Polars
you can write something like this:
df.select([
pl.col("foo").sort().head(2),
pl.col("bar").filter(pl.col("foo") == "d").sum()
])
Output:
shape: (2, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 9 │
└─────┴─────┘
The expressions on columns foo
and bar
are completely independent. Since the expression on bar
returns a single value, that value is repeated for each value output by the expression on foo
. But a
and b
have no relation to the data that produced the sum of 9
.
To do something similar in Spark
, you'd need to compute the sum separately and provide it as a literal:
from pyspark.sql.functions import col, sum, lit
bar_sum = (
dfs
.where(col("foo") == "d")
.groupBy()
.agg(sum(col("bar")))
.take(1)[0][0]
)
(
dfs
.orderBy("foo")
.limit(2)
.withColumn("bar", lit(bar_sum))
.show()
)
Output:
+---+---+
|foo|bar|
+---+---+
| a| 9|
| b| 9|
+---+---+
Example 2: Combining Two head
s
In Polars
you can combine two different head
expressions on the same DataFrame, provided that they return the same number of values.
df.select([
pl.col("foo").sort().head(2),
pl.col("bar").sort(reverse=True).head(2),
])
Output:
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ 4 │
└─────┴─────┘
Again, the two head
expressions here are completely independent, and the pairing of a
to 5
and b
to 4
results purely from the juxtaposition of the two columns output by the expressions.
To accomplish something similar in Spark
, you would need to generate an artificial key that enables you to join the values in this way.
from pyspark.sql import Window
from pyspark.sql.functions import row_number
foo_dfs = (
dfs
.withColumn(
"rownum",
row_number().over(Window.orderBy("foo"))
)
)
bar_dfs = (
dfs
.withColumn(
"rownum",
row_number().over(Window.orderBy(col("bar").desc()))
)
)
(
foo_dfs.alias("foo")
.join(bar_dfs.alias("bar"), on="rownum")
.select("foo.foo", "bar.bar")
.limit(2)
.show()
)
Output:
+---+---+
|foo|bar|
+---+---+
| a| 5|
| b| 4|
+---+---+
How can I?
This chapter contains some snippets that will get you up to speed with the most
idiomatic way to get things done in Polars
.
IO
Polars
supports different file types, and its respective parsers are amongst the fastest
out there.
For instance, it is faster to load a CSV file via Polars
before handing it to Pandas
than loading them using Pandas
. Just run a
pl.read_csv("<FILE>", rechunk=False).to_pandas()
to convince yourself!
Character-Separated Values
Read & Write
Reading a CSV file should look familiar:
df = pl.read_csv("path.csv")
use polars::prelude::*;
let df = CsvReader::from_path("path.csv").unwrap().finish().unwrap();
CSV files come in many different flavors, so make sure to check the
read_csv()
API.
Writing to a CSV file can be done with the
write_csv()
method.
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_csv("path.csv")
use polars::prelude::*;
let mut df = df!(
"foo" => &[1, 2, 3],
"bar" => &[None, Some("bak"), Some("baz")],
)
.unwrap();
let mut file = std::fs::File::create("path.csv").unwrap();
CsvWriter::new(&mut file).finish(&mut df).unwrap();
Scan
Polars
allows you to scan a CSV input. Scanning delays the actual parsing of the
file and instead returns a lazy computation holder called a LazyFrame
.
df = pl.scan_csv("path.csv")
use polars::prelude::*;
let df = LazyCsvReader::new("./test.csv").finish().unwrap();
If you want to know why this is desirable, you can read more about those Polars
optimizations here.
The following video shows how to efficiently load CSV files with Polars and how the built-in query optimization makes this much faster.
Parquet
Loading or writing Parquet
files is lightning fast.
Pandas
uses PyArrow
-Python
bindings
exposed by Arrow
- to load Parquet
files into memory, but it has to copy that data into
Pandas
memory. With Polars
there is no extra cost due to
copying as we read Parquet
directly into Arrow
memory and keep it there.
Read
df = pl.read_parquet("path.parquet")
let mut file = std::fs::File::open("path.parquet").unwrap();
let df = ParquetReader::new(&mut file).finish().unwrap();
Write
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_parquet("path.parquet")
let mut df = df!(
"foo" => &[1, 2, 3],
"bar" => &[None, Some("bak"), Some("baz")],
)
.unwrap();
let mut file = std::fs::File::create("path.parquet").unwrap();
ParquetWriter::new(&mut file).finish(&mut df).unwrap();
Scan
Polars
allows you to scan a Parquet
input. Scanning delays the actual parsing of the
file and instead returns a lazy computation holder called a LazyFrame
.
df = pl.scan_parquet("path.parquet")
use polars::prelude::*;
let args = ScanArgsParquet::default();
let df = LazyFrame::scan_parquet("./file.parquet",args).unwrap();
If you want to know why this is desirable,
you can read more about those Polars
optimizations here.
Note about Rust usage
Parquet functionality is not enabled by default. It must be added as an additional feature.
This can be enabled via cargo add polars --features parquet
or by directly adding it to your Cargo.toml
[dependencies]
polars = { version = "0.24.3", features = ["parquet"] }
Additionally, scanning of parquet files requires the lazy
feature
[dependencies]
polars = { version = "0.24.3", features = ["parquet", "lazy"] }
JSON files
Read & Write
JSON
Reading a JSON file should look familiar:
df = pl.read_json("path.json")
use polars::prelude::*;
let mut file = std::fs::File::open("path.json").unwrap();
let df = JsonReader::new(&mut file).finish().unwrap();
Newline Delimited JSON
JSON objects that are delimited by newlines can be read into polars in a much more performant way than standard json.
df = pl.read_ndjson("path.json")
use polars::prelude::*;
let mut file = std::fs::File::open("path.json").unwrap();
let df = JsonLineReader::new(&mut file).finish().unwrap();
Write
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
# json
df.write_json("path.json")
# ndjson
df.write_ndjson("path.json")
use polars::prelude::*;
let mut df = df!(
"foo" => &[1, 2, 3],
"bar" => &[None, Some("bak"), Some("baz")],
)
.unwrap();
let mut file = std::fs::File::create("path.csv").unwrap();
// json
JsonWriter::new(&mut file)
.with_json_format(JsonFormat::Json)
.finish(&mut df)
.unwrap();
// ndjson
JsonWriter::new(&mut file)
.with_json_format(JsonFormat::JsonLines)
.finish(&mut df)
.unwrap();
Scan
Polars
allows you to scan a JSON input only for newline delimited json. Scanning delays the actual parsing of the
file and instead returns a lazy computation holder called a LazyFrame
.
df = pl.scan_ndjson("path.json")
use polars::prelude::*;
let df = LazyJsonLineReader::new("path.json".to_string()).finish().unwrap();
Note about Rust usage
Parquet functionality is not enabled by default. It must be added as an additional feature.
This can be enabled via cargo add polars --features json
or by directly adding it to your Cargo.toml
[dependencies]
polars = { version = "0.24.3", features = ["json"] }
Additionally, scanning of json files requires the lazy
feature
[dependencies]
polars = { version = "0.24.3", features = ["json", "lazy"] }
Dealing with multiple files.
Polars
can deal with multiple files differently depending on your needs and memory strain.
Let's create some files to give use some context:
import polars as pl
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})
for i in range(5):
df.write_csv(f"my_many_files_{i}.csv")
Reading into a single DataFrame
To read multiple files into a single DataFrame
, we can use globbing patterns:
df = pl.read_csv("my_many_files_*.csv")
print(df)
shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ null │
│ 2 ┆ ham │
│ 3 ┆ spam │
│ 1 ┆ null │
│ ... ┆ ... │
│ 3 ┆ spam │
│ 1 ┆ null │
│ 2 ┆ ham │
│ 3 ┆ spam │
└─────┴──────┘
To see how this works we can take a look at the query plan. Below we see that all files are read separately and
concatenated into a single DataFrame
. Polars
will try to parallelize the reading.
pl.scan_csv("my_many_files_*.csv").show_graph()
Reading and processing in parallel
If your files don't have to be in a single table you can also build a query plan for each file and execute them in parallel
on the Polars
thread pool.
All query plan execution is embarrassingly parallel and doesn't require any communication.
import polars as pl
import glob
queries = []
for file in glob.glob("my_many_files_*.csv"):
q = pl.scan_csv(file).groupby("bar").agg([pl.count(), pl.sum("foo")])
queries.append(q)
dataframes = pl.collect_all(queries)
print(dataframes)
[shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1 ┆ 1 │
│ ham ┆ 1 ┆ 2 │
│ spam ┆ 1 ┆ 3 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1 ┆ 3 │
│ null ┆ 1 ┆ 1 │
│ ham ┆ 1 ┆ 2 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1 ┆ 3 │
│ null ┆ 1 ┆ 1 │
│ ham ┆ 1 ┆ 2 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1 ┆ 1 │
│ spam ┆ 1 ┆ 3 │
│ ham ┆ 1 ┆ 2 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1 ┆ 3 │
│ ham ┆ 1 ┆ 2 │
│ null ┆ 1 ┆ 1 │
└──────┴───────┴─────┘]
Read from MySQL, Postgres, Sqlite, Redshift, Clickhouse
To read from one of the supported databases connector-x
needs to be installed.
$ pip install connectorx>=0.2.0a3
import polars as pl
conn = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"
pl.read_sql(query, conn)
Interact with AWS
The Interact with AWS page is under construction.
To read from or write to an AWS bucket, additional dependencies are needed:
$ pip install s3fs
$ cargo add aws_sdk_s3 aws_config tokio --features tokio/full
In the next few snippets we'll demonstrate interacting with a Parquet
file
located on an AWS bucket.
Read
Load a .parquet
file using:
import polars as pl
import pyarrow.parquet as pq
import s3fs
fs = s3fs.S3FileSystem()
bucket = "<YOUR_BUCKET>"
path = "<YOUR_PATH>"
dataset = pq.ParquetDataset(f"s3://{bucket}/{path}", filesystem=fs)
df = pl.from_arrow(dataset.read())
use aws_sdk_s3::Region;
use aws_config::meta::region::RegionProviderChain;
use aws_sdk_s3::Client;
use std::borrow::Cow;
use polars::prelude::*;
#[tokio::main]
async fn main() {
let bucket = "<YOUR_BUCKET>";
let path = "<YOUR_PATH>";
let config = aws_config::from_env().load().await;
let client = Client::new(&config);
let req = client.get_object().bucket(bucket).key(path);
let res = req.clone().send().await.unwrap();
let bytes = res.body.collect().await.unwrap();
let bytes = bytes.into_bytes();
let cursor = std::io::Cursor::new(bytes);
let df = CsvReader::new(cursor).finish().unwrap();
println!("{:?}", df);
}
Write
This content is under construction.
Interact with Google BigQuery
To read or write from GBQ, additional dependencies are needed:
$ pip install google-cloud-bigquery
Read
We can load a query into a DataFrame
like this:
import polars as pl
from google.cloud import bigquery
client = bigquery.Client()
# Perform a query.
QUERY = (
'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
'WHERE state = "TX" '
'LIMIT 100')
query_job = client.query(QUERY) # API request
rows = query_job.result() # Waits for query to finish
df = pl.from_arrow(rows.to_arrow())
Write
This content is under construction.
Interact with Postgres
Read
To read from postgres, additional dependencies are needed:
$ pip install connectorx>=0.2.0a3
import polars as pl
conn = "postgresql://username:password@server:port/database"
query = "SELECT * FROM foo"
pl.read_sql(query, conn)
Write
To write to postgres, additional dependencies are needed:
$ pip install psycopg2-binary
For writing to a postgres database with psycopg2
, we utilize execute_batch
. This will limit round trips needed
to the server.
We first make sure that all our dtypes are in a format that psycopg2
recognizes, and then we use DataFrame.rows
to
easily transform the columnar data to rows that the database driver can work with.
from psycopg2 import sql
import psycopg2.extras
import polars as pl
# let's assume we have a DataFrame with some floats, integers, strings, and date64 columns.
df = pl.read_parquet("somefile.parquet")
# first me convert polars date64 representation to python datetime objects
for col in df:
# only for date64
if col.dtype == pl.Date64:
df = df.with_column(col.dt.to_python_datetime())
# create sql identifiers for the column names
# we do this to safely insert this into a sql query
columns = sql.SQL(",").join(sql.Identifier(name) for name in df.columns)
# create placeholders for the values. These will be filled later
values = sql.SQL(",").join([sql.Placeholder() for _ in df.columns])
table_id = "mytable"
# prepare the insert query
insert_stmt = sql.SQL("INSERT INTO {} ({}) VALUES({});").format(
sql.Identifier(table_id), columns, values
)
# make a connection
conn = psycopg2.connect()
cur = conn.cursor()
# do the insert
psycopg2.extras.execute_batch(cur, insert_stmt, df.rows())
conn.commit()
Interoperability
Arrow
Arrow
is rapidly becoming the de facto standard for columnar data. This means that
support for Arrow
is growing rapidly (both languages and tools). Due to the amazing
effort behind the format, using Arrow
is now likely the fastest way to:
- Read and write
Parquet
formatted files - Read CSV into columnar data
- Exchanging columnar data
Polars
uses an Arrow
memory buffer as the most basic building block for the Polars
Series
. This means that we exchange data between Polars
and Arrow
without
copying it. It also means that Polars
shares the same performance gains that Arrow
receives.
Convert a Polars
DataFrame
or Series
to Arrow
using the .to_arrow()
method. Similarly, importing from Arrow
data structure can be performed with the
.from_arrow()
functions.
NumPy
Polars
Series
have support for NumPy
universal functions (ufuncs).
Element-wise functions such as np.exp()
, np.cos()
, np.div()
, etc. all work with
almost zero overhead.
However, as a Polars
-specific remark: missing values are a separate bitmask and are not
visible by NumPy
. It can yield to a window function or a np.convolve()
giving
flawed or incomplete results.
Convert a Polars
Series
to a NumPy
array with the .to_numpy()
method.
Missing values will be replaced by np.nan
during the conversion. If the Series
does
not include missing values, or those values are not desired anymore, the .view()
method can be used instead, providing a zero-copy NumPy
array of the data.
Selecting data
In this section we show how to select rows and/or columns from a DataFrame
. We can
select data with expressions or select data with square bracket indexing.
The Expression API is key to writing performant queries in Polars
. The simplest way to get started with the Expression API is to get familiar with the filter
and select
methods in this section.
Although they may give the same output, selecting data with expressions or square bracket indexing are not equivalent. The implementation of selecting data with expressions is completely different from the implementation of selecting data with square bracket indexing.
We strongly recommend selecting data with expressions for almost all use cases. Square bracket indexing is perhaps useful when doing exploratory data analysis in a terminal or notebook when you just want a quick look at a subset of data.
For all other use cases we recommend using expressions because:
- expressions can be parallelized
- the expression approach can be used in lazy and eager mode while the indexing approach can only be used in eager mode
- in lazy mode the query optimizer can optimize expressions
Selecting with expressions
In this page we show how to select rows and columns with the preferred expression approach. We cover:
- use of the Expression API via the
filter
andselect
methods to select data - combining these expressions and
- optimization of these expression in lazy mode.
To select data with expressions we use:
- the
filter
method to select rows - the
select
method to select columns
For simplicity we deal with DataFrame
examples throughout. The principles are the same for Series
objects except that columns obviously cannot be selected in a Series
. To illustrate the filter
and select
methods we define a simple DataFrame
:
df = pl.DataFrame(
{
"id": [1, 2, 3],
"color": ["blue", "red", "green"],
"size": ["small", "medium", "large"],
}
)
print(df)
shape: (3, 3)
┌─────┬───────┬────────┐
│ id ┆ color ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═══════╪════════╡
│ 1 ┆ blue ┆ small │
│ 2 ┆ red ┆ medium │
│ 3 ┆ green ┆ large │
└─────┴───────┴────────┘
Selecting rows with the filter
method
We can select rows by using the filter
method. In the filter
method we pass the condition we are using to select the rows as an expression:
filter_df = df.filter(pl.col("id") <= 2)
print(filter_df)
shape: (2, 3)
┌─────┬───────┬────────┐
│ id ┆ color ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═══════╪════════╡
│ 1 ┆ blue ┆ small │
│ 2 ┆ red ┆ medium │
└─────┴───────┴────────┘
We can specify multiple conditions in filter
using the &
operator:
multi_filter_df = df.filter((pl.col("id") <= 2) & (pl.col("size") == "small"))
print(multi_filter_df)
shape: (1, 3)
┌─────┬───────┬───────┐
│ id ┆ color ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═══════╪═══════╡
│ 1 ┆ blue ┆ small │
└─────┴───────┴───────┘
Selecting columns with the select
method
We select columns using the select
method. In the select
method we can specify the columns with:
- a (string) column name
- a list of (string) column names
- a boolean list of the same length as the number of columns
- an expression such as a condition on the column name
- a
Series
Select a single column
single_select_df = df.select("id")
print(single_select_df)
shape: (3, 1)
┌─────┐
│ id │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
Select a list of columns
list_select_df = df.select(["id", "color"])
print(list_select_df)
shape: (3, 2)
┌─────┬───────┐
│ id ┆ color │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ blue │
│ 2 ┆ red │
│ 3 ┆ green │
└─────┴───────┘
Select columns with an expression
To select based on a condition on the column name:
condition_select_df = df.select(pl.col("^col.*$"))
print(condition_select_df)
shape: (3, 1)
┌───────┐
│ color │
│ --- │
│ str │
╞═══════╡
│ blue │
│ red │
│ green │
└───────┘
To select based on the dtype of the columns:
dtype_select_df = df.select(pl.col(pl.Int64))
print(dtype_select_df)
shape: (3, 1)
┌─────┐
│ id │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
Selecting rows and columns
We can combine the filter
and select
methods to select rows and columns
expression_df = df.filter(pl.col("id") <= 2).select(["id", "color"])
print(expression_df)
shape: (2, 2)
┌─────┬───────┐
│ id ┆ color │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ blue │
│ 2 ┆ red │
└─────┴───────┘
Query optimization
In lazy mode the query optimizer may be able to optimize the query based on the expressions.
In this example we scan a CSV file with many columns using scan_csv
and then select
a subset of them. The query optimizer creates a query plan that causes only the selected columns to be read from the CSV - see how the Project
part of the query plan below states that only 1 of 2 columns will be read:
lazy_select_df = pl.scan_csv("data/appleStock.csv").select(["Date"])
print(lazy_select_df.describe_optimized_plan())
FAST_PROJECT: [Date]
CSV SCAN data/appleStock.csv
PROJECT 1/2 COLUMNS
If you specify two separate filter conditions the query optimizer will combine them into a single joint condition (see the Selection
part of the query plan below):
lazy_filter_df = (
pl.scan_csv("data/appleStock.csv")
.filter(
pl.col("Date") == datetime(1995, 10, 16),
)
.filter(pl.col("Close") > 100)
)
print(lazy_filter_df.describe_optimized_plan())
CSV SCAN data/appleStock.csv
PROJECT */2 COLUMNS
SELECTION: [([(col("Close")) > (100f64)]) & ([(col("Date")) == (813801600000000i64.strict_cast(Datetime(Microseconds, None)).cast(Utf8))])]
Selecting with indexing
In this page we cover use of square bracket indexing to select data. Square bracket indexing can be used to select rows and/or columns.
Indexing has a limited use case in Polars
There are some use cases in Polars where square bracket indexing is effective. However, there are many use cases where indexing prevents you from using the full power of Polars.
Use cases where indexing is effective:
- to extract a scalar value from a
DataFrame
- to convert a
DataFrame
column to aSeries
- for exploratory data analysis and to inspect some rows and/or columns
The first downside of indexing with square brackets is that indexing only works in eager mode. Any steps in your query that involve square bracket indexing cannot be included in a lazy query meaning that the step cannot be optimised as part of a lazy query and the step cannot be part of a streaming query that processes larger-than-memory data in batches
The second downside of indexing with square brackets is that operations on multiple columns are not parallelised.
Outside of the use cases noted above Polars strongly favours the expression API with select
and filter
in favor of accessing by square bracket indexing.
Rules for square bracket indexing
The rules for square bracket indexing are as follows (depending on the datatypes of the values):
-
numeric
- axis 0: row
- axis 1: column
-
numeric + strings
- axis 0: row (only accept numbers here)
- axis 1: column (accept numeric + string values)
-
only strings
- axis 0: column
- axis 1: error
-
expressions
All expression evaluations are executed in parallel
- axis 0: column
- axis 1: column
- ..
- axis n: column
Comparison with pandas
pandas | polars |
---|---|
select rowdf.iloc[2] | df[2, :] |
select several rows by their indicesdf.iloc[[2, 5, 6]] | df[[2, 5, 6], :] |
select slice of rowsdf.iloc[2:6] | df[2:6, :] |
select rows using a boolean maskdf.iloc[True, True, False] | df.filter([True, True, False]) |
select rows by a predicate conditiondf.loc[df["A"] > 3] | df[df["A"] > 3] |
select slice of columnsdf.iloc[:, 1:3] | df[:, 1:3] |
select slice of columns by string orderdf.loc[:, "A":"Z"] | df[:, "A":"Z"] |
select a single value (scalar)df.loc[2, "A"] | df[2, "A"] |
select a single value (scalar)df.iloc[2, 1] | df[2, 1] |
select a single value (Series/DataFrame)df.loc[2, ["A"]] | df[2, ["A"]] |
select a single value (Series/DataFrame)df.iloc[2, [1]] | df[2, [1]] |
Data handling
Process strings
Thanks to its Arrow
backend, Polars
string operations are much faster compared to the
same operations performed with NumPy
or Pandas
. In the latter, strings are stored as
Python
objects. While traversing the np.array
or the pd.Series
the CPU needs to
follow all the string pointers, and jump to many random memory locations -- which
is very cache-inefficient. In Polars
(via the Arrow
data
structure) strings are contiguous in memory. Thus traversing is cache-optimal and
predictable for the CPU.
The string processing functions available in Polars
are available in the
str
namespace.
Below are a few examples. To compute string lengths:
import polars as pl
df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})
df = df.with_column(pl.col("shakespeare").str.lengths().alias("letter_count"))
returning:
shape: (6, 2)
┌─────────────┬──────────────┐
│ shakespeare ┆ letter_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════════════╪══════════════╡
│ All ┆ 3 │
│ that ┆ 4 │
│ glitters ┆ 8 │
│ is ┆ 2 │
│ not ┆ 3 │
│ gold ┆ 4 │
└─────────────┴──────────────┘
And below a regex pattern to filter out articles (the
, a
, and
, etc.) from a
sentence:
import polars as pl
df = pl.DataFrame({"a": "The man that ate a whole cake".split(" ")})
df = df.filter(pl.col("a").str.contains(r"(?i)^the$|^a$").is_not())
yielding:
shape: (5, 1)
┌───────┐
│ a │
│ --- │
│ str │
╞═══════╡
│ man │
│ that │
│ ate │
│ whole │
│ cake │
└───────┘
Timestamp parsing
Polars
offers 4
time datatypes:
pl.Date
, to be used for date objects: the number of days since the UNIX epoch as a 32 bit signed integer.pl.Datetime
, to be used for datetime objects: the number of nanoseconds since the UNIX epoch as a 64 bit signed integer.pl.Time
, encoded as the number of nanoseconds since midnight.pl.Duration
, to be used for timedelta objects: the difference between Date, Datetime or Time as a 64 bit signed integer offering microsecond resolution.
Polars
string (pl.Utf8
) datatypes can be parsed as either of them. You can let
Polars
try to guess the format of the date[time], or explicitly provide a fmt
rule.
For instance (check this link for an comprehensive list):
"%Y-%m-%d"
for"2020-12-31"
"%Y/%B/%d"
for"2020/December/31"
"%B %y"
for"December 20"
Below a quick example:
import polars as pl
dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04"], "index": [1, 2, 3]})
q = dataset.lazy().with_column(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))
df = q.collect()
returning:
shape: (3, 2)
┌────────────┬───────┐
│ date ┆ index │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2020-01-02 ┆ 1 │
│ 2020-01-03 ┆ 2 │
│ 2020-01-04 ┆ 3 │
└────────────┴───────┘
All datetime functionality is shown in the dt
namespace.
Missing data
This page sets out how missing data is represented in Polars
and how missing data can be filled.
null
and NaN
values
Each column in a DataFrame
(or equivalently a Series
) is an Arrow array or a collection of Arrow arrays based on the Apache Arrow format. Missing data is represented in Arrow and Polars
with a null
value. This null
missing value applies for all data types including numerical values.
Polars
also allows NotaNumber
or NaN
values for float columns. These NaN
values are considered to be a type of floating point data rather than missing data. We discuss NaN
values separately below.
You can manually define a missing value with the python None
value:
df = pl.DataFrame(
{
"value": [1, None],
},
)
print(df)
shape: (2, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ null │
└───────┘
In
Pandas
the value for missing data depends on the dtype of the column. InPolars
missing data is always represented as anull
value.
Missing data metadata
Each Arrow array used by Polars
stores two kinds of metadata related to missing data. This metadata allows Polars
to quickly show how many missing values there are and which values are missing.
The first piece of metadata is the null_count
- this is the number of rows with null
values in the column:
null_count_df = df.null_count()
print(null_count_df)
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ u32 │
╞═══════╡
│ 1 │
└───────┘
The null_count
method can be called on a DataFrame
, a column from a DataFrame
or a Series
. The null_count
method is a cheap operation as null_count
is already calculated for the underlying Arrow array.
The second piece of metadata is an array called a validity bitmap that indicates whether each data value is valid or missing.
The validity bitmap is memory efficient as it is bit encoded - each value is either a 0 or a 1. This bit encoding means the memory overhead per array is only (array length / 8) bytes. The validity bitmap is used by the is_null
method in Polars
.
You can return a Series
based on the validity bitmap for a column in a DataFrame
or a Series
with the is_null
method:
is_null_series = df.select(
pl.col("value").is_null(),
)
print(is_null_series)
shape: (2, 1)
┌───────┐
│ value │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
└───────┘
The is_null
method is a cheap operation that does not require scanning the full column for null
values. This is because the validity bitmap already exists and can be returned as a Boolean array.
Filling missing data
Missing data in a Series
can be filled with the fill_null
method. You have to specify how you want the fill_null
method to fill the missing data. The main ways to do this are filling with:
- a literal such as 0 or "0"
- a strategy such as filling forwards
- an expression such as replacing with values from another column
- interpolation
We illustrate each way to fill nulls by defining a simple DataFrame
with a missing value in col2
:
df = pl.DataFrame(
{
"col1": [1, 2, 3],
"col2": [1, None, 3],
},
)
print(df)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ null │
│ 3 ┆ 3 │
└──────┴──────┘
Fill with specified literal value
We can fill the missing data with a specified literal value with pl.lit
:
fill_literal_df = (
df.with_column(
pl.col("col2").fill_null(
pl.lit(2),
),
),
)
print(fill_literal_df)
(shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ 3 │
└──────┴──────┘,)
Fill with a strategy
We can fill the missing data with a strategy such as filling forward:
fill_forward_df = df.with_column(
pl.col("col2").fill_null(strategy="forward"),
)
print(fill_forward_df)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 1 │
│ 3 ┆ 3 │
└──────┴──────┘
See the API docs for other available strategies.
Fill with an expression
For more flexibility we can fill the missing data with an expression. For example, to fill nulls with the median value from that column:
fill_median_df = df.with_column(
pl.col("col2").fill_null(pl.median("col2")),
)
print(fill_median_df)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞══════╪══════╡
│ 1 ┆ 1.0 │
│ 2 ┆ 2.0 │
│ 3 ┆ 3.0 │
└──────┴──────┘
In this case the column is cast from integer to float because the median is a float statistic.
Fill with interpolation
In addition, we can fill nulls with interpolation (without using the fill_null
function):
fill_interpolation_df = df.with_column(
pl.col("col2").interpolate(),
)
print(fill_interpolation)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ 3 │
└──────┴──────┘
NotaNumber
or NaN
values
Missing data in a Series
has a null
value. However, you can use NotaNumber
or NaN
values in columns with float datatypes. These NaN
values can be created from Numpy's np.nan
or the native python float('nan')
:
nan_df = pl.DataFrame(
{
"value": [1.0, np.NaN, float("nan"), 3.0],
},
)
print(nan_df)
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ NaN │
│ NaN │
│ 3.0 │
└───────┘
In
Pandas
by default aNaN
value in an integer column causes the column to be cast to float. This does not happen inPolars
- instead an exception is raised.
NaN
values are considered to be a type of floating point data and are not considered to be missing data in Polars
. This means:
NaN
values are not counted with thenull_count
methodNaN
values are filled when you usefill_nan
method but are not filled with thefill_null
method
Polars
has is_nan
and fill_nan
methods which work in a similar way to the is_null
and fill_null
methods. The underlying Arrow arrays do not have a pre-computed validity bitmask for NaN
values so this has to be computed for the is_nan
method.
One further difference between null
and NaN
values is that taking the mean
of a column with null
values excludes the null
values from the calculation but with NaN
values taking the mean results in a NaN
. This behaviour can be avoided by replacing the NaN
values with null
values;
mean_nan_df = nan_df.with_column(
pl.col("value").fill_nan(None).alias("value"),
).mean()
print(mean_nan_df)
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 2.0 │
└───────┘
Time Series
Polars has native support for parsing time series data and doing more sophisticated operations such as temporal grouping and resampling.
Parsing dates and times
Datatypes
Polars
has the following datetime datatypes:
Date
: Date representation e.g. 2014-07-08. It is internally represented as days since UNIX epoch encoded by a 32-bit signed integer.Datetime
: Datetime representation e.g. 2014-07-08 07:00:00. It is internally represented as a 64 bit integer since the Unix epoch and can have different units such as ns, us, ms.Duration
: A time delta type that is created when subtractingDate/Datetime
. Similar totimedelta
in python.Time
: Time representation, internally represented as nanoseconds since midnight.
Parsing dates from a file
When loading from a CSV file Polars
attempts to parse dates and times if the parse_dates
flag is set to True
:
df = pl.read_csv("data/appleStock.csv", parse_dates=True)
print(df)
shape: (100, 2)
┌────────────┬────────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62 │
│ 1981-05-06 ┆ 27.38 │
│ 1981-05-18 ┆ 28.0 │
│ 1981-09-25 ┆ 14.25 │
│ ... ┆ ... │
│ 2012-12-04 ┆ 575.85 │
│ 2013-07-05 ┆ 417.42 │
│ 2013-11-07 ┆ 512.49 │
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘
On the other hand binary formats such as parquet have a schema that is respected by Polars
.
Casting strings to dates
You can also cast a column of datetimes encoded as strings to a datetime type. You do this by calling the string str.strptime
method and passing the format of the date string:
df = pl.read_csv("data/appleStock.csv", parse_dates=False)
df = df.with_column(pl.col("Date").str.strptime(pl.Date, fmt="%Y-%m-%d"))
print(df)
shape: (100, 2)
┌────────────┬────────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62 │
│ 1981-05-06 ┆ 27.38 │
│ 1981-05-18 ┆ 28.0 │
│ 1981-09-25 ┆ 14.25 │
│ ... ┆ ... │
│ 2012-12-04 ┆ 575.85 │
│ 2013-07-05 ┆ 417.42 │
│ 2013-11-07 ┆ 512.49 │
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘
The strptime date formats can be found here..
Extracting date features from a date column
You can extract data features such as the year or day from a date column using the .dt
namespace on a date column:
df_with_year = df.with_column(pl.col("Date").dt.year().alias("year"))
print(df_with_year)
shape: (100, 3)
┌────────────┬────────┬──────┐
│ Date ┆ Close ┆ year │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i32 │
╞════════════╪════════╪══════╡
│ 1981-02-23 ┆ 24.62 ┆ 1981 │
│ 1981-05-06 ┆ 27.38 ┆ 1981 │
│ 1981-05-18 ┆ 28.0 ┆ 1981 │
│ 1981-09-25 ┆ 14.25 ┆ 1981 │
│ ... ┆ ... ┆ ... │
│ 2012-12-04 ┆ 575.85 ┆ 2012 │
│ 2013-07-05 ┆ 417.42 ┆ 2013 │
│ 2013-11-07 ┆ 512.49 ┆ 2013 │
│ 2014-02-25 ┆ 522.06 ┆ 2014 │
└────────────┴────────┴──────┘
See the API docs for more date feature options.
Filtering date columns
Filtering date columns works in the same way as with other types of columns using the .filter
method.
Polars uses Python's native datetime
, date
and timedelta
for equality comparisons between the datatypes
pl.Datetime
, pl.Date
and pl.Duration
.
In the following example we use a time series of Apple stock prices.
import polars as pl
from datetime import datetime
df = pl.read_csv("data/appleStock.csv", parse_dates=True)
print(df)
shape: (100, 2)
┌────────────┬────────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62 │
│ 1981-05-06 ┆ 27.38 │
│ 1981-05-18 ┆ 28.0 │
│ 1981-09-25 ┆ 14.25 │
│ ... ┆ ... │
│ 2012-12-04 ┆ 575.85 │
│ 2013-07-05 ┆ 417.42 │
│ 2013-11-07 ┆ 512.49 │
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘
Filtering by single dates
We can filter by a single date by casting the desired date string to a Date
object
in a filter expression:
filtered_df = df.filter(
pl.col("Date") == datetime(1995, 10, 16),
)
shape: (1, 2)
┌────────────┬───────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════╡
│ 1995-10-16 ┆ 36.13 │
└────────────┴───────┘
Note we are using the lowercase datetime
method rather than the uppercase Datetime
data type.
Filtering by a date range
We can filter by a range of dates using the is_between
method in a filter expression with the start and end dates:
filtered_range_df = df.filter(
pl.col("Date").is_between(datetime(1995, 7, 1), datetime(1995, 11, 1)),
)
shape: (2, 2)
┌────────────┬───────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════╡
│ 1995-07-06 ┆ 47.0 │
│ 1995-10-16 ┆ 36.13 │
└────────────┴───────┘
Filtering with negative dates
Say you are working with an archeologist and are dealing in negative dates.
Polars can parse and store them just fine, but the Python datetime
library
does not. So for filtering, you should use attributes in the .dt
namespace:
ts = pl.Series(["-1300-05-23", "-1400-03-02"]).str.strptime(pl.Date)
negative_dates_df = pl.DataFrame({"ts": ts, "values": [3, 4]})
negative_dates_filtered_df = negative_dates_df.filter(pl.col("ts").dt.year() < -1300)
shape: (1, 2)
┌─────────────┬────────┐
│ ts ┆ values │
│ --- ┆ --- │
│ date ┆ i64 │
╞═════════════╪════════╡
│ -1400-03-02 ┆ 4 │
└─────────────┴────────┘
Fixed and rolling temporal groupby
Grouping by fixed windows with groupby_dynamic
We can calculate temporal statistics using groupby_dynamic
to group rows into days/months/years etc.
Annual average example
In following simple example we calculate the annual average closing price of Apple stock prices. We first load the data from CSV:
df = pl.read_csv("data/appleStock.csv", parse_dates=True)
shape: (100, 2)
┌────────────┬────────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62 │
│ 1981-05-06 ┆ 27.38 │
│ 1981-05-18 ┆ 28.0 │
│ 1981-09-25 ┆ 14.25 │
│ ... ┆ ... │
│ 2012-12-04 ┆ 575.85 │
│ 2013-07-05 ┆ 417.42 │
│ 2013-11-07 ┆ 512.49 │
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘
The dates are sorted in ascending order - if they are not sorted in this way the
groupby_dynamic
output will not be correct!
To get the annual average closing price we tell groupby_dynamic
that we want to:
- group by the
Date
column on an annual (1y
) basis - take the mean values of the
Close
column for each year:
annual_average_df = df.groupby_dynamic("Date", every="1y").agg(pl.col("Close").mean())
df_with_year = df.with_column(pl.col("Date").dt.year().alias("year"))
The annual average closing price is then:
shape: (34, 2)
┌────────────┬───────────┐
│ Date ┆ Close │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪═══════════╡
│ 1981-01-01 ┆ 23.5625 │
│ 1982-01-01 ┆ 11.0 │
│ 1983-01-01 ┆ 30.543333 │
│ 1984-01-01 ┆ 27.583333 │
│ ... ┆ ... │
│ 2011-01-01 ┆ 368.225 │
│ 2012-01-01 ┆ 560.965 │
│ 2013-01-01 ┆ 464.955 │
│ 2014-01-01 ┆ 522.06 │
└────────────┴───────────┘
Parameters for groupby_dynamic
A dynamic window is defined by a:
- every: indicates the interval of the window
- period: indicates the duration of the window
- offset: can be used to offset the start of the windows
every
, period
and offset
parameters to control temporal window size
The value for every
sets how often the groups start. The time period values are flexible - for example we could take:
- the average over 2 year intervals by replacing
1y
with2y
- the average over 18 month periods by replacing
1y
with1y6mo
We can also use the period
parameter to set how long the time period for each group is. For example, if we set the every
parameter to be 1y
and the period
parameter to be 2y
then we would get groups at one year intervals where each groups spanned two years.
If the period
parameter is not specified then it is set equal to the every
parameter so that if the every
parameter is set to be 1y
then each group spans 1y
as well.
Because every does not have to be equal to period, we can create many groups in a very flexible way. They may overlap or leave boundaries between them.
Let's see how the windows for some parameter combinations would look. Let's start out boring. 🥱
- every: 1 day ->
"1d"
- period: 1 day ->
"1d"
this creates adjacent windows of the same size
|--|
|--|
|--|
- every: 1 day ->
"1d"
- period: 2 days ->
"2d"
these windows have an overlap of 1 day
|----|
|----|
|----|
- every: 2 days ->
"2d"
- period: 1 day ->
"1d"
this would leave gaps between the windows
data points that in these gaps will not be a member of any group
|--|
|--|
|--|
See the API pages for the full range of time periods.
truncate
parameter to set the start date for each group
The truncate
parameter is a Boolean variable that determines what datetime value is associated with each group in the output. In the example above the first data point is on 23rd February 1981. If truncate = True
(the default) then the date for the first year in the annual average is 1st January 1981. However, if truncate = False
then the date for the first year in the annual average is the date of the first data point on 23rd February 1981.
Using expressions in groupby_dynamic
We aren't restricted to using simple aggregations like mean
in a groupby operation - we can use the full range of expressions available in Polars.
In the snippet below we create a date range
with every day ("1d"
) in 2021 and turn this into a DataFrame
.
Then in the groupby_dynamic
we create dynamic windows that start every month ("1mo"
) and have a window length of 1
month. The values that match these dynamic windows are then assigned to that group and can be aggregated with the powerful expression API.
Below we show an example where we use groupby_dynamic to compute:
- the number of days until the end of the month
- the number of days in a month
df = pl.date_range(low=datetime(2021, 1, 1), high=datetime(2021, 12, 31), interval="1d", name="time").to_frame()
out = (
df.groupby_dynamic("time", every="1mo", period="1mo", closed="left")
.agg(
[
pl.col("time").cumcount().reverse().head(3).alias("day/eom"),
((pl.col("time") - pl.col("time").first()).last().dt.days() + 1).alias("days_in_month"),
]
)
.explode("day/eom")
)
print(out)
shape: (36, 3)
┌─────────────────────┬─────────┬───────────────┐
│ time ┆ day/eom ┆ days_in_month │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ u32 ┆ i64 │
╞═════════════════════╪═════════╪═══════════════╡
│ 2021-01-01 00:00:00 ┆ 30 ┆ 31 │
│ 2021-01-01 00:00:00 ┆ 29 ┆ 31 │
│ 2021-01-01 00:00:00 ┆ 28 ┆ 31 │
│ 2021-02-01 00:00:00 ┆ 27 ┆ 28 │
│ ... ┆ ... ┆ ... │
│ 2021-11-01 00:00:00 ┆ 27 ┆ 30 │
│ 2021-12-01 00:00:00 ┆ 30 ┆ 31 │
│ 2021-12-01 00:00:00 ┆ 29 ┆ 31 │
│ 2021-12-01 00:00:00 ┆ 28 ┆ 31 │
└─────────────────────┴─────────┴───────────────┘
Grouping by rolling windows with groupby_rolling
The rolling groupby is another entrance to the groupby
context. But different from the groupby_dynamic
the windows are
not fixed by a parameter every
and period
. In a rolling groupby the windows are not fixed at all! They are determined
by the values in the index_column
.
So imagine having a time column with the values {2021-01-06, 20210-01-10}
and a period="5d"
this would create the following
windows:
2021-01-01 2021-01-06
|----------|
2021-01-05 2021-01-10
|----------|
Because the windows of a rolling groupby are always determined by the values in the DataFrame
column, the number of
groups is always equal to the original DataFrame
.
Combining Groupby and Dynamic / Rolling
Rolling and dynamic groupby's can be combined with normal groupby operations.
Below is an example with a dynamic groupby.
from datetime import datetime
import polars as pl
df = pl.DataFrame(
{
"time": pl.date_range(
low=datetime(2021, 12, 16),
high=datetime(2021, 12, 16, 3),
interval="30m",
),
"groups": ["a", "a", "a", "b", "b", "a", "a"],
}
)
print(out)
shape: (7, 2)
┌─────────────────────┬────────┐
│ time ┆ groups │
│ --- ┆ --- │
│ datetime[μs] ┆ str │
╞═════════════════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a │
│ 2021-12-16 00:30:00 ┆ a │
│ 2021-12-16 01:00:00 ┆ a │
│ 2021-12-16 01:30:00 ┆ b │
│ 2021-12-16 02:00:00 ┆ b │
│ 2021-12-16 02:30:00 ┆ a │
│ 2021-12-16 03:00:00 ┆ a │
└─────────────────────┴────────┘
out = df.groupby_dynamic("time", every="1h", closed="both", by="groups", include_boundaries=True,).agg(
[
pl.count(),
]
)
print(out)
shape: (7, 5)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┐
│ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ count │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ u32 │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╡
│ a ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1 │
│ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3 │
│ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1 │
│ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2 │
│ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1 │
│ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2 │
│ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1 │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┘
Resampling
We can resample by either:
- upsampling (moving data to a higher frequency)
- downsampling (moving data to a lower frequency)
- combinations of these e.g. first upsample and then downsample
Downsampling to a lower frequency
Polars
views downsampling as a special case of the groupby operation and you can do this with groupby_dynamic
and groupby_rolling
- see the temporal groupby page for examples.
Upsampling to a higher frequency
Let's go through an example where we generate data at 30 minute intervals:
from datetime import datetime
import polars as pl
df = pl.DataFrame(
{
"time": pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 16, 3), interval="30m"),
"groups": ["a", "a", "a", "b", "b", "a", "a"],
"values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
}
print(df)
shape: (7, 3)
┌─────────────────────┬────────┬────────┐
│ time ┆ groups ┆ values │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ f64 │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a ┆ 1.0 │
│ 2021-12-16 00:30:00 ┆ a ┆ 2.0 │
│ 2021-12-16 01:00:00 ┆ a ┆ 3.0 │
│ 2021-12-16 01:30:00 ┆ b ┆ 4.0 │
│ 2021-12-16 02:00:00 ┆ b ┆ 5.0 │
│ 2021-12-16 02:30:00 ┆ a ┆ 6.0 │
│ 2021-12-16 03:00:00 ┆ a ┆ 7.0 │
└─────────────────────┴────────┴────────┘
Upsampling can be done by defining the new sampling interval. By upsampling we are adding in extra rows where we do not have data. As such upsampling by itself gives a DataFrame with nulls. These nulls can then be filled with a fill strategy or interpolation.
Upsampling strategies
In this example we upsample from the original 30 minutes to 15 minutes and then use a forward
strategy to replace the nulls with the previous non-null value:
out1 = df.upsample(time_column="time", every="15m").fill_null("forward")
print(out1)
shape: (13, 3)
┌─────────────────────┬─────────┬────────┐
│ time ┆ groups ┆ values │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ f64 │
╞═════════════════════╪═════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a ┆ 1.0 │
│ 2021-12-16 00:15:00 ┆ forward ┆ null │
│ 2021-12-16 00:30:00 ┆ a ┆ 2.0 │
│ 2021-12-16 00:45:00 ┆ forward ┆ null │
│ ... ┆ ... ┆ ... │
│ 2021-12-16 02:15:00 ┆ forward ┆ null │
│ 2021-12-16 02:30:00 ┆ a ┆ 6.0 │
│ 2021-12-16 02:45:00 ┆ forward ┆ null │
│ 2021-12-16 03:00:00 ┆ a ┆ 7.0 │
└─────────────────────┴─────────┴────────┘
In this example we instead fill the nulls by linear interpolation:
out2 = df.upsample(time_column="time", every="15m").interpolate().fill_null("forward")
print(out2)
shape: (13, 3)
┌─────────────────────┬─────────┬────────┐
│ time ┆ groups ┆ values │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ f64 │
╞═════════════════════╪═════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a ┆ 1.0 │
│ 2021-12-16 00:15:00 ┆ forward ┆ 1.5 │
│ 2021-12-16 00:30:00 ┆ a ┆ 2.0 │
│ 2021-12-16 00:45:00 ┆ forward ┆ 2.5 │
│ ... ┆ ... ┆ ... │
│ 2021-12-16 02:15:00 ┆ forward ┆ 5.5 │
│ 2021-12-16 02:30:00 ┆ a ┆ 6.0 │
│ 2021-12-16 02:45:00 ┆ forward ┆ 6.5 │
│ 2021-12-16 03:00:00 ┆ a ┆ 7.0 │
└─────────────────────┴─────────┴────────┘
Combining data with concat
and join
You can combine data from different DataFrames
using:
- the
concat
function or - the
join
method on aDataFrame
Concatenation
There are a number of ways to concatenate data from separate DataFrames:
- two dataframes with the same columns can be vertically concatenated to make a longer dataframe
- two dataframes with the same number of rows and non-overlapping columns can be horizontally concatenated to make a wider dataframe
- two dataframes with different numbers of rows and columns can be diagonally concatenated to make a dataframe which might be longer and/ or wider. Where column names overlap values will be vertically concatenated. Where column names do not overlap new rows and columns will be added. Missing values will be set as
null
Vertical concatenation - getting longer
In a vertical concatenation you combine all of the rows from a list of DataFrames
into a single longer DataFrame
.
df_v1 = pl.DataFrame(
{
"a": [1],
"b": [3],
}
)
df_v2 = pl.DataFrame(
{
"a": [2],
"b": [4],
}
)
df_vertical_concat = pl.concat(
[
df_v1,
df_v2,
],
how="vertical",
)
print(df_vertical_concat)
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
│ 2 ┆ 4 │
└─────┴─────┘
Vertical concatenation fails when the dataframes do not have the same column names.
Horizontal concatenation - getting wider
In a horizontal concatenation you combine all of the columns from a list of DataFrames
into a single wider DataFrame
.
df_h1 = pl.DataFrame(
{
"l1": [1, 2],
"l2": [3, 4],
}
)
df_h2 = pl.DataFrame(
{
"r1": [5, 6],
"r2": [7, 8],
"r3": [9, 10],
}
)
df_horizontal_concat = pl.concat(
[
df_h1,
df_h2,
],
how="horizontal",
print(df_horizontal_concat)
shape: (2, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ l1 ┆ l2 ┆ r1 ┆ r2 ┆ r3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 3 ┆ 5 ┆ 7 ┆ 9 │
│ 2 ┆ 4 ┆ 6 ┆ 8 ┆ 10 │
└─────┴─────┴─────┴─────┴─────┘
Horizontal concatenation fails when dataframes have overlapping columns or a different number of rows.
Diagonal concatenation - getting longer, wider and null
ier
In a diagonal concatenation you combine all of the row and columns from a list of DataFrames
into a single longer and/or wider DataFrame
.
df_d1 = pl.DataFrame(
{
"a": [1],
"b": [3],
}
)
df_d2 = pl.DataFrame(
{
"a": [2],
"d": [4],
}
)
df_diagonal_concat = pl.concat(
[
df_d1,
df_d2,
],
how="diagonal",
)
print(df_diagonal_concat)
shape: (2, 3)
┌─────┬──────┬──────┐
│ a ┆ b ┆ d │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ 1 ┆ 3 ┆ null │
│ 2 ┆ null ┆ 4 │
└─────┴──────┴──────┘
Diagonal concatenation generates nulls when the column names do not overlap.
When the dataframe shapes do not match and we have an overlapping semantic key then we can join the dataframes instead of concatenating them.
Rechunking
Before a concatenation we have two dataframes df1
and df2
. Each column in df1
and df2
is in one or more chunks in memory. By default, during concatenation the chunks in each column are copied to a single new chunk - this is known as rechunking. Rechunking is an expensive operation, but is often worth it because future operations will be faster.
If you do not want Polars to rechunk the concatenated DataFrame
you specify rechunk = False
when doing the concatenation.
Joins
Join strategies
Polars
supports the following join strategies by specifying the strategy
argument:
inner
left
outer
cross
asof
semi
anti
The inner
, left
, outer
and cross
join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi
, anti
and asof
join strategies below.
Semi join
Consider the following scenario: a car rental company has a DataFrame
showing the cars that it owns with each car having a unique id
.
df_cars = pl.DataFrame(
{
"id": ["a", "b", "c"],
"make": ["ford", "toyota", "bmw"],
}
)
print(df_cars)
shape: (3, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
│ c ┆ bmw │
└─────┴────────┘
The company has another DataFrame
showing each repair job carried out on a vehicle.
df_repairs = pl.DataFrame(
{
"id": ["c", "c"],
"cost": [100, 200],
}
)
print(df_repairs)
shape: (2, 2)
┌─────┬──────┐
│ id ┆ cost │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ c ┆ 100 │
│ c ┆ 200 │
└─────┴──────┘
You want to answer this question: which of the cars have had repairs carried out?
An inner join does not answer this question directly as it produces a DataFrame
with multiple rows for each car that has had multiple repair jobs:
df_inner_join = df_cars.join(df_repairs, on="id", how="inner")
print(df_inner_join)
shape: (2, 3)
┌─────┬──────┬──────┐
│ id ┆ make ┆ cost │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪══════╪══════╡
│ c ┆ bmw ┆ 100 │
│ c ┆ bmw ┆ 200 │
└─────┴──────┴──────┘
However, a semi join produces a single row for each car that has had a repair job carried out.
df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
print(df_semi_join)
shape: (1, 2)
┌─────┬──────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════╡
│ c ┆ bmw │
└─────┴──────┘
Anti join
Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame
showing all the cars from df_cars
where the id
is not present in the df_repairs
DataFrame
.
df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
print(df_anti_join)
shape: (2, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
└─────┴────────┘
Asof join
An asof
join is like a left join except that we match on nearest key rather than equal keys.
In Polars
we can do an asof join with the join
method and specifying strategy="asof"
. However, for more flexibility we can use the join_asof
method.
Consider the following scenario: a stock market broker has a DataFrame
called df_trades
showing transactions it has made for different stocks.
df_trades = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 3, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "B", "C"],
"trade": [101, 299, 301, 500],
}
)
print(df_trades)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ trade │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 │
└─────────────────────┴───────┴───────┘
The broker has another DataFrame
called df_quotes
showing prices it has quoted for these stocks.
df_quotes = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 0, 0),
datetime(2020, 1, 1, 9, 2, 0),
datetime(2020, 1, 1, 9, 4, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "C", "A"],
"quote": [100, 300, 501, 102],
}
)
print(df_quotes)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ quote │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A ┆ 100 │
│ 2020-01-01 09:02:00 ┆ B ┆ 300 │
│ 2020-01-01 09:04:00 ┆ C ┆ 501 │
│ 2020-01-01 09:06:00 ┆ A ┆ 102 │
└─────────────────────┴───────┴───────┘
You want to produce a DataFrame
showing for each trade the most recent quote provided before the trade. You do this with join_asof
(using the default strategy = "backward"
).
To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock"
.
df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ 501 │
└─────────────────────┴───────┴───────┴───────┘
If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance
argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m"
.
print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ null │
└─────────────────────┴───────┴───────┴───────┘
Combining Polars with Python's multiprocessing
TLDR: if you find that using Python's built-in multiprocessing
module together with Polars results in a Polars error about multiprocessing methods, you should make sure you are using spawn
, not fork
, as the starting method:
from multiprocessing import get_context
def my_fun(s):
print(s)
with get_context("spawn").Pool() as pool:
pool.map(my_fun, ["input1", "input2", ...])
When not to use multiprocessing
Before we dive into the details, it is important to emphasize that Polars has been built from the start to use all your CPU cores.
It does this by executing computations which can be done in parallel in separate threads.
For example, requesting two expressions in a select
statement can be done in parallel, with the results only being combined at the end.
Another example is aggregating a value within groups using groupby().agg(<expr>)
, each group can be evaluated separately.
It is very unlikely that the multiprocessing
module can improve your code performance in these cases.
See the optimizations section for more optimizations.
When to use multiprocessing
Although Polars is multithreaded, other libraries may be single-threaded. When the other library is the bottleneck, and the problem at hand is parallelizable, it makes sense to use multiprocessing to speed up.
The problem with the default multiprocessing config
Summary
The Python multiprocessing documentation lists the three methods a process pool can be created:
- spawn
- fork
- forkserver
The description of fork is (as of 2022-10-15):
The parent process uses os.fork() to fork the Python interpreter. The child process, when it begins, is effectively identical to the parent process. All resources of the parent are inherited by the child process. Note that safely forking a multithreaded process is problematic.
Available on Unix only. The default on Unix.
The short summary is: Polars is multithreaded as to provide strong performance out-of-the-box.
Thus, it cannot be combined with fork
.
If you are on Unix (Linux, BSD, etc), you are using fork
, unless you explicitly override it.
The reason you may not have encountered this before is that pure Python code, and most Python libraries, are (mostly) single threaded.
Alternatively, you are on Windows or MacOS, on which fork
is not even available as a method (for MacOS it was up to Python 3.7).
Thus one should use spawn
, or forkserver
, instead. spawn
is available on all platforms and the safest choice, and hence the recommended method.
Example
The problem with fork
is in the copying of the parent's process.
Consider the example below, which is a slightly modified example posted on the Polars issue tracker:
import multiprocessing
import polars as pl
def test_sub_process(df: pl.DataFrame, job_id):
df_filtered = df.filter(pl.col("a") > 0)
print(f"Filtered (job_id: {job_id})", df_filtered, sep="\n")
def create_dataset():
return pl.DataFrame({"a": [0, 2, 3, 4, 5], "b": [0, 4, 5, 56, 4]})
def setup():
# some setup work
df = create_dataset()
df.write_parquet("/tmp/test.parquet")
def main():
test_df = pl.read_parquet("/tmp/test.parquet")
for i in range(0, 5):
proc = multiprocessing.get_context("spawn").Process(target=test_sub_process, args=(test_df, i))
proc.start()
proc.join()
print(f"Executed sub process {i}")
if __name__ == "__main__":
setup()
main()
Using fork
as the method, instead of spawn
, will cause a dead lock.
Please note: Polars will not even start and raise the error on multiprocessing method being set wrong, but if the check would not be there, the deadlock would exist.
The fork method is equivalent to calling os.fork()
, which is a system call as defined in the POSIX standard:
A process shall be created with a single thread. If a multi-threaded process calls fork(), the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.
In contrast, spawn
will create a completely new fresh Python interpreter, and not inherit the state of mutexes.
So what happens in the code example?
For reading the file with pl.read_parquet
the file has to be locked.
Then os.fork()
is called, copying the state of the parent process, including mutexes.
Thus all child processes will copy the file lock in an acquired state, leaving them hanging indefinitely waiting for the file lock to be released, which never happens.
What makes debugging these issues tricky is that fork
can work.
Change the example to not having the call to pl.read_parquet
:
from .example1 import create_dataset, test_sub_process
import multiprocessing
def main():
test_df = create_dataset()
for i in range(0, 5):
proc = multiprocessing.get_context("fork").Process(target=test_sub_process, args=(test_df, i))
proc.start()
proc.join()
print(f"Executed sub process {i}")
if __name__ == "__main__":
main()
This works fine.
Therefore debugging these issues in larger code bases, i.e. not the small toy examples here, can be a real pain, as a seemingly unrelated change can break your multiprocessing code.
In general, one should therefore never use the fork
start method with multithreaded libraries unless there are very specific requirements that cannot be met otherwise.
Pro's and cons of fork
Based on the example, you may think, why is fork
available in Python to start with?
First, probably because of historical reasons: spawn
was added to Python in version 3.4, whilst fork
has been part of Python from the 2.x series.
Second, there are several limitations for spawn
and forkserver
that do not apply to fork
, in particular all arguments should be pickable.
See the Python multiprocessing docs for more information.
Third, because it is faster to create new processes compared to spawn
, as spawn
is effectively fork
+ creating a brand new Python process without the locks by calling execv.
Hence the warning in the Python docs that it is slower: there is more overhead to spawn
.
However, in almost all cases, one would like to use multiple processes to speed up computations that take multiple minutes or even hours, meaning the overhead is negligible in the grand scheme of things.
And more importantly, it actually works in combination with multithreaded libraries.
Fourth, spawn
starts a new process, and therefore it requires code to be importable, in contrast to fork
.
In particular, this means that when using spawn
the relevant code should not be in the global scope, such as in Jupyter notebooks or in plain scripts.
Hence in the examples above, we define functions where we spawn within, and run those functions from a __main__
clause.
This is not an issue for typical projects, but in quick experimentation in notebooks it could fail.
References
[1] https://docs.python.org/3/library/multiprocessing.html
[2] https://pythonspeed.com/articles/python-multiprocessing/
[3] https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
[4] https://bnikolic.co.uk/blog/python/parallelism/2019/11/13/python-forkserver-preload.html
Performance
This chapter handles some gotcha's needed to squeeze maximum performance out of Polars
.
When used properly, Polars
can run at blazing speeds. Take a look at the results in
H2O AI database benchmark.
Strings
Understanding the memory format used by Arrow
and Polars
can really increase performance
of your queries. This is especially true for large string data. The figure below shows
how an Arrow
UTF8
array is laid out in memory.
The array ["foo", "bar", "ham"]
is encoded by :
- a concatenated string
"foobarham"
, - an offset array indicating the start (and end) of each string
[0, 2, 5, 8]
, - a null bitmap, indicating null values.
This memory structure is very cache-efficient if we are to read the string values.
Especially if we compare it to a Vec<String>
(an array of heap allocated string data
in Rust
).
However, if we need to reorder the Arrow
UTF8
array, we need to swap around all the
bytes of the string values, which can become very expensive when dealing with large
strings. On the other hand for Vec<String>
, we only need to swap pointers around,
which is only 8 bytes data that have to be moved with little cost.
Reordering a DataFrame
embedding a large number of Utf8
Series
due to an operation
(filtering, joining, grouping, etc.) can quickly become quite expensive.
Categorical type
For this reason Polars
has a CategoricalType
. A Categorical
Series
is an array
filled with u32
values that each represent a unique string value. Thereby maintaining
cache efficiency whilst remaining cheap to move values around.
In the example below we demonstrate how you can cast a Utf8
Series
column to a
Categorical
Series
.
import polars as pl
df["utf8-column"].cast(pl.Categorical)
use polars::prelude::*;
df.column("utf8-column").unwrap().cast(&DataType::Categorical(None)).unwrap();
Eager join multiple DataFrames on Categorical data
When two DataFrames
need to be joined based on string data the Categorical
data needs
to be synchronized (data in column A
of df1
needs to point to the same underlying
string data as column B
in df2
). One can do so by casting data in the StringCache
context manager. This will synchronize all discoverable string values for the duration of that
context manager. If you want the global string cache to exist during the whole
run, you can set toggle_string_cache
to True
.
import polars as pl
df1 = pl.DataFrame({"a": ["foo", "bar", "ham"], "b": [1, 2, 3]})
df2 = pl.DataFrame({"a": ["foo", "spam", "eggs"], "c": [3, 2, 2]})
with pl.StringCache():
df1.with_column(pl.col("a").cast(pl.Categorical))
df2.with_column(pl.col("a").cast(pl.Categorical))
use polars::prelude::*;
fn main() {
let words = "All that glitters is not gold".split(' ').collect::<Vec<_>>();
let df = df! (
"shakespear" => &words
).unwrap();
println!("{df}");
}
Lazy join multiple DataFrames on Categorical data
A lazy query always has a global string cache (unless you opt-out) for the duration of
that query (until .collect()
is called). The example below shows how you could join
two DataFrames
with Categorical
types.
import polars as pl
lf1 = pl.DataFrame({"a": ["foo", "bar", "ham"], "b": [1, 2, 3]}).lazy()
lf2 = pl.DataFrame({"a": ["foo", "spam", "eggs"], "c": [3, 2, 2]}).lazy()
lf1 = lf1.with_column(pl.col("a").cast(pl.Categorical))
lf2 = lf2.with_column(pl.col("a").cast(pl.Categorical))
lf1.join(lf2, on="a", how="inner")
use polars::{prelude::*, toggle_string_cache};
fn main() {
toggle_string_cache(true);
let lf1 = df! (
"a" => &["foo", "bar", "ham"],
"b" => &[1, 2, 3]
)
.unwrap()
.lazy();
let lf2 = df! (
"a" => &["foo", "spam", "eggs"],
"b" => &[3,2,2]
)
.unwrap()
.lazy();
let lf1 = lf1.with_column(col("a").cast(DataType::Categorical(None)));
let lf2 = lf2.with_column(col("a").cast(DataType::Categorical(None)));
let joined = lf1.inner_join(lf2, col("a"), col("a"));
println!("{:?}", joined.collect().unwrap());
}
Optimizations
This chapter will investigate some of the optimizations that are applied by the Polars
query optimizer by going through some examples to see how Polars
modifies the original query plan.
This section is quite outdated. Polars does much more optimizations than named in this section.
Lazy API
The Lazy API page is under construction.
To demonstrate the lazy Polars
capabilities we'll explore two medium-large
datasets of usernames:
Reddit usernames dataset containing 69+ million rows
import polars as pl
from ..paths import DATA_DIR
dataset = pl.read_csv(f"{DATA_DIR}/reddit.csv", n_rows=10)
shape: (5, 6)
┌─────┬──────────────────────────┬─────────────┬────────────┬───────────────┬────────────┐
│ id ┆ name ┆ created_utc ┆ updated_on ┆ comment_karma ┆ link_karma │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪══════════════════════════╪═════════════╪════════════╪═══════════════╪════════════╡
│ 1 ┆ truman48lamb_jasonbroken ┆ 1397113470 ┆ 1536527864 ┆ 0 ┆ 0 │
│ 2 ┆ johnethen06_jasonbroken ┆ 1397113483 ┆ 1536527864 ┆ 0 ┆ 0 │
│ 3 ┆ yaseinrez_jasonbroken ┆ 1397113483 ┆ 1536527864 ┆ 0 ┆ 1 │
│ 4 ┆ Valve92_jasonbroken ┆ 1397113503 ┆ 1536527864 ┆ 0 ┆ 0 │
│ 5 ┆ srbhuyan_jasonbroken ┆ 1397113506 ┆ 1536527864 ┆ 0 ┆ 0 │
└─────┴──────────────────────────┴─────────────┴────────────┴───────────────┴────────────┘
and the Runescape username dataset containing about 55+ million records.
import polars as pl
from ..paths import DATA_DIR
dataset = pl.read_csv(f"{DATA_DIR}/runescape.csv", has_header=False, n_rows=10)
shape: (5, 1)
┌─────────────┐
│ column_1 │
│ --- │
│ str │
╞═════════════╡
│ a000 │
│ a0000 │
│ a000000 │
│ a0000000 │
│ a0000000000 │
└─────────────┘
Predicate pushdown
The Predicate pushdown page is under construction
Predicate pushdown is an optimization Polars
does that reduces query times and memory
usage. A predicate is database jargon for applying a filter on some table, thereby
reducing the number of rows on that table.
So let's see if we can load some Reddit data and filter on a few predicates.
import polars as pl
from ..paths import DATA_DIR
q1 = (
pl.scan_csv(f"{DATA_DIR}/reddit.csv")
.filter(pl.col("comment_karma") > 0)
.filter(pl.col("link_karma") > 0)
.filter(pl.col("name").str.contains(r"^a")) # filter name that start with an "a"
)
If we were to run this query above nothing would happen! This is due to the lazy evaluation. Nothing will happen until specifically requested. This allows Polars to see the whole context of a query and optimize just in time for execution.
Execution is requested by the .collect
method. This would query all available data.
While you're writing, optimizing, and checking your query, this is often undesirable. Another
method that calls for execution is the .fetch
method. .fetch
takes a parameter
n_rows
and tries to 'fetch' that number of rows at the data source (no guarantees are
given though).
So let's "fetch" ~10 Million rows from the source file and apply the predicates.
q1.fetch(n_rows=int(1e7))
shape: (656, 6)
┌─────────┬─────────────┬─────────────┬────────────┬───────────────┬────────────┐
│ id ┆ name ┆ created_utc ┆ updated_on ┆ comment_karma ┆ link_karma │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════════╪═════════════╪════════════╪═══════════════╪════════════╡
│ 77860 ┆ aquarin ┆ 1137474000 ┆ 1536528294 ┆ 150 ┆ 11 │
│ 77974 ┆ aadvaark ┆ 1137301200 ┆ 1536528294 ┆ 26 ┆ 47 │
│ 78004 ┆ apoisel ┆ 1137301200 ┆ 1536497404 ┆ 42 ┆ 2549 │
│ 78041 ┆ aonic ┆ 1137301200 ┆ 1536497404 ┆ 2931 ┆ 2095 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 1192656 ┆ atothedrian ┆ 1162785880 ┆ 1536497412 ┆ 748 ┆ 585 │
│ 1204607 ┆ akbusiness ┆ 1162899425 ┆ 1536532995 ┆ 73 ┆ 512 │
│ 1214809 ┆ aaminics ┆ 1162969322 ┆ 1536533034 ┆ 22 ┆ 6 │
│ 1225341 ┆ antonulrich ┆ 1163110623 ┆ 1536497412 ┆ 9304 ┆ 1782 │
└─────────┴─────────────┴─────────────┴────────────┴───────────────┴────────────┘
Above we see that from the 10 Million rows, 61503 rows match our predicate.
Break it down
In Polars
we can visualize the query plan. Let's take a look.
q1.show_graph(optimized=False)
The astute reader maybe would notice that our query is not very optimal because we have
three separate FILTER nodes. That means that after every FILTER a new DataFrame
is
allocated, which will be input to the next FILTER and then deleted from memory -- that
must be redundant, and you know what... they'd be right. The predicates should be
combined. We should have written this query:
import polars as pl
from ..paths import DATA_DIR
q2 = pl.scan_csv(f"{DATA_DIR}/reddit.csv").filter(
(pl.col("comment_karma") > 0) & (pl.col("link_karma") > 0) & (pl.col("name").str.contains(r"^a"))
)
That would translate to:
q2.show_graph(optimized=False)
As we can see the predicates are combined. This would lead to less copying of data.
In comes optimization
Polars
tries to save that mental overhead from the query writer and combines predicates
for you. Besides that, it pushes predicates down to the scan level! Let's see how our
optimized query looks.
q1.show_graph(optimized=True)
It may be hard to see, but what is clear is that there is only a single node: the CSV SCAN. The predicate filtering is done during the reading of the csv. This means that this query's memory overhead is reduced by filtering factor! This makes a huge impact.
Memory
As we have seen there were ~ 62,000 rows left after the FILTER. That means that (aside for some memory overhead of the batch size and filter operations) we use \( \frac{6.2\text{e-}4}{1\text{e-}7} \sim 0.6 \text{%} \) of the memory we would during an eager evaluation where we first would read the whole table in memory before applying a filter.
Performance
At the time of writing this, the predicate pushdown also increased the query time performance.
Without optimization, predicate_pushdown=False
flag:
real 0m2,401s
user 0m5,457s
sys 0m0,894s
With optimization, predicate_pushdown=True
flag:
real 0m1,597s
user 0m6,143s
sys 0m0,647s
Relational algebra
In the visualization of the query plan, you see a \( \sigma \) symbol. This indicates a predicate done at the SCAN level. There is also a \( \pi \) symbol indicating projection (database jargon for column selection), but we'll get to that later.
Cheaper joins
Predicate pushdown optimization will generally also lead to cheaper join's. A join is an expensive operation. The fewer rows we have in a join operation the cheaper it will become.
Projection pushdown
The Projection pushdown page is under construction.
Let's expand our query from the previous section by joining the result of the FILTER
operation with the runescape data to find popular Reddit usernames that have a
username starting with an "a"
that also played Runescape. That must be something we are all
interested in!
The query would look like this:
import polars as pl
from ..paths import DATA_DIR
reddit = (
pl.scan_csv(f"{DATA_DIR}/reddit.csv")
.filter(pl.col("comment_karma") > 0)
.filter(pl.col("link_karma") > 0)
.filter(pl.col("name").str.contains(r"^a"))
)
runescape = pl.scan_csv("data/runescape.csv", has_header=False).select(pl.col("column_1").alias("name"))
dataset = reddit.join(runescape, on="name", how="inner").select(["name", "comment_karma", "link_karma"])
df1 = dataset.fetch(int(1e7))
df2 = dataset.fetch(int(1e7), predicate_pushdown=True, projection_pushdown=True)
And yields the following DataFrame.
shape: (0, 3)
┌──────┬───────────────┬────────────┐
│ name ┆ comment_karma ┆ link_karma │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════╪═══════════════╪════════════╡
└──────┴───────────────┴────────────┘
Break it down
Again, let's take a look the query plan.
dataset.show_graph(optimized=False)
Now were focused on the projection's indicated with π. The first node shows π 3/6,
indicating that we select 3 out of 6 columns in the DataFrame
. If we look the csv scans
we see a wildcard π */6 and π */1 meaning that we select all of 6 columns of the
reddit dataset and the one and only column from the runescape dataset respectively.
This query is not very optimal. We select all columns from both datasets and only show
3/6 after join. That means that there were some columns computed during the join
operation that could have been ignored. There were also columns parsed during csv
scanning only to be dropped at the end. When we are dealing with DataFrame
s with a
large number of columns the redundant work that is done can be huge.
Optimized query
Let's see how Polars
optimizes this query.
dataset.show_graph(optimized=True)
The projections are pushed down the join operation all the way to the csv scans. This means that both the scanning and join operation have become cheaper due to the query optimization.
Performance
Let's time the result before and after optimization.
Without optimization, predicate_pushdown=False
and projection_pushdown=False
.
real 0m3,273s
user 0m9,284s
sys 0m1,081s
With optimization, predicate_pushdown
and projection_pushdown
flags both to
True
.
real 0m1,732s
user 0m7,581s
sys 0m0,783s
We can see that we almost reduced query time by half on this simple query. With real business data often comprising of many columns, filtering missing data, doing complex groupby operations, and using joins we expect this difference between unoptimized queries and optimized queries to only grow.
Other optimizations
The Other optimizations page is under construction.
Besides predicate and projection pushdown, Polars
does other optimizations.
One important topic is optional caching and parallelization. It's easy to imagine having two
different DataFrame
computations that lead to a scan of the same file. Polars
may cache
the scanned file to prevent scanning the same file twice. However, if you want to, you
may override this behavior and force Polars
to read the same file. This could be faster
because the scan can be done in parallel.
Join parallelization
If we look at the previous query, we see that the join operation has as input a
computation path with data/reddit.csv
as root and one path with data/runescape.csv
as root. Polars
can observe that there are no dependencies between the two DataFrame
s and
will read both files in parallel. If other operations are done before the join (e.g.
groupby, filters, etc.) they are also executed in parallel.
Simplify expressions
Some other optimizations that are done are expression simplifications. The impact of these optimizations is less than that of predicate and projection pushdown, but they likely add up. You can track this issue to see the latest status of those.
Static schemas
Do you know that feeling, you are 20 mins in a ETL job and bam, your pipeline fails because we assumed to have another data type for a column? With polars' lazy API we know the data type and column name for any node in the pipeline.
This is very valuable information and can be used to ensure data integrity at any node in the pipeline.
trip_duration = (pl.col("dropoff_datetime") - pl.col("pickup_datetime")).dt.seconds() / 3600
assert (
pl.scan_csv("data/yellow_tripdata_2010-01.csv", parse_dates=True)
.with_column(trip_duration.alias("trip_duration"))
.filter(pl.col("trip_duration") > 0)
.groupby(["vendor_id"])
.agg(
[
(pl.col("trip_distance") / pl.col("trip_duration")).mean().alias("avg_speed"),
(pl.col("tip_amount") / pl.col("passenger_count")).mean().alias("avg_tip_per_person"),
]
)
).schema == {"vendor_id": pl.Utf8, "avg_speed": pl.Float64, "avg_tip_per_person": pl.Float64}
Reference guide
Need to see all available methods/functions of Polars
? We have Rust
and Python
references:
Contributing
See the CONTRIBUTING.md
if you would like to contribute to the Polars
project.
If you're new to this we recommend starting out with contributing examples to the Python API documentation. The Python API docs are generated from the docstrings of the Python wrapper located in polars/py-polars
.
Here is an example commit that adds a docstring.
If you spot any gaps in this User Guide you can submit fixes to the pola-rs/polars-book
repo.
Happy hunting!