Polars logo

Introduction

This book is an introduction to the Polars DataFrame library. Its goal is to introduce you to Polars by going through examples and comparing it to other solutions. Some design choices are introduced here. The guide will also introduce you to optimal usage of Polars.

Even though Polars is completely written in Rust (no runtime overhead!) and uses Arrow -- the native arrow2 Rust implementation -- as its foundation, the examples presented in this guide will be mostly using its higher-level language bindings. Higher-level bindings only serve as a thin wrapper for functionality implemented in the core library.

For Pandas users, our Python package will offer the easiest way to get started with Polars.

Goals and non-goals

The goal of Polars is to provide a lightning fast DataFrame library that utilizes all available cores on your machine. Unlike tools such as dask -- which tries to parallelize existing single-threaded libraries like NumPy and Pandas -- Polars is written from the ground up, designed for parallelization of queries on DataFrames.

Polars goes to great lengths to:

  • Reduce redundant copies
  • Traverse memory cache efficiently
  • Minimize contention in parallelism

Polars is lazy and semi-lazy. It allows you to do most of your work eagerly, similar to Pandas, but it also provides a powerful expression syntax that will be optimized and executed on within the query engine.

In lazy Polars we are able to do query optimization on the entire query, further improving performance and memory pressure.

Polars keeps track of your query in a logical plan. This plan is optimized and reordered before running it. When a result is requested, Polars distributes the available work to different executors that use the algorithms available in the eager API to produce a result. Because the whole query context is known to the optimizer and executors of the logical plan, processes dependent on separate data sources can be parallelized on the fly.

Performance 🚀🚀

Polars is very fast, and in fact is one of the best performing solutions available. See the results in h2oai's db-benchmark. The image below shows the biggest datasets yielding a result.

Polars TPCH Benchmark results are now available on the official website.

Current status

Below a concise list of the features allowing Polars to meet its goals:

  • Copy-on-write (COW) semantics
    • "Free" clones
    • Cheap appends
  • Appending without clones
  • Column oriented data storage
    • No block manager (i.e. predictable performance)
  • Missing values indicated with bitmask
    • NaN are different from missing
    • Bitmask optimizations
  • Efficient algorithms
  • Very fast IO
    • Its csv and parquet readers are among the fastest in existence
  • Query optimizations
    • Predicate pushdown
      • Filtering at scan level
    • Projection pushdown
      • Projection at scan level
    • Aggregate pushdown
      • Aggregations at scan level
    • Simplify expressions
    • Parallel execution of physical plan
    • Cardinality based groupby dispatch
      • Different groupby strategies based on data cardinality
  • SIMD vectorization
  • NumPy universal functions

Acknowledgements

Development of Polars is proudly powered by

Xomnia

Getting started

Installation

Installing and using Polars is just a simple pip install, cargo add, or yarn add away.

$ # Installing for python
$ pip install polars
$ # Installing into a Rust project
$ cargo add polars
$ # Installing for Node
$ yarn add nodejs-polars

All binaries are pre-built for Python v3.6+.

Quick start

Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation. This example is presented in python only, as the "eager" API is not the preferred model in Rust.

import polars as pl

df = pl.read_csv("https://j.mp/iriscsv")
print(df.filter(pl.col("sepal_length") > 5)
      .groupby("species")
      .agg(pl.all().sum())
)

The snippet above will output:

shape: (3, 5)
╭──────────────┬──────────────────┬─────────────────┬──────────────────┬─────────────────╮
│ species      ┆ sepal_length_sum ┆ sepal_width_sum ┆ petal_length_sum ┆ petal_width_sum │
│ ---          ┆ ---              ┆ ---             ┆ ---              ┆ ---             │
│ str          ┆ f64              ┆ f64             ┆ f64              ┆ f64             │
╞══════════════╪══════════════════╪═════════════════╪══════════════════╪═════════════════╡
│ "virginica"  ┆ 324.5            ┆ 146.2           ┆ 273.1            ┆ 99.6            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "versicolor" ┆ 281.9            ┆ 131.8           ┆ 202.9            ┆ 63.3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "setosa"     ┆ 116.9            ┆ 81.7            ┆ 33.2             ┆ 6.1             │
╰──────────────┴──────────────────┴─────────────────┴──────────────────┴─────────────────╯

As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.

Lazy quick start

If we want to run this query in lazy Polars we'd write:

import polars as pl

print(
    pl.read_csv("https://j.mp/iriscsv")
    .lazy()
    .filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.all().sum())
    .collect()
)
use color_eyre::{Result};
use polars::prelude::*;
use reqwest::blocking::Client;

fn main() -> Result<()> { 
    let data: Vec<u8> = Client::new()
        .get("https://j.mp/iriscsv")
        .send()?
        .text()?
        .bytes()
        .collect();

    let df = CsvReader::new(Cursor::new(data))
        .has_header(true)
        .finish()?
        .lazy()
        .filter(col("sepal_length").gt(5))
        .groupby([col("species")])
        .agg([col("*").sum()])
        .collect()?;

    println!("{:?}", df);

    Ok(())
}

When the data is stored locally, we can also use scan_csv in Python, or LazyCsvReader in Rust to run the query in lazy polars.

References

If you want to dive right into the Python API docs, check the the reference docs. Alternatively, the Rust API docs are available on docs.rs.

Lazy API

The lazy API builds a query plan. Nothing is executed until you explicitly ask Polars to execute the query (via LazyFrame.collect(), or LazyFrame.fetch()). This provides Polars with the entire context of the query, allowing optimizations and choosing the fastest algorithm given that context.

Going from eager to lazy is often as simple as starting your query with .lazy() and ending with .collect().

So the eager snippet above would become:

(
    df.lazy()
    .filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.all().sum())
    .collect()
)
let df = df
    .lazy()
    .filter(col("sepal_length").gt(5))
    .groupby([col("species")])
    .agg([col("*" ).sum()])
    .collect()?;

Polars Expressions

Polars has a powerful concept called expressions that is central to its very fast performance.

Expressions are at the core of many data science operations:

  • taking a sample of rows from a column
  • multiplying values in a column
  • extracting a column of years from dates
  • convert a column of strings to lowercase
  • and so on!

However, expressions are also used within other operations:

  • taking the mean of a group in a groupby operation
  • calculating the size of groups in a groupby operation
  • taking the sum horizontally across columns

Polars performs these core data transformations very quickly by:

  • automatic query optimization on each expression
  • automatic parallelization of expressions on many columns

Polars expressions are a mapping from a series to a series (or mathematically Fn(Series) -> Series). As expressions have a Series as an input and a Series as an output then it is straightforward to do a sequence of expressions (similar to method chaining in Pandas).

This has all been a bit abstract, so let's start with some examples.

Polars Expressions

The following is an expression:

pl.col("foo").sort().head(2)
df.column("foo")?.sort(false).head(Some(2));

The snippet above says:

  1. Select column "foo"
  2. Then sort the column (not in reversed order)
  3. Then take the first two values of the sorted output

The power of expressions is that every expression produces a new expression, and that they can be piped together. You can run an expression by passing them to one of Polars execution contexts.

Here we run two expressions by running df.select:

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").filter(pl.col("foo") == 1).sum()
])
df.select([
   col("foo").sort(Default::default()).head(Some(2)),
   col("bar").filter(col("foo").eq(lit(1))).sum(),
]).collect()?;
All expressions are run in parallel, meaning that separate `Polars` expressions are **embarrassingly parallel**. Note that within an expression there may be more parallelization going on.

Expression examples

In this section we will go through some examples, but first let's create a dataset:

import polars as pl
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
use color_eyre::Result;
use polars::prelude::*;
use rand::{thread_rng, Rng};

fn main() -> Result<()> {
    let mut arr = [0f64; 5];
    thread_rng().fill(&mut arr);

    let df = df! [
        "nrs" => [Some(1), Some(2), Some(3), None, Some(5)],
        "names" => [Some("foo"), Some("ham"), Some("spam"), Some("eggs"), None],
        "random" => arr,
        "groups" => ["A", "A", "B", "C", "B"],
    ]?;

    println!("{}", &df);
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

You can do a lot with expressions. They are so expressive that you sometimes have multiple ways to get the same results. To get a better feel for them let's go through some more examples.

A note for the Rust examples: Each of these examples use the same dataset. So, due to Rust's ownership rules, and the fact that all the examples run in the same context, we'll clone() the dataset for each example to ensure that no prior example affects the behavior of later examples. This is the case for all Rust examples for the remainder of this book. It's worth mentioning, that clones in Polars are very efficient, and don't result in a "deep copy" of the data. They're implemented using the Rust Arc type (Automatic Reference Counting).

Count unique values

We can count the unique values in a column. Note that we are creating the same result in different ways. To avoid duplicate column names in the DataFrame, we could use an alias expression that can rename the expression.

out = df.select(
    [
        pl.col("names").n_unique().alias("unique_names_1"),
        pl.col("names").unique().count().alias("unique_names_2"),
    ]
)
print(out)
    let out = df
        .clone()
        .lazy()
        .select([
            col("names").n_unique().alias("unique_names_1"),
            col("names").unique().count().alias("unique_names_2"),
        ])
        .collect()?;
    println!("{}", out);
shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ ---            ┆ ---            │
│ u32            ┆ u32            │
╞════════════════╪════════════════╡
│ 5              ┆ 5              │
└────────────────┴────────────────┘

Various aggregations

We can do various aggregations. Below are examples of some of them, but there are more such as median, mean, first, etc.

out = df.select(
    [
        pl.sum("random").alias("sum"),
        pl.min("random").alias("min"),
        pl.max("random").alias("max"),
        pl.col("random").max().alias("other_max"),
        pl.std("random").alias("std dev"),
        pl.var("random").alias("variance"),
    ]
)
print(out)
    let out = df
        .clone()
        .lazy()
        .select([
            sum("random").alias("sum"),
            min("random").alias("min"),
            max("random").alias("max"),
            col("random").max().alias("other_max"),
            col("random").std().alias("std dev"),
            col("random").var().alias("variance"),
        ])
        .collect()?;
    println!("{}", out);
shape: (1, 6)
┌──────────┬──────────┬──────┬───────────┬──────────┬──────────┐
│ sum      ┆ min      ┆ max  ┆ other_max ┆ std dev  ┆ variance │
│ ---      ┆ ---      ┆ ---  ┆ ---       ┆ ---      ┆ ---      │
│ f64      ┆ f64      ┆ f64  ┆ f64       ┆ f64      ┆ f64      │
╞══════════╪══════════╪══════╪═══════════╪══════════╪══════════╡
│ 1.705842 ┆ 0.014575 ┆ 0.74 ┆ 0.74      ┆ 0.293209 ┆ 0.085971 │
└──────────┴──────────┴──────┴───────────┴──────────┴──────────┘

Filter and conditionals

We can also do some pretty complex things. In the next snippet we count all names ending with the string "am".

Note that in Rust, the strings feature must be enabled for str expression to be available.

out = df.select(
    [
        pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
    ]
)
print(out)
    let out = df
        .clone()
        .lazy()
        .select([col("names")
            .filter(col("names").str().contains("am$"))
            .count()])
        .collect()?;
    println!("{}", out);
shape: (1, 1)
┌───────┐
│ names │
│ ---   │
│ u32   │
╞═══════╡
│ 2     │
└───────┘

Binary functions and modification

In the example below we use a conditional to create a new expression in the following when -> then -> otherwise construct. The when function requires a predicate expression (and thus leads to a boolean Series). The then function expects an expression that will be used in case the predicate evaluates to true, and the otherwise function expects an expression that will be used in case the predicate evaluates to false.

Note that you can pass any expression, or just base expressions like pl.col("foo"), pl.lit(3), pl.lit("bar"), etc.

Finally, we multiply this with the result of a sum expression:

out = df.select(
    [
        pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"),
    ]
)
print(out)
    let out = df
        .clone()
        .lazy()
        .select([when(col("random").gt(0.5)).then(0).otherwise(col("random")) * sum("nrs")])
        .collect()?;
    println!("{}", out);
shape: (5, 1)
┌──────────┐
│ literal  │
│ ---      │
│ f64      │
╞══════════╡
│ 1.695791 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0      │
├╌╌╌╌╌╌╌╌╌╌┤
│ 2.896465 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0      │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.160325 │
└──────────┘

Window expressions

A polars expression can also do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression. In the examples below we do a GROUPBY OVER "groups" and AGGREGATE SUM of "random", and in the next expression we GROUPBY OVER "names" and AGGREGATE a LIST of "random". These window functions can be combined with other expressions and are an efficient way to determine group statistics. See more on those group statistics here.

df = df.select(
    [
        pl.col("*"),  # select all
        pl.col("random").sum().over("groups").alias("sum[random]/groups"),
        pl.col("random").list().over("names").alias("random/name"),
    ]
)
print(df)
    let df = df
        .lazy()
        .select([
            col("*"), // Select all
            col("random")
                .sum()
                .over([col("groups")])
                .alias("sum[random]/groups"),
            col("random")
                .list()
                .over([col("names")])
                .alias("random/name"),
        ])
        .collect()?;
    println!("{}", df);
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬────────────────────┬─────────────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ sum[random]/groups ┆ random/name │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---                ┆ ---         │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ f64                ┆ list[f64]   │
╞══════╪═══════╪══════════╪════════╪════════════════════╪═════════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 0.894213           ┆ [0.154163]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      ┆ 0.894213           ┆ [0.74]      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 0.2778             ┆ [0.263315]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 0.533739           ┆ [0.533739]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 0.2778             ┆ [0.014575]  │
└──────┴───────┴──────────┴────────┴────────────────────┴─────────────┘

Conclusion

This is the tip of the iceberg in terms of possible expressions. There are a ton more, and they can be combined in a variety ways.

This page was an introduction to Polars expressions, and gave a glimpse of what's possible with them. In the next page we'll discuss in which contexts expressions can be used. Later in the guide we'll go through expressions in various groupby contexts, all while keeping Polars execution parallel.

Expression contexts

You cannot use an expression anywhere. An expression needs a context, the available contexts are:

  • selection: df.select([..])
  • groupy aggregation: df.groupby(..).agg([..])
  • hstack/ add columns: df.with_columns([..])

Syntactic sugar

The reason for such a context, is that you actually are using the Polars lazy API, even if you use it in eager. For instance this snippet:

df.groupby("foo").agg([pl.col("bar").sum()])
eager.groupby(["foo"])?.agg(&[("bar", &["sum"])])?;

actually desugars to:

(df.lazy().groupby("foo").agg([pl.col("bar").sum()])).collect()
eager.lazy().groupby(["foo"]).agg([col("bar").sum()]).collect()?;

This allows Polars to push the expression into the query engine, do optimizations, and cache intermediate results.

Rust differs from Python somewhat in this respect. Where Python's eager mode is little more than a thin veneer over the lazy API, Rust's eager mode is closer to an implementation detail, and isn't really recommended for end-user use. It is possible that the eager API in Rust will be scoped private sometime in the future. Therefore, for the remainder of this document, assume that the Rust examples are using the lazy API.

Select context

In the select context the selection applies expressions over columns. The expressions in this context must produce Series that are all the same length or have a length of 1.

A Series of a length of 1 will be broadcasted to match the height of the DataFrame. Note that a select may produce new columns that are aggregations, combinations of expressions, or literals.

Selection context

out = df.select(
    [
        pl.sum("nrs"),
        pl.col("names").sort(),
        pl.col("names").first().alias("first name"),
        (pl.mean("nrs") * 10).alias("10xnrs"),
    ]
)
print(out)
    let out = df
        .clone()
        .lazy()
        .select([
            sum("nrs"),
            col("names").sort(false),
            col("names").first().alias("first name"),
            mean("nrs").mul(lit(10)).alias("10xnrs"),
        ])
        .collect()?;
    println!("{}", out);
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ ---   ┆ ---        ┆ ---    │
│ i64 ┆ str   ┆ str        ┆ f64    │
╞═════╪═══════╪════════════╪════════╡
│ 11  ┆ null  ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ egg   ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ foo   ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ ham   ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ spam  ┆ foo        ┆ 27.5   │
└─────┴───────┴────────────┴────────┘

Add columns

Adding columns to a DataFrame using with_columns is also the selection context.

df = df.with_columns(
    [
        pl.sum("nrs").alias("nrs_sum"),
        pl.col("random").count().alias("count"),
    ]
)
print(df)
    let out = df
        .clone()
        .lazy()
        .with_columns([
            sum("nrs").alias("nrs_sum"),
            col("random").count().alias("count"),
        ])
        .collect()?;
    println!("{}", out);
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└──────┴───────┴──────────┴────────┴─────────┴───────┘

Groupby context

In the groupby context expressions work on groups and thus may yield results of any length (a group may have many members).

out = df.groupby("groups").agg(
    [
        pl.sum("nrs"),  # sum nrs by groups
        pl.col("random").count().alias("count"),  # count group members
        # sum random where name != null
        pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
        pl.col("names").reverse().alias(("reversed names")),
    ]
)
print(out)
    let out = df
        .lazy()
        .groupby([col("groups")])
        .agg([
            sum("nrs"),                           // sum nrs by groups
            col("random").count().alias("count"), // count group members
            // sum random where name != null
            col("random")
                .filter(col("names").is_not_null())
                .sum()
                .suffix("_sum"),
            col("names").reverse().alias("reversed names"),
        ])
        .collect()?;
    println!("{}", out);
shape: (3, 5)
┌────────┬──────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs  ┆ count ┆ random_sum ┆ reversed names │
│ ---    ┆ ---  ┆ ---   ┆ ---        ┆ ---            │
│ str    ┆ i64  ┆ u32   ┆ f64        ┆ list[str]      │
╞════════╪══════╪═══════╪════════════╪════════════════╡
│ C      ┆ null ┆ 1     ┆ 0.533739   ┆ ["egg"]        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B      ┆ 8    ┆ 2     ┆ 0.263315   ┆ [null, "spam"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ A      ┆ 3    ┆ 2     ┆ 0.894213   ┆ ["ham", "foo"] │
└────────┴──────┴───────┴────────────┴────────────────┘

Besides the standard groupby, groupby_dynamic, and groupby_rolling are also entrances to the groupby context.

GroupBy

The GroupBy page is under construction.

A multithreaded approach

One of the most efficient ways to process tabular data is to parallelize its processing via the "split-apply-combine" approach. This operation is at the core of the Polars grouping implementation, allowing it to attain lightning-fast operations. Specifically, both the "split" and "apply" phases are executed in a multi-threaded fashion.

A simple grouping operation is taken below as an example to illustrate this approach:

For the hashing operations performed during the "split" phase, Polars uses a multithreaded lock-free approach that is illustrated on the following schema:

This parallelization allows the grouping and joining operations (for instance) to be blazingly fast!

Check out this blog post for more details.

Do not kill the parallelization!

The following is specific to Python, and doesn't apply to Rust. Within Rust, blocks and closures (lambdas) can, and will, be executed concurrently.

We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.

This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.

To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API. Let's take a look at what that means.

We can start with the simple US congress dataset.

Note to Rust users, the dtype-categorical feature must be enabled for the examples in this section.

dtypes = {
    "first_name": pl.Categorical,
    "gender": pl.Categorical,
    "type": pl.Categorical,
    "state": pl.Categorical,
    "party": pl.Categorical,
}

dataset = pl.read_csv(url, dtype=dtypes).with_column(pl.col("birthday").str.strptime(pl.Date, strict=False))
    let url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv";

    let mut schema = Schema::new();
    schema.with_column("first_name".to_string(), DataType::Categorical(None));
    schema.with_column("gender".to_string(), DataType::Categorical(None));
    schema.with_column("type".to_string(), DataType::Categorical(None));
    schema.with_column("state".to_string(), DataType::Categorical(None));
    schema.with_column("party".to_string(), DataType::Categorical(None));
    schema.with_column("birthday".to_string(), DataType::Date);

    let data: Vec<u8> = Client::new().get(url).send()?.text()?.bytes().collect();

    let dataset = CsvReader::new(Cursor::new(data))
        .has_header(true)
        .with_dtypes(Some(&schema))
        .with_parse_dates(true)
        .finish()?;

    println!("{}", &dataset);

Basic aggregations

You can easily combine different aggregations by adding multiple expressions in a list. There is no upper bound on the number of aggregations you can do, and you can make any combination you want. In the snippet below we do the following aggregations:

Per GROUP "first_name" we

  • count the number of rows in the group:
    • short form: pl.count("party")
    • full form: pl.col("party").count()
  • aggregate the gender values groups to a list:
    • full form: pl.col("gender").list()
  • get the first value of column "last_name" in the group:
    • short form: pl.first("last_name") (not available in Rust)
    • full form: pl.col("last_name").first()

Besides the aggregation, we immediately sort the result and limit to the top 5 so that we have a nice summary overview.

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("first_name")
    .agg(
        [
            pl.count(),
            pl.col("gender").list(),
            pl.first("last_name"),
        ]
    )
    .sort("count", reverse=True)
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .groupby(["first_name"])
        .agg([count(), col("gender").list(), col("last_name").first()])
        .sort(
            "count",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 4)
┌────────────┬───────┬─────────────────────┬───────────┐
│ first_name ┆ count ┆ gender              ┆ last_name │
│ ---        ┆ ---   ┆ ---                 ┆ ---       │
│ cat        ┆ u32   ┆ list[cat]           ┆ str       │
╞════════════╪═══════╪═════════════════════╪═══════════╡
│ John       ┆ 1254  ┆ ["M", "M", ... "M"] ┆ Walker    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ William    ┆ 1022  ┆ ["M", "M", ... "M"] ┆ Few       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ James      ┆ 712   ┆ ["M", "M", ... "M"] ┆ Armstrong │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Thomas     ┆ 453   ┆ ["M", "M", ... "M"] ┆ Tucker    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Charles    ┆ 439   ┆ ["M", "M", ... "M"] ┆ Carroll   │
└────────────┴───────┴─────────────────────┴───────────┘

Conditionals

It's that easy! Let's turn it up a notch. Let's say we want to know how many delegates of a "state" are "Pro" or "Anti" administration. We could directly query that in the aggregation without the need of lambda or grooming the DataFrame.

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("state")
    .agg(
        [
            (pl.col("party") == "Anti-Administration").sum().alias("anti"),
            (pl.col("party") == "Pro-Administration").sum().alias("pro"),
        ]
    )
    .sort("pro", reverse=True)
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .groupby(["state"])
        .agg([
            (col("party").eq(lit("Anti-Administration")))
                .sum()
                .alias("anti"),
            (col("party").eq(lit("Pro-Administration")))
                .sum()
                .alias("pro"),
        ])
        .sort(
            "pro",
            SortOptions {
                descending: true,
                nulls_last: false,
            },
        )
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 3)
┌───────┬──────┬─────┐
│ state ┆ anti ┆ pro │
│ ---   ┆ ---  ┆ --- │
│ cat   ┆ u32  ┆ u32 │
╞═══════╪══════╪═════╡
│ CT    ┆ 0    ┆ 3   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ NJ    ┆ 0    ┆ 3   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ NC    ┆ 1    ┆ 2   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ VA    ┆ 3    ┆ 1   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ SC    ┆ 0    ┆ 1   │
└───────┴──────┴─────┘

Similarly, this could also be done with a nested GROUPBY, but that doesn't help show off some of these nice features. 😉

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby(["state", "party"])
    .agg([pl.count("party").alias("count")])
    .filter((pl.col("party") == "Anti-Administration") | (pl.col("party") == "Pro-Administration"))
    .sort("count", reverse=True)
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .groupby(["state", "party"])
        .agg([col("party").count().alias("count")])
        .filter(
            col("party")
                .eq(lit("Anti-Administration"))
                .or(col("party").eq(lit("Pro-Administration"))),
        )
        .sort(
            "count",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ state ┆ party               ┆ count │
│ ---   ┆ ---                 ┆ ---   │
│ cat   ┆ cat                 ┆ u32   │
╞═══════╪═════════════════════╪═══════╡
│ NJ    ┆ Pro-Administration  ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ CT    ┆ Pro-Administration  ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ VA    ┆ Anti-Administration ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ NC    ┆ Pro-Administration  ┆ 2     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ PA    ┆ Anti-Administration ┆ 1     │
└───────┴─────────────────────┴───────┘

Filtering

We can also filter the groups. Let's say we want to compute a mean per group, but we don't want to include all values from that group, and we also don't want to filter the rows from the DataFrame (because we need those rows for another aggregation).

In the example below we show how that can be done.

Note that we can make Python functions for clarity. These functions don't cost us anything. That is because we only create Polars expressions, we don't apply a custom function over a Series during runtime of the query. Of course, you can make functions that return expressions in Rust, too.

from datetime import date

import polars as pl

from .dataset import dataset


def compute_age() -> pl.Expr:
    return date(2021, 1, 1).year - pl.col("birthday").dt.year()


def avg_birthday(gender: str) -> pl.Expr:
    return compute_age().filter(pl.col("gender") == gender).mean().alias(f"avg {gender} birthday")


q = (
    dataset.lazy()
    .groupby(["state"])
    .agg(
        [
            avg_birthday("M"),
            avg_birthday("F"),
            (pl.col("gender") == "M").sum().alias("# male"),
            (pl.col("gender") == "F").sum().alias("# female"),
        ]
    )
    .limit(5)
)

df = q.collect()
    fn compute_age() -> Expr {
        lit(2022) - col("birthday").dt().year()
    }

    fn avg_birthday(gender: &str) -> Expr {
        compute_age()
            .filter(col("gender").eq(lit(gender)))
            .mean()
            .alias(&format!("avg {} birthday", gender))
    }

    let df = dataset
        .clone()
        .lazy()
        .groupby(["state"])
        .agg([
            avg_birthday("M"),
            avg_birthday("F"),
            (col("gender").eq(lit("M"))).sum().alias("# male"),
            (col("gender").eq(lit("F"))).sum().alias("# female"),
        ])
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 5)
┌───────┬────────────────┬────────────────┬────────┬──────────┐
│ state ┆ avg M birthday ┆ avg F birthday ┆ # male ┆ # female │
│ ---   ┆ ---            ┆ ---            ┆ ---    ┆ ---      │
│ cat   ┆ f64            ┆ f64            ┆ u32    ┆ u32      │
╞═══════╪════════════════╪════════════════╪════════╪══════════╡
│ ID    ┆ 137.666667     ┆ 99.0           ┆ 57     ┆ 2        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ WY    ┆ 137.717949     ┆ 75.0           ┆ 39     ┆ 1        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ OR    ┆ 149.892857     ┆ 106.4          ┆ 85     ┆ 5        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PI    ┆ 145.0          ┆ null           ┆ 13     ┆ 0        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ CO    ┆ 130.8876       ┆ 72.666667      ┆ 89     ┆ 3        │
└───────┴────────────────┴────────────────┴────────┴──────────┘

Sorting

It's common to see a DataFrame being sorted for the sole purpose of managing the ordering during a GROUPBY operation. Let's say that we want to get the names of the oldest and youngest politicians per state. We could SORT and GROUPBY.

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ]
    )
    .limit(5)
)

df = q.collect()
    fn get_person() -> Expr {
        col("first_name") + lit(" ") + col("last_name")
    }

    let df = dataset
        .clone()
        .lazy()
        .sort(
            "birthday",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .groupby(["state"])
        .agg([
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ])
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 3)
┌───────┬─────────────────────┬──────────────────┐
│ state ┆ youngest            ┆ oldest           │
│ ---   ┆ ---                 ┆ ---              │
│ cat   ┆ str                 ┆ str              │
╞═══════╪═════════════════════╪══════════════════╡
│ GU    ┆ Antonio Won Pat     ┆ Robert Underwood │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ VA    ┆ William Grayson     ┆ Scott Taylor     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ IL    ┆ Benjamin Stephenson ┆ Aaron Schock     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CA    ┆ Edward Gilbert      ┆ Katie Hill       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ KY    ┆ John Edwards        ┆ Ben Chandler     │
└───────┴─────────────────────┴──────────────────┘

However, if we also want to sort the names alphabetically, this breaks. Luckily we can sort in a groupby context separate from the DataFrame.

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort().first().alias("alphabetical_first"),
        ]
    )
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .sort(
            "birthday",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .groupby(["state"])
        .agg([
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort(false).first().alias("alphabetical_first"),
        ])
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 4)
┌───────┬───────────────────────┬─────────────────┬────────────────────┐
│ state ┆ youngest              ┆ oldest          ┆ alphabetical_first │
│ ---   ┆ ---                   ┆ ---             ┆ ---                │
│ cat   ┆ str                   ┆ str             ┆ str                │
╞═══════╪═══════════════════════╪═════════════════╪════════════════════╡
│ FL    ┆ Charles Downing       ┆ Patrick Murphy  ┆ Abijah Gilbert     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ OH    ┆ John Smith            ┆ John Boccieri   ┆ Aaron Harlan       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NJ    ┆ Lambert Cadwalader    ┆ Jon Runyan      ┆ Aaron Kitchell     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ IA    ┆ Bernhart Henn         ┆ Abby Finkenauer ┆ Abby Finkenauer    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NY    ┆ Cornelius Schoonmaker ┆ Max Rose        ┆ A. Foster          │
└───────┴───────────────────────┴─────────────────┴────────────────────┘

We can even sort by another column in the groupby context. If we want to know if the alphabetically sorted name is male or female we could add: pl.col("gender").sort_by("first_name").first().alias("gender")

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort().first().alias("alphabetical_first"),
            pl.col("gender").sort_by("first_name").first().alias("gender"),
        ]
    )
    .sort("state")
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .sort(
            "birthday",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .groupby(["state"])
        .agg([
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort(false).first().alias("alphabetical_first"),
            col("gender")
                .sort_by(["first_name"], [false])
                .first()
                .alias("gender"),
        ])
        .sort("state", SortOptions::default())
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 5)
┌───────┬───────────────────┬────────────────┬────────────────────┬────────┐
│ state ┆ youngest          ┆ oldest         ┆ alphabetical_first ┆ gender │
│ ---   ┆ ---               ┆ ---            ┆ ---                ┆ ---    │
│ cat   ┆ str               ┆ str            ┆ str                ┆ cat    │
╞═══════╪═══════════════════╪════════════════╪════════════════════╪════════╡
│ DE    ┆ Samuel White      ┆ John Carney    ┆ Albert Polk        ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ VA    ┆ William Grayson   ┆ Scott Taylor   ┆ Abraham Venable    ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ SC    ┆ Ralph Izard       ┆ Joe Cunningham ┆ Abraham Nott       ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ MD    ┆ Benjamin Contee   ┆ Frank Kratovil ┆ Albert Blakeney    ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ PA    ┆ Thomas Fitzsimons ┆ Ryan Costello  ┆ Aaron Kreider      ┆ M      │
└───────┴───────────────────┴────────────────┴────────────────────┴────────┘

Conclusion

In the examples above we've seen that we can do a lot by combining expressions. By doing so we delay the use of custom Python functions that slow down the queries (by the slow nature of Python AND the GIL).

If we are missing a type expression let us know by opening a feature request!

Folds

Polars provides expressions/methods for horizontal aggregations like sum, min, mean, etc. by setting the argument axis=1. However, when you need a more complex aggregation the default methods provided by the Polars library may not be sufficient. That's when folds come in handy.

The Polars fold expression operates on columns for maximum speed. It utilizes the data layout very efficiently and often has vectorized execution.

Let's start with an example by implementing the sum operation ourselves, with a fold.

Manual Sum

df = pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [10, 20, 30],
    }
)

out = df.select(
    pl.fold(acc=pl.lit(0), f=lambda acc, x: acc + x, exprs=pl.col("*")).alias("sum"),
)
print(out)
    let df = df![
        "a" => [1, 2, 3],
        "b" => [10, 20, 30],
    ]?;

    let out = df
        .lazy()
        .select([fold_exprs(lit(0), |acc, x| Ok(acc + x), [col("*")]).alias("sum")])
        .collect()?;
    println!("{}", out);
shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 11  │
├╌╌╌╌╌┤
│ 22  │
├╌╌╌╌╌┤
│ 33  │
└─────┘

The snippet above recursively applies the function f(acc, x) -> acc to an accumulator acc and a new column x. The function operates on columns individually and can take advantage of cache efficiency and vectorization.

Conditional

In the case where you'd want to apply a condition/predicate on all columns in a DataFrame a fold operation can be a very concise way to express this.

df = pl.DataFrame(
    {
        "a": [1, 2, 3],
        "b": [0, 1, 2],
    }
)

out = df.filter(
    pl.fold(
        acc=pl.lit(True),
        f=lambda acc, x: acc & x,
        exprs=pl.col("*") > 1,
    )
)
print(out)
    let df = df![
        "a" => [1, 2, 3],
        "b" => [0, 1, 2],
    ]?;

    let out = df
        .lazy()
        .filter(fold_exprs(
            lit(true),
            |acc, x| acc.bitand(&x),
            [col("*").gt(1)],
        ))
        .collect()?;
    println!("{}", out);
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 2   │
└─────┴─────┘

In the snippet we filter all rows where each column value is > 1.

Folds and string data

Folds could be used to concatenate string data. However, due to the materialization of intermediate columns, this operation will have squared complexity.

Therefore, we recommend using the concat_str expression for this.

Note that, in Rust, the concat_str feature must be enabled to use the concat_str expression.

df = pl.DataFrame(
    {
        "a": ["a", "b", "c"],
        "b": [1, 2, 3],
    }
)

out = df.select(
    [
        pl.concat_str(["a", "b"]),
    ]
)
print(out)
    let df = df![
        "a" => ["a", "b", "c"],
        "b" => [1, 2, 3],
    ]?;

    let out = df
        .lazy()
        .select([concat_str([col("a"), col("b")], "")])
        .collect()?;
    println!("{:?}", out);
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ str │
╞═════╡
│ a1  │
├╌╌╌╌╌┤
│ b2  │
├╌╌╌╌╌┤
│ c3  │
└─────┘

Window functions 🚀🚀

Window functions are expressions with superpowers. They allow you to perform aggregations on groups in the select context. Let's get a feel of what that means. First we create a dataset. The dataset loaded in the snippet below contains information about pokemon and has the following columns:

['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']

import polars as pl

# then let's load some csv data with information about pokemon
df = pl.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
)
use color_eyre::Result;
use polars::prelude::*;
use reqwest::blocking::Client;

fn main() -> Result<()> {
    let data: Vec<u8> = Client::new()
        .get("https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv")
        .send()?
        .text()?
        .bytes()
        .collect();

    let df = CsvReader::new(std::io::Cursor::new(data))
        .has_header(true)
        .finish()?;

    println!("{}", df);
shape: (163, 13)
┌─────┬───────────────────────┬─────────┬────────┬─────┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1  ┆ Type 2 ┆ ... ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---     ┆ ---    ┆     ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ str     ┆ str    ┆     ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪═════════╪════════╪═════╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass   ┆ Poison ┆ ... ┆ 65      ┆ 45    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ Ivysaur               ┆ Grass   ┆ Poison ┆ ... ┆ 80      ┆ 60    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ Venusaur              ┆ Grass   ┆ Poison ┆ ... ┆ 100     ┆ 80    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ VenusaurMega Venusaur ┆ Grass   ┆ Poison ┆ ... ┆ 120     ┆ 80    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...                   ┆ ...     ┆ ...    ┆ ... ┆ ...     ┆ ...   ┆ ...        ┆ ...       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 147 ┆ Dratini               ┆ Dragon  ┆ null   ┆ ... ┆ 50      ┆ 50    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 148 ┆ Dragonair             ┆ Dragon  ┆ null   ┆ ... ┆ 70      ┆ 70    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 149 ┆ Dragonite             ┆ Dragon  ┆ Flying ┆ ... ┆ 100     ┆ 80    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 150 ┆ Mewtwo                ┆ Psychic ┆ null   ┆ ... ┆ 90      ┆ 130   ┆ 1          ┆ true      │
└─────┴───────────────────────┴─────────┴────────┴─────┴─────────┴───────┴────────────┴───────────┘

Groupby Aggregations in selection

Below we show how to use window functions to group over different columns and perform an aggregation on them. Doing so allows us to use multiple groupby operations in parallel, using a single query. The results of the aggregation are projected back to the original rows. Therefore, a window function will always lead to a DataFrame with the same size as the original.

Note how we call .over("Type 1") and .over(["Type 1", "Type 2"]). Using window functions we can aggregate over different groups in a single select call! Note that, in Rust, the type of the argument to over() must be a collection, so even when you're only using one column, you must provided it in an array.

The best part is, this won't cost you anything. The computed groups are cached and shared between different window expressions.

out = df.select(
    [
        "Type 1",
        "Type 2",
        pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"),
        pl.col("Defense").mean().over(["Type 1", "Type 2"]).alias("avg_defense_by_type_combination"),
        pl.col("Attack").mean().alias("avg_attack"),
    ]
)
    let out = df
        .clone()
        .lazy()
        .select([
            col("Type 1"),
            col("Type 2"),
            col("Attack")
                .mean()
                .over(["Type 1"])
                .alias("avg_attack_by_type"),
            col("Defense")
                .mean()
                .over(["Type 1", "Type 2"])
                .alias("avg_defense_by_type_combination"),
            col("Attack").mean().alias("avg_attack"),
        ])
        .collect()?;

    println!("{}", out);
shape: (163, 5)
┌─────────┬────────┬────────────────────┬─────────────────────────────────┬────────────┐
│ Type 1  ┆ Type 2 ┆ avg_attack_by_type ┆ avg_defense_by_type_combination ┆ avg_attack │
│ ---     ┆ ---    ┆ ---                ┆ ---                             ┆ ---        │
│ str     ┆ str    ┆ f64                ┆ f64                             ┆ f64        │
╞═════════╪════════╪════════════════════╪═════════════════════════════════╪════════════╡
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...     ┆ ...    ┆ ...                ┆ ...                             ┆ ...        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon  ┆ null   ┆ 94.0               ┆ 55.0                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon  ┆ null   ┆ 94.0               ┆ 55.0                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon  ┆ Flying ┆ 94.0               ┆ 95.0                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Psychic ┆ null   ┆ 53.875             ┆ 51.428571                       ┆ 75.349693  │
└─────────┴────────┴────────────────────┴─────────────────────────────────┴────────────┘

Operations per group

Window functions can do more than aggregation. They can also be viewed as an operation within a group. If, for instance, you want to sort the values within a group, you can write col("value").sort().over("group") and voilà! We sorted by group!

Let's filter out some rows to make this more clear.

filtered = df.filter(pl.col("Type 2") == "Psychic").select(
    [
        "Name",
        "Type 1",
        "Speed",
    ]
)
print(filtered)
    let filtered = df
        .clone()
        .lazy()
        .filter(col("Type 2").eq(lit("Psychic")))
        .select([col("Name"), col("Type 1"), col("Speed")])
        .collect()?;

    println!("{}", filtered);
shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Slowpoke            ┆ Water  ┆ 15    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Slowbro             ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggcute           ┆ Grass  ┆ 40    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggutor           ┆ Grass  ┆ 55    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Starmie             ┆ Water  ┆ 115   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

Observe that the group Water of column Type 1 is not contiguous. There are two rows of Grass in between. Also note that each pokemon within a group are sorted by Speed in ascending order. Unfortunately, for this example we want them sorted in descending speed order. Luckily with window functions this is easy to accomplish.

out = filtered.with_columns(
    [
        pl.col(["Name", "Speed"]).sort(reverse=True).over("Type 1"),
    ]
)
print(out)
    let out = filtered
        .lazy()
        .with_columns([cols(["Name", "Speed"]).sort(true).over(["Type 1"])])
        .collect()?;
    println!("{}", out);

shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Starmie             ┆ Water  ┆ 115   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Slowpoke            ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggutor           ┆ Grass  ┆ 55    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggcute           ┆ Grass  ┆ 40    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Slowbro             ┆ Water  ┆ 15    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

Polars keeps track of each group's location and maps the expressions to the proper row locations. This will also work over different groups in a single select.

The power of window expressions is that you often don't need a groupby -> explode combination, but you can put the logic in a single expression. It also makes the API cleaner. If properly used a:

  • groupby -> marks that groups are aggregated and we expect a DataFrame of size n_groups
  • over -> marks that we want to compute something within a group, but doesn't modify the original size of the DataFrame

Window expression rules

The evaluations of window expressions are as follows (assuming we apply it to a pl.Int32 column):

# aggregate and broadcast within a group
# output type: -> Int32
pl.sum("foo").over("groups")

# sum within a group and multiply with group elements
# output type: -> Int32
(pl.col("x").sum() * pl.col("y")).over("groups")

# sum within a group and multiply with group elements 
# and aggregate the group to a list
# output type: -> List(Int32)
(pl.col("x").sum() * pl.col("y")).list().over("groups")

# note that it will require an explicit `list()` call
# sum within a group and multiply with group elements 
# and aggregate the group to a list
# the flatten call explodes that list

# This is the fastest method to do things over groups when the groups are sorted
(pl.col("x").sum() * pl.col("y")).list().over("groups").flatten()
            // aggregate and broadcast within a group
            // output type: -> i32
            sum("foo").over([col("groups")]),
            // sum within a group and multiply with group elements
            // output type: -> i32
            (col("x").sum() * col("y"))
                .over([col("groups")])
                .alias("x1"),
            // sum within a group and multiply with group elements
            // and aggregate the group to a list
            // output type: -> ChunkedArray<i32>
            (col("x").sum() * col("y"))
                .list()
                .over([col("groups")])
                .alias("x2"),
            // note that it will require an explicit `list()` call
            // sum within a group and multiply with group elements
            // and aggregate the group to a list
            // the flatten call explodes that list

            // This is the fastest method to do things over groups when the groups are sorted
            (col("x").sum() * col("y"))
                .list()
                .over([col("groups")])
                .flatten()
                .alias("x3"),
        ])
        .collect()?;

More examples

For more exercise, below are some window functions for us to compute:

  • sort all pokemon by type
  • select the first 3 pokemon per type as "Type 1"
  • sort the pokemon within a type by speed and select the first 3 as "fastest/group"
  • sort the pokemon within a type by attack and select the first 3 as "strongest/group"
  • sort the pokemon by name within a type and select the first 3 as "sorted_by_alphabet"
out = df.sort("Type 1").select(
    [
        pl.col("Type 1").head(3).list().over("Type 1").flatten(),
        pl.col("Name").sort_by(pl.col("Speed")).head(3).list().over("Type 1").flatten().alias("fastest/group"),
        pl.col("Name").sort_by(pl.col("Attack")).head(3).list().over("Type 1").flatten().alias("strongest/group"),
        pl.col("Name").sort().head(3).list().over("Type 1").flatten().alias("sorted_by_alphabet"),
    ]
)
    let out = df
        .clone()
        .lazy()
        .select([
            col("Type 1")
                .head(Some(3))
                .list()
                .over(["Type 1"])
                .flatten(),
            col("Name")
                .sort_by(["Speed"], [false])
                .head(Some(3))
                .list()
                .over(["Type 1"])
                .flatten()
                .alias("fastest/group"),
            col("Name")
                .sort_by(["Attack"], [false])
                .head(Some(3))
                .list()
                .over(["Type 1"])
                .flatten()
                .alias("strongest/group"),
            col("Name")
                .sort(false)
                .head(Some(3))
                .list()
                .over(["Type 1"])
                .flatten()
                .alias("sorted_by_alphabet"),
        ])
        .collect()?;
    println!("{:?}", out);
shape: (43, 4)
┌────────┬─────────────────────┬─────────────────┬─────────────────────────┐
│ Type 1 ┆ fastest/group       ┆ strongest/group ┆ sorted_by_alphabet      │
│ ---    ┆ ---                 ┆ ---             ┆ ---                     │
│ str    ┆ str                 ┆ str             ┆ str                     │
╞════════╪═════════════════════╪═════════════════╪═════════════════════════╡
│ Bug    ┆ Paras               ┆ Metapod         ┆ Beedrill                │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bug    ┆ Metapod             ┆ Kakuna          ┆ BeedrillMega Beedrill   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bug    ┆ Parasect            ┆ Caterpie        ┆ Butterfree              │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon ┆ Dratini             ┆ Dratini         ┆ Dragonair               │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...                 ┆ ...             ┆ ...                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Rock   ┆ Omanyte             ┆ Omastar         ┆ Geodude                 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Water  ┆ Slowpoke            ┆ Magikarp        ┆ Blastoise               │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Water  ┆ Slowbro             ┆ Tentacool       ┆ BlastoiseMega Blastoise │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Water  ┆ SlowbroMega Slowbro ┆ Horsea          ┆ Cloyster                │
└────────┴─────────────────────┴─────────────────┴─────────────────────────┘

Flattened window function

If we have a window function that aggregates to a list like the example above with the following Python expression:

pl.col("Name").sort_by(pl.col("Speed")).head(3).list().over("Type 1")

and in Rust:

col("Name").sort_by(["Speed"], [false]).head(Some(3)).list().over(["Type 1"])

This still works, but that would give us a column type List which might not be what we want (this would significantly increase our memory usage!).

Instead we could flatten. This just turns our 2D list into a 1D array and projects that array/column back to our DataFrame. This is very fast because the reshape is often free, and adding the column back the the original DataFrame is also a lot cheaper (since we don't require a join like in a normal window function).

However, for this operation to make sense, it is important that the columns used in over([..]) are sorted!

List context

An expression context we haven't discussed yet is the List context. This means simply we can apply any expression on the elements of a List.

A note for Rust users, these features require the list feature flag.

Row wise computations

This context is ideal for computing things in row orientation.

Polars expressions work on columns that have the guarantee that they consist of homogeneous data. Columns have this guarantee, rows in a DataFrame not so much. Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List data type.

Let's say we have the following data:

grades = pl.DataFrame(
    {
        "student": ["bas", "laura", "tim", "jenny"],
        "arithmetic": [10, 5, 6, 8],
        "biology": [4, 6, 2, 7],
        "geography": [8, 4, 9, 7],
    }
)
print(grades)
    let grades = df![
        "student" => ["bas", "laura", "tim", "jenny"],
        "arithmetic" => [10, 5, 6, 8],
        "biology" => [4, 6, 2, 7],
        "geography" => [8, 4, 9, 7],
    ]?;
    println!("{}", grades);
shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ ---     ┆ ---        ┆ ---     ┆ ---       │
│ str     ┆ i64        ┆ i64     ┆ i64       │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ laura   ┆ 5          ┆ 6       ┆ 4         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ tim     ┆ 6          ┆ 2       ┆ 9         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny   ┆ 8          ┆ 7       ┆ 7         │
└─────────┴────────────┴─────────┴───────────┘

If we want to compute the rank of all the columns except for "student", we can collect those into a list data type:

This would give:

out = grades.select([pl.concat_list(pl.all().exclude("student")).alias("all_grades")])
print(out)
    let out = grades
        .clone()
        .lazy()
        .select([concat_lst([all().exclude(["student"])]).alias("all_grades")])
        .collect()?;
    println!("{}", out);
shape: (4, 1)
┌────────────┐
│ all_grades │
│ ---        │
│ list[i64]  │
╞════════════╡
│ [10, 4, 8] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6, 4]  │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [6, 2, 9]  │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 7, 7]  │
└────────────┘

Running polars expression on list elements

We can run any polars expression on the elements of a list with the arr.eval (arr().eval in Rust) expression! These expressions run entirely on polars' query engine and can run in parallel so will be super fast.

Let's expand the example from above with something a little more interesting. Pandas allows you to compute the percentages of the rank values. Polars doesn't provide such a keyword argument. But because expressions are so versatile we can create our own percentage rank expression. Let's try that!

Note that we must select the list's element from the context. When we apply expressions over list elements, we use pl.element() to select the element of a list.

Also note that to use arr().eval in Rust requires the list_eval feature flag.

# the percentage rank expression
rank_pct = pl.element().rank(reverse=True) / pl.col("").count()


grades.with_column(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude("student")).alias("all_grades")
).select([
    # select all columns except the intermediate list
    pl.all().exclude("all_grades"),
    # compute the rank by calling `arr.eval`
    pl.col("all_grades").arr.eval(rank_pct, parallel=True).alias("grades_rank")
])
    // the percentage rank expression
    let rank_opts = RankOptions {
        method: RankMethod::Average,
        descending: true,
    };
    let rank_pct = col("").rank(rank_opts) / col("").count().cast(DataType::Float32);

    let grades = grades
        .clone()
        .lazy()
        .with_column(
            // create the list of homogeneous data
            concat_lst([all().exclude(["student"])]).alias("all_grades"),
        )
        .select([
            // select all columns except the intermediate list
            all().exclude(["all_grades"]),
            // compute the rank by calling `arr.eval`
            col("all_grades")
                .arr()
                .eval(rank_pct, true)
                .alias("grades_rank"),
        ])
        .collect()?;
    println!("{}", grades);

This outputs:

shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank                    │
│ ---     ┆ ---        ┆ ---     ┆ ---       ┆ ---                            │
│ str     ┆ i64        ┆ i64     ┆ i64       ┆ list [f32]                     │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         ┆ [0.333333, 1.0, 0.666667]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ laura   ┆ 5          ┆ 6       ┆ 4         ┆ [0.666667, 0.333333, 1.0]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim     ┆ 6          ┆ 2       ┆ 9         ┆ [0.666667, 1.0, 0.333333]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny   ┆ 8          ┆ 7       ┆ 7         ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘

Note that this solution works for any expressions/operation you want to do row wise.

Numpy interop

Polars expressions support NumPy ufuncs. See here for a list on all supported numpy functions.

This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API.

Example

import polars as pl
import numpy as np

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

out = df.select(
    [
        np.log(pl.all()).suffix("_log"),
    ]
)
print(out)
shape: (3, 2)
┌──────────┬──────────┐
│ a_log    ┆ b_log    │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 1.386294 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0.693147 ┆ 1.609438 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1.098612 ┆ 1.791759 │
└──────────┴──────────┘

Gotcha's

Read more about the gotcha's here.

Custom functions

You should be convinced by now that polar expressions are so powerful and flexible that the need for custom python functions is much less needed than you might need in other libraries.

Still, you need to have the power to be able to pass an expression's state to a third party library or apply your black box function over data in polars.

For this we provide the following expressions:

  • map
  • apply

To map or to apply.

These functions have an important distinction in how they operate and consequently what data they will pass to the user.

A map passes the Series backed by the expression as is.

map follows the same rules in both the select and the groupby context, this will mean that the Series represents a column in a DataFrame. Note that in the groupby context, that column is not yet aggregated!

Use cases for map are for instance passing the Series in an expression to a third party library. Below we show how we could use map to pass an expression column to a neural network model.

df.with_column([
    pl.col("features").map(lambda s: MyNeuralNetwork.forward(s.to_numpy())).alias("activations")
])
df.with_column([
    col("features").map(|s| Ok(my_nn.forward(s))).alias("activations")
])

Use cases for map in the groupby context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why.

df = pl.DataFrame(
    {
        "keys": ["a", "a", "b"],
        "values": [10, 7, 1],
    }
)

out = df.groupby("keys", maintain_order=True).agg(
    [
        pl.col("values").map(lambda s: s.shift()).alias("shift_map"),
        pl.col("values").shift().alias("shift_expression"),
    ]
)
print(df)
        "keys" => ["a", "a", "b"],
        "values" => [10, 7, 1],
    ]?;

    let out = df
        .lazy()
        .groupby(["keys"])
        .agg([
            col("values")
                .map(|s| Ok(s.shift(1)), GetOutput::default())
                .alias("shift_map"),
            col("values").shift(1).alias("shift_expression"),
        ])
        .collect()?;

    println!("{}", out);
shape: (3, 2)
┌──────┬────────┐
│ keys ┆ values │
│ ---  ┆ ---    │
│ str  ┆ i64    │
╞══════╪════════╡
│ a    ┆ 10     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a    ┆ 7      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b    ┆ 1      │
└──────┴────────┘

In the snippet above we groupby the "keys" column. That means we have the following groups:

"a" -> [10, 7]
"b" -> [1]

If we would then apply a shift operation to the right, we'd expect:

"a" -> [null, 10]
"b" -> [null]

Now, let's print and see what we've got.

print(out)
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map  ┆ shift_expression │
│ ---  ┆ ---        ┆ ---              │
│ str  ┆ list[i64]  ┆ list[i64]        │
╞══════╪════════════╪══════════════════╡
│ a    ┆ [null, 10] ┆ [null, 10]       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b    ┆ [7]        ┆ [null]           │
└──────┴────────────┴──────────────────┘

Ouch.. we clearly get the wrong results here. Group "b" even got a value from group "a" 😵.

This went horribly wrong, because the map applies the function before we aggregate! So that means the whole column [10, 7, 1] got shifted to [null, 10, 7] and was then aggregated.

So my advice is to never use map in the groupby context unless you know you need it and know what you are doing.

To apply

Luckily we can fix previous example with apply. apply works on the smallest logical elements for that operation.

That is:

  • select context -> single elements
  • groupby context -> single groups

So with apply we should be able to fix our example:

out = df.groupby("keys", maintain_order=True).agg(
    [
        pl.col("values").apply(lambda s: s.shift()).alias("shift_map"),
        pl.col("values").shift().alias("shift_expression"),
    ]
)
print(out)
    let out = df
        .clone()
        .lazy()
        .groupby([col("keys")])
        .agg([
            col("values")
                .apply(|s| Ok(s.shift(1)), GetOutput::default())
                .alias("shift_map"),
            col("values").shift(1).alias("shift_expression"),
        ])
        .collect()?;
    println!("{}", out);
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map  ┆ shift_expression │
│ ---  ┆ ---        ┆ ---              │
│ str  ┆ list[i64]  ┆ list[i64]        │
╞══════╪════════════╪══════════════════╡
│ a    ┆ [null, 10] ┆ [null, 10]       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b    ┆ [null]     ┆ [null]           │
└──────┴────────────┴──────────────────┘

And observe, a valid result! 🎉

apply in the select context

In the select context, the apply expression passes elements of the column to the python function.

Note that you are now running python, this will be slow.

Let's go through some examples to see what to expect. We will continue with the DataFrame we defined at the start of this section and show an example with the apply function and a counter example where we use the expression API to achieve the same goals.

Adding a counter

In this example we create a global counter and then add the integer 1 to the global state at every element processed. Every iteration the result of the increment will be added to the element value.

Note, this example isn't provided in Rust. The reason is that the global counter value would lead to data races when this apply is evaluated in parallel. It would be possible to wrap it in a Mutex to protect the variable, but that would be obscuring the point of the example. This is a case where the Python Global Interpreter Lock's performance tradeoff provides some safety guarentees.

counter = 0


def add_counter(val: int) -> int:
    global counter
    counter += 1
    return counter + val


out = df.select(
    [
        pl.col("values").apply(add_counter).alias("solution_apply"),
        (pl.col("values") + pl.arange(1, pl.count() + 1)).alias("solution_expr"),
    ]
)
print(out)
shape: (3, 2)
┌────────────────┬───────────────┐
│ solution_apply ┆ solution_expr │
│ ---            ┆ ---           │
│ i64            ┆ i64           │
╞════════════════╪═══════════════╡
│ 11             ┆ 11            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9              ┆ 9             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4              ┆ 4             │
└────────────────┴───────────────┘

Combining multiple column values

If we want to have access to values of different columns in a single apply function call, we can create struct data type. This data type collects those columns as fields in the struct. So if we'd create a struct from the columns "keys" and "values", we would get the following struct elements:

[
    {"keys": "a", "values": 10},
    {"keys": "a", "values": 7},
    {"keys": "b", "values": 1},
]

In Python, those would be passed as dict to the calling python function and can thus be indexed by field: str. In rust, you'll get a Series with the Struct type. The fields of the struct can then be indexed and downcast.

out = df.select(
    [
        pl.struct(["keys", "values"]).apply(lambda x: len(x["keys"]) + x["values"]).alias("solution_apply"),
        (pl.col("keys").str.lengths() + pl.col("values")).alias("solution_expr"),
    ]
)
print(out)
    let out = df
        .lazy()
        .select([
            // pack to struct to get access to multiple fields in a custom `apply/map`
            as_struct(&[col("keys"), col("values")])
                // we will compute the len(a) + b
                .apply(
                    |s| {
                        // downcast to struct
                        let ca = s.struct_()?;

                        // get the fields as Series
                        let s_a = &ca.fields()[0];
                        let s_b = &ca.fields()[1];

                        // downcast the `Series` to their known type
                        let ca_a = s_a.utf8()?;
                        let ca_b = s_b.i32()?;

                        // iterate both `ChunkedArrays`
                        let out: Int32Chunked = ca_a
                            .into_iter()
                            .zip(ca_b)
                            .map(|(opt_a, opt_b)| match (opt_a, opt_b) {
                                (Some(a), Some(b)) => Some(a.len() as i32 + b),
                                _ => None,
                            })
                            .collect();

                        Ok(out.into_series())
                    },
                    GetOutput::from_type(DataType::Int32),
                )
                .alias("solution_apply"),
            (col("keys").str().count_match(".") + col("values")).alias("solution_expr"),
        ])
        .collect()?;
    println!("{}", out);
shape: (3, 2)
┌────────────────┬───────────────┐
│ solution_apply ┆ solution_expr │
│ ---            ┆ ---           │
│ i64            ┆ i64           │
╞════════════════╪═══════════════╡
│ 11             ┆ 11            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 8              ┆ 8             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2              ┆ 2             │
└────────────────┴───────────────┘

Return types?

Custom python functions are black boxes for polars. We really don't know what kind of black arts you are doing, so we have to infer and try our best to understand what you meant.

As a user it helps to understand what we do to better utilize custom functions.

The data type is automatically inferred. We do that by waiting for the first non-null value. That value will then be used to determine the type of the Series.

The mapping of python types to polars data types is as follows:

  • int -> Int64
  • float -> Float64
  • bool -> Boolean
  • str -> Utf8
  • list[tp] -> List[tp] (where the inner type is inferred with the same rules)
  • dict[str, [tp]] -> struct
  • Any -> object (Prevent this at all times)

Rust types map as follows:

  • i32 or i64 -> Int64
  • f32 or f64 -> Float64
  • bool -> Boolean
  • String or str -> Utf8
  • Vec<tp> -> List[tp] (where the inner type is inferred with the same rules)
import polars as pl

Expressions

fn(Series) -> Series

  • Lazily evaluated
    • Can be optimized
    • Gives the library writer context and informed decision can be made
  • Embarrassingly parallel
  • Context dependent
    • selection / projection -> Series = COLUMN, LITERAL or VALUE
    • aggregation -> Series = GROUPS
df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)
df
A fruits B cars optional
i64 str i64 str i64
1 "banana" 5 "beetle" 28
2 "banana" 4 "audi" 300
3 "apple" 3 "beetle" null
4 "apple" 2 "beetle" 2
5 "banana" 1 "beetle" -30

Selection context

# We can select by name

(df.select([
    pl.col("A"),
    "B",      # the col part is inferred
    pl.lit("B"),  # we must tell polars we mean the literal "B"
    pl.col("fruits"),
]))
A B literal fruits
i64 i64 str str
1 5 "B" "banana"
2 4 "B" "banana"
3 3 "B" "apple"
4 2 "B" "apple"
5 1 "B" "banana"
# you can select columns with a regex if it starts with '^' and ends with '$'

(df.select([
    pl.col("^A|B$").sum()
]))
A B
i64 i64
15 15
# you can select multiple columns by name

(df.select([
    pl.col(["A", "B"]).sum()
]))
A B
i64 i64
15 15

# We select everything in normal order
# Then we select everything in reversed order

(df.select([
    pl.all(),
    pl.all().reverse().suffix("_reverse")
]))
A fruits B cars optional A_reverse fruits_reverse B_reverse cars_reverse optional_reverse
i64 str i64 str i64 i64 str i64 str i64
1 "banana" 5 "beetle" 28 5 "banana" 1 "beetle" -30
2 "banana" 4 "audi" 300 4 "apple" 2 "beetle" 2
3 "apple" 3 "beetle" null 3 "apple" 3 "beetle" null
4 "apple" 2 "beetle" 2 2 "banana" 4 "audi" 300
5 "banana" 1 "beetle" -30 1 "banana" 5 "beetle" 28
# all expressions run in parallel
# single valued `Series` are broadcasted to the shape of the `DataFrame`

(df.select([
    pl.all(),
    pl.all().sum().suffix("_sum")
]))
A fruits B cars optional A_sum fruits_sum B_sum cars_sum optional_sum
i64 str i64 str i64 i64 str i64 str i64
1 "banana" 5 "beetle" 28 15 null 15 null 300
2 "banana" 4 "audi" 300 15 null 15 null 300
3 "apple" 3 "beetle" null 15 null 15 null 300
4 "apple" 2 "beetle" 2 15 null 15 null 300
5 "banana" 1 "beetle" -30 15 null 15 null 300
# there are `str` and `dt` namespaces for specialized functions

predicate = pl.col("fruits").str.contains("^b.*")

(df.select([
    predicate
]))
fruits
bool
true
true
false
false
true
# use the predicate to filter

df.filter(predicate)
A fruits B cars optional
i64 str i64 str i64
1 "banana" 5 "beetle" 28
2 "banana" 4 "audi" 300
5 "banana" 1 "beetle" -30
# predicate expressions can be used to filter

(df.select([
    pl.col("A").filter(pl.col("fruits").str.contains("^b.*")).sum(),
    (pl.col("B").filter(pl.col("cars").str.contains("^b.*")).sum() * pl.col("B").sum()).alias("some_compute()"),
]))
A some_compute()
i64 i64
8 165
# We can do arithmetic on columns and (literal) values
# can be evaluated to 1 without programmer knowing

some_var = 1

(df.select([
    ((pl.col("A") / 124.0 * pl.col("B")) / pl.sum("B") * some_var).alias("computed")
]))
computed
f64
0.0
0.0
0.0
0.0
0.0
# We can combine columns by a predicate

(df.select([
    "fruits",
    "B",
    pl.when(pl.col("fruits") == "banana").then(pl.col("B")).otherwise(-1).alias("b")
]))
fruits B b
str i64 i64
"banana" 5 5
"banana" 4 4
"apple" 3 -1
"apple" 2 -1
"banana" 1 1
# We can combine columns by a fold operation on column level

(df.select([
    "A",
    "B",
    pl.fold(0, lambda a, b: a + b, [pl.col("A"), "B", pl.col("B")**2, pl.col("A") / 2.0]).alias("fold")
]))
A B fold
i64 i64 f64
1 5 31
2 4 23
3 3 16
4 2 12
5 1 9
# even combine all

(df.select([
    pl.arange(0, df.height).alias("idx"),
    "A",
    pl.col("A").shift().alias("A_shifted"),
    pl.concat_str(pl.all(), "-").alias("str_concat_1"),  # prefer this
    pl.fold(pl.col("A"), lambda a, b: a + "-" + b, pl.all().exclude("A")).alias("str_concat_2"),  # over this (accidentally O(n^2))
]))
idx A A_shifted str_concat_1 str_concat_2
i64 i64 i64 str str
0 1 null "1-banana-5-beetle-28" "1-banana-5-beetle-28"
1 2 1 "2-banana-4-audi-300" "2-banana-4-audi-300"
2 3 2 null null
3 4 3 "4-apple-2-beetle-2" "4-apple-2-beetle-2"
4 5 4 "5-banana-1-beetle--30" "5-banana-1-beetle--30"

Aggregation context

  • expressions are applied over groups instead of columns
# we can still combine many expressions

(df.sort("cars").groupby("fruits")
    .agg([
        pl.col("B").sum().alias("B_sum"),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count("A").alias("count"),
        pl.col("cars").reverse()
    ]))
fruits B_sum B_sum2 fruits_first count cars
str i64 i64 str u32 list
"banana" 10 10 "banana" 3 [beetle, beetle, audi]
"apple" 5 5 "apple" 2 [beetle, beetle]
# We can explode the list column "cars"

(df.sort("cars").groupby("fruits")
    .agg([
        pl.col("B").sum().alias("B_sum"),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count("A").alias("count"),
        pl.col("cars").reverse()
    ])).explode("cars")
fruits B_sum B_sum2 fruits_first count cars
str i64 i64 str u32 str
"apple" 5 5 "apple" 2 "beetle"
"apple" 5 5 "apple" 2 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "audi"
(df.groupby("fruits")
    .agg([
        pl.col("B").sum().alias("B_sum"),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count(),
        pl.col("B").shift().alias("B_shifted")
    ])
 .explode("B_shifted")
)
fruits B_sum B_sum2 fruits_first count B_shifted
str i64 i64 str u32 i64
"apple" 5 5 "apple" 2 null
"apple" 5 5 "apple" 2 3
"banana" 10 10 "banana" 3 null
"banana" 10 10 "banana" 3 5
"banana" 10 10 "banana" 3 4
# We can explode the list column "cars"

(df.sort("cars").groupby("fruits")
    .agg([
        pl.col("B").sum(),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count("A").alias("count"),
        pl.col("cars").reverse()
    ])).explode("cars")
fruits B_sum B_sum2 fruits_first count cars
str i64 i64 str u32 str
"apple" 5 5 "apple" 2 "beetle"
"apple" 5 5 "apple" 2 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "audi"
# we can also get the list of the groups

(df.groupby("fruits")
    .agg([
         pl.col("B").shift().alias("shift_B"),
         pl.col("B").reverse().alias("rev_B"),
    ]))
fruits shift_B rev_B
str list list
"apple" [null, 3] [2, 3]
"banana" [null, 5, 4] [1, 4, 5]
# we can do predicates in the groupby as well

(df.groupby("fruits")
    .agg([
        pl.col("B").filter(pl.col("B") > 1).list().keep_name(),
    ]))
fruits B
str list
"banana" [5, 4]
"apple" [3, 2]
# and sum only by the values where the predicates are true

(df.groupby("fruits")
    .agg([
        pl.col("B").filter(pl.col("B") > 1).mean(),
    ]))
fruits B_mean
str f64
"banana" 4.5
"apple" 2.5
# Another example

(df.groupby("fruits")
    .agg([
        pl.col("B").shift_and_fill(1, fill_value=0).alias("shifted"),
        pl.col("B").shift_and_fill(1, fill_value=0).sum().alias("shifted_sum"),
    ]))
fruits shifted shifted_sum
str list i64
"apple" [0, 3] 3
"banana" [0, 5, 4] 9

Window functions!

  • Expression with superpowers.
  • Aggregation in selection context
pl.col("foo").aggregation_expression(..).over("column_used_to_group")
# groupby 2 different columns

(df.select([
    "fruits",
    "cars",
    "B",
    pl.col("B").sum().over("fruits").alias("B_sum_by_fruits"),
    pl.col("B").sum().over("cars").alias("B_sum_by_cars"),
]))
fruits cars B B_sum_by_fruits B_sum_by_cars
str str i64 i64 i64
"apple" "beetle" 3 5 11
"apple" "beetle" 2 5 11
"banana" "beetle" 5 10 11
"banana" "audi" 4 10 4
"banana" "beetle" 1 10 11
# reverse B by groups and show the results in original DF

(df.select([
    "fruits",
    "B",
    pl.col("B").reverse().over("fruits").alias("B_reversed_by_fruits")
]))
fruits B B_reversed_by_fruits
str i64 i64
"apple" 3 2
"apple" 2 3
"banana" 5 1
"banana" 4 4
"banana" 1 5
# Lag a column within "fruits"

(df.select([
    "fruits",
    "B",
    pl.col("B").shift().over("fruits").alias("lag_B_by_fruits")
]))

fruits B lag_B_by_fruits
str i64 i64
"apple" 3 null
"apple" 2 3
"banana" 5 null
"banana" 4 5
"banana" 1 4

Note: To use this notebook, you must first install the Rust excvr jupyter kernel. Also, note that clone() is used fairly often in the examples. This is because we tend to create one dataset for multiple examples. When this dataset is used, the rust ownership system will move that dataframe, which will make it unavailable to later examples. By cloneing, we can keep using it over and over.


#![allow(unused)]
fn main() {
:dep polars = { version = "0.23.2", features = ["lazy", "csv-file", "strings", "temporal", "dtype-duration", "dtype-categorical", "concat_str", "list", "list_eval", "rank", "lazy_regex"]}
:dep color-eyre = {version = "0.6.2"}
:dep rand = {version = "0.8.5"}
:dep reqwest = { version = "0.11.11", features = ["blocking"]}

use color_eyre::{Result};
use polars::prelude::*;
}

Expressions

fn(Series) -> Series

  • Lazily evaluated
    • Can be optimized
    • Gives the library writer context and informed decisions can be made
  • Embarrassingly parallel
  • Context dependent
    • selection/projection -> Series = COLUMN, LITERAL, or VALUE
    • aggregation -> Series = GROUPS

#![allow(unused)]
fn main() {
let df = df! [
    "A"        => [1, 2, 3, 4, 5],
    "fruits"   => ["banana", "banana", "apple", "apple", "banana"],
    "B"        => [5, 4, 3, 2, 1],
    "cars"     => ["beetle", "audi", "beetle", "beetle", "beetle"],
    "optional" => [Some(28), Some(300), None, Some(2), Some(-30)],
]?;
df
}
shape: (5, 5)
┌─────┬────────┬─────┬────────┬──────────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ optional │
│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---      │
│ i32 ┆ str    ┆ i32 ┆ str    ┆ i32      │
╞═════╪════════╪═════╪════════╪══════════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ 28       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ 300      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ null     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ apple  ┆ 2   ┆ beetle ┆ 2        │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ banana ┆ 1   ┆ beetle ┆ -30      │
└─────┴────────┴─────┴────────┴──────────┘

Selection context


#![allow(unused)]
fn main() {
// We can select by name
// We'll be re-using the dataframe a bunch, so we'll clone copies as we go.
df.clone().lazy().select([
    col("A"),
    col("B"),
    lit("B"),  // we must tell polars we mean the literal "B"
    col("fruits"),
]).collect()?
}
shape: (5, 4)
┌─────┬─────┬─────────┬────────┐
│ A   ┆ B   ┆ literal ┆ fruits │
│ --- ┆ --- ┆ ---     ┆ ---    │
│ i32 ┆ i32 ┆ str     ┆ str    │
╞═════╪═════╪═════════╪════════╡
│ 1   ┆ 5   ┆ B       ┆ banana │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ 4   ┆ B       ┆ banana │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3   ┆ 3   ┆ B       ┆ apple  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4   ┆ 2   ┆ B       ┆ apple  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5   ┆ 1   ┆ B       ┆ banana │
└─────┴─────┴─────────┴────────┘

#![allow(unused)]
fn main() {
// you can select columns with a regex if it starts with '^' and ends with '$'

df.clone().lazy().select([
    col("^A|B$").sum()
]).collect()?
}
shape: (1, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 15  ┆ 15  │
└─────┴─────┘

#![allow(unused)]
fn main() {
// you can select multiple columns by name

df.clone().lazy().select([
    cols(["A", "B"]).sum()
]).collect()?
}
shape: (1, 2)
┌─────┬─────┐
│ A   ┆ B   │
│ --- ┆ --- │
│ i32 ┆ i32 │
╞═════╪═════╡
│ 15  ┆ 15  │
└─────┴─────┘

#![allow(unused)]
fn main() {
// We select everything in normal order
// Then we select everything in reversed order

df.clone().lazy().select([
    all(),
    all().reverse().suffix("_reverse")
]).collect()?
}
shape: (5, 10)
┌─────┬────────┬─────┬────────┬─────┬────────────────┬───────────┬──────────────┬──────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ ... ┆ fruits_reverse ┆ B_reverse ┆ cars_reverse ┆ opti │
│ --- ┆ ---    ┆ --- ┆ ---    ┆     ┆ ---            ┆ ---       ┆ ---          ┆ onal │
│ i32 ┆ str    ┆ i32 ┆ str    ┆     ┆ str            ┆ i32       ┆ str          ┆ _rev │
│     ┆        ┆     ┆        ┆     ┆                ┆           ┆              ┆ erse │
│     ┆        ┆     ┆        ┆     ┆                ┆           ┆              ┆ ---  │
│     ┆        ┆     ┆        ┆     ┆                ┆           ┆              ┆ i32  │
╞═════╪════════╪═════╪════════╪═════╪════════════════╪═══════════╪══════════════╪══════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ ... ┆ banana         ┆ 1         ┆ beetle       ┆ -30  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ ... ┆ apple          ┆ 2         ┆ beetle       ┆ 2    │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ ... ┆ apple          ┆ 3         ┆ beetle       ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4   ┆ apple  ┆ 2   ┆ beetle ┆ ... ┆ banana         ┆ 4         ┆ audi         ┆ 300  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 5   ┆ banana ┆ 1   ┆ beetle ┆ ... ┆ banana         ┆ 5         ┆ beetle       ┆ 28   │
└─────┴────────┴─────┴────────┴─────┴────────────────┴───────────┴──────────────┴──────┘

#![allow(unused)]
fn main() {
// all expressions run in parallel
// single valued `Series` are broadcasted to the shape of the `DataFrame`

df.clone().lazy().select([
    all(),
    all().sum().suffix("_sum")
]).collect()?
}
shape: (5, 10)
┌─────┬────────┬─────┬────────┬─────┬────────────┬───────┬──────────┬──────────────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ ... ┆ fruits_sum ┆ B_sum ┆ cars_sum ┆ optional_sum │
│ --- ┆ ---    ┆ --- ┆ ---    ┆     ┆ ---        ┆ ---   ┆ ---      ┆ ---          │
│ i32 ┆ str    ┆ i32 ┆ str    ┆     ┆ str        ┆ i32   ┆ str      ┆ i32          │
╞═════╪════════╪═════╪════════╪═════╪════════════╪═══════╪══════════╪══════════════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ ... ┆ null       ┆ 15    ┆ null     ┆ 300          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ ... ┆ null       ┆ 15    ┆ null     ┆ 300          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ ... ┆ null       ┆ 15    ┆ null     ┆ 300          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ apple  ┆ 2   ┆ beetle ┆ ... ┆ null       ┆ 15    ┆ null     ┆ 300          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ banana ┆ 1   ┆ beetle ┆ ... ┆ null       ┆ 15    ┆ null     ┆ 300          │
└─────┴────────┴─────┴────────┴─────┴────────────┴───────┴──────────┴──────────────┘

#![allow(unused)]
fn main() {
// there are `str` and `dt` namespaces for specialized functions

let predicate = col("fruits").str().contains("^b.*");

df.clone().lazy().select([
    predicate
]).collect()?
}
shape: (5, 1)
┌────────┐
│ fruits │
│ ---    │
│ bool   │
╞════════╡
│ true   │
├╌╌╌╌╌╌╌╌┤
│ true   │
├╌╌╌╌╌╌╌╌┤
│ false  │
├╌╌╌╌╌╌╌╌┤
│ false  │
├╌╌╌╌╌╌╌╌┤
│ true   │
└────────┘

#![allow(unused)]
fn main() {
// use the predicate to filter
let predicate = col("fruits").str().contains("^b.*");
df.clone().lazy().filter(predicate).collect()?
}
shape: (3, 5)
┌─────┬────────┬─────┬────────┬──────────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ optional │
│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---      │
│ i32 ┆ str    ┆ i32 ┆ str    ┆ i32      │
╞═════╪════════╪═════╪════════╪══════════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ 28       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ 300      │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ banana ┆ 1   ┆ beetle ┆ -30      │
└─────┴────────┴─────┴────────┴──────────┘

#![allow(unused)]
fn main() {
// predicate expressions can be used to filter

df.clone().lazy().select([
    col("A").filter(col("fruits").str().contains("^b.*")).sum(),
    (col("B").filter(col("cars").str().contains("^b.*")).sum() * col("B").sum()).alias("some_compute()"),
]).collect()?
}
shape: (1, 2)
┌─────┬────────────────┐
│ A   ┆ some_compute() │
│ --- ┆ ---            │
│ i32 ┆ i32            │
╞═════╪════════════════╡
│ 8   ┆ 165            │
└─────┴────────────────┘

#![allow(unused)]
fn main() {
// We can do arithmetic on columns and (literal) values
// can be evaluated to 1 without programmer knowing

let some_var = 1;

df.clone().lazy().select([
    ((col("A") / lit(124.0) * col("B")) / sum("B") * lit(some_var)).alias("computed")
]).collect()?
}
shape: (5, 1)
┌──────────┐
│ computed │
│ ---      │
│ f64      │
╞══════════╡
│ 0.002688 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.004301 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.004839 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.004301 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.002688 │
└──────────┘

#![allow(unused)]
fn main() {
// We can combine columns by a predicate
// This doesn't work.  It seems like the condition always evaluates to true
df.clone().lazy().select([
    col("fruits"),
    col("B"),
    when(col("fruits") == lit("banana")).then(col("B")).otherwise(-1).alias("b when not banana")
]).collect()?
}
shape: (5, 3)
┌────────┬─────┬───────────────────┐
│ fruits ┆ B   ┆ b when not banana │
│ ---    ┆ --- ┆ ---               │
│ str    ┆ i32 ┆ i32               │
╞════════╪═════╪═══════════════════╡
│ banana ┆ 5   ┆ -1                │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 4   ┆ -1                │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ 3   ┆ -1                │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ 2   ┆ -1                │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 1   ┆ -1                │
└────────┴─────┴───────────────────┘

#![allow(unused)]
fn main() {
// We can combine columns by a fold operation on column level

df.clone().lazy().select([
    col("A"),
    col("B"),
    fold_exprs(lit(0), |a, b| Ok(&a + &b), [
        col("A"),
        lit("B"),
        col("B").pow(lit(2)),
        col("A") / lit(2.0)
    ]).alias("fold")
]).collect()?
}
shape: (5, 3)
┌─────┬─────┬───────────┐
│ A   ┆ B   ┆ fold      │
│ --- ┆ --- ┆ ---       │
│ i32 ┆ i32 ┆ str       │
╞═════╪═════╪═══════════╡
│ 1   ┆ 5   ┆ 1B25.00.5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 2B16.01.0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 3   ┆ 3B9.01.5  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 2   ┆ 4B4.02.0  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ 1   ┆ 5B1.02.5  │
└─────┴─────┴───────────┘

#![allow(unused)]
fn main() {
// even combine all
use std::convert::TryInto;
let height: i32 = df.height().try_into()?;
df.clone().lazy().select([
    range(0i32, height).alias("idx"),
    col("A"),
    col("A").shift(1).alias("A_shifted"),
    concat_str([all()], "").alias("str_concat_1"),  // prefer this
    fold_exprs(col("A"), |a, b| Ok(a + b), [all().exclude(["A"])]).alias("str_concat_2"), // over this (accidentally O(n^2))
]).collect()?
}
shape: (5, 5)
┌─────┬─────┬───────────┬───────────────────┬───────────────────┐
│ idx ┆ A   ┆ A_shifted ┆ str_concat_1      ┆ str_concat_2      │
│ --- ┆ --- ┆ ---       ┆ ---               ┆ ---               │
│ i32 ┆ i32 ┆ i32       ┆ str               ┆ str               │
╞═════╪═════╪═══════════╪═══════════════════╪═══════════════════╡
│ 0   ┆ 1   ┆ null      ┆ 1banana5beetle28  ┆ 1banana5beetle28  │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ 2   ┆ 1         ┆ 2banana4audi300   ┆ 2banana4audi300   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ 3   ┆ 2         ┆ null              ┆ null              │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ 4   ┆ 3         ┆ 4apple2beetle2    ┆ 4apple2beetle2    │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ 5   ┆ 4         ┆ 5banana1beetle-30 ┆ 5banana1beetle-30 │
└─────┴─────┴───────────┴───────────────────┴───────────────────┘

Aggregation context

  • expressions are applied over groups instead of columns

#![allow(unused)]
fn main() {
// we can still combine many expressions

df.clone().lazy().sort("cars", SortOptions::default()).groupby(["fruits"])
    .agg([
        col("B").sum().alias("B_sum"),
        sum("B").alias("B_sum2"),  // syntactic sugar for the first
        col("fruits").first().alias("fruits_first"),
        col("A").count().alias("count"),
        col("cars").reverse()
    ]).collect()?
}
shape: (2, 6)
┌────────┬───────┬────────┬──────────────┬───────┬──────────────────────────────┐
│ fruits ┆ B_sum ┆ B_sum2 ┆ fruits_first ┆ count ┆ cars                         │
│ ---    ┆ ---   ┆ ---    ┆ ---          ┆ ---   ┆ ---                          │
│ str    ┆ i32   ┆ i32    ┆ str          ┆ u32   ┆ list[str]                    │
╞════════╪═══════╪════════╪══════════════╪═══════╪══════════════════════════════╡
│ apple  ┆ 5     ┆ 5      ┆ apple        ┆ 2     ┆ ["beetle", "beetle"]         │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10    ┆ 10     ┆ banana       ┆ 3     ┆ ["beetle", "beetle", "audi"] │
└────────┴───────┴────────┴──────────────┴───────┴──────────────────────────────┘

#![allow(unused)]
fn main() {
// We can explode the list column "cars"

df.clone().lazy()
    .sort("cars", SortOptions { descending: false, nulls_last: false })
    .groupby(["fruits"])
    .agg([
        col("B").sum().alias("B_sum"),
        sum("B").alias("B_sum2"),  // syntactic sugar for the first
        col("fruits").first().alias("fruits_first"),
        col("A").count().alias("count"),
        col("cars").reverse()
    ])
    .explode(["cars"])
    .collect()?
}
shape: (5, 6)
┌────────┬───────┬────────┬──────────────┬───────┬────────┐
│ fruits ┆ B_sum ┆ B_sum2 ┆ fruits_first ┆ count ┆ cars   │
│ ---    ┆ ---   ┆ ---    ┆ ---          ┆ ---   ┆ ---    │
│ str    ┆ i32   ┆ i32    ┆ str          ┆ u32   ┆ str    │
╞════════╪═══════╪════════╪══════════════╪═══════╪════════╡
│ banana ┆ 10    ┆ 10     ┆ banana       ┆ 3     ┆ beetle │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ banana ┆ 10    ┆ 10     ┆ banana       ┆ 3     ┆ beetle │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ banana ┆ 10    ┆ 10     ┆ banana       ┆ 3     ┆ audi   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ apple  ┆ 5     ┆ 5      ┆ apple        ┆ 2     ┆ beetle │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ apple  ┆ 5     ┆ 5      ┆ apple        ┆ 2     ┆ beetle │
└────────┴───────┴────────┴──────────────┴───────┴────────┘

#![allow(unused)]
fn main() {
df.clone().lazy()
    .groupby(["fruits"])
    .agg([
        col("B").sum().alias("B_sum"),
        col("fruits").first().alias("fruits_first"),
        count(),
        col("B").shift(1).alias("B_shifted")
    ])
    .explode(["B_shifted"])
    .collect()?
}
shape: (5, 5)
┌────────┬───────┬──────────────┬───────┬───────────┐
│ fruits ┆ B_sum ┆ fruits_first ┆ count ┆ B_shifted │
│ ---    ┆ ---   ┆ ---          ┆ ---   ┆ ---       │
│ str    ┆ i32   ┆ str          ┆ u32   ┆ i32       │
╞════════╪═══════╪══════════════╪═══════╪═══════════╡
│ apple  ┆ 5     ┆ apple        ┆ 2     ┆ null      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ 5     ┆ apple        ┆ 2     ┆ 3         │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10    ┆ banana       ┆ 3     ┆ null      │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10    ┆ banana       ┆ 3     ┆ 5         │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 10    ┆ banana       ┆ 3     ┆ 4         │
└────────┴───────┴──────────────┴───────┴───────────┘

#![allow(unused)]
fn main() {
// we can also get the list of the groups

df.clone().lazy()
    .groupby(["fruits"])
    .agg([
        col("B").shift(1).alias("shift_B"),
        col("B").reverse().alias("rev_B"),
    ])
    .collect()?
}
shape: (2, 3)
┌────────┬──────────────┬───────────┐
│ fruits ┆ shift_B      ┆ rev_B     │
│ ---    ┆ ---          ┆ ---       │
│ str    ┆ list[i32]    ┆ list[i32] │
╞════════╪══════════════╪═══════════╡
│ banana ┆ [null, 5, 4] ┆ [1, 4, 5] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ [null, 3]    ┆ [2, 3]    │
└────────┴──────────────┴───────────┘

#![allow(unused)]
fn main() {
// we can do predicates in the groupby as well

df.clone().lazy()
    .groupby(["fruits"])
    .agg([
        col("B").filter(col("B").gt(lit(1))).list().keep_name(),
    ])
    .collect()?
}
shape: (2, 2)
┌────────┬───────────┐
│ fruits ┆ B         │
│ ---    ┆ ---       │
│ str    ┆ list[i32] │
╞════════╪═══════════╡
│ apple  ┆ [3, 2]    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ [5, 4]    │
└────────┴───────────┘

#![allow(unused)]
fn main() {
// and sum only by the values where the predicates are true

df.clone().lazy()
    .groupby(["fruits"])
    .agg([
        col("B").filter(col("B").gt(lit(1))).mean(),
    ])
    .collect()?
}
shape: (2, 2)
┌────────┬─────┐
│ fruits ┆ B   │
│ ---    ┆ --- │
│ str    ┆ f64 │
╞════════╪═════╡
│ apple  ┆ 2.5 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ banana ┆ 4.5 │
└────────┴─────┘

#![allow(unused)]
fn main() {
// Another example

df.clone().lazy()
    .groupby(["fruits"])
    .agg([
        col("B").shift_and_fill(1, 0).alias("shifted"),
        col("B").shift_and_fill(1, 0).sum().alias("shifted_sum"),
    ])
    .collect()?
}
shape: (2, 3)
┌────────┬───────────┬─────────────┐
│ fruits ┆ shifted   ┆ shifted_sum │
│ ---    ┆ ---       ┆ ---         │
│ str    ┆ list[i32] ┆ i32         │
╞════════╪═══════════╪═════════════╡
│ apple  ┆ [0, 3]    ┆ 3           │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ [0, 5, 4] ┆ 9           │
└────────┴───────────┴─────────────┘

Window functions!

  • Expression with superpowers.
  • Aggregation in selection context

#![allow(unused)]
fn main() {
col("foo").aggregation_expression(..).over("column_used_to_group")
}

#![allow(unused)]
fn main() {
// groupby 2 different columns

df.clone().lazy()
    .select([
        col("fruits"),
        col("cars"),
        col("B"),
        col("B").sum().over(["fruits"]).alias("B_sum_by_fruits"),
        col("B").sum().over(["cars"]).alias("B_sum_by_cars"),
    ])
    .collect()?
}
shape: (5, 5)
┌────────┬────────┬─────┬─────────────────┬───────────────┐
│ fruits ┆ cars   ┆ B   ┆ B_sum_by_fruits ┆ B_sum_by_cars │
│ ---    ┆ ---    ┆ --- ┆ ---             ┆ ---           │
│ str    ┆ str    ┆ i32 ┆ i32             ┆ i32           │
╞════════╪════════╪═════╪═════════════════╪═══════════════╡
│ banana ┆ beetle ┆ 5   ┆ 10              ┆ 11            │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ audi   ┆ 4   ┆ 10              ┆ 4             │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ beetle ┆ 3   ┆ 5               ┆ 11            │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ beetle ┆ 2   ┆ 5               ┆ 11            │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ beetle ┆ 1   ┆ 10              ┆ 11            │
└────────┴────────┴─────┴─────────────────┴───────────────┘

#![allow(unused)]
fn main() {
// reverse B by groups and show the results in original DF

df.clone().lazy()
    .select([
        col("fruits"),
        col("B"),
        col("B").reverse().over(["fruits"]).alias("B_reversed_by_fruits")
    ])
    .collect()?
}
shape: (5, 3)
┌────────┬─────┬──────────────────────┐
│ fruits ┆ B   ┆ B_reversed_by_fruits │
│ ---    ┆ --- ┆ ---                  │
│ str    ┆ i32 ┆ i32                  │
╞════════╪═════╪══════════════════════╡
│ banana ┆ 5   ┆ 1                    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 4   ┆ 4                    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ 3   ┆ 2                    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ 2   ┆ 3                    │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 1   ┆ 5                    │
└────────┴─────┴──────────────────────┘

#![allow(unused)]
fn main() {
// Lag a column within "fruits"

df.clone().lazy()
    .select([
        col("fruits"),
        col("B"),
        col("B").shift(1).over(["fruits"]).alias("lag_B_by_fruits")
    ])
    .collect()?
}
shape: (5, 3)
┌────────┬─────┬─────────────────┐
│ fruits ┆ B   ┆ lag_B_by_fruits │
│ ---    ┆ --- ┆ ---             │
│ str    ┆ i32 ┆ i32             │
╞════════╪═════╪═════════════════╡
│ banana ┆ 5   ┆ null            │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 4   ┆ 5               │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ 3   ┆ null            │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple  ┆ 2   ┆ 3               │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ banana ┆ 1   ┆ 4               │
└────────┴─────┴─────────────────┘

#![allow(unused)]

fn main() {
}

Expression API

The full list of possible expressions is available on the Expr class definition in the reference guide.

Video Introduction

Don't enjoy reading? Take a look at this introduction video on Polars and its expressions.

Data types

Polars is entirely based on Arrow data types and backed by Arrow memory arrays. This makes data processing cache-efficient and well-supported for Inter Process Communication. Most data types follow the exact implementation from Arrow, with exception of Utf8 (this is actually LargeUtf8), Categorical, and Object (support is limited).

The data types are:

  • Int8: 8-bit signed integer.
  • Int16: 16-bit signed integer.
  • Int32: 32-bit signed integer.
  • Int64: 64-bit signed integer.
  • UInt8: 8-bit unsigned integer.
  • UInt16: 16-bit unsigned integer.
  • UInt32: 32-bit unsigned integer.
  • UInt64: 64-bit unsigned integer.
  • Float32: 32-bit floating point.
  • Float64: 64-bit floating point.
  • Boolean: Boolean type effectively bit packed.
  • Utf8: String data (this is actually Arrow LargeUtf8 internally).
  • List: A list array contains a child array containing the list values and an offset array. (this is actually Arrow LargeList internally).
  • Struct: A struct array is represented as Vec<Series> and is useful to pack multiple/heterogenous values in a single column.
  • Date: Date representation, internally represented as days since UNIX epoch encoded by a 32-bit signed integer.
  • Datetime: Datetime representation, internally represented as nanoseconds since UNIX epoch encoded by a 64-bit signed integer.
  • Duration: A timedelta type. Created when subtracting Date/Datetime.
  • Time: Time representation, internally represented as nanoseconds since midnight.
  • Object: A limited supported data type that can be any value.

To learn more about the internal representation of these data types, check the Arrow columnar format.

Coming from Pandas

Here we set out the key points that anyone who has experience with Pandas and wants to try Polars should know. We include both differences in the concepts the libraries are built on and differences in how you should write Polars code compared to Pandas code.

Differences in concepts between Polars and Pandas

Polars does not have an index

Pandas gives a label to each row with an index. Polars does not use an index and each row is indexed by its integer position in the table.

Indexes are not needed! Not having them makes things easier - convince us otherwise!

For more detail on how you select data in Polars see the selecting data section.

Polars uses Apache Arrow arrays to represent data in memory while Pandas uses Numpy arrays

Polars represents data in memory with Arrow arrays while Pandas represents data in memory in Numpy arrays. Apache Arrow is an emerging standard for in-memory columnar analytics that can accelerate data load times, reduce memory usage and accelerate calculations.

Polars can convert data to Numpy format with the to_numpy method.

Polars has more support for parallel operations than Pandas

Polars exploits the strong support for concurrency in Rust to run many operations in parallel. While some operations in Pandas are multi-threaded the core of the library is single-threaded and an additional library such as Dask must be used to parallelise operations.

Polars can lazily evaluate queries and apply query optimization

Eager evaluation is where code is evaluated as soon as you run the code. Lazy evaluation is where running a line of code means that the underlying logic is added to a query plan rather than being evaluated.

Polars supports eager evaluation and lazy evaluation whereas Pandas only supports eager evaluation. The lazy evaluation mode is powerful because Polars carries out automatic query optimization where it examines the query plan and looks for ways to accelerate the query or reduce memory usage.

Dask also supports lazy evaluation where it generates a query plan. However, Dask does not carry out query optimization on the query plan.

Key syntax differences

Users coming from Pandas generally need to know one thing...

polars != pandas

If your Polars code looks like it could be Pandas code, it might run, but it likely runs slower than it should.

Let's go through some typical Pandas code and see how we might write that in Polars.

Selecting data

As there is no index in Polars there is no .loc or iloc method in Polars - and there is also no SettingWithCopyWarning in Polars.

To learn more about how you select data in Polars see the selecting data section.

However, the best way to select data in Polars is to use the expression API. For example, if you want to select a column in Pandas you can do one of the following:

df['a']
df.loc[:,'a']

but in Polars you would use the .select method:

df.select(['a'])

If you want to select rows based on the values then in Polars you use the .filter method:

df.filter(pl.col('a')<10)

As noted in the section on expressions below, Polars can run operations in .select and filter in parallel and Polars can carry out query optimization on the full set of data selection criteria.

Be lazy

Working in lazy evaluation mode is straightforward and should be your default in Polars as the lazy mode allows Polars to do query optimization.

We can run in lazy mode by either using an implicitly lazy function (such as scan_csv) or explicitly using the lazy method.

Take the following simple example where we read a CSV file from disk and do a groupby. The CSV file has numerous columns but we just want to do a groupby on one of the id columns (id1) and then sum by a value column (v1). In Pandas this would be:

    df = pd.read_csv(csvFile)
    groupedDf = df.loc[:,['id1','v1']].groupby('id1').sum('v1')

In Polars you can build this query in lazy mode with query optimization and evaluate it by replacing the eager Pandas function read_csv with the implicitly lazy Polars function scan_csv:

    df = pl.scan_csv(csvFile)
    groupedDf = df.groupby('id1').agg(pl.col('v1').sum()).collect()

Polars optimizes this query by identifying that only the id1 and v1 columns are relevant and so will only read these columns from the CSV. By calling the .collect method at the end of the second line we instruct Polars to eagerly evaluate the query.

If you do want to run this query in eager mode you can just replace scan_csv with read_csv in the Polars code.

Read more about working with lazy evaluation in the lazy API section.

Express yourself

A typical Pandas script consists of multiple data transformations that are executed sequentially. However, in Polars these transformations can be executed in parallel using expressions.

Column assignment

We have a dataframe df with a column called value. We want to add two new columns, a column called tenXValue where the value column is multiplied by 10 and a column called hundredXValue where the value column is multiplied by 100.

In Pandas this would be:

df["tenXValue"] = df["value"] * 10
df["hundredXValue"] = df["value"] * 100

These column assignments are executed sequentially.

In Polars we add columns to df using the .with_column method and name them with the .alias method:

df.with_columns([
    (pl.col("value") * 10).alias("tenXValue"),
    (pl.col("value") * 100).alias("hundredXValue"),
])

These column assignments are executed in parallel.

Column assignment based on predicate

In this case we have a dataframe df with columns a,b and c. We want to re-assign the values in column a based on a condition. When the value in column c is equal to 2 then we replace the value in a with the value in b.

In Pandas this would be:

df.loc[df["c"] == 2, "a"] = df.loc[df["c"] == 2, "b"]

while in Polars this would be:

df.with_column(
    pl.when(pl.col("c") == 2)
    .then(pl.col("b"))
    .otherwise(pl.col("a")).alias("a")
)

The Polars way is pure in that the original DataFrame is not modified. The mask is also not computed twice as in Pandas (you could prevent this in Pandas, but that would require setting a temporary variable).

Additionally Polars can compute every branch of an if -> then -> otherwise in parallel. This is valuable, when the branches get more expensive to compute.

Filtering

We want to filter the dataframe df with housing data based on some criteria.

In Pandas you filter the dataframe by passing Boolean expressions to the loc method:

df.loc[(df['sqft_living'] > 2500) & (df['price'] < 300000)]

while in Polars you call the filter method:

df.filter(
    (pl.col("m2_living") > 2500) & (pl.col("price") < 300000)
)

The query optimizer in Polars can also detect if you write multiple filters separately and combine them into a single filter in the optimized plan.

Pandas transform

The Pandas documentation demonstrates an operation on a groupby called transform. In this case we have a dataframe df and we want a new column showing the number of rows in each group.

In Pandas we have:

df = pd.DataFrame({
    "type": ["m", "n", "o", "m", "m", "n", "n"],
    "c": [1, 1, 1, 2, 2, 2, 2],
})

df["size"] = df.groupby("c")["type"].transform(len)

Here Pandas does a groupby on "c", takes column "type", computes the group length and then joins the result back to the original DataFrame producing:

   c type size
0  1    m    3
1  1    n    3
2  1    o    3
3  2    m    4
4  2    m    4
5  2    n    4
6  2    n    4

In Polars the same can be achieved with window functions:

df.select([
    pl.all(),
    pl.col("type").count().over("c").alias("size")
])
shape: (7, 3)
┌─────┬──────┬──────┐
│ c   ┆ type ┆ size │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ u32  │
╞═════╪══════╪══════╡
│ 1   ┆ m    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ n    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ o    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    │
└─────┴──────┴──────┘

Because we can store the whole operation in a single expression, we can combine several window functions and even combine different groups!

Polars will cache window expressions that are applied over the same group, so storing them in a single select is both convenient and optimal. In the following example we look at a case where we are calculating group statistics over "c" twice:

df.select([
    pl.all(),
    pl.col("c").count().over("c").alias("size"),
    pl.col("c").sum().over("type").alias("sum"),
    pl.col("c").reverse().over("c").flatten().alias("reverse_type")
])
shape: (7, 5)
┌─────┬──────┬──────┬─────┬──────────────┐
│ c   ┆ type ┆ size ┆ sum ┆ reverse_type │
│ --- ┆ ---  ┆ ---  ┆ --- ┆ ---          │
│ i64 ┆ str  ┆ u32  ┆ i64 ┆ i64          │
╞═════╪══════╪══════╪═════╪══════════════╡
│ 1   ┆ m    ┆ 3    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ n    ┆ 3    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ o    ┆ 3    ┆ 1   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    ┆ 5   ┆ 1            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    ┆ 5   ┆ 1            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    ┆ 5   ┆ 1            │
└─────┴──────┴──────┴─────┴──────────────┘

Missing data

Pandas uses NaN and/or None values to indicate missing values depending on the dtype of the column. In addition the behaviour in Pandas varies depending on whether the default dtypes or optional nullable arrays are used. In Polars missing data corresponds to a null value for all data types.

For float columns Polars permits the use of NaN values. These NaN values are not considered to be missing data but instead a special floating point value.

In Pandas an integer column with missing values is cast to be a float column with NaN values for the missing values (unless using optional nullable integer dtypes). In Polars any missing values in an integer column are simply null values and the column remains an integer column.

See the missing data section for more details.

Coming from Apache Spark

Column-based API vs. Row-based API

Whereas the Spark DataFrame is analogous to a collection of rows, a Polars DataFrame is closer to a collection of columns. This means that you can combine columns in Polars in ways that are not possible in Spark, because Spark preserves the relationship of the data in each row.

Consider this sample dataset:

import polars as pl

df = pl.DataFrame({
    "foo": ["a", "b", "c", "d", "d"],
    "bar": [1, 2, 3, 4, 5],
})

dfs = spark.createDataFrame(
    [
        ("a", 1),
        ("b", 2),
        ("c", 3),
        ("d", 4),
        ("d", 5),
    ],
    schema=["foo", "bar"],
)

Example 1: Combining head and sum

In Polars you can write something like this:

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").filter(pl.col("foo") == "d").sum()
])

Output:

shape: (2, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 9   │
└─────┴─────┘

The expressions on columns foo and bar are completely independent. Since the expression on bar returns a single value, that value is repeated for each value output by the expression on foo. But a and b have no relation to the data that produced the sum of 9.

To do something similar in Spark, you'd need to compute the sum separately and provide it as a literal:

from pyspark.sql.functions import col, sum, lit

bar_sum = (
    dfs
    .where(col("foo") == "d")
    .groupBy()
    .agg(sum(col("bar")))
    .take(1)[0][0]
)

(
    dfs
    .orderBy("foo")
    .limit(2)
    .withColumn("bar", lit(bar_sum))
    .show()
)

Output:

+---+---+
|foo|bar|
+---+---+
|  a|  9|
|  b|  9|
+---+---+

Example 2: Combining Two heads

In Polars you can combine two different head expressions on the same DataFrame, provided that they return the same number of values.

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").sort(reverse=True).head(2),
])

Output:

shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 4   │
└─────┴─────┘

Again, the two head expressions here are completely independent, and the pairing of a to 5 and b to 4 results purely from the juxtaposition of the two columns output by the expressions.

To accomplish something similar in Spark, you would need to generate an artificial key that enables you to join the values in this way.

from pyspark.sql import Window
from pyspark.sql.functions import row_number

foo_dfs = (
    dfs
    .withColumn(
        "rownum",
        row_number().over(Window.orderBy("foo"))
    )
)

bar_dfs = (
    dfs
    .withColumn(
        "rownum",
        row_number().over(Window.orderBy(col("bar").desc()))
    )
)

(
    foo_dfs.alias("foo")
    .join(bar_dfs.alias("bar"), on="rownum")
    .select("foo.foo", "bar.bar")
    .limit(2)
    .show()
)

Output:

+---+---+
|foo|bar|
+---+---+
|  a|  5|
|  b|  4|
+---+---+

How can I?

This chapter contains some snippets that will get you up to speed with the most idiomatic way to get things done in Polars.

IO

Polars supports different file types, and its respective parsers are amongst the fastest out there.

For instance, it is faster to load a CSV file via Polars before handing it to Pandas than loading them using Pandas. Just run a pl.read_csv("<FILE>", rechunk=False).to_pandas() to convince yourself!

Character-Separated Values

Read & Write

Reading a CSV file should look familiar:

df = pl.read_csv("path.csv")

CSV files come in many different flavors, so make sure to check the read_csv() API.

Writing to a CSV file can be done with the write_csv() method.

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_csv("path.csv")

Scan

Polars allows you to scan a CSV input. Scanning delays the actual parsing of the file and instead returns a lazy computation holder called a LazyFrame.

df = pl.scan_csv("path.csv")

If you want to know why this is desirable, you can read more about those Polars optimizations here.

The following video shows how to efficiently load CSV files with Polars and how the built-in query optimization makes this much faster.

Parquet

Loading or writing Parquet files is lightning fast. Pandas uses PyArrow -Python bindings exposed by Arrow- to load Parquet files into memory, but it has to copy that data into Pandas memory. With Polars there is no extra cost due to copying as we read Parquet directly into Arrow memory and keep it there.

Read & write

df = pl.read_parquet("path.parquet")
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_parquet("path.parquet")

Scan

Polars allows you to scan a Parquet input. Scanning delays the actual parsing of the file and instead returns a lazy computation holder called a LazyFrame.

df = pl.scan_parquet("path.parquet")

If you want to know why this is desirable, you can read more about those Polars optimizations here.

Dealing with multiple files.

Polars can deal with multiple files differently depending on your needs and memory strain.

Let's create some files to give use some context:

import polars as pl

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})

for i in range(5):
    df.write_csv(f"my_many_files_{i}.csv")

Reading into a single DataFrame

To read multiple files into a single DataFrame, we can use globbing patterns:

df = pl.read_csv("my_many_files_*.csv")
print(df)
shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar  │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ ham  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ...  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ ham  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
└─────┴──────┘

To see how this works we can take a look at the query plan. Below we see that all files are read separately and concatenated into a single DataFrame. Polars will try to parallelize the reading.

pl.scan_csv("my_many_files_*.csv").show_graph()

single_df_graph

Reading and processing in parallel

If your files don't have to be in a single table you can also build a query plan for each file and execute them in parallel on the Polars thread pool.

All query plan execution is embarrassingly parallel and doesn't require any communication.

import polars as pl
import glob

queries = []
for file in glob.glob("my_many_files_*.csv"):
    q = pl.scan_csv(file).groupby("bar").agg([pl.count(), pl.sum("foo")])
    queries.append(q)

dataframes = pl.collect_all(queries)
print(dataframes)
[shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ spam ┆ 1     ┆ 3   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘]

Read from MySQL, Postgres, Sqlite, Redshift, Clickhouse

To read from one of the supported databases connector-x needs to be installed.

$  pip install connectorx>=0.2.0a3
import polars as pl

conn = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_sql(query, conn)

Interact with AWS

The Interact with AWS page is under construction.

To read from or write to an AWS bucket, additional dependencies are needed:

$ pip install s3fs

In the next few snippets we'll demonstrate interacting with a Parquet file located on an AWS bucket.

Read

Load a .parquet file using:

import polars as pl
import pyarrow.parquet as pq
import s3fs

fs = s3fs.S3FileSystem()
bucket = "<YOUR_BUCKET>"
path = "<YOUR_PATH>"

dataset = pq.ParquetDataset(f"s3://{bucket}/{path}", filesystem=fs)
df = pl.from_arrow(dataset.read())

Write

This content is under construction.

Interact with Google BigQuery

To read or write from GBQ, additional dependencies are needed:

$ pip install google-cloud-bigquery

Read

We can load a query into a DataFrame like this:

import polars as pl
from google.cloud import bigquery

client = bigquery.Client()

# Perform a query.
QUERY = (
    'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
    'WHERE state = "TX" '
    'LIMIT 100')
query_job = client.query(QUERY)  # API request
rows = query_job.result()  # Waits for query to finish

df = pl.from_arrow(rows.to_arrow())

Write

This content is under construction.

Interact with Postgres

Read

To read from postgres, additional dependencies are needed:

$  pip install connectorx>=0.2.0a3
import polars as pl

conn = "postgresql://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_sql(query, conn)

Write

To write to postgres, additional dependencies are needed:

$ pip install psycopg2-binary

For writing to a postgres database with psycopg2, we utilize execute_batch. This will limit round trips needed to the server.

We first make sure that all our dtypes are in a format that psycopg2 recognizes, and then we use DataFrame.rows to easily transform the columnar data to rows that the database driver can work with.

from psycopg2 import sql
import psycopg2.extras
import polars as pl

# let's assume we have a DataFrame with some floats, integers, strings, and date64 columns.
df = pl.read_parquet("somefile.parquet")

# first me convert polars date64 representation to python datetime objects 
for col in df:
    # only for date64
    if col.dtype == pl.Date64:
        df = df.with_column(col.dt.to_python_datetime())

# create sql identifiers for the column names
# we do this to safely insert this into a sql query
columns = sql.SQL(",").join(sql.Identifier(name) for name in df.columns)

# create placeholders for the values. These will be filled later
values = sql.SQL(",").join([sql.Placeholder() for _ in df.columns])

table_id = "mytable"

# prepare the insert query
insert_stmt = sql.SQL("INSERT INTO {} ({}) VALUES({});").format(
    sql.Identifier(table_id), columns, values
)

# make a connection
conn = psycopg2.connect()
cur = conn.cursort()

# do the insert
psycopg2.extras.execute_batch(cur, insert_stmt, df.rows())
conn.commit()

Interoperability

Arrow

Arrow is rapidly becoming the de facto standard for columnar data. This means that support for Arrow is growing rapidly (both languages and tools). Due to the amazing effort behind the format, using Arrow is now likely the fastest way to:

  • Read and write Parquet formatted files
  • Read CSV into columnar data
  • Exchanging columnar data

Polars uses an Arrow memory buffer as the most basic building block for the Polars Series. This means that we exchange data between Polars and Arrow without copying it. It also means that Polars shares the same performance gains that Arrow receives.

Convert a Polars DataFrame or Series to Arrow using the .to_arrow() method. Similarly, importing from Arrow data structure can be performed with the .from_arrow() functions.

NumPy

Polars Series have support for NumPy universal functions (ufuncs). Element-wise functions such as np.exp(), np.cos(), np.div(), etc. all work with almost zero overhead.

However, as a Polars-specific remark: missing values are a separate bitmask and are not visible by NumPy. It can yield to a window function or a np.convolve() giving flawed or incomplete results.

Convert a Polars Series to a NumPy array with the .to_numpy() method. Missing values will be replaced by np.nan during the conversion. If the Series does not include missing values, or those values are not desired anymore, the .view() method can be used instead, providing a zero-copy NumPy array of the data.

Selecting data

In this section we show how to select rows and/or columns from a DataFrame. We can select data with expressions or select data with square bracket indexing.

The Expression API is key to writing performant queries in Polars. The simplest way to get started with the Expression API is to get familiar with the filter and select methods in this section.

Although they may give the same output, selecting data with expressions or square bracket indexing are not equivalent. The implementation of selecting data with expressions is completely different from the implementation of selecting data with square bracket indexing.

We strongly recommend selecting data with expressions for almost all use cases. Square bracket indexing is perhaps useful when doing exploratory data analysis in a terminal or notebook when you just want a quick look at a subset of data.

For all other use cases we recommend using expressions because:

  • expressions can be parallelized
  • the expression approach can be used in lazy and eager mode while the indexing approach can only be used in eager mode
  • in lazy mode the query optimizer can optimize expressions

Selecting with expressions

In this page we show how to select rows and columns with the preferred expression approach. We cover:

  • use of the Expression API via the filter and select methods to select data
  • combining these expressions and
  • optimization of these expression in lazy mode.

To select data with expressions we use:

  • the filter method to select rows
  • the select method to select columns

For simplicity we deal with DataFrame examples throughout. The principles are the same for Series objects except that columns obviously cannot be selected in a Series. To illustrate the filter and select methods we define a simple DataFrame:

df = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "color": ["blue", "red", "green"],
        "size": ["small", "medium", "large"],
    }
)
print(df)
shape: (3, 3)
┌─────┬───────┬────────┐
│ id  ┆ color ┆ size   │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ str    │
╞═════╪═══════╪════════╡
│ 1   ┆ blue  ┆ small  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ red   ┆ medium │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3   ┆ green ┆ large  │
└─────┴───────┴────────┘

Selecting rows with the filter method

We can select rows by using the filter method. In the filter method we pass the condition we are using to select the rows as an expression:

filter_df = df.filter(pl.col("id") <= 2)
print(filter_df)
shape: (2, 3)
┌─────┬───────┬────────┐
│ id  ┆ color ┆ size   │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ str    │
╞═════╪═══════╪════════╡
│ 1   ┆ blue  ┆ small  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ red   ┆ medium │
└─────┴───────┴────────┘

We can specify multiple conditions in filter using the & operator:

multi_filter_df = df.filter((pl.col("id") <= 2) & (pl.col("size") == "small"))
print(multi_filter_df)
shape: (1, 3)
┌─────┬───────┬───────┐
│ id  ┆ color ┆ size  │
│ --- ┆ ---   ┆ ---   │
│ i64 ┆ str   ┆ str   │
╞═════╪═══════╪═══════╡
│ 1   ┆ blue  ┆ small │
└─────┴───────┴───────┘

Selecting columns with the select method

We select columns using the select method. In the select method we can specify the columns with:

  • a (string) column name
  • a list of (string) column names
  • a boolean list of the same length as the number of columns
  • an expression such as a condition on the column name
  • a Series

Select a single column

single_select_df = df.select("id")
print(single_select_df)
shape: (3, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 2   │
├╌╌╌╌╌┤
│ 3   │
└─────┘

Select a list of columns

list_select_df = df.select(["id", "color"])
print(list_select_df)
shape: (3, 2)
┌─────┬───────┐
│ id  ┆ color │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ blue  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ red   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ green │
└─────┴───────┘

Select columns with an expression

To select based on a condition on the column name:

condition_select_df = df.select(pl.col("^col.*$"))
print(condition_select_df)
shape: (3, 1)
┌───────┐
│ color │
│ ---   │
│ str   │
╞═══════╡
│ blue  │
├╌╌╌╌╌╌╌┤
│ red   │
├╌╌╌╌╌╌╌┤
│ green │
└───────┘

To select based on the dtype of the columns:

dtype_select_df = df.select(pl.col(pl.Int64))
print(dtype_select_df)
shape: (3, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 2   │
├╌╌╌╌╌┤
│ 3   │
└─────┘

Selecting rows and columns

We can combine the filter and select methods to select rows and columns

expression_df = df.filter(pl.col("id") <= 2).select(["id", "color"])
print(expression_df)
shape: (2, 2)
┌─────┬───────┐
│ id  ┆ color │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ blue  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ red   │
└─────┴───────┘

Query optimization

In lazy mode the query optimizer may be able to optimize the query based on the expressions.

In this example we scan a CSV file with many columns using scan_csv and then select a subset of them. The query optimizer creates a query plan that causes only the selected columns to be read from the CSV - see how the Project part of the query plan below states that only 1 of 2 columns will be read:

lazy_select_df = pl.scan_csv("data/appleStock.csv").select(["Date"])
print(lazy_select_df.describe_optimized_plan())
FAST PROJECTION 
CSV SCAN data/appleStock.csv; PROJECT 1/2 COLUMNS; SELECTION: None

If you specify two separate filter conditions the query optimizer will combine them into a single joint condition (see the Selection part of the query plan below):

lazy_filter_df = (
    pl.scan_csv("data/appleStock.csv")
    .filter(
        pl.col("Date") == datetime(1995, 10, 16),
    )
    .filter(pl.col("Close") > 100)
)
print(lazy_filter_df.describe_optimized_plan())
CSV SCAN data/appleStock.csv; PROJECT 2/2 COLUMNS; SELECTION: Some([([(col("Close")) > (100f64)]) & ([(col("Date")) == (813801600000000i64.strict_cast(Datetime(Microseconds, None)).cast(Utf8))])])

Selecting with indexing

In this page we cover use of square bracket indexing to select data. Square bracket indexing can be used to select rows and/or columns.

Indexing is an anti-pattern in Polars

Indexing polars with square brackets is considered an anti-pattern and the functionality may be removed in the future. Polars strongly favours the expression API with select and filter in favor of accessing by square bracket indexing. See the introduction to this section for more information.

Indexing does not work in lazy mode

Selecting data by indexing only works with a DataFrame in eager mode. If you try to select data by indexing on a LazyFrame it will raise an exception that a LazyFrame is not subscriptable. Instead you need to select data using expressions.

Rules for square bracket indexing

The rules for square bracket indexing are as follows (depending on the datatypes of the values):

  • numeric

    • axis 0: row
    • axis 1: column
  • numeric + strings

    • axis 0: row (only accept numbers here)
    • axis 1: column (accept numeric + string values)
  • only strings

    • axis 0: column
    • axis 1: error
  • expressions

    All expression evaluations are executed in parallel

    • axis 0: column
    • axis 1: column
    • ..
    • axis n: column

Comparison with pandas

pandaspolars
select row
df.iloc[2]
df[2, :]
select several rows by their indices
df.iloc[[2, 5, 6]]
df[[2, 5, 6], :]
select slice of rows
df.iloc[2:6]
df[2:6, :]
select rows using a boolean mask
df.iloc[True, True, False]
df.filter([True, True, False])
select rows by a predicate condition
df.loc[df["A"] > 3]
df[df["A"] > 3]
select slice of columns
df.iloc[:, 1:3]
df[:, 1:3]
select slice of columns by string order
df.loc[:, "A":"Z"]
df[:, "A":"Z"]
select a single value (scalar)
df.loc[2, "A"]
df[2, "A"]
select a single value (scalar)
df.iloc[2, 1]
df[2, 1]
select a single value (Series/DataFrame)
df.loc[2, ["A"]]
df[2, ["A"]]
select a single value (Series/DataFrame)
df.iloc[2, [1]]
df[2, [1]]

Data handling

Process strings

Thanks to its Arrow backend, Polars string operations are much faster compared to the same operations performed with NumPy or Pandas. In the latter, strings are stored as Python objects. While traversing the np.array or the pd.Series the CPU needs to follow all the string pointers, and jump to many random memory locations -- which is very cache-inefficient. In Polars (via the Arrow data structure) strings are contiguous in memory. Thus traversing is cache-optimal and predictable for the CPU.

The string processing functions available in Polars are available in the str namespace.

Below are a few examples. To compute string lengths:

import polars as pl

df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})

df = df.with_column(pl.col("shakespeare").str.lengths().alias("letter_count"))

returning:

shape: (6, 2)
┌─────────────┬──────────────┐
│ shakespeare ┆ letter_count │
│ ---         ┆ ---          │
│ str         ┆ u32          │
╞═════════════╪══════════════╡
│ All         ┆ 3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ that        ┆ 4            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ glitters    ┆ 8            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ is          ┆ 2            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ not         ┆ 3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ gold        ┆ 4            │
└─────────────┴──────────────┘

And below a regex pattern to filter out articles (the, a, and, etc.) from a sentence:

import polars as pl

df = pl.DataFrame({"a": "The man that ate a whole cake".split(" ")})

df = df.filter(pl.col("a").str.contains(r"(?i)^the$|^a$").is_not())

yielding:

shape: (5, 1)
┌───────┐
│ a     │
│ ---   │
│ str   │
╞═══════╡
│ man   │
├╌╌╌╌╌╌╌┤
│ that  │
├╌╌╌╌╌╌╌┤
│ ate   │
├╌╌╌╌╌╌╌┤
│ whole │
├╌╌╌╌╌╌╌┤
│ cake  │
└───────┘

Timestamp parsing

Polars offers 4 time datatypes:

  • pl.Date, to be used for date objects: the number of days since the UNIX epoch as a 32 bit signed integer.
  • pl.Datetime, to be used of datetime objects: the number of nanoseconds since the UNIX epoch as a 64 bit signed integer.
  • pl.Time, encoded as the number of nanoseconds since midnight.

Polars string (pl.Utf8) datatypes can be parsed as either of them. You can let Polars try to guess the format of the date[time], or explicitly provide a fmt rule.

For instance (check this link for an comprehensive list):

  • "%Y-%m-%d" for "2020-12-31"
  • "%Y/%B/%d" for "2020/December/31"
  • "%B %y" for "December 20"

Below a quick example:

import polars as pl

dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04"], "index": [1, 2, 3]})

q = dataset.lazy().with_column(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))

df = q.collect()

returning:

shape: (3, 2)
┌────────────┬───────┐
│ date       ┆ index │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2020-01-02 ┆ 1     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-03 ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-04 ┆ 3     │
└────────────┴───────┘

All datetime functionality is shown in the dt namespace.

Missing data

This page sets out how missing data is represented in Polars and how missing data can be filled.

null and NaN values

Each column in a DataFrame (or equivalently a Series) is an Arrow array or a collection of Arrow arrays based on the Apache Arrow format. Missing data is represented in Arrow and Polars with a null value. This null missing value applies for all data types including numerical values.

Polars also allows NotaNumber or NaN values for float columns. These NaN values are considered to be a type of floating point data rather than missing data. We discuss NaN values separately below.

You can manually define a missing value with the python None value:

df = pl.DataFrame(
    {
        "value": [1, None],
    },
)
print(df)
shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ i64   │
╞═══════╡
│ 1     │
├╌╌╌╌╌╌╌┤
│ null  │
└───────┘

In Pandas the value for missing data depends on the dtype of the column. In Polars missing data is always represented as a null value.

Missing data metadata

Each Arrow array used by Polars stores two kinds of metadata related to missing data. This metadata allows Polars to quickly show how many missing values there are and which values are missing.

The first piece of metadata is the null_count - this is the number of rows with null values in the column:

null_count_df = df.null_count()
print(null_count_df)
shape: (1, 1)
┌───────┐
│ value │
│ ---   │
│ u32   │
╞═══════╡
│ 1     │
└───────┘

The null_count method can be called on a DataFrame, a column from a DataFrame or a Series. The null_countmethod is a cheap operation as null_count is already calculated for the underlying Arrow array.

The second piece of metadata is an array called a validity bitmap that indicates whether each data value is valid or missing. The validity bitmap is memory efficient as it is bit encoded - each value is either a 0 or a 1. This bit encoding means the memory overhead per array is only (array length / 8) bytes. The validity bitmap is used by the is_null method in Polars.

You can return a Series based on the validity bitmap for a column in a DataFrame or a Series with the is_null method:

is_null_series = df.select(
    pl.col("value").is_null(),
)
print(is_null_series)
shape: (2, 1)
┌───────┐
│ value │
│ ---   │
│ bool  │
╞═══════╡
│ false │
├╌╌╌╌╌╌╌┤
│ true  │
└───────┘

The is_null method is a cheap operation that does not require scanning the full column for null values. This is because the validity bitmap already exists and can be returned as a Boolean array.

Filling missing data

Missing data in a Series can be filled with the fill_null method. You have to specify how you want the fill_null method to fill the missing data. The main ways to do this are filling with:

  • a literal such as 0 or "0"
  • a strategy such as filling forwards
  • an expression such as replacing with values from another column
  • interpolation

We illustrate each way to fill nulls by defining a simple DataFrame with a missing value in col2:

df = pl.DataFrame(
    {
        "col1": [1, 2, 3],
        "col2": [1, None, 3],
    },
)
print(df)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ 3    │
└──────┴──────┘

Fill with specified literal value

We can fill the missing data with a specified literal value with pl.lit:

fill_literal_df = (
    df.with_column(
        pl.col("col2").fill_null(
            pl.lit(2),
        ),
    ),
)
print(fill_literal_df)
(shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 2    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ 3    │
└──────┴──────┘,)

Fill with a strategy

We can fill the missing data with a strategy such as filling forward:

fill_forward_df = df.with_column(
    pl.col("col2").fill_null("forward"),
)
print(fill_forward_df)
shape: (3, 2)
┌──────┬─────────┐
│ col1 ┆ col2    │
│ ---  ┆ ---     │
│ i64  ┆ str     │
╞══════╪═════════╡
│ 1    ┆ 1       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ forward │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ 3       │
└──────┴─────────┘

See the API docs for other available strategies.

Fill with an expression

For more flexibility we can fill the missing data with an expression. For example, to fill nulls with the median value from that column:

fill_median_df = df.with_column(
    pl.col("col2").fill_null(pl.median("col2")),
)
print(fill_median_df)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ f64  │
╞══════╪══════╡
│ 1    ┆ 1.0  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 2.0  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ 3.0  │
└──────┴──────┘

In this case the column is cast from integer to float because the median is a float statistic.

Fill with interpolation

In addition, we can fill nulls with interpolation (without using the fill_null function):

fill_interpolation_df = df.with_column(
    pl.col("col2").interpolate(),
)
print(fill_interpolation)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 1    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2    ┆ 2    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3    ┆ 3    │
└──────┴──────┘

NotaNumber or NaN values

Missing data in a Series has a null value. However, you can use NotaNumber or NaN values in columns with float datatypes. These NaN values can be created from Numpy's np.nan or the native python float('nan'):

nan_df = pl.DataFrame(
    {
        "value": [1.0, np.NaN, float("nan"), 3.0],
    },
)
print(nan_df)
shape: (4, 1)
┌───────┐
│ value │
│ ---   │
│ f64   │
╞═══════╡
│ 1.0   │
├╌╌╌╌╌╌╌┤
│ NaN   │
├╌╌╌╌╌╌╌┤
│ NaN   │
├╌╌╌╌╌╌╌┤
│ 3.0   │
└───────┘

In Pandas by default a NaN value in an integer column causes the column to be cast to float. This does not happen in Polars - instead an exception is raised.

NaN values are considered to be a type of floating point data and are not considered to be missing data in Polars. This means:

  • NaN values are not counted with the null_count method
  • NaN values are filled when you use fill_nan method but are not filled with the fill_null method

Polars has is_nan and fill_nan methods which work in a similar way to the is_null and fill_null methods. The underlying Arrow arrays do not have a pre-computed validity bitmask for NaN values so this has to be computed for the is_nan method.

One further difference between null and NaN values is that taking the mean of a column with null values excludes the null values from the calculation but with NaN values taking the mean results in a NaN. This behaviour can be avoided by replacing the NaN values with null values;

mean_nan_df = nan_df.with_column(
    pl.col("value").fill_nan(None).alias("value"),
).mean()
print(mean_nan_df)
shape: (1, 1)
┌───────┐
│ value │
│ ---   │
│ f64   │
╞═══════╡
│ 2.0   │
└───────┘

Time Series

Polars has native support for parsing time series data and doing more sophisticated operations such as temporal grouping and resampling.

Parsing dates and times

Datatypes

Polars has the following datetime datatypes:

  • Date: Date representation e.g. 2014-07-08. It is internally represented as days since UNIX epoch encoded by a 32-bit signed integer.
  • Datetime: Datetime representation e.g. 2014-07-08 07:00:00. It is internally represented as a 64 bit integer since the Unix epoch and can have different units such as ns, us, ms.
  • Duration: A time delta type that is created when subtracting Date/Datetime. Similar to timedelta in python.
  • Time: Time representation, internally represented as nanoseconds since midnight.

Parsing dates from a file

When loading from a CSV file Polars attempts to parse dates and times if the parse_dates flag is set to True:

df = pl.read_csv("data/appleStock.csv", parse_dates=True)
print(df)
shape: (100, 2)
┌────────────┬────────┐
│ Date       ┆ Close  │
│ ---        ┆ ---    │
│ date       ┆ f64    │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-06 ┆ 27.38  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-18 ┆ 28.0   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-09-25 ┆ 14.25  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2012-12-04 ┆ 575.85 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-07-05 ┆ 417.42 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-11-07 ┆ 512.49 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘

On the other hand binary formats such as parquet have a schema that is respected by Polars.

Casting strings to dates

You can also cast a column of datetimes encoded as strings to a datetime type. You do this by calling the string str.strptime method and passing the format of the date string:

df = pl.read_csv("data/appleStock.csv", parse_dates=False)

df = df.with_column(pl.col("Date").str.strptime(pl.Date, fmt="%Y-%m-%d"))
print(df)
shape: (100, 2)
┌────────────┬────────┐
│ Date       ┆ Close  │
│ ---        ┆ ---    │
│ date       ┆ f64    │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-06 ┆ 27.38  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-18 ┆ 28.0   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-09-25 ┆ 14.25  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2012-12-04 ┆ 575.85 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-07-05 ┆ 417.42 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-11-07 ┆ 512.49 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘

The strptime date formats can be found here..

Extracting date features from a date column

You can extract data features such as the year or day from a date column using the .dt namespace on a date column:

    pl.col("Date").is_between(datetime(1995, 7, 1), datetime(1995, 11, 1)),
print(df_with_year)
shape: (100, 3)
┌────────────┬────────┬──────┐
│ Date       ┆ Close  ┆ year │
│ ---        ┆ ---    ┆ ---  │
│ date       ┆ f64    ┆ i32  │
╞════════════╪════════╪══════╡
│ 1981-02-23 ┆ 24.62  ┆ 1981 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1981-05-06 ┆ 27.38  ┆ 1981 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1981-05-18 ┆ 28.0   ┆ 1981 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1981-09-25 ┆ 14.25  ┆ 1981 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...        ┆ ...    ┆ ...  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2012-12-04 ┆ 575.85 ┆ 2012 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2013-07-05 ┆ 417.42 ┆ 2013 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2013-11-07 ┆ 512.49 ┆ 2013 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2014-02-25 ┆ 522.06 ┆ 2014 │
└────────────┴────────┴──────┘

See the API docs for more date feature options.

Filtering date columns

Filtering date columns works in the same way as with other types of columns using the .filter method.

Polars uses pythons native datetime, date and timedelta for equality comparisons between the datatypes pl.Datetime, pl.Date and pl.Duration.

In the following example we use a time series of Apple stock prices.

import polars as pl
from datetime import datetime

df = pl.read_csv("data/appleStock.csv", parse_dates=True)
print(df)
shape: (100, 2)
┌────────────┬────────┐
│ Date       ┆ Close  │
│ ---        ┆ ---    │
│ date       ┆ f64    │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-06 ┆ 27.38  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-18 ┆ 28.0   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-09-25 ┆ 14.25  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2012-12-04 ┆ 575.85 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-07-05 ┆ 417.42 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-11-07 ┆ 512.49 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘

Filtering by single dates

We can filter by a single date by casting the desired date string to a Date object in a filter expression:

filtered_df = df.filter(
    pl.col("Date") == datetime(1995, 10, 16),
)
shape: (1, 2)
┌────────────┬───────┐
│ Date       ┆ Close │
│ ---        ┆ ---   │
│ date       ┆ f64   │
╞════════════╪═══════╡
│ 1995-10-16 ┆ 36.13 │
└────────────┴───────┘

Note we are using the lowercase datetime method rather than the uppercase Datetime data type.

Filtering by a date range

We can filter by a range of dates using the is_between method in a filter expression with the start and end dates:

filtered_range_df = df.filter(
    pl.col("Date").is_between(datetime(1995, 7, 1), datetime(1995, 11, 1)),
)
shape: (2, 2)
┌────────────┬───────┐
│ Date       ┆ Close │
│ ---        ┆ ---   │
│ date       ┆ f64   │
╞════════════╪═══════╡
│ 1995-07-06 ┆ 47.0  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1995-10-16 ┆ 36.13 │
└────────────┴───────┘

Fixed and rolling temporal groupby

Grouping by fixed windows with groupby_dynamic

We can calculate temporal statistics using groupby_dynamic to group rows into days/months/years etc.

Annual average example

In following simple example we calculate the annual average closing price of Apple stock prices. We first load the data from CSV:

df = pl.read_csv("data/appleStock.csv", parse_dates=True)
shape: (100, 2)
┌────────────┬────────┐
│ Date       ┆ Close  │
│ ---        ┆ ---    │
│ date       ┆ f64    │
╞════════════╪════════╡
│ 1981-02-23 ┆ 24.62  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-06 ┆ 27.38  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-05-18 ┆ 28.0   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1981-09-25 ┆ 14.25  │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2012-12-04 ┆ 575.85 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-07-05 ┆ 417.42 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2013-11-07 ┆ 512.49 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2014-02-25 ┆ 522.06 │
└────────────┴────────┘

The dates are sorted in ascending order - if they are not sorted in this way the groupby_dynamic output will not be correct!

To get the annual average closing price we tell groupby_dynamic that we want to:

  • group by the Date column on an annual (1y) basis
  • take the mean values of the Close column for each year:
annual_average_df = df.groupby_dynamic("Date", every="1y").agg(pl.col("Close").mean())

df_with_year = df.with_column(pl.col("Date").dt.year().alias("year"))

The annual average closing price is then:

shape: (34, 2)
┌────────────┬───────────┐
│ Date       ┆ Close     │
│ ---        ┆ ---       │
│ date       ┆ f64       │
╞════════════╪═══════════╡
│ 1981-01-01 ┆ 23.5625   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1982-01-01 ┆ 11.0      │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1983-01-01 ┆ 30.543333 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1984-01-01 ┆ 27.583333 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ...        ┆ ...       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2011-01-01 ┆ 368.225   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2012-01-01 ┆ 560.965   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2013-01-01 ┆ 464.955   │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2014-01-01 ┆ 522.06    │
└────────────┴───────────┘

Parameters for groupby_dynamic

A dynamic window is defined by a:

  • every: indicates the interval of the window
  • period: indicates the duration of the window
  • offset: can be used to offset the start of the windows

every, period and offset parameters to control temporal window size

The value for every sets how often the groups start. The time period values are flexible - for example we could take:

  • the average over 2 year intervals by replacing 1y with 2y
  • the average over 18 month periods by replacing 1y with 1y6mo

We can also use the period parameter to set how long the time period for each group is. For example, if we set the every parameter to be 1y and the period parameter to be 2y then we would get groups at one year intervals where each groups spanned two years.

If the period parameter is not specified then it is set equal to the every parameter so that if the every parameter is set to be 1y then each group spans 1y as well.

Because every does not have to be equal to period, we can create many groups in a very flexible way. They may overlap or leave boundaries between them.

Let's see how the windows for some parameter combinations would look. Let's start out boring. 🥱

  • every: 1 day -> "1d"
  • period: 1 day -> "1d"
this creates adjacent windows of the same size
|--|
   |--|
      |--|
  • every: 1 day -> "1d"
  • period: 2 days -> "2d"
these windows have an overlap of 1 day
|----|
   |----|
      |----|
  • every: 2 days -> "2d"
  • period: 1 day -> "1d"
this would leave gaps between the windows
data points that in these gaps will not be a member of any group
|--|
       |--|
              |--|

See the API pages for the full range of time periods.

truncate parameter to set the start date for each group

The truncate parameter is a Boolean variable that determines what datetime value is associated with each group in the output. In the example above the first data point is on 23rd February 1981. If truncate = True (the default) then the date for the first year in the annual average is 1st January 1981. However, if truncate = False then the date for the first year in the annual average is the date of the first data point on 23rd February 1981.

Using expressions in groupby_dynamic

We aren't restricted to using simple aggregations like mean in a groupby operation - we can use the full range of expressions available in Polars.

In the snippet below we create a date range with every day ("1d") in 2021 and turn this into a DataFrame.

Then in the groupby_dynamic we create dynamic windows that start every month ("1mo") and have a window length of 1 month. The values that match these dynamic windows are then assigned to that group and can be aggregated with the powerful expression API.

Below we show an example where we use groupby_dynamic to compute:

  • the number of days until the end of the month
  • the number of days in a month

df = pl.date_range(low=datetime(2021, 1, 1), high=datetime(2021, 12, 31), interval="1d", name="time").to_frame()

out = (
    df.groupby_dynamic("time", every="1mo", period="1mo", closed="left")
    .agg(
        [
            pl.col("time").cumcount().reverse().head(3).alias("day/eom"),
            ((pl.col("time") - pl.col("time").first()).last().dt.days() + 1).alias("days_in_month"),
        ]
    )
    .explode("day/eom")
)
print(out)
shape: (36, 3)
┌─────────────────────┬─────────┬───────────────┐
│ time                ┆ day/eom ┆ days_in_month │
│ ---                 ┆ ---     ┆ ---           │
│ datetime[μs]        ┆ u32     ┆ i64           │
╞═════════════════════╪═════════╪═══════════════╡
│ 2021-01-01 00:00:00 ┆ 30      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 00:00:00 ┆ 29      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 00:00:00 ┆ 28      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-02-01 00:00:00 ┆ 27      ┆ 28            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...     ┆ ...           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-11-01 00:00:00 ┆ 27      ┆ 30            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-01 00:00:00 ┆ 30      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-01 00:00:00 ┆ 29      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-01 00:00:00 ┆ 28      ┆ 31            │
└─────────────────────┴─────────┴───────────────┘

Grouping by rolling windows with groupby_rolling

The rolling groupby is another entrance to the groupby context. But different from the groupby_dynamic the windows are not fixed by a parameter every and period. In a rolling groupby the windows are not fixed at all! They are determined by the values in the index_column.

So imagine having a time column with the values {2021-01-06, 20210-01-10} and a period="5d" this would create the following windows:


2021-01-01   2021-01-06
    |----------|

       2021-01-05   2021-01-10
             |----------|

Because the windows of a rolling groupby are always determined by the values in the DataFrame column, the number of groups is always equal to the original DataFrame.

Combining Groupby and Dynamic / Rolling

Rolling and dynamic groupby's can be combined with normal groupby operations.

Below is an example with a dynamic groupby.

from datetime import datetime

import polars as pl

df = pl.DataFrame(
    {
        "time": pl.date_range(
            low=datetime(2021, 12, 16),
            high=datetime(2021, 12, 16, 3),
            interval="30m",
        ),
        "groups": ["a", "a", "a", "b", "b", "a", "a"],
    }
)
print(out)
shape: (7, 2)
┌─────────────────────┬────────┐
│ time                ┆ groups │
│ ---                 ┆ ---    │
│ datetime[μs]        ┆ str    │
╞═════════════════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a      │
└─────────────────────┴────────┘

out = df.groupby_dynamic("time", every="1h", closed="both", by="groups", include_boundaries=True,).agg(
    [
        pl.count(),
    ]
)
print(out)
shape: (7, 5)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┐
│ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ count │
│ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---   │
│ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╡
│ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 ┆ 1     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1     │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┘

Resampling

We can resample by either:

  • upsampling (moving data to a higher frequency)
  • downsampling (moving data to a lower frequency)
  • combinations of these e.g. first upsample and then downsample

Downsampling to a lower frequency

Polars views downsampling as a special case of the groupby operation and you can do this with groupby_dynamic and groupby_rolling - see the temporal groupby page for examples.

Upsampling to a higher frequency

Let's go through an example where we generate data at 30 minute intervals:

from datetime import datetime

import polars as pl

df = pl.DataFrame(
    {
        "time": pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 16, 3), interval="30m"),
        "groups": ["a", "a", "a", "b", "b", "a", "a"],
        "values": [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0],
    }
print(df)
shape: (7, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ f64    │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ b      ┆ 4.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
└─────────────────────┴────────┴────────┘

Upsampling can be done by defining the new sampling interval. By upsampling we are adding in extra rows where we do not have data. As such upsampling by itself gives a DataFrame with nulls. These nulls can then be filled with a fill strategy or interpolation.

Upsampling strategies

In this example we upsample from the original 30 minutes to 15 minutes and then use a forward strategy to replace the nulls with the previous non-null value:

out1 = df.upsample("time", "15m").fill_null("forward")
print(out1)
shape: (13, 3)
┌────────────────────────────┬─────────┬─────────┐
│ time                       ┆ groups  ┆ values  │
│ ---                        ┆ ---     ┆ ---     │
│ str                        ┆ str     ┆ str     │
╞════════════════════════════╪═════════╪═════════╡
│ 2021-12-16 00:00:00.000000 ┆ a       ┆ 1.0     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:15:00.000000 ┆ forward ┆ forward │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00.000000 ┆ a       ┆ 2.0     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:45:00.000000 ┆ forward ┆ forward │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ ...                        ┆ ...     ┆ ...     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:15:00.000000 ┆ forward ┆ forward │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00.000000 ┆ a       ┆ 6.0     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:45:00.000000 ┆ forward ┆ forward │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00.000000 ┆ a       ┆ 7.0     │
└────────────────────────────┴─────────┴─────────┘

In this example we instead fill the nulls by linear interpolation:

out2 = df.upsample("time", "15m").interpolate().fill_null("forward")
print(out2)
shape: (13, 3)
┌────────────────────────────┬─────────┬────────┐
│ time                       ┆ groups  ┆ values │
│ ---                        ┆ ---     ┆ ---    │
│ str                        ┆ str     ┆ str    │
╞════════════════════════════╪═════════╪════════╡
│ 2021-12-16 00:00:00.000000 ┆ a       ┆ 1.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:15:00.000000 ┆ forward ┆ 1.5    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00.000000 ┆ a       ┆ 2.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:45:00.000000 ┆ forward ┆ 2.5    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ ...                        ┆ ...     ┆ ...    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:15:00.000000 ┆ forward ┆ 5.5    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00.000000 ┆ a       ┆ 6.0    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:45:00.000000 ┆ forward ┆ 6.5    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00.000000 ┆ a       ┆ 7.0    │
└────────────────────────────┴─────────┴────────┘

Combining data with concat and join

You can combine data from different DataFrames using:

Concatenation

There are a number of ways to concatenate data from separate DataFrames:

  • two dataframes with the same columns can be vertically concatenated to make a longer dataframe
  • two dataframes with the same number of rows and non-overlapping columns can be horizontally concatenated to make a wider dataframe
  • two dataframes with different numbers of rows and columns can be diagonally concatenated to make a dataframe which might be longer and/ or wider. Where column names overlap values will be vertically concatenated. Where column names do not overlap new rows and columns will be added. Missing values will be set as null

Vertical concatenation - getting longer

In a vertical concatenation you combine all of the rows from a list of DataFrames into a single longer DataFrame.

df_v1 = pl.DataFrame(
    {
        "a": [1],
        "b": [3],
    }
)
df_v2 = pl.DataFrame(
    {
        "a": [2],
        "b": [4],
    }
)
df_vertical_concat = pl.concat(
    [
        df_v1,
        df_v2,
    ],
    how="vertical",
)
print(df_vertical_concat)
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   │
└─────┴─────┘

Vertical concatenation fails when the dataframes do not have the same column names.

Horizontal concatenation - getting wider

In a horizontal concatenation you combine all of the columns from a list of DataFrames into a single wider DataFrame.

df_h1 = pl.DataFrame(
    {
        "l1": [1, 2],
        "l2": [3, 4],
    }
)
df_h2 = pl.DataFrame(
    {
        "r1": [5, 6],
        "r2": [7, 8],
        "r3": [9, 10],
    }
)
df_horizontal_concat = pl.concat(
    [
        df_h1,
        df_h2,
    ],
    how="horizontal",
print(df_horizontal_concat)
shape: (2, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ l1  ┆ l2  ┆ r1  ┆ r2  ┆ r3  │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1   ┆ 3   ┆ 5   ┆ 7   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   ┆ 6   ┆ 8   ┆ 10  │
└─────┴─────┴─────┴─────┴─────┘

Horizontal concatenation fails when dataframes have overlapping columns or a different number of rows.

Diagonal concatenation - getting longer, wider and nullier

In a diagonal concatenation you combine all of the row and columns from a list of DataFrames into a single longer and/or wider DataFrame.

df_d1 = pl.DataFrame(
    {
        "a": [1],
        "b": [3],
    }
)
df_d2 = pl.DataFrame(
    {
        "a": [2],
        "d": [4],
    }
)

df_diagonal_concat = pl.concat(
    [
        df_d1,
        df_d2,
    ],
    how="diagonal",
)
print(df_diagonal_concat)
shape: (2, 3)
┌─────┬──────┬──────┐
│ a   ┆ b    ┆ d    │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ i64  ┆ i64  │
╞═════╪══════╪══════╡
│ 1   ┆ 3    ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ null ┆ 4    │
└─────┴──────┴──────┘

Diagonal concatenation generates nulls when the column names do not overlap.

When the dataframe shapes do not match and we have an overlapping semantic key then we can join the dataframes instead of concatenating them.

Rechunking

Before a concatenation we have two dataframes df1 and df2. Each column in df1 and df2 is in one or more chunks in memory. By default, during concatenation the chunks in each column are copied to a single new chunk - this is known as rechunking. Rechunking is an expensive operation, but is often worth it because future operations will be faster. If you do not want Polars to rechunk the concatenated DataFrame you specify rechunk = False when doing the concatenation.

Joins

Join strategies

Polars supports the following join strategies by specifying the strategy argument:

  • inner
  • left
  • outer
  • cross
  • asof
  • semi
  • anti

The inner, left, outer and cross join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi, anti and asof join strategies below.

Semi join

Consider the following scenario: a car rental company has a DataFrame showing the cars that it owns with each car having a unique id.

df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "make": ["ford", "toyota", "bmw"],
    }
)
print(df_cars)
shape: (3, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b   ┆ toyota │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c   ┆ bmw    │
└─────┴────────┘

The company has another DataFrame showing each repair job carried out on a vehicle.

df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
print(df_repairs)
shape: (2, 2)
┌─────┬──────┐
│ id  ┆ cost │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ c   ┆ 100  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c   ┆ 200  │
└─────┴──────┘

You want to answer this question: which of the cars have had repairs carried out?

An inner join does not answer this question directly as it produces a DataFrame with multiple rows for each car that has had multiple repair jobs:

df_inner_join = df_cars.join(df_repairs, on="id", how="inner")
print(df_inner_join)
shape: (2, 3)
┌─────┬──────┬──────┐
│ id  ┆ make ┆ cost │
│ --- ┆ ---  ┆ ---  │
│ str ┆ str  ┆ i64  │
╞═════╪══════╪══════╡
│ c   ┆ bmw  ┆ 100  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c   ┆ bmw  ┆ 200  │
└─────┴──────┴──────┘

However, a semi join produces a single row for each car that has had a repair job carried out.

df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
print(df_semi_join)
shape: (1, 2)
┌─────┬──────┐
│ id  ┆ make │
│ --- ┆ ---  │
│ str ┆ str  │
╞═════╪══════╡
│ c   ┆ bmw  │
└─────┴──────┘

Anti join

Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame showing all the cars from df_cars where the id is not present in the df_repairs DataFrame.

df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
print(df_anti_join)
shape: (2, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b   ┆ toyota │
└─────┴────────┘

Asof join

An asof join is like a left join except that we match on nearest key rather than equal keys. In Polars we can do an asof join with the join method and specifying strategy="asof". However, for more flexibility we can use the join_asof method.

Consider the following scenario: a stock market broker has a DataFrame called df_trades showing transactions it has made for different stocks.

df_trades = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 3, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "B", "C"],
        "trade": [101, 299, 301, 500],
    }
)
print(df_trades)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ trade │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   │
└─────────────────────┴───────┴───────┘

The broker has another DataFrame called df_quotes showing prices it has quoted for these stocks.

df_quotes = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 0, 0),
            datetime(2020, 1, 1, 9, 2, 0),
            datetime(2020, 1, 1, 9, 4, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "C", "A"],
        "quote": [100, 300, 501, 102],
    }
)
print(df_quotes)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ quote │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A     ┆ 100   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:02:00 ┆ B     ┆ 300   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:04:00 ┆ C     ┆ 501   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ A     ┆ 102   │
└─────────────────────┴───────┴───────┘

You want to produce a DataFrame showing for each trade the most recent quote provided before the trade. You do this with join_asof (using the default strategy = "backward"). To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock".

df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ 501   │
└─────────────────────┴───────┴───────┴───────┘

If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m".


print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ null  │
└─────────────────────┴───────┴───────┴───────┘

Performance

This chapter handles some gotcha's needed to squeeze maximum performance out of Polars. When used properly, Polars can run at blazing speeds. Take a look at the results in H2O AI database benchmark.

Strings

Understanding the memory format used by Arrow and Polars can really increase performance of your queries. This is especially true for large string data. The figure below shows how an Arrow UTF8 array is laid out in memory.

The array ["foo", "bar", "ham"] is encoded by :

  • a concatenated string "foobarham",
  • an offset array indicating the start (and end) of each string [0, 2, 5, 8],
  • a null bitmap, indicating null values.

This memory structure is very cache-efficient if we are to read the string values. Especially if we compare it to a Vec<String> (an array of heap allocated string data in Rust).

However, if we need to reorder the Arrow UTF8 array, we need to swap around all the bytes of the string values, which can become very expensive when dealing with large strings. On the other hand for Vec<String>, we only need to swap pointers around, which is only 8 bytes data that have to be moved with little cost.

Reordering a DataFrame embedding a large number of Utf8 Series due to an operation (filtering, joining, grouping, etc.) can quickly become quite expensive.

Categorical type

For this reason Polars has a CategoricalType. A Categorical Series is an array filled with u32 values that each represent a unique string value. Thereby maintaining cache efficiency whilst remaining cheap to move values around.

In the example below we demonstrate how you can cast a Utf8 Series column to a Categorical Series.

import polars as pl

df["utf8-column"].cast(pl.Categorical)

Eager join multiple DataFrames on Categorical data

When two DataFrames need to be joined based on string data the Categorical data needs to be synchronized (data in column A of df1 needs to point to the same underlying string data as column B in df2). One can do so by casting data in the StringCache context manager. This will synchronize all discoverable string values for the duration of that context manager. If you want the global string cache to exist during the whole run, you can set toggle_string_cache to True.

import polars as pl

df1 = pl.DataFrame({"a": ["foo", "bar", "ham"], "b": [1, 2, 3]})
df2 = pl.DataFrame({"a": ["foo", "spam", "eggs"], "c": [3, 2, 2]})

with pl.StringCache():
    df1.with_column(pl.col("a").cast(pl.Categorical))
    df2.with_column(pl.col("a").cast(pl.Categorical))

Lazy join multiple DataFrames on Categorical data

A lazy query always has a global string cache (unless you opt-out) for the duration of that query (until .collect() is called). The example below shows how you could join two DataFrames with Categorical types.

import polars as pl

lf1 = pl.DataFrame({"a": ["foo", "bar", "ham"], "b": [1, 2, 3]}).lazy()
lf2 = pl.DataFrame({"a": ["foo", "spam", "eggs"], "c": [3, 2, 2]}).lazy()

lf1 = lf1.with_column(pl.col("a").cast(pl.Categorical))
lf2 = lf2.with_column(pl.col("a").cast(pl.Categorical))

lf1.join(lf2, on="a", how="inner")

Optimizations

This chapter will investigate some of the optimizations that are applied by the Polars query optimizer by going through some examples to see how Polars modifies the original query plan.

Lazy API

The Lazy API page is under construction.

To demonstrate the lazy Polars capabilities we'll explore two medium-large datasets of usernames:

Reddit usernames dataset containing 69+ million rows

import polars as pl

from ..paths import DATA_DIR

dataset = pl.read_csv(f"{DATA_DIR}/reddit.csv", stop_after_n_rows=10)
shape: (5, 6)
┌─────┬──────────────────────────┬─────────────┬────────────┬───────────────┬────────────┐
│ id  ┆ name                     ┆ created_utc ┆ updated_on ┆ comment_karma ┆ link_karma │
│ --- ┆ ---                      ┆ ---         ┆ ---        ┆ ---           ┆ ---        │
│ i64 ┆ str                      ┆ i64         ┆ i64        ┆ i64           ┆ i64        │
╞═════╪══════════════════════════╪═════════════╪════════════╪═══════════════╪════════════╡
│ 1   ┆ truman48lamb_jasonbroken ┆ 1397113470  ┆ 1536527864 ┆ 0             ┆ 0          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ johnethen06_jasonbroken  ┆ 1397113483  ┆ 1536527864 ┆ 0             ┆ 0          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ yaseinrez_jasonbroken    ┆ 1397113483  ┆ 1536527864 ┆ 0             ┆ 1          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ Valve92_jasonbroken      ┆ 1397113503  ┆ 1536527864 ┆ 0             ┆ 0          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ srbhuyan_jasonbroken     ┆ 1397113506  ┆ 1536527864 ┆ 0             ┆ 0          │
└─────┴──────────────────────────┴─────────────┴────────────┴───────────────┴────────────┘

and the Runescape username dataset containing about 55+ million records.

import polars as pl

from ..paths import DATA_DIR

dataset = pl.read_csv(f"{DATA_DIR}/runescape.csv", has_headers=False, stop_after_n_rows=10)
shape: (5, 1)
┌─────────────┐
│ column_1    │
│ ---         │
│ str         │
╞═════════════╡
│ a000        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a0000       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a000000     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a0000000    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a0000000000 │
└─────────────┘

Predicate pushdown

The Predicate pushdown page is under construction

Predicate pushdown is an optimization Polars does that reduces query times and memory usage. A predicate is database jargon for applying a filter on some table, thereby reducing the number of rows on that table.

So let's see if we can load some Reddit data and filter on a few predicates.

import polars as pl

from ..paths import DATA_DIR

q1 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .filter(pl.col("comment_karma") > 0)
    .filter(pl.col("link_karma") > 0)
    .filter(pl.col("name").str.contains(r"^a"))  # filter name that start with an "a"
)

If we were to run this query above nothing would happen! This is due to the lazy evaluation. Nothing will happen until specifically requested. This allows Polars to see the whole context of a query and optimize just in time for execution.

Execution is requested by the .collect method. This would query all available data. While you're writing, optimizing, and checking your query, this is often undesirable. Another method that calls for execution is the .fetch method. .fetch takes a parameter n_rows and tries to 'fetch' that number of rows at the data source (no guarantees are given though).

So let's "fetch" ~10 Million rows from the source file and apply the predicates.

q1.fetch(n_rows=int(1e7))
shape: (656, 6)
┌─────────┬─────────────┬─────────────┬────────────┬───────────────┬────────────┐
│ id      ┆ name        ┆ created_utc ┆ updated_on ┆ comment_karma ┆ link_karma │
│ ---     ┆ ---         ┆ ---         ┆ ---        ┆ ---           ┆ ---        │
│ i64     ┆ str         ┆ i64         ┆ i64        ┆ i64           ┆ i64        │
╞═════════╪═════════════╪═════════════╪════════════╪═══════════════╪════════════╡
│ 77860   ┆ aquarin     ┆ 1137474000  ┆ 1536528294 ┆ 150           ┆ 11         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77974   ┆ aadvaark    ┆ 1137301200  ┆ 1536528294 ┆ 26            ┆ 47         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 78004   ┆ apoisel     ┆ 1137301200  ┆ 1536497404 ┆ 42            ┆ 2549       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 78041   ┆ aonic       ┆ 1137301200  ┆ 1536497404 ┆ 2931          ┆ 2095       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...     ┆ ...         ┆ ...         ┆ ...        ┆ ...           ┆ ...        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1192656 ┆ atothedrian ┆ 1162785880  ┆ 1536497412 ┆ 748           ┆ 585        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1204607 ┆ akbusiness  ┆ 1162899425  ┆ 1536532995 ┆ 73            ┆ 512        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1214809 ┆ aaminics    ┆ 1162969322  ┆ 1536533034 ┆ 22            ┆ 6          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1225341 ┆ antonulrich ┆ 1163110623  ┆ 1536497412 ┆ 9304          ┆ 1782       │
└─────────┴─────────────┴─────────────┴────────────┴───────────────┴────────────┘

Above we see that from the 10 Million rows, 61503 rows match our predicate.

Break it down

In Polars we can visualize the query plan. Let's take a look.

q1.show_graph(optimized=False)

The astute reader maybe would notice that our query is not very optimal because we have three separate FILTER nodes. That means that after every FILTER a new DataFrame is allocated, which will be input to the next FILTER and then deleted from memory -- that must be redundant, and you know what... they'd be right. The predicates should be combined. We should have written this query:

import polars as pl

from ..paths import DATA_DIR

q2 = pl.scan_csv(f"{DATA_DIR}/reddit.csv").filter(
    (pl.col("comment_karma") > 0) & (pl.col("link_karma") > 0) & (pl.col("name").str.contains(r"^a"))
)

That would translate to:

q2.show_graph(optimized=False)

As we can see the predicates are combined. This would lead to less copying of data.

In comes optimization

Polars tries to save that mental overhead from the query writer and combines predicates for you. Besides that, it pushes predicates down to the scan level! Let's see how our optimized query looks.

q1.show_graph(optimized=True)

It may be hard to see, but what is clear is that there is only a single node: the CSV SCAN. The predicate filtering is done during the reading of the csv. This means that this query's memory overhead is reduced by filtering factor! This makes a huge impact.

Memory

As we have seen there were ~ 62,000 rows left after the FILTER. That means that (aside for some memory overhead of the batch size and filter operations) we use \( \frac{6.2\text{e-}4}{1\text{e-}7} \sim 0.6 \text{%} \) of the memory we would during an eager evaluation where we first would read the whole table in memory before applying a filter.

Performance

At the time of writing this, the predicate pushdown also increased the query time performance.

Without optimization, predicate_pushdown=False flag:

real	0m2,401s
user	0m5,457s
sys	0m0,894s

With optimization, predicate_pushdown=True flag:

real	0m1,597s
user	0m6,143s
sys	0m0,647s

Relational algebra

In the visualization of the query plan, you see a \( \sigma \) symbol. This indicates a predicate done at the SCAN level. There is also a \( \pi \) symbol indicating projection (database jargon for column selection), but we'll get to that later.

Cheaper joins

Predicate pushdown optimization will generally also lead to cheaper join's. A join is an expensive operation. The fewer rows we have in a join operation the cheaper it will become.

Projection pushdown

The Projection pushdown page is under construction.

Let's expand our query from the previous section by joining the result of the FILTER operation with the runescape data to find popular Reddit usernames that have a username starting with an "a" that also played Runescape. That must be something we are all interested in!

The query would look like this:

import polars as pl

from ..paths import DATA_DIR

reddit = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .filter(pl.col("comment_karma") > 0)
    .filter(pl.col("link_karma") > 0)
    .filter(pl.col("name").str.contains(r"^a"))
)

runescape = pl.scan_csv("data/runescape.csv", has_headers=False).select(pl.col("column_1").alias("name"))

dataset = reddit.join(runescape, on="name", how="inner").select(["name", "comment_karma", "link_karma"])

df1 = dataset.fetch(int(1e7))
df2 = dataset.fetch(int(1e7), predicate_pushdown=True, projection_pushdown=True)

And yields the following DataFrame.

shape: (0, 3)
┌──────┬───────────────┬────────────┐
│ name ┆ comment_karma ┆ link_karma │
│ ---  ┆ ---           ┆ ---        │
│ str  ┆ i64           ┆ i64        │
╞══════╪═══════════════╪════════════╡
└──────┴───────────────┴────────────┘

Break it down

Again, let's take a look the query plan.

dataset.show_graph(optimized=False)

Now were focussed on the projection's indicated with π. The first node shows π 3/6, indicating that we select 3 out of 6 columns in the DataFrame. If we look the csv scans we see a wildcard π */6 and π */1 meaning that we select all of 6 columns of the reddit dataset and the one and only column from the runescape dataset respectively.

This query is not very optimal. We select all columns from both datasets and only show 3/6 after join. That means that there were some columns computed during the join operation that could have been ignored. There were also columns parsed during csv scanning only to be dropped at the end. When we are dealing with DataFrames with a large number of columns the redundant work that is done can be huge.

Optimized query

Let's see how Polars optimizes this query.

dataset.show_graph(optimized=True)

The projections are pushed down the join operation all the way to the csv scans. This means that both the scanning and join operation have become cheaper due to the query optimization.

Performance

Let's time the result before and after optimization.

Without optimization, predicate_pushdown=False and projection_pushdown=False.

real	0m3,273s
user	0m9,284s
sys	0m1,081s

With optimization, predicate_pushdown and projection_pushdown flags both to True.

real	0m1,732s
user	0m7,581s
sys	0m0,783s

We can see that we almost reduced query time by half on this simple query. With real business data often comprising of many columns, filtering missing data, doing complex groupby operations, and using joins we expect this difference between unoptimized queries and optimized queries to only grow.

Other optimizations

The Other optimizations page is under construction.

Besides predicate and projection pushdown, Polars does other optimizations.

One important topic is optional caching and parallelization. It's easy to imagine having two different DataFrame computations that lead to a scan of the same file. Polars may cache the scanned file to prevent scanning the same file twice. However, if you want to, you may override this behavior and force Polars to read the same file. This could be faster because the scan can be done in parallel.

Join parallelization

If we look at the previous query, we see that the join operation has as input a computation path with data/reddit.csv as root and one path with data/runescape.csv as root. Polars can observe that there are no dependencies between the two DataFrames and will read both files in parallel. If other operations are done before the join (e.g. groupby, filters, etc.) they are also executed in parallel.

Simplify expressions

Some other optimizations that are done are expression simplifications. The impact of these optimizations is less than that of predicate and projection pushdown, but they likely add up. You can track this issue to see the latest status of those.

Static schemas

Do you know that feeling, you are 20 mins in a ETL job and bam, your pipeline fails because we assumed to have another data type for a column? With polars' lazy API we know the data type and column name for any node in the pipeline.

This is very valuable information and can be used to ensure data integrity at any node in the pipeline.

trip_duration = (pl.col("dropoff_datetime") - pl.col("pickup_datetime")).dt.seconds() / 3600

assert (
    pl.scan_csv("data/yellow_tripdata_2010-01.csv", parse_dates=True)
    .with_column(trip_duration.alias("trip_duration"))
    .filter(pl.col("trip_duration") > 0)
    .groupby(["vendor_id"])
    .agg(
        [
            (pl.col("trip_distance") / pl.col("trip_duration")).mean().alias("avg_speed"),
            (pl.col("tip_amount") / pl.col("passenger_count")).mean().alias("avg_tip_per_person"),
        ]
    )
).schema == {"vendor_id": pl.Utf8, "avg_speed": pl.Float64, "avg_tip_per_person": pl.Float64}

Reference guide

Need to see all available methods/functions of Polars? We have Rust and Python references:

Contributing

See the CONTRIBUTING.md if you would like to contribute to the Polars project.

If you're new to this we recommend starting out with contributing examples to the Python API documentation. The Python API docs are generated from the docstrings of the Python wrapper located in polars/py-polars.

Here is an example commit that adds a docstring.

If you spot any gaps in this User Guide you can submit fixes to the pola-rs/polars-book repo.

Happy hunting!