Polars logo

Introduction

This book is an introduction to the Polars DataFrame library. Its goal is to introduce you to Polars by going through examples and comparing it to other solutions. Some design choices are introduced here. The guide will also introduce you to optimal usage of Polars.

Even though Polars is completely written in Rust (no runtime overhead!) and uses Arrow -- the native arrow2 Rust implementation -- as its foundation, the examples presented in this guide will be mostly using its higher-level language bindings. Higher-level bindings only serve as a thin wrapper for functionality implemented in the core library.

For Pandas users, our Python package will offer the easiest way to get started with Polars.

Goals and non-goals

The goal of Polars is to provide a lightning fast DataFrame library that utilizes all available cores on your machine. Unlike tools such as dask -- which tries to parallelize existing single-threaded libraries like NumPy and Pandas -- Polars is written from the ground up, designed for parallelization of queries on DataFrames.

Polars goes to great lengths to:

  • Reduce redundant copies
  • Traverse memory cache efficiently
  • Minimize contention in parallelism

Polars is lazy and semi-lazy. It allows you to do most of your work eagerly, similar to Pandas, but it also provides a powerful expression syntax that will be optimized and executed on within the query engine.

In lazy Polars we are able to do query optimization on the entire query, further improving performance and memory pressure.

Polars keeps track of your query in a logical plan. This plan is optimized and reordered before running it. When a result is requested, Polars distributes the available work to different executors that use the algorithms available in the eager API to produce a result. Because the whole query context is known to the optimizer and executors of the logical plan, processes dependent on separate data sources can be parallelized on the fly.

Performance 🚀🚀

Polars is very fast, and in fact is one of the best performing solutions available. See the results in h2oai's db-benchmark. The image below shows the biggest datasets yielding a result.

Current status

Below a concise list of the features allowing Polars to meet its goals:

  • Copy-on-write (COW) semantics
    • "Free" clones
    • Cheap appends
  • Appending without clones
  • Column oriented data storage
    • No block manager (i.e. predictable performance)
  • Missing values indicated with bitmask
    • NaN are different from missing
    • Bitmask optimizations
  • Efficient algorithms
  • Very fast IO
    • Its csv and parquet readers are among the fastest in existence
  • Query optimizations
    • Predicate pushdown
      • Filtering at scan level
    • Projection pushdown
      • Projection at scan level
    • Aggregate pushdown
      • Aggregations at scan level
    • Simplify expressions
    • Parallel execution of physical plan
    • Cardinality based groupby dispatch
      • Different groupby strategies based on data cardinality
  • SIMD vectorization
  • NumPy universal functions

Acknowledgements

Development of Polars is proudly powered by

Xomnia

Getting started

Installation

Installing Polars is just a simple pip install away.

$ pip install polars

All binaries are pre-built for Python v3.6+.

Quick start

Below we show a simple snippet that parses a CSV file, filters it, and finishes with a groupby operation.

import polars as pl

df = pl.read_csv("https://j.mp/iriscsv")
print(df.filter(pl.col("sepal_length") > 5)
      .groupby("species")
      .agg(pl.all().sum())
)

The snippet above will output:

shape: (3, 5)
╭──────────────┬──────────────────┬─────────────────┬──────────────────┬─────────────────╮
│ species      ┆ sepal_length_sum ┆ sepal_width_sum ┆ petal_length_sum ┆ petal_width_sum │
│ ---          ┆ ---              ┆ ---             ┆ ---              ┆ ---             │
│ str          ┆ f64              ┆ f64             ┆ f64              ┆ f64             │
╞══════════════╪══════════════════╪═════════════════╪══════════════════╪═════════════════╡
│ "virginica"  ┆ 324.5            ┆ 146.2           ┆ 273.1            ┆ 99.6            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "versicolor" ┆ 281.9            ┆ 131.8           ┆ 202.9            ┆ 63.3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ "setosa"     ┆ 116.9            ┆ 81.7            ┆ 33.2             ┆ 6.1             │
╰──────────────┴──────────────────┴─────────────────┴──────────────────┴─────────────────╯

As we can see, Polars pretty-prints the output object, including the column name and datatype as headers.

Lazy quick start

If we want to run this query in lazy Polars we'd write:

import polars as pl

print(
    pl.read_csv("https://j.mp/iriscsv")
    .lazy()
    .filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.all().sum())
    .collect()
)

When the data is stored locally, we can also use scan_csv to run the query in lazy polars.

References

If you want to dive right into the Python API docs, check the the reference docs.

Lazy API

The lazy API builds a query plan. Nothing is executed until you explicitly ask Polars to execute the query (via LazyFrame.collect(), or LazyFrame.fetch()). This provides Polars with the entire context of the query, allowing optimizations and choosing the fastest algorithm given that context.

Going from eager to lazy is often as simple as starting your query with .lazy() and ending with .collect().

So the eager snippet above would become:

(
    df.lazy()
    .filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.all().sum())
    .collect()
)

Polars Expressions

Polars has a powerful concept called expressions. Polars expressions can be used in various contexts and are a functional mapping of Fn(Series) -> Series, meaning that they have a Series as an input and a Series as an output. By looking at this functional definition, we can see that the output of an Expr also can serve as the input of an Expr.

That may sound a bit strange, so let's start with an example.

Polars Expressions

The following is an expression:

pl.col("foo").sort().head(2)

The snippet above says:

  1. Select column "foo"
  2. Then sort the column
  3. Then take the first two values of the sorted output

The power of expressions is that every expression produces a new expression, and that they can be piped together. You can run an expression by passing them to one of Polars execution contexts.

Here we run two expressions by running df.select:

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").filter(pl.col("foo") == 1).sum()
])

All expressions are ran in parallel, meaning that separate Polars expressions are embarrassingly parallel. Note that within an expression there may be more parallelization going on.

Expression examples

In this section we will go through some examples, but first let's create a dataset:

import polars as pl
import numpy as np

np.random.seed(12)

df = pl.DataFrame(
    {
        "nrs": [1, 2, 3, None, 5],
        "names": ["foo", "ham", "spam", "egg", None],
        "random": np.random.rand(5),
        "groups": ["A", "A", "B", "C", "B"],
    }
)
print(df)
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs  ┆ names ┆ random   ┆ groups │
│ ---  ┆ ---   ┆ ---      ┆ ---    │
│ i64  ┆ str   ┆ f64      ┆ str    │
╞══════╪═══════╪══════════╪════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      │
└──────┴───────┴──────────┴────────┘

You can do a lot with expressions. They are so expressive that you sometimes have multiple ways to get the same results. To get a better feel for them let's go through some more examples.

Count unique values

We can count the unique values in a column. Note that we are creating the same result in different ways. To avoid duplicate column names in the DataFrame, we could use an alias expression that can rename the expression.

out = df.select(
    [
        pl.col("names").n_unique().alias("unique_names_1"),
        pl.col("names").unique().count().alias("unique_names_2"),
    ]
)
print(out)
shape: (1, 2)
┌────────────────┬────────────────┐
│ unique_names_1 ┆ unique_names_2 │
│ ---            ┆ ---            │
│ u32            ┆ u32            │
╞════════════════╪════════════════╡
│ 5              ┆ 5              │
└────────────────┴────────────────┘

Various aggregations

We can do various aggregations. Below are examples of some of them, but there are more such as median, mean, first, etc.

out = df.select(
    [
        pl.sum("random").alias("sum"),
        pl.min("random").alias("min"),
        pl.max("random").alias("max"),
        pl.col("random").max().alias("other_max"),
        pl.std("random").alias("std dev"),
        pl.var("random").alias("variance"),
    ]
)
print(out)
shape: (1, 6)
┌──────────┬──────────┬──────┬───────────┬──────────┬──────────┐
│ sum      ┆ min      ┆ max  ┆ other_max ┆ std dev  ┆ variance │
│ ---      ┆ ---      ┆ ---  ┆ ---       ┆ ---      ┆ ---      │
│ f64      ┆ f64      ┆ f64  ┆ f64       ┆ f64      ┆ f64      │
╞══════════╪══════════╪══════╪═══════════╪══════════╪══════════╡
│ 1.705842 ┆ 0.014575 ┆ 0.74 ┆ 0.74      ┆ 0.293209 ┆ 0.085971 │
└──────────┴──────────┴──────┴───────────┴──────────┴──────────┘

Filter and conditionals

We can also do some pretty complex things. In the next snippet we count all names ending with the string "am".

out = df.select(
    [
        pl.col("names").filter(pl.col("names").str.contains(r"am$")).count(),
    ]
)
print(df)
shape: (1, 1)
┌───────┐
│ names │
│ ---   │
│ u32   │
╞═══════╡
│ 2     │
└───────┘

Binary functions and modification

In the example below we use a conditional to create a new expression in the following when -> then -> otherwise construct. The when function requires a predicate expression (and thus leads to a boolean Series). The then function expects an expression that will be used in case the predicate evaluates to true, and the otherwise function expects an expression that will be used in case the predicate evaluates to false.

Note that you can pass any expression, or just base expressions like pl.col("foo"), pl.lit(3), pl.lit("bar"), etc.

Finally, we multiply this with the result of a sum expression:

out = df.select(
    [
        pl.when(pl.col("random") > 0.5).then(0).otherwise(pl.col("random")) * pl.sum("nrs"),
    ]
)
print(df)
shape: (5, 1)
┌──────────┐
│ literal  │
│ ---      │
│ f64      │
╞══════════╡
│ 1.695791 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0      │
├╌╌╌╌╌╌╌╌╌╌┤
│ 2.896465 │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.0      │
├╌╌╌╌╌╌╌╌╌╌┤
│ 0.160325 │
└──────────┘

Window expressions

A polars expression can also do an implicit GROUPBY, AGGREGATION, and JOIN in a single expression. In the examples below we do a GROUPBY OVER "groups" and AGGREGATE SUM of "random", and in the next expression we GROUPBY OVER "names" and AGGREGATE a LIST of "random". These window functions can be combined with other expressions and are an efficient way to determine group statistics. See more on those group statistics here.

df = df[
    [
        pl.col("*"),  # select all
        pl.col("random").sum().over("groups").alias("sum[random]/groups"),
        pl.col("random").list().over("names").alias("random/name"),
    ]
]
print(df)
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬────────────────────┬─────────────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ sum[random]/groups ┆ random/name │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---                ┆ ---         │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ f64                ┆ list [f64]  │
╞══════╪═══════╪══════════╪════════╪════════════════════╪═════════════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 0.894213           ┆ [0.154163]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      ┆ 0.894213           ┆ [0.74]      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 0.2778             ┆ [0.263315]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 0.533739           ┆ [0.533739]  │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 0.2778             ┆ [0.014575]  │
└──────┴───────┴──────────┴────────┴────────────────────┴─────────────┘

Conclusion

This is the tip of the iceberg in terms of possible expressions. There are a ton more, and they can be combined in a variety ways.

This page was an introduction to Polars expressions, and gave a glimpse of what's possible with them. In the next page we'll discuss in which contexts expressions can be used. Later in the guide we'll go through expressions in various groupby contexts, all while keeping Polars execution parallel.

Expression contexts

You cannot use an expression anywhere. An expression needs a context, the available contexts are:

  • selection: df.select([..])
  • groupy aggregation: df.groupby(..).agg([..])
  • hstack/ add columns: df.with_columns([..])

Syntactic sugar

The reason for such a context, is that you actually are using the Polars lazy API, even if you use it in eager. For instance this snippet:

df.groupby("foo").agg([pl.col("bar").sum()])

actually desugars to:

(df.lazy().groupby("foo").agg([pl.col("bar").sum()])).collect()

This allows Polars to push the expression into the query engine, do optimizations, and cache intermediate results.

Select context

In the select context the selection applies expressions over columns. The expressions in this context must produce Series that are all the same length or have a length of 1.

A Series of a length of 1 will be broadcasted to match the height of the DataFrame. Note that a select may produce new columns that are aggregations, combinations of expressions, or literals.

Selection context

out = df.select(
    [
        pl.sum("nrs"),
        pl.col("names").sort(),
        pl.col("names").first().alias("first name"),
        (pl.mean("nrs") * 10).alias("10xnrs"),
    ]
)
print(out)
shape: (5, 4)
┌─────┬───────┬────────────┬────────┐
│ nrs ┆ names ┆ first name ┆ 10xnrs │
│ --- ┆ ---   ┆ ---        ┆ ---    │
│ i64 ┆ str   ┆ str        ┆ f64    │
╞═════╪═══════╪════════════╪════════╡
│ 11  ┆ null  ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ egg   ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ foo   ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ ham   ┆ foo        ┆ 27.5   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 11  ┆ spam  ┆ foo        ┆ 27.5   │
└─────┴───────┴────────────┴────────┘

Add columns

Adding columns to a DataFrame using with_columns is also the selection context.

df = df.with_columns(
    [
        pl.sum("nrs").alias("nrs_sum"),
        pl.col("random").count().alias("count"),
    ]
)
print(out)
shape: (5, 6)
┌──────┬───────┬──────────┬────────┬─────────┬───────┐
│ nrs  ┆ names ┆ random   ┆ groups ┆ nrs_sum ┆ count │
│ ---  ┆ ---   ┆ ---      ┆ ---    ┆ ---     ┆ ---   │
│ i64  ┆ str   ┆ f64      ┆ str    ┆ i64     ┆ u32   │
╞══════╪═══════╪══════════╪════════╪═════════╪═══════╡
│ 1    ┆ foo   ┆ 0.154163 ┆ A      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2    ┆ ham   ┆ 0.74     ┆ A      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3    ┆ spam  ┆ 0.263315 ┆ B      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ egg   ┆ 0.533739 ┆ C      ┆ 11      ┆ 5     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5    ┆ null  ┆ 0.014575 ┆ B      ┆ 11      ┆ 5     │
└──────┴───────┴──────────┴────────┴─────────┴───────┘

Groupby context

In the groupby context expressions work on groups and thus may yield results of any length (a group may have many members).

out = df.groupby("groups").agg(
    [
        pl.sum("nrs"),  # sum nrs by groups
        pl.col("random").count().alias("count"),  # count group members
        # sum random where name != null
        pl.col("random").filter(pl.col("names").is_not_null()).sum().suffix("_sum"),
        pl.col("names").reverse().alias(("reversed names")),
    ]
)
print(out)
shape: (3, 5)
┌────────┬──────┬───────┬────────────┬────────────────┐
│ groups ┆ nrs  ┆ count ┆ random_sum ┆ reversed names │
│ ---    ┆ ---  ┆ ---   ┆ ---        ┆ ---            │
│ str    ┆ i64  ┆ u32   ┆ f64        ┆ list [str]     │
╞════════╪══════╪═══════╪════════════╪════════════════╡
│ B      ┆ 8    ┆ 2     ┆ 0.263315   ┆ [null, "spam"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C      ┆ null ┆ 1     ┆ 0.533739   ┆ ["egg"]        │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ A      ┆ 3    ┆ 2     ┆ 0.894213   ┆ ["ham", "foo"] │
└────────┴──────┴───────┴────────────┴────────────────┘

Besides the standard groupby, groupby_dynamic, and groupby_rolling are also entrances to the groupby context.

GroupBy

The GroupBy page is under construction.

A multithreaded approach

One of the most efficient ways to process tabular data is to parallelize its processing via the "split-apply-combine" approach. This operation is at the core of the Polars grouping implementation, allowing it to attain lightning-fast operations. Specifically, both the "split" and "apply" phases are executed in a multi-threaded fashion.

A simple grouping operation is taken below as an example to illustrate this approach:

For the hashing operations performed during the "split" phase, Polars uses a multithreaded lock-free approach that is illustrated on the following schema:

This parallelization allows the grouping and joining operations (for instance) to be blazingly fast!

Check out this blog post for more details.

Do not kill the parallelization!

We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.

This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.

To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.

Polars Expressions

In the introduction on the previous page we discussed that using custom Python functions, killed parallelization, and that we can use the expressions of the lazy API to mitigate this. Let's take a look at what that means.

We can start with the simple US congress dataset.

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("first_name")
    .agg(
        [
            pl.count(),
            pl.col("gender").list(),
            pl.first("last_name"),
        ]
    )
    .sort("count", reverse=True)
    .limit(5)
)

df = q.collect()

Basic aggregations

You can easily combine different aggregations by adding multiple expressions in a list. There is no upper bound on the number of aggregations you can do, and you can make any combination you want. In the snippet below we do the following aggregations:

Per GROUP "first_name" we

  • count the number of rows in the group:
    • short form: pl.count("party")
    • full form: pl.col("party").count()
  • aggregate the gender values groups to a list:
    • full form: pl.col("gender").list()
  • get the first value of column "last_name" in the group:
    • short form: pl.first("last_name")
    • full form: pl.col("last_name").first()

Besides the aggregation, we immediately sort the result and limit to the top 5 so that we have a nice summary overview.

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("first_name")
    .agg(
        [
            pl.count(),
            pl.col("gender").list(),
            pl.first("last_name"),
        ]
    )
    .sort("count", reverse=True)
    .limit(5)
)

df = q.collect()
shape: (5, 4)
┌────────────┬───────┬─────────────────────┬───────────┐
│ first_name ┆ count ┆ gender              ┆ last_name │
│ ---        ┆ ---   ┆ ---                 ┆ ---       │
│ cat        ┆ u32   ┆ list [cat]          ┆ str       │
╞════════════╪═══════╪═════════════════════╪═══════════╡
│ John       ┆ 1254  ┆ ["M", "M", ... "M"] ┆ Walker    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ William    ┆ 1022  ┆ ["M", "M", ... "M"] ┆ Few       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ James      ┆ 712   ┆ ["M", "M", ... "M"] ┆ Armstrong │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Thomas     ┆ 453   ┆ ["M", "M", ... "M"] ┆ Tucker    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Charles    ┆ 439   ┆ ["M", "M", ... "M"] ┆ Carroll   │
└────────────┴───────┴─────────────────────┴───────────┘

Conditionals

It's that easy! Let's turn it up a notch. Let's say we want to know how many delegates of a "state" are "Pro" or "Anti" administration. We could directly query that in the aggregation without the need of lambda or grooming the DataFrame.

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("state")
    .agg(
        [
            (pl.col("party") == "Anti-Administration").sum().alias("anti"),
            (pl.col("party") == "Pro-Administration").sum().alias("pro"),
        ]
    )
    .sort("pro", reverse=True)
    .limit(5)
)

df = q.collect()
shape: (5, 3)
┌───────┬──────┬─────┐
│ state ┆ anti ┆ pro │
│ ---   ┆ ---  ┆ --- │
│ cat   ┆ u32  ┆ u32 │
╞═══════╪══════╪═════╡
│ CT    ┆ 0    ┆ 3   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ NJ    ┆ 0    ┆ 3   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ NC    ┆ 1    ┆ 2   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ SC    ┆ 0    ┆ 1   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ PA    ┆ 1    ┆ 1   │
└───────┴──────┴─────┘

Similarly, this could also be done with a nested GROUPBY, but that doesn't help show off some of these nice features. 😉

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby(["state", "party"])
    .agg([pl.count("party").alias("count")])
    .filter((pl.col("party") == "Anti-Administration") | (pl.col("party") == "Pro-Administration"))
    .sort("count", reverse=True)
    .limit(5)
)

df = q.collect()
shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ state ┆ party               ┆ count │
│ ---   ┆ ---                 ┆ ---   │
│ cat   ┆ cat                 ┆ u32   │
╞═══════╪═════════════════════╪═══════╡
│ CT    ┆ Pro-Administration  ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ NJ    ┆ Pro-Administration  ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ VA    ┆ Anti-Administration ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ NC    ┆ Pro-Administration  ┆ 2     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ PA    ┆ Pro-Administration  ┆ 1     │
└───────┴─────────────────────┴───────┘

Filtering

We can also filter the groups. Let's say we want to compute a mean per group, but we don't want to include all values from that group, and we also don't want to filter the rows from the DataFrame (because we need those rows for another aggregation).

In the example below we show how that can be done. Note that we can make Python functions for clarity. These functions don't cost us anything. That is because we only create Polars expressions, we don't apply a custom function over a Series during runtime of the query.

from datetime import date

import polars as pl

from .dataset import dataset


def compute_age() -> pl.Expr:
    return date(2021, 1, 1).year - pl.col("birthday").dt.year()


def avg_birthday(gender: str) -> pl.Expr:
    return compute_age().filter(pl.col("gender") == gender).mean().alias(f"avg {gender} birthday")


q = (
    dataset.lazy()
    .groupby(["state"])
    .agg(
        [
            avg_birthday("M"),
            avg_birthday("F"),
            (pl.col("gender") == "M").sum().alias("# male"),
            (pl.col("gender") == "F").sum().alias("# female"),
        ]
    )
    .limit(5)
)

df = q.collect()
shape: (5, 5)
┌───────┬────────────────┬────────────────┬────────┬──────────┐
│ state ┆ avg M birthday ┆ avg F birthday ┆ # male ┆ # female │
│ ---   ┆ ---            ┆ ---            ┆ ---    ┆ ---      │
│ cat   ┆ f64            ┆ f64            ┆ u32    ┆ u32      │
╞═══════╪════════════════╪════════════════╪════════╪══════════╡
│ MO    ┆ 164.338558     ┆ 84.571429      ┆ 327    ┆ 7        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ CO    ┆ 130.8876       ┆ 72.666667      ┆ 89     ┆ 3        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PA    ┆ 180.238144     ┆ 91.857143      ┆ 1046   ┆ 7        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PI    ┆ 145.0          ┆ null           ┆ 13     ┆ 0        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ OL    ┆ 281.0          ┆ null           ┆ 2      ┆ 0        │
└───────┴────────────────┴────────────────┴────────┴──────────┘

Sorting

It's common to see a DataFrame being sorted for the sole purpose of managing the ordering during a GROUPBY operation. Let's say that we want to get the names of the oldest and youngest politicians per state. We could SORT and GROUPBY.

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ]
    )
    .limit(5)
)

df = q.collect()
shape: (5, 3)
┌───────┬────────────────────┬─────────────────┐
│ state ┆ youngest           ┆ oldest          │
│ ---   ┆ ---                ┆ ---             │
│ cat   ┆ str                ┆ str             │
╞═══════╪════════════════════╪═════════════════╡
│ IA    ┆ Bernhart Henn      ┆ Abby Finkenauer │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ DK    ┆ John Todd          ┆ George Mathews  │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ UT    ┆ John Bernhisel     ┆ Mia Love        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NJ    ┆ Lambert Cadwalader ┆ Jon Runyan      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CO    ┆ Allen Bradford     ┆ Jared Polis     │
└───────┴────────────────────┴─────────────────┘

However, if we also want to sort the names alphabetically, this breaks. Luckily we can sort in a groupby context separate from the DataFrame.

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort().first().alias("alphabetical_first"),
        ]
    )
    .limit(5)
)

df = q.collect()
shape: (5, 4)
┌───────┬────────────────────┬─────────────────────┬────────────────────┐
│ state ┆ youngest           ┆ oldest              ┆ alphabetical_first │
│ ---   ┆ ---                ┆ ---                 ┆ ---                │
│ cat   ┆ str                ┆ str                 ┆ str                │
╞═══════╪════════════════════╪═════════════════════╪════════════════════╡
│ WA    ┆ Columbia Lancaster ┆ Randy Tate          ┆ Addison Foster     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ AR    ┆ Archibald Yell     ┆ Tim Griffin         ┆ Albert Rust        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ WY    ┆ Stephen Nuckolls   ┆ Barbara Cubin       ┆ Alan Simpson       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ KY    ┆ John Edwards       ┆ Ben Chandler        ┆ Aaron Harding      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ PR    ┆ Tulio Larrinaga    ┆ Aníbal Acevedo-Vilá ┆ Antonio Colorado   │
└───────┴────────────────────┴─────────────────────┴────────────────────┘

We can even sort by another column in the groupby context. If we want to know if the alphabetically sorted name is male or female we could add: pl.col("gender").sort_by("first_name").first().alias("gender")

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort().first().alias("alphabetical_first"),
            pl.col("gender").sort_by("first_name").first().alias("gender"),
        ]
    )
    .sort("state")
    .limit(5)
)

df = q.collect()
shape: (5, 5)
┌───────┬───────────────────┬────────────────┬────────────────────┬────────┐
│ state ┆ youngest          ┆ oldest         ┆ alphabetical_first ┆ gender │
│ ---   ┆ ---               ┆ ---            ┆ ---                ┆ ---    │
│ cat   ┆ str               ┆ str            ┆ str                ┆ cat    │
╞═══════╪═══════════════════╪════════════════╪════════════════════╪════════╡
│ DE    ┆ Samuel White      ┆ John Carney    ┆ Albert Polk        ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ VA    ┆ William Grayson   ┆ Scott Taylor   ┆ Abraham Venable    ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ SC    ┆ Ralph Izard       ┆ Joe Cunningham ┆ Abraham Nott       ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ MD    ┆ Benjamin Contee   ┆ Frank Kratovil ┆ Albert Blakeney    ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ PA    ┆ Thomas Fitzsimons ┆ Ryan Costello  ┆ Aaron Kreider      ┆ M      │
└───────┴───────────────────┴────────────────┴────────────────────┴────────┘

Conclusion

In the examples above we've seen that we can do a lot by combining expressions. By doing so we delay the use of custom Python functions that slow down the queries (by the slow nature of Python AND the GIL).

If we are missing a type expression let us know by opening a feature request!

Folds

Polars provides expressions/methods for horizontal aggregations like sum, min, mean, etc. by setting the argument axis=1. However, when you need a more complex aggregation the default methods provided by the Polars library may not be sufficient. That's when folds come in handy.

The Polars fold expression operates on columns for maximum speed. It utilizes the data layout very efficiently and often has vectorized execution.

Let's start with an example by implementing the sum operation ourselves, with a fold.

Manual Sum

    {
        "a": [1, 2, 3],
        "b": [10, 20, 30],
    }
)

out = df.select(
    pl.fold(acc=pl.lit(0), f=lambda acc, x: acc + x, exprs=pl.col("*")).alias("sum"),
)
print(out)
shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 11  │
├╌╌╌╌╌┤
│ 22  │
├╌╌╌╌╌┤
│ 33  │
└─────┘

The snippet above recursively applies the function f(acc, x) -> acc to an accumulator acc and a new column x. The function operates on columns individually and can take advantage of cache efficiency and vectorization.

Conditional

In the case where you'd want to apply a condition/predicate on all columns in a DataFrame a fold operation can be a very concise way to express this.

    {
        "a": [1, 2, 3],
        "b": [0, 1, 2],
    }
)

out = df.filter(
    pl.fold(
        acc=pl.lit(True),
        f=lambda acc, x: acc & x,
        exprs=pl.col("*") > 1,
    )
)
print(out)
shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 3   ┆ 2   │
└─────┴─────┘

In the snippet we filter all rows where each column value is > 1.

Folds and string data

Folds could be used to concatenate string data. However, due to the materialization of intermediate columns, this operation will have squared complexity.

Therefore, we recommend using the concat_str expression for this.

df = pl.DataFrame(
    {
        "a": ["a", "b", "c"],
        "b": [1, 2, 3],
    }
)

out = df.select(
    [
        pl.concat_str(["a", "b"]),
    ]
)
print(out)
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ str │
╞═════╡
│ a1  │
├╌╌╌╌╌┤
│ b2  │
├╌╌╌╌╌┤
│ c3  │
└─────┘

Window functions 🚀🚀

Window functions are expressions with superpowers. They allow you to perform aggregations on groups in the select context. Let's get a feel of what that means. First we create a dataset. The dataset loaded in the snippet below contains information about pokemon and has the following columns:

['#', 'Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Generation', 'Legendary']

import polars as pl

# then let's load some csv data with information about pokemon
df = pl.read_csv(
    "https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv"
)
shape: (163, 13)
┌─────┬───────────────────────┬─────────┬────────┬─────┬─────────┬───────┬────────────┬───────────┐
│ #   ┆ Name                  ┆ Type 1  ┆ Type 2 ┆ ... ┆ Sp. Def ┆ Speed ┆ Generation ┆ Legendary │
│ --- ┆ ---                   ┆ ---     ┆ ---    ┆     ┆ ---     ┆ ---   ┆ ---        ┆ ---       │
│ i64 ┆ str                   ┆ str     ┆ str    ┆     ┆ i64     ┆ i64   ┆ i64        ┆ bool      │
╞═════╪═══════════════════════╪═════════╪════════╪═════╪═════════╪═══════╪════════════╪═══════════╡
│ 1   ┆ Bulbasaur             ┆ Grass   ┆ Poison ┆ ... ┆ 65      ┆ 45    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ Ivysaur               ┆ Grass   ┆ Poison ┆ ... ┆ 80      ┆ 60    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ Venusaur              ┆ Grass   ┆ Poison ┆ ... ┆ 100     ┆ 80    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ VenusaurMega Venusaur ┆ Grass   ┆ Poison ┆ ... ┆ 120     ┆ 80    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ...                   ┆ ...     ┆ ...    ┆ ... ┆ ...     ┆ ...   ┆ ...        ┆ ...       │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 147 ┆ Dratini               ┆ Dragon  ┆ null   ┆ ... ┆ 50      ┆ 50    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 148 ┆ Dragonair             ┆ Dragon  ┆ null   ┆ ... ┆ 70      ┆ 70    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 149 ┆ Dragonite             ┆ Dragon  ┆ Flying ┆ ... ┆ 100     ┆ 80    ┆ 1          ┆ false     │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 150 ┆ Mewtwo                ┆ Psychic ┆ null   ┆ ... ┆ 90      ┆ 130   ┆ 1          ┆ true      │
└─────┴───────────────────────┴─────────┴────────┴─────┴─────────┴───────┴────────────┴───────────┘

Groupby Aggregations in selection

Below we show how to use window functions to group over different columns and perform an aggregation on them. Doing so allows us to use multiple groupby operations in parallel, using a single query. The results of the aggregation are projected back to the original rows. Therefore, a window function will always lead to a DataFrame with the same size as the original.

Note how we call .over("Type 1") and .over(["Type 1", "Type 2"]). Using window functions we can aggregate over different groups in a single select call!

The best part is, this won't cost you anything. The computed groups are cached and shared between different window expressions.


out = df.select(
    [
        "Type 1",
        "Type 2",
        pl.col("Attack").mean().over("Type 1").alias("avg_attack_by_type"),
        pl.col("Defense").mean().over(["Type 1", "Type 2"]).alias("avg_defense_by_type_combination"),
        pl.col("Attack").mean().alias("avg_attack"),
    ]
)
shape: (163, 5)
┌─────────┬────────┬────────────────────┬─────────────────────────────────┬────────────┐
│ Type 1  ┆ Type 2 ┆ avg_attack_by_type ┆ avg_defense_by_type_combination ┆ avg_attack │
│ ---     ┆ ---    ┆ ---                ┆ ---                             ┆ ---        │
│ str     ┆ str    ┆ f64                ┆ f64                             ┆ f64        │
╞═════════╪════════╪════════════════════╪═════════════════════════════════╪════════════╡
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Grass   ┆ Poison ┆ 72.923077          ┆ 67.8                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...     ┆ ...    ┆ ...                ┆ ...                             ┆ ...        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon  ┆ null   ┆ 94.0               ┆ 55.0                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon  ┆ null   ┆ 94.0               ┆ 55.0                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon  ┆ Flying ┆ 94.0               ┆ 95.0                            ┆ 75.349693  │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Psychic ┆ null   ┆ 53.875             ┆ 51.428571                       ┆ 75.349693  │
└─────────┴────────┴────────────────────┴─────────────────────────────────┴────────────┘

Operations per group

Window functions can do more than aggregation. They can also be viewed as an operation within a group. If, for instance, you want to sort the values within a group, you can write col("value").sort().over("group") and voilà! We sorted by group!

Let's filter out some rows to make this more clear.

filtered = df.filter(pl.col("Type 2") == "Psychic").select(
    [
        "Name",
        "Type 1",
        "Speed",
    ]
)
print(filtered)
shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Slowpoke            ┆ Water  ┆ 15    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Slowbro             ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggcute           ┆ Grass  ┆ 40    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggutor           ┆ Grass  ┆ 55    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Starmie             ┆ Water  ┆ 115   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

Observe that the group Water of column Type 1 is not contiguous. There are two rows of Grass in between. Also note that each pokemon within a group are sorted by Speed in ascending order. Unfortunately, for this example we want them sorted in descending speed order. Luckily with window functions this is easy to accomplish.

out = filtered.with_columns(
    [
        pl.col(["Name", "Speed"]).sort(reverse=True).over("Type 1"),
    ]
)
print(out)

shape: (7, 3)
┌─────────────────────┬────────┬───────┐
│ Name                ┆ Type 1 ┆ Speed │
│ ---                 ┆ ---    ┆ ---   │
│ str                 ┆ str    ┆ i64   │
╞═════════════════════╪════════╪═══════╡
│ Starmie             ┆ Water  ┆ 115   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Slowpoke            ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ SlowbroMega Slowbro ┆ Water  ┆ 30    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggutor           ┆ Grass  ┆ 55    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Exeggcute           ┆ Grass  ┆ 40    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Slowbro             ┆ Water  ┆ 15    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ Jynx                ┆ Ice    ┆ 95    │
└─────────────────────┴────────┴───────┘

Polars keeps track of each group's location and maps the expressions to the proper row locations. This will also work over different groups in a single select.

The power of window expressions is that you often don't need a groupby -> explode combination, but you can put the logic in a single expression. It also makes the API cleaner. If properly used a:

  • groupby -> marks that groups are aggregated and we expect a DataFrame of size n_groups
  • over -> marks that we want to compute something within a group, but doesn't modify the original size of the DataFrame

Window expression rules

The evaluations of window expressions are as follows (assuming we apply it to a pl.Int32 column):

# aggregate and broadcast within a group
# output type: -> Int32
pl.sum("foo").over("groups")

# sum within a group and multiply with group elements
# output type: -> Int32
(pl.col("x").sum() * pl.col("y")).over("groups")

# sum within a group and multiply with group elements 
# and aggregate the group to a list
# output type: -> List(Int32)
(pl.col("x").sum() * pl.col("y")).list().over("groups")

# note that it will require an explicit `list()` call
# sum within a group and multiply with group elements 
# and aggregate the group to a list
# the flatten call explodes that list

# This is the fastest method to do things over groups when the groups are sorted
(pl.col("x").sum() * pl.col("y")).list().over("groups").flatten()

More examples

For more exercise, below are some window functions for us to compute:

  • sort all pokemon by type
  • select the first 3 pokemon per type as "Type 1"
  • sort the pokemon within a type by speed and select the first 3 as "fastest/group"
  • sort the pokemon within a type by attack and select the first 3 as "strongest/group"
  • sort the pokemon by name within a type and select the first 3 as "sorted_by_alphabet"

out = df.sort("Type 1").select(
    [
        pl.col("Type 1").head(3).list().over("Type 1").flatten(),
        pl.col("Name").sort_by(pl.col("Speed")).head(3).list().over("Type 1").flatten().alias("fastest/group"),
        pl.col("Name").sort_by(pl.col("Attack")).head(3).list().over("Type 1").flatten().alias("strongest/group"),
        pl.col("Name").sort().head(3).list().over("Type 1").flatten().alias("sorted_by_alphabet"),
    ]
)
shape: (43, 4)
┌────────┬─────────────────────┬─────────────────┬─────────────────────────┐
│ Type 1 ┆ fastest/group       ┆ strongest/group ┆ sorted_by_alphabet      │
│ ---    ┆ ---                 ┆ ---             ┆ ---                     │
│ str    ┆ str                 ┆ str             ┆ str                     │
╞════════╪═════════════════════╪═════════════════╪═════════════════════════╡
│ Bug    ┆ Paras               ┆ Metapod         ┆ Beedrill                │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bug    ┆ Metapod             ┆ Kakuna          ┆ BeedrillMega Beedrill   │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Bug    ┆ Parasect            ┆ Caterpie        ┆ Butterfree              │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Dragon ┆ Dratini             ┆ Dratini         ┆ Dragonair               │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...    ┆ ...                 ┆ ...             ┆ ...                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Rock   ┆ Omanyte             ┆ Omastar         ┆ Geodude                 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Water  ┆ Slowpoke            ┆ Magikarp        ┆ Blastoise               │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Water  ┆ Slowbro             ┆ Tentacool       ┆ BlastoiseMega Blastoise │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Water  ┆ SlowbroMega Slowbro ┆ Horsea          ┆ Cloyster                │
└────────┴─────────────────────┴─────────────────┴─────────────────────────┘

Flattened window function

If we have a window function that aggregates to a list like the example above with the following expression:

pl.col("Name").sort_by(pl.col("Speed")).head(3).list().over("Type 1")

This still works, but that would give us a column type List which might not be what we want (this would significantly increase our memory usage!).

Instead we could flatten. This just turns our 2D list into a 1D array and projects that array/column back to our DataFrame. This is very fast because the reshape is often free, and adding the column back the the original DataFrame is also a lot cheaper (since we don't require a join like in a normal window function).

However, for this operation to make sense, it is important that the columns used in over([..]) are sorted!

List context

An expression context we haven't discussed yet is the List context. This means simply we can apply any expression on the elements of a List.

Row wise computations

This context is ideal for computing things in row orientation.

Polars expressions work on columns that have the guarantee that they consist of homogeneous data. Columns have this guarantee, rows in a DataFrame not so much. Luckily we have a data type that has the guarantee that the rows are homogeneous: pl.List data type.

Let's say we have the following data:

grades = pl.DataFrame(
    {
        "student": ["bas", "laura", "tim", "jenny"],
        "arithmetic": [10, 5, 6, 8],
        "biology": [4, 6, 2, 7],
        "geography": [8, 4, 9, 7],
    }
)
print(grades)
shape: (4, 4)
┌─────────┬────────────┬─────────┬───────────┐
│ student ┆ arithmetic ┆ biology ┆ geography │
│ ---     ┆ ---        ┆ ---     ┆ ---       │
│ str     ┆ i64        ┆ i64     ┆ i64       │
╞═════════╪════════════╪═════════╪═══════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ laura   ┆ 5          ┆ 6       ┆ 4         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ tim     ┆ 6          ┆ 2       ┆ 9         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny   ┆ 8          ┆ 7       ┆ 7         │
└─────────┴────────────┴─────────┴───────────┘

If we want to compute the rank of all the columns except for "student", we can collect those into a list data type:

This would give:

out = grades.select([pl.concat_list(pl.all().exclude("student")).alias("all_grades")])
print(out)
out = grades.select([
    pl.concat_list(pl.all().exclude("student")).alias("all_grades")
])
shape: (4, 1)
┌────────────┐
│ all_grades │
│ ---        │
│ list [i64] │
╞════════════╡
│ [10, 4, 8] │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [5, 6, 4]  │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [6, 2, 9]  │
├╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 7, 7]  │
└────────────┘

Running polars expression on list elements

We can run any polars expression on the elements of a list with the arr.eval expression! These expressions entirely run on polars' query engine and can run in parallel so will be super fast.

Let's expand the example from above with someting a little bit more interesting. Pandas allows you to compute the percentages of the rank values. Polars doesn't provide such a keyword argument. But because expressions are so versatile we can create our own percentage rank expression. Let's try that!

Note that we must select the list's element from the context. When we apply expressions over list elements. Any col()/first() selection suffices.

# the percentage rank expression
rank_pct = pl.col("").rank(reverse=True) / pl.col("").count()


grades.with_column(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude("student")).alias("all_grades")
).select([
    # select all columns except the intermediate list
    pl.all().exclude("all_grades"),
    # compute the rank by calling `arr.eval`
    pl.col("all_grades").arr.eval(rank_pct, parallel=True).alias("grades_rank")
])

This outputs:

shape: (4, 5)
┌─────────┬────────────┬─────────┬───────────┬────────────────────────────────┐
│ student ┆ arithmetic ┆ biology ┆ geography ┆ grades_rank                    │
│ ---     ┆ ---        ┆ ---     ┆ ---       ┆ ---                            │
│ str     ┆ i64        ┆ i64     ┆ i64       ┆ list [f32]                     │
╞═════════╪════════════╪═════════╪═══════════╪════════════════════════════════╡
│ bas     ┆ 10         ┆ 4       ┆ 8         ┆ [0.333333, 1.0, 0.666667]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ laura   ┆ 5          ┆ 6       ┆ 4         ┆ [0.666667, 0.333333, 1.0]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim     ┆ 6          ┆ 2       ┆ 9         ┆ [0.666667, 1.0, 0.333333]      │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ jenny   ┆ 8          ┆ 7       ┆ 7         ┆ [0.333333, 0.833333, 0.833333] │
└─────────┴────────────┴─────────┴───────────┴────────────────────────────────┘

Note that this solution works for any expressions/operation you want to do row wise.

Numpy interop

Polars expressions support NumPy ufuncs. See here for a list on all supported numpy functions.

This means that if a function is not provided by Polars, we can use NumPy and we still have fast columnar operation through the NumPy API.

Example

import polars as pl
import numpy as np

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

out = df.select(
    [
        np.log(pl.all()).suffix("_log"),
    ]
)
print(out)
shape: (3, 2)
┌──────────┬──────────┐
│ a_log    ┆ b_log    │
│ ---      ┆ ---      │
│ f64      ┆ f64      │
╞══════════╪══════════╡
│ 0.0      ┆ 1.386294 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0.693147 ┆ 1.609438 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1.098612 ┆ 1.791759 │
└──────────┴──────────┘

Gotcha's

Read more about the gotcha's here.

Custom functions

You should be convinced by now that polar expressions are so powerful and flexible that the need for custom python functions is much less needed than you might need in other libraries.

Still, you need to have the power to be able to pass an expression's state to a third party library or apply your black box function over data in polars.

For this we provide the following expressions:

  • map
  • apply

To map or to apply.

These functions have an important distinction in how they operate and consequently what data they will pass to the user.

A map passes the Series backed by the expression as is.

map follows the same rules in both the select and the groupby context, this will mean that the Series represents a column in a DataFrame. Note that in the groupby context, that column is not yet aggregated!

Use cases for map are for instance passing the Series in an expression to a third party library. Below we show how we could use map to pass an expression column to a neural network model.

df.with_column([
    pl.col("features").map(lambda s: MyNeuralNetwork.forward(s.to_numpy())).alias("activations")
])

Use cases for map in the groupby context are slim. They are only used for performance reasons, but can quite easily lead to incorrect results. Let me explain why.

df = pl.DataFrame(
    {
        "keys": ["a", "a", "b"],
        "values": [10, 7, 1],
    }
)

out = df.groupby("keys", maintain_order=True).agg(
    [
        pl.col("values").map(lambda s: s.shift()).alias("shift_map"),
        pl.col("values").shift().alias("shift_expression"),
    ]
)
print(df)
shape: (3, 2)
┌──────┬────────┐
│ keys ┆ values │
│ ---  ┆ ---    │
│ str  ┆ i64    │
╞══════╪════════╡
│ a    ┆ 10     │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a    ┆ 7      │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b    ┆ 1      │
└──────┴────────┘

In the snippet above we groupby the "keys" column. That means we have the following groups:

"a" -> [10, 7]
"b" -> [1]

If we would then apply a shift operation to the right, we'd expect:

"a" -> [null, 10]
"b" -> [null]

Now, let's print and see what we've got.

print(out)
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map  ┆ shift_expression │
│ ---  ┆ ---        ┆ ---              │
│ str  ┆ list [i64] ┆ list [i64]       │
╞══════╪════════════╪══════════════════╡
│ a    ┆ [null, 10] ┆ [null, 10]       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b    ┆ [7]        ┆ [null]           │
└──────┴────────────┴──────────────────┘

Ouch.. we clearly get the wrong results here. Group "b" even got a value from group "a" 😵.

This went horribly wrong, because the map applies the function before we aggregate! So that means the whole column [10, 7, 1] got shifted to [null, 10, 7] and was then aggregated.

So my advice is to never use map in the groupby context unless you know you need it and know what you are doing.

To apply

Luckily we can fix previous example with apply. apply works on the smallest logical elements for that operation.

That is:

  • select context -> single elements
  • groupby context -> single groups

So with apply we should be able to fix our example:

out = df.groupby("keys", maintain_order=True).agg(
    [
        pl.col("values").apply(lambda s: s.shift()).alias("shift_map"),
        pl.col("values").shift().alias("shift_expression"),
    ]
)
print(out)
shape: (2, 3)
┌──────┬────────────┬──────────────────┐
│ keys ┆ shift_map  ┆ shift_expression │
│ ---  ┆ ---        ┆ ---              │
│ str  ┆ list [i64] ┆ list [i64]       │
╞══════╪════════════╪══════════════════╡
│ a    ┆ [null, 10] ┆ [null, 10]       │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b    ┆ [null]     ┆ [null]           │
└──────┴────────────┴──────────────────┘

And observe, a valid result! 🎉

apply in the select context

In the select context, the apply expression passes elements of the column to the python function.

Note that you are now running python, this will be slow.

Let's go through some examples to see what to expect. We will continue with the DataFrame we defined at the start of this section and show an example with the apply function and a counter example where we use the expression API to achieve the same goals.

Adding a counter

In this example we create a global counter and then add the integer 1 to the global state at every element processed. Every iteration the result of the increment will be added to the element value.

counter = 0


def add_counter(val: int) -> int:
    global counter
    counter += 1
    return counter + val


out = df.select(
    [
        pl.col("values").apply(add_counter).alias("solution_apply"),
        (pl.col("values") + pl.arange(1, pl.count() + 1)).alias("solution_expr"),
    ]
)
print(out)
shape: (3, 2)
┌────────────────┬───────────────┐
│ solution_apply ┆ solution_expr │
│ ---            ┆ ---           │
│ i64            ┆ i64           │
╞════════════════╪═══════════════╡
│ 11             ┆ 11            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9              ┆ 9             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4              ┆ 4             │
└────────────────┴───────────────┘

Combining multiple column values

If we want to have access to values of different columns in a single apply function call, we can create struct data type. This data type collects those columns as fields in the struct. So if we'd create a struct from the columns "keys" and "values", we would get the following struct elements:

[
    {"keys": "a", "values": 10},
    {"keys": "a", "values": 7},
    {"keys": "b", "values": 1},
]

Those would be passed as dict to the calling python function and can thus be indexed by field: str.

out = df.select(
    [
        pl.struct(["keys", "values"]).apply(lambda x: len(x["keys"]) + x["values"]).alias("solution_apply"),
        (pl.col("keys").str.lengths() + pl.col("values")).alias("solution_expr"),
    ]
)
print(out)
shape: (3, 2)
┌────────────────┬───────────────┐
│ solution_apply ┆ solution_expr │
│ ---            ┆ ---           │
│ i64            ┆ i64           │
╞════════════════╪═══════════════╡
│ 11             ┆ 11            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 8              ┆ 8             │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2              ┆ 2             │
└────────────────┴───────────────┘

Return types?

Custom python functions are black boxes for polars. We really don't know what kind of black arts you are doing, so we have to infer and try our best to understand what you meant.

As a user it helps to understand what we do to better utilize custom functions.

The data type is automatically inferred. We do that by waiting for the first non-null value. That value will then be used to determine the type of the Series.

The mapping of python types to polars data types is as follows:

  • int -> Int64
  • float -> Float64
  • bool -> Boolean
  • str -> Utf8
  • list[tp] -> List[tp] (where the inner type is inferred with the same rules)
  • dict[str, [tp]] -> struct
  • Any -> object (Prevent this at all times)
import polars as pl

Expressions

fn(Series) -> Series

  • Lazily evaluated
    • Can be optimized
    • Gives the library writer context and informed decision can be made
  • Embarrassingly parallel
  • Context dependent
    • selection / projection -> Series = COLUMN, LITERAL or VALUE
    • aggregation -> Series = GROUPS
df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
        "optional": [28, 300, None, 2, -30],
    }
)
df
A fruits B cars optional
i64 str i64 str i64
1 "banana" 5 "beetle" 28
2 "banana" 4 "audi" 300
3 "apple" 3 "beetle" null
4 "apple" 2 "beetle" 2
5 "banana" 1 "beetle" -30

Selection context

# We can select by name

(df.select([
    pl.col("A"),
    "B",      # the col part is inferred
    pl.lit("B"),  # we must tell polars we mean the literal "B"
    pl.col("fruits"),
]))
A B literal fruits
i64 i64 str str
1 5 "B" "banana"
2 4 "B" "banana"
3 3 "B" "apple"
4 2 "B" "apple"
5 1 "B" "banana"
# you can select columns with a regex if it starts with '^' and ends with '$'

(df.select([
    pl.col("^A|B$").sum()
]))
A B
i64 i64
15 15
# you can select multiple columns by name

(df.select([
    pl.col(["A", "B"]).sum()
]))
A B
i64 i64
15 15
# We select everything in normal order
# Then we select everything in reversed order

(df.select([
    pl.all(),
    pl.all().reverse().suffix("_reverse")
]))
A fruits B cars optional A_reverse fruits_reverse B_reverse cars_reverse optional_reverse
i64 str i64 str i64 i64 str i64 str i64
1 "banana" 5 "beetle" 28 5 "banana" 1 "beetle" -30
2 "banana" 4 "audi" 300 4 "apple" 2 "beetle" 2
3 "apple" 3 "beetle" null 3 "apple" 3 "beetle" null
4 "apple" 2 "beetle" 2 2 "banana" 4 "audi" 300
5 "banana" 1 "beetle" -30 1 "banana" 5 "beetle" 28
# all expressions run in parallel
# single valued `Series` are broadcasted to the shape of the `DataFrame`

(df.select([
    pl.all(),
    pl.all().sum().suffix("_sum")
]))
A fruits B cars optional A_sum fruits_sum B_sum cars_sum optional_sum
i64 str i64 str i64 i64 str i64 str i64
1 "banana" 5 "beetle" 28 15 null 15 null 300
2 "banana" 4 "audi" 300 15 null 15 null 300
3 "apple" 3 "beetle" null 15 null 15 null 300
4 "apple" 2 "beetle" 2 15 null 15 null 300
5 "banana" 1 "beetle" -30 15 null 15 null 300
# there are `str` and `dt` namespaces for specialized functions

predicate = pl.col("fruits").str.contains("^b.*")

(df.select([
    predicate
]))
fruits
bool
true
true
false
false
true
# use the predicate to filter

df.filter(predicate)
A fruits B cars optional
i64 str i64 str i64
1 "banana" 5 "beetle" 28
2 "banana" 4 "audi" 300
5 "banana" 1 "beetle" -30
# predicate expressions can be used to filter

(df.select([
    pl.col("A").filter(pl.col("fruits").str.contains("^b.*")).sum(),
    (pl.col("B").filter(pl.col("cars").str.contains("^b.*")).sum() * pl.col("B").sum()).alias("some_compute()"),
]))
A some_compute()
i64 i64
8 165
# We can do arithmetic on columns and (literal) values
# can be evaluated to 1 without programmer knowing

some_var = 1

(df.select([
    ((pl.col("A") / 124.0 * pl.col("B")) / pl.sum("B") * some_var).alias("computed")
]))
computed
f64
0.0
0.0
0.0
0.0
0.0
# We can combine columns by a predicate

(df.select([
    "fruits",
    "B",
    pl.when(pl.col("fruits") == "banana").then(pl.col("B")).otherwise(-1).alias("b")
]))
fruits B b
str i64 i64
"banana" 5 5
"banana" 4 4
"apple" 3 -1
"apple" 2 -1
"banana" 1 1
# We can combine columns by a fold operation on column level

(df.select([
    "A",
    "B",
    pl.fold(0, lambda a, b: a + b, [pl.col("A"), "B", pl.col("B")**2, pl.col("A") / 2.0]).alias("fold")
]))
A B fold
i64 i64 f64
1 5 31
2 4 23
3 3 16
4 2 12
5 1 9
# even combine all

(df.select([
    pl.arange(0, df.height).alias("idx"),
    "A",
    pl.col("A").shift().alias("A_shifted"),
    pl.concat_str(pl.all(), "-").alias("str_concat_1"),  # prefer this
    pl.fold(pl.col("A"), lambda a, b: a + "-" + b, pl.all().exclude("A")).alias("str_concat_2"),  # over this (accidentally O(n^2))
]))
idx A A_shifted str_concat_1 str_concat_2
i64 i64 i64 str str
0 1 null "1-banana-5-beetle-28" "1-banana-5-beetle-28"
1 2 1 "2-banana-4-audi-300" "2-banana-4-audi-300"
2 3 2 null null
3 4 3 "4-apple-2-beetle-2" "4-apple-2-beetle-2"
4 5 4 "5-banana-1-beetle--30" "5-banana-1-beetle--30"

Aggregation context

  • expressions are applied over groups instead of columns
# we can still combine many expressions

(df.sort("cars").groupby("fruits")
    .agg([
        pl.col("B").sum().alias("B_sum"),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count("A").alias("count"),
        pl.col("cars").reverse()
    ]))
fruits B_sum B_sum2 fruits_first count cars
str i64 i64 str u32 list
"banana" 10 10 "banana" 3 [beetle, beetle, audi]
"apple" 5 5 "apple" 2 [beetle, beetle]
# We can explode the list column "cars"

(df.sort("cars").groupby("fruits")
    .agg([
        pl.col("B").sum().alias("B_sum"),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count("A").alias("count"),
        pl.col("cars").reverse()
    ])).explode("cars")
fruits B_sum B_sum2 fruits_first count cars
str i64 i64 str u32 str
"apple" 5 5 "apple" 2 "beetle"
"apple" 5 5 "apple" 2 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "audi"
(df.groupby("fruits")
    .agg([
        pl.col("B").sum().alias("B_sum"),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count(),
        pl.col("B").shift().alias("B_shifted")
    ])
 .explode("B_shifted")
)
fruits B_sum B_sum2 fruits_first count B_shifted
str i64 i64 str u32 i64
"apple" 5 5 "apple" 2 null
"apple" 5 5 "apple" 2 3
"banana" 10 10 "banana" 3 null
"banana" 10 10 "banana" 3 5
"banana" 10 10 "banana" 3 4
# We can explode the list column "cars"

(df.sort("cars").groupby("fruits")
    .agg([
        pl.col("B").sum(),
        pl.sum("B").alias("B_sum2"),  # syntactic sugar for the first
        pl.first("fruits").alias("fruits_first"),
        pl.count("A").alias("count"),
        pl.col("cars").reverse()
    ])).explode("cars")
fruits B_sum B_sum2 fruits_first count cars
str i64 i64 str u32 str
"apple" 5 5 "apple" 2 "beetle"
"apple" 5 5 "apple" 2 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "beetle"
"banana" 10 10 "banana" 3 "audi"
# we can also get the list of the groups

(df.groupby("fruits")
    .agg([
         pl.col("B").shift().alias("shift_B"),
         pl.col("B").reverse().alias("rev_B"),
    ]))
fruits shift_B rev_B
str list list
"apple" [null, 3] [2, 3]
"banana" [null, 5, 4] [1, 4, 5]
# we can do predicates in the groupby as well

(df.groupby("fruits")
    .agg([
        pl.col("B").filter(pl.col("B") > 1).list().keep_name(),
    ]))
fruits B
str list
"banana" [5, 4]
"apple" [3, 2]
# and sum only by the values where the predicates are true

(df.groupby("fruits")
    .agg([
        pl.col("B").filter(pl.col("B") > 1).mean(),
    ]))
fruits B_mean
str f64
"banana" 4.5
"apple" 2.5
# Another example

(df.groupby("fruits")
    .agg([
        pl.col("B").shift_and_fill(1, fill_value=0).alias("shifted"),
        pl.col("B").shift_and_fill(1, fill_value=0).sum().alias("shifted_sum"),
    ]))
fruits shifted shifted_sum
str list i64
"apple" [0, 3] 3
"banana" [0, 5, 4] 9

Window functions!

  • Expression with superpowers.
  • Aggregation in selection context
pl.col("foo").aggregation_expression(..).over("column_used_to_group")
# groupby 2 different columns

(df.select([
    "fruits",
    "cars",
    "B",
    pl.col("B").sum().over("fruits").alias("B_sum_by_fruits"),
    pl.col("B").sum().over("cars").alias("B_sum_by_cars"),
]))
fruits cars B B_sum_by_fruits B_sum_by_cars
str str i64 i64 i64
"apple" "beetle" 3 5 11
"apple" "beetle" 2 5 11
"banana" "beetle" 5 10 11
"banana" "audi" 4 10 4
"banana" "beetle" 1 10 11
# reverse B by groups and show the results in original DF

(df.select([
    "fruits",
    "B",
    pl.col("B").reverse().over("fruits").alias("B_reversed_by_fruits")
]))
fruits B B_reversed_by_fruits
str i64 i64
"apple" 3 2
"apple" 2 3
"banana" 5 1
"banana" 4 4
"banana" 1 5
# Lag a column within "fruits"

(df.select([
    "fruits",
    "B",
    pl.col("B").shift().over("fruits").alias("lag_B_by_fruits")
]))

fruits B lag_B_by_fruits
str i64 i64
"apple" 3 null
"apple" 2 3
"banana" 5 null
"banana" 4 5
"banana" 1 4

Expression API

The full list of possible expressions is available on the Expr class definition in the reference guide.

Video Introduction

Don't enjoy reading? Take a look at this introduction video on Polars and its expressions.

Indexing

The Polars DataFrame doesn't have an index, therefore indexing behavior can be consistent without the need of a df.loc, df.iloc, or a df.at operation.

The rules are as follows (depending on the datatypes of the values):

  • numeric

    • axis 0: row
    • axis 1: column
  • numeric + strings

    • axis 0: row (only accept numbers here)
    • axis 1: column (accept numeric + string values)
  • only strings

    • axis 0: column
    • axis 1: error
  • expressions

    All expression evaluations are executed in parallel

    • axis 0: column
    • axis 1: column
    • ..
    • axis n: column

Comparison with pandas

pandaspolars
select row
df.iloc[2]
df[2, :]
select several rows by their indices
df.iloc[[2, 5, 6]]
df[[2, 5, 6], :]
select slice of rows
df.iloc[2:6]
df[2:6, :]
select rows using a boolean mask
df.iloc[True, True, False]
df[[True, True, False]]
select rows by a predicate condition
df.loc[df["A"] > 3]
df[df["A"] > 3]
select slice of columns
df.iloc[:, 1:3]
df[:, 1:3]
select slice of columns by string order
df.loc[:, "A":"Z"]
df[:, "A":"Z"]
select a single value (scalar)
df.loc[2, "A"]
df[2, "A"]
select a single value (scalar)
df.iloc[2, 1]
df[2, 1]
select a single value (Series/DataFrame)
df.loc[2, ["A"]]
df[2, ["A"]]
select a single value (Series/DataFrame)
df.iloc[2, [1]]
df[2, [1]]

Anti-pattern

Indexing polars by boolean masks is considered an anti-pattern and the functionality may be removed in the future. Polars strongly favours the expression API in combination with select and filter in favor of accessing by index.

Data types

Polars is entirely based on Arrow data types and backed by Arrow memory arrays. This makes data processing cache-efficient and well-supported for Inter Process Communication. Most data types follow the exact implementation from Arrow, with exception of Utf8 (this is actually LargeUtf8), Categorical, and Object (support is limited).

The data types are:

  • Int8: 8-bit signed integer.
  • Int16: 16-bit signed integer.
  • Int32: 32-bit signed integer.
  • Int64: 64-bit signed integer.
  • UInt8: 8-bit unsigned integer.
  • UInt16: 16-bit unsigned integer.
  • UInt32: 32-bit unsigned integer.
  • UInt64: 64-bit unsigned integer.
  • Float32: 32-bit floating point.
  • Float64: 64-bit floating point.
  • Boolean: Boolean type effectively bit packed.
  • Utf8: String data (this is actually Arrow LargeUtf8 internally).
  • List: A list array contains a child array containing the list values and an offset array. (this is actually Arrow LargeList internally).
  • Struct: A struct array is represented as Vec<Series> and is useful to pack multiple/heterogenous values in a single column.
  • Date: Date representation, internally represented as days since UNIX epoch encoded by a 32-bit signed integer.
  • Datetime: Datetime representation, internally represented as nanoseconds since UNIX epoch encoded by a 64-bit signed integer.
  • Duration: A timedelate type. Created when subtracting Date/Datetime.
  • Time: Time representation, internally represented as nanoseconds since midnight.
  • Object: A limited supported data type that can be any value.

To learn more about the internal representation of these data types, check the Arrow columnar format.

Coming from Pandas

Here we set out the key points that anyone who has experience with Pandas and wants to try Polars should know. We include both differences in the concepts the libraries are built on and differences in how you should write Polars code compared to Pandas code.

Differences in concepts between Polars and Pandas

Polars does not have an index

Pandas gives a label to each row with an index. Polars does not use an index and each row is indexed by its integer position in the table.

Indexes are not needed! Not having them makes things easier - convince us otherwise!

For more detail on how you select data in Polars see the indexing section.

Polars uses Apache Arrow arrays to represent data in memory while Pandas uses Numpy arrays

Polars represents data in memory with Arrow arrays while Pandas represents data in memory in Numpy arrays. Apache Arrow is an emerging standard for in-memory columnar analytics that can accelerate data load times, reduce memory usage and accelerate calculations.

Polars can convert data to Numpy format with the to_numpy method.

Polars has more support for parallel operations than Pandas

Polars exploits the strong support for concurrency in Rust to run many operations in parallel. While some operations in Pandas are multi-threaded the core of the library is single-threaded and an additional library such as Dask must be used to parallelise operations.

Polars can lazily evaluate queries and apply query optimization

Eager evaluation is where code is evaluated as soon as you run the code. Lazy evaluation is where running a line of code means that the underlying logic is added to a query plan rather than being evaluated.

Polars supports eager evaluation and lazy evaluation whereas Pandas only supports eager evaluation. The lazy evaluation mode is powerful because Polars carries out automatic query optimization where it examines the query plan and looks for ways to accelerate the query or reduce memory usage.

Dask also supports lazy evaluation where it generates a query plan. However, Dask does not carry out query optimization on the query plan.

Key syntax differences

Users coming from Pandas generally need to know one thing...

polars != pandas

If your Polars code looks like it could be Pandas code, it might run, but it likely runs slower than it should.

Let's go through some typical Pandas code and see how we might write that in Polars.

Selecting data

As there is no index in Polars there is no .loc or iloc method in Polars - and there is also no SettingWithCopyWarning in Polars.

To learn more about how you select data in Polars see the indexing section.

However, the best way to select data in Polars is to use the expression API. For example, if you want to select a column in Pandas you can do one of the following:

df['a']
df.loc[:,'a']

but in Polars you would use the .select method:

df.select(['a'])

If you want to select rows based on the values then in Polars you use the .filter method:

df.filter(pl.col('a')<10)

As noted in the section on expressions below, Polars can run operations in .select and filter in parallel and Polars can carry out query optimization on the full set of data selection criteria.

Be lazy

Working in lazy evaluation mode is straightforward and should be your default in Polars as the lazy mode allows Polars to do query optimization.

We can run in lazy mode by either using an implicitly lazy function (such as scan_csv) or explicitly using the lazy method.

Take the following simple example where we read a CSV file from disk and do a groupby. The CSV file has numerous columns but we just want to do a groupby on one of the id columns (id1) and then sum by a value column (v1). In Pandas this would be:

    df = pd.read_csv(csvFile)
    groupedDf = df.loc[:,['id1','v1']].groupby('id1').sum('v1')

In Polars you can build this query in lazy mode with query optimization and evaluate it by replacing the eager Pandas function read_csv with the implicitly lazy Polars function scan_csv:

    df = pl.scan_csv(csvFile)
    groupedDf = df.groupby('id1').agg(pl.col('v1').sum()).collect()

Polars optimizes this query by identifying that only the id1 and v1 columns are relevant and so will only read these columns from the CSV. By calling the .collect method at the end of the second line we instruct Polars to eagerly evaluate the query.

If you do want to run this query in eager mode you can just replace scan_csv with read_csv in the Polars code.

Read more about working with lazy evaluation in the lazy API section.

Express yourself

A typical Pandas script consists of multiple data transformations that are executed sequentially. However, in Polars these transformations can be executed in parallel using expressions.

Column assignment

We have a dataframe df with a column called value. We want to add two new columns, a column called tenXValue where the value column is multiplied by 10 and a column called hundredXValue where the value column is multiplied by 100.

In Pandas this would be:

df["tenXValue"] = df["value"] * 10
df["hundredXValue"] = df["value"] * 100

These column assignments are executed sequentially.

In Polars we add columns to df using the .with_column method and name them with the .alias method:

df.with_columns([
    (pl.col("value") * 10).alias("tenXValue"),
    (pl.col("value") * 100).alias("hundredXValue"),
])

These column assignments are executed in parallel.

Column asignment based on predicate

In this case we have a dataframe df with columns a,b and c. We want to re-assign the values in column a based on a condition. When the value in column c is equal to 2 then we replace the value in a with the value in b.

In Pandas this would be:

df.loc[df["c"] == 2, "a"] = df.loc[df["c"] == 2, "b"]

while in Polars this would be:

df.with_column(
    pl.when(pl.col("c") == 2)
    .then(pl.col("b"))
    .otherwise(pl.col("a")).alias("a")
)

The Polars way is pure in that the original DataFrame is not modified. The mask is also not computed twice as in Pandas (you could prevent this in Pandas, but that would require setting a temporary variable).

Additionally Polars can compute every branch of an if -> then -> otherwise in parallel. This is valuable, when the branches get more expensive to compute.

Filtering

We want to filter the dataframe df with housing data based on some criteria.

In Pandas you filter the dataframe by passing Boolean expressions to the loc method:

df.loc[(df['sqft_living'] > 2500) & (df['price'] < 300000)]

while in Polars you call the filter method:

df.filter(
    (pl.col("m2_living") > 2500) & (pl.col("price") < 300000)
)

The query optimizer in Polars can also detect if you write multiple filters separately and combine them into a single filter in the optimized plan.

Pandas transform

The Pandas documentation demonstrates an operation on a groupby called transform. In this case we have a dataframe df and we want a new column showing the number of rows in each group.

In Pandas we have:

df = pd.DataFrame({
    "type": ["m", "n", "o", "m", "m", "n", "n"]
    "c": [1, 1, 1, 2, 2, 2, 2],
})

df["size"] = df.groupby("c")["type"].transform(len)

Here Pandas does a groupby on "c", takes column "type", computes the group length and then joins the result back to the original DataFrame producing:

   c type size
0  1    m    3
1  1    n    3
2  1    o    3
3  2    m    4
4  2    m    4
5  2    n    4
6  2    n    4

In Polars the same can be achieved with window functions:

df.select([
    pl.all(),
    pl.col("type").count().over("c").alias("size")
])
shape: (7, 3)
┌─────┬──────┬──────┐
│ c   ┆ type ┆ size │
│ --- ┆ ---  ┆ ---  │
│ i64 ┆ str  ┆ u32  │
╞═════╪══════╪══════╡
│ 1   ┆ m    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ n    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ o    ┆ 3    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    │
└─────┴──────┴──────┘

Because we can store the whole operation in a single expression, we can combine several window functions and even combine different groups!

Polars will cache window expressions that are applied over the same group, so storing them in a single select is both convenient and optimal. In the following example we look at a case where we are calculating group statistics over "c" twice:

df.select([
    pl.all(),
    pl.col("c").count().over("c").alias("size"),
    pl.col("c").sum().over("type").alias("sum"),
    pl.col("c").reverse().over("c").flatten().alias("reverse_type")
])
shape: (7, 5)
┌─────┬──────┬──────┬─────┬──────────────┐
│ c   ┆ type ┆ size ┆ sum ┆ reverse_type │
│ --- ┆ ---  ┆ ---  ┆ --- ┆ ---          │
│ i64 ┆ str  ┆ u32  ┆ i64 ┆ i64          │
╞═════╪══════╪══════╪═════╪══════════════╡
│ 1   ┆ m    ┆ 3    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ n    ┆ 3    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ o    ┆ 3    ┆ 1   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    ┆ 5   ┆ 2            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ m    ┆ 4    ┆ 5   ┆ 1            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    ┆ 5   ┆ 1            │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ n    ┆ 4    ┆ 5   ┆ 1            │
└─────┴──────┴──────┴─────┴──────────────┘

Coming from Apache Spark

Column-based API vs. Row-based API

Whereas the Spark DataFrame is analogous to a collection of rows, a Polars DataFrame is closer to a collection of columns. This means that you can combine columns in Polars in ways that are not possible in Spark, because Spark preserves the relationship of the data in each row.

Consider this sample dataset:

import polars as pl

df = pl.DataFrame({
    "foo": ["a", "b", "c", "d", "d"],
    "bar": [1, 2, 3, 4, 5],
})

dfs = spark.createDataFrame(
    [
        ("a", 1),
        ("b", 2),
        ("c", 3),
        ("d", 4),
        ("d", 5),
    ],
    schema=["foo", "bar"],
)

Example 1: Combining head and sum

In Polars you can write something like this:

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").filter(pl.col("foo") == "d").sum()
])

Output:

shape: (2, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 9   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 9   │
└─────┴─────┘

The expressions on columns foo and bar are completely independent. Since the expression on bar returns a single value, that value is repeated for each value output by the expression on foo. But a and b have no relation to the data that produced the sum of 9.

To do something similar in Spark, you'd need to compute the sum separately and provide it as a literal:

from pyspark.sql.functions import col, sum, lit

bar_sum = (
    dfs
    .where(col("foo") == "d")
    .groupBy()
    .agg(sum(col("bar")))
    .take(1)[0][0]
)

(
    dfs
    .orderBy("foo")
    .limit(2)
    .withColumn("bar", lit(bar_sum))
    .show()
)

Output:

+---+---+
|foo|bar|
+---+---+
|  a|  9|
|  b|  9|
+---+---+

Example 2: Combining Two heads

In Polars you can combine two different head expressions on the same DataFrame, provided that they return the same number of values.

df.select([
    pl.col("foo").sort().head(2),
    pl.col("bar").sort(reverse=True).head(2),
])

Output:

shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 5   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ b   ┆ 4   │
└─────┴─────┘

Again, the two head expressions here are completely independent, and the pairing of a to 5 and b to 4 results purely from the juxtaposition of the two columns output by the expressions.

To accomplish something similar in Spark, you would need to generate an artificial key that enables you to join the values in this way.

from pyspark.sql import Window
from pyspark.sql.functions import row_number

foo_dfs = (
    dfs
    .withColumn(
        "rownum",
        row_number().over(Window.orderBy("foo"))
    )
)

bar_dfs = (
    dfs
    .withColumn(
        "rownum",
        row_number().over(Window.orderBy(col("bar").desc()))
    )
)

(
    foo_dfs.alias("foo")
    .join(bar_dfs.alias("bar"), on="rownum")
    .select("foo.foo", "bar.bar")
    .limit(2)
    .show()
)

Output:

+---+---+
|foo|bar|
+---+---+
|  a|  5|
|  b|  4|
+---+---+

Time Series

For time-series resampling Polars offers a powerful API to resample data. Pandas is well known for its resampling functionality via df.resample.

Polars make the distinction between

  • upsampling
  • downsampling

Upsampling

An upsample operation is actually nothing more than left joining a date range with your dataset and filling the blanks. Polars provides wrapper methods for this operation. Later on we'll discuss an example.

Downsampling

Downsampling is interesting. Here you deal with date intervals, window durations, aggregations etc.

Polars views downsampling as a special case of the groupby operation and therefore has two extra entrances in the expression API with the groupby context:

Calling any of those functions will give you complete access to the expression API and performance!

Let's go through some examples and see what that means.

Groupby Dynamic

In the snippet below we create a date range with every day ("1d") in 2021 and turn this into a DataFrame.

Then we we create dynamic windows that starts every month ("1mo") and has a window length of 1 month. Dynamic windows don't have a size thats fixed by the number of rows in a DataFrame, instead they are fixed by a temporal unit. This can be a day ("1d"), 3 weeks ("3w") or 5 nanoseconds ("5ns") ... you get the idea.

The values that match these dynamic windows are then assigned to that group and can be aggregated with the powerful expression API.

Below we show an example where we use groupby_dynamic to compute:

  • the number of days until the end of the month
  • the number of days in a month
df = pl.date_range(low=datetime(2021, 1, 1), high=datetime(2021, 12, 31), interval="1d", name="time").to_frame()

out = (
    df.groupby_dynamic("time", every="1mo", period="1mo", closed="left")
    .agg(
        [
            pl.col("time").cumcount().reverse().head(3).alias("day/eom"),
            ((pl.col("time") - pl.col("time").first()).last().dt.days() + 1).alias("days_in_month"),
        ]
    )
    .explode("day/eom")
)
print(out)
shape: (36, 3)
┌─────────────────────┬─────────┬───────────────┐
│ time                ┆ day/eom ┆ days_in_month │
│ ---                 ┆ ---     ┆ ---           │
│ datetime[ns]        ┆ u32     ┆ i64           │
╞═════════════════════╪═════════╪═══════════════╡
│ 2021-01-01 00:00:00 ┆ 30      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 00:00:00 ┆ 29      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 00:00:00 ┆ 28      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-02-01 00:00:00 ┆ 27      ┆ 28            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...     ┆ ...           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-11-01 00:00:00 ┆ 27      ┆ 30            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-01 00:00:00 ┆ 30      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-01 00:00:00 ┆ 29      ┆ 31            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-12-01 00:00:00 ┆ 28      ┆ 31            │
└─────────────────────┴─────────┴───────────────┘

A dynamic window is defined by a:

  • every: indicates the interval of the window
  • period: indicates the duration of the window
  • offset: can be used to offset the start of the windows

Because every does not have to be equal to period, we can create many groups in a very flexible way. They may overlap or leave boundaries between them.

Let's see how the windows for some parameter combinations would look. Let's start out boring. 🥱

  • every: 1 day -> "1d"
  • period: 1 day -> "1d"
this creates adjacent windows of the same size
|--|
   |--|
      |--|
  • every: 1 day -> "1d"
  • period: 2 days -> "2d"
these windows have an overlap of 1 day
|----|
   |----|
      |----|
  • every: 2 days -> "2d"
  • period: 1 day -> "1d"
this would leave gaps between the windows
data points that in these gaps will not be a member of any group
|--|
       |--|
              |--|

Rolling GroupBy

The rolling groupby is another entrance to the groupby context. But different from the groupby_dynamic the windows are not fixed by a parameter every and period. In a rolling groupby the windows are not fixed at all! They are determined by the values in the index_column.

So imagine having a time column with the values {2021-01-01, 20210-01-05} and a period="5d" this would create the following windows:


2021-01-01   2021-01-06
    |----------|

       2021-01-05   2021-01-10
             |----------|

Because the windows of a rolling groupby are always determined by the values in the DataFrame column, the number of groups is always equal to the original DataFrame.

Combining Groupby and Dynamic / Rolling

Rolling and dynamic groupby's can be combined with normal groupby operations.

Below is an example with a dynamic groupby.

import polars as pl
from datetime import datetime


df = pl.DataFrame(
    {
        "time": pl.date_range(
            low=datetime(2021, 12, 16),
            high=datetime(2021, 12, 16, 3),
            interval="30m",
        ),
        "groups": ["a", "a", "a", "b", "b", "a", "a"],
    }
)
print(out)
shape: (7, 2)
┌─────────────────────┬────────┐
│ time                ┆ groups │
│ ---                 ┆ ---    │
│ datetime[ns]        ┆ str    │
╞═════════════════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 00:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:00:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 01:30:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:00:00 ┆ b      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 02:30:00 ┆ a      │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2021-12-16 03:00:00 ┆ a      │
└─────────────────────┴────────┘
out = df.groupby_dynamic(
    "time",
    every="1h",
    closed="both",
    by="groups",
    include_boundaries=True,
).agg([pl.count()])
print(out)
shape: (7, 5)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┐
│ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ count │
│ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---   │
│ str    ┆ datetime[ns]        ┆ datetime[ns]        ┆ datetime[ns]        ┆ u32   │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╡
│ a      ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 ┆ 1     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1     │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┘

Upsample

This content is under construction.

How can I?

This chapter contains some snippets that will get you up to speed with the most idiomatic way to get things done in Polars.

The How can I chapter is under construction.

IO

Polars supports different file types, and its respective parsers are amongst the fastest out there.

For instance, it is faster to load a CSV file via Polars before handing it to Pandas than loading them using Pandas. Just run a pl.read_csv("<FILE>", rechunk=False).to_pandas() to convince yourself!

Character-Separated Values

Read & Write

Reading a CSV file should look familiar:

df = pl.read_csv("path.csv")

CSV files come in many different flavors, so make sure to check the read_csv() API.

Writing to a CSV file can be done with the write_csv() method.

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_csv("path.csv")

Scan

Polars allows you to scan a CSV input. Scanning delays the actual parsing of the file and instead returns a lazy computation holder called a LazyFrame.

df = pl.scan_csv("path.csv")

If you want to know why this is desirable, you can read more about those Polars optimizations here.

Parquet

Loading or writing Parquet files is lightning fast. Pandas uses PyArrow -Python bindings exposed by Arrow- to load Parquet files into memory, but it has to copy that data into Pandas memory. With Polars there is no extra cost due to copying as we read Parquet directly into Arrow memory and keep it there.

Read & write

df = pl.read_parquet("path.parquet")
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_parquet("path.parquet")

Scan

Polars allows you to scan a Parquet input. Scanning delays the actual parsing of the file and instead returns a lazy computation holder called a LazyFrame.

df = pl.scan_parquet("path.parquet")

If you want to know why this is desirable, you can read more about those Polars optimizations here.

Dealing with multiple files.

Polars can deal with multiple files differently depending on your needs and memory strain.

Let's create some files to give use some context:

import polars as pl

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})

for i in range(5):
    df.write_csv(f"my_many_files_{i}.csv")

Reading into a single DataFrame

To read multiple files into a single DataFrame, we can use globbing patterns:

df = pl.read_csv("my_many_files_*.csv")
print(df)
shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar  │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ ham  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ...  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ ham  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
└─────┴──────┘

To see how this works we can take a look at the query plan. Below we see that all files are read separately and concatenated into a single DataFrame. Polars will try to parallelize the reading.

pl.scan_csv("my_many_files_*.csv").show_graph()

single_df_graph

Reading and processing in parallel

If your files don't have to be in a single table you can also build a query plan for each file and execute them in paralllel on the Polars thread pool.

All query plan execution is embarrassingly parallel and doesn't require any communication.

import polars as pl
import glob

queries = []
for file in glob.glob("my_many_files_*.csv"):
    q = pl.scan_csv(file).groupby("bar").agg([pl.count(), pl.sum("foo")])
    queries.append(q)

dataframes = pl.collect_all(queries)
print(dataframes)
[shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ spam ┆ 1     ┆ 3   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ spam ┆ 1     ┆ 3   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
└──────┴───────┴─────┘]

Read from MySQL, Postgres, Sqlite, Redshift, Clickhouse

To read from one of the supported databases connector-x needs to be installed.

$  pip install connectorx>=0.2.0a3
import polars as pl

conn = "postgres://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_sql(query, conn)

Interact with AWS

The Interact with AWS page is under construction.

To read from or write to an AWS bucket, additional dependencies are needed:

$ pip install s3fs

In the next few snippets we'll demonstrate interacting with a Parquet file located on an AWS bucket.

Read

Load a .parquet file using:

import polars as pl
import pyarrow.parquet as pq
import s3fs

fs = s3fs.S3FileSystem()
bucket = "<YOUR_BUCKET>"
path = "<YOUR_PATH>"

dataset = pq.ParquetDataset(f"s3://{bucket}/{path}", filesystem=fs)
df = pl.from_arrow(dataset.read())

Write

This content is under construction.

Interact with Google BigQuery

To read or write from GBQ, additional dependencies are needed:

$ pip install google-cloud-bigquery

Read

We can load a query into a DataFrame like this:

import polars as pl
from google.cloud import bigquery

client = bigquery.Client()

# Perform a query.
QUERY = (
    'SELECT name FROM `bigquery-public-data.usa_names.usa_1910_2013` '
    'WHERE state = "TX" '
    'LIMIT 100')
query_job = client.query(QUERY)  # API request
rows = query_job.result()  # Waits for query to finish

df = pl.from_arrow(rows.to_arrow())

Write

This content is under construction.

Interact with Postgres

Read

To read from postgres, additional dependencies are needed:

$  pip install connectorx>=0.2.0a3
import polars as pl

conn = "postgresql://username:password@server:port/database"
query = "SELECT * FROM foo"

pl.read_sql(query, conn)

Write

To write to postgres, additional dependencies are needed:

$ pip install psycopg2-binary

For writing to a postgres database with psycopg2, we utilize execute_batch. This will limit round trips needed to the server.

We first make sure that all our dtypes are in a format that psycopg2 recognizes, and then we use DataFrame.rows to easily transform the columnar data to rows that the database driver can work with.

from psycopg2 import sql
import psycopg2.extras
import polars as pl

# let's assume we have a DataFrame with some floats, integers, strings, and date64 columns.
df = pl.read_parquet("somefile.parquet")

# first me convert polars date64 representation to python datetime objects 
for col in df:
    # only for date64
    if col.dtype == pl.Date64:
        df = df.with_column(col.dt.to_python_datetime())

# create sql identifiers for the column names
# we do this to safely insert this into a sql query
columns = sql.SQL(",").join(sql.Identifier(name) for name in df.columns)

# create placeholders for the values. These will be filled later
values = sql.SQL(",").join([sql.Placeholder() for _ in df.columns])

table_id = "mytable"

# prepare the insert query
insert_stmt = sql.SQL("INSERT INTO ({}) VALUES({});").format(
    sql.Identifier(table_id), columns, values
)

# make a connection
conn = psycopg2.connect()
cur = conn.cursort()

# do the insert
psycopg2.extras.execute_batch(cur, insert_stmt, df.rows())
conn.commit()

Interoperability

Arrow

Arrow is rapidly becoming the de facto standard for columnar data. This means that support for Arrow is growing rapidly (both languages and tools). Due to the amazing effort behind the format, using Arrow is now likely the fastest way to:

  • Read and write Parquet formatted files
  • Read CSV into columnar data
  • Exchanging columnar data

Polars uses an Arrow memory buffer as the most basic building block for the Polars Series. This means that we exchange data between Polars and Arrow without copying it. It also means that Polars shares the same performance gains that Arrow receives.

Convert a Polars DataFrame or Series to Arrow using the .to_arrow() method. Similarly, importing from Arrow data structure can be performed with the .from_arrow() functions.

NumPy

Polars Series have support for NumPy universal functions (ufuncs). Element-wise functions such as np.exp(), np.cos(), np.div(), etc. all work with almost zero overhead.

However, as a Polars-specific remark: missing values are a separate bitmask and are not visible by NumPy. It can yield to a window function or a np.convolve() giving flawed or incomplete results.

Convert a Polars Series to a NumPy array with the .to_numpy() method. Missing values will be replaced by np.nan during the conversion. If the Series does not include missing values, or those values are not desired anymore, the .view() method can be used instead, providing a zero-copy NumPy array of the data.

Data handling

Process strings

Thanks to its Arrow backend, Polars string operations are much faster compared to the same operations performed with NumPy or Pandas. In the latter, strings are stored as Python objects. While traversing the np.array or the pd.Series the CPU needs to follow all the string pointers, and jump to many random memory locations -- which is very cache-inefficient. In Polars (via the Arrow data structure) strings are contiguous in memory. Thus traversing is cache-optimal and predictable for the CPU.

The string processing functions available in Polars are available in the str namespace.

Below are a few examples. To compute string lengths:

import polars as pl

df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})

df = df.with_column(pl.col("shakespeare").str.lengths().alias("letter_count"))

returning:

shape: (6, 2)
┌─────────────┬──────────────┐
│ shakespeare ┆ letter_count │
│ ---         ┆ ---          │
│ str         ┆ u32          │
╞═════════════╪══════════════╡
│ All         ┆ 3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ that        ┆ 4            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ glitters    ┆ 8            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ is          ┆ 2            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ not         ┆ 3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ gold        ┆ 4            │
└─────────────┴──────────────┘

And below a regex pattern to filter out articles (the, a, and, etc.) from a sentence:

import polars as pl

df = pl.DataFrame({"a": "The man that ate a whole cake".split(" ")})

df = df.filter(pl.col("a").str.contains(r"(?i)^the$|^a$").is_not())

yielding:

shape: (5, 1)
┌───────┐
│ a     │
│ ---   │
│ str   │
╞═══════╡
│ man   │
├╌╌╌╌╌╌╌┤
│ that  │
├╌╌╌╌╌╌╌┤
│ ate   │
├╌╌╌╌╌╌╌┤
│ whole │
├╌╌╌╌╌╌╌┤
│ cake  │
└───────┘

Timestamp parsing

Polars offers 4 time datatypes:

  • pl.Date, to be used for date objects: the number of days since the UNIX epoch as a 32 bit signed integer.
  • pl.Datetime, to be used of datetime ojects: the number of nanoseconds since the UNIX epoch as a 64 bit signed integer.
  • pl.Time, encoded as the number of nanoseconds since midnight.

Polars string (pl.Utf8) datatypes can be parsed as either of them. You can let Polars try to guess the format of the date[time], or explicitly provide a fmt rule.

For instance (check this link for an comprehensive list):

  • "%Y-%m-%d" for "2020-12-31"
  • "%Y/%B/%d" for "2020/December/31"
  • "%B %y" for "December 20"

Below a quick example:

import polars as pl

dataset = pl.DataFrame({"date": ["2020-01-02", "2020-01-03", "2020-01-04"], "index": [1, 2, 3]})

q = dataset.lazy().with_column(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))

df = q.collect()

returning:

shape: (3, 2)
┌────────────┬───────┐
│ date       ┆ index │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2020-01-02 ┆ 1     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-03 ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-04 ┆ 3     │
└────────────┴───────┘

All datetime functionality is shown in the dt namespace.

Performance

This chapter handles some gotcha's needed to squeeze maximum performance out of Polars. When used properly, Polars can run at blazing speeds. Take a look at the results in H2O AI database benchmark.

Strings

Understanding the memory format used by Arrow and Polars can really increase performance of your queries. This is especially true for large string data. The figure below shows how an Arrow UTF8 array is laid out in memory.

The array ["foo", "bar", "ham"] is encoded by :

  • a concatenated string "foobarham",
  • an offset array indicating the start (and end) of each string [0, 2, 5, 8],
  • a null bitmap, indicating null values.

This memory structure is very cache-efficient if we are to read the string values. Especially if we compare it to a Vec<String> (an array of heap allocated string data in Rust).

However, if we need to reorder the Arrow UTF8 array, we need to swap around all the bytes of the string values, which can become very expensive when dealing with large strings. On the other hand for Vec<String>, we only need to swap pointers around, which is only 8 bytes data that have to be moved with little cost.

Reordering a DataFrame embedding a large number of Utf8 Series due to an operation (filtering, joining, grouping, etc.) can quickly become quite expensive.

Categorical type

For this reason Polars has a CategoricalType. A Categorical Series is an array filled with u32 values that each represent a unique string value. Thereby maintaining cache efficiency whilst remaining cheap to move values around.

In the example below we demonstrate how you can cast a Utf8 Series column to a Categorical Series.

import polars as pl

df["utf8-column"].cast(pl.Categorical)

Eager join multiple DataFrames on Categorical data

When two DataFrames need to be joined based on string data the Categorical data needs to be synchronized (data in column A of df1 needs to point to the same underlying string data as column B in df2). One can do so by casting data in the StringCache context manager. This will synchronize all discoverable string values for the duration of that context manager. If you want the global string cache to exist during the whole run, you can set toggle_string_cache to True.

import polars as pl

df1 = pl.DataFrame({"a": ["foo", "bar", "ham"], "b": [1, 2, 3]})
df2 = pl.DataFrame({"a": ["foo", "spam", "eggs"], "c": [3, 2, 2]})

with pl.StringCache():
    df1.with_column(pl.col("a").cast(pl.Categorical))
    df2.with_column(pl.col("a").cast(pl.Categorical))

Lazy join multiple DataFrames on Categorical data

A lazy query always has a global string cache (unless you opt-out) for the duration of that query (until .collect() is called). The example below shows how you could join two DataFrames with Categorical types.

import polars as pl

lf1 = pl.DataFrame({"a": ["foo", "bar", "ham"], "b": [1, 2, 3]}).lazy()
lf2 = pl.DataFrame({"a": ["foo", "spam", "eggs"], "c": [3, 2, 2]}).lazy()

lf1 = lf1.with_column(pl.col("a").cast(pl.Categorical))
lf2 = lf2.with_column(pl.col("a").cast(pl.Categorical))

lf1.join(lf2, on="a", how="inner")

Optimizations

This chapter will investigate some of the optimizations that are applied by the Polars query optimizer by going through some examples to see how Polars modifies the original query plan.

Lazy API

The Lazy API page is under construction.

To demonstrate the lazy Polars capabilities we'll explore two medium-large datasets of usernames:

Reddit usernames dataset containing 69+ million rows

import polars as pl

from ..paths import DATA_DIR

dataset = pl.read_csv(f"{DATA_DIR}/reddit.csv", stop_after_n_rows=10)
shape: (5, 6)
┌─────┬──────────────────────────┬─────────────┬────────────┬───────────────┬────────────┐
│ id  ┆ name                     ┆ created_utc ┆ updated_on ┆ comment_karma ┆ link_karma │
│ --- ┆ ---                      ┆ ---         ┆ ---        ┆ ---           ┆ ---        │
│ i64 ┆ str                      ┆ i64         ┆ i64        ┆ i64           ┆ i64        │
╞═════╪══════════════════════════╪═════════════╪════════════╪═══════════════╪════════════╡
│ 1   ┆ truman48lamb_jasonbroken ┆ 1397113470  ┆ 1536527864 ┆ 0             ┆ 0          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2   ┆ johnethen06_jasonbroken  ┆ 1397113483  ┆ 1536527864 ┆ 0             ┆ 0          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3   ┆ yaseinrez_jasonbroken    ┆ 1397113483  ┆ 1536527864 ┆ 0             ┆ 1          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4   ┆ Valve92_jasonbroken      ┆ 1397113503  ┆ 1536527864 ┆ 0             ┆ 0          │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 5   ┆ srbhuyan_jasonbroken     ┆ 1397113506  ┆ 1536527864 ┆ 0             ┆ 0          │
└─────┴──────────────────────────┴─────────────┴────────────┴───────────────┴────────────┘

and the Runescape username dataset containing about 55+ million records.

import polars as pl

from ..paths import DATA_DIR

dataset = pl.read_csv(f"{DATA_DIR}/runescape.csv", has_headers=False, stop_after_n_rows=10)
shape: (5, 1)
┌─────────────┐
│ column_1    │
│ ---         │
│ str         │
╞═════════════╡
│ a000        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a0000       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a000000     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a0000000    │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a0000000000 │
└─────────────┘

Predicate pushdown

The Predicate pushdown page is under construction

Predicate pushdown is an optimization Polars does that reduces query times and memory usage. A predicate is database jargon for applying a filter on some table, thereby reducing the number of rows on that table.

So let's see if we can load some Reddit data and filter on a few predicates.

import polars as pl

from ..paths import DATA_DIR

q1 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .filter(pl.col("comment_karma") > 0)
    .filter(pl.col("link_karma") > 0)
    .filter(pl.col("name").str.contains(r"^a"))  # filter name that start with an "a"
)

If we were to run this query above nothing would happen! This is due to the lazy evaluation. Nothing will happen until specifically requested. This allows Polars to see the whole context of a query and optimize just in time for execution.

Execution is requested by the .collect method. This would query all available data. While you're writing, optimizing, and checking your query, this is often undesirable. Another method that calls for execution is the .fetch method. .fetch takes a parameter n_rows and tries to 'fetch' that number of rows at the data source (no guarantees are given though).

So let's "fetch" ~10 Million rows from the source file and apply the predicates.

q1.fetch(n_rows=int(1e7))
shape: (656, 6)
┌─────────┬─────────────┬─────────────┬────────────┬───────────────┬────────────┐
│ id      ┆ name        ┆ created_utc ┆ updated_on ┆ comment_karma ┆ link_karma │
│ ---     ┆ ---         ┆ ---         ┆ ---        ┆ ---           ┆ ---        │
│ i64     ┆ str         ┆ i64         ┆ i64        ┆ i64           ┆ i64        │
╞═════════╪═════════════╪═════════════╪════════════╪═══════════════╪════════════╡
│ 77860   ┆ aquarin     ┆ 1137474000  ┆ 1536528294 ┆ 150           ┆ 11         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 77974   ┆ aadvaark    ┆ 1137301200  ┆ 1536528294 ┆ 26            ┆ 47         │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 78004   ┆ apoisel     ┆ 1137301200  ┆ 1536497404 ┆ 42            ┆ 2549       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 78041   ┆ aonic       ┆ 1137301200  ┆ 1536497404 ┆ 2931          ┆ 2095       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...     ┆ ...         ┆ ...         ┆ ...        ┆ ...           ┆ ...        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1192656 ┆ atothedrian ┆ 1162785880  ┆ 1536497412 ┆ 748           ┆ 585        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1204607 ┆ akbusiness  ┆ 1162899425  ┆ 1536532995 ┆ 73            ┆ 512        │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1214809 ┆ aaminics    ┆ 1162969322  ┆ 1536533034 ┆ 22            ┆ 6          │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1225341 ┆ antonulrich ┆ 1163110623  ┆ 1536497412 ┆ 9304          ┆ 1782       │
└─────────┴─────────────┴─────────────┴────────────┴───────────────┴────────────┘

Above we see that from the 10 Million rows, 61503 rows match our predicate.

Break it down

In Polars we can visualize the query plan. Let's take a look.

q1.show_graph(optimized=False)

The astute reader maybe would notice that our query is not very optimal because we have three separate FILTER nodes. That means that after every FILTER a new DataFrame is allocated, which will be input to the next FILTER and then deleted from memory -- that must be redundant, and you know what... they'd be right. The predicates should be combined. We should have written this query:

import polars as pl

from ..paths import DATA_DIR

q2 = pl.scan_csv(f"{DATA_DIR}/reddit.csv").filter(
    (pl.col("comment_karma") > 0) & (pl.col("link_karma") > 0) & (pl.col("name").str.contains(r"^a"))
)

That would translate to:

q2.show_graph(optimized=False)

As we can see the predicates are combined. This would lead to less copying of data.

In comes optimization

Polars tries to save that mental overhead from the query writer and combines predicates for you. Besides that, it pushes predicates down to the scan level! Let's see how our optimized query looks.

q1.show_graph(optimized=True)

It may be hard to see, but what is clear is that there is only a single node: the CSV SCAN. The predicate filtering is done during the reading of the csv. This means that this query's memory overhead is reduced by filtering factor! This makes a huge impact.

Memory

As we have seen there were ~ 62,000 rows left after the FILTER. That means that (aside for some memory overhead of the batch size and filter operations) we use \( \frac{6.2\text{e-}4}{1\text{e-}7} \sim 0.6 \text{%} \) of the memory we would during an eager evaluation where we first would read the whole table in memory before applying a filter.

Performance

At the time of writing this, the predicate pushdown also increased the query time performance.

Without optimization, predicate_pushdown=False flag:

real	0m2,401s
user	0m5,457s
sys	0m0,894s

With optimization, predicate_pushdown=True flag:

real	0m1,597s
user	0m6,143s
sys	0m0,647s

Relational algebra

In the visualization of the query plan, you see a \( \sigma \) symbol. This indicates a predicate done at the SCAN level. There is also a \( \pi \) symbol indicating projection (database jargon for column selection), but we'll get to that later.

Cheaper joins

Predicate pushdown optimization will generally also lead to cheaper join's. A join is an expensive operation. The fewer rows we have in a join operation the cheaper it will become.

Projection pushdown

The Projection pushdown page is under construction.

Let's expand our query from the previous section by joining the result of the FILTER operation with the runescape data to find popular Reddit usernames that have a username starting with an "a" that also played Runescape. That must be something we are all interested in!

The query would look like this:

import polars as pl

from ..paths import DATA_DIR

reddit = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .filter(pl.col("comment_karma") > 0)
    .filter(pl.col("link_karma") > 0)
    .filter(pl.col("name").str.contains(r"^a"))
)

runescape = pl.scan_csv("data/runescape.csv", has_headers=False).select(pl.col("column_1").alias("name"))

dataset = reddit.join(runescape, on="name", how="inner").select(["name", "comment_karma", "link_karma"])

df1 = dataset.fetch(int(1e7))
df2 = dataset.fetch(int(1e7), predicate_pushdown=True, projection_pushdown=True)

And yields the following DataFrame.

shape: (0, 3)
┌──────┬───────────────┬────────────┐
│ name ┆ comment_karma ┆ link_karma │
│ ---  ┆ ---           ┆ ---        │
│ str  ┆ i64           ┆ i64        │
╞══════╪═══════════════╪════════════╡
└──────┴───────────────┴────────────┘

Break it down

Again, let's take a look the query plan.

dataset.show_graph(optimized=False)

Now were focussed on the projection's indicated with π. The first node shows π 3/6, indicating that we select 3 out of 6 columns in the DataFrame. If we look the csv scans we see a wildcard π */6 and π */1 meaning that we select all of 6 columns of the reddit dataset and the one and only column from the runescape dataset respectively.

This query is not very optimal. We select all columns from both datasets and only show 3/6 after join. That means that there were some columns computed during the join operation that could have been ignored. There were also columns parsed during csv scanning only to be dropped at the end. When we are dealing with DataFrames with a large number of columns the redundant work that is done can be huge.

Optimized query

Let's see how Polars optimizes this query.

dataset.show_graph(optimized=True)

The projections are pushed down the join operation all the way to the csv scans. This means that both the scanning and join operation have become cheaper due to the query optimization.

Performance

Let's time the result before and after optimization.

Without optimization, predicate_pushdown=False and projection_pushdown=False.

real	0m3,273s
user	0m9,284s
sys	0m1,081s

With optimization, predicate_pushdown and projection_pushdown flags both to True.

real	0m1,732s
user	0m7,581s
sys	0m0,783s

We can see that we almost reduced query time by half on this simple query. With real business data often comprising of many columns, filtering missing data, doing complex groupby operations, and using joins we expect this difference between unoptimized queries and optimized queries to only grow.

Other optimizations

The Other optimizations page is under construction.

Besides predicate and projection pushdown, Polars does other optimizations.

One important topic is optional caching and parallelization. It's easy to imagine having two different DataFrame computations that lead to a scan of the same file. Polars may cache the scanned file to prevent scanning the same file twice. However, if you want to, you may override this behavior and force Polars to read the same file. This could be faster because the scan can be done in parallel.

Join parallelization

If we look at the previous query, we see that the join operation has as input a computation path with data/reddit.csv as root and one path with data/runescape.csv as root. Polars can observe that there are no dependencies between the two DataFrames and will read both files in parallel. If other operations are done before the join (e.g. groupby, filters, etc.) they are also executed in parallel.

Simplify expressions

Some other optimizations that are done are expression simplifications. The impact of these optimizations is less than that of predicate and projection pushdown, but they likely add up. You can track this issue to see the latest status of those.

Reference guide

Need to see all available methods/functions of Polars? We have Rust and Python references: