GroupBy

The GroupBy page is under construction.

A multithreaded approach

One of the most efficient ways to process tabular data is to parallelize its processing via the "split-apply-combine" approach. This operation is at the core of the Polars grouping implementation, allowing it to attain lightning-fast operations. Specifically, both the "split" and "apply" phases are executed in a multi-threaded fashion.

A simple grouping operation is taken below as an example to illustrate this approach:

For the hashing operations performed during the "split" phase, Polars uses a multithreaded lock-free approach that is illustrated on the following schema:

This parallelization allows the grouping and joining operations (for instance) to be blazingly fast!

Check out this blog post for more details.

Do not kill the parallelization!

The following is specific to Python, and doesn't apply to Rust. Within Rust, blocks and closures (lambdas) can, and will, be executed concurrently.

We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.

This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.

To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API. Let's take a look at what that means.

We can start with the simple US congress dataset.

Note to Rust users, the dtype-categorical feature must be enabled for the examples in this section.

dtypes = {
    "first_name": pl.Categorical,
    "gender": pl.Categorical,
    "type": pl.Categorical,
    "state": pl.Categorical,
    "party": pl.Categorical,
}

dataset = pl.read_csv(url, dtype=dtypes).with_column(pl.col("birthday").str.strptime(pl.Date, strict=False))
    let url = "https://theunitedstates.io/congress-legislators/legislators-historical.csv";

    let mut schema = Schema::new();
    schema.with_column("first_name".to_string(), DataType::Categorical(None));
    schema.with_column("gender".to_string(), DataType::Categorical(None));
    schema.with_column("type".to_string(), DataType::Categorical(None));
    schema.with_column("state".to_string(), DataType::Categorical(None));
    schema.with_column("party".to_string(), DataType::Categorical(None));
    schema.with_column("birthday".to_string(), DataType::Date);

    let data: Vec<u8> = Client::new().get(url).send()?.text()?.bytes().collect();

    let dataset = CsvReader::new(Cursor::new(data))
        .has_header(true)
        .with_dtypes(Some(&schema))
        .with_parse_dates(true)
        .finish()?;

    println!("{}", &dataset);

Basic aggregations

You can easily combine different aggregations by adding multiple expressions in a list. There is no upper bound on the number of aggregations you can do, and you can make any combination you want. In the snippet below we do the following aggregations:

Per GROUP "first_name" we

  • count the number of rows in the group:
    • short form: pl.count("party")
    • full form: pl.col("party").count()
  • aggregate the gender values groups to a list:
    • full form: pl.col("gender").list()
  • get the first value of column "last_name" in the group:
    • short form: pl.first("last_name") (not available in Rust)
    • full form: pl.col("last_name").first()

Besides the aggregation, we immediately sort the result and limit to the top 5 so that we have a nice summary overview.

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("first_name")
    .agg(
        [
            pl.count(),
            pl.col("gender").list(),
            pl.first("last_name"),
        ]
    )
    .sort("count", reverse=True)
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .groupby(["first_name"])
        .agg([count(), col("gender").list(), col("last_name").first()])
        .sort(
            "count",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 4)
┌────────────┬───────┬─────────────────────┬───────────┐
│ first_name ┆ count ┆ gender              ┆ last_name │
│ ---        ┆ ---   ┆ ---                 ┆ ---       │
│ cat        ┆ u32   ┆ list[cat]           ┆ str       │
╞════════════╪═══════╪═════════════════════╪═══════════╡
│ John       ┆ 1254  ┆ ["M", "M", ... "M"] ┆ Walker    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ William    ┆ 1022  ┆ ["M", "M", ... "M"] ┆ Few       │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ James      ┆ 712   ┆ ["M", "M", ... "M"] ┆ Armstrong │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Thomas     ┆ 453   ┆ ["M", "M", ... "M"] ┆ Tucker    │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Charles    ┆ 439   ┆ ["M", "M", ... "M"] ┆ Carroll   │
└────────────┴───────┴─────────────────────┴───────────┘

Conditionals

It's that easy! Let's turn it up a notch. Let's say we want to know how many delegates of a "state" are "Pro" or "Anti" administration. We could directly query that in the aggregation without the need of lambda or grooming the DataFrame.

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby("state")
    .agg(
        [
            (pl.col("party") == "Anti-Administration").sum().alias("anti"),
            (pl.col("party") == "Pro-Administration").sum().alias("pro"),
        ]
    )
    .sort("pro", reverse=True)
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .groupby(["state"])
        .agg([
            (col("party").eq(lit("Anti-Administration")))
                .sum()
                .alias("anti"),
            (col("party").eq(lit("Pro-Administration")))
                .sum()
                .alias("pro"),
        ])
        .sort(
            "pro",
            SortOptions {
                descending: true,
                nulls_last: false,
            },
        )
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 3)
┌───────┬──────┬─────┐
│ state ┆ anti ┆ pro │
│ ---   ┆ ---  ┆ --- │
│ cat   ┆ u32  ┆ u32 │
╞═══════╪══════╪═════╡
│ CT    ┆ 0    ┆ 3   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ NJ    ┆ 0    ┆ 3   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ NC    ┆ 1    ┆ 2   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ VA    ┆ 3    ┆ 1   │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┤
│ SC    ┆ 0    ┆ 1   │
└───────┴──────┴─────┘

Similarly, this could also be done with a nested GROUPBY, but that doesn't help show off some of these nice features. 😉

import polars as pl

from .dataset import dataset

q = (
    dataset.lazy()
    .groupby(["state", "party"])
    .agg([pl.count("party").alias("count")])
    .filter((pl.col("party") == "Anti-Administration") | (pl.col("party") == "Pro-Administration"))
    .sort("count", reverse=True)
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .groupby(["state", "party"])
        .agg([col("party").count().alias("count")])
        .filter(
            col("party")
                .eq(lit("Anti-Administration"))
                .or(col("party").eq(lit("Pro-Administration"))),
        )
        .sort(
            "count",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ state ┆ party               ┆ count │
│ ---   ┆ ---                 ┆ ---   │
│ cat   ┆ cat                 ┆ u32   │
╞═══════╪═════════════════════╪═══════╡
│ NJ    ┆ Pro-Administration  ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ CT    ┆ Pro-Administration  ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ VA    ┆ Anti-Administration ┆ 3     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ NC    ┆ Pro-Administration  ┆ 2     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ PA    ┆ Anti-Administration ┆ 1     │
└───────┴─────────────────────┴───────┘

Filtering

We can also filter the groups. Let's say we want to compute a mean per group, but we don't want to include all values from that group, and we also don't want to filter the rows from the DataFrame (because we need those rows for another aggregation).

In the example below we show how that can be done.

Note that we can make Python functions for clarity. These functions don't cost us anything. That is because we only create Polars expressions, we don't apply a custom function over a Series during runtime of the query. Of course, you can make functions that return expressions in Rust, too.

from datetime import date

import polars as pl

from .dataset import dataset


def compute_age() -> pl.Expr:
    return date(2021, 1, 1).year - pl.col("birthday").dt.year()


def avg_birthday(gender: str) -> pl.Expr:
    return compute_age().filter(pl.col("gender") == gender).mean().alias(f"avg {gender} birthday")


q = (
    dataset.lazy()
    .groupby(["state"])
    .agg(
        [
            avg_birthday("M"),
            avg_birthday("F"),
            (pl.col("gender") == "M").sum().alias("# male"),
            (pl.col("gender") == "F").sum().alias("# female"),
        ]
    )
    .limit(5)
)

df = q.collect()
    fn compute_age() -> Expr {
        lit(2022) - col("birthday").dt().year()
    }

    fn avg_birthday(gender: &str) -> Expr {
        compute_age()
            .filter(col("gender").eq(lit(gender)))
            .mean()
            .alias(&format!("avg {} birthday", gender))
    }

    let df = dataset
        .clone()
        .lazy()
        .groupby(["state"])
        .agg([
            avg_birthday("M"),
            avg_birthday("F"),
            (col("gender").eq(lit("M"))).sum().alias("# male"),
            (col("gender").eq(lit("F"))).sum().alias("# female"),
        ])
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 5)
┌───────┬────────────────┬────────────────┬────────┬──────────┐
│ state ┆ avg M birthday ┆ avg F birthday ┆ # male ┆ # female │
│ ---   ┆ ---            ┆ ---            ┆ ---    ┆ ---      │
│ cat   ┆ f64            ┆ f64            ┆ u32    ┆ u32      │
╞═══════╪════════════════╪════════════════╪════════╪══════════╡
│ ID    ┆ 137.666667     ┆ 99.0           ┆ 57     ┆ 2        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ WY    ┆ 137.717949     ┆ 75.0           ┆ 39     ┆ 1        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ OR    ┆ 149.892857     ┆ 106.4          ┆ 85     ┆ 5        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ PI    ┆ 145.0          ┆ null           ┆ 13     ┆ 0        │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ CO    ┆ 130.8876       ┆ 72.666667      ┆ 89     ┆ 3        │
└───────┴────────────────┴────────────────┴────────┴──────────┘

Sorting

It's common to see a DataFrame being sorted for the sole purpose of managing the ordering during a GROUPBY operation. Let's say that we want to get the names of the oldest and youngest politicians per state. We could SORT and GROUPBY.

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ]
    )
    .limit(5)
)

df = q.collect()
    fn get_person() -> Expr {
        col("first_name") + lit(" ") + col("last_name")
    }

    let df = dataset
        .clone()
        .lazy()
        .sort(
            "birthday",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .groupby(["state"])
        .agg([
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
        ])
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 3)
┌───────┬─────────────────────┬──────────────────┐
│ state ┆ youngest            ┆ oldest           │
│ ---   ┆ ---                 ┆ ---              │
│ cat   ┆ str                 ┆ str              │
╞═══════╪═════════════════════╪══════════════════╡
│ GU    ┆ Antonio Won Pat     ┆ Robert Underwood │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ VA    ┆ William Grayson     ┆ Scott Taylor     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ IL    ┆ Benjamin Stephenson ┆ Aaron Schock     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CA    ┆ Edward Gilbert      ┆ Katie Hill       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ KY    ┆ John Edwards        ┆ Ben Chandler     │
└───────┴─────────────────────┴──────────────────┘

However, if we also want to sort the names alphabetically, this breaks. Luckily we can sort in a groupby context separate from the DataFrame.

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort().first().alias("alphabetical_first"),
        ]
    )
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .sort(
            "birthday",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .groupby(["state"])
        .agg([
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort(false).first().alias("alphabetical_first"),
        ])
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 4)
┌───────┬───────────────────────┬─────────────────┬────────────────────┐
│ state ┆ youngest              ┆ oldest          ┆ alphabetical_first │
│ ---   ┆ ---                   ┆ ---             ┆ ---                │
│ cat   ┆ str                   ┆ str             ┆ str                │
╞═══════╪═══════════════════════╪═════════════════╪════════════════════╡
│ FL    ┆ Charles Downing       ┆ Patrick Murphy  ┆ Abijah Gilbert     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ OH    ┆ John Smith            ┆ John Boccieri   ┆ Aaron Harlan       │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NJ    ┆ Lambert Cadwalader    ┆ Jon Runyan      ┆ Aaron Kitchell     │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ IA    ┆ Bernhart Henn         ┆ Abby Finkenauer ┆ Abby Finkenauer    │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ NY    ┆ Cornelius Schoonmaker ┆ Max Rose        ┆ A. Foster          │
└───────┴───────────────────────┴─────────────────┴────────────────────┘

We can even sort by another column in the groupby context. If we want to know if the alphabetically sorted name is male or female we could add: pl.col("gender").sort_by("first_name").first().alias("gender")

import polars as pl

from .dataset import dataset


def get_person() -> pl.Expr:
    return pl.col("first_name") + pl.lit(" ") + pl.col("last_name")


q = (
    dataset.lazy()
    .sort("birthday")
    .groupby(["state"])
    .agg(
        [
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort().first().alias("alphabetical_first"),
            pl.col("gender").sort_by("first_name").first().alias("gender"),
        ]
    )
    .sort("state")
    .limit(5)
)

df = q.collect()
    let df = dataset
        .clone()
        .lazy()
        .sort(
            "birthday",
            SortOptions {
                descending: true,
                nulls_last: true,
            },
        )
        .groupby(["state"])
        .agg([
            get_person().first().alias("youngest"),
            get_person().last().alias("oldest"),
            get_person().sort(false).first().alias("alphabetical_first"),
            col("gender")
                .sort_by(["first_name"], [false])
                .first()
                .alias("gender"),
        ])
        .sort("state", SortOptions::default())
        .limit(5)
        .collect()?;

    println!("{}", df);
shape: (5, 5)
┌───────┬───────────────────┬────────────────┬────────────────────┬────────┐
│ state ┆ youngest          ┆ oldest         ┆ alphabetical_first ┆ gender │
│ ---   ┆ ---               ┆ ---            ┆ ---                ┆ ---    │
│ cat   ┆ str               ┆ str            ┆ str                ┆ cat    │
╞═══════╪═══════════════════╪════════════════╪════════════════════╪════════╡
│ DE    ┆ Samuel White      ┆ John Carney    ┆ Albert Polk        ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ VA    ┆ William Grayson   ┆ Scott Taylor   ┆ Abraham Venable    ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ SC    ┆ Ralph Izard       ┆ Joe Cunningham ┆ Abraham Nott       ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ MD    ┆ Benjamin Contee   ┆ Frank Kratovil ┆ Albert Blakeney    ┆ M      │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ PA    ┆ Thomas Fitzsimons ┆ Ryan Costello  ┆ Aaron Kreider      ┆ M      │
└───────┴───────────────────┴────────────────┴────────────────────┴────────┘

Conclusion

In the examples above we've seen that we can do a lot by combining expressions. By doing so we delay the use of custom Python functions that slow down the queries (by the slow nature of Python AND the GIL).

If we are missing a type expression let us know by opening a feature request!