Lists and Arrays

Polars has first-class support for List columns: that is, columns where each row is a list of homogeneous elements, of varying lengths. Polars also has an Array datatype, which is analogous to numpy's ndarray objects, where the length is identical across rows.

Note: this is different from Python's list object, where the elements can be of any type. Polars can store these within columns, but as a generic Object datatype that doesn't have the special list manipulation features that we're about to discuss.

Powerful `List` manipulation

Let's say we had the following data from different weather stations across a state. When the weather station is unable to get a result, an error code is recorded instead of the actual temperature at that time.

Python Rust

DataFrame

weather = pl.DataFrame(
    {
        "station": ["Station " + str(x) for x in range(1, 6)],
        "temperatures": [
            "20 5 5 E1 7 13 19 9 6 20",
            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
            "19 24 E9 16 6 12 10 22",
            "E2 E0 15 7 8 10 E1 24 17 13 6",
            "14 8 E0 16 22 24 E1",
        ],
    }
)
print(weather)

DataFrame

let stns: Vec<String> = (1..6).map(|i| format!("Station {i}")).collect();
let weather = df!(
        "station"=> &stns,
        "temperatures"=> &[
            "20 5 5 E1 7 13 19 9 6 20",
            "18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
            "19 24 E9 16 6 12 10 22",
            "E2 E0 15 7 8 10 E1 24 17 13 6",
            "14 8 E0 16 22 24 E1",
        ],
)?;
println!("{}", &weather);

shape: (5, 2)
┌───────────┬───────────────────────────────────┐
│ station   ┆ temperatures                      │
│ ---       ┆ ---                               │
│ str       ┆ str                               │
╞═══════════╪═══════════════════════════════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20          │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90 7… │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22            │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6     │
│ Station 5 ┆ 14 8 E0 16 22 24 E1               │
└───────────┴───────────────────────────────────┘

Creating a `List` column

For the weather DataFrame created above, it's very likely we need to run some analysis on the temperatures that are captured by each station. To make this happen, we need to first be able to get individual temperature measurements. This is done by:

Python Rust

str.split

out = weather.with_columns(pl.col("temperatures").str.split(" "))
print(out)

let out = weather
    .clone()
    .lazy()
    .with_columns([col("temperatures").str().split(" ")])
    .collect()?;
println!("{}", &out);

shape: (5, 2)
┌───────────┬──────────────────────┐
│ station   ┆ temperatures         │
│ ---       ┆ ---                  │
│ str       ┆ list[str]            │
╞═══════════╪══════════════════════╡
│ Station 1 ┆ ["20", "5", … "20"]  │
│ Station 2 ┆ ["18", "8", … "40"]  │
│ Station 3 ┆ ["19", "24", … "22"] │
│ Station 4 ┆ ["E2", "E0", … "6"]  │
│ Station 5 ┆ ["14", "8", … "E1"]  │
└───────────┴──────────────────────┘

One way we could go post this would be to convert each temperature measurement into its own row:

Python Rust

DataFrame.explode

out = weather.with_columns(pl.col("temperatures").str.split(" ")).explode(
    "temperatures"
)
print(out)

let out = weather
    .clone()
    .lazy()
    .with_columns([col("temperatures").str().split(" ")])
    .explode(["temperatures"])
    .collect()?;
println!("{}", &out);

shape: (49, 2)
┌───────────┬──────────────┐
│ station   ┆ temperatures │
│ ---       ┆ ---          │
│ str       ┆ str          │
╞═══════════╪══════════════╡
│ Station 1 ┆ 20           │
│ Station 1 ┆ 5            │
│ Station 1 ┆ 5            │
│ Station 1 ┆ E1           │
│ …         ┆ …            │
│ Station 5 ┆ 16           │
│ Station 5 ┆ 22           │
│ Station 5 ┆ 24           │
│ Station 5 ┆ E1           │
└───────────┴──────────────┘

However, in Polars, we often do not need to do this to operate on the List elements.

Operating on `List` columns

Polars provides several standard operations on List columns. If we want the first three measurements, we can do a head(3). The last three can be obtained via a tail(3), or alternately, via slice (negative indexing is supported). We can also identify the number of observations via lengths. Let's see them in action:

Python Rust

Expr.List

out = weather.with_columns(pl.col("temperatures").str.split(" ")).with_columns(
    pl.col("temperatures").list.head(3).alias("top3"),
    pl.col("temperatures").list.slice(-3, 3).alias("bottom_3"),
    pl.col("temperatures").list.lengths().alias("obs"),
)
print(out)

let out = weather
    .clone()
    .lazy()
    .with_columns([col("temperatures").str().split(" ")])
    .with_columns([
        col("temperatures").list().head(lit(3)).alias("top3"),
        col("temperatures")
            .list()
            .slice(lit(-3), lit(3))
            .alias("bottom_3"),
        col("temperatures").list().lengths().alias("obs"),
    ])
    .collect()?;
println!("{}", &out);

shape: (5, 5)
┌───────────┬──────────────────────┬────────────────────┬────────────────────┬─────┐
│ station   ┆ temperatures         ┆ top3               ┆ bottom_3           ┆ obs │
│ ---       ┆ ---                  ┆ ---                ┆ ---                ┆ --- │
│ str       ┆ list[str]            ┆ list[str]          ┆ list[str]          ┆ u32 │
╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪═════╡
│ Station 1 ┆ ["20", "5", … "20"]  ┆ ["20", "5", "5"]   ┆ ["9", "6", "20"]   ┆ 10  │
│ Station 2 ┆ ["18", "8", … "40"]  ┆ ["18", "8", "16"]  ┆ ["90", "70", "40"] ┆ 13  │
│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ 8   │
│ Station 4 ┆ ["E2", "E0", … "6"]  ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"]  ┆ 11  │
│ Station 5 ┆ ["14", "8", … "E1"]  ┆ ["14", "8", "E0"]  ┆ ["22", "24", "E1"] ┆ 7   │
└───────────┴──────────────────────┴────────────────────┴────────────────────┴─────┘

arr then, list now

If you find references to the arr API on Stackoverflow or other sources, just replace arr with list, this was the old accessor for the List datatype. arr now refers to the newly introduced Array datatype (see below).

Element-wise computation within `List`s

If we need to identify the stations that are giving the most number of errors from the starting DataFrame, we need to:

Parse the string input as a List of string values (already done).
Identify those strings that can be converted to numbers.
Identify the number of non-numeric values (i.e. null values) in the list, by row.
Rename this output as errors so that we can easily identify the stations.

The third step requires a casting (or alternately, a regex pattern search) operation to be perform on each element of the list. We can do this using by applying the operation on each element by first referencing them in the pl.element() context, and then calling a suitable Polars expression on them. Let's see how:

Python Rust

Expr.List · element

out = weather.with_columns(
    pl.col("temperatures")
    .str.split(" ")
    .list.eval(pl.element().cast(pl.Int64, strict=False).is_null())
    .list.sum()
    .alias("errors")
)
print(out)

let out = weather
    .clone()
    .lazy()
    .with_columns([col("temperatures")
        .str()
        .split(" ")
        .list()
        .eval(col("").cast(DataType::Int64).is_null(), false)
        .list()
        .sum()
        .alias("errors")])
    .collect()?;
println!("{}", &out);

shape: (5, 3)
┌───────────┬───────────────────────────────────┬────────┐
│ station   ┆ temperatures                      ┆ errors │
│ ---       ┆ ---                               ┆ ---    │
│ str       ┆ str                               ┆ u32    │
╞═══════════╪═══════════════════════════════════╪════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20          ┆ 1      │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90 7… ┆ 4      │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22            ┆ 1      │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6     ┆ 3      │
│ Station 5 ┆ 14 8 E0 16 22 24 E1               ┆ 2      │
└───────────┴───────────────────────────────────┴────────┘

What if we chose the regex route (i.e. recognizing the presence of any alphabetical character?)

Python Rust

str.contains

out = weather.with_columns(
    pl.col("temperatures")
    .str.split(" ")
    .list.eval(pl.element().str.contains("(?i)[a-z]"))
    .list.sum()
    .alias("errors")
)
print(out)

let out = weather
    .clone()
    .lazy()
    .with_columns([col("temperatures")
        .str()
        .split(" ")
        .list()
        .eval(col("").str().contains(lit("(?i)[a-z]"), false), false)
        .list()
        .sum()
        .alias("errors")])
    .collect()?;
println!("{}", &out);

shape: (5, 3)
┌───────────┬───────────────────────────────────┬────────┐
│ station   ┆ temperatures                      ┆ errors │
│ ---       ┆ ---                               ┆ ---    │
│ str       ┆ str                               ┆ u32    │
╞═══════════╪═══════════════════════════════════╪════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20          ┆ 1      │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90 7… ┆ 4      │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22            ┆ 1      │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6     ┆ 3      │
│ Station 5 ┆ 14 8 E0 16 22 24 E1               ┆ 2      │
└───────────┴───────────────────────────────────┴────────┘

If you're unfamiliar with the (?i), it's a good time to look at the documentation for the str.contains function in Polars! The rust regex crate provides a lot of additional regex flags that might come in handy.

Row-wise computations

This context is ideal for computing in row orientation.

We can apply any Polars operations on the elements of the list with the list.eval (list().eval in Rust) expression! These expressions run entirely on Polars' query engine and can run in parallel, so will be well optimized. Let's say we have another set of weather data across three days, for different stations:

Python Rust

DataFrame

weather_by_day = pl.DataFrame(
    {
        "station": ["Station " + str(x) for x in range(1, 11)],
        "day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
        "day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
        "day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
    }
)
print(weather_by_day)

DataFrame

let stns: Vec<String> = (1..11).map(|i| format!("Station {i}")).collect();
let weather_by_day = df!(
        "station" => &stns,
        "day_1" => &[17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
        "day_2" => &[15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
        "day_3" => &[16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
)?;
println!("{}", &weather_by_day);

shape: (10, 4)
┌────────────┬───────┬───────┬───────┐
│ station    ┆ day_1 ┆ day_2 ┆ day_3 │
│ ---        ┆ ---   ┆ ---   ┆ ---   │
│ str        ┆ i64   ┆ i64   ┆ i64   │
╞════════════╪═══════╪═══════╪═══════╡
│ Station 1  ┆ 17    ┆ 15    ┆ 16    │
│ Station 2  ┆ 11    ┆ 11    ┆ 15    │
│ Station 3  ┆ 8     ┆ 10    ┆ 24    │
│ Station 4  ┆ 22    ┆ 8     ┆ 24    │
│ …          ┆ …     ┆ …     ┆ …     │
│ Station 7  ┆ 20    ┆ 18    ┆ 19    │
│ Station 8  ┆ 8     ┆ 21    ┆ 23    │
│ Station 9  ┆ 8     ┆ 15    ┆ 16    │
│ Station 10 ┆ 17    ┆ 13    ┆ 10    │
└────────────┴───────┴───────┴───────┘

Let's do something interesting, where we calculate the percentage rank of the temperatures by day, measured across stations. Pandas allows you to compute the percentages of the rank values. Polars doesn't provide a special function to do this directly, but because expressions are so versatile we can create our own percentage rank expression for highest temperature. Let's try that!

Python Rust

list.eval

rank_pct = (pl.element().rank(descending=True) / pl.col("*").count()).round(2)

out = weather_by_day.with_columns(
    # create the list of homogeneous data
    pl.concat_list(pl.all().exclude("station")).alias("all_temps")
).select(
    # select all columns except the intermediate list
    pl.all().exclude("all_temps"),
    # compute the rank by calling `list.eval`
    pl.col("all_temps").list.eval(rank_pct, parallel=True).alias("temps_rank"),
)

print(out)

let rank_pct = (col("")
    .rank(
        RankOptions {
            method: RankMethod::Average,
            descending: true,
        },
        None,
    )
    .cast(DataType::Float32)
    / col("*").count().cast(DataType::Float32))
.round(2);

let out = weather_by_day
    .clone()
    .lazy()
    .with_columns(
        // create the list of homogeneous data
        [concat_list([all().exclude(["station"])])?.alias("all_temps")],
    )
    .select(
        // select all columns except the intermediate list
        [
            all().exclude(["all_temps"]),
            // compute the rank by calling `list.eval`
            col("all_temps")
                .list()
                .eval(rank_pct, true)
                .alias("temps_rank"),
        ],
    )
    .collect()?;

println!("{}", &out);

shape: (10, 5)
┌────────────┬───────┬───────┬───────┬────────────────────┐
│ station    ┆ day_1 ┆ day_2 ┆ day_3 ┆ temps_rank         │
│ ---        ┆ ---   ┆ ---   ┆ ---   ┆ ---                │
│ str        ┆ i64   ┆ i64   ┆ i64   ┆ list[f64]          │
╞════════════╪═══════╪═══════╪═══════╪════════════════════╡
│ Station 1  ┆ 17    ┆ 15    ┆ 16    ┆ [0.33, 1.0, 0.67]  │
│ Station 2  ┆ 11    ┆ 11    ┆ 15    ┆ [0.83, 0.83, 0.33] │
│ Station 3  ┆ 8     ┆ 10    ┆ 24    ┆ [1.0, 0.67, 0.33]  │
│ Station 4  ┆ 22    ┆ 8     ┆ 24    ┆ [0.67, 1.0, 0.33]  │
│ …          ┆ …     ┆ …     ┆ …     ┆ …                  │
│ Station 7  ┆ 20    ┆ 18    ┆ 19    ┆ [0.33, 1.0, 0.67]  │
│ Station 8  ┆ 8     ┆ 21    ┆ 23    ┆ [1.0, 0.67, 0.33]  │
│ Station 9  ┆ 8     ┆ 15    ┆ 16    ┆ [1.0, 0.67, 0.33]  │
│ Station 10 ┆ 17    ┆ 13    ┆ 10    ┆ [0.33, 0.67, 1.0]  │
└────────────┴───────┴───────┴───────┴────────────────────┘

Polars `Array`s

Arrays are a new data type that was recently introduced, and are still pretty nascent in features that it offers. The major difference between a List and an Array is that the latter is limited to having the same number of elements per row, while a List can have a variable number of elements. Both still require that each element's data type is the same.

We can define Array columns in this manner:

Python Rust

Array

array_df = pl.DataFrame(
    [
        pl.Series("Array_1", [[1, 3], [2, 5]]),
        pl.Series("Array_2", [[1, 7, 3], [8, 1, 0]]),
    ],
    schema={"Array_1": pl.Array(2, pl.Int64), "Array_2": pl.Array(3, pl.Int64)},
)
print(array_df)

let mut col1: ListPrimitiveChunkedBuilder<Int32Type> =
    ListPrimitiveChunkedBuilder::new("Array_1", 8, 8, DataType::Int32);
col1.append_slice(&[1, 3]);
col1.append_slice(&[2, 5]);
let mut col2: ListPrimitiveChunkedBuilder<Int32Type> =
    ListPrimitiveChunkedBuilder::new("Array_2", 8, 8, DataType::Int32);
col2.append_slice(&[1, 7, 3]);
col2.append_slice(&[8, 1, 0]);
let array_df = DataFrame::new([col1.finish(), col2.finish()].into())?;

println!("{}", &array_df);

shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1       ┆ Array_2       │
│ ---           ┆ ---           │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3]        ┆ [1, 7, 3]     │
│ [2, 5]        ┆ [8, 1, 0]     │
└───────────────┴───────────────┘

Basic operations are available on it:

Python Rust

arr

out = array_df.select(
    pl.col("Array_1").arr.min().suffix("_min"),
    pl.col("Array_2").arr.sum().suffix("_sum"),
)
print(out)

let out = array_df
    .clone()
    .lazy()
    .select([
        col("Array_1").list().min().suffix("_min"),
        col("Array_2").list().sum().suffix("_sum"),
    ])
    .collect()?;
println!("{}", &out);

shape: (2, 2)
┌─────────────┬─────────────┐
│ Array_1_min ┆ Array_2_sum │
│ ---         ┆ ---         │
│ i64         ┆ i64         │
╞═════════════╪═════════════╡
│ 1           ┆ 11          │
│ 2           ┆ 9           │
└─────────────┴─────────────┘

Polars Arrays are still being actively developed, so this section will likely change in the future.

Lists and Arrays

Powerful List manipulation

Creating a List column

Operating on List columns

Element-wise computation within Lists

Row-wise computations

Polars Arrays

Powerful `List` manipulation

Creating a `List` column

Operating on `List` columns

Element-wise computation within `List`s

Polars `Array`s