Lists and Arrays
Polars
has first-class support for List
columns: that is, columns where each row is a list of homogeneous elements, of varying lengths. Polars
also has an Array
datatype, which is analogous to numpy
's ndarray
objects, where the length is identical across rows.
Note: this is different from Python's list
object, where the elements can be of any type. Polars can store these within columns, but as a generic Object
datatype that doesn't have the special list manipulation features that we're about to discuss.
Powerful List
manipulation
Let's say we had the following data from different weather stations across a state. When the weather station is unable to get a result, an error code is recorded instead of the actual temperature at that time.
weather = pl.DataFrame(
{
"station": ["Station " + str(x) for x in range(1, 6)],
"temperatures": [
"20 5 5 E1 7 13 19 9 6 20",
"18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
"19 24 E9 16 6 12 10 22",
"E2 E0 15 7 8 10 E1 24 17 13 6",
"14 8 E0 16 22 24 E1",
],
}
)
print(weather)
let stns: Vec<String> = (1..6).map(|i| format!("Station {i}")).collect();
let weather = df!(
"station"=> &stns,
"temperatures"=> &[
"20 5 5 E1 7 13 19 9 6 20",
"18 8 16 11 23 E2 8 E2 E2 E2 90 70 40",
"19 24 E9 16 6 12 10 22",
"E2 E0 15 7 8 10 E1 24 17 13 6",
"14 8 E0 16 22 24 E1",
],
)?;
println!("{}", &weather);
shape: (5, 2)
┌───────────┬───────────────────────────────────┐
│ station ┆ temperatures │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪═══════════════════════════════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20 │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90 7… │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22 │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6 │
│ Station 5 ┆ 14 8 E0 16 22 24 E1 │
└───────────┴───────────────────────────────────┘
Creating a List
column
For the weather
DataFrame
created above, it's very likely we need to run some analysis on the temperatures that are captured by each station. To make this happen, we need to first be able to get individual temperature measurements. This is done by:
out = weather.with_columns(pl.col("temperatures").str.split(" "))
print(out)
let out = weather
.clone()
.lazy()
.with_columns([col("temperatures").str().split(" ")])
.collect()?;
println!("{}", &out);
shape: (5, 2)
┌───────────┬──────────────────────┐
│ station ┆ temperatures │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪══════════════════════╡
│ Station 1 ┆ ["20", "5", … "20"] │
│ Station 2 ┆ ["18", "8", … "40"] │
│ Station 3 ┆ ["19", "24", … "22"] │
│ Station 4 ┆ ["E2", "E0", … "6"] │
│ Station 5 ┆ ["14", "8", … "E1"] │
└───────────┴──────────────────────┘
One way we could go post this would be to convert each temperature measurement into its own row:
out = weather.with_columns(pl.col("temperatures").str.split(" ")).explode(
"temperatures"
)
print(out)
let out = weather
.clone()
.lazy()
.with_columns([col("temperatures").str().split(" ")])
.explode(["temperatures"])
.collect()?;
println!("{}", &out);
shape: (49, 2)
┌───────────┬──────────────┐
│ station ┆ temperatures │
│ --- ┆ --- │
│ str ┆ str │
╞═══════════╪══════════════╡
│ Station 1 ┆ 20 │
│ Station 1 ┆ 5 │
│ Station 1 ┆ 5 │
│ Station 1 ┆ E1 │
│ … ┆ … │
│ Station 5 ┆ 16 │
│ Station 5 ┆ 22 │
│ Station 5 ┆ 24 │
│ Station 5 ┆ E1 │
└───────────┴──────────────┘
However, in Polars, we often do not need to do this to operate on the List
elements.
Operating on List
columns
Polars provides several standard operations on List
columns. If we want the first three measurements, we can do a head(3)
. The last three can be obtained via a tail(3)
, or alternately, via slice
(negative indexing is supported). We can also identify the number of observations via lengths
. Let's see them in action:
out = weather.with_columns(pl.col("temperatures").str.split(" ")).with_columns(
pl.col("temperatures").list.head(3).alias("top3"),
pl.col("temperatures").list.slice(-3, 3).alias("bottom_3"),
pl.col("temperatures").list.lengths().alias("obs"),
)
print(out)
let out = weather
.clone()
.lazy()
.with_columns([col("temperatures").str().split(" ")])
.with_columns([
col("temperatures").list().head(lit(3)).alias("top3"),
col("temperatures")
.list()
.slice(lit(-3), lit(3))
.alias("bottom_3"),
col("temperatures").list().lengths().alias("obs"),
])
.collect()?;
println!("{}", &out);
shape: (5, 5)
┌───────────┬──────────────────────┬────────────────────┬────────────────────┬─────┐
│ station ┆ temperatures ┆ top3 ┆ bottom_3 ┆ obs │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ list[str] ┆ list[str] ┆ list[str] ┆ u32 │
╞═══════════╪══════════════════════╪════════════════════╪════════════════════╪═════╡
│ Station 1 ┆ ["20", "5", … "20"] ┆ ["20", "5", "5"] ┆ ["9", "6", "20"] ┆ 10 │
│ Station 2 ┆ ["18", "8", … "40"] ┆ ["18", "8", "16"] ┆ ["90", "70", "40"] ┆ 13 │
│ Station 3 ┆ ["19", "24", … "22"] ┆ ["19", "24", "E9"] ┆ ["12", "10", "22"] ┆ 8 │
│ Station 4 ┆ ["E2", "E0", … "6"] ┆ ["E2", "E0", "15"] ┆ ["17", "13", "6"] ┆ 11 │
│ Station 5 ┆ ["14", "8", … "E1"] ┆ ["14", "8", "E0"] ┆ ["22", "24", "E1"] ┆ 7 │
└───────────┴──────────────────────┴────────────────────┴────────────────────┴─────┘
arr
then, list
now
If you find references to the arr
API on Stackoverflow or other sources, just replace arr
with list
, this was the old accessor for the List
datatype. arr
now refers to the newly introduced Array
datatype (see below).
Element-wise computation within List
s
If we need to identify the stations that are giving the most number of errors from the starting DataFrame
, we need to:
- Parse the string input as a
List
of string values (already done). - Identify those strings that can be converted to numbers.
- Identify the number of non-numeric values (i.e.
null
values) in the list, by row. - Rename this output as
errors
so that we can easily identify the stations.
The third step requires a casting (or alternately, a regex pattern search) operation to be perform on each element of the list. We can do this using by applying the operation on each element by first referencing them in the pl.element()
context, and then calling a suitable Polars expression on them. Let's see how:
out = weather.with_columns(
pl.col("temperatures")
.str.split(" ")
.list.eval(pl.element().cast(pl.Int64, strict=False).is_null())
.list.sum()
.alias("errors")
)
print(out)
let out = weather
.clone()
.lazy()
.with_columns([col("temperatures")
.str()
.split(" ")
.list()
.eval(col("").cast(DataType::Int64).is_null(), false)
.list()
.sum()
.alias("errors")])
.collect()?;
println!("{}", &out);
shape: (5, 3)
┌───────────┬───────────────────────────────────┬────────┐
│ station ┆ temperatures ┆ errors │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═══════════╪═══════════════════════════════════╪════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20 ┆ 1 │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90 7… ┆ 4 │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22 ┆ 1 │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6 ┆ 3 │
│ Station 5 ┆ 14 8 E0 16 22 24 E1 ┆ 2 │
└───────────┴───────────────────────────────────┴────────┘
What if we chose the regex route (i.e. recognizing the presence of any alphabetical character?)
out = weather.with_columns(
pl.col("temperatures")
.str.split(" ")
.list.eval(pl.element().str.contains("(?i)[a-z]"))
.list.sum()
.alias("errors")
)
print(out)
let out = weather
.clone()
.lazy()
.with_columns([col("temperatures")
.str()
.split(" ")
.list()
.eval(col("").str().contains(lit("(?i)[a-z]"), false), false)
.list()
.sum()
.alias("errors")])
.collect()?;
println!("{}", &out);
shape: (5, 3)
┌───────────┬───────────────────────────────────┬────────┐
│ station ┆ temperatures ┆ errors │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞═══════════╪═══════════════════════════════════╪════════╡
│ Station 1 ┆ 20 5 5 E1 7 13 19 9 6 20 ┆ 1 │
│ Station 2 ┆ 18 8 16 11 23 E2 8 E2 E2 E2 90 7… ┆ 4 │
│ Station 3 ┆ 19 24 E9 16 6 12 10 22 ┆ 1 │
│ Station 4 ┆ E2 E0 15 7 8 10 E1 24 17 13 6 ┆ 3 │
│ Station 5 ┆ 14 8 E0 16 22 24 E1 ┆ 2 │
└───────────┴───────────────────────────────────┴────────┘
If you're unfamiliar with the (?i)
, it's a good time to look at the documentation for the str.contains
function in Polars! The rust regex crate provides a lot of additional regex flags that might come in handy.
Row-wise computations
This context is ideal for computing in row orientation.
We can apply any Polars operations on the elements of the list with the list.eval
(list().eval
in Rust) expression! These expressions run entirely on Polars' query engine and can run in parallel, so will be well optimized. Let's say we have another set of weather data across three days, for different stations:
weather_by_day = pl.DataFrame(
{
"station": ["Station " + str(x) for x in range(1, 11)],
"day_1": [17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
"day_2": [15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
"day_3": [16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
}
)
print(weather_by_day)
let stns: Vec<String> = (1..11).map(|i| format!("Station {i}")).collect();
let weather_by_day = df!(
"station" => &stns,
"day_1" => &[17, 11, 8, 22, 9, 21, 20, 8, 8, 17],
"day_2" => &[15, 11, 10, 8, 7, 14, 18, 21, 15, 13],
"day_3" => &[16, 15, 24, 24, 8, 23, 19, 23, 16, 10],
)?;
println!("{}", &weather_by_day);
shape: (10, 4)
┌────────────┬───────┬───────┬───────┐
│ station ┆ day_1 ┆ day_2 ┆ day_3 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪═══════╪═══════╪═══════╡
│ Station 1 ┆ 17 ┆ 15 ┆ 16 │
│ Station 2 ┆ 11 ┆ 11 ┆ 15 │
│ Station 3 ┆ 8 ┆ 10 ┆ 24 │
│ Station 4 ┆ 22 ┆ 8 ┆ 24 │
│ … ┆ … ┆ … ┆ … │
│ Station 7 ┆ 20 ┆ 18 ┆ 19 │
│ Station 8 ┆ 8 ┆ 21 ┆ 23 │
│ Station 9 ┆ 8 ┆ 15 ┆ 16 │
│ Station 10 ┆ 17 ┆ 13 ┆ 10 │
└────────────┴───────┴───────┴───────┘
Let's do something interesting, where we calculate the percentage rank of the temperatures by day, measured across stations. Pandas allows you to compute the percentages of the rank
values. Polars
doesn't provide a special function to do this directly, but because expressions are so versatile we can create our own percentage rank expression for highest temperature. Let's try that!
rank_pct = (pl.element().rank(descending=True) / pl.col("*").count()).round(2)
out = weather_by_day.with_columns(
# create the list of homogeneous data
pl.concat_list(pl.all().exclude("station")).alias("all_temps")
).select(
# select all columns except the intermediate list
pl.all().exclude("all_temps"),
# compute the rank by calling `list.eval`
pl.col("all_temps").list.eval(rank_pct, parallel=True).alias("temps_rank"),
)
print(out)
let rank_pct = (col("")
.rank(
RankOptions {
method: RankMethod::Average,
descending: true,
},
None,
)
.cast(DataType::Float32)
/ col("*").count().cast(DataType::Float32))
.round(2);
let out = weather_by_day
.clone()
.lazy()
.with_columns(
// create the list of homogeneous data
[concat_list([all().exclude(["station"])])?.alias("all_temps")],
)
.select(
// select all columns except the intermediate list
[
all().exclude(["all_temps"]),
// compute the rank by calling `list.eval`
col("all_temps")
.list()
.eval(rank_pct, true)
.alias("temps_rank"),
],
)
.collect()?;
println!("{}", &out);
shape: (10, 5)
┌────────────┬───────┬───────┬───────┬────────────────────┐
│ station ┆ day_1 ┆ day_2 ┆ day_3 ┆ temps_rank │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ list[f64] │
╞════════════╪═══════╪═══════╪═══════╪════════════════════╡
│ Station 1 ┆ 17 ┆ 15 ┆ 16 ┆ [0.33, 1.0, 0.67] │
│ Station 2 ┆ 11 ┆ 11 ┆ 15 ┆ [0.83, 0.83, 0.33] │
│ Station 3 ┆ 8 ┆ 10 ┆ 24 ┆ [1.0, 0.67, 0.33] │
│ Station 4 ┆ 22 ┆ 8 ┆ 24 ┆ [0.67, 1.0, 0.33] │
│ … ┆ … ┆ … ┆ … ┆ … │
│ Station 7 ┆ 20 ┆ 18 ┆ 19 ┆ [0.33, 1.0, 0.67] │
│ Station 8 ┆ 8 ┆ 21 ┆ 23 ┆ [1.0, 0.67, 0.33] │
│ Station 9 ┆ 8 ┆ 15 ┆ 16 ┆ [1.0, 0.67, 0.33] │
│ Station 10 ┆ 17 ┆ 13 ┆ 10 ┆ [0.33, 0.67, 1.0] │
└────────────┴───────┴───────┴───────┴────────────────────┘
Polars Array
s
Array
s are a new data type that was recently introduced, and are still pretty nascent in features that it offers. The major difference between a List
and an Array
is that the latter is limited to having the same number of elements per row, while a List
can have a variable number of elements. Both still require that each element's data type is the same.
We can define Array
columns in this manner:
array_df = pl.DataFrame(
[
pl.Series("Array_1", [[1, 3], [2, 5]]),
pl.Series("Array_2", [[1, 7, 3], [8, 1, 0]]),
],
schema={"Array_1": pl.Array(2, pl.Int64), "Array_2": pl.Array(3, pl.Int64)},
)
print(array_df)
let mut col1: ListPrimitiveChunkedBuilder<Int32Type> =
ListPrimitiveChunkedBuilder::new("Array_1", 8, 8, DataType::Int32);
col1.append_slice(&[1, 3]);
col1.append_slice(&[2, 5]);
let mut col2: ListPrimitiveChunkedBuilder<Int32Type> =
ListPrimitiveChunkedBuilder::new("Array_2", 8, 8, DataType::Int32);
col2.append_slice(&[1, 7, 3]);
col2.append_slice(&[8, 1, 0]);
let array_df = DataFrame::new([col1.finish(), col2.finish()].into())?;
println!("{}", &array_df);
shape: (2, 2)
┌───────────────┬───────────────┐
│ Array_1 ┆ Array_2 │
│ --- ┆ --- │
│ array[i64, 2] ┆ array[i64, 3] │
╞═══════════════╪═══════════════╡
│ [1, 3] ┆ [1, 7, 3] │
│ [2, 5] ┆ [8, 1, 0] │
└───────────────┴───────────────┘
Basic operations are available on it:
out = array_df.select(
pl.col("Array_1").arr.min().suffix("_min"),
pl.col("Array_2").arr.sum().suffix("_sum"),
)
print(out)
let out = array_df
.clone()
.lazy()
.select([
col("Array_1").list().min().suffix("_min"),
col("Array_2").list().sum().suffix("_sum"),
])
.collect()?;
println!("{}", &out);
shape: (2, 2)
┌─────────────┬─────────────┐
│ Array_1_min ┆ Array_2_sum │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════════════╪═════════════╡
│ 1 ┆ 11 │
│ 2 ┆ 9 │
└─────────────┴─────────────┘
Polars Array
s are still being actively developed, so this section will likely change in the future.