The Struct datatype

Polars `Struct`s are the idiomatic way of working with multiple columns. It is also a free operation i.e. moving columns into `Struct`s does not copy any data!

For this section, let's start with a `DataFrame` that captures the average rating of a few movies across some states in the U.S.:

``````ratings = pl.DataFrame(
{
"Movie": ["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],
"Theatre": ["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],
"Avg_Rating": [4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],
"Count": [30, 27, 26, 29, 31, 28, 28, 26, 33, 26],
}
)
print(ratings)
``````

``````let ratings = df!(
"Movie"=> &["Cars", "IT", "ET", "Cars", "Up", "IT", "Cars", "ET", "Up", "ET"],
"Theatre"=> &["NE", "ME", "IL", "ND", "NE", "SD", "NE", "IL", "IL", "SD"],
"Avg_Rating"=> &[4.5, 4.4, 4.6, 4.3, 4.8, 4.7, 4.7, 4.9, 4.7, 4.6],
"Count"=> &[30, 27, 26, 29, 31, 28, 28, 26, 33, 26],

)?;
println!("{}", &ratings);
``````

``````shape: (10, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ IT    ┆ ME      ┆ 4.4        ┆ 27    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ ND      ┆ 4.3        ┆ 29    │
│ …     ┆ …       ┆ …          ┆ …     │
│ Cars  ┆ NE      ┆ 4.7        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
│ Up    ┆ IL      ┆ 4.7        ┆ 33    │
│ ET    ┆ SD      ┆ 4.6        ┆ 26    │
└───────┴─────────┴────────────┴───────┘
``````

Encountering the `Struct` type

A common operation that will lead to a `Struct` column is the ever so popular `value_counts` function that is commonly used in exploratory data analysis. Checking the number of times a state appears the data will be done as so:

``````out = ratings.select(pl.col("Theatre").value_counts(sort=True))
print(out)
``````

``````let out = ratings
.clone()
.lazy()
.select([col("Theatre").value_counts(true, true)])
.collect()?;
println!("{}", &out);
``````
``````shape: (5, 1)
┌───────────┐
│ Theatre   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {"NE",3}  │
│ {"IL",3}  │
│ {"SD",2}  │
│ {"ME",1}  │
│ {"ND",1}  │
└───────────┘
``````

Quite unexpected an output, especially if coming from tools that do not have such a data type. We're not in peril though, to get back to a more familiar output, all we need to do is `unnest` the `Struct` column into its constituent columns:

``````out = ratings.select(pl.col("Theatre").value_counts(sort=True)).unnest("Theatre")
print(out)
``````

``````let out = ratings
.clone()
.lazy()
.select([col("Theatre").value_counts(true, true)])
.unnest(["Theatre"])
.collect()?;
println!("{}", &out);
``````
``````shape: (5, 2)
┌─────────┬────────┐
│ Theatre ┆ counts │
│ ---     ┆ ---    │
│ str     ┆ u32    │
╞═════════╪════════╡
│ NE      ┆ 3      │
│ IL      ┆ 3      │
│ SD      ┆ 2      │
│ ME      ┆ 1      │
│ ND      ┆ 1      │
└─────────┴────────┘
``````

Why `value_counts` returns a `Struct`

Polars expressions always have a `Fn(Series) -> Series` signature and `Struct` is thus the data type that allows us to provide multiple columns as input/ouput of an expression. In other words, all expressions have to return a `Series` object, and `Struct` allows us to stay consistent with that requirement.

Structs as `dict`s

Polars will interpret a `dict` sent to the `Series` constructor as a `Struct`:

``````rating_Series = pl.Series(
"ratings",
[
{"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
{"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
],
)
print(rating_Series)
``````

``````// Don't think we can make it the same way in rust, but this works
let rating_series = df!(
"Movie" => &["Cars", "Toy Story"],
"Theatre" => &["NE", "ME"],
"Avg_Rating" => &[4.5, 4.9],
)?
.into_struct("ratings")
.into_series();
println!("{}", &rating_series);
``````

``````shape: (2,)
Series: 'ratings' [struct[3]]
[
{"Cars","NE",4.5}
{"Toy Story","ME",4.9}
]
``````

Constructing `Series` objects

Note that `Series` here was constructed with the `name` of the series in the begninng, followed by the `values`. Providing the latter first is considered an anti-pattern in Polars, and must be avoided.

Extracting individual values of a `Struct`

Let's say that we needed to obtain just the `movie` value in the `Series` that we created above. We can use the `field` method to do so:

``````out = rating_Series.struct.field("Movie")
print(out)
``````

``````let out = rating_series.struct_()?.field_by_name("Movie")?;
println!("{}", &out);
``````
``````shape: (2,)
Series: 'Movie' [str]
[
"Cars"
"Toy Story"
]
``````

Renaming individual keys of a `Struct`

What if we need to rename individual `field`s of a `Struct` column? We first convert the `rating_Series` object to a `DataFrame` so that we can view the changes easily, and then use the `rename_fields` method:

``````out = (
rating_Series.to_frame()
.select(pl.col("ratings").struct.rename_fields(["Film", "State", "Value"]))
.unnest("ratings")
)
print(out)
``````

``````let out = DataFrame::new([rating_series].into())?
.lazy()
.select([col("ratings")
.struct_()
.rename_fields(["Film".into(), "State".into(), "Value".into()].to_vec())])
.unnest(["ratings"])
.collect()?;

println!("{}", &out);
``````
``````shape: (2, 3)
┌───────────┬───────┬───────┐
│ Film      ┆ State ┆ Value │
│ ---       ┆ ---   ┆ ---   │
│ str       ┆ str   ┆ f64   │
╞═══════════╪═══════╪═══════╡
│ Cars      ┆ NE    ┆ 4.5   │
│ Toy Story ┆ ME    ┆ 4.9   │
└───────────┴───────┴───────┘
``````

Practical use-cases of `Struct` columns

Identifying duplicate rows

Let's get back to the `ratings` data. We want to identify cases where there are duplicates at a `Movie` and `Theatre` level. This is where the `Struct` datatype shines:

``````out = ratings.filter(pl.struct("Movie", "Theatre").is_duplicated())
print(out)
``````

``````let out = ratings
.clone()
.lazy()
// .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
// Error: .is_duplicated() not available if you try that
// https://github.com/pola-rs/polars/issues/3803
.filter(count().over([col("Movie"), col("Theatre")]).gt(lit(1)))
.collect()?;
println!("{}", &out);
``````

``````shape: (4, 4)
┌───────┬─────────┬────────────┬───────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count │
│ ---   ┆ ---     ┆ ---        ┆ ---   │
│ str   ┆ str     ┆ f64        ┆ i64   │
╞═══════╪═════════╪════════════╪═══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    │
│ Cars  ┆ NE      ┆ 4.7        ┆ 28    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    │
└───────┴─────────┴────────────┴───────┘
``````

We can identify the unique cases at this level also with `is_unique`!

Multi-column ranking

Suppose, given that we know there are duplicates, we want to choose which rank gets a higher priority. We define Count of ratings to be more important than the actual `Avg_Rating` themselves, and only use it to break a tie. We can then do:

``````out = ratings.with_columns(
pl.struct("Count", "Avg_Rating")
.rank("dense", descending=True)
.over("Movie", "Theatre")
.alias("Rank")
).filter(pl.struct("Movie", "Theatre").is_duplicated())
print(out)
``````

``````let out = ratings
.clone()
.lazy()
.with_columns([as_struct(&[col("Count"), col("Avg_Rating")])
.rank(
RankOptions {
method: RankMethod::Dense,
descending: false,
},
None,
)
.over([col("Movie"), col("Theatre")])
.alias("Rank")])
// .filter(as_struct(&[col("Movie"), col("Theatre")]).is_duplicated())
// Error: .is_duplicated() not available if you try that
// https://github.com/pola-rs/polars/issues/3803
.filter(count().over([col("Movie"), col("Theatre")]).gt(lit(1)))
.collect()?;
println!("{}", &out);
``````

``````shape: (4, 5)
┌───────┬─────────┬────────────┬───────┬──────┐
│ Movie ┆ Theatre ┆ Avg_Rating ┆ Count ┆ Rank │
│ ---   ┆ ---     ┆ ---        ┆ ---   ┆ ---  │
│ str   ┆ str     ┆ f64        ┆ i64   ┆ u32  │
╞═══════╪═════════╪════════════╪═══════╪══════╡
│ Cars  ┆ NE      ┆ 4.5        ┆ 30    ┆ 1    │
│ ET    ┆ IL      ┆ 4.6        ┆ 26    ┆ 2    │
│ Cars  ┆ NE      ┆ 4.7        ┆ 28    ┆ 2    │
│ ET    ┆ IL      ┆ 4.9        ┆ 26    ┆ 1    │
└───────┴─────────┴────────────┴───────┴──────┘
``````

That's a pretty complex set of requirements done very elegantly in Polars!

Using multi-column apply

This was discussed in the previous section on User Defined Functions.