Create Enum DataType
Description
An Enum
is a fixed set categorical encoding of a set of
strings. It is similar to the Categorical
data type, but
the categories are explicitly provided by the user and cannot be
modified.
Usage
DataType_Enum(categories)
Arguments
categories
|
A character vector specifying the categories of the variable. |
Details
This functionality is unstable. It is a work-in-progress feature and may not always work as expected. It may be changed at any point without it being considered a breaking change.
Value
An Enum DataType
Examples
library("polars")
pl$DataFrame(
x = c("Polar", "Panda", "Brown", "Brown", "Polar"),
schema = list(x = pl$Enum(c("Polar", "Panda", "Brown")))
)
#> shape: (5, 1)
#> ┌───────┐
#> │ x │
#> │ --- │
#> │ enum │
#> ╞═══════╡
#> │ Polar │
#> │ Panda │
#> │ Brown │
#> │ Brown │
#> │ Polar │
#> └───────┘
# All values of the variable have to be in the categories
dtype = pl$Enum(c("Polar", "Panda", "Brown"))
tryCatch(
pl$DataFrame(
x = c("Polar", "Panda", "Brown", "Brown", "Polar", "Black"),
schema = list(x = dtype)
),
error = function(e) e
)
#> <RPolarsErr_error: Execution halted with the following contexts
#> 0: In R: in $DataFrame():
#> 0: During function call [.main()]
#> 1: Encountered the following error in Rust-Polars:
#> conversion from `str` to `enum` failed in column '' for 1 out of 6 values: ["Black"]
#>
#> Ensure that all values in the input column are present in the categories of the enum datatype.
#>
#> Resolved plan until failure:
#>
#> ---> FAILED HERE RESOLVING 'select' <---
#> SELECT [Series.strict_cast(Enum(Some(local), Physical)).alias("x")] FROM
#> DF []; PROJECT */0 COLUMNS; SELECTION: None
#> >
# Comparing two Enum is only valid if they have the same categories
df = pl$DataFrame(
x = c("Polar", "Panda", "Brown", "Brown", "Polar"),
y = c("Polar", "Polar", "Polar", "Brown", "Brown"),
z = c("Polar", "Polar", "Polar", "Brown", "Brown"),
schema = list(
x = pl$Enum(c("Polar", "Panda", "Brown")),
y = pl$Enum(c("Polar", "Panda", "Brown")),
z = pl$Enum(c("Polar", "Black", "Brown"))
)
)
# Same categories
df$with_columns(x_eq_y = pl$col("x") == pl$col("y"))
#> shape: (5, 4)
#> ┌───────┬───────┬───────┬────────┐
#> │ x ┆ y ┆ z ┆ x_eq_y │
#> │ --- ┆ --- ┆ --- ┆ --- │
#> │ enum ┆ enum ┆ enum ┆ bool │
#> ╞═══════╪═══════╪═══════╪════════╡
#> │ Polar ┆ Polar ┆ Polar ┆ true │
#> │ Panda ┆ Polar ┆ Polar ┆ false │
#> │ Brown ┆ Polar ┆ Polar ┆ false │
#> │ Brown ┆ Brown ┆ Brown ┆ true │
#> │ Polar ┆ Brown ┆ Brown ┆ false │
#> └───────┴───────┴───────┴────────┘
# Different categories
tryCatch(
df$with_columns(x_eq_z = pl$col("x") == pl$col("z")),
error = function(e) e
)
#> <RPolarsErr_error: Execution halted with the following contexts
#> 0: In R: in $with_columns()
#> 0: During function call [.main()]
#> 1: Encountered the following error in Rust-Polars:
#> string caches don't match: cannot compare categoricals coming from different sources, consider setting a global StringCache.
#>
#> Help: if you're using Python, this may look something like:
#>
#> with pl.StringCache():
#> # Initialize Categoricals.
#> df1 = pl.DataFrame({'a': ['1', '2']}, schema={'a': pl.Categorical})
#> df2 = pl.DataFrame({'a': ['1', '3']}, schema={'a': pl.Categorical})
#> # Your operations go here.
#> pl.concat([df1, df2])
#>
#> Alternatively, if the performance cost is acceptable, you could just set:
#>
#> import polars as pl
#> pl.enable_string_cache()
#>
#> on startup.
#> >