Skip to content

Bin continuous values into discrete categories based on their quantiles

Source code

Description

Bin continuous values into discrete categories based on their quantiles

Usage

<Expr>$qcut(
  quantiles,
  ...,
  labels = NULL,
  left_closed = FALSE,
  allow_duplicates = FALSE,
  include_breaks = FALSE
)

Arguments

quantiles Either a vector of quantile probabilities between 0 and 1 or a positive integer determining the number of bins with uniform probability.
Ignored.
labels Names of the categories. The number of labels must be equal to the number of cut points plus one.
left_closed Set the intervals to be left-closed instead of right-closed.
allow_duplicates If set to TRUE, duplicates in the resulting quantiles are dropped, rather than raising an error. This can happen even with unique probabilities, depending on the data.
include_breaks Include a column with the right endpoint of the bin each observation falls in. This will change the data type of the output from a Categorical to a Struct.

Value

Expr of data type Categorical is include_breaks is FALSE and of data type Struct if include_breaks is TRUE.

See Also

$cut()

Examples

library("polars")

df = pl$DataFrame(foo = c(-2, -1, 0, 1, 2))

# Divide a column into three categories according to pre-defined quantile
# probabilities
df$with_columns(
  qcut = pl$col("foo")$qcut(c(0.25, 0.75), labels = c("a", "b", "c"))
)
#> shape: (5, 2)
#> ┌──────┬──────┐
#> │ foo  ┆ qcut │
#> │ ---  ┆ ---  │
#> │ f64  ┆ cat  │
#> ╞══════╪══════╡
#> │ -2.0 ┆ a    │
#> │ -1.0 ┆ a    │
#> │ 0.0  ┆ b    │
#> │ 1.0  ┆ b    │
#> │ 2.0  ┆ c    │
#> └──────┴──────┘
# Divide a column into two categories using uniform quantile probabilities.
df$with_columns(
  qcut = pl$col("foo")$qcut(2, labels = c("low", "high"), left_closed = TRUE)
)
#> shape: (5, 2)
#> ┌──────┬──────┐
#> │ foo  ┆ qcut │
#> │ ---  ┆ ---  │
#> │ f64  ┆ cat  │
#> ╞══════╪══════╡
#> │ -2.0 ┆ low  │
#> │ -1.0 ┆ low  │
#> │ 0.0  ┆ high │
#> │ 1.0  ┆ high │
#> │ 2.0  ┆ high │
#> └──────┴──────┘
# Add both the category and the breakpoint
df$with_columns(
  qcut = pl$col("foo")$qcut(c(0.25, 0.75), include_breaks = TRUE)
)$unnest("qcut")
#> shape: (5, 3)
#> ┌──────┬────────────┬────────────┐
#> │ foo  ┆ breakpoint ┆ category   │
#> │ ---  ┆ ---        ┆ ---        │
#> │ f64  ┆ f64        ┆ cat        │
#> ╞══════╪════════════╪════════════╡
#> │ -2.0 ┆ -1.0       ┆ (-inf, -1] │
#> │ -1.0 ┆ -1.0       ┆ (-inf, -1] │
#> │ 0.0  ┆ 1.0        ┆ (-1, 1]    │
#> │ 1.0  ┆ 1.0        ┆ (-1, 1]    │
#> │ 2.0  ┆ inf        ┆ (1, inf]   │
#> └──────┴────────────┴────────────┘