Skip to content

Returns the Unicode normal form of the string values

Source code

Description

This uses the forms described in Unicode Standard Annex 15: https://www.unicode.org/reports/tr15/.

Usage

<Expr>$str$normalize(form = c("NFC", "NFKC", "NFD", "NFKD"))

Arguments

form Unicode form to use. Must be one of: “NFC”, “NFKC”, “NFD”, “NFKD”.

Value

A polars expression

Examples

library("polars")

df <- pl$DataFrame(text = c("01²", "KADOKAWA"))

new <- df$with_columns(
  nfc = pl$col("text")$str$normalize("NFC"),
  nfkc = pl$col("text")$str$normalize("NFKC"),
)
new
#> shape: (2, 3)
#> ┌──────────────────┬──────────────────┬──────────┐
#> │ text             ┆ nfc              ┆ nfkc     │
#> │ ---              ┆ ---              ┆ ---      │
#> │ str              ┆ str              ┆ str      │
#> ╞══════════════════╪══════════════════╪══════════╡
#> │ 01²              ┆ 01²              ┆ 012      │
#> │ KADOKAWA ┆ KADOKAWA ┆ KADOKAWA │
#> └──────────────────┴──────────────────┴──────────┘
new$select(pl$all()$str$len_bytes())
#> shape: (2, 3)
#> ┌──────┬─────┬──────┐
#> │ text ┆ nfc ┆ nfkc │
#> │ ---  ┆ --- ┆ ---  │
#> │ u32  ┆ u32 ┆ u32  │
#> ╞══════╪═════╪══════╡
#> │ 4    ┆ 4   ┆ 3    │
#> │ 24   ┆ 24  ┆ 8    │
#> └──────┴─────┴──────┘