Process strings

Thanks to its Arrow backend, Polars string operations are much faster compared to the same operations performed with NumPy or Pandas. In the latter, strings are stored as Python objects. While traversing the np.array or the pd.Series the CPU needs to follow all the string pointers, and jump to many random memory locations -- which is very cache-inefficient. In Polars (via the Arrow data structure) strings are contiguous in memory. Thus traversing is cache-optimal and predictable for the CPU.

The string processing functions available in Polars are available in the str namespace.

Below are a few examples. To compute string lengths:

import polars as pl

df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})

df = df.with_column(pl.col("shakespeare").str.lengths().alias("letter_count"))

returning:

shape: (6, 2)
┌─────────────┬──────────────┐
│ shakespeare ┆ letter_count │
│ ---         ┆ ---          │
│ str         ┆ u32          │
╞═════════════╪══════════════╡
│ All         ┆ 3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ that        ┆ 4            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ glitters    ┆ 8            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ is          ┆ 2            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ not         ┆ 3            │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ gold        ┆ 4            │
└─────────────┴──────────────┘

And below a regex pattern to filter out articles (the, a, and, etc.) from a sentence:

import polars as pl

df = pl.DataFrame({"a": "The man that ate a whole cake".split(" ")})

df = df.filter(pl.col("a").str.contains(r"(?i)^the$|^a$").is_not())

yielding:

shape: (5, 1)
┌───────┐
│ a     │
│ ---   │
│ str   │
╞═══════╡
│ man   │
├╌╌╌╌╌╌╌┤
│ that  │
├╌╌╌╌╌╌╌┤
│ ate   │
├╌╌╌╌╌╌╌┤
│ whole │
├╌╌╌╌╌╌╌┤
│ cake  │
└───────┘