Process strings
Thanks to its Arrow
backend, Polars
string operations are much faster compared to the
same operations performed with NumPy
or Pandas
. In the latter, strings are stored as
Python
objects. While traversing the np.array
or the pd.Series
the CPU needs to
follow all the string pointers, and jump to many random memory locations -- which
is very cache-inefficient. In Polars
(via the Arrow
data
structure) strings are contiguous in memory. Thus traversing is cache-optimal and
predictable for the CPU.
The string processing functions available in Polars
are available in the
str
namespace.
Below are a few examples. To compute string lengths:
import polars as pl
df = pl.DataFrame({"shakespeare": "All that glitters is not gold".split(" ")})
df = df.with_column(pl.col("shakespeare").str.lengths().alias("letter_count"))
returning:
shape: (6, 2)
┌─────────────┬──────────────┐
│ shakespeare ┆ letter_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════════════╪══════════════╡
│ All ┆ 3 │
│ that ┆ 4 │
│ glitters ┆ 8 │
│ is ┆ 2 │
│ not ┆ 3 │
│ gold ┆ 4 │
└─────────────┴──────────────┘
And below a regex pattern to filter out articles (the
, a
, and
, etc.) from a
sentence:
import polars as pl
df = pl.DataFrame({"a": "The man that ate a whole cake".split(" ")})
df = df.filter(pl.col("a").str.contains(r"(?i)^the$|^a$").is_not())
yielding:
shape: (5, 1)
┌───────┐
│ a │
│ --- │
│ str │
╞═══════╡
│ man │
│ that │
│ ate │
│ whole │
│ cake │
└───────┘