Strings
The following section discusses operations performed on Utf8
strings, which are a frequently used DataType
when working with DataFrames
. However, processing strings can often be inefficient due to their unpredictable memory size, causing the CPU to access many random memory locations. To address this issue, Polars utilizes Arrow
as its backend, which stores all strings in a contiguous block of memory. As a result, string traversal is cache-optimal and predictable for the CPU.
String processing functions are available in the str
namespace.
Accessing the string namespace
The str
namespace can be accessed through the .str
attribute of a column with Utf8
data type. In the following example, we create a column named animal
and compute the length of each element in the column in terms of the number of bytes and the number of characters. If you are working with ASCII text, then the results of these two computations will be the same, and using lengths
is recommended since it is faster.
df = pl.DataFrame({"animal": ["Crab", "cat and dog", "rab$bit", None]})
out = df.select(
pl.col("animal").str.lengths().alias("byte_count"),
pl.col("animal").str.n_chars().alias("letter_count"),
)
print(out)
let df = df! (
"animal" => &[Some("Crab"), Some("cat and dog"), Some("rab$bit"), None],
)?;
let out = df
.clone()
.lazy()
.select([
col("animal").str().lengths().alias("byte_count"),
col("animal").str().n_chars().alias("letter_count"),
])
.collect()?;
println!("{}", &out);
shape: (4, 2)
┌────────────┬──────────────┐
│ byte_count ┆ letter_count │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞════════════╪══════════════╡
│ 4 ┆ 4 │
│ 11 ┆ 11 │
│ 7 ┆ 7 │
│ null ┆ null │
└────────────┴──────────────┘
String parsing
Polars
offers multiple methods for checking and parsing elements of a string. Firstly, we can use the contains
method to check whether a given pattern exists within a substring. Subsequently, we can extract these patterns and replace them using other methods, which will be demonstrated in upcoming examples.
Check for existence of a pattern
To check for the presence of a pattern within a string, we can use the contains method. The contains
method accepts either a regular substring or a regex pattern, depending on the value of the literal
parameter. If the pattern we're searching for is a simple substring located either at the beginning or end of the string, we can alternatively use the starts_with
and ends_with
functions.
str.contains
· starts_with
· ends_with
out = df.select(
pl.col("animal"),
pl.col("animal").str.contains("cat|bit").alias("regex"),
pl.col("animal").str.contains("rab$", literal=True).alias("literal"),
pl.col("animal").str.starts_with("rab").alias("starts_with"),
pl.col("animal").str.ends_with("dog").alias("ends_with"),
)
print(out)
let out = df
.clone()
.lazy()
.select([
col("animal"),
col("animal")
.str()
.contains(lit("cat|bit"), false)
.alias("regex"),
col("animal")
.str()
.contains_literal(lit("rab$"))
.alias("literal"),
col("animal")
.str()
.starts_with(lit("rab"))
.alias("starts_with"),
col("animal").str().ends_with(lit("dog")).alias("ends_with"),
])
.collect()?;
println!("{}", &out);
shape: (4, 5)
┌─────────────┬───────┬─────────┬─────────────┬───────────┐
│ animal ┆ regex ┆ literal ┆ starts_with ┆ ends_with │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ bool ┆ bool ┆ bool │
╞═════════════╪═══════╪═════════╪═════════════╪═══════════╡
│ Crab ┆ false ┆ false ┆ false ┆ false │
│ cat and dog ┆ true ┆ false ┆ false ┆ true │
│ rab$bit ┆ true ┆ true ┆ true ┆ false │
│ null ┆ null ┆ null ┆ null ┆ null │
└─────────────┴───────┴─────────┴─────────────┴───────────┘
Extract a pattern
The extract
method allows us to extract a pattern from a specified string. This method takes a regex pattern containing one or more capture groups, which are defined by parentheses ()
in the pattern. The group index indicates which capture group to output.
df = pl.DataFrame(
{
"a": [
"http://vote.com/ballon_dor?candidate=messi&ref=polars",
"http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
"http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
]
}
)
out = df.select(
pl.col("a").str.extract(r"candidate=(\w+)", group_index=1),
)
print(out)
let df = df!(
"a" => &[
"http://vote.com/ballon_dor?candidate=messi&ref=polars",
"http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
"http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
]
)?;
let out = df
.clone()
.lazy()
.select([col("a").str().extract(r"candidate=(\w+)", 1)])
.collect()?;
println!("{}", &out);
shape: (3, 1)
┌─────────┐
│ a │
│ --- │
│ str │
╞═════════╡
│ messi │
│ null │
│ ronaldo │
└─────────┘
To extract all occurrences of a pattern within a string, we can use the extract_all
method. In the example below, we extract all numbers from a string using the regex pattern (\d+)
, which matches one or more digits. The resulting output of the extract_all
method is a list containing all instances of the matched pattern within the string.
df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"]})
out = df.select(
pl.col("foo").str.extract_all(r"(\d+)").alias("extracted_nrs"),
)
print(out)
let df = df!("foo"=> &["123 bla 45 asd", "xyz 678 910t"])?;
let out = df
.clone()
.lazy()
.select([col("foo")
.str()
.extract_all(lit(r"(\d+)"))
.alias("extracted_nrs")])
.collect()?;
println!("{}", &out);
shape: (2, 1)
┌────────────────┐
│ extracted_nrs │
│ --- │
│ list[str] │
╞════════════════╡
│ ["123", "45"] │
│ ["678", "910"] │
└────────────────┘
Replace a pattern
We have discussed two methods for pattern matching and extraction thus far, and now we will explore how to replace a pattern within a string. Similar to extract
and extract_all
, Polars provides the replace
and replace_all
methods for this purpose. In the example below we replace one match of abc
at the end of a word (\b
) by ABC
and we replace all occurrence of a
with -
.
df = pl.DataFrame({"id": [1, 2], "text": ["123abc", "abc456"]})
out = df.with_columns(
pl.col("text").str.replace(r"abc\b", "ABC"),
pl.col("text").str.replace_all("a", "-", literal=True).alias("text_replace_all"),
)
print(out)
let df = df!("id"=> &[1, 2], "text"=> &["123abc", "abc456"])?;
let out = df
.clone()
.lazy()
.with_columns([
col("text").str().replace(lit(r"abc\b"), lit("ABC"), false),
col("text")
.str()
.replace_all(lit("a"), lit("-"), false)
.alias("text_replace_all"),
])
.collect()?;
println!("{}", &out);
shape: (2, 3)
┌─────┬────────┬──────────────────┐
│ id ┆ text ┆ text_replace_all │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪════════╪══════════════════╡
│ 1 ┆ 123ABC ┆ 123-bc │
│ 2 ┆ abc456 ┆ -bc456 │
└─────┴────────┴──────────────────┘
API documentation
In addition to the examples covered above, Polars offers various other string manipulation methods for tasks such as formatting, stripping, splitting, and more. To explore these additional methods, you can go to the API documentation of your chosen programming language for Polars.