Use the Aho-Corasick algorithm to find many matches
Description
The function will return the bytes offset of the start of each match.
The return type will be List(UInt32). This method supports matching on
string literals only, and does not support regular expression matching.
Usage
<Expr>$str$find_many(
patterns,
...,
ascii_case_insensitive = FALSE,
overlapping = FALSE
)
Arguments
patterns
|
String patterns to search. This can be an Expr or something coercible to an Expr. Strings are parsed as column names. |
…
|
These dots are for future extensions and must be empty. |
ascii_case_insensitive
|
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
overlapping
|
Whether matches can overlap. |
Value
A polars expression
Examples
library("polars")
df <- pl$DataFrame(values = "discontent")
patterns <- pl$lit(list(c("winter", "disco", "onte", "discontent")))
df$with_columns(
matches = pl$col("values")$str$find_many(patterns, overlapping = FALSE),
matches_overlapping = pl$col("values")$str$find_many(
patterns, overlapping = TRUE
)
)
#> shape: (1, 3)
#> ┌────────────┬───────────┬─────────────────────┐
#> │ values ┆ matches ┆ matches_overlapping │
#> │ --- ┆ --- ┆ --- │
#> │ str ┆ list[u32] ┆ list[u32] │
#> ╞════════════╪═══════════╪═════════════════════╡
#> │ discontent ┆ [0] ┆ [0, 4, 0] │
#> └────────────┴───────────┴─────────────────────┘
df <- pl$DataFrame(
values = c("discontent", "rhapsody"),
patterns = list(
c("winter", "disco", "onte", "discontent"),
c("rhap", "ody", "coalesce")
)
)
df$select(pl$col("values")$str$find_many("patterns"))
#> shape: (2, 1)
#> ┌───────────┐
#> │ values │
#> │ --- │
#> │ list[u32] │
#> ╞═══════════╡
#> │ [0] │
#> │ [0, 5] │
#> └───────────┘