Use the Aho-Corasick algorithm to find many matches
Description
The function will return the bytes offset of the start of each match.
The return type will be List(UInt32). This method supports matching on
string literals only, and does not support regular expression matching.
Usage
<Expr>$str$find_many(
patterns,
...,
ascii_case_insensitive = FALSE,
overlapping = FALSE,
leftmost = FALSE
)
Arguments
patterns
|
String patterns to search. Accepts expression input. Strings are parsed
as column names, and other non-expression inputs are parsed as literals.
To use the same character vector for all rows, use
list(c(…)) instead of c(…) (see Examples).
|
…
|
These dots are for future extensions and must be empty. |
ascii_case_insensitive
|
Enable ASCII-aware case insensitive matching. When this option is enabled, searching will be performed without respect to case for ASCII letters (a-z and A-Z) only. |
overlapping
|
Whether matches can overlap. |
leftmost
|
Whether to guarantee in case there are overlapping matches that the
leftmost match is used. In case there are multiple candidates for the
leftmost match, the pattern which comes first in patterns
is used.
|
Value
A polars expression
Examples
library("polars")
df <- pl$DataFrame(values = "discontent")
patterns <- list(c("winter", "disco", "onte", "discontent"))
df$with_columns(
matches = pl$col("values")$str$find_many(patterns, overlapping = FALSE),
matches_overlapping = pl$col("values")$str$find_many(
patterns, overlapping = TRUE
)
)
#> shape: (1, 3)
#> ┌────────────┬───────────┬─────────────────────┐
#> │ values ┆ matches ┆ matches_overlapping │
#> │ --- ┆ --- ┆ --- │
#> │ str ┆ list[u32] ┆ list[u32] │
#> ╞════════════╪═══════════╪═════════════════════╡
#> │ discontent ┆ [0] ┆ [0, 4, 0] │
#> └────────────┴───────────┴─────────────────────┘
df <- pl$DataFrame(
values = c("discontent", "rhapsody"),
patterns = list(
c("winter", "disco", "onte", "discontent"),
c("rhap", "ody", "coalesce")
)
)
df$select(pl$col("values")$str$find_many("patterns"))
#> shape: (2, 1)
#> ┌───────────┐
#> │ values │
#> │ --- │
#> │ list[u32] │
#> ╞═══════════╡
#> │ [0] │
#> │ [0, 5] │
#> └───────────┘