Check if strings in Expression contain a substring that matches a pattern.
A valid regular expression pattern, compatible with the regex crate @param literal Treat
pattern` as a literal string, not as a regular expression.
Optional
literal: booleanOptional
strict: booleanRaise an error if the underlying pattern is not a valid regex, otherwise mask out with a null value.
Boolean mask
const df = pl.DataFrame({"txt": ["Crab", "cat and dog", "rab$bit", null]})
df.select(
... pl.col("txt"),
... pl.col("txt").str.contains("cat|bit").alias("regex"),
... pl.col("txt").str.contains("rab$", true).alias("literal"),
... )
shape: (4, 3)
┌─────────────┬───────┬─────────┐
│ txt ┆ regex ┆ literal │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ bool │
╞═════════════╪═══════╪═════════╡
│ Crab ┆ false ┆ false │
│ cat and dog ┆ true ┆ false │
│ rab$bit ┆ true ┆ true │
│ null ┆ null ┆ null │
└─────────────┴───────┴─────────┘
Check if string values in Expression ends with a substring.
>>> df = pl.DataFrame({"fruits": ["apple", "mango", None]})
>>> df.withColumns(
... pl.col("fruits").str.endsWith("go").alias("has_suffix"),
... )
shape: (3, 2)
┌────────┬────────────┐
│ fruits ┆ has_suffix │
│ --- ┆ --- │
│ str ┆ bool │
╞════════╪════════════╡
│ apple ┆ false │
│ mango ┆ true │
│ null ┆ null │
└────────┴────────────┘
>>> df = pl.DataFrame(
... {"fruits": ["apple", "mango", "banana"], "suffix": ["le", "go", "nu"]}
... )
>>> df.withColumns(
... pl.col("fruits").str.endsWith(pl.col("suffix")).alias("has_suffix"),
... )
shape: (3, 3)
┌────────┬────────┬────────────┐
│ fruits ┆ suffix ┆ has_suffix │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool │
╞════════╪════════╪════════════╡
│ apple ┆ le ┆ true │
│ mango ┆ go ┆ true │
│ banana ┆ nu ┆ false │
└────────┴────────┴────────────┘
Using `ends_with` as a filter condition:
>>> df.filter(pl.col("fruits").str.endsWith("go"))
shape: (1, 2)
┌────────┬────────┐
│ fruits ┆ suffix │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪════════╡
│ mango ┆ go │
└────────┴────────┘
Extract the target capture group from provided patterns.
Utf8 array. Contain null if original value is null or regex capture nothing.
> df = pl.DataFrame({
... 'a': [
... 'http://vote.com/ballon_dor?candidate=messi&ref=polars',
... 'http://vote.com/ballon_dor?candidat=jorginho&ref=polars',
... 'http://vote.com/ballon_dor?candidate=ronaldo&ref=polars'
... ]})
> df.select(pl.col('a').str.extract(/candidate=(\w+)/, 1))
shape: (3, 1)
┌─────────┐
│ a │
│ --- │
│ str │
╞═════════╡
│ messi │
├╌╌╌╌╌╌╌╌╌┤
│ null │
├╌╌╌╌╌╌╌╌╌┤
│ ronaldo │
└─────────┘
Parse string values in Expression as JSON. Throw errors if encounter invalid JSON strings.
Optional
dtype: DataTypeThe dtype to cast the extracted value to. If None, the dtype will be inferred from the JSON value.
Optional
inferSchemaLength: numberThe maximum number of rows to scan for schema inference.
DF with struct
>>> df = pl.DataFrame( {json: ['{"a":1, "b": true}', null, '{"a":2, "b": false}']} )
>>> df.select(pl.col("json").str.jsonDecode())
shape: (3, 1)
┌─────────────┐
│ json │
│ --- │
│ struct[2] │
╞═════════════╡
│ {1,true} │
│ {null,null} │
│ {2,false} │
└─────────────┘
See Also
----------
jsonPathMatch : Extract the first match of json string with provided JSONPath expression
Extract the first match of json string in Expression with provided JSONPath expression. Throw errors if encounter invalid json strings. All return value will be casted to Utf8 regardless of the original value.
A valid JSON path query string
Utf8 array. Contain null if original value is null or the jsonPath
return nothing.
Get number of chars of the string values in Expression.
df = pl.DataFrame({"a": ["Café", "345", "東京", null]})
df.withColumns(
pl.col("a").str.lengths().alias("n_chars"),
)
shape: (4, 3)
┌──────┬─────────┬─────────┐
│ a ┆ n_chars ┆ n_bytes │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 │
╞══════╪═════════╪═════════╡
│ Café ┆ 4 ┆ 5 │
│ 345 ┆ 3 ┆ 3 │
│ 東京 ┆ 2 ┆ 6 │
│ null ┆ null ┆ null │
└──────┴─────────┴─────────┘
Add a trailing fillChar to a string until string length is reached. If string is longer or equal to given length no modifications will be done
of the final string
that will fill the string. Note: If a string longer than 1 character is provided only the first character will be used
Add a leading fillChar to a string in Expression until string length is reached. If string is longer or equal to given length no modifications will be done
of the final string
that will fill the string. If a string longer than 1 character is provided only the first character will be used
Replace first match with a string value in Expression.
df = pl.DataFrame({"cost": ["#12.34", "#56.78"], "text": ["123abc", "abc456"]})
df = df.withColumns(
pl.col("cost").str.replace(/#(\d+)/, "$$$1"),
pl.col("text").str.replace("ab", "-")
pl.col("text").str.replace("abc", pl.col("cost")).alias("expr")
);
shape: (2, 2)
┌────────┬───────┬───────────┐
│ cost ┆ text │ expr │
│ --- ┆ --- │ --- │
│ str ┆ str │ str │
╞════════╪═══════╪═══════════╡
│ $12.34 ┆ 123-c │ 123#12.34 │
│ $56.78 ┆ -c456 │ #56.78456 │
└────────┴───────┴───────────┘
Replace all regex matches with a string value in Expression.
df = df = pl.DataFrame({"weather": ["Rainy", "Sunny", "Cloudy", "Snowy"], "text": ["abcabc", "123a123", null, null]})
df = df.withColumns(
pl.col("weather").str.replaceAll(/foggy|rainy/i, "Sunny"),
pl.col("text").str.replaceAll("a", "-")
)
shape: (4, 2)
┌─────────┬─────────┐
│ weather ┆ text │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════════╡
│ Sunny ┆ -bc-bc │
│ Sunny ┆ 123-123 │
│ Cloudy ┆ null │
│ Snowy ┆ null │
└─────────┴─────────┘
Split a string into substrings using the specified separator and return them as a Series.
— A string that identifies character or characters to use in separating the string.
Optional
options: boolean | { inclusive?: boolean }Optional
inclusive?: booleanInclude the split character/string in the results
Check if string values start with a substring.
>>> df = pl.DataFrame({"fruits": ["apple", "mango", None]})
>>> df.withColumns(
... pl.col("fruits").str.startsWith("app").alias("has_prefix"),
... )
shape: (3, 2)
┌────────┬────────────┐
│ fruits ┆ has_prefix │
│ --- ┆ --- │
│ str ┆ bool │
╞════════╪════════════╡
│ apple ┆ true │
│ mango ┆ false │
│ null ┆ null │
└────────┴────────────┘
>>> df = pl.DataFrame(
... {"fruits": ["apple", "mango", "banana"], "prefix": ["app", "na", "ba"]}
... )
>>> df.withColumns(
... pl.col("fruits").str.startsWith(pl.col("prefix")).alias("has_prefix"),
... )
shape: (3, 3)
┌────────┬────────┬────────────┐
│ fruits ┆ prefix ┆ has_prefix │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool │
╞════════╪════════╪════════════╡
│ apple ┆ app ┆ true │
│ mango ┆ na ┆ false │
│ banana ┆ ba ┆ true │
└────────┴────────┴────────────┘
Using `starts_with` as a filter condition:
>>> df.filter(pl.col("fruits").str.startsWith("app"))
shape: (1, 2)
┌────────┬────────┐
│ fruits ┆ prefix │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪════════╡
│ apple ┆ app │
└────────┴────────┘
Remove leading and trailing whitespace.
>>> df = pl.DataFrame({
os: [
"#Kali-Linux###",
"$$$Debian-Linux$",
null,
"Ubuntu-Linux ",
" Mac-Sierra",
],
chars: ["#", "$", " ", " ", null],
})
>>> df.select(col("os").str.stripChars(col("chars")).as("os"))
shape: (5, 1)
┌──────────────┐
│ os │
│ --- │
│ str │
╞══════════════╡
│ Kali-Linux │
│ Debian-Linux │
│ null │
│ Ubuntu-Linux │
│ Mac-Sierra │
└──────────────┘
Parse a Series of dtype Utf8 to a Date/Datetime Series.
Date or Datetime.
Calendar date and time type
Optional
timeUnit: TimeUnit | "ms" | "ns" | "us"any of 'ms' | 'ns' | 'us'
timezone string as defined by Intl.DateTimeFormat America/New_York
for example.
Optional
fmt: stringformatting syntax. Read more
String functions for Lazy dataframes