Check if strings in Expression contain a substring that matches a pattern.
A valid regular expression pattern, compatible with the regex crate @param literal Treat
pattern` as a literal string, not as a regular expression.
Optional
literal: booleanOptional
strict: booleanRaise an error if the underlying pattern is not a valid regex, otherwise mask out with a null value.
Boolean mask
const df = pl.DataFrame({"txt": ["Crab", "cat and dog", "rab$bit", null]})
df.select(
... pl.col("txt"),
... pl.col("txt").str.contains("cat|bit").alias("regex"),
... pl.col("txt").str.contains("rab$", true).alias("literal"),
... )
shape: (4, 3)
┌─────────────┬───────┬─────────┐
│ txt ┆ regex ┆ literal │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ bool │
╞═════════════╪═══════╪═════════╡
│ Crab ┆ false ┆ false │
│ cat and dog ┆ true ┆ false │
│ rab$bit ┆ true ┆ true │
│ null ┆ null ┆ null │
└─────────────┴───────┴─────────┘
Check if string values in Expression ends with a substring.
>>> df = pl.DataFrame({"fruits": ["apple", "mango", None]})
>>> df.withColumns(
... pl.col("fruits").str.endsWith("go").alias("has_suffix"),
... )
shape: (3, 2)
┌────────┬────────────┐
│ fruits ┆ has_suffix │
│ --- ┆ --- │
│ str ┆ bool │
╞════════╪════════════╡
│ apple ┆ false │
│ mango ┆ true │
│ null ┆ null │
└────────┴────────────┘
>>> df = pl.DataFrame(
... {"fruits": ["apple", "mango", "banana"], "suffix": ["le", "go", "nu"]}
... )
>>> df.withColumns(
... pl.col("fruits").str.endsWith(pl.col("suffix")).alias("has_suffix"),
... )
shape: (3, 3)
┌────────┬────────┬────────────┐
│ fruits ┆ suffix ┆ has_suffix │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool │
╞════════╪════════╪════════════╡
│ apple ┆ le ┆ true │
│ mango ┆ go ┆ true │
│ banana ┆ nu ┆ false │
└────────┴────────┴────────────┘
Using `ends_with` as a filter condition:
>>> df.filter(pl.col("fruits").str.endsWith("go"))
shape: (1, 2)
┌────────┬────────┐
│ fruits ┆ suffix │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪════════╡
│ mango ┆ go │
└────────┴────────┘
Extract the target capture group from provided patterns.
Utf8 array. Contain null if original value is null or regex capture nothing.
> df = pl.DataFrame({
... 'a': [
... 'http://vote.com/ballon_dor?candidate=messi&ref=polars',
... 'http://vote.com/ballon_dor?candidat=jorginho&ref=polars',
... 'http://vote.com/ballon_dor?candidate=ronaldo&ref=polars'
... ]})
> df.select(pl.col('a').str.extract(/candidate=(\w+)/, 1))
shape: (3, 1)
┌─────────┐
│ a │
│ --- │
│ str │
╞═════════╡
│ messi │
├╌╌╌╌╌╌╌╌╌┤
│ null │
├╌╌╌╌╌╌╌╌╌┤
│ ronaldo │
└─────────┘
Parse string values in Expression as JSON. Throw errors if encounter invalid JSON strings.
Optional
dtype: DataTypeOptional
inferSchemaLength: numberDF with struct
>>> df = pl.DataFrame( {json: ['{"a":1, "b": true}', null, '{"a":2, "b": false}']} )
>>> df.select(pl.col("json").str.jsonDecode())
shape: (3, 1)
┌─────────────┐
│ json │
│ --- │
│ struct[2] │
╞═════════════╡
│ {1,true} │
│ {null,null} │
│ {2,false} │
└─────────────┘
See Also
----------
jsonPathMatch : Extract the first match of json string with provided JSONPath expression
Extract the first match of json string in Expression with provided JSONPath expression. Throw errors if encounter invalid json strings. All return value will be casted to Utf8 regardless of the original value.
Utf8 array. Contain null if original value is null or the jsonPath
return nothing.
Get number of chars of the string values in Expression.
df = pl.DataFrame({"a": ["Café", "345", "東京", null]})
df.withColumns(
pl.col("a").str.lengths().alias("n_chars"),
)
shape: (4, 3)
┌──────┬─────────┬─────────┐
│ a ┆ n_chars ┆ n_bytes │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 │
╞══════╪═════════╪═════════╡
│ Café ┆ 4 ┆ 5 │
│ 345 ┆ 3 ┆ 3 │
│ 東京 ┆ 2 ┆ 6 │
│ null ┆ null ┆ null │
└──────┴─────────┴─────────┘
Add a leading fillChar to a string in Expression until string length is reached. If string is longer or equal to given length no modifications will be done
of the final string
that will fill the string. If a string longer than 1 character is provided only the first character will be used
Replace first match with a string value in Expression.
df = pl.DataFrame({"cost": ["#12.34", "#56.78"], "text": ["123abc", "abc456"]})
df = df.withColumns(
pl.col("cost").str.replace(/#(\d+)/, "$$$1"),
pl.col("text").str.replace("ab", "-")
pl.col("text").str.replace("abc", pl.col("cost")).alias("expr")
);
shape: (2, 2)
┌────────┬───────┬───────────┐
│ cost ┆ text │ expr │
│ --- ┆ --- │ --- │
│ str ┆ str │ str │
╞════════╪═══════╪═══════════╡
│ $12.34 ┆ 123-c │ 123#12.34 │
│ $56.78 ┆ -c456 │ #56.78456 │
└────────┴───────┴───────────┘
Replace all regex matches with a string value in Expression.
df = df = pl.DataFrame({"weather": ["Rainy", "Sunny", "Cloudy", "Snowy"], "text": ["abcabc", "123a123", null, null]})
df = df.withColumns(
pl.col("weather").str.replaceAll(/foggy|rainy/i, "Sunny"),
pl.col("text").str.replaceAll("a", "-")
)
shape: (4, 2)
┌─────────┬─────────┐
│ weather ┆ text │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪═════════╡
│ Sunny ┆ -bc-bc │
│ Sunny ┆ 123-123 │
│ Cloudy ┆ null │
│ Snowy ┆ null │
└─────────┴─────────┘
Check if string values start with a substring.
>>> df = pl.DataFrame({"fruits": ["apple", "mango", None]})
>>> df.withColumns(
... pl.col("fruits").str.startsWith("app").alias("has_prefix"),
... )
shape: (3, 2)
┌────────┬────────────┐
│ fruits ┆ has_prefix │
│ --- ┆ --- │
│ str ┆ bool │
╞════════╪════════════╡
│ apple ┆ true │
│ mango ┆ false │
│ null ┆ null │
└────────┴────────────┘
>>> df = pl.DataFrame(
... {"fruits": ["apple", "mango", "banana"], "prefix": ["app", "na", "ba"]}
... )
>>> df.withColumns(
... pl.col("fruits").str.startsWith(pl.col("prefix")).alias("has_prefix"),
... )
shape: (3, 3)
┌────────┬────────┬────────────┐
│ fruits ┆ prefix ┆ has_prefix │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool │
╞════════╪════════╪════════════╡
│ apple ┆ app ┆ true │
│ mango ┆ na ┆ false │
│ banana ┆ ba ┆ true │
└────────┴────────┴────────────┘
Using `starts_with` as a filter condition:
>>> df.filter(pl.col("fruits").str.startsWith("app"))
shape: (1, 2)
┌────────┬────────┐
│ fruits ┆ prefix │
│ --- ┆ --- │
│ str ┆ str │
╞════════╪════════╡
│ apple ┆ app │
└────────┴────────┘
Remove leading and trailing whitespace.
>>> df = pl.DataFrame({
os: [
"#Kali-Linux###",
"$$$Debian-Linux$",
null,
"Ubuntu-Linux ",
" Mac-Sierra",
],
chars: ["#", "$", " ", " ", null],
})
>>> df.select(col("os").str.stripChars(col("chars")).as("os"))
shape: (5, 1)
┌──────────────┐
│ os │
│ --- │
│ str │
╞══════════════╡
│ Kali-Linux │
│ Debian-Linux │
│ null │
│ Ubuntu-Linux │
│ Mac-Sierra │
└──────────────┘
Parse a Series of dtype Utf8 to a Date/Datetime Series.
Date or Datetime.
Calendar date and time type
Optional
timeUnit: TimeUnit | "ms" | "ns" | "us"any of 'ms' | 'ns' | 'us'
timezone string as defined by Intl.DateTimeFormat America/New_York
for example.
Optional
fmt: stringformatting syntax. Read more
String functions for Lazy dataframes