Functions
Polars
expressions have a large number of build in functions. These allow you to create complex queries without the need for user defined functions. There are too many to go through here, but we will cover some of the more popular use cases. If you want to view all the functions go to the API Reference for your programming language.
In the examples below we will use the following DataFrame
:
df = pl.DataFrame(
{
"nrs": [1, 2, 3, None, 5],
"names": ["foo", "ham", "spam", "egg", "spam"],
"random": np.random.rand(5),
"groups": ["A", "A", "B", "C", "B"],
}
)
print(df)
shape: (5, 4)
┌──────┬───────┬──────────┬────────┐
│ nrs ┆ names ┆ random ┆ groups │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ str │
╞══════╪═══════╪══════════╪════════╡
│ 1 ┆ foo ┆ 0.154163 ┆ A │
│ 2 ┆ ham ┆ 0.74005 ┆ A │
│ 3 ┆ spam ┆ 0.263315 ┆ B │
│ null ┆ egg ┆ 0.533739 ┆ C │
│ 5 ┆ spam ┆ 0.014575 ┆ B │
└──────┴───────┴──────────┴────────┘
Column Selection
There are various convenience methods to select multiple or all columns.
Select All Columns
df_all = df.select([pl.col("*")])
# Is equivalent to
df_all = df.select([pl.all()])
print(df_all)
Select All Columns Except
df_exclude = df.select([pl.exclude("groups")])
print(df_exclude)
shape: (5, 3)
┌──────┬───────┬──────────┐
│ nrs ┆ names ┆ random │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞══════╪═══════╪══════════╡
│ 1 ┆ foo ┆ 0.154163 │
│ 2 ┆ ham ┆ 0.74005 │
│ 3 ┆ spam ┆ 0.263315 │
│ null ┆ egg ┆ 0.533739 │
│ 5 ┆ spam ┆ 0.014575 │
└──────┴───────┴──────────┘
Column Naming
By default if you perform a expression it will keep the same name as the original column. In the example below we perform an expression on the nrs
column. Note that the output DataFrame
still has the same name.
df_samename = df.select([pl.col("nrs") + 5])
print(df_samename)
df_samename = df.select([pl.col("nrs") + 5])
print(df_samename)
shape: (5, 1)
┌──────┐
│ nrs │
│ --- │
│ i64 │
╞══════╡
│ 6 │
│ 7 │
│ 8 │
│ null │
│ 10 │
└──────┘
This might get problematic in case you use the same column muliple times in your expression as the output columns will get duplicated. For example the following query will fail.
try:
df_samename2 = df.select([pl.col("nrs") + 5, pl.col("nrs") - 5])
print(df_samename2)
except Exception as e:
print(e)
column with name 'nrs' has more than one occurrences
You can change the output name of an expression by using the alias
function
df_alias = df.select(
[
(pl.col("nrs") + 5).alias("nrs + 5"),
(pl.col("nrs") - 5).alias("nrs - 5"),
]
)
print(df_alias)
shape: (5, 2)
┌─────────┬─────────┐
│ nrs + 5 ┆ nrs - 5 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════════╪═════════╡
│ 6 ┆ -4 │
│ 7 ┆ -3 │
│ 8 ┆ -2 │
│ null ┆ null │
│ 10 ┆ 0 │
└─────────┴─────────┘
In case of multiple columns for example when using all()
or col(*)
you can apply a mapping function map_alias
to change the original column name into something else. In case you want to add a suffix (suffix()
) or prefix (prefix()
) these are also build in.
Count Unique Values
There are two ways two count unique values in Polars
one is an exact methodology and the other one is an approximantion. The approximation uses the HyperLogLog++ algorithm to approximate the cardinality and is especially usefull for very large datasets where an approximation is good enough.
df_alias = df.select(
[
pl.col("names").n_unique().alias("unique"),
pl.approx_unique("names").alias("unique_approx"),
]
)
print(df_alias)
shape: (1, 2)
┌────────┬───────────────┐
│ unique ┆ unique_approx │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞════════╪═══════════════╡
│ 4 ┆ 4 │
└────────┴───────────────┘
Conditionals
Polars
supports if-like conditions in expression with the when
, then
, otherwise
syntax. The predicate is placed in the when
clause and when this evaluates to true
the then
expression is applied otherwise the otherwise
expression is applied (row-wise).
df_conditional = df.select(
[
pl.col("nrs"),
pl.when(pl.col("nrs") > 2)
.then(pl.lit(True))
.otherwise(pl.lit(False))
.alias("conditional"),
]
)
print(df_conditional)
shape: (5, 2)
┌──────┬─────────────┐
│ nrs ┆ conditional │
│ --- ┆ --- │
│ i64 ┆ bool │
╞══════╪═════════════╡
│ 1 ┆ false │
│ 2 ┆ false │
│ 3 ┆ true │
│ null ┆ false │
│ 5 ┆ true │
└──────┴─────────────┘