polars.Expr.top_k_by#

Expr.top_k_by( by: IntoExpr | Iterable[IntoExpr], k: int | IntoExprColumn = 5, *, descending: bool | Sequence[bool] = False, nulls_last: bool = False, maintain_order: bool = False, multithreaded: bool = True, ) → Self[source]#

Return elements corresponding to the k largest elements of the by column(s).

This has time complexity:

\[O(n + k \log{n} - \frac{k}{2})\]

Parameters:

by: Column(s) included in sort order. Accepts expression input. Strings are parsed as column names.
k: Number of elements to return.
descending: If True, consider the k smallest (instead of the k largest). Top-k by multiple columns can be specified per column by passing a sequence of booleans.
nulls_last: Place null values last.
maintain_order: Whether the order should be maintained if elements are equal.
multithreaded: Sort using multiple threads.

See also

top_k
bottom_k
bottom_k_by

Examples

>>> df = pl.DataFrame(
...     {
...         "a": [1, 2, 3, 4, 5, 6],
...         "b": [6, 5, 4, 3, 2, 1],
...         "c": ["Apple", "Orange", "Apple", "Apple", "Banana", "Banana"],
...     }
... )
>>> df
shape: (6, 3)
┌─────┬─────┬────────┐
│ a   ┆ b   ┆ c      │
│ --- ┆ --- ┆ ---    │
│ i64 ┆ i64 ┆ str    │
╞═════╪═════╪════════╡
│ 1   ┆ 6   ┆ Apple  │
│ 2   ┆ 5   ┆ Orange │
│ 3   ┆ 4   ┆ Apple  │
│ 4   ┆ 3   ┆ Apple  │
│ 5   ┆ 2   ┆ Banana │
│ 6   ┆ 1   ┆ Banana │
└─────┴─────┴────────┘

Get the top 2 rows by column a or b.

>>> df.select(
...     pl.all().top_k_by("a", 2).name.suffix("_top_by_a"),
...     pl.all().top_k_by("b", 2).name.suffix("_top_by_b"),
... )
shape: (2, 6)
┌────────────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ a_top_by_a ┆ b_top_by_a ┆ c_top_by_a ┆ a_top_by_b ┆ b_top_by_b ┆ c_top_by_b │
│ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│ i64        ┆ i64        ┆ str        ┆ i64        ┆ i64        ┆ str        │
╞════════════╪════════════╪════════════╪════════════╪════════════╪════════════╡
│ 6          ┆ 1          ┆ Banana     ┆ 1          ┆ 6          ┆ Apple      │
│ 5          ┆ 2          ┆ Banana     ┆ 2          ┆ 5          ┆ Orange     │
└────────────┴────────────┴────────────┴────────────┴────────────┴────────────┘

Get the top 2 rows by multiple columns with given order.

>>> df.select(
...     pl.all()
...     .top_k_by(["c", "a"], 2, descending=[False, True])
...     .name.suffix("_by_ca"),
...     pl.all()
...     .top_k_by(["c", "b"], 2, descending=[False, True])
...     .name.suffix("_by_cb"),
... )
shape: (2, 6)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ a_by_ca ┆ b_by_ca ┆ c_by_ca ┆ a_by_cb ┆ b_by_cb ┆ c_by_cb │
│ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     ┆ ---     │
│ i64     ┆ i64     ┆ str     ┆ i64     ┆ i64     ┆ str     │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ 2       ┆ 5       ┆ Orange  ┆ 2       ┆ 5       ┆ Orange  │
│ 5       ┆ 2       ┆ Banana  ┆ 6       ┆ 1       ┆ Banana  │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Get the top 2 rows by column a in each group.

>>> (
...     df.group_by("c", maintain_order=True)
...     .agg(pl.all().top_k_by("a", 2))
...     .explode(pl.all().exclude("c"))
... )
shape: (5, 3)
┌────────┬─────┬─────┐
│ c      ┆ a   ┆ b   │
│ ---    ┆ --- ┆ --- │
│ str    ┆ i64 ┆ i64 │
╞════════╪═════╪═════╡
│ Apple  ┆ 4   ┆ 3   │
│ Apple  ┆ 3   ┆ 4   │
│ Orange ┆ 2   ┆ 5   │
│ Banana ┆ 6   ┆ 1   │
│ Banana ┆ 5   ┆ 2   │
└────────┴─────┴─────┘