Joins

Join strategies

Polars supports the following join strategies by specifying the strategy argument:

Strategy	Description
`inner`	Returns row with matching keys in both frames. Non-matching rows in either the left or right frame are discarded.
`left`	Returns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled.
`outer`	Returns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled.
`cross`	Returns the Cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained; the table length of `A` cross-joined with `B` is always `len(A) × len(B)`.
`asof`	A left-join in which the match is performed on the nearest key rather than on equal keys.
`semi`	Returns all rows from the left frame in which the join key is also present in the right frame.
`anti`	Returns all rows from the left frame in which the join key is not present in the right frame.

Inner join

An inner join produces a DataFrame that contains only the rows where the join key exists in both DataFrames. Let's take for example the following two DataFrames:

Python Rust

DataFrame

df_customers = pl.DataFrame(
    {
        "customer_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
    }
)
print(df_customers)

DataFrame

let df_customers = df! (

    "customer_id" => &[1, 2, 3],
    "name" => &["Alice", "Bob", "Charlie"],
)?;

println!("{}", &df_customers);

shape: (3, 2)
┌─────────────┬─────────┐
│ customer_id ┆ name    │
│ ---         ┆ ---     │
│ i64         ┆ str     │
╞═════════════╪═════════╡
│ 1           ┆ Alice   │
│ 2           ┆ Bob     │
│ 3           ┆ Charlie │
└─────────────┴─────────┘

Python Rust

DataFrame

df_orders = pl.DataFrame(
    {
        "order_id": ["a", "b", "c"],
        "customer_id": [1, 2, 2],
        "amount": [100, 200, 300],
    }
)
print(df_orders)

DataFrame

let df_orders = df!(
        "order_id"=> &["a", "b", "c"],
        "customer_id"=> &[1, 2, 2],
        "amount"=> &[100, 200, 300],
)?;
println!("{}", &df_orders);

shape: (3, 3)
┌──────────┬─────────────┬────────┐
│ order_id ┆ customer_id ┆ amount │
│ ---      ┆ ---         ┆ ---    │
│ str      ┆ i64         ┆ i64    │
╞══════════╪═════════════╪════════╡
│ a        ┆ 1           ┆ 100    │
│ b        ┆ 2           ┆ 200    │
│ c        ┆ 2           ┆ 300    │
└──────────┴─────────────┴────────┘

To get a DataFrame with the orders and their associated customer we can do an inner join on the customer_id column:

Python Rust

join

df_inner_customer_join = df_customers.join(df_orders, on="customer_id", how="inner")
print(df_inner_customer_join)

join

let df_inner_customer_join = df_customers
    .clone()
    .lazy()
    .join(
        df_orders.clone().lazy(),
        [col("customer_id")],
        [col("customer_id")],
        JoinArgs::new(JoinType::Inner),
    )
    .collect()?;
println!("{}", &df_inner_customer_join);

shape: (3, 4)
┌─────────────┬───────┬──────────┬────────┐
│ customer_id ┆ name  ┆ order_id ┆ amount │
│ ---         ┆ ---   ┆ ---      ┆ ---    │
│ i64         ┆ str   ┆ str      ┆ i64    │
╞═════════════╪═══════╪══════════╪════════╡
│ 1           ┆ Alice ┆ a        ┆ 100    │
│ 2           ┆ Bob   ┆ b        ┆ 200    │
│ 2           ┆ Bob   ┆ c        ┆ 300    │
└─────────────┴───────┴──────────┴────────┘

Left join

The left join produces a DataFrame that contains all the rows from the left DataFrame and only the rows from the right DataFrame where the join key exists in the left DataFrame. If we now take the example from above and want to have a DataFrame with all the customers and their associated orders (regardless of whether they have placed an order or not) we can do a left join:

Python Rust

join

df_left_join = df_customers.join(df_orders, on="customer_id", how="left")
print(df_left_join)

join

let df_left_join = df_customers
    .clone()
    .lazy()
    .join(
        df_orders.clone().lazy(),
        [col("customer_id")],
        [col("customer_id")],
        JoinArgs::new(JoinType::Left),
    )
    .collect()?;
println!("{}", &df_left_join);

shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name    ┆ order_id ┆ amount │
│ ---         ┆ ---     ┆ ---      ┆ ---    │
│ i64         ┆ str     ┆ str      ┆ i64    │
╞═════════════╪═════════╪══════════╪════════╡
│ 1           ┆ Alice   ┆ a        ┆ 100    │
│ 2           ┆ Bob     ┆ b        ┆ 200    │
│ 2           ┆ Bob     ┆ c        ┆ 300    │
│ 3           ┆ Charlie ┆ null     ┆ null   │
└─────────────┴─────────┴──────────┴────────┘

Notice, that the fields for the customer with the customer_id of 3 are null, as there are no orders for this customer.

Outer join

The outer join produces a DataFrame that contains all the rows from both DataFrames. Columns are null, if the join key does not exist in the source DataFrame. Doing an outer join on the two DataFrames from above produces a similar DataFrame to the left join:

Python Rust

join

df_outer_join = df_customers.join(df_orders, on="customer_id", how="outer")
print(df_outer_join)

join

let df_outer_join = df_customers
    .clone()
    .lazy()
    .join(
        df_orders.clone().lazy(),
        [col("customer_id")],
        [col("customer_id")],
        JoinArgs::new(JoinType::Outer),
    )
    .collect()?;
println!("{}", &df_outer_join);

shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name    ┆ order_id ┆ amount │
│ ---         ┆ ---     ┆ ---      ┆ ---    │
│ i64         ┆ str     ┆ str      ┆ i64    │
╞═════════════╪═════════╪══════════╪════════╡
│ 1           ┆ Alice   ┆ a        ┆ 100    │
│ 2           ┆ Bob     ┆ b        ┆ 200    │
│ 2           ┆ Bob     ┆ c        ┆ 300    │
│ 3           ┆ Charlie ┆ null     ┆ null   │
└─────────────┴─────────┴──────────┴────────┘

Cross join

A cross join is a cartesian product of the two DataFrames. This means that every row in the left DataFrame is joined with every row in the right DataFrame. The cross join is useful for creating a DataFrame with all possible combinations of the columns in two DataFrames. Let's take for example the following two DataFrames.

Python Rust

DataFrame

df_colors = pl.DataFrame(
    {
        "color": ["red", "blue", "green"],
    }
)
print(df_colors)

DataFrame

let df_colors = df!(
        "color"=> &["red", "blue", "green"],
)?;
println!("{}", &df_colors);

shape: (3, 1)
┌───────┐
│ color │
│ ---   │
│ str   │
╞═══════╡
│ red   │
│ blue  │
│ green │
└───────┘

Python Rust

DataFrame

df_sizes = pl.DataFrame(
    {
        "size": ["S", "M", "L"],
    }
)
print(df_sizes)

DataFrame

let df_sizes = df!(
        "size"=> &["S", "M", "L"],
)?;
println!("{}", &df_sizes);

shape: (3, 1)
┌──────┐
│ size │
│ ---  │
│ str  │
╞══════╡
│ S    │
│ M    │
│ L    │
└──────┘

We can now create a DataFrame containing all possible combinations of the colors and sizes with a cross join:

Python Rust

join

df_cross_join = df_colors.join(df_sizes, how="cross")
print(df_cross_join)

join

let df_cross_join = df_colors
    .clone()
    .lazy()
    .cross_join(df_sizes.clone().lazy())
    .collect()?;
println!("{}", &df_cross_join);

shape: (9, 2)
┌───────┬──────┐
│ color ┆ size │
│ ---   ┆ ---  │
│ str   ┆ str  │
╞═══════╪══════╡
│ red   ┆ S    │
│ red   ┆ M    │
│ red   ┆ L    │
│ blue  ┆ S    │
│ blue  ┆ M    │
│ blue  ┆ L    │
│ green ┆ S    │
│ green ┆ M    │
│ green ┆ L    │
└───────┴──────┘

The inner, left, outer and cross join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi, anti and asof join strategies below.

Semi join

The semi join returns all rows from the left frame in which the join key is also present in the right frame. Consider the following scenario: a car rental company has a DataFrame showing the cars that it owns with each car having a unique id.

Python Rust

DataFrame

df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "make": ["ford", "toyota", "bmw"],
    }
)
print(df_cars)

DataFrame

let df_cars = df!(
        "id"=> &["a", "b", "c"],
        "make"=> &["ford", "toyota", "bmw"],
)?;
println!("{}", &df_cars);

shape: (3, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
│ b   ┆ toyota │
│ c   ┆ bmw    │
└─────┴────────┘

The company has another DataFrame showing each repair job carried out on a vehicle.

Python Rust

DataFrame

df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
print(df_repairs)

DataFrame

let df_repairs = df!(
        "id"=> &["c", "c"],
        "cost"=> &[100, 200],
)?;
println!("{}", &df_repairs);

shape: (2, 2)
┌─────┬──────┐
│ id  ┆ cost │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ c   ┆ 100  │
│ c   ┆ 200  │
└─────┴──────┘

You want to answer this question: which of the cars have had repairs carried out?

An inner join does not answer this question directly as it produces a DataFrame with multiple rows for each car that has had multiple repair jobs:

Python Rust

join

df_inner_join = df_cars.join(df_repairs, on="id", how="inner")
print(df_inner_join)

join

let df_inner_join = df_cars
    .clone()
    .lazy()
    .inner_join(df_repairs.clone().lazy(), col("id"), col("id"))
    .collect()?;
println!("{}", &df_inner_join);

shape: (2, 3)
┌─────┬──────┬──────┐
│ id  ┆ make ┆ cost │
│ --- ┆ ---  ┆ ---  │
│ str ┆ str  ┆ i64  │
╞═════╪══════╪══════╡
│ c   ┆ bmw  ┆ 100  │
│ c   ┆ bmw  ┆ 200  │
└─────┴──────┴──────┘

However, a semi join produces a single row for each car that has had a repair job carried out.

Python Rust

join

df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
print(df_semi_join)

join

let df_semi_join = df_cars
    .clone()
    .lazy()
    .join(
        df_repairs.clone().lazy(),
        [col("id")],
        [col("id")],
        JoinArgs::new(JoinType::Semi),
    )
    .collect()?;
println!("{}", &df_semi_join);

shape: (1, 2)
┌─────┬──────┐
│ id  ┆ make │
│ --- ┆ ---  │
│ str ┆ str  │
╞═════╪══════╡
│ c   ┆ bmw  │
└─────┴──────┘

Anti join

Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame showing all the cars from df_cars where the id is not present in the df_repairs DataFrame.

Python Rust

join

df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
print(df_anti_join)

join

let df_anti_join = df_cars
    .clone()
    .lazy()
    .join(
        df_repairs.clone().lazy(),
        [col("id")],
        [col("id")],
        JoinArgs::new(JoinType::Anti),
    )
    .collect()?;
println!("{}", &df_anti_join);

shape: (2, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
│ b   ┆ toyota │
└─────┴────────┘

Asof join

An asof join is like a left join except that we match on nearest key rather than equal keys. In Polars we can do an asof join with the join method and specifying strategy="asof". However, for more flexibility we can use the join_asof method.

Consider the following scenario: a stock market broker has a DataFrame called df_trades showing transactions it has made for different stocks.

Python Rust

DataFrame

df_trades = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 3, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "B", "C"],
        "trade": [101, 299, 301, 500],
    }
)
print(df_trades)

DataFrame

use chrono::prelude::*;
let df_trades = df!(
    "time"=> &[
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 1, 0).unwrap(),
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 1, 0).unwrap(),
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 3, 0).unwrap(),
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 6, 0).unwrap(),
        ],
        "stock"=> &["A", "B", "B", "C"],
        "trade"=> &[101, 299, 301, 500],
)?;
println!("{}", &df_trades);

shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ trade │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   │
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   │
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   │
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   │
└─────────────────────┴───────┴───────┘

The broker has another DataFrame called df_quotes showing prices it has quoted for these stocks.

Python Rust

DataFrame

df_quotes = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 0, 0),
            datetime(2020, 1, 1, 9, 2, 0),
            datetime(2020, 1, 1, 9, 4, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "C", "A"],
        "quote": [100, 300, 501, 102],
    }
)

print(df_quotes)

DataFrame

let df_quotes = df!(
        "time"=> &[
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 0, 0).unwrap(),
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 2, 0).unwrap(),
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 4, 0).unwrap(),
    NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 6, 0).unwrap(),
        ],
        "stock"=> &["A", "B", "C", "A"],
        "quote"=> &[100, 300, 501, 102],
)?;

println!("{}", &df_quotes);

shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ quote │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A     ┆ 100   │
│ 2020-01-01 09:02:00 ┆ B     ┆ 300   │
│ 2020-01-01 09:04:00 ┆ C     ┆ 501   │
│ 2020-01-01 09:06:00 ┆ A     ┆ 102   │
└─────────────────────┴───────┴───────┘

You want to produce a DataFrame showing for each trade the most recent quote provided before the trade. You do this with join_asof (using the default strategy = "backward"). To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock".

Python Rust

join_asof

df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)

let df_asof_join = df_trades.join_asof_by(
    &df_quotes,
    "time",
    "time",
    ["stock"],
    ["stock"],
    AsofStrategy::Backward,
    None,
)?;
println!("{}", &df_asof_join);

shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ 501   │
└─────────────────────┴───────┴───────┴───────┘

If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m".

Python

df_asof_tolerance_join = df_trades.join_asof(
    df_quotes, on="time", by="stock", tolerance="1m"
)
print(df_asof_tolerance_join)

shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ null  │
└─────────────────────┴───────┴───────┴───────┘