Skip to content

Joins

Join strategies

Polars supports the following join strategies by specifying the strategy argument:

  • inner
  • left
  • outer
  • cross
  • asof
  • semi
  • anti

Inner join

An inner join produces a DataFrame that contains only the rows where the join key exists in both DataFrames. Let's take for example the following two DataFrames:

DataFrame

df_customers = pl.DataFrame(
    {
        "customer_id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
    }
)
print(df_customers)

shape: (3, 2)
┌─────────────┬─────────┐
│ customer_id ┆ name    │
│ ---         ┆ ---     │
│ i64         ┆ str     │
╞═════════════╪═════════╡
│ 1           ┆ Alice   │
│ 2           ┆ Bob     │
│ 3           ┆ Charlie │
└─────────────┴─────────┘

DataFrame

df_orders = pl.DataFrame(
    {
        "order_id": ["a", "b", "c"],
        "customer_id": [1, 2, 2],
        "amount": [100, 200, 300],
    }
)
print(df_orders)

shape: (3, 3)
┌──────────┬─────────────┬────────┐
│ order_id ┆ customer_id ┆ amount │
│ ---      ┆ ---         ┆ ---    │
│ str      ┆ i64         ┆ i64    │
╞══════════╪═════════════╪════════╡
│ a        ┆ 1           ┆ 100    │
│ b        ┆ 2           ┆ 200    │
│ c        ┆ 2           ┆ 300    │
└──────────┴─────────────┴────────┘

To get a DataFrame with the orders and their associated customer we can do an inner join on the customer_id column:

join

df_inner_customer_join = df_customers.join(df_orders, on="customer_id", how="inner")
print(df_inner_customer_join)

shape: (3, 4)
┌─────────────┬───────┬──────────┬────────┐
│ customer_id ┆ name  ┆ order_id ┆ amount │
│ ---         ┆ ---   ┆ ---      ┆ ---    │
│ i64         ┆ str   ┆ str      ┆ i64    │
╞═════════════╪═══════╪══════════╪════════╡
│ 1           ┆ Alice ┆ a        ┆ 100    │
│ 2           ┆ Bob   ┆ b        ┆ 200    │
│ 2           ┆ Bob   ┆ c        ┆ 300    │
└─────────────┴───────┴──────────┴────────┘

Left join

The left join produces a DataFrame that contains all the rows from the left DataFrame and only the rows from the right DataFrame where the join key exists in the left DataFrame. If we now take the example from above and want to have a DataFrame with all the customers and their associated orders (regardless of whether they have placed an order or not) we can do a left join:

join

df_left_join = df_customers.join(df_orders, on="customer_id", how="left")
print(df_left_join)

shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name    ┆ order_id ┆ amount │
│ ---         ┆ ---     ┆ ---      ┆ ---    │
│ i64         ┆ str     ┆ str      ┆ i64    │
╞═════════════╪═════════╪══════════╪════════╡
│ 1           ┆ Alice   ┆ a        ┆ 100    │
│ 2           ┆ Bob     ┆ b        ┆ 200    │
│ 2           ┆ Bob     ┆ c        ┆ 300    │
│ 3           ┆ Charlie ┆ null     ┆ null   │
└─────────────┴─────────┴──────────┴────────┘

Notice, that the fields for the customer with the customer_id of 3 are null, as there are no orders for this customer.

Outer join

The outer join produces a DataFrame that contains all the rows from both DataFrames. Columns are null, if the join key does not exist in the source DataFrame. Doing an outer join on the two DataFrames from above produces a similar DataFrame to the left join:

join

df_outer_join = df_customers.join(df_orders, on="customer_id", how="outer")
print(df_outer_join)

shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name    ┆ order_id ┆ amount │
│ ---         ┆ ---     ┆ ---      ┆ ---    │
│ i64         ┆ str     ┆ str      ┆ i64    │
╞═════════════╪═════════╪══════════╪════════╡
│ 1           ┆ Alice   ┆ a        ┆ 100    │
│ 2           ┆ Bob     ┆ b        ┆ 200    │
│ 2           ┆ Bob     ┆ c        ┆ 300    │
│ 3           ┆ Charlie ┆ null     ┆ null   │
└─────────────┴─────────┴──────────┴────────┘

Cross join

A cross join is a cartesian product of the two DataFrames. This means that every row in the left DataFrame is joined with every row in the right DataFrame. The cross join is useful for creating a DataFrame with all possible combinations of the columns in two DataFrames. Let's take for example the following two DataFrames.

DataFrame

df_colors = pl.DataFrame(
    {
        "color": ["red", "blue", "green"],
    }
)
print(df_colors)

shape: (3, 1)
┌───────┐
│ color │
│ ---   │
│ str   │
╞═══════╡
│ red   │
│ blue  │
│ green │
└───────┘

DataFrame

df_sizes = pl.DataFrame(
    {
        "size": ["S", "M", "L"],
    }
)
print(df_sizes)

shape: (3, 1)
┌──────┐
│ size │
│ ---  │
│ str  │
╞══════╡
│ S    │
│ M    │
│ L    │
└──────┘

We can now create a DataFrame containing all possible combinations of the colors and sizes with a cross join:

join

df_cross_join = df_colors.join(df_sizes, how="cross")
print(df_cross_join)

shape: (9, 2)
┌───────┬──────┐
│ color ┆ size │
│ ---   ┆ ---  │
│ str   ┆ str  │
╞═══════╪══════╡
│ red   ┆ S    │
│ red   ┆ M    │
│ red   ┆ L    │
│ blue  ┆ S    │
│ blue  ┆ M    │
│ blue  ┆ L    │
│ green ┆ S    │
│ green ┆ M    │
│ green ┆ L    │
└───────┴──────┘


The inner, left, outer and cross join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi, anti and asof join strategies below.

Semi join

Consider the following scenario: a car rental company has a DataFrame showing the cars that it owns with each car having a unique id.

DataFrame

df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "make": ["ford", "toyota", "bmw"],
    }
)
print(df_cars)

shape: (3, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
│ b   ┆ toyota │
│ c   ┆ bmw    │
└─────┴────────┘

The company has another DataFrame showing each repair job carried out on a vehicle.

DataFrame

df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
print(df_repairs)

shape: (2, 2)
┌─────┬──────┐
│ id  ┆ cost │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ c   ┆ 100  │
│ c   ┆ 200  │
└─────┴──────┘

You want to answer this question: which of the cars have had repairs carried out?

An inner join does not answer this question directly as it produces a DataFrame with multiple rows for each car that has had multiple repair jobs:

join

df_inner_join = df_cars.join(df_repairs, on="id", how="inner")
print(df_inner_join)

shape: (2, 3)
┌─────┬──────┬──────┐
│ id  ┆ make ┆ cost │
│ --- ┆ ---  ┆ ---  │
│ str ┆ str  ┆ i64  │
╞═════╪══════╪══════╡
│ c   ┆ bmw  ┆ 100  │
│ c   ┆ bmw  ┆ 200  │
└─────┴──────┴──────┘

However, a semi join produces a single row for each car that has had a repair job carried out.

join

df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
print(df_semi_join)

shape: (1, 2)
┌─────┬──────┐
│ id  ┆ make │
│ --- ┆ ---  │
│ str ┆ str  │
╞═════╪══════╡
│ c   ┆ bmw  │
└─────┴──────┘

Anti join

Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame showing all the cars from df_cars where the id is not present in the df_repairs DataFrame.

join

df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
print(df_anti_join)

shape: (2, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
│ b   ┆ toyota │
└─────┴────────┘

Asof join

An asof join is like a left join except that we match on nearest key rather than equal keys. In Polars we can do an asof join with the join method and specifying strategy="asof". However, for more flexibility we can use the join_asof method.

Consider the following scenario: a stock market broker has a DataFrame called df_trades showing transactions it has made for different stocks.

DataFrame

df_trades = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 3, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "B", "C"],
        "trade": [101, 299, 301, 500],
    }
)
print(df_trades)

shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ trade │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   │
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   │
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   │
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   │
└─────────────────────┴───────┴───────┘

The broker has another DataFrame called df_quotes showing prices it has quoted for these stocks.

DataFrame

df_quotes = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 0, 0),
            datetime(2020, 1, 1, 9, 2, 0),
            datetime(2020, 1, 1, 9, 4, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "C", "A"],
        "quote": [100, 300, 501, 102],
    }
)

print(df_quotes)

shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ quote │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A     ┆ 100   │
│ 2020-01-01 09:02:00 ┆ B     ┆ 300   │
│ 2020-01-01 09:04:00 ┆ C     ┆ 501   │
│ 2020-01-01 09:06:00 ┆ A     ┆ 102   │
└─────────────────────┴───────┴───────┘

You want to produce a DataFrame showing for each trade the most recent quote provided before the trade. You do this with join_asof (using the default strategy = "backward"). To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock".

join_asof

df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)

shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ 501   │
└─────────────────────┴───────┴───────┴───────┘

If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m".

df_asof_tolerance_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ 501   │
└─────────────────────┴───────┴───────┴───────┘