Joins

Join strategies

Polars supports the following join strategies by specifying the strategy argument:

  • inner
  • left
  • outer
  • cross
  • asof
  • semi
  • anti

The inner, left, outer and cross join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi, anti and asof join strategies below.

Semi join

Consider the following scenario: a car rental company has a DataFrame showing the cars that it owns with each car having a unique id.

df_cars = pl.DataFrame(
    {
        "id": ["a", "b", "c"],
        "make": ["ford", "toyota", "bmw"],
    }
)
print(df_cars)
shape: (3, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b   ┆ toyota │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c   ┆ bmw    │
└─────┴────────┘

The company has another DataFrame showing each repair job carried out on a vehicle.

df_repairs = pl.DataFrame(
    {
        "id": ["c", "c"],
        "cost": [100, 200],
    }
)
print(df_repairs)
shape: (2, 2)
┌─────┬──────┐
│ id  ┆ cost │
│ --- ┆ ---  │
│ str ┆ i64  │
╞═════╪══════╡
│ c   ┆ 100  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c   ┆ 200  │
└─────┴──────┘

You want to answer this question: which of the cars have had repairs carried out?

An inner join does not answer this question directly as it produces a DataFrame with multiple rows for each car that has had multiple repair jobs:

df_inner_join = df_cars.join(df_repairs, on="id", how="inner")
print(df_inner_join)
shape: (2, 3)
┌─────┬──────┬──────┐
│ id  ┆ make ┆ cost │
│ --- ┆ ---  ┆ ---  │
│ str ┆ str  ┆ i64  │
╞═════╪══════╪══════╡
│ c   ┆ bmw  ┆ 100  │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c   ┆ bmw  ┆ 200  │
└─────┴──────┴──────┘

However, a semi join produces a single row for each car that has had a repair job carried out.

df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
print(df_semi_join)
shape: (1, 2)
┌─────┬──────┐
│ id  ┆ make │
│ --- ┆ ---  │
│ str ┆ str  │
╞═════╪══════╡
│ c   ┆ bmw  │
└─────┴──────┘

Anti join

Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame showing all the cars from df_cars where the id is not present in the df_repairs DataFrame.

df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
print(df_anti_join)
shape: (2, 2)
┌─────┬────────┐
│ id  ┆ make   │
│ --- ┆ ---    │
│ str ┆ str    │
╞═════╪════════╡
│ a   ┆ ford   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b   ┆ toyota │
└─────┴────────┘

Asof join

An asof join is like a left join except that we match on nearest key rather than equal keys. In Polars we can do an asof join with the join method and specifying strategy="asof". However, for more flexibility we can use the join_asof method.

Consider the following scenario: a stock market broker has a DataFrame called df_trades showing transactions it has made for different stocks.

df_trades = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 1, 0),
            datetime(2020, 1, 1, 9, 3, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "B", "C"],
        "trade": [101, 299, 301, 500],
    }
)
print(df_trades)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ trade │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   │
└─────────────────────┴───────┴───────┘

The broker has another DataFrame called df_quotes showing prices it has quoted for these stocks.

df_quotes = pl.DataFrame(
    {
        "time": [
            datetime(2020, 1, 1, 9, 0, 0),
            datetime(2020, 1, 1, 9, 2, 0),
            datetime(2020, 1, 1, 9, 4, 0),
            datetime(2020, 1, 1, 9, 6, 0),
        ],
        "stock": ["A", "B", "C", "A"],
        "quote": [100, 300, 501, 102],
    }
)
print(df_quotes)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time                ┆ stock ┆ quote │
│ ---                 ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A     ┆ 100   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:02:00 ┆ B     ┆ 300   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:04:00 ┆ C     ┆ 501   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ A     ┆ 102   │
└─────────────────────┴───────┴───────┘

You want to produce a DataFrame showing for each trade the most recent quote provided before the trade. You do this with join_asof (using the default strategy = "backward"). To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock".

df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ 501   │
└─────────────────────┴───────┴───────┴───────┘

If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m".


print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time                ┆ stock ┆ trade ┆ quote │
│ ---                 ┆ ---   ┆ ---   ┆ ---   │
│ datetime[μs]        ┆ str   ┆ i64   ┆ i64   │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A     ┆ 101   ┆ 100   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B     ┆ 299   ┆ null  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B     ┆ 301   ┆ 300   │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C     ┆ 500   ┆ null  │
└─────────────────────┴───────┴───────┴───────┘