Joins
Join strategies
Polars
supports the following join strategies by specifying the strategy
argument:
inner
left
outer
cross
asof
semi
anti
The inner
, left
, outer
and cross
join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi
, anti
and asof
join strategies below.
Semi join
Consider the following scenario: a car rental company has a DataFrame
showing the cars that it owns with each car having a unique id
.
df_cars = pl.DataFrame(
{
"id": ["a", "b", "c"],
"make": ["ford", "toyota", "bmw"],
}
)
print(df_cars)
shape: (3, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ toyota │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c ┆ bmw │
└─────┴────────┘
The company has another DataFrame
showing each repair job carried out on a vehicle.
df_repairs = pl.DataFrame(
{
"id": ["c", "c"],
"cost": [100, 200],
}
)
print(df_repairs)
shape: (2, 2)
┌─────┬──────┐
│ id ┆ cost │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ c ┆ 100 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 200 │
└─────┴──────┘
You want to answer this question: which of the cars have had repairs carried out?
An inner join does not answer this question directly as it produces a DataFrame
with multiple rows for each car that has had multiple repair jobs:
df_inner_join = df_cars.join(df_repairs, on="id", how="inner")
print(df_inner_join)
shape: (2, 3)
┌─────┬──────┬──────┐
│ id ┆ make ┆ cost │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪══════╪══════╡
│ c ┆ bmw ┆ 100 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ bmw ┆ 200 │
└─────┴──────┴──────┘
However, a semi join produces a single row for each car that has had a repair job carried out.
df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
print(df_semi_join)
shape: (1, 2)
┌─────┬──────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════╡
│ c ┆ bmw │
└─────┴──────┘
Anti join
Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame
showing all the cars from df_cars
where the id
is not present in the df_repairs
DataFrame
.
df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
print(df_anti_join)
shape: (2, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ toyota │
└─────┴────────┘
Asof join
An asof
join is like a left join except that we match on nearest key rather than equal keys.
In Polars
we can do an asof join with the join
method and specifying strategy="asof"
. However, for more flexibility we can use the join_asof
method.
Consider the following scenario: a stock market broker has a DataFrame
called df_trades
showing transactions it has made for different stocks.
df_trades = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 3, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "B", "C"],
"trade": [101, 299, 301, 500],
}
)
print(df_trades)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ trade │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B ┆ 299 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B ┆ 301 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C ┆ 500 │
└─────────────────────┴───────┴───────┘
The broker has another DataFrame
called df_quotes
showing prices it has quoted for these stocks.
df_quotes = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 0, 0),
datetime(2020, 1, 1, 9, 2, 0),
datetime(2020, 1, 1, 9, 4, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "C", "A"],
"quote": [100, 300, 501, 102],
}
)
print(df_quotes)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ quote │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A ┆ 100 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:02:00 ┆ B ┆ 300 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:04:00 ┆ C ┆ 501 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ A ┆ 102 │
└─────────────────────┴───────┴───────┘
You want to produce a DataFrame
showing for each trade the most recent quote provided before the trade. You do this with join_asof
(using the default strategy = "backward"
).
To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock"
.
df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ 501 │
└─────────────────────┴───────┴───────┴───────┘
If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance
argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m"
.
print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ null │
└─────────────────────┴───────┴───────┴───────┘