Joins
Join strategies
Polars
supports the following join strategies by specifying the strategy
argument:
inner
left
outer
cross
asof
semi
anti
Inner join
An inner
join produces a DataFrame
that contains only the rows where the join key exists in both DataFrames
. Let's take for example the following two DataFrames
:
df_customers = pl.DataFrame(
{
"customer_id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
}
)
print(df_customers)
shape: (3, 2)
┌─────────────┬─────────┐
│ customer_id ┆ name │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════════╪═════════╡
│ 1 ┆ Alice │
│ 2 ┆ Bob │
│ 3 ┆ Charlie │
└─────────────┴─────────┘
df_orders = pl.DataFrame(
{
"order_id": ["a", "b", "c"],
"customer_id": [1, 2, 2],
"amount": [100, 200, 300],
}
)
print(df_orders)
shape: (3, 3)
┌──────────┬─────────────┬────────┐
│ order_id ┆ customer_id ┆ amount │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪═════════════╪════════╡
│ a ┆ 1 ┆ 100 │
│ b ┆ 2 ┆ 200 │
│ c ┆ 2 ┆ 300 │
└──────────┴─────────────┴────────┘
To get a DataFrame
with the orders and their associated customer we can do an inner
join on the customer_id
column:
df_inner_customer_join = df_customers.join(df_orders, on="customer_id", how="inner")
print(df_inner_customer_join)
shape: (3, 4)
┌─────────────┬───────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═══════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
└─────────────┴───────┴──────────┴────────┘
Left join
The left
join produces a DataFrame
that contains all the rows from the left DataFrame
and only the rows from the right DataFrame
where the join key exists in the left DataFrame
. If we now take the example from above and want to have a DataFrame
with all the customers and their associated orders (regardless of whether they have placed an order or not) we can do a left
join:
df_left_join = df_customers.join(df_orders, on="customer_id", how="left")
print(df_left_join)
shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═════════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
│ 3 ┆ Charlie ┆ null ┆ null │
└─────────────┴─────────┴──────────┴────────┘
Notice, that the fields for the customer with the customer_id
of 3
are null, as there are no orders for this customer.
Outer join
The outer
join produces a DataFrame
that contains all the rows from both DataFrames
. Columns are null, if the join key does not exist in the source DataFrame
. Doing an outer
join on the two DataFrames
from above produces a similar DataFrame
to the left
join:
df_outer_join = df_customers.join(df_orders, on="customer_id", how="outer")
print(df_outer_join)
shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═════════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
│ 3 ┆ Charlie ┆ null ┆ null │
└─────────────┴─────────┴──────────┴────────┘
Cross join
A cross
join is a cartesian product of the two DataFrames
. This means that every row in the left DataFrame
is joined with every row in the right DataFrame
. The cross
join is useful for creating a DataFrame
with all possible combinations of the columns in two DataFrames
. Let's take for example the following two DataFrames
.
df_colors = pl.DataFrame(
{
"color": ["red", "blue", "green"],
}
)
print(df_colors)
shape: (3, 1)
┌───────┐
│ color │
│ --- │
│ str │
╞═══════╡
│ red │
│ blue │
│ green │
└───────┘
df_sizes = pl.DataFrame(
{
"size": ["S", "M", "L"],
}
)
print(df_sizes)
shape: (3, 1)
┌──────┐
│ size │
│ --- │
│ str │
╞══════╡
│ S │
│ M │
│ L │
└──────┘
We can now create a DataFrame
containing all possible combinations of the colors and sizes with a cross
join:
df_cross_join = df_colors.join(df_sizes, how="cross")
print(df_cross_join)
shape: (9, 2)
┌───────┬──────┐
│ color ┆ size │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪══════╡
│ red ┆ S │
│ red ┆ M │
│ red ┆ L │
│ blue ┆ S │
│ blue ┆ M │
│ blue ┆ L │
│ green ┆ S │
│ green ┆ M │
│ green ┆ L │
└───────┴──────┘
The inner
, left
, outer
and cross
join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi
, anti
and asof
join strategies below.
Semi join
Consider the following scenario: a car rental company has a DataFrame
showing the cars that it owns with each car having a unique id
.
df_cars = pl.DataFrame(
{
"id": ["a", "b", "c"],
"make": ["ford", "toyota", "bmw"],
}
)
print(df_cars)
shape: (3, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
│ c ┆ bmw │
└─────┴────────┘
The company has another DataFrame
showing each repair job carried out on a vehicle.
df_repairs = pl.DataFrame(
{
"id": ["c", "c"],
"cost": [100, 200],
}
)
print(df_repairs)
shape: (2, 2)
┌─────┬──────┐
│ id ┆ cost │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ c ┆ 100 │
│ c ┆ 200 │
└─────┴──────┘
You want to answer this question: which of the cars have had repairs carried out?
An inner join does not answer this question directly as it produces a DataFrame
with multiple rows for each car that has had multiple repair jobs:
df_inner_join = df_cars.join(df_repairs, on="id", how="inner")
print(df_inner_join)
shape: (2, 3)
┌─────┬──────┬──────┐
│ id ┆ make ┆ cost │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪══════╪══════╡
│ c ┆ bmw ┆ 100 │
│ c ┆ bmw ┆ 200 │
└─────┴──────┴──────┘
However, a semi join produces a single row for each car that has had a repair job carried out.
df_semi_join = df_cars.join(df_repairs, on="id", how="semi")
print(df_semi_join)
shape: (1, 2)
┌─────┬──────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════╡
│ c ┆ bmw │
└─────┴──────┘
Anti join
Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame
showing all the cars from df_cars
where the id
is not present in the df_repairs
DataFrame
.
df_anti_join = df_cars.join(df_repairs, on="id", how="anti")
print(df_anti_join)
shape: (2, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
└─────┴────────┘
Asof join
An asof
join is like a left join except that we match on nearest key rather than equal keys.
In Polars
we can do an asof join with the join
method and specifying strategy="asof"
. However, for more flexibility we can use the join_asof
method.
Consider the following scenario: a stock market broker has a DataFrame
called df_trades
showing transactions it has made for different stocks.
df_trades = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 3, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "B", "C"],
"trade": [101, 299, 301, 500],
}
)
print(df_trades)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ trade │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 │
└─────────────────────┴───────┴───────┘
The broker has another DataFrame
called df_quotes
showing prices it has quoted for these stocks.
df_quotes = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 0, 0),
datetime(2020, 1, 1, 9, 2, 0),
datetime(2020, 1, 1, 9, 4, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "C", "A"],
"quote": [100, 300, 501, 102],
}
)
print(df_quotes)
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ quote │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A ┆ 100 │
│ 2020-01-01 09:02:00 ┆ B ┆ 300 │
│ 2020-01-01 09:04:00 ┆ C ┆ 501 │
│ 2020-01-01 09:06:00 ┆ A ┆ 102 │
└─────────────────────┴───────┴───────┘
You want to produce a DataFrame
showing for each trade the most recent quote provided before the trade. You do this with join_asof
(using the default strategy = "backward"
).
To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock"
.
df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ 501 │
└─────────────────────┴───────┴───────┴───────┘
If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance
argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m"
.
df_asof_tolerance_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ 501 │
└─────────────────────┴───────┴───────┴───────┘