Joins
Join strategies
Polars
supports the following join strategies by specifying the strategy
argument:
Strategy | Description |
---|---|
inner |
Returns row with matching keys in both frames. Non-matching rows in either the left or right frame are discarded. |
left |
Returns all rows in the left dataframe, whether or not a match in the right-frame is found. Non-matching rows have their right columns null-filled. |
outer |
Returns all rows from both the left and right dataframe. If no match is found in one frame, columns from the other frame are null-filled. |
cross |
Returns the Cartesian product of all rows from the left frame with all rows from the right frame. Duplicates rows are retained; the table length of A cross-joined with B is always len(A) × len(B) . |
asof |
A left-join in which the match is performed on the nearest key rather than on equal keys. |
semi |
Returns all rows from the left frame in which the join key is also present in the right frame. |
anti |
Returns all rows from the left frame in which the join key is not present in the right frame. |
Inner join
An inner
join produces a DataFrame
that contains only the rows where the join key exists in both DataFrames
. Let's take for example the following two DataFrames
:
shape: (3, 2)
┌─────────────┬─────────┐
│ customer_id ┆ name │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════════╪═════════╡
│ 1 ┆ Alice │
│ 2 ┆ Bob │
│ 3 ┆ Charlie │
└─────────────┴─────────┘
shape: (3, 3)
┌──────────┬─────────────┬────────┐
│ order_id ┆ customer_id ┆ amount │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════════╪═════════════╪════════╡
│ a ┆ 1 ┆ 100 │
│ b ┆ 2 ┆ 200 │
│ c ┆ 2 ┆ 300 │
└──────────┴─────────────┴────────┘
To get a DataFrame
with the orders and their associated customer we can do an inner
join on the customer_id
column:
df_inner_customer_join = df_customers.join(df_orders, on="customer_id", how="inner")
print(df_inner_customer_join)
let df_inner_customer_join = df_customers
.clone()
.lazy()
.join(
df_orders.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Inner),
)
.collect()?;
println!("{}", &df_inner_customer_join);
shape: (3, 4)
┌─────────────┬───────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═══════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
└─────────────┴───────┴──────────┴────────┘
Left join
The left
join produces a DataFrame
that contains all the rows from the left DataFrame
and only the rows from the right DataFrame
where the join key exists in the left DataFrame
. If we now take the example from above and want to have a DataFrame
with all the customers and their associated orders (regardless of whether they have placed an order or not) we can do a left
join:
df_left_join = df_customers.join(df_orders, on="customer_id", how="left")
print(df_left_join)
let df_left_join = df_customers
.clone()
.lazy()
.join(
df_orders.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Left),
)
.collect()?;
println!("{}", &df_left_join);
shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═════════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
│ 3 ┆ Charlie ┆ null ┆ null │
└─────────────┴─────────┴──────────┴────────┘
Notice, that the fields for the customer with the customer_id
of 3
are null, as there are no orders for this customer.
Outer join
The outer
join produces a DataFrame
that contains all the rows from both DataFrames
. Columns are null, if the join key does not exist in the source DataFrame
. Doing an outer
join on the two DataFrames
from above produces a similar DataFrame
to the left
join:
df_outer_join = df_customers.join(df_orders, on="customer_id", how="outer")
print(df_outer_join)
let df_outer_join = df_customers
.clone()
.lazy()
.join(
df_orders.clone().lazy(),
[col("customer_id")],
[col("customer_id")],
JoinArgs::new(JoinType::Outer),
)
.collect()?;
println!("{}", &df_outer_join);
shape: (4, 4)
┌─────────────┬─────────┬──────────┬────────┐
│ customer_id ┆ name ┆ order_id ┆ amount │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════════════╪═════════╪══════════╪════════╡
│ 1 ┆ Alice ┆ a ┆ 100 │
│ 2 ┆ Bob ┆ b ┆ 200 │
│ 2 ┆ Bob ┆ c ┆ 300 │
│ 3 ┆ Charlie ┆ null ┆ null │
└─────────────┴─────────┴──────────┴────────┘
Cross join
A cross
join is a cartesian product of the two DataFrames
. This means that every row in the left DataFrame
is joined with every row in the right DataFrame
. The cross
join is useful for creating a DataFrame
with all possible combinations of the columns in two DataFrames
. Let's take for example the following two DataFrames
.
shape: (3, 1)
┌───────┐
│ color │
│ --- │
│ str │
╞═══════╡
│ red │
│ blue │
│ green │
└───────┘
shape: (3, 1)
┌──────┐
│ size │
│ --- │
│ str │
╞══════╡
│ S │
│ M │
│ L │
└──────┘
We can now create a DataFrame
containing all possible combinations of the colors and sizes with a cross
join:
shape: (9, 2)
┌───────┬──────┐
│ color ┆ size │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪══════╡
│ red ┆ S │
│ red ┆ M │
│ red ┆ L │
│ blue ┆ S │
│ blue ┆ M │
│ blue ┆ L │
│ green ┆ S │
│ green ┆ M │
│ green ┆ L │
└───────┴──────┘
The inner
, left
, outer
and cross
join strategies are standard amongst dataframe libraries. We provide more details on the less familiar semi
, anti
and asof
join strategies below.
Semi join
The semi
join returns all rows from the left frame in which the join key is also present in the right frame. Consider the following scenario: a car rental company has a DataFrame
showing the cars that it owns with each car having a unique id
.
shape: (3, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
│ c ┆ bmw │
└─────┴────────┘
The company has another DataFrame
showing each repair job carried out on a vehicle.
shape: (2, 2)
┌─────┬──────┐
│ id ┆ cost │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪══════╡
│ c ┆ 100 │
│ c ┆ 200 │
└─────┴──────┘
You want to answer this question: which of the cars have had repairs carried out?
An inner join does not answer this question directly as it produces a DataFrame
with multiple rows for each car that has had multiple repair jobs:
shape: (2, 3)
┌─────┬──────┬──────┐
│ id ┆ make ┆ cost │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═════╪══════╪══════╡
│ c ┆ bmw ┆ 100 │
│ c ┆ bmw ┆ 200 │
└─────┴──────┴──────┘
However, a semi join produces a single row for each car that has had a repair job carried out.
shape: (1, 2)
┌─────┬──────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════╡
│ c ┆ bmw │
└─────┴──────┘
Anti join
Continuing this example, an alternative question might be: which of the cars have not had a repair job carried out? An anti join produces a DataFrame
showing all the cars from df_cars
where the id
is not present in the df_repairs
DataFrame
.
shape: (2, 2)
┌─────┬────────┐
│ id ┆ make │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪════════╡
│ a ┆ ford │
│ b ┆ toyota │
└─────┴────────┘
Asof join
An asof
join is like a left join except that we match on nearest key rather than equal keys.
In Polars
we can do an asof join with the join
method and specifying strategy="asof"
. However, for more flexibility we can use the join_asof
method.
Consider the following scenario: a stock market broker has a DataFrame
called df_trades
showing transactions it has made for different stocks.
df_trades = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 1, 0),
datetime(2020, 1, 1, 9, 3, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "B", "C"],
"trade": [101, 299, 301, 500],
}
)
print(df_trades)
use chrono::prelude::*;
let df_trades = df!(
"time"=> &[
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 1, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 1, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 3, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 6, 0).unwrap(),
],
"stock"=> &["A", "B", "B", "C"],
"trade"=> &[101, 299, 301, 500],
)?;
println!("{}", &df_trades);
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ trade │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 │
└─────────────────────┴───────┴───────┘
The broker has another DataFrame
called df_quotes
showing prices it has quoted for these stocks.
df_quotes = pl.DataFrame(
{
"time": [
datetime(2020, 1, 1, 9, 0, 0),
datetime(2020, 1, 1, 9, 2, 0),
datetime(2020, 1, 1, 9, 4, 0),
datetime(2020, 1, 1, 9, 6, 0),
],
"stock": ["A", "B", "C", "A"],
"quote": [100, 300, 501, 102],
}
)
print(df_quotes)
let df_quotes = df!(
"time"=> &[
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 2, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 4, 0).unwrap(),
NaiveDate::from_ymd_opt(2020, 1, 1).unwrap().and_hms_opt(9, 6, 0).unwrap(),
],
"stock"=> &["A", "B", "C", "A"],
"quote"=> &[100, 300, 501, 102],
)?;
println!("{}", &df_quotes);
shape: (4, 3)
┌─────────────────────┬───────┬───────┐
│ time ┆ stock ┆ quote │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 │
╞═════════════════════╪═══════╪═══════╡
│ 2020-01-01 09:00:00 ┆ A ┆ 100 │
│ 2020-01-01 09:02:00 ┆ B ┆ 300 │
│ 2020-01-01 09:04:00 ┆ C ┆ 501 │
│ 2020-01-01 09:06:00 ┆ A ┆ 102 │
└─────────────────────┴───────┴───────┘
You want to produce a DataFrame
showing for each trade the most recent quote provided before the trade. You do this with join_asof
(using the default strategy = "backward"
).
To avoid joining between trades on one stock with a quote on another you must specify an exact preliminary join on the stock column with by="stock"
.
df_asof_join = df_trades.join_asof(df_quotes, on="time", by="stock")
print(df_asof_join)
let df_asof_join = df_trades.join_asof_by(
&df_quotes,
"time",
"time",
["stock"],
["stock"],
AsofStrategy::Backward,
None,
)?;
println!("{}", &df_asof_join);
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ 501 │
└─────────────────────┴───────┴───────┴───────┘
If you want to make sure that only quotes within a certain time range are joined to the trades you can specify the tolerance
argument. In this case we want to make sure that the last preceding quote is within 1 minute of the trade so we set tolerance = "1m"
.
df_asof_tolerance_join = df_trades.join_asof(
df_quotes, on="time", by="stock", tolerance="1m"
)
print(df_asof_tolerance_join)
shape: (4, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ time ┆ stock ┆ trade ┆ quote │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ str ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 09:01:00 ┆ A ┆ 101 ┆ 100 │
│ 2020-01-01 09:01:00 ┆ B ┆ 299 ┆ null │
│ 2020-01-01 09:03:00 ┆ B ┆ 301 ┆ 300 │
│ 2020-01-01 09:06:00 ┆ C ┆ 500 ┆ null │
└─────────────────────┴───────┴───────┴───────┘