Projection pushdown
The Projection pushdown page is under construction.
Let's expand our query from the previous section by joining the result of the FILTER
operation with the runescape data to find popular Reddit usernames that have a
username starting with an "a"
that also played Runescape. That must be something we are all
interested in!
The query would look like this:
import polars as pl
from ..paths import DATA_DIR
reddit = (
pl.scan_csv(f"{DATA_DIR}/reddit.csv")
.filter(pl.col("comment_karma") > 0)
.filter(pl.col("link_karma") > 0)
.filter(pl.col("name").str.contains(r"^a"))
)
runescape = pl.scan_csv("data/runescape.csv", has_header=False).select(pl.col("column_1").alias("name"))
dataset = reddit.join(runescape, on="name", how="inner").select(["name", "comment_karma", "link_karma"])
df1 = dataset.fetch(int(1e7))
df2 = dataset.fetch(int(1e7), predicate_pushdown=True, projection_pushdown=True)
And yields the following DataFrame.
shape: (0, 3)
┌──────┬───────────────┬────────────┐
│ name ┆ comment_karma ┆ link_karma │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞══════╪═══════════════╪════════════╡
└──────┴───────────────┴────────────┘
Break it down
Again, let's take a look the query plan.
dataset.show_graph(optimized=False)
Now were focused on the projection's indicated with π. The first node shows π 3/6,
indicating that we select 3 out of 6 columns in the DataFrame
. If we look the csv scans
we see a wildcard π */6 and π */1 meaning that we select all of 6 columns of the
reddit dataset and the one and only column from the runescape dataset respectively.
This query is not very optimal. We select all columns from both datasets and only show
3/6 after join. That means that there were some columns computed during the join
operation that could have been ignored. There were also columns parsed during csv
scanning only to be dropped at the end. When we are dealing with DataFrame
s with a
large number of columns the redundant work that is done can be huge.
Optimized query
Let's see how Polars
optimizes this query.
dataset.show_graph(optimized=True)
The projections are pushed down the join operation all the way to the csv scans. This means that both the scanning and join operation have become cheaper due to the query optimization.
Performance
Let's time the result before and after optimization.
Without optimization, predicate_pushdown=False
and projection_pushdown=False
.
real 0m3,273s
user 0m9,284s
sys 0m1,081s
With optimization, predicate_pushdown
and projection_pushdown
flags both to
True
.
real 0m1,732s
user 0m7,581s
sys 0m0,783s
We can see that we almost reduced query time by half on this simple query. With real business data often comprising of many columns, filtering missing data, doing complex groupby operations, and using joins we expect this difference between unoptimized queries and optimized queries to only grow.