With the lazy API Polars doesn't run each query line-by-line but instead processes the full query end-to-end. To get the most out of Polars it is important that you use the lazy API because:
- the lazy API allows Polars to apply automatic query optimization with the query optimizer
- the lazy API allows you to work with larger than memory datasets using streaming
- the lazy API can catch schema errors before processing the data
Here we see how to use the lazy API starting from either a file or an existing
Using the lazy API from a file
In the ideal case we use the lazy API right from a file as the query optimizer may help us to reduce the amount of data we read from the file.
We create a lazy query from the Reddit CSV data and apply some transformations.
By starting the query with
pl.scan_csv we are using the lazy API.
q1 = ( pl.scan_csv(f"docs/src/data/reddit.csv") .with_columns(pl.col("name").str.to_uppercase()) .filter(pl.col("comment_karma") > 0) )
pl.scan_ function is available for a number of file types including CSV, IPC, Parquet and JSON.
In this query we tell Polars that we want to:
- load data from the Reddit CSV file
- convert the
namecolumn to uppercase
- apply a filter to the
The lazy query will not be executed at this point. See this page on executing lazy queries for more on running lazy queries.
Using the lazy API from a
An alternative way to access the lazy API is to call
.lazy on a
DataFrame that has already been created in memory.
.lazy we convert the
DataFrame to a