Streaming larger-than-memory datasets

When a lazy query is executed in streaming mode Polars processes the dataset in batches rather than all-at-once. This can allow Polars to process datasets that are larger-than-memory.

To tell Polars we want to execute a query in streaming mode we pass the streaming=True argument to collect

import polars as pl

from ..paths import DATA_DIR

q5 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .collect(streaming=True)
)

We can also use streaming mode when we execute lazy queries in other ways such as a partial execution with fetch or a profiling execution with profile.

When is streaming available?

Streaming is still a developing feature of Polars. We can ask Polars to execute any lazy query in streaming mode. However, not all lazy operations support streaming. If there is an operation for which streaming is not supported Polars will run the query in non-streaming mode.

Streaming is supported for many operations including:

  • filter,slice,head,tail
  • with_columns,select
  • groupby,groupby_dynamic
  • join,join_asof
  • sort
  • scan_csv,scan_parquet,scan_ipc

Sinking to a file

Although streaming allows you to process larger than memory datasets the output DataFrame must still fit in memory.

To work with data where the output DataFrame is too large to fit in memory we can write it directly to disk. We use streaming to do this by executing the lazy query with a sink_ function instead of collect. The sink_ functions use streaming by default.

import polars as pl

from ..paths import DATA_DIR

q11 = (
    pl.scan_csv(f"{DATA_DIR}/reddit.csv")
    .with_columns(pl.col("name").str.to_uppercase())
    .filter(pl.col("comment_karma") > 0)
    .sink_parquet(f"{DATA_DIR}/reddit.parquet")
)

There are sink_ functions available to write to Parquet and IPC file formats.