Evaluate the query in streaming mode and write to a Parquet file

Description

This allows streaming results that are larger than RAM to be written to disk.

\$lazy_sink\_\*() don’t write directly to the output file(s) until $collect() is called. This is useful if you want to save a query to review or run later.
\$sink\_*() write directly to the output file(s) (they are shortcuts for \$lazy_sink\_*()$collect()).

Usage

parquet_statistics(
  ...,
  min = TRUE,
  max = TRUE,
  distinct_count = TRUE,
  null_count = TRUE
)

lazyframe__sink_parquet(
  path,
  ...,
  compression = c("lz4", "uncompressed", "snappy", "gzip", "brotli", "zstd"),
  compression_level = NULL,
  statistics = TRUE,
  row_group_size = NULL,
  data_page_size = NULL,
  maintain_order = TRUE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE,
  engine = c("auto", "in-memory", "streaming"),
  optimizations = pl\$QueryOptFlags(),
  type_coercion = deprecated(),
  predicate_pushdown = deprecated(),
  projection_pushdown = deprecated(),
  simplify_expression = deprecated(),
  slice_pushdown = deprecated(),
  collapse_joins = deprecated(),
  no_optimization = deprecated()
)

lazyframe__lazy_sink_parquet(
  path,
  ...,
  compression = c("lz4", "uncompressed", "snappy", "gzip", "brotli", "zstd"),
  compression_level = NULL,
  statistics = TRUE,
  row_group_size = NULL,
  data_page_size = NULL,
  maintain_order = TRUE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE
)

Arguments

… These dots are for future extensions and must be empty.

min Include stats on the minimum values in the column.

max Include stats on the maximum values in the column.

distinct_count Include stats on the number of distinct values in the column.

null_count Include stats on the number of null values in the column.

path A character. File path to which the file should be written.

compression

The compression method. Must be one of:

“lz4”: fast compression/decompression.
“uncompressed”
“snappy”: this guarantees that the parquet file will be compatible with older parquet readers.
“gzip”
“brotli”
“zstd”: good compression performance.

compression_level

NULL or integer. The level of compression to use. Only used if method is one of “gzip”, “brotli”, or “zstd”. Higher compression means smaller files on disk:

“gzip”: min-level: 0, max-level: 9, default: 6.
“brotli”: min-level: 0, max-level: 11, default: 1.
“zstd”: min-level: 1, max-level: 22, default: 3.

statistics

Whether statistics should be written to the Parquet headers. Possible values:

TRUE: enable default set of statistics (default). Some statistics may be disabled.
FALSE: disable all statistics
“full”: calculate and write all available statistics
A list created via parquet_statistics() to specify which statistics to include.

row_group_size Size of the row groups in number of rows. If NULL (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.

data_page_size Size of the data page in bytes. If NULL (default), it is set to 1024^2 bytes.

maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.

storage_options

Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries Number of retries if accessing a cloud instance fails.

sync_on_close

Sync to disk when before closing a file. Must be one of:

“none”: does not sync;
“data”: syncs the file contents;
“all”: syncs the file contents and metadata.

mkdir Recursively create all the directories in the path.

engine

The engine name to use for processing the query. One of the followings:

“auto” (default): Select the engine automatically. The “in-memory” engine will be selected for most cases.
“in-memory”: Use the in-memory engine.
“streaming”: Use the (new) streaming engine.

optimizations A QueryOptFlags object to indicate optimization passes done during query optimization.

type_coercion Use the type_coercion property of a QueryOptFlags object, then pass that to the optimizations argument instead.

predicate_pushdown Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

projection_pushdown Use the projection_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

simplify_expression Use the simplify_expression property of a QueryOptFlags object, then pass that to the optimizations argument instead.

slice_pushdown Use the slice_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

collapse_joins Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

no_optimization Use the optimizations argument with pl$QueryOptFlags()$no_optimizations() instead.

Value

\$sink\_\*() returns NULL invisibly.
\$lazy_sink\_\*() returns a new LazyFrame.

Examples

library("polars")

# Sink table 'mtcars' from mem to parquet
tmpf <- tempfile()
as_polars_lf(mtcars)$sink_parquet(tmpf)

# Create a query that can be run in streaming end-to-end
tmpf2 <- tempfile()
lf <- pl$scan_parquet(tmpf)$select(pl$col("cyl") * 2)$lazy_sink_parquet(tmpf2)
lf$explain() |>
  cat()

#> SINK (file)
#>   SELECT [[(col("cyl")) * (2.0)]]
#>     Parquet SCAN [/tmp/RtmpKHTmvA/file7b0034266b9d]
#>     PROJECT 1/11 COLUMNS
#>     ESTIMATED ROWS: 32

# Execute the query and write to disk
lf$collect()

#> shape: (0, 0)
#> ┌┐
#> ╞╡
#> └┘

# Load parquet directly into a DataFrame / memory
pl$read_parquet(tmpf2)

#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘