Skip to content

Stream the output of a query to a Parquet file

Description

This writes the output of a query directly to a Parquet file without collecting it in the R session first. This is useful if the output of the query is still larger than RAM as it would crash the R session if it was collected into R.

Usage

<LazyFrame>$sink_parquet(
  path,
  ...,
  compression = "zstd",
  compression_level = 3,
  statistics = TRUE,
  row_group_size = NULL,
  data_page_size = NULL,
  maintain_order = TRUE,
  type_coercion = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  no_optimization = FALSE
)

Arguments

path A character. File path to which the file should be written.
Ignored.
compression String. The compression method. One of:
  • "lz4": fast compression/decompression.
  • "uncompressed"
  • "snappy": this guarantees that the parquet file will be compatible with older parquet readers.
  • "gzip"
  • "lzo"
  • "brotli"
  • "zstd": good compression performance.
compression_level NULL or Integer. The level of compression to use. Only used if method is one of ‘gzip’, ‘brotli’, or ‘zstd’. Higher compression means smaller files on disk:
  • "gzip": min-level: 0, max-level: 10.
  • "brotli": min-level: 0, max-level: 11.
  • "zstd": min-level: 1, max-level: 22.
statistics Whether statistics should be written to the Parquet headers. Possible values:
  • TRUE: enable default set of statistics (default)
  • FALSE: disable all statistics
  • “full”: calculate and write all available statistics.
  • A named list where all values must be TRUE or FALSE, e.g. list(min = TRUE, max = FALSE). Statistics available are “min”, “max”, “distinct_count”, “null_count”.
row_group_size NULL or Integer. Size of the row groups in number of rows. If NULL (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.
data_page_size Size of the data page in bytes. If NULL (default), it is set to 1024^2 bytes. will be ~1MB.
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
type_coercion Logical. Coerce types such that operations succeed and run on minimal required memory.
predicate_pushdown Logical. Applies filters as early as possible at scan level.
projection_pushdown Logical. Select only the columns that are needed at the scan level.
simplify_expression Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives.
slice_pushdown Logical. Only load the required slice from the scan level. Don’t materialize sliced outputs (e.g. join$head(10)).
no_optimization Logical. Sets the following parameters to FALSE: predicate_pushdown, projection_pushdown, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns.

Value

Invisibly returns the input LazyFrame

Examples

library("polars")

# sink table 'mtcars' from mem to parquet
tmpf = tempfile()
as_polars_lf(mtcars)$sink_parquet(tmpf)

# stream a query end-to-end
tmpf2 = tempfile()
pl$scan_parquet(tmpf)$select(pl$col("cyl") * 2)$sink_parquet(tmpf2)

# load parquet directly into a DataFrame / memory
pl$scan_parquet(tmpf2)$collect()
#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘