Stream the output of a query to a Parquet file

Description

This writes the output of a query directly to a Parquet file without collecting it in the R session first. This is useful if the output of the query is still larger than RAM as it would crash the R session if it was collected into R.

Usage

<LazyFrame>$sink_parquet(
  path,
  ...,
  compression = "zstd",
  compression_level = 3,
  statistics = TRUE,
  row_group_size = NULL,
  data_page_size = NULL,
  maintain_order = TRUE,
  type_coercion = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  no_optimization = FALSE
)

Arguments

`path`	A character. File path to which the file should be written.
`…`	Ignored.
`compression`	String. The compression method. One of: "lz4": fast compression/decompression. "uncompressed" "snappy": this guarantees that the parquet file will be compatible with older parquet readers. "gzip" "lzo" "brotli" "zstd": good compression performance.
`compression_level`	`NULL` or Integer. The level of compression to use. Only used if method is one of ‘gzip’, ‘brotli’, or ‘zstd’. Higher compression means smaller files on disk: "gzip": min-level: 0, max-level: 10. "brotli": min-level: 0, max-level: 11. "zstd": min-level: 1, max-level: 22.
`statistics`	Whether statistics should be written to the Parquet headers. Possible values: `TRUE`: enable default set of statistics (default) `FALSE`: disable all statistics `“full”`: calculate and write all available statistics. A named list where all values must be `TRUE` or `FALSE`, e.g. `list(min = TRUE, max = FALSE)`. Statistics available are `“min”`, `“max”`, `“distinct_count”`, `“null_count”`.
`row_group_size`	`NULL` or Integer. Size of the row groups in number of rows. If `NULL` (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.
`data_page_size`	Size of the data page in bytes. If `NULL` (default), it is set to 1024^2 bytes. will be ~1MB.
`maintain_order`	Maintain the order in which data is processed. Setting this to `FALSE` will be slightly faster.
`type_coercion`	Logical. Coerce types such that operations succeed and run on minimal required memory.
`predicate_pushdown`	Logical. Applies filters as early as possible at scan level.
`projection_pushdown`	Logical. Select only the columns that are needed at the scan level.
`simplify_expression`	Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives.
`slice_pushdown`	Logical. Only load the required slice from the scan level. Don’t materialize sliced outputs (e.g. `join$head(10)`).
`no_optimization`	Logical. Sets the following parameters to `FALSE`: `predicate_pushdown`, `projection_pushdown`, `slice_pushdown`, `comm_subplan_elim`, `comm_subexpr_elim`, `cluster_with_columns`.

Value

Invisibly returns the input LazyFrame

Examples

library("polars")

# sink table 'mtcars' from mem to parquet
tmpf = tempfile()
as_polars_lf(mtcars)$sink_parquet(tmpf)

# stream a query end-to-end
tmpf2 = tempfile()
pl$scan_parquet(tmpf)$select(pl$col("cyl") * 2)$sink_parquet(tmpf2)

# load parquet directly into a DataFrame / memory
pl$scan_parquet(tmpf2)$collect()

#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘