Stream the output of a query to a Parquet file
Description
This writes the output of a query directly to a Parquet file without collecting it in the R session first. This is useful if the output of the query is still larger than RAM as it would crash the R session if it was collected into R.
Usage
<LazyFrame>$sink_parquet(
path,
...,
compression = "zstd",
compression_level = 3,
statistics = TRUE,
row_group_size = NULL,
data_page_size = NULL,
maintain_order = TRUE,
type_coercion = TRUE,
predicate_pushdown = TRUE,
projection_pushdown = TRUE,
simplify_expression = TRUE,
slice_pushdown = TRUE,
no_optimization = FALSE
)
Arguments
path
|
A character. File path to which the file should be written. |
…
|
Ignored. |
compression
|
String. The compression method. One of:
|
compression_level
|
NULL or Integer. The level of compression to use. Only used
if method is one of ‘gzip’, ‘brotli’, or ‘zstd’. Higher compression
means smaller files on disk:
|
statistics
|
Whether statistics should be written to the Parquet headers. Possible
values:
|
row_group_size
|
NULL or Integer. Size of the row groups in number of rows.
If NULL (default), the chunks of the DataFrame are used.
Writing in smaller chunks may reduce memory pressure and improve writing
speeds.
|
data_page_size
|
Size of the data page in bytes. If NULL (default), it is
set to 1024^2 bytes. will be ~1MB.
|
maintain_order
|
Maintain the order in which data is processed. Setting this to
FALSE will be slightly faster.
|
type_coercion
|
Logical. Coerce types such that operations succeed and run on minimal required memory. |
predicate_pushdown
|
Logical. Applies filters as early as possible at scan level. |
projection_pushdown
|
Logical. Select only the columns that are needed at the scan level. |
simplify_expression
|
Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives. |
slice_pushdown
|
Logical. Only load the required slice from the scan level. Don’t
materialize sliced outputs (e.g. join$head(10) ).
|
no_optimization
|
Logical. Sets the following parameters to FALSE :
predicate_pushdown , projection_pushdown ,
slice_pushdown , comm_subplan_elim ,
comm_subexpr_elim , cluster_with_columns .
|
Value
Invisibly returns the input LazyFrame
Examples
library("polars")
# sink table 'mtcars' from mem to parquet
tmpf = tempfile()
as_polars_lf(mtcars)$sink_parquet(tmpf)
# stream a query end-to-end
tmpf2 = tempfile()
pl$scan_parquet(tmpf)$select(pl$col("cyl") * 2)$sink_parquet(tmpf2)
# load parquet directly into a DataFrame / memory
pl$scan_parquet(tmpf2)$collect()
#> shape: (32, 1)
#> ┌──────┐
#> │ cyl │
#> │ --- │
#> │ f64 │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ … │
#> │ 8.0 │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0 │
#> └──────┘