Skip to content

Stream the output of a query to an Arrow IPC file

Description

This writes the output of a query directly to an Arrow IPC file without collecting it in the R session first. This is useful if the output of the query is still larger than RAM as it would crash the R session if it was collected into R.

Usage

<LazyFrame>$sink_ipc(
  path,
  ...,
  compression = c("zstd", "lz4", "uncompressed"),
  maintain_order = TRUE,
  type_coercion = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  no_optimization = FALSE,
  inherit_optimization = FALSE
)

Arguments

path A character. File path to which the file should be written.
Ignored.
compression NULL or a character of the compression method, “uncompressed” or "lz4" or "zstd". NULL is equivalent to “uncompressed”. Choose "zstd" for good compression performance. Choose "lz4" for fast compression/decompression.
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
type_coercion Logical. Coerce types such that operations succeed and run on minimal required memory.
predicate_pushdown Logical. Applies filters as early as possible at scan level.
projection_pushdown Logical. Select only the columns that are needed at the scan level.
simplify_expression Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives.
slice_pushdown Logical. Only load the required slice from the scan level. Don’t materialize sliced outputs (e.g. join$head(10)).
no_optimization Logical. Sets the following parameters to FALSE: predicate_pushdown, projection_pushdown, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns.
inherit_optimization Logical. Use existing optimization settings regardless the settings specified in this function call.

Value

Invisibly returns the input LazyFrame

Examples

library(polars)

# sink table 'mtcars' from mem to ipc
tmpf = tempfile()
pl$LazyFrame(mtcars)$sink_ipc(tmpf)

# stream a query end-to-end (not supported yet, https://github.com/pola-rs/polars/issues/1040)
# tmpf2 = tempfile()
# pl$scan_ipc(tmpf)$select(pl$col("cyl") * 2)$sink_ipc(tmpf2)

# load ipc directly into a DataFrame / memory
# pl$scan_ipc(tmpf2)$collect()