Skip to content

Stream the output of a query to a CSV file

Description

This writes the output of a query directly to a CSV file without collecting it in the R session first. This is useful if the output of the query is still larger than RAM as it would crash the R session if it was collected into R.

Usage

<LazyFrame>$sink_csv(
  path,
  ...,
  include_bom = FALSE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_precision = NULL,
  null_values = "",
  quote_style = "necessary",
  maintain_order = TRUE,
  type_coercion = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  no_optimization = FALSE
)

Arguments

path A character. File path to which the file should be written.
Ignored.
include_bom Whether to include UTF-8 BOM (byte order mark) in the CSV output.
include_header Whether to include header in the CSV output.
separator Separate CSV fields with this symbol.
line_terminator String used to end each row.
quote_char Byte to use as quoting character.
batch_size Number of rows that will be processed per thread.
datetime_format A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).
date_format A format string, with the specifiers defined by the chrono Rust crate.
time_format A format string, with the specifiers defined by the chrono Rust crate.
float_precision Number of decimal places to write, applied to both Float32 and Float64 datatypes.
null_values A string representing null values (defaulting to the empty string).
quote_style Determines the quoting strategy used.
  • “necessary” (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
  • “always”: This puts quotes around every field.
  • “non_numeric”: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren’t strictly necessary.
  • “never”: This never puts quotes around fields, even if that results in invalid CSV data (e.g. by not quoting strings containing the separator).
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
type_coercion Logical. Coerce types such that operations succeed and run on minimal required memory.
predicate_pushdown Logical. Applies filters as early as possible at scan level.
projection_pushdown Logical. Select only the columns that are needed at the scan level.
simplify_expression Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives.
slice_pushdown Logical. Only load the required slice from the scan level. Don’t materialize sliced outputs (e.g. join$head(10)).
no_optimization Logical. Sets the following parameters to FALSE: predicate_pushdown, projection_pushdown, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns.

Value

Invisibly returns the input LazyFrame

Examples

library("polars")

# sink table 'mtcars' from mem to CSV
tmpf = tempfile()
as_polars_lf(mtcars)$sink_csv(tmpf)

# stream a query end-to-end
tmpf2 = tempfile()
pl$scan_csv(tmpf)$select(pl$col("cyl") * 2)$sink_csv(tmpf2)

# load parquet directly into a DataFrame / memory
pl$scan_csv(tmpf2)$collect()
#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘