Evaluate the query in streaming mode and write to a CSV file

Description

This allows streaming results that are larger than RAM to be written to disk.

\$lazy_sink\_\*() don’t write directly to the output file(s) until $collect() is called. This is useful if you want to save a query to review or run later.
\$sink\_*() write directly to the output file(s) (they are shortcuts for \$lazy_sink\_*()$collect()).

Usage

<LazyFrame>$sink_csv(
  path,
  ...,
  include_bom = FALSE,
  compression = c("uncompressed", "gzip", "zstd"),
  compression_level = NULL,
  check_extension = TRUE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_scientific = NULL,
  float_precision = NULL,
  decimal_comma = FALSE,
  null_value = "",
  quote_style = c("necessary", "always", "never", "non_numeric"),
  maintain_order = TRUE,
  storage_options = NULL,
  retries = deprecated(),
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE,
  engine = c("auto", "in-memory", "streaming"),
  optimizations = pl\$QueryOptFlags(),
  type_coercion = deprecated(),
  predicate_pushdown = deprecated(),
  projection_pushdown = deprecated(),
  simplify_expression = deprecated(),
  slice_pushdown = deprecated(),
  collapse_joins = deprecated(),
  no_optimization = deprecated()
)

lazyframe__lazy_sink_csv(
  path,
  ...,
  include_bom = FALSE,
  compression = c("uncompressed", "gzip", "zstd"),
  compression_level = NULL,
  check_extension = TRUE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_scientific = NULL,
  float_precision = NULL,
  decimal_comma = FALSE,
  null_value = "",
  quote_style = c("necessary", "always", "never", "non_numeric"),
  maintain_order = TRUE,
  storage_options = NULL,
  retries = deprecated(),
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE
)

Arguments

path A character. File path to which the file should be written.

… These dots are for future extensions and must be empty.

include_bom Logical, whether to include UTF-8 BOM in the CSV output.

compression What compression format to use. Must be one of uncompressed (default), gzip, or zstd.

compression_level The compression level to use, typically 0-9 or NULL to let the engine choose.

check_extension Whether to check if the filename matches the compression settings. Will raise an error if compression is set to “uncompressed” and the filename ends in one of “.gz”, “.zst”, “.zstd”, or if compression != “uncompressed” and the file uses a mismatched extension. Only applies if file is a path.

include_header Logical, whether to include header in the CSV output.

separator Separate CSV fields with this symbol.

line_terminator String used to end each row.

quote_char Byte to use as quoting character.

batch_size Number of rows that will be processed per thread.

datetime_format A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).

date_format A format string, with the specifiers defined by the chrono Rust crate.

time_format A format string, with the specifiers defined by the chrono Rust crate.

float_scientific Whether to use scientific form always (TRUE), never (FALSE), or automatically (NULL) for Float32 and Float64 datatypes.

float_precision Number of decimal places to write, applied to both Float32 and Float64 datatypes.

decimal_comma If TRUE, use a comma “,” as the decimal separator instead of a point. Floats will be encapsulated in quotes if necessary.

null_value A string representing null values (defaulting to the empty string).

quote_style

Determines the quoting strategy used. Must be one of:

“necessary” (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
“always”: This puts quotes around every field. Always.
“never”: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
“non_numeric”: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren’t strictly necessary.

maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.

storage_options

Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

aws
gcp
azure
Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries Number of retries if accessing a cloud instance fails. Specify max_retries in storage_options instead.

sync_on_close

Sync to disk when before closing a file. Must be one of:

“none”: does not sync;
“data”: syncs the file contents;
“all”: syncs the file contents and metadata.

mkdir Recursively create all the directories in the path.

engine

The engine name to use for processing the query. One of the followings:

“auto” (default): Select the engine automatically. The “in-memory” engine will be selected for most cases.
“in-memory”: Use the in-memory engine.
“streaming”: Use the (new) streaming engine.

optimizations A QueryOptFlags object to indicate optimization passes done during query optimization.

type_coercion Use the type_coercion property of a QueryOptFlags object, then pass that to the optimizations argument instead.

predicate_pushdown Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

projection_pushdown Use the projection_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

simplify_expression Use the simplify_expression property of a QueryOptFlags object, then pass that to the optimizations argument instead.

slice_pushdown Use the slice_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

collapse_joins Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.

no_optimization Use the optimizations argument with pl$QueryOptFlags()$no_optimizations() instead.

Value

\$sink\_\*() returns NULL invisibly.
\$lazy_sink\_\*() returns a new LazyFrame.

Examples

library("polars")

# Sink table 'mtcars' from mem to CSV
tmpf <- tempfile(fileext = ".csv")
as_polars_lf(mtcars)$sink_csv(tmpf)

# Create a query that can be run in streaming end-to-end
tmpf2 <- tempfile(fileext = ".csv")
lf <- pl$scan_csv(tmpf)$select(pl$col("cyl") * 2)$lazy_sink_csv(tmpf2)
lf$explain() |>
  cat()

#> SINK (file)
#>   SELECT [[(col("cyl")) * (2.0)]]
#>     Csv SCAN [/tmp/RtmpvE4vUU/file555c3e08b9cd.csv]
#>     PROJECT 1/11 COLUMNS
#>     ESTIMATED ROWS: 33

# Execute the query and write to disk
lf$collect()

#> shape: (0, 0)
#> ┌┐
#> ╞╡
#> └┘

# Load CSV directly into a DataFrame / memory
pl$read_csv(tmpf2)

#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘