Skip to content

Evaluate the query in streaming mode and write to a CSV file

Source code

Description

[Experimental]

This allows streaming results that are larger than RAM to be written to disk.

  • \$lazy_sink\_\*() don’t write directly to the output file(s) until $collect() is called. This is useful if you want to save a query to review or run later.
  • \$sink\_*() write directly to the output file(s) (they are shortcuts for \$lazy_sink\_*()$collect()).

Usage

<LazyFrame>$sink_csv(
  path,
  ...,
  include_bom = FALSE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_scientific = NULL,
  float_precision = NULL,
  decimal_comma = FALSE,
  null_value = "",
  quote_style = c("necessary", "always", "never", "non_numeric"),
  maintain_order = TRUE,
  type_coercion = TRUE,
  `_type_check` = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  no_optimization = FALSE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE,
  engine = c("auto", "in-memory", "streaming"),
  collapse_joins = deprecated()
)

lazyframe__lazy_sink_csv(
  path,
  ...,
  include_bom = FALSE,
  include_header = TRUE,
  separator = ",",
  line_terminator = "\n",
  quote_char = "\"",
  batch_size = 1024,
  datetime_format = NULL,
  date_format = NULL,
  time_format = NULL,
  float_scientific = NULL,
  float_precision = NULL,
  decimal_comma = FALSE,
  null_value = "",
  quote_style = c("necessary", "always", "never", "non_numeric"),
  maintain_order = TRUE,
  type_coercion = TRUE,
  `_type_check` = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  no_optimization = FALSE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE,
  collapse_joins = deprecated()
)

Arguments

path A character. File path to which the file should be written.
These dots are for future extensions and must be empty.
include_bom Logical, whether to include UTF-8 BOM in the CSV output.
include_header Logical, whether to include header in the CSV output.
separator Separate CSV fields with this symbol.
line_terminator String used to end each row.
quote_char Byte to use as quoting character.
batch_size Number of rows that will be processed per thread.
datetime_format A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).
date_format A format string, with the specifiers defined by the chrono Rust crate.
time_format A format string, with the specifiers defined by the chrono Rust crate.
float_scientific Whether to use scientific form always (TRUE), never (FALSE), or automatically (NULL) for Float32 and Float64 datatypes.
float_precision Number of decimal places to write, applied to both Float32 and Float64 datatypes.
decimal_comma If TRUE, use a comma “,” as the decimal separator instead of a point. Floats will be encapsulated in quotes if necessary.
null_value A string representing null values (defaulting to the empty string).
quote_style Determines the quoting strategy used. Must be one of:
  • “necessary” (default): This puts quotes around fields only when necessary. They are necessary when fields contain a quote, delimiter or record terminator. Quotes are also necessary when writing an empty record (which is indistinguishable from a record with one empty field). This is the default.
  • “always”: This puts quotes around every field. Always.
  • “never”: This never puts quotes around fields, even if that results in invalid CSV data (e.g.: by not quoting strings containing the separator).
  • “non_numeric”: This puts quotes around all fields that are non-numeric. Namely, when writing a field that does not parse as a valid float or integer, then quotes will be used even if they aren’t strictly necessary.
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
type_coercion A logical, indicates type coercion optimization.
\_type_check For internal use only.
predicate_pushdown A logical, indicates predicate pushdown optimization.
projection_pushdown A logical, indicates projection pushdown optimization.
simplify_expression A logical, indicates simplify expression optimization.
slice_pushdown A logical, indicates slice pushdown optimization.
no_optimization A logical. If TRUE, turn off (certain) optimizations.
storage_options Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
  • aws
  • gcp
  • azure
  • Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
retries Number of retries if accessing a cloud instance fails.
sync_on_close Sync to disk when before closing a file. Must be one of:
  • “none”: does not sync;
  • “data”: syncs the file contents;
  • “all”: syncs the file contents and metadata.
mkdir Recursively create all the directories in the path.
engine The engine name to use for processing the query. One of the followings:
  • “auto” (default): Select the engine automatically. The “in-memory” engine will be selected for most cases.
  • “in-memory”: Use the in-memory engine.
  • “streaming”: [Experimental] Use the (new) streaming engine.
collapse_joins [Deprecated] Use predicate_pushdown instead.

Value

  • \$sink\_\*() returns NULL invisibly.
  • \$lazy_sink\_\*() returns a new LazyFrame.

Examples

library("polars")

# Sink table 'mtcars' from mem to CSV
tmpf <- tempfile(fileext = ".csv")
as_polars_lf(mtcars)$sink_csv(tmpf)

# Create a query that can be run in streaming end-to-end
tmpf2 <- tempfile(fileext = ".csv")
lf <- pl$scan_csv(tmpf)$select(pl$col("cyl") * 2)$lazy_sink_csv(tmpf2)
lf$explain() |>
  cat()
#> SINK (file)
#>   SELECT [[(col("cyl")) * (2.0)]]
#>     Csv SCAN [/tmp/RtmpdwCYjn/file56344b552415.csv]
#>     PROJECT 1/11 COLUMNS
#>     ESTIMATED ROWS: 32
# Execute the query and write to disk
lf$collect()
#> shape: (0, 0)
#> ┌┐
#> ╞╡
#> └┘
# Load CSV directly into a DataFrame / memory
pl$read_csv(tmpf2)
#> shape: (32, 1)
#> ┌──────┐
#> │ cyl  │
#> │ ---  │
#> │ f64  │
#> ╞══════╡
#> │ 12.0 │
#> │ 12.0 │
#> │ 8.0  │
#> │ 12.0 │
#> │ 16.0 │
#> │ …    │
#> │ 8.0  │
#> │ 16.0 │
#> │ 12.0 │
#> │ 16.0 │
#> │ 8.0  │
#> └──────┘