Skip to content

Evaluate the query in streaming mode and write to a NDJSON file

Source code

Description

[Experimental]

This allows streaming results that are larger than RAM to be written to disk.

Usage

<LazyFrame>$sink_ndjson(
  path,
  ...,
  compression = c("uncompressed", "gzip", "zstd"),
  compression_level = NULL,
  check_extension = TRUE,
  maintain_order = TRUE,
  storage_options = NULL,
  retries = deprecated(),
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE,
  engine = c("auto", "in-memory", "streaming"),
  optimizations = pl\$QueryOptFlags(),
  type_coercion = deprecated(),
  predicate_pushdown = deprecated(),
  projection_pushdown = deprecated(),
  simplify_expression = deprecated(),
  slice_pushdown = deprecated(),
  collapse_joins = deprecated(),
  no_optimization = deprecated()
)

lazyframe__lazy_sink_ndjson(
  path,
  ...,
  compression = c("uncompressed", "gzip", "zstd"),
  compression_level = NULL,
  check_extension = TRUE,
  maintain_order = TRUE,
  storage_options = NULL,
  retries = deprecated(),
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE
)

Arguments

path A character. File path to which the file should be written.
These dots are for future extensions and must be empty.
compression [Experimental] What compression format to use. Must be one of uncompressed (default), gzip, or zstd.
compression_level [Experimental] The compression level to use, typically 0-9 or NULL to let the engine choose.
check_extension [Experimental] Whether to check if the filename matches the compression settings. Will raise an error if compression is set to “uncompressed” and the filename ends in one of “.gz”, “.zst”, “.zstd”, or if compression != “uncompressed” and the file uses a mismatched extension. Only applies if file is a path.
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
storage_options Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
  • aws
  • gcp
  • azure
  • Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
retries [Deprecated] Number of retries if accessing a cloud instance fails. Specify max_retries in storage_options instead.
sync_on_close Sync to disk when before closing a file. Must be one of:
  • “none”: does not sync;
  • “data”: syncs the file contents;
  • “all”: syncs the file contents and metadata.
mkdir Recursively create all the directories in the path.
engine The engine name to use for processing the query. One of the followings:
  • “auto” (default): Select the engine automatically. The “in-memory” engine will be selected for most cases.
  • “in-memory”: Use the in-memory engine.
  • “streaming”: [Experimental] Use the (new) streaming engine.
optimizations [Experimental] A QueryOptFlags object to indicate optimization passes done during query optimization.
type_coercion [Deprecated] Use the type_coercion property of a QueryOptFlags object, then pass that to the optimizations argument instead.
predicate_pushdown [Deprecated] Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
projection_pushdown [Deprecated] Use the projection_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
simplify_expression [Deprecated] Use the simplify_expression property of a QueryOptFlags object, then pass that to the optimizations argument instead.
slice_pushdown [Deprecated] Use the slice_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
collapse_joins [Deprecated] Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
no_optimization [Deprecated] Use the optimizations argument with pl$QueryOptFlags()$no_optimizations() instead.

Value

  • \$sink\_\*() returns NULL invisibly.
  • \$lazy_sink\_\*() returns a new LazyFrame.

Examples

library("polars")


dat <- as_polars_lf(head(mtcars))
destination <- tempfile()

dat$select(pl$col("drat", "mpg"))$sink_ndjson(destination)
jsonlite::stream_in(file(destination))
#> 
 Found 6 records...
 Imported 6 records. Simplifying...

#>   drat  mpg
#> 1 3.90 21.0
#> 2 3.90 21.0
#> 3 3.85 22.8
#> 4 3.08 21.4
#> 5 3.15 18.7
#> 6 2.76 18.1