Skip to content

Evaluate the query in streaming mode and write to a NDJSON file

Source code

Description

[Experimental]

This allows streaming results that are larger than RAM to be written to disk.

Usage

<LazyFrame>$sink_ndjson(
  path,
  ...,
  maintain_order = TRUE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE,
  engine = c("auto", "in-memory", "streaming"),
  optimizations = pl\$QueryOptFlags(),
  type_coercion = deprecated(),
  predicate_pushdown = deprecated(),
  projection_pushdown = deprecated(),
  simplify_expression = deprecated(),
  slice_pushdown = deprecated(),
  collapse_joins = deprecated(),
  no_optimization = deprecated()
)

lazyframe__lazy_sink_ndjson(
  path,
  ...,
  maintain_order = TRUE,
  storage_options = NULL,
  retries = 2,
  sync_on_close = c("none", "data", "all"),
  mkdir = FALSE
)

Arguments

path A character. File path to which the file should be written.
These dots are for future extensions and must be empty.
maintain_order Maintain the order in which data is processed. Setting this to FALSE will be slightly faster.
storage_options Named vector containing options that indicate how to connect to a cloud provider. The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
  • aws
  • gcp
  • azure
  • Hugging Face (hf://): Accepts an API key under the token parameter c(token = YOUR_TOKEN) or by setting the HF_TOKEN environment variable.
If storage_options is not provided, Polars will try to infer the information from environment variables.
retries Number of retries if accessing a cloud instance fails.
sync_on_close Sync to disk when before closing a file. Must be one of:
  • “none”: does not sync;
  • “data”: syncs the file contents;
  • “all”: syncs the file contents and metadata.
mkdir Recursively create all the directories in the path.
engine The engine name to use for processing the query. One of the followings:
  • “auto” (default): Select the engine automatically. The “in-memory” engine will be selected for most cases.
  • “in-memory”: Use the in-memory engine.
  • “streaming”: [Experimental] Use the (new) streaming engine.
optimizations [Experimental] A QueryOptFlags object to indicate optimization passes done during query optimization.
type_coercion [Deprecated] Use the type_coercion property of a QueryOptFlags object, then pass that to the optimizations argument instead.
predicate_pushdown [Deprecated] Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
projection_pushdown [Deprecated] Use the projection_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
simplify_expression [Deprecated] Use the simplify_expression property of a QueryOptFlags object, then pass that to the optimizations argument instead.
slice_pushdown [Deprecated] Use the slice_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
collapse_joins [Deprecated] Use the predicate_pushdown property of a QueryOptFlags object, then pass that to the optimizations argument instead.
no_optimization [Deprecated] Use the optimizations argument with pl$QueryOptFlags()$no_optimizations() instead.

Value

  • \$sink\_\*() returns NULL invisibly.
  • \$lazy_sink\_\*() returns a new LazyFrame.

Examples

library("polars")


dat <- as_polars_lf(head(mtcars))
destination <- tempfile()

dat$select(pl$col("drat", "mpg"))$sink_ndjson(destination)
jsonlite::stream_in(file(destination))
#> 
 Found 6 records...
 Imported 6 records. Simplifying...

#>   drat  mpg
#> 1 3.90 21.0
#> 2 3.90 21.0
#> 3 3.85 22.8
#> 4 3.08 21.4
#> 5 3.15 18.7
#> 6 2.76 18.1