Skip to content

Write to parquet file

Description

Write to parquet file

Usage

<DataFrame>$write_parquet(
  file,
  ...,
  compression = "zstd",
  compression_level = 3,
  statistics = TRUE,
  row_group_size = NULL,
  data_page_size = NULL,
  partition_by = NULL,
  partition_chunk_size_bytes = 4294967296
)

Arguments

file File path to which the result should be written. This should be a path to a directory if writing a partitioned dataset.
Ignored.
compression String. The compression method. One of:
  • "lz4": fast compression/decompression.
  • "uncompressed"
  • "snappy": this guarantees that the parquet file will be compatible with older parquet readers.
  • "gzip"
  • "lzo"
  • "brotli"
  • "zstd": good compression performance.
compression_level NULL or Integer. The level of compression to use. Only used if method is one of ‘gzip’, ‘brotli’, or ‘zstd’. Higher compression means smaller files on disk:
  • "gzip": min-level: 0, max-level: 10.
  • "brotli": min-level: 0, max-level: 11.
  • "zstd": min-level: 1, max-level: 22.
statistics Whether statistics should be written to the Parquet headers. Possible values:
  • TRUE: enable default set of statistics (default)
  • FALSE: disable all statistics
  • “full”: calculate and write all available statistics.
  • A named list where all values must be TRUE or FALSE, e.g. list(min = TRUE, max = FALSE). Statistics available are “min”, “max”, “distinct_count”, “null_count”.
row_group_size NULL or Integer. Size of the row groups in number of rows. If NULL (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds.
data_page_size Size of the data page in bytes. If NULL (default), it is set to 1024^2 bytes. will be ~1MB.
partition_by Column(s) to partition by. A partitioned dataset will be written if this is specified.
partition_chunk_size_bytes Approximate size to split DataFrames within a single partition when writing. Note this is calculated using the size of the DataFrame in memory - the size of the output file may differ depending on the file format / compression.

Value

Invisibly returns the input DataFrame.

Examples

library("polars")


dat = as_polars_df(mtcars)

# write data to a single parquet file
destination = withr::local_tempfile(fileext = ".parquet")
dat$write_parquet(destination)

# write data to folder with a hive-partitioned structure
dest_folder = withr::local_tempdir()
dat$write_parquet(dest_folder, partition_by = c("gear", "cyl"))
list.files(dest_folder, recursive = TRUE)
#> [1] "gear=3.0/cyl=4.0/00000000.parquet" "gear=3.0/cyl=6.0/00000000.parquet"
#> [3] "gear=3.0/cyl=8.0/00000000.parquet" "gear=4.0/cyl=4.0/00000000.parquet"
#> [5] "gear=4.0/cyl=6.0/00000000.parquet" "gear=5.0/cyl=4.0/00000000.parquet"
#> [7] "gear=5.0/cyl=6.0/00000000.parquet" "gear=5.0/cyl=8.0/00000000.parquet"