Scan a parquet file
Description
Scan a parquet file
Usage
pl_scan_parquet(
source,
...,
n_rows = NULL,
row_index_name = NULL,
row_index_offset = 0L,
parallel = c("auto", "columns", "row_groups", "none"),
hive_partitioning = NULL,
hive_schema = NULL,
try_parse_hive_dates = TRUE,
glob = TRUE,
schema = NULL,
rechunk = FALSE,
low_memory = FALSE,
storage_options = NULL,
use_statistics = TRUE,
cache = TRUE,
include_file_paths = NULL,
allow_missing_columns = FALSE
)
Arguments
source
|
Path to a file. You can use globbing with \* to scan/read
multiple files in the same directory (see examples).
|
…
|
Ignored. |
n_rows
|
Maximum number of rows to read. |
row_index_name
|
If not NULL , this will insert a row index column with the
given name into the DataFrame.
|
row_index_offset
|
Offset to start the row index column (only used if the name is set). |
parallel
|
This determines the direction of parallelism. “auto” will
try to determine the optimal direction. Can be “auto” ,
“columns” , “row_groups” ,
“prefiltered” , or “none” . See ‘Details’.
|
hive_partitioning
|
Infer statistics and schema from Hive partitioned URL and use them to
prune reads. If NULL (default), it is automatically enabled
when a single directory is passed, and otherwise disabled.
|
hive_schema
|
A list containing the column names and data types of the columns by
which the data is partitioned, e.g. list(a = pl$String, b =
pl$Float32) . If NULL (default), the schema of the
Hive partitions is inferred.
|
try_parse_hive_dates
|
Whether to try parsing hive values as date/datetime types. |
glob
|
Expand path given via globbing rules. |
schema
|
Specify the datatypes of the columns. The datatypes must match the
datatypes in the file(s). If there are extra columns that are not in the
file(s), consider also enabling allow_missing_columns .
|
rechunk
|
In case of reading multiple files via a glob pattern, rechunk the final DataFrame into contiguous memory chunks. |
low_memory
|
Reduce memory usage (will yield a lower performance). |
storage_options
|
Experimental. List of options necessary to scan parquet files from different cloud storage providers (GCP, AWS, Azure, HuggingFace). See the ‘Details’ section. |
use_statistics
|
Use statistics in the parquet file to determine if pages can be skipped from reading. |
cache
|
Cache the result after reading. |
include_file_paths
|
Include the path of the source file(s) as a column with this name. |
allow_missing_columns
|
When reading a list of parquet files, if a column existing in the first
file cannot be found in subsequent files, the default behavior is to
raise an error. However, if allow_missing_columns is set to
TRUE , a full-NULL column is returned instead of erroring
for the files that do not contain the column.
|
Details
On parallel strategies
The prefiltered strategy first evaluates the pushed-down predicates in parallel and determines a mask of which rows to read. Then, it parallelizes over both the columns and the row groups while filtering out rows that do not need to be read. This can provide significant speedups for large files (i.e. many row-groups) with a predicate that filters clustered rows or filters heavily. In other cases, prefiltered may slow down the scan compared other strategies.
The prefiltered settings falls back to auto if no predicate is given.
Connecting to cloud providers
Polars supports scanning parquet files from different cloud providers.
The cloud providers currently supported are AWS, GCP, and Azure. The
supported keys to pass to the storage_options
argument can
be found here:
Currently it is impossible to scan public parquet files from GCP without
a valid service account. Be sure to always include a service account in
the storage_options
argument.
Scanning from HuggingFace
It is possible to scan data stored on HuggingFace using a path starting
with hf://
. The
hf://
path format is defined as
hf://BUCKET/REPOSITORY@REVISION/PATH
,
where:
- BUCKET is one of datasets or spaces
-
REPOSITORY is the location of the repository. this is usually in the
format of username/repo_name. A branch can also be optionally specified
by appending
@branch
. - REVISION is the name of the branch (or commit) to use. This is optional and defaults to main if not given.
- PATH is a file or directory path, or a glob pattern from the repository root.
A Hugging Face API key can be passed to access private locations using either of the following methods:
-
Passing a token in storage_options to the scan function, e.g.
scan_parquet(…, storage_options = list(token = \
)) -
Setting the HF_TOKEN environment variable, e.g.
Sys.setenv(HF_TOKEN = \
.)
Value
LazyFrame
Examples
library("polars")
# Write a Parquet file than we can then import as DataFrame
temp_file = withr::local_tempfile(fileext = ".parquet")
as_polars_df(mtcars)$write_parquet(temp_file)
pl$scan_parquet(temp_file)$collect()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬───────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪═════╪═════╪══════╪══════╡
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0 ┆ 4.0 │
#> │ 21.0 ┆ 6.0 ┆ 160.0 ┆ 110.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4.0 ┆ 4.0 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 1.0 │
#> │ 21.4 ┆ 6.0 ┆ 258.0 ┆ 110.0 ┆ … ┆ 1.0 ┆ 0.0 ┆ 3.0 ┆ 1.0 │
#> │ 18.7 ┆ 8.0 ┆ 360.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 3.0 ┆ 2.0 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 30.4 ┆ 4.0 ┆ 95.1 ┆ 113.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 5.0 ┆ 2.0 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 4.0 │
#> │ 19.7 ┆ 6.0 ┆ 145.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 6.0 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 8.0 │
#> │ 21.4 ┆ 4.0 ┆ 121.0 ┆ 109.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 2.0 │
#> └──────┴─────┴───────┴───────┴───┴─────┴─────┴──────┴──────┘
# Write a hive-style partitioned parquet dataset
temp_dir = withr::local_tempdir()
as_polars_df(mtcars)$write_parquet(temp_dir, partition_by = c("cyl", "gear"))
list.files(temp_dir, recursive = TRUE)
#> [1] "cyl=4.0/gear=3.0/00000000.parquet" "cyl=4.0/gear=4.0/00000000.parquet"
#> [3] "cyl=4.0/gear=5.0/00000000.parquet" "cyl=6.0/gear=3.0/00000000.parquet"
#> [5] "cyl=6.0/gear=4.0/00000000.parquet" "cyl=6.0/gear=5.0/00000000.parquet"
#> [7] "cyl=8.0/gear=3.0/00000000.parquet" "cyl=8.0/gear=5.0/00000000.parquet"
# If the path is a folder, Polars automatically tries to detect partitions
# and includes them in the output
pl$scan_parquet(temp_dir)$collect()
#> shape: (32, 11)
#> ┌──────┬─────┬───────┬───────┬───┬─────┬─────┬──────┬──────┐
#> │ mpg ┆ cyl ┆ disp ┆ hp ┆ … ┆ vs ┆ am ┆ gear ┆ carb │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞══════╪═════╪═══════╪═══════╪═══╪═════╪═════╪══════╪══════╡
#> │ 21.5 ┆ 4.0 ┆ 120.1 ┆ 97.0 ┆ … ┆ 1.0 ┆ 0.0 ┆ 3.0 ┆ 1.0 │
#> │ 22.8 ┆ 4.0 ┆ 108.0 ┆ 93.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 1.0 │
#> │ 24.4 ┆ 4.0 ┆ 146.7 ┆ 62.0 ┆ … ┆ 1.0 ┆ 0.0 ┆ 4.0 ┆ 2.0 │
#> │ 22.8 ┆ 4.0 ┆ 140.8 ┆ 95.0 ┆ … ┆ 1.0 ┆ 0.0 ┆ 4.0 ┆ 2.0 │
#> │ 32.4 ┆ 4.0 ┆ 78.7 ┆ 66.0 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4.0 ┆ 1.0 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 15.2 ┆ 8.0 ┆ 304.0 ┆ 150.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 3.0 ┆ 2.0 │
#> │ 13.3 ┆ 8.0 ┆ 350.0 ┆ 245.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 3.0 ┆ 4.0 │
#> │ 19.2 ┆ 8.0 ┆ 400.0 ┆ 175.0 ┆ … ┆ 0.0 ┆ 0.0 ┆ 3.0 ┆ 2.0 │
#> │ 15.8 ┆ 8.0 ┆ 351.0 ┆ 264.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 4.0 │
#> │ 15.0 ┆ 8.0 ┆ 301.0 ┆ 335.0 ┆ … ┆ 0.0 ┆ 1.0 ┆ 5.0 ┆ 8.0 │
#> └──────┴─────┴───────┴───────┴───┴─────┴─────┴──────┴──────┘