Lazily read from an Arrow IPC (Feather v2) file or multiple files via glob patterns
Description
This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.
Usage
pl_scan_ipc(
source,
...,
n_rows = NULL,
memory_map = TRUE,
row_index_name = NULL,
row_index_offset = 0L,
rechunk = FALSE,
cache = TRUE,
hive_partitioning = NULL,
hive_schema = NULL,
try_parse_hive_dates = TRUE,
include_file_paths = NULL
)
Arguments
source
|
Path to a file. You can use globbing with \* to scan/read
multiple files in the same directory (see examples).
|
…
|
Ignored. |
n_rows
|
Maximum number of rows to read. |
memory_map
|
A logical. If TRUE , try to memory map the file. This can
greatly improve performance on repeated queries as the OS may cache
pages. Only uncompressed Arrow IPC files can be memory mapped.
|
row_index_name
|
If not NULL , this will insert a row index column with the
given name into the DataFrame.
|
row_index_offset
|
Offset to start the row index column (only used if the name is set). |
rechunk
|
In case of reading multiple files via a glob pattern, rechunk the final DataFrame into contiguous memory chunks. |
cache
|
Cache the result after reading. |
hive_partitioning
|
Infer statistics and schema from Hive partitioned URL and use them to
prune reads. If NULL (default), it is automatically enabled
when a single directory is passed, and otherwise disabled.
|
hive_schema
|
A list containing the column names and data types of the columns by
which the data is partitioned, e.g. list(a = pl$String, b =
pl$Float32) . If NULL (default), the schema of the
Hive partitions is inferred.
|
try_parse_hive_dates
|
Whether to try parsing hive values as date/datetime types. |
include_file_paths
|
Character value indicating the column name that will include the path of the source file(s). |
Details
Hive-style partitioning is not supported yet.
Value
LazyFrame
Examples
library("polars")
temp_dir = tempfile()
# Write a hive-style partitioned arrow file dataset
arrow::write_dataset(
mtcars,
temp_dir,
partitioning = c("cyl", "gear"),
format = "arrow",
hive_style = TRUE
)
list.files(temp_dir, recursive = TRUE)
#> [1] "cyl=4/gear=3/part-0.arrow" "cyl=4/gear=4/part-0.arrow"
#> [3] "cyl=4/gear=5/part-0.arrow" "cyl=6/gear=3/part-0.arrow"
#> [5] "cyl=6/gear=4/part-0.arrow" "cyl=6/gear=5/part-0.arrow"
#> [7] "cyl=8/gear=3/part-0.arrow" "cyl=8/gear=5/part-0.arrow"
# If the path is a folder, Polars automatically tries to detect partitions
# and includes them in the output
pl$scan_ipc(temp_dir)$collect()
#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg ┆ disp ┆ hp ┆ drat ┆ … ┆ am ┆ carb ┆ cyl ┆ gear │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ i64 ┆ i64 │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0 ┆ 3.7 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4 ┆ 3 │
#> │ 22.8 ┆ 108.0 ┆ 93.0 ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ 24.4 ┆ 146.7 ┆ 62.0 ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 22.8 ┆ 140.8 ┆ 95.0 ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 32.4 ┆ 78.7 ┆ 66.0 ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0 ┆ 8 ┆ 3 │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0 ┆ 8 ┆ 5 │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0 ┆ 8 ┆ 5 │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘
# We can also impose a schema to the partition
pl$scan_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))$collect()
#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg ┆ disp ┆ hp ┆ drat ┆ … ┆ am ┆ carb ┆ cyl ┆ gear │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ f64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ str ┆ i32 │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0 ┆ 3.7 ┆ … ┆ 0.0 ┆ 1.0 ┆ 4 ┆ 3 │
#> │ 22.8 ┆ 108.0 ┆ 93.0 ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ 24.4 ┆ 146.7 ┆ 62.0 ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 22.8 ┆ 140.8 ┆ 95.0 ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0 ┆ 4 ┆ 4 │
#> │ 32.4 ┆ 78.7 ┆ 66.0 ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0 ┆ 4 ┆ 4 │
#> │ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0 ┆ 8 ┆ 3 │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0 ┆ 8 ┆ 3 │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0 ┆ 8 ┆ 5 │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0 ┆ 8 ┆ 5 │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘