Skip to content

Lazily read from an Arrow IPC (Feather v2) file or multiple files via glob patterns

Description

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Usage

pl_scan_ipc(
  source,
  ...,
  n_rows = NULL,
  memory_map = TRUE,
  row_index_name = NULL,
  row_index_offset = 0L,
  rechunk = FALSE,
  cache = TRUE,
  hive_partitioning = NULL,
  hive_schema = NULL,
  try_parse_hive_dates = TRUE,
  include_file_paths = NULL
)

Arguments

source Path to a file. You can use globbing with \* to scan/read multiple files in the same directory (see examples).
Ignored.
n_rows Maximum number of rows to read.
memory_map A logical. If TRUE, try to memory map the file. This can greatly improve performance on repeated queries as the OS may cache pages. Only uncompressed Arrow IPC files can be memory mapped.
row_index_name If not NULL, this will insert a row index column with the given name into the DataFrame.
row_index_offset Offset to start the row index column (only used if the name is set).
rechunk In case of reading multiple files via a glob pattern, rechunk the final DataFrame into contiguous memory chunks.
cache Cache the result after reading.
hive_partitioning Infer statistics and schema from Hive partitioned URL and use them to prune reads. If NULL (default), it is automatically enabled when a single directory is passed, and otherwise disabled.
hive_schema A list containing the column names and data types of the columns by which the data is partitioned, e.g. list(a = pl$String, b = pl$Float32). If NULL (default), the schema of the Hive partitions is inferred.
try_parse_hive_dates Whether to try parsing hive values as date/datetime types.
include_file_paths Character value indicating the column name that will include the path of the source file(s).

Details

Hive-style partitioning is not supported yet.

Value

LazyFrame

Examples

library("polars")


temp_dir = tempfile()
# Write a hive-style partitioned arrow file dataset
arrow::write_dataset(
  mtcars,
  temp_dir,
  partitioning = c("cyl", "gear"),
  format = "arrow",
  hive_style = TRUE
)
list.files(temp_dir, recursive = TRUE)
#> [1] "cyl=4/gear=3/part-0.arrow" "cyl=4/gear=4/part-0.arrow"
#> [3] "cyl=4/gear=5/part-0.arrow" "cyl=6/gear=3/part-0.arrow"
#> [5] "cyl=6/gear=4/part-0.arrow" "cyl=6/gear=5/part-0.arrow"
#> [7] "cyl=8/gear=3/part-0.arrow" "cyl=8/gear=5/part-0.arrow"
# If the path is a folder, Polars automatically tries to detect partitions
# and includes them in the output
pl$scan_ipc(temp_dir)$collect()
#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg  ┆ disp  ┆ hp    ┆ drat ┆ … ┆ am  ┆ carb ┆ cyl ┆ gear │
#> │ ---  ┆ ---   ┆ ---   ┆ ---  ┆   ┆ --- ┆ ---  ┆ --- ┆ ---  │
#> │ f64  ┆ f64   ┆ f64   ┆ f64  ┆   ┆ f64 ┆ f64  ┆ i64 ┆ i64  │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0  ┆ 3.7  ┆ … ┆ 0.0 ┆ 1.0  ┆ 4   ┆ 3    │
#> │ 22.8 ┆ 108.0 ┆ 93.0  ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ 24.4 ┆ 146.7 ┆ 62.0  ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 22.8 ┆ 140.8 ┆ 95.0  ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 32.4 ┆ 78.7  ┆ 66.0  ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ …    ┆ …     ┆ …     ┆ …    ┆ … ┆ …   ┆ …    ┆ …   ┆ …    │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0  ┆ 8   ┆ 3    │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0  ┆ 8   ┆ 5    │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0  ┆ 8   ┆ 5    │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘
# We can also impose a schema to the partition
pl$scan_ipc(temp_dir, hive_schema = list(cyl = pl$String, gear = pl$Int32))$collect()
#> shape: (32, 11)
#> ┌──────┬───────┬───────┬──────┬───┬─────┬──────┬─────┬──────┐
#> │ mpg  ┆ disp  ┆ hp    ┆ drat ┆ … ┆ am  ┆ carb ┆ cyl ┆ gear │
#> │ ---  ┆ ---   ┆ ---   ┆ ---  ┆   ┆ --- ┆ ---  ┆ --- ┆ ---  │
#> │ f64  ┆ f64   ┆ f64   ┆ f64  ┆   ┆ f64 ┆ f64  ┆ str ┆ i32  │
#> ╞══════╪═══════╪═══════╪══════╪═══╪═════╪══════╪═════╪══════╡
#> │ 21.5 ┆ 120.1 ┆ 97.0  ┆ 3.7  ┆ … ┆ 0.0 ┆ 1.0  ┆ 4   ┆ 3    │
#> │ 22.8 ┆ 108.0 ┆ 93.0  ┆ 3.85 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ 24.4 ┆ 146.7 ┆ 62.0  ┆ 3.69 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 22.8 ┆ 140.8 ┆ 95.0  ┆ 3.92 ┆ … ┆ 0.0 ┆ 2.0  ┆ 4   ┆ 4    │
#> │ 32.4 ┆ 78.7  ┆ 66.0  ┆ 4.08 ┆ … ┆ 1.0 ┆ 1.0  ┆ 4   ┆ 4    │
#> │ …    ┆ …     ┆ …     ┆ …    ┆ … ┆ …   ┆ …    ┆ …   ┆ …    │
#> │ 15.2 ┆ 304.0 ┆ 150.0 ┆ 3.15 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 13.3 ┆ 350.0 ┆ 245.0 ┆ 3.73 ┆ … ┆ 0.0 ┆ 4.0  ┆ 8   ┆ 3    │
#> │ 19.2 ┆ 400.0 ┆ 175.0 ┆ 3.08 ┆ … ┆ 0.0 ┆ 2.0  ┆ 8   ┆ 3    │
#> │ 15.8 ┆ 351.0 ┆ 264.0 ┆ 4.22 ┆ … ┆ 1.0 ┆ 4.0  ┆ 8   ┆ 5    │
#> │ 15.0 ┆ 301.0 ┆ 335.0 ┆ 3.54 ┆ … ┆ 1.0 ┆ 8.0  ┆ 8   ┆ 5    │
#> └──────┴───────┴───────┴──────┴───┴─────┴──────┴─────┴──────┘