polars.scan_pyarrow_dataset#

polars.scan_pyarrow_dataset( source: pa.dataset.Dataset, *, allow_pyarrow_filter: bool = True, batch_size: int | None = None, ) → LazyFrame[source]#

Scan a pyarrow dataset.

Warning

This functionality is considered unstable. It may be changed at any point without it being considered a breaking change.

This can be useful to connect to cloud or partitioned datasets.

Parameters:

source: Pyarrow dataset to scan.
allow_pyarrow_filter: Allow predicates to be pushed down to pyarrow. This can lead to different results if comparisons are done with null values as pyarrow handles this different than polars does.
batch_size: The maximum row count for scanned pyarrow record batches.

Warning

This method can only can push down predicates that are allowed by PyArrow (e.g. not the full Polars API).

If scan_parquet() works for your source, you should use that instead.

Notes

When using partitioning, the appropriate partitioning option must be set on pyarrow.dataset.dataset before passing to Polars or the partitioned-on column(s) may not get passed to Polars.

Examples

>>> import pyarrow.dataset as ds
>>> dset = ds.dataset("s3://my-partitioned-folder/", format="ipc")  
>>> (
...     pl.scan_pyarrow_dataset(dset)
...     .filter("bools")
...     .select("bools", "floats", "date")
...     .collect()
... )  
shape: (1, 3)
┌───────┬────────┬────────────┐
│ bools ┆ floats ┆ date       │
│ ---   ┆ ---    ┆ ---        │
│ bool  ┆ f64    ┆ date       │
╞═══════╪════════╪════════════╡
│ true  ┆ 2.0    ┆ 1970-05-04 │
└───────┴────────┴────────────┘