polars.read_parquet#

polars.read_parquet(
source: str | Path | list[str] | list[Path] | IO[bytes] | bytes,
*,
columns: list[int] | list[str] | None = None,
n_rows: int | None = None,
row_index_name: str | None = None,
row_index_offset: int = 0,
parallel: ParallelStrategy = 'auto',
use_statistics: bool = True,
hive_partitioning: bool = True,
rechunk: bool = True,
low_memory: bool = False,
storage_options: dict[str, Any] | None = None,
retries: int = 0,
use_pyarrow: bool = False,
pyarrow_options: dict[str, Any] | None = None,
memory_map: bool = True,
) DataFrame[source]#

Read into a DataFrame from a parquet file.

Parameters:
source

Path to a file, or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO). If the path is a directory, files in that directory will all be read.

columns

Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.

n_rows

Stop reading from parquet file after reading n_rows. Only valid when use_pyarrow=False.

row_index_name

Insert a row index column with the given name into the DataFrame as the first column. If set to None (default), no row index column is created.

row_index_offset

Start the row index at this offset. Cannot be negative. Only used if row_index_name is set.

parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}

This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.

use_statistics

Use statistics in the parquet to determine if pages can be skipped from reading.

hive_partitioning

Infer statistics and schema from hive partitioned URL and use them to prune reads.

rechunk

Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.

low_memory

Reduce memory pressure at the expense of performance.

storage_options

Options that indicate how to connect to a cloud provider. If the cloud provider is not supported by Polars, the storage options are passed to fsspec.open().

The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:

If storage_options is not provided, Polars will try to infer the information from environment variables.

retries

Number of retries if accessing a cloud instance fails.

use_pyarrow

Use pyarrow instead of the Rust native parquet reader. The pyarrow reader is more stable.

pyarrow_options

Keyword arguments for pyarrow.parquet.read_table.

memory_map

Memory map underlying file. This will likely increase performance. Only used when use_pyarrow=True.

Returns:
DataFrame

Notes

  • Partitioned files:

    If you have a directory-nested (hive-style) partitioned dataset, you should use the scan_pyarrow_dataset() method instead.

  • When benchmarking:

    This operation defaults to a rechunk operation at the end, meaning that all data will be stored continuously in memory. Set rechunk=False if you are benchmarking the parquet-reader as rechunk can be an expensive operation that should not contribute to the timings.