polars.read_parquet#
- polars.read_parquet(
- source: str | Path | list[str] | list[Path] | IO[bytes] | bytes,
- *,
- columns: list[int] | list[str] | None = None,
- n_rows: int | None = None,
- row_index_name: str | None = None,
- row_index_offset: int = 0,
- parallel: ParallelStrategy = 'auto',
- use_statistics: bool = True,
- hive_partitioning: bool = True,
- rechunk: bool = True,
- low_memory: bool = False,
- storage_options: dict[str, Any] | None = None,
- retries: int = 0,
- use_pyarrow: bool = False,
- pyarrow_options: dict[str, Any] | None = None,
- memory_map: bool = True,
Read into a DataFrame from a parquet file.
- Parameters:
- source
Path to a file, or a file-like object (by file-like object, we refer to objects that have a
read()
method, such as a file handler (e.g. via builtinopen
function) orBytesIO
). If the path is a directory, files in that directory will all be read.- columns
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- n_rows
Stop reading from parquet file after reading
n_rows
. Only valid whenuse_pyarrow=False
.- row_index_name
Insert a row index column with the given name into the DataFrame as the first column. If set to
None
(default), no row index column is created.- row_index_offset
Start the row index at this offset. Cannot be negative. Only used if
row_index_name
is set.- parallel{‘auto’, ‘columns’, ‘row_groups’, ‘none’}
This determines the direction of parallelism. ‘auto’ will try to determine the optimal direction.
- use_statistics
Use statistics in the parquet to determine if pages can be skipped from reading.
- hive_partitioning
Infer statistics and schema from hive partitioned URL and use them to prune reads.
- rechunk
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- low_memory
Reduce memory pressure at the expense of performance.
- storage_options
Options that indicate how to connect to a cloud provider. If the cloud provider is not supported by Polars, the storage options are passed to
fsspec.open()
.The cloud providers currently supported are AWS, GCP, and Azure. See supported keys here:
If
storage_options
is not provided, Polars will try to infer the information from environment variables.- retries
Number of retries if accessing a cloud instance fails.
- use_pyarrow
Use pyarrow instead of the Rust native parquet reader. The pyarrow reader is more stable.
- pyarrow_options
Keyword arguments for pyarrow.parquet.read_table.
- memory_map
Memory map underlying file. This will likely increase performance. Only used when
use_pyarrow=True
.
- Returns:
- DataFrame
See also
Notes
- Partitioned files:
If you have a directory-nested (hive-style) partitioned dataset, you should use the
scan_pyarrow_dataset()
method instead.
- When benchmarking:
This operation defaults to a
rechunk
operation at the end, meaning that all data will be stored continuously in memory. Setrechunk=False
if you are benchmarking the parquet-reader asrechunk
can be an expensive operation that should not contribute to the timings.