polars.scan_delta#
- polars.scan_delta(table_uri: str, version: int | None = None, raw_filesystem: pa.fs.FileSystem | None = None, storage_options: dict[str, Any] | None = None, delta_table_options: dict[str, Any] | None = None, pyarrow_options: dict[str, Any] | None = None) LazyFrame [source]#
Lazily read from a Delta lake table.
- Parameters:
- table_uri
Path or URI to the root of the Delta lake table.
Note: For Local filesystem, absolute and relative paths are supported. But for the supported object storages - GCS, Azure and S3, there is no relative path support, and thus full URI must be provided.
- version
Version of the Delta lake table.
Note: If
version
is not provided, latest version of delta lake table is read.- raw_filesystem
A pyarrow.fs.FileSystem to read files from.
Note: The root of the filesystem has to be adjusted to point at the root of the Delta lake table. The provided
raw_filesystem
is wrapped into a pyarrow.fs.SubTreeFileSystemMore info is available here.
- storage_options
Extra options for the storage backends supported by deltalake. For cloud storages, this may include configurations for authentication etc.
More info is available here.
- delta_table_options
Additional keyword arguments while reading a Delta lake Table.
- pyarrow_options
Keyword arguments while converting a Delta lake Table to pyarrow table. Use this parameter when filtering on partitioned columns.
- Returns:
- LazyFrame
Examples
Creates a scan for a Delta table from local filesystem. Note: Since version is not provided, latest version of the delta table is read.
>>> table_path = "/path/to/delta-table/" >>> pl.scan_delta(table_path).collect()
Use the pyarrow_options parameter to read only certain partitions. Note: This should be preferred over using an equivalent .filter() on the resulting dataframe, as this avoids reading the data at all.
>>> pl.scan_delta( ... table_path, ... pyarrow_options={"partitions": [("year", "=", "2021")]}, ... )
Creates a scan for a specific version of the Delta table from local filesystem. Note: This will fail if the provided version of the delta table does not exist.
>>> pl.scan_delta(table_path, version=1).collect()
Creates a scan for a Delta table from AWS S3. See a list of supported storage options for S3 here.
>>> table_path = "s3://bucket/path/to/delta-table/" >>> storage_options = { ... "AWS_ACCESS_KEY_ID": "THE_AWS_ACCESS_KEY_ID", ... "AWS_SECRET_ACCESS_KEY": "THE_AWS_SECRET_ACCESS_KEY", ... } >>> pl.scan_delta( ... table_path, storage_options=storage_options ... ).collect()
Creates a scan for a Delta table from Google Cloud storage (GCS).
Note: This implementation relies on pyarrow.fs and thus has to rely on fsspec compatible filesystems as mentioned here. So please ensure that pyarrow ,`fsspec` and gcsfs are installed.
See a list of supported storage options for GCS here.
>>> import gcsfs >>> from pyarrow.fs import PyFileSystem, FSSpecHandler >>> storage_options = {"SERVICE_ACCOUNT": "SERVICE_ACCOUNT_JSON_ABSOLUTE_PATH"} >>> fs = gcsfs.GCSFileSystem( ... project="my-project-id", ... token=storage_options["SERVICE_ACCOUNT"], ... ) >>> # this pyarrow fs must be created and passed to scan_delta for GCS >>> pa_fs = PyFileSystem(FSSpecHandler(fs)) >>> table_path = "gs://bucket/path/to/delta-table/" >>> pl.scan_delta( ... table_path, storage_options=storage_options, raw_filesystem=pa_fs ... ).collect()
Creates a scan for a Delta table from Azure.
Note: This implementation relies on pyarrow.fs and thus has to rely on fsspec compatible filesystems as mentioned here. So please ensure that pyarrow ,`fsspec` and adlfs are installed.
Following type of table paths are supported,
az://<container>/<path>
adl://<container>/<path>
abfs://<container>/<path>
See a list of supported storage options for Azure here.
>>> import adlfs >>> from pyarrow.fs import PyFileSystem, FSSpecHandler >>> storage_options = { ... "AZURE_STORAGE_ACCOUNT_NAME": "AZURE_STORAGE_ACCOUNT_NAME", ... "AZURE_STORAGE_ACCOUNT_KEY": "AZURE_STORAGE_ACCOUNT_KEY", ... } >>> fs = adlfs.AzureBlobFileSystem( ... account_name=storage_options["AZURE_STORAGE_ACCOUNT_NAME"], ... account_key=storage_options["AZURE_STORAGE_ACCOUNT_KEY"], ... ) >>> # this pyarrow fs must be created and passed to scan_delta for Azure >>> pa_fs = PyFileSystem(FSSpecHandler(fs)) >>> table_path = "az://container/path/to/delta-table/" >>> pl.scan_delta( ... table_path, storage_options=storage_options, raw_filesystem=pa_fs ... ).collect()
Creates a scan for a Delta table with additional delta specific options. In the below example, without_files option is used which loads the table without file tracking information.
>>> table_path = "/path/to/delta-table/" >>> delta_table_options = {"without_files": True} >>> pl.scan_delta( ... table_path, delta_table_options=delta_table_options ... ).collect()