Skip to content

Cloud storage

Polars can read and write to AWS S3, Azure Blob Storage and Google Cloud Storage. The API is the same for all three storage providers.

To read from cloud storage, additional dependencies may be needed depending on the use case and cloud storage provider:

$ pip install fsspec s3fs adlfs gcsfs
$ cargo add aws_sdk_s3 aws_config tokio --features tokio/full

Reading from cloud storage

Polars can read a CSV, IPC or Parquet file in eager mode from cloud storage.

read_parquet · read_csv · read_ipc

import polars as pl

source = "s3://bucket/*.parquet"

df = pl.read_parquet(source)

ParquetReader · CsvReader · IpcReader · Available on feature ipc · Available on feature csv · Available on feature parquet

use aws_config::BehaviorVersion;
use polars::prelude::*;

#[tokio::main]
async fn main() {
    let bucket = "<YOUR_BUCKET>";
    let path = "<YOUR_PATH>";

    let config = aws_config::load_defaults(BehaviorVersion::latest()).await;
    let client = aws_sdk_s3::Client::new(&config);

    let object = client
        .get_object()
        .bucket(bucket)
        .key(path)
        .send()
        .await
        .unwrap();

    let bytes = object.body.collect().await.unwrap().into_bytes();

    let cursor = std::io::Cursor::new(bytes);
    let df = CsvReader::new(cursor).finish().unwrap();

    println!("{:?}", df);
}

This eager query downloads the file to a buffer in memory and creates a DataFrame from there. Polars uses fsspec to manage this download internally for all cloud storage providers.

Scanning from cloud storage with query optimisation

Polars can scan a Parquet file in lazy mode from cloud storage. We may need to provide further details beyond the source url such as authentication details or storage region. Polars looks for these as environment variables but we can also do this manually by passing a dict as the storage_options argument.

scan_parquet

import polars as pl

source = "s3://bucket/*.parquet"

storage_options = {
    "aws_access_key_id": "<secret>",
    "aws_secret_access_key": "<secret>",
    "aws_region": "us-east-1",
}
df = pl.scan_parquet(source, storage_options=storage_options)

This query creates a LazyFrame without downloading the file. In the LazyFrame we have access to file metadata such as the schema. Polars uses the object_store.rs library internally to manage the interface with the cloud storage providers and so no extra dependencies are required in Python to scan a cloud Parquet file.

If we create a lazy query with predicate and projection pushdowns, the query optimizer will apply them before the file is downloaded. This can significantly reduce the amount of data that needs to be downloaded. The query evaluation is triggered by calling collect.

import polars as pl

source = "s3://bucket/*.parquet"


df = pl.scan_parquet(source).filter(pl.col("id") < 100).select("id","value").collect()

Scanning with PyArrow

We can also scan from cloud storage using PyArrow. This is particularly useful for partitioned datasets such as Hive partitioning.

We first create a PyArrow dataset and then create a LazyFrame from the dataset.

scan_pyarrow_dataset

import polars as pl
import pyarrow.dataset as ds

dset = ds.dataset("s3://my-partitioned-folder/", format="parquet")
(
    pl.scan_pyarrow_dataset(dset)
    .filter(pl.col("foo") == "a")
    .select(["foo", "bar"])
    .collect()
)

Writing to cloud storage

We can write a DataFrame to cloud storage in Python using s3fs for S3, adlfs for Azure Blob Storage and gcsfs for Google Cloud Storage. In this example, we write a Parquet file to S3.

write_parquet

import polars as pl
import s3fs

df = pl.DataFrame({
    "foo": ["a", "b", "c", "d", "d"],
    "bar": [1, 2, 3, 4, 5],
})

fs = s3fs.S3FileSystem()
destination = "s3://bucket/my_file.parquet"

# write parquet
with fs.open(destination, mode='wb') as f:
    df.write_parquet(f)