Dealing with multiple files.

Polars can deal with multiple files differently depending on your needs and memory strain.

Let's create some files to give use some context:

import polars as pl

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})

for i in range(5):
    df.write_csv(f"my_many_files_{i}.csv")

Reading into a single DataFrame

To read multiple files into a single DataFrame, we can use globbing patterns:

df = pl.read_csv("my_many_files_*.csv")
print(df)
shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar  │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ ham  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ...  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1   ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2   ┆ ham  │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3   ┆ spam │
└─────┴──────┘

To see how this works we can take a look at the query plan. Below we see that all files are read separately and concatenated into a single DataFrame. Polars will try to parallelize the reading.

pl.scan_csv("my_many_files_*.csv").show_graph()

single_df_graph

Reading and processing in parallel

If your files don't have to be in a single table you can also build a query plan for each file and execute them in paralllel on the Polars thread pool.

All query plan execution is embarrassingly parallel and doesn't require any communication.

import polars as pl
import glob

queries = []
for file in glob.glob("my_many_files_*.csv"):
    q = pl.scan_csv(file).groupby("bar").agg([pl.count(), pl.sum("foo")])
    queries.append(q)

dataframes = pl.collect_all(queries)
print(dataframes)
[shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ spam ┆ 1     ┆ 3   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1     ┆ 1   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ spam ┆ 1     ┆ 3   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ ham  ┆ 1     ┆ 2   │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ null ┆ 1     ┆ 1   │
└──────┴───────┴─────┘]