Multiple
Dealing with multiple files.
Polars can deal with multiple files differently depending on your needs and memory strain.
Let's create some files to give us some context:
import polars as pl
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})
for i in range(5):
df.write_csv(f"docs/data/my_many_files_{i}.csv")
Reading into a single DataFrame
To read multiple files into a single DataFrame
, we can use globbing patterns:
df = pl.read_csv("docs/data/my_many_files_*.csv")
print(df)
shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ null │
│ 2 ┆ ham │
│ 3 ┆ spam │
│ 1 ┆ null │
│ … ┆ … │
│ 3 ┆ spam │
│ 1 ┆ null │
│ 2 ┆ ham │
│ 3 ┆ spam │
└─────┴──────┘
To see how this works we can take a look at the query plan. Below we see that all files are read separately and
concatenated into a single DataFrame
. Polars
will try to parallelize the reading.
pl.scan_csv("docs/data/my_many_files_*.csv").show_graph()
Reading and processing in parallel
If your files don't have to be in a single table you can also build a query plan for each file and execute them in parallel
on the Polars
thread pool.
All query plan execution is embarrassingly parallel and doesn't require any communication.
import polars as pl
import glob
queries = []
for file in glob.glob("docs/data/my_many_files_*.csv"):
q = pl.scan_csv(file).group_by("bar").agg([pl.count(), pl.sum("foo")])
queries.append(q)
dataframes = pl.collect_all(queries)
print(dataframes)
[shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1 ┆ 1 │
│ spam ┆ 1 ┆ 3 │
│ ham ┆ 1 ┆ 2 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1 ┆ 1 │
│ ham ┆ 1 ┆ 2 │
│ spam ┆ 1 ┆ 3 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1 ┆ 3 │
│ ham ┆ 1 ┆ 2 │
│ null ┆ 1 ┆ 1 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ ham ┆ 1 ┆ 2 │
│ null ┆ 1 ┆ 1 │
│ spam ┆ 1 ┆ 3 │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar ┆ count ┆ foo │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ i64 │
╞══════╪═══════╪═════╡
│ ham ┆ 1 ┆ 2 │
│ null ┆ 1 ┆ 1 │
│ spam ┆ 1 ┆ 3 │
└──────┴───────┴─────┘]