多文件 - Polars - 用户指南

处理多个文件

Polars可以根据您的需要和内存紧张程度，以不同的方式处理多个文件。

让我们创建一些文件来使用一些上下文（context）：

import polars as pl

df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "ham", "spam"]})

for i in range(5):
    df.write_csv(f"my_many_files_{i}.csv")

读入单个`DataFrame`

要将多个文件读入一个DataFrame，我们可以使用全局模式：

df = pl.read_csv("my_many_files_*.csv")
print(df)

shape: (15, 2)
┌─────┬──────┐
│ foo ┆ bar  │
│ --- ┆ ---  │
│ i64 ┆ str  │
╞═════╪══════╡
│ 1   ┆ null │
│ 2   ┆ ham  │
│ 3   ┆ spam │
│ 1   ┆ null │
│ …   ┆ …    │
│ 3   ┆ spam │
│ 1   ┆ null │
│ 2   ┆ ham  │
│ 3   ┆ spam │
└─────┴──────┘

要了解这是如何工作的，我们可以看看查询计划。下面我们可以看到，所有文件都是单独读取并连接成一个DataFrame 。Polars将尝试将读取并行化。

pl.scan_csv("my_many_files_*.csv").show_graph()

single_df_graph

并行读取和处理

如果您的文件不必位于单个表中，您还可以为每个文件构建一个查询计划，并在Polars线程池中并行执行它们。所有查询计划的执行都是极好的并行执行，不需要任何通信。

import polars as pl
import glob

queries = []
for file in glob.glob("my_many_files_*.csv"):
    q = pl.scan_csv(file).groupby("bar").agg([pl.count(), pl.sum("foo")])
    queries.append(q)

dataframes = pl.collect_all(queries)
print(dataframes)

[shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ spam ┆ 1     ┆ 3   │
│ null ┆ 1     ┆ 1   │
│ ham  ┆ 1     ┆ 2   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ null ┆ 1     ┆ 1   │
│ ham  ┆ 1     ┆ 2   │
│ spam ┆ 1     ┆ 3   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ ham  ┆ 1     ┆ 2   │
│ spam ┆ 1     ┆ 3   │
│ null ┆ 1     ┆ 1   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ ham  ┆ 1     ┆ 2   │
│ spam ┆ 1     ┆ 3   │
│ null ┆ 1     ┆ 1   │
└──────┴───────┴─────┘, shape: (3, 3)
┌──────┬───────┬─────┐
│ bar  ┆ count ┆ foo │
│ ---  ┆ ---   ┆ --- │
│ str  ┆ u32   ┆ i64 │
╞══════╪═══════╪═════╡
│ ham  ┆ 1     ┆ 2   │
│ null ┆ 1     ┆ 1   │
│ spam ┆ 1     ┆ 3   │
└──────┴───────┴─────┘]

Polars - 用户指南

处理多个文件

读入单个DataFrame

并行读取和处理

读入单个`DataFrame`