Parquet
Loading or writing Parquet
files is lightning fast.
Pandas
uses PyArrow
-Python
bindings
exposed by Arrow
- to load Parquet
files into memory, but it has to copy that data into
Pandas
memory. With Polars
there is no extra cost due to
copying as we read Parquet
directly into Arrow
memory and keep it there.
Read
df = pl.read_parquet("docs/data/path.parquet")
ParquetReader
· Available on feature parquet
let mut file = std::fs::File::open("docs/data/path.parquet").unwrap();
let df = ParquetReader::new(&mut file).finish().unwrap();
Write
df = pl.DataFrame({"foo": [1, 2, 3], "bar": [None, "bak", "baz"]})
df.write_parquet("docs/data/path.parquet")
ParquetWriter
· Available on feature parquet
let mut df = df!(
"foo" => &[1, 2, 3],
"bar" => &[None, Some("bak"), Some("baz")],
)
.unwrap();
let mut file = std::fs::File::create("docs/data/path.parquet").unwrap();
ParquetWriter::new(&mut file).finish(&mut df).unwrap();
Scan
Polars
allows you to scan a Parquet
input. Scanning delays the actual parsing of the
file and instead returns a lazy computation holder called a LazyFrame
.
df = pl.scan_parquet("docs/data/path.parquet")
scan_parquet
· Available on feature parquet
let args = ScanArgsParquet::default();
let df = LazyFrame::scan_parquet("./file.parquet",args).unwrap();
If you want to know why this is desirable, you can read more about those Polars
optimizations here.