Static schemas

Do you know that feeling, you are 20 mins in a ETL job and bam, your pipeline fails because we assumed to have another data type for a column? With polars' lazy API we know the data type and column name for any node in the pipeline.

This is very valuable information and can be used to ensure data integrity at any node in the pipeline.

trip_duration = (pl.col("dropoff_datetime") - pl.col("pickup_datetime")).dt.seconds() / 3600

assert (
    pl.scan_csv("data/yellow_tripdata_2010-01.csv", parse_dates=True)
    .with_column(trip_duration.alias("trip_duration"))
    .filter(pl.col("trip_duration") > 0)
    .groupby(["vendor_id"])
    .agg(
        [
            (pl.col("trip_distance") / pl.col("trip_duration")).mean().alias("avg_speed"),
            (pl.col("tip_amount") / pl.col("passenger_count")).mean().alias("avg_tip_per_person"),
        ]
    )
).schema == {"vendor_id": pl.Utf8, "avg_speed": pl.Float64, "avg_tip_per_person": pl.Float64}