Cache the result once the execution of the physical plan hits this node.
Collect into a DataFrame.
Note: use fetch if you want to run this query on the first n rows only.
This can be a huge time saver in debugging queries.
Optionalopts: LazyOptionsoptions for lazy operations
OptionalcommSubexprElim?: booleanOptionalcommSubplanElim?: booleanOptionalnoOptimization?: booleanOptionalpredicatePushdown?: booleanOptionalprojectionPushdown?: booleanOptionalsimplifyExpression?: booleanOptionalslicePushdown?: booleanOptionalstreaming?: booleanOptionaltypeCoercion?: booleanDataFrame
Optionalopts: LazyOptionsA string representation of the optimized query plan.
Optionalopts: LazyOptionsA string representation of the unoptimized query plan.
Remove one or multiple columns from a DataFrame.
column or list of columns to be removed
Drop rows with null values from this DataFrame. This method only drops nulls row-wise if any single value of the row is null.
Fetch is like a collect operation, but it overwrites the number of rows read by every scan
Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.
collect 'n' number of rows from data source
options for lazy operations
OptionalcommSubexprElim?: booleanOptionalcommSubplanElim?: booleanOptionalnoOptimization?: booleanOptionalpredicatePushdown?: booleanOptionalprojectionPushdown?: booleanOptionalsimplifyExpression?: booleanOptionalslicePushdown?: booleanOptionalstreaming?: booleanOptionaltypeCoercion?: booleanFill missing values
Filter the rows in the DataFrame based on a predicate expression.
> lf = pl.DataFrame({
> "foo": [1, 2, 3],
> "bar": [6, 7, 8],
> "ham": ['a', 'b', 'c']
> }).lazy()
> // Filter on one condition
> lf.filter(pl.col("foo").lt(3)).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 7 ┆ b │
└─────┴─────┴─────┘
Start a groupby operation.
OptionalmaintainOrder: booleanGroups based on a time value (or index value of type Int32, Int64). Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
The every, period and offset arguments are created with
the following string language:
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupbyDynamic on an integer column, the windows are defined by:
Optionalby?: ColumnsOrExprAlso group by this column/these columns
Optionalclosed?: "none" | "left" | "right" | "both"Defines if the window interval is closed or not. Any of {"left", "right", "both" "none"}
interval of the window
OptionalincludeBoundaries?: booleanadd the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.
In case of a dynamic groupby on indices, dtype needs to be one of {Int32, Int64}. Note that
Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
Optionallabel?: stringDefine which label to use for the window: Any if {'left', 'right', 'datapoint'}
Optionaloffset?: stringoffset of the window if None and period is None it will be equal to negative every
Optionalperiod?: stringlength of the window, if None it is equal to 'every'
OptionalstartBy?: StartByThe strategy to determine the start of the first window by. Any of {'window', 'datapoint', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'}
Create rolling groups based on a time column (or index value of type Int32, Int64).
Different from a rolling groupby the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupByDynamic
The period and offset arguments are created with
the following string language:
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupby_rolling on an integer column, the windows are defined by:
Optionalby?: ColumnsOrExprAlso group by this column/these columns
Optionalclosed?: "none" | "left" | "right" | "both"Defines if the window interval is closed or not. Any of {"left", "right", "both" "none"}
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.
In case of a rolling groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
Optionaloffset?: stringoffset of the window. Default is -period
length of the window
>dates = [
... "2020-01-01 13:45:48",
... "2020-01-01 16:42:13",
... "2020-01-01 16:45:09",
... "2020-01-02 18:12:48",
... "2020-01-03 19:45:32",
... "2020-01-08 23:16:43",
... ]
>df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).withColumn(
... pl.col("dt").str.strptime(pl.Datetime)
... )
>out = df.groupbyRolling({indexColumn:"dt", period:"2d"}).agg(
... [
... pl.sum("a").alias("sum_a"),
... pl.min("a").alias("min_a"),
... pl.max("a").alias("max_a"),
... ]
... )
>assert(out["sum_a"].toArray() === [3, 10, 15, 24, 11, 1])
>assert(out["max_a"].toArray() === [3, 7, 7, 9, 9, 1])
>assert(out["min_a"].toArray() === [3, 3, 3, 3, 2, 1])
>out
shape: (6, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ dt ┆ a_sum ┆ a_max ┆ a_min │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:42:13 ┆ 10 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:45:09 ┆ 15 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-02 18:12:48 ┆ 24 ┆ 9 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-03 19:45:32 ┆ 11 ┆ 9 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │
└─────────────────────┴───────┴───────┴───────┘
Gets the first n rows of the DataFrame. You probably don't want to use this!
Consider using the fetch operation.
The fetch operation will truly load the first nrows lazily.
Optionallength: numberSQL like joins.
DataFrame to join with.
options for same named column join
Optionalcoalesce?: booleanCoalescing behavior (merging of join columns).
Optionalhow?: Exclude<JoinType, "cross">Join strategy
Name(s) of the join columns in both DataFrames.
Optionalsuffix?: stringSuffix to append to columns with a duplicate name.
OptionalallowParallel?: booleanAllow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
OptionalforceParallel?: booleanForce the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
>>> const df = pl.DataFrame({
>>> foo: [1, 2, 3],
>>> bar: [6.0, 7.0, 8.0],
>>> ham: ['a', 'b', 'c'],
>>> }).lazy()
>>>
>>> const otherDF = pl.DataFrame({
>>> apple: ['x', 'y', 'z'],
>>> ham: ['a', 'b', 'd'],
>>> }).lazy();
>>> const result = await df.join(otherDF, { on: 'ham', how: 'inner' }).collect();
shape: (2, 4)
╭─────┬─────┬─────┬───────╮
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 1 ┆ 6 ┆ "a" ┆ "x" │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 7 ┆ "b" ┆ "y" │
╰─────┴─────┴─────┴───────╯
SQL like joins with different names for left and right dataframes.
DataFrame to join with.
options for differently named column join
Optionalcoalesce?: booleanCoalescing behavior (merging of join columns).
Optionalhow?: Exclude<JoinType, "cross">Join strategy
Name(s) of the left join column(s).
Name(s) of the right join column(s).
Optionalsuffix?: stringSuffix to append to columns with a duplicate name.
OptionalallowParallel?: booleanAllow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
OptionalforceParallel?: booleanForce the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
>>> const df = pl.DataFrame({
>>> foo: [1, 2, 3],
>>> bar: [6.0, 7.0, 8.0],
>>> ham: ['a', 'b', 'c'],
>>> }).lazy()
>>>
>>> const otherDF = pl.DataFrame({
>>> apple: ['x', 'y', 'z'],
>>> ham: ['a', 'b', 'd'],
>>> }).lazy();
>>> const result = await df.join(otherDF, { leftOn: 'ham', rightOn: 'ham', how: 'inner' }).collect();
shape: (2, 4)
╭─────┬─────┬─────┬───────╮
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 1 ┆ 6 ┆ "a" ┆ "x" │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 7 ┆ "b" ┆ "y" │
╰─────┴─────┴─────┴───────╯
SQL like cross joins.
DataFrame to join with.
options for cross join
Optionalcoalesce?: booleanCoalescing behavior (merging of join columns).
Join strategy
Optionalsuffix?: stringSuffix to append to columns with a duplicate name.
OptionalallowParallel?: booleanAllow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
OptionalforceParallel?: booleanForce the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
>>> const df = pl.DataFrame({
>>> foo: [1, 2],
>>> bar: [6.0, 7.0],
>>> ham: ['a', 'b'],
>>> }).lazy()
>>>
>>> const otherDF = pl.DataFrame({
>>> apple: ['x', 'y'],
>>> ham: ['a', 'b'],
>>> }).lazy();
>>> const result = await df.join(otherDF, { how: 'cross' }).collect();
shape: (4, 5)
╭─────┬─────┬─────┬───────┬───────────╮
│ foo ┆ bar ┆ ham ┆ apple ┆ ham_right │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ str ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╪═══════════╡
│ 1.0 ┆ 6.0 ┆ a ┆ x ┆ a │
│ 1.0 ┆ 6.0 ┆ a ┆ y ┆ b │
│ 2.0 ┆ 7.0 ┆ b ┆ x ┆ a │
│ 2.0 ┆ 7.0 ┆ b ┆ y ┆ b │
╰─────┴─────┴─────┴───────┴───────────╯
Perform an asof join. This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
A "nearest" search selects the last row in the right DataFrame whose value is nearest to the left's key. String keys are not currently supported for a nearest search.
The default is "backward".
DataFrame to join with.
OptionalallowParallel?: booleanAllow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
Optionalby?: string | string[]OptionalbyLeft?: string | string[]join on these columns before doing asof join
OptionalbyRight?: string | string[]join on these columns before doing asof join
OptionalcheckSortedness?: booleanCheck the sortedness of the asof keys. If the keys are not sorted Polars will error, or in case of 'by' argument raise a warning. This might become a hard error in the future.
OptionalforceParallel?: booleanForce the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
OptionalleftOn?: stringJoin column of the left DataFrame.
Optionalon?: stringJoin column of both DataFrames. If set, leftOn and rightOn should be undefined.
OptionalrightOn?: stringJoin column of the right DataFrame.
Optionalstrategy?: "backward" | "forward" | "nearest"One of {'forward', 'backward', 'nearest'}
Optionalsuffix?: stringSuffix to append to columns with a duplicate name.
Optionaltolerance?: string | numberNumeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime" you use the following string language:
Or combine them:
>const gdp = pl.DataFrame({
... date: [
... new Date('2016-01-01'),
... new Date('2017-01-01'),
... new Date('2018-01-01'),
... new Date('2019-01-01'),
... ], // note record date: Jan 1st (sorted!)
... gdp: [4164, 4411, 4566, 4696],
... })
>const population = pl.DataFrame({
... date: [
... new Date('2016-05-12'),
... new Date('2017-05-12'),
... new Date('2018-05-12'),
... new Date('2019-05-12'),
... ], // note record date: May 12th (sorted!)
... "population": [82.19, 82.66, 83.12, 83.52],
... })
>population.joinAsof(
... gdp,
... {leftOn:"date", rightOn:"date", strategy:"backward"}
... )
shape: (4, 3)
┌─────────────────────┬────────────┬──────┐
│ date ┆ population ┆ gdp │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ i64 │
╞═════════════════════╪════════════╪══════╡
│ 2016-05-12 00:00:00 ┆ 82.19 ┆ 4164 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2017-05-12 00:00:00 ┆ 82.66 ┆ 4411 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2018-05-12 00:00:00 ┆ 83.12 ┆ 4566 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2019-05-12 00:00:00 ┆ 83.52 ┆ 4696 │
└─────────────────────┴────────────┴──────┘
Get the last row of the DataFrame.
Reverse the DataFrame.
Evaluate the query in streaming mode and write to a CSV file.
.. warning:: Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
File path to which the file should be written.
Optionaloptions: CsvWriterOptionsOptions for
OptionalbatchSize?: numberOptionaldateFormat?: stringOptionaldatetimeFormat?: stringOptionalfloatPrecision?: numberOptionalincludeBom?: booleanOptionalincludeHeader?: booleanOptionallineTerminator?: stringOptionalmaintainOrder?: booleanOptionalnullValue?: stringOptionalquoteChar?: stringOptionalseparator?: stringOptionaltimeFormat?: stringconst lf = pl.scanCsv("/path/to/my_larger_than_ram_file.csv") lf.sinkCsv("out.csv").collect()
Evaluate the query in streaming mode and write to an IPC file. This allows streaming results that are larger than RAM to be written to disk.
Parameters
File path to which the file should be written.
Optionaloptions: SinkIpcOptionsOptions for
OptionalcloudOptions?: Map<string, string>OptionalcompatLevel?: stringOptionalcompression?: stringOptionalmaintainOrder?: booleanOptionalmkdir?: booleanOptionalretries?: numberOptionalsyncOnClose?: stringconst lf = pl.scanCsv("/path/to/my_larger_than_ram_file.csv") # doctest: +SKIP lf.sinkIpc("out.arrow").collect()
Evaluate the query in streaming mode and write to an NDJSON file. This allows streaming results that are larger than RAM to be written to disk.
Parameters
File path to which the file should be written.
Optionaloptions: SinkJsonOptionsOptions for
OptionalcloudOptions?: Map<string, string>OptionalmaintainOrder?: booleanOptionalmkdir?: booleanOptionalretries?: numberOptionalsyncOnClose?: stringconst lf = pl.scanCsv("/path/to/my_larger_than_ram_file.csv") # doctest: +SKIP lf.sinkNdJson("out.ndjson").collect()
Evaluate the query in streaming mode and write to a Parquet file.
This allows streaming results that are larger than RAM to be written to disk.
File path to which the file should be written.
Optionaloptions: SinkParquetOptionsOptions for
OptionalcloudOptions?: Map<string, string>Optionalcompression?: stringOptionalcompressionLevel?: numberOptionaldataPagesizeLimit?: numberOptionalmaintainOrder?: booleanOptionalnoOptimization?: booleanOptionalpredicatePushdown?: booleanOptionalprojectionPushdown?: booleanOptionalretries?: numberOptionalrowGroupSize?: numberOptionalsimplifyExpression?: booleanOptionalsinkOptions?: SinkOptionsOptionalslicePushdown?: booleanOptionalstatistics?: booleanOptionaltypeCoercion?: booleanconst lf = pl.scanCsv("/path/to/my_larger_than_ram_file.csv") # doctest: +SKIP lf.sinkParquet("out.parquet").collect() # doctest: +SKIP
Optionaldescending: ValueOrArray<boolean>OptionalnullsLast: booleanOptionalmaintainOrder: booleanAggregate the columns in the DataFrame to their sum value.
compatibility with JSON.stringify
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List in the DataFrame.
OptionalmaintainOrder: booleanOptionalsubset: ColumnSelectionsubset to drop duplicates for
Optionalkeep: "first" | "last""first" | "last"
Optionaloptions: { valueName?: string | null; variableName?: string | null }Aggregate the columns in the DataFrame to their variance value.
Add or overwrite column in a DataFrame.
Representation of a Lazy computation graph / query.