Cache the result once the execution of the physical plan hits this node.
Collect into a DataFrame.
Note: use fetch
if you want to run this query on the first n
rows only.
This can be a huge time saver in debugging queries.
Optional
opts: LazyOptionsoptions for lazy operations
Optional
commSubexprElim?: booleanOptional
commSubplanElim?: booleanOptional
noOptimization?: booleanOptional
predicatePushdown?: booleanOptional
projectionPushdown?: booleanOptional
simplifyExpression?: booleanOptional
slicePushdown?: booleanOptional
streaming?: booleanOptional
typeCoercion?: booleanDataFrame
Optional
opts: LazyOptionsA string representation of the optimized query plan.
Optional
opts: LazyOptionsA string representation of the unoptimized query plan.
Remove one or multiple columns from a DataFrame.
column or list of columns to be removed
Drop rows with null values from this DataFrame. This method only drops nulls row-wise if any single value of the row is null.
Fetch is like a collect operation, but it overwrites the number of rows read by every scan
Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.
collect 'n' number of rows from data source
options for lazy operations
Optional
commSubexprElim?: booleanOptional
commSubplanElim?: booleanOptional
noOptimization?: booleanOptional
predicatePushdown?: booleanOptional
projectionPushdown?: booleanOptional
simplifyExpression?: booleanOptional
slicePushdown?: booleanOptional
streaming?: booleanOptional
typeCoercion?: booleanFill missing values
Filter the rows in the DataFrame based on a predicate expression.
> lf = pl.DataFrame({
> "foo": [1, 2, 3],
> "bar": [6, 7, 8],
> "ham": ['a', 'b', 'c']
> }).lazy()
> // Filter on one condition
> lf.filter(pl.col("foo").lt(3)).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 7 ┆ b │
└─────┴─────┴─────┘
Start a groupby operation.
Optional
maintainOrder: booleanGroups based on a time value (or index value of type Int32, Int64). Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
The every
, period
and offset
arguments are created with
the following string language:
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupbyDynamic on an integer column, the windows are defined by:
Optional
by?: ColumnsOrExprAlso group by this column/these columns
Optional
closed?: "none" | "left" | "right" | "both"Defines if the window interval is closed or not. Any of {"left", "right", "both" "none"}
interval of the window
Optional
includeBoundaries?: booleanadd the lower and upper bound of the window to the "_lower_bound" and "_upper_bound" columns. This will impact performance because it's harder to parallelize
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.
In case of a dynamic groupby on indices, dtype needs to be one of {Int32, Int64}. Note that
Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
Optional
label?: stringDefine which label to use for the window: Any if {'left', 'right', 'datapoint'}
Optional
offset?: stringoffset of the window if None and period is None it will be equal to negative every
Optional
period?: stringlength of the window, if None it is equal to 'every'
Optional
startBy?: StartByThe strategy to determine the start of the first window by. Any of {'window', 'datapoint', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'}
Create rolling groups based on a time column (or index value of type Int32, Int64).
Different from a rolling groupby the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupByDynamic
The period
and offset
arguments are created with
the following string language:
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupby_rolling on an integer column, the windows are defined by:
Optional
by?: ColumnsOrExprAlso group by this column/these columns
Optional
closed?: "none" | "left" | "right" | "both"Defines if the window interval is closed or not. Any of {"left", "right", "both" "none"}
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.
In case of a rolling groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
Optional
offset?: stringoffset of the window. Default is -period
length of the window
>dates = [
... "2020-01-01 13:45:48",
... "2020-01-01 16:42:13",
... "2020-01-01 16:45:09",
... "2020-01-02 18:12:48",
... "2020-01-03 19:45:32",
... "2020-01-08 23:16:43",
... ]
>df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).withColumn(
... pl.col("dt").str.strptime(pl.Datetime)
... )
>out = df.groupbyRolling({indexColumn:"dt", period:"2d"}).agg(
... [
... pl.sum("a").alias("sum_a"),
... pl.min("a").alias("min_a"),
... pl.max("a").alias("max_a"),
... ]
... )
>assert(out["sum_a"].toArray() === [3, 10, 15, 24, 11, 1])
>assert(out["max_a"].toArray() === [3, 7, 7, 9, 9, 1])
>assert(out["min_a"].toArray() === [3, 3, 3, 3, 2, 1])
>out
shape: (6, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ dt ┆ a_sum ┆ a_max ┆ a_min │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:42:13 ┆ 10 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:45:09 ┆ 15 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-02 18:12:48 ┆ 24 ┆ 9 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-03 19:45:32 ┆ 11 ┆ 9 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │
└─────────────────────┴───────┴───────┴───────┘
Gets the first n
rows of the DataFrame. You probably don't want to use this!
Consider using the fetch
operation.
The fetch
operation will truly load the first n
rows lazily.
Optional
length: numberSQL like joins.
DataFrame to join with.
options for lazy join operations
Optional
allowParallel?: booleanOptional
forceParallel?: booleanOptional
how?: JoinTypejoin type
Optional
leftOn?: string | string[]left join column
Optional
on?: string | string[]left and right join column
Optional
rightOn?: string | string[]right join column
Optional
suffix?: string>>> const df = pl.DataFrame({
>>> foo: [1, 2, 3],
>>> bar: [6.0, 7.0, 8.0],
>>> ham: ['a', 'b', 'c'],
>>> }).lazy()
>>>
>>> const otherDF = pl.DataFrame({
>>> apple: ['x', 'y', 'z'],
>>> ham: ['a', 'b', 'd'],
>>> }).lazy();
>>> const result = await df.join(otherDF, { on: 'ham', how: 'inner' }).collect();
shape: (2, 4)
╭─────┬─────┬─────┬───────╮
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 1 ┆ 6 ┆ "a" ┆ "x" │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 7 ┆ "b" ┆ "y" │
╰─────┴─────┴─────┴───────╯
Perform an asof join. This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
A "nearest" search selects the last row in the right DataFrame whose value is nearest to the left's key. String keys are not currently supported for a nearest search.
The default is "backward".
DataFrame to join with.
Optional
allowParallel?: booleanAllow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
Optional
by?: string | string[]Optional
byLeft?: string | string[]join on these columns before doing asof join
Optional
byRight?: string | string[]join on these columns before doing asof join
Optional
forceParallel?: booleanForce the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
Optional
leftOn?: stringJoin column of the left DataFrame.
Optional
on?: stringJoin column of both DataFrames. If set, leftOn
and rightOn
should be undefined.
Optional
rightOn?: stringJoin column of the right DataFrame.
Optional
strategy?: "backward" | "forward" | "nearest"One of {'forward', 'backward', 'nearest'}
Optional
suffix?: stringSuffix to append to columns with a duplicate name.
Optional
tolerance?: string | numberNumeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime" you use the following string language:
Or combine them:
>const gdp = pl.DataFrame({
... date: [
... new Date('2016-01-01'),
... new Date('2017-01-01'),
... new Date('2018-01-01'),
... new Date('2019-01-01'),
... ], // note record date: Jan 1st (sorted!)
... gdp: [4164, 4411, 4566, 4696],
... })
>const population = pl.DataFrame({
... date: [
... new Date('2016-05-12'),
... new Date('2017-05-12'),
... new Date('2018-05-12'),
... new Date('2019-05-12'),
... ], // note record date: May 12th (sorted!)
... "population": [82.19, 82.66, 83.12, 83.52],
... })
>population.joinAsof(
... gdp,
... {leftOn:"date", rightOn:"date", strategy:"backward"}
... )
shape: (4, 3)
┌─────────────────────┬────────────┬──────┐
│ date ┆ population ┆ gdp │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ i64 │
╞═════════════════════╪════════════╪══════╡
│ 2016-05-12 00:00:00 ┆ 82.19 ┆ 4164 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2017-05-12 00:00:00 ┆ 82.66 ┆ 4411 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2018-05-12 00:00:00 ┆ 83.12 ┆ 4566 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2019-05-12 00:00:00 ┆ 83.52 ┆ 4696 │
└─────────────────────┴────────────┴──────┘
Get the last row of the DataFrame.
Reverse the DataFrame.
Evaluate the query in streaming mode and write to a CSV file.
.. warning:: Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
File path to which the file should be written.
Optional
options: CsvWriterOptionsOptions for
Optional
batchSize?: numberOptional
dateFormat?: stringOptional
datetimeFormat?: stringOptional
floatPrecision?: numberOptional
includeBom?: booleanOptional
includeHeader?: booleanOptional
lineTerminator?: stringOptional
maintainOrder?: booleanOptional
nullValue?: stringOptional
quoteChar?: stringOptional
separator?: stringOptional
timeFormat?: stringEvaluate the query in streaming mode and write to a Parquet file.
.. warning:: Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
File path to which the file should be written.
Optional
options: SinkParquetOptionsOptions for
Optional
cloudOptions?: Map<string, string>Optional
compression?: stringOptional
compressionLevel?: numberOptional
dataPagesizeLimit?: numberOptional
maintainOrder?: booleanOptional
noOptimization?: booleanOptional
predicatePushdown?: booleanOptional
projectionPushdown?: booleanOptional
retries?: numberOptional
rowGroupSize?: numberOptional
simplifyExpression?: booleanOptional
slicePushdown?: booleanOptional
statistics?: booleanOptional
typeCoercion?: booleanOptional
descending: ValueOrArray<boolean>Optional
nullsLast: booleanOptional
maintainOrder: booleanAggregate the columns in the DataFrame to their sum value.
compatibility with JSON.stringify
Drop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List
in the DataFrame.
Optional
maintainOrder: booleanOptional
subset: ColumnSelectionsubset to drop duplicates for
Optional
keep: "first" | "last""first" | "last"
Aggregate the columns in the DataFrame to their variance value.
Add or overwrite column in a DataFrame.
Add or overwrite multiple columns in a DataFrame.
Representation of a Lazy computation graph / query.