Cache the result once the execution of the physical plan hits this node.
Collect into a DataFrame.
Note: use fetch
if you want to run this query on the first n
rows only.
This can be a huge time saver in debugging queries.
Optional
opts: LazyOptionsDataFrame
Optional
opts: LazyOptionsA string representation of the optimized query plan.
Optional
opts: LazyOptionsDrop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List
in the DataFrame.
Optional
maintainOrder: booleanOptional
subset: ColumnSelectionsubset to drop duplicates for
Optional
keep: "first" | "last""first" | "last"
Optional
keep?: "first" | "last"Optional
maintainOptional
subset?: ColumnSelectionRemove one or multiple columns from a DataFrame.
Rest
...names: string[]Drop rows with null values from this DataFrame. This method only drops nulls row-wise if any single value of the row is null.
Rest
...columns: string[]Explode lists to long format.
Rest
...columns: ExprOrString[]Fetch is like a collect operation, but it overwrites the number of rows read by every scan
Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.
Optional
numRows: numbercollect 'n' number of rows from data source
Fill missing values
Filter the rows in the DataFrame based on a predicate expression.
> lf = pl.DataFrame({
> "foo": [1, 2, 3],
> "bar": [6, 7, 8],
> "ham": ['a', 'b', 'c']
> }).lazy()
> // Filter on one condition
> lf.filter(pl.col("foo").lt(3)).collect()
shape: (2, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 6 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 7 ┆ b │
└─────┴─────┴─────┘
Groups based on a time value (or index value of type Int32, Int64). Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
The every
, period
and offset
arguments are created with
the following string language:
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupbyDynamic on an integer column, the windows are defined by:
Optional
by?: ColumnsOrExprOptional
check_Optional
closed?: Optional
includeOptional
offset?: stringOptional
period?: stringCreate rolling groups based on a time column (or index value of type Int32, Int64).
Different from a rolling groupby the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupByDynamic
The period
and offset
arguments are created with
the following string language:
Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupby_rolling on an integer column, the windows are defined by:
Optional
by?: ColumnsOrExprOptional
check_Optional
closed?: Optional
offset?: string
>dates = [
... "2020-01-01 13:45:48",
... "2020-01-01 16:42:13",
... "2020-01-01 16:45:09",
... "2020-01-02 18:12:48",
... "2020-01-03 19:45:32",
... "2020-01-08 23:16:43",
... ]
>df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).withColumn(
... pl.col("dt").str.strptime(pl.Datetime)
... )
>out = df.groupbyRolling({indexColumn:"dt", period:"2d"}).agg(
... [
... pl.sum("a").alias("sum_a"),
... pl.min("a").alias("min_a"),
... pl.max("a").alias("max_a"),
... ]
... )
>assert(out["sum_a"].toArray() === [3, 10, 15, 24, 11, 1])
>assert(out["max_a"].toArray() === [3, 7, 7, 9, 9, 1])
>assert(out["min_a"].toArray() === [3, 3, 3, 3, 2, 1])
>out
shape: (6, 4)
┌─────────────────────┬───────┬───────┬───────┐
│ dt ┆ a_sum ┆ a_max ┆ a_min │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ i64 ┆ i64 ┆ i64 │
╞═════════════════════╪═══════╪═══════╪═══════╡
│ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:42:13 ┆ 10 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-01 16:45:09 ┆ 15 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-02 18:12:48 ┆ 24 ┆ 9 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-03 19:45:32 ┆ 11 ┆ 9 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │
└─────────────────────┴───────┴───────┴───────┘
Gets the first n
rows of the DataFrame. You probably don't want to use this!
Consider using the fetch
operation.
The fetch
operation will truly load the first n
rows lazily.
Optional
length: numberSQL like joins.
>>> const df = pl.DataFrame({
>>> foo: [1, 2, 3],
>>> bar: [6.0, 7.0, 8.0],
>>> ham: ['a', 'b', 'c'],
>>> }).lazy()
>>>
>>> const otherDF = pl.DataFrame({
>>> apple: ['x', 'y', 'z'],
>>> ham: ['a', 'b', 'd'],
>>> }).lazy();
>>> const result = await df.join(otherDF, { on: 'ham', how: 'inner' }).collect();
shape: (2, 4)
╭─────┬─────┬─────┬───────╮
│ foo ┆ bar ┆ ham ┆ apple │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ str ┆ str │
╞═════╪═════╪═════╪═══════╡
│ 1 ┆ 6 ┆ "a" ┆ "x" │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 7 ┆ "b" ┆ "y" │
╰─────┴─────┴─────┴───────╯
Optional
allowOptional
forceOptional
suffix?: stringPerform an asof join. This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.
A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.
A "nearest" search selects the last row in the right DataFrame whose value is nearest to the left's key. String keys are not currently supported for a nearest search.
The default is "backward".
DataFrame to join with.
Optional
allowAllow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
Optional
by?: string | string[]Optional
byjoin on these columns before doing asof join
Optional
byjoin on these columns before doing asof join
Optional
forceForce the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
Optional
leftJoin column of the left DataFrame.
Optional
on?: stringJoin column of both DataFrames. If set, leftOn
and rightOn
should be undefined.
Optional
rightJoin column of the right DataFrame.
Optional
strategy?: "backward" | "forward" | "nearest"One of {'forward', 'backward', 'nearest'}
Optional
suffix?: stringSuffix to append to columns with a duplicate name.
Optional
tolerance?: string | numberNumeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime" you use the following string language:
Or combine them:
>const gdp = pl.DataFrame({
... date: [
... new Date('2016-01-01'),
... new Date('2017-01-01'),
... new Date('2018-01-01'),
... new Date('2019-01-01'),
... ], // note record date: Jan 1st (sorted!)
... gdp: [4164, 4411, 4566, 4696],
... })
>const population = pl.DataFrame({
... date: [
... new Date('2016-05-12'),
... new Date('2017-05-12'),
... new Date('2018-05-12'),
... new Date('2019-05-12'),
... ], // note record date: May 12th (sorted!)
... "population": [82.19, 82.66, 83.12, 83.52],
... })
>population.joinAsof(
... gdp,
... {leftOn:"date", rightOn:"date", strategy:"backward"}
... )
shape: (4, 3)
┌─────────────────────┬────────────┬──────┐
│ date ┆ population ┆ gdp │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ i64 │
╞═════════════════════╪════════════╪══════╡
│ 2016-05-12 00:00:00 ┆ 82.19 ┆ 4164 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2017-05-12 00:00:00 ┆ 82.66 ┆ 4411 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2018-05-12 00:00:00 ┆ 83.12 ┆ 4566 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2019-05-12 00:00:00 ┆ 83.52 ┆ 4696 │
└─────────────────────┴────────────┴──────┘
Get the last row of the DataFrame.
Optional
n: numbersince 0.13.0 use unpivot
Reverse the DataFrame.
Rest
...columns: ExprOrString[]Serializes object to desired format via serde
Evaluate the query in streaming mode and write to a CSV file.
.. warning:: Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
File path to which the file should be written.
Optional
options: SinkCsvOptionsEvaluate the query in streaming mode and write to a Parquet file.
.. warning:: Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.
This allows streaming results that are larger than RAM to be written to disk.
File path to which the file should be written.
Optional
options: SinkParquetOptionsOptional
descending: ValueOrArray<boolean>Optional
nullsLast: booleanOptional
maintainOrder: booleanOptional
descending?: ValueOrArray<boolean>Optional
maintainOptional
nullsAggregate the columns in the DataFrame to their sum value.
Get the last n
rows of the DataFrame.
Optional
length: numberDrop duplicate rows from this DataFrame.
Note that this fails if there is a column of type List
in the DataFrame.
Optional
maintainOrder: booleanOptional
subset: ColumnSelectionsubset to drop duplicates for
Optional
keep: "first" | "last""first" | "last"
Optional
keep?: "first" | "last"Optional
maintainOptional
subset?: ColumnSelectionAggregate the columns in the DataFrame to their variance value.
Add or overwrite column in a DataFrame.
Add or overwrite multiple columns in a DataFrame.
Add a column at index 0 that counts the rows.
Representation of a Lazy computation graph / query.