Interface LazyDataFrame

Representation of a Lazy computation graph / query.

interface LazyDataFrame {
    [inspect](): string;
    cache(): LazyDataFrame;
    clone(): LazyDataFrame;
    collect(opts?): Promise<pl.DataFrame>;
    collectSync(opts?): pl.DataFrame;
    get columns(): string[];
    describeOptimizedPlan(opts?): string;
    describePlan(): string;
    distinct(maintainOrder?, subset?, keep?): LazyDataFrame;
    distinct(opts): LazyDataFrame;
    drop(name): LazyDataFrame;
    drop(names): LazyDataFrame;
    drop(name, ...names): LazyDataFrame;
    dropNulls(column): LazyDataFrame;
    dropNulls(columns): LazyDataFrame;
    dropNulls(...columns): LazyDataFrame;
    explode(column): LazyDataFrame;
    explode(columns): LazyDataFrame;
    explode(column, ...columns): LazyDataFrame;
    fetch(numRows?): Promise<pl.DataFrame>;
    fetch(numRows, opts): Promise<pl.DataFrame>;
    fetchSync(numRows?): pl.DataFrame;
    fetchSync(numRows, opts): pl.DataFrame;
    fillNull(fillValue): LazyDataFrame;
    filter(predicate): LazyDataFrame;
    first(): pl.DataFrame;
    groupBy(by, maintainOrder?): LazyGroupBy;
    groupBy(by, opts): LazyGroupBy;
    groupByDynamic(options): LazyGroupBy;
    groupByRolling(opts): LazyGroupBy;
    head(length?): LazyDataFrame;
    join(other, joinOptions): LazyDataFrame;
    join(other, joinOptions): LazyDataFrame;
    join(other, options): LazyDataFrame;
    joinAsof(other, options): LazyDataFrame;
    last(): LazyDataFrame;
    limit(n?): LazyDataFrame;
    max(): LazyDataFrame;
    mean(): LazyDataFrame;
    median(): LazyDataFrame;
    melt(idVars, valueVars): LazyDataFrame;
    min(): LazyDataFrame;
    quantile(quantile): LazyDataFrame;
    rename(mapping): any;
    reverse(): any;
    select(column): LazyDataFrame;
    select(columns): LazyDataFrame;
    select(...columns): LazyDataFrame;
    serialize(format): Buffer;
    shift(periods): LazyDataFrame;
    shift(opts): LazyDataFrame;
    shiftAndFill(n, fillValue): LazyDataFrame;
    shiftAndFill(opts): LazyDataFrame;
    sinkCSV(path, options?): void;
    sinkParquet(path, options?): void;
    slice(offset, length): LazyDataFrame;
    slice(opts): LazyDataFrame;
    sort(by, descending?, maintain_order?): LazyDataFrame;
    sort(opts): LazyDataFrame;
    std(): LazyDataFrame;
    sum(): LazyDataFrame;
    tail(length?): LazyDataFrame;
    toJSON(): string;
    unique(maintainOrder?, subset?, keep?): LazyDataFrame;
    unique(opts): LazyDataFrame;
    var(): LazyDataFrame;
    withColumn(expr): LazyDataFrame;
    withColumnRenamed(existing, replacement): LazyDataFrame;
    withColumns(exprs): LazyDataFrame;
    withColumns(...exprs): LazyDataFrame;
    withRowCount(): any;
}

Hierarchy

  • Serialize
  • GroupByOps<LazyGroupBy>
    • LazyDataFrame

Accessors

Methods

  • Collect into a DataFrame. Note: use fetch if you want to run this query on the first n rows only. This can be a huge time saver in debugging queries.

    Parameters

    Returns Promise<pl.DataFrame>

    DataFrame

  • A string representation of the optimized query plan.

    Parameters

    Returns string

  • A string representation of the unoptimized query plan.

    Returns string

  • Drop duplicate rows from this DataFrame. Note that this fails if there is a column of type List in the DataFrame.

    Parameters

    • Optional maintainOrder: boolean
    • Optional subset: ColumnSelection

      subset to drop duplicates for

    • Optional keep: "first" | "last"

      "first" | "last"

    Returns LazyDataFrame

    Deprecated

    Since

    0.4.0

    Use

    unique

  • Parameters

    • opts: {
          keep?: "first" | "last";
          maintainOrder?: boolean;
          subset?: ColumnSelection;
      }
      • Optional keep?: "first" | "last"
      • Optional maintainOrder?: boolean
      • Optional subset?: ColumnSelection

    Returns LazyDataFrame

  • Fetch is like a collect operation, but it overwrites the number of rows read by every scan

    Note that the fetch does not guarantee the final number of rows in the DataFrame. Filter, join operations and a lower number of rows available in the scanned file influence the final number of rows.

    Parameters

    • Optional numRows: number

      collect 'n' number of rows from data source

    Returns Promise<pl.DataFrame>

  • Parameters

    Returns Promise<pl.DataFrame>

  • Filter the rows in the DataFrame based on a predicate expression.

    Parameters

    • predicate: string | pl.Expr

      Expression that evaluates to a boolean Series.

    Returns LazyDataFrame

    Example

    > lf = pl.DataFrame({
    > "foo": [1, 2, 3],
    > "bar": [6, 7, 8],
    > "ham": ['a', 'b', 'c']
    > }).lazy()
    > // Filter on one condition
    > lf.filter(pl.col("foo").lt(3)).collect()
    shape: (2, 3)
    ┌─────┬─────┬─────┐
    foobarham
    │ --- ┆ --- ┆ --- │
    i64i64str
    ╞═════╪═════╪═════╡
    16a
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
    27b
    └─────┴─────┴─────┘
  • Start a groupby operation.

    Parameters

    • by: ColumnsOrExpr
    • Optional maintainOrder: boolean

    Returns LazyGroupBy

  • Parameters

    • by: ColumnsOrExpr
    • opts: {
          maintainOrder: boolean;
      }
      • maintainOrder: boolean

    Returns LazyGroupBy

  • Groups based on a time value (or index value of type Int32, Int64). Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.

    A window is defined by:

    • every: interval of the window
    • period: length of the window
    • offset: offset of the window

    The every, period and offset arguments are created with the following string language:

    • 1ns (1 nanosecond)
    • 1us (1 microsecond)
    • 1ms (1 millisecond)
    • 1s (1 second)
    • 1m (1 minute)
    • 1h (1 hour)
    • 1d (1 day)
    • 1w (1 week)
    • 1mo (1 calendar month)
    • 1y (1 calendar year)
    • 1i (1 index count)

    Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

    In case of a groupbyDynamic on an integer column, the windows are defined by:

    • "1i" # length 1
    • "10i" # length 10

    Parameters

    Parameters

    • options: {
          by?: ColumnsOrExpr;
          check_sorted?: boolean;
          closed?: "none" | "left" | "right" | "both";
          every: string;
          includeBoundaries?: boolean;
          indexColumn: string;
          offset?: string;
          period?: string;
          start_by: StartBy;
      }
      • Optional by?: ColumnsOrExpr
      • Optional check_sorted?: boolean
      • Optional closed?: "none" | "left" | "right" | "both"
      • every: string
      • Optional includeBoundaries?: boolean
      • indexColumn: string
      • Optional offset?: string
      • Optional period?: string
      • start_by: StartBy

    Returns LazyGroupBy

  • Create rolling groups based on a time column (or index value of type Int32, Int64).

    Different from a rolling groupby the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupByDynamic

    The period and offset arguments are created with the following string language:

    • 1ns (1 nanosecond)
    • 1us (1 microsecond)
    • 1ms (1 millisecond)
    • 1s (1 second)
    • 1m (1 minute)
    • 1h (1 hour)
    • 1d (1 day)
    • 1w (1 week)
    • 1mo (1 calendar month)
    • 1y (1 calendar year)
    • 1i (1 index count)

    Or combine them: "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

    In case of a groupby_rolling on an integer column, the windows are defined by:

    • "1i" # length 1
    • "10i" # length 10

    Parameters

    • opts: {
          by?: ColumnsOrExpr;
          check_sorted?: boolean;
          closed?: "none" | "left" | "right" | "both";
          indexColumn: ColumnsOrExpr;
          offset?: string;
          period: string;
      }
      • Optional by?: ColumnsOrExpr
      • Optional check_sorted?: boolean
      • Optional closed?: "none" | "left" | "right" | "both"
      • indexColumn: ColumnsOrExpr
      • Optional offset?: string
      • period: string

    Returns LazyGroupBy

    Example


    >dates = [
    ... "2020-01-01 13:45:48",
    ... "2020-01-01 16:42:13",
    ... "2020-01-01 16:45:09",
    ... "2020-01-02 18:12:48",
    ... "2020-01-03 19:45:32",
    ... "2020-01-08 23:16:43",
    ... ]
    >df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).withColumn(
    ... pl.col("dt").str.strptime(pl.Datetime)
    ... )
    >out = df.groupbyRolling({indexColumn:"dt", period:"2d"}).agg(
    ... [
    ... pl.sum("a").alias("sum_a"),
    ... pl.min("a").alias("min_a"),
    ... pl.max("a").alias("max_a"),
    ... ]
    ... )
    >assert(out["sum_a"].toArray() === [3, 10, 15, 24, 11, 1])
    >assert(out["max_a"].toArray() === [3, 7, 7, 9, 9, 1])
    >assert(out["min_a"].toArray() === [3, 3, 3, 3, 2, 1])
    >out
    shape: (6, 4)
    ┌─────────────────────┬───────┬───────┬───────┐
    dta_suma_maxa_min
    │ --- ┆ --- ┆ --- ┆ --- │
    datetime[ms] ┆ i64i64i64
    ╞═════════════════════╪═══════╪═══════╪═══════╡
    2020-01-01 13:45:48333
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
    2020-01-01 16:42:131073
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
    2020-01-01 16:45:091573
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
    2020-01-02 18:12:482493
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
    2020-01-03 19:45:321192
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
    2020-01-08 23:16:43111
    └─────────────────────┴───────┴───────┴───────┘
  • Gets the first n rows of the DataFrame. You probably don't want to use this!

    Consider using the fetch operation. The fetch operation will truly load the first nrows lazily.

    Parameters

    • Optional length: number

    Returns LazyDataFrame

  • SQL like joins.

    Parameters

    Returns LazyDataFrame

    See

    LazyJoinOptions

    Example

    >>> const df = pl.DataFrame({
    >>> foo: [1, 2, 3],
    >>> bar: [6.0, 7.0, 8.0],
    >>> ham: ['a', 'b', 'c'],
    >>> }).lazy()
    >>>
    >>> const otherDF = pl.DataFrame({
    >>> apple: ['x', 'y', 'z'],
    >>> ham: ['a', 'b', 'd'],
    >>> }).lazy();
    >>> const result = await df.join(otherDF, { on: 'ham', how: 'inner' }).collect();
    shape: (2, 4)
    ╭─────┬─────┬─────┬───────╮
    foobarhamapple
    │ --- ┆ --- ┆ --- ┆ --- │
    i64f64strstr
    ╞═════╪═════╪═════╪═══════╡
    16"a""x"
    ├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
    27"b""y"
    ╰─────┴─────┴─────┴───────╯
  • Parameters

    Returns LazyDataFrame

  • Parameters

    • other: LazyDataFrame
    • options: {
          allowParallel?: boolean;
          forceParallel?: boolean;
          how: "cross";
          suffix?: string;
      }
      • Optional allowParallel?: boolean
      • Optional forceParallel?: boolean
      • how: "cross"
      • Optional suffix?: string

    Returns LazyDataFrame

  • Perform an asof join. This is similar to a left-join except that we match on nearest key rather than equal keys.

    Both DataFrames must be sorted by the asof_join key.

    For each row in the left DataFrame:

    • A "backward" search selects the last row in the right DataFrame whose 'on' key is less than or equal to the left's key.

    • A "forward" search selects the first row in the right DataFrame whose 'on' key is greater than or equal to the left's key.

    The default is "backward".

    Parameters

    Parameters

    • other: LazyDataFrame

      DataFrame to join with.

    • options: {
          allowParallel?: boolean;
          by?: string | string[];
          byLeft?: string | string[];
          byRight?: string | string[];
          forceParallel?: boolean;
          leftOn?: string;
          on?: string;
          rightOn?: string;
          strategy?: "backward" | "forward";
          suffix?: string;
          tolerance?: string | number;
      }
      • Optional allowParallel?: boolean

        Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.

      • Optional by?: string | string[]
      • Optional byLeft?: string | string[]

        join on these columns before doing asof join

      • Optional byRight?: string | string[]

        join on these columns before doing asof join

      • Optional forceParallel?: boolean

        Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.

      • Optional leftOn?: string

        Join column of the left DataFrame.

      • Optional on?: string

        Join column of both DataFrames. If set, leftOn and rightOn should be undefined.

      • Optional rightOn?: string

        Join column of the right DataFrame.

      • Optional strategy?: "backward" | "forward"

        One of {'forward', 'backward'}

      • Optional suffix?: string

        Suffix to append to columns with a duplicate name.

      • Optional tolerance?: string | number

        Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype "Date", "Datetime" you use the following string language:

        • 1ns (1 nanosecond)
        • 1us (1 microsecond)
        • 1ms (1 millisecond)
        • 1s (1 second)
        • 1m (1 minute)
        • 1h (1 hour)
        • 1d (1 day)
        • 1w (1 week)
        • 1mo (1 calendar month)
        • 1y (1 calendar year)
        • 1i (1 index count)

        Or combine them:

        • "3d12h4m25s" # 3 days, 12 hours, 4 minutes, and 25 seconds

    Returns LazyDataFrame

    Example

     >const gdp = pl.DataFrame({
    ... date: [
    ... new Date('2016-01-01'),
    ... new Date('2017-01-01'),
    ... new Date('2018-01-01'),
    ... new Date('2019-01-01'),
    ... ], // note record date: Jan 1st (sorted!)
    ... gdp: [4164, 4411, 4566, 4696],
    ... })
    >const population = pl.DataFrame({
    ... date: [
    ... new Date('2016-05-12'),
    ... new Date('2017-05-12'),
    ... new Date('2018-05-12'),
    ... new Date('2019-05-12'),
    ... ], // note record date: May 12th (sorted!)
    ... "population": [82.19, 82.66, 83.12, 83.52],
    ... })
    >population.joinAsof(
    ... gdp,
    ... {leftOn:"date", rightOn:"date", strategy:"backward"}
    ... )
    shape: (4, 3)
    ┌─────────────────────┬────────────┬──────┐
    datepopulationgdp
    │ --- ┆ --- ┆ --- │
    datetime[μs] ┆ f64i64
    ╞═════════════════════╪════════════╪══════╡
    2016-05-12 00:00:0082.194164
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    2017-05-12 00:00:0082.664411
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    2018-05-12 00:00:0083.124566
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
    2019-05-12 00:00:0083.524696
    └─────────────────────┴────────────┴──────┘
  • Serializes object to desired format via serde

    Parameters

    Returns Buffer

  • Evaluate the query in streaming mode and write to a CSV file.

    .. warning:: Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.

    This allows streaming results that are larger than RAM to be written to disk.

    Parameters

    Parameters

    • path: string

      File path to which the file should be written.

    • Optional options: SinkCsvOptions

    Returns void

  • Evaluate the query in streaming mode and write to a Parquet file.

    .. warning:: Streaming mode is considered unstable. It may be changed at any point without it being considered a breaking change.

    This allows streaming results that are larger than RAM to be written to disk.

    Parameters

    Parameters

    • path: string

      File path to which the file should be written.

    • Optional options: SinkParquetOptions

    Returns void

  • Drop duplicate rows from this DataFrame. Note that this fails if there is a column of type List in the DataFrame.

    Parameters

    • Optional maintainOrder: boolean
    • Optional subset: ColumnSelection

      subset to drop duplicates for

    • Optional keep: "first" | "last"

      "first" | "last"

    Returns LazyDataFrame

  • Parameters

    • opts: {
          keep?: "first" | "last";
          maintainOrder?: boolean;
          subset?: ColumnSelection;
      }
      • Optional keep?: "first" | "last"
      • Optional maintainOrder?: boolean
      • Optional subset?: ColumnSelection

    Returns LazyDataFrame