Skip to content

Collect a query into a DataFrame

Source code

Description

$collect() performs the query on the LazyFrame. It returns a DataFrame

Usage

<LazyFrame>$collect(
  ...,
  type_coercion = TRUE,
  predicate_pushdown = TRUE,
  projection_pushdown = TRUE,
  simplify_expression = TRUE,
  slice_pushdown = TRUE,
  comm_subplan_elim = TRUE,
  comm_subexpr_elim = TRUE,
  cluster_with_columns = TRUE,
  streaming = FALSE,
  no_optimization = FALSE,
  collect_in_background = FALSE
)

Arguments

Ignored.
type_coercion Logical. Coerce types such that operations succeed and run on minimal required memory.
predicate_pushdown Logical. Applies filters as early as possible at scan level.
projection_pushdown Logical. Select only the columns that are needed at the scan level.
simplify_expression Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives.
slice_pushdown Logical. Only load the required slice from the scan level. Don’t materialize sliced outputs (e.g. join$head(10)).
comm_subplan_elim Logical. Will try to cache branching subplans that occur on self-joins or unions.
comm_subexpr_elim Logical. Common subexpressions will be cached and reused.
cluster_with_columns Combine sequential independent calls to with_columns().
streaming Logical. Run parts of the query in a streaming fashion (this is in an alpha state).
no_optimization Logical. Sets the following parameters to FALSE: predicate_pushdown, projection_pushdown, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns.
collect_in_background Logical. Detach this query from R session. Computation will start in background. Get a handle which later can be converted into the resulting DataFrame. Useful in interactive mode to not lock R session.

Details

Note: use $fetch(n) if you want to run your query on the first n rows only. This can be a huge time saver in debugging queries.

Value

A DataFrame

See Also

  • $fetch() - fast limited query check
  • $profile() - same as $collect() but also returns a table with each operation profiled.
  • $collect_in_background() - non-blocking collect returns a future handle. Can also just be used via $collect(collect_in_background = TRUE).
  • $sink_parquet() streams query to a parquet file.
  • $sink_ipc() streams query to a arrow file.

Examples

library("polars")

as_polars_lf(iris)$filter(pl$col("Species") == "setosa")$collect()
#> shape: (50, 5)
#> ┌──────────────┬─────────────┬──────────────┬─────────────┬─────────┐
#> │ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width ┆ Species │
#> │ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
#> │ f64          ┆ f64         ┆ f64          ┆ f64         ┆ cat     │
#> ╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
#> │ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  │
#> │ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa  │
#> │ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
#> │ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  │
#> │ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa  │
#> │ …            ┆ …           ┆ …            ┆ …           ┆ …       │
#> │ 4.8          ┆ 3.0         ┆ 1.4          ┆ 0.3         ┆ setosa  │
#> │ 5.1          ┆ 3.8         ┆ 1.6          ┆ 0.2         ┆ setosa  │
#> │ 4.6          ┆ 3.2         ┆ 1.4          ┆ 0.2         ┆ setosa  │
#> │ 5.3          ┆ 3.7         ┆ 1.5          ┆ 0.2         ┆ setosa  │
#> │ 5.0          ┆ 3.3         ┆ 1.4          ┆ 0.2         ┆ setosa  │
#> └──────────────┴─────────────┴──────────────┴─────────────┴─────────┘