Collect and profile a lazy query.
Description
This will run the query and return a list containing the materialized DataFrame and a DataFrame that contains profiling information of each node that is executed.
Usage
<LazyFrame>$profile(
type_coercion = TRUE,
predicate_pushdown = TRUE,
projection_pushdown = TRUE,
simplify_expression = TRUE,
slice_pushdown = TRUE,
comm_subplan_elim = TRUE,
comm_subexpr_elim = TRUE,
cluster_with_columns = TRUE,
streaming = FALSE,
no_optimization = FALSE,
collect_in_background = FALSE,
show_plot = FALSE,
truncate_nodes = 0
)
Arguments
type_coercion
|
Logical. Coerce types such that operations succeed and run on minimal required memory. |
predicate_pushdown
|
Logical. Applies filters as early as possible at scan level. |
projection_pushdown
|
Logical. Select only the columns that are needed at the scan level. |
simplify_expression
|
Logical. Various optimizations, such as constant folding and replacing expensive operations with faster alternatives. |
slice_pushdown
|
Logical. Only load the required slice from the scan level. Don’t
materialize sliced outputs (e.g. join$head(10) ).
|
comm_subplan_elim
|
Logical. Will try to cache branching subplans that occur on self-joins or unions. |
comm_subexpr_elim
|
Logical. Common subexpressions will be cached and reused. |
cluster_with_columns
|
Combine sequential independent calls to with_columns() .
|
streaming
|
Logical. Run parts of the query in a streaming fashion (this is in an alpha state). |
no_optimization
|
Logical. Sets the following parameters to FALSE :
predicate_pushdown , projection_pushdown ,
slice_pushdown , comm_subplan_elim ,
comm_subexpr_elim , cluster_with_columns .
|
collect_in_background
|
Logical. Detach this query from R session. Computation will start in background. Get a handle which later can be converted into the resulting DataFrame. Useful in interactive mode to not lock R session. |
show_plot
|
Show a Gantt chart of the profiling result |
truncate_nodes
|
Truncate the label lengths in the Gantt chart to this number of
characters. If 0 (default), do not truncate.
|
Details
The units of the timings are microseconds.
Value
List of two DataFrame
s: one with the collected result, the
other with the timings of each step. If show_graph = TRUE
,
then the plot is also stored in the list.
See Also
-
$collect()
- regular collect. -
$fetch()
- fast limited query check -
$collect_in_background()
- non-blocking collect returns a future handle. Can also just be used via$collect(collect_in_background = TRUE)
. -
$sink_parquet()
streams query to a parquet file. -
$sink_ipc()
streams query to a arrow file.
Examples
#> $result
#> shape: (1, 1)
#> ┌─────────┐
#> │ literal │
#> │ --- │
#> │ f64 │
#> ╞═════════╡
#> │ 4.0 │
#> └─────────┘
#>
#> $profile
#> shape: (2, 3)
#> ┌─────────────────┬───────┬─────┐
#> │ node ┆ start ┆ end │
#> │ --- ┆ --- ┆ --- │
#> │ str ┆ u64 ┆ u64 │
#> ╞═════════════════╪═══════╪═════╡
#> │ optimization ┆ 0 ┆ 45 │
#> │ select(literal) ┆ 45 ┆ 266 │
#> └─────────────────┴───────┴─────┘
# Use $profile() to compare two queries
# -1- map each Species-group with native polars, takes ~120us only
as_polars_lf(iris)$
sort("Sepal.Length")$
group_by("Species", maintain_order = TRUE)$
agg(pl$col(pl$Float64)$first() + 5)$
profile()
#> $result
#> shape: (3, 5)
#> ┌────────────┬──────────────┬─────────────┬──────────────┬─────────────┐
#> │ Species ┆ Sepal.Length ┆ Sepal.Width ┆ Petal.Length ┆ Petal.Width │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ cat ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞════════════╪══════════════╪═════════════╪══════════════╪═════════════╡
#> │ setosa ┆ 9.3 ┆ 8.0 ┆ 6.1 ┆ 5.1 │
#> │ versicolor ┆ 9.9 ┆ 7.4 ┆ 8.3 ┆ 6.0 │
#> │ virginica ┆ 9.9 ┆ 7.5 ┆ 9.5 ┆ 6.7 │
#> └────────────┴──────────────┴─────────────┴──────────────┴─────────────┘
#>
#> $profile
#> shape: (3, 3)
#> ┌────────────────────┬───────┬──────┐
#> │ node ┆ start ┆ end │
#> │ --- ┆ --- ┆ --- │
#> │ str ┆ u64 ┆ u64 │
#> ╞════════════════════╪═══════╪══════╡
#> │ optimization ┆ 0 ┆ 31 │
#> │ sort(Sepal.Length) ┆ 31 ┆ 727 │
#> │ group_by(Species) ┆ 736 ┆ 1288 │
#> └────────────────────┴───────┴──────┘
# -2- map each Species-group of each numeric column with an R function, takes ~7000us (slow!)
# some R function, prints `.` for each time called by polars
r_func = \(s) {
cat(".")
s$to_r()[1] + 5
}
as_polars_lf(iris)$
sort("Sepal.Length")$
group_by("Species", maintain_order = TRUE)$
agg(pl$col(pl$Float64)$map_elements(r_func))$
profile()
#> ............
#> $result
#> shape: (3, 5)
#> ┌────────────┬────────────────────┬───────────────────┬────────────────────┬───────────────────┐
#> │ Species ┆ Sepal.Length_apply ┆ Sepal.Width_apply ┆ Petal.Length_apply ┆ Petal.Width_apply │
#> │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
#> │ cat ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
#> ╞════════════╪════════════════════╪═══════════════════╪════════════════════╪═══════════════════╡
#> │ setosa ┆ 9.3 ┆ 8.0 ┆ 6.1 ┆ 5.1 │
#> │ versicolor ┆ 9.9 ┆ 7.4 ┆ 8.3 ┆ 6.0 │
#> │ virginica ┆ 9.9 ┆ 7.5 ┆ 9.5 ┆ 6.7 │
#> └────────────┴────────────────────┴───────────────────┴────────────────────┴───────────────────┘
#>
#> $profile
#> shape: (3, 3)
#> ┌────────────────────┬───────┬───────┐
#> │ node ┆ start ┆ end │
#> │ --- ┆ --- ┆ --- │
#> │ str ┆ u64 ┆ u64 │
#> ╞════════════════════╪═══════╪═══════╡
#> │ optimization ┆ 0 ┆ 25 │
#> │ sort(Sepal.Length) ┆ 25 ┆ 568 │
#> │ group_by(Species) ┆ 579 ┆ 70691 │
#> └────────────────────┴───────┴───────┘