Map an expression with an R function
Description
Map an expression with an R function
Usage
<Expr>$map_batches(
f,
output_type = NULL,
agg_list = FALSE,
in_background = FALSE
)
Arguments
f
|
a function to map with |
output_type
|
NULL or a type available in names(pl$dtypes) .
If NULL (default), the output datatype will match the input
datatype. This is used to inform schema of the actual return type of the
R function. Setting this wrong could theoretically have some downstream
implications to the query.
|
agg_list
|
Aggregate list. Map from vector to group in group_by context. |
in_background
|
Logical. Whether to execute the map in a background R process. Combined
with setting e.g. options(polars.rpool_cap = 4) it can
speed up some slow R functions as they can run in parallel R sessions.
The communication speed between processes is quite slower than between
threads. This will likely only give a speed-up in a "low IO - high CPU"
use case. If there are multiple $map_batches(in_background =
TRUE) calls in the query, they will be run in parallel.
|
Details
It is sometimes necessary to apply a specific R function on one or
several columns. However, note that using R code in
$map_batches()
is slower than native polars. The user
function must take one polars Series
as input and the
return should be a Series
or any Robj convertible into a
Series
(e.g. vectors). Map fully supports
browser()
.
If in_background = FALSE
the function can access any global
variable of the R session. However, note that several calls to
$map_batches()
will sequentially share the same main R
session, so the global environment might change between the start of the
query and the moment a $map_batches()
call is evaluated.
Any native polars computations can still be executed meanwhile. If
in_background = TRUE
, the map will run in one or more other
R sessions and will not have access to global variables. Use
options(polars.rpool_cap = 4)
and
polars_options()$rpool_cap
to set and view number of
parallel R sessions.
Value
Expr
Examples
library("polars")
as_polars_df(iris)$
select(
pl$col("Sepal.Length")$map_batches(\(x) {
paste("cheese", as.character(x$to_vector()))
}, pl$dtypes$String)
)
#> shape: (150, 1)
#> ┌──────────────┐
#> │ Sepal.Length │
#> │ --- │
#> │ str │
#> ╞══════════════╡
#> │ cheese 5.1 │
#> │ cheese 4.9 │
#> │ cheese 4.7 │
#> │ cheese 4.6 │
#> │ cheese 5 │
#> │ … │
#> │ cheese 6.7 │
#> │ cheese 6.3 │
#> │ cheese 6.5 │
#> │ cheese 6.2 │
#> │ cheese 5.9 │
#> └──────────────┘
# R parallel process example, use Sys.sleep() to imitate some CPU expensive
# computation.
# map a,b,c,d sequentially
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
pl$all()$map_batches(\(s) {
Sys.sleep(.1)
s * 2
})
)$collect() |> system.time()
#> user system elapsed
#> 0.028 0.000 0.428
# map in parallel 1: Overhead to start up extra R processes / sessions
options(polars.rpool_cap = 0) # drop any previous processes, just to show start-up overhead
options(polars.rpool_cap = 4) # set back to 4, the default
polars_options()$rpool_cap
#> [1] 4
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
pl$all()$map_batches(\(s) {
Sys.sleep(.1)
s * 2
}, in_background = TRUE)
)$collect() |> system.time()
#> user system elapsed
#> 0.012 0.000 0.901
#> [1] 4
pl$LazyFrame(a = 1, b = 2, c = 3, d = 4)$select(
pl$all()$map_batches(\(s) {
Sys.sleep(.1)
s * 2
}, in_background = TRUE)
)$collect() |> system.time()
#> user system elapsed
#> 0.004 0.005 0.116