Inner workings of the LazyFrame-class
Description
The LazyFrame
-class is simply two environments of
respectively the public and private methods/function calls to the polars
rust side. The instantiated LazyFrame
-object is an
externalptr
to a lowlevel rust polars LazyFrame object. The
pointer address is the only statefullness of the LazyFrame object on the
R side. Any other state resides on the rust side. The S3 method
.DollarNames.RPolarsLazyFrame
exposes all public
$foobar()
-methods which are
callable onto the object.
Most methods return another LazyFrame
-class instance or
similar which allows for method chaining. This class system in lack of a
better name could be called "environment classes" and is the same class
system extendr provides, except here there is both a public and private
set of methods. For implementation reasons, the private methods are
external and must be called from
.pr$LazyFrame$methodname()
. Also, all private methods must
take any self as an argument, thus they are pure functions. Having the
private methods as pure functions solved/simplified self-referential
complications.
DataFrame
and LazyFrame
can both be said to be
a Frame
. To convert use \<DataFrame>$lazy()
and \<LazyFrame>$collect()
. You can also create a
LazyFrame
directly with pl$LazyFrame()
. This
is quite similar to the lazy-collect syntax of the dplyr
package to interact with database connections such as SQL variants. Most
SQL databases would be able to perform the same optimizations as polars
such predicate pushdown and projection pushdown. However polars can
interact and optimize queries with both SQL DBs and other data sources
such parquet files simultaneously.
Active bindings
columns
$columns
returns a character
vector with the column names.
dtypes
$dtypes
returns a unnamed list
with the data type of each column.
schema
$schema
returns a named list with
the data type of each column.
width
$width
returns the number of
columns in the LazyFrame.
Conversion to R data types considerations
When converting Polars objects, such as DataFrames to R objects, for
example via the as.data.frame()
generic function, each type
in the Polars object is converted to an R type. In some cases, an error
may occur because the conversion is not appropriate. In particular,
there is a high possibility of an error when converting a Datetime type
without a time zone. A Datetime type without a time zone in Polars is
converted to the POSIXct type in R, which takes into account the time
zone in which the R session is running (which can be checked with the
Sys.timezone()
function). In this case, if ambiguous times
are included, a conversion error will occur. In such cases, change the
session time zone using Sys.setenv(TZ = "UTC")
and then
perform the conversion, or use the $dt$replace_time_zone()
method on the Datetime type column to explicitly specify the time zone
before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")\$str\$strptime(pl\$Datetime(), "%F %T") withr::with_timezone( "America/New_York", { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_timezone( "America/New_York", { # This is safe. as.vector(non_existent_time\$dt\$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
Examples
#> [1] "cast" "clear" "clone"
#> [4] "collect" "collect_in_background" "columns"
#> [7] "drop" "drop_nulls" "dtypes"
#> [10] "explain" "explode" "fetch"
#> [13] "fill_nan" "fill_null" "filter"
#> [16] "first" "gather_every" "group_by"
#> [19] "group_by_dynamic" "head" "join"
#> [22] "join_asof" "join_where" "last"
#> [25] "limit" "max" "mean"
#> [28] "median" "min" "print"
#> [31] "profile" "quantile" "rename"
#> [34] "reverse" "rolling" "schema"
#> [37] "select" "select_seq" "serialize"
#> [40] "shift" "sink_csv" "sink_ipc"
#> [43] "sink_ndjson" "sink_parquet" "slice"
#> [46] "sort" "sql" "std"
#> [49] "sum" "tail" "to_dot"
#> [52] "unique" "unnest" "unpivot"
#> [55] "var" "width" "with_columns"
#> [58] "with_columns_seq" "with_context" "with_row_index"
#> [1] "cast" "cast_all"
#> [3] "clone_in_rust" "collect"
#> [5] "collect_in_background" "debug_plan"
#> [7] "describe_optimized_plan" "describe_optimized_plan_tree"
#> [9] "describe_plan" "describe_plan_tree"
#> [11] "deserialize" "drop"
#> [13] "drop_nulls" "explode"
#> [15] "fetch" "fill_nan"
#> [17] "fill_null" "filter"
#> [19] "first" "group_by"
#> [21] "group_by_dynamic" "join"
#> [23] "join_asof" "join_where"
#> [25] "last" "max"
#> [27] "mean" "median"
#> [29] "min" "optimization_toggle"
#> [31] "print" "profile"
#> [33] "quantile" "rename"
#> [35] "reverse" "rolling"
#> [37] "schema" "select"
#> [39] "select_seq" "serialize"
#> [41] "shift" "sink_csv"
#> [43] "sink_ipc" "sink_json"
#> [45] "sink_parquet" "slice"
#> [47] "sort_by_exprs" "std"
#> [49] "sum" "tail"
#> [51] "to_dot" "unique"
#> [53] "unnest" "unpivot"
#> [55] "var" "with_columns"
#> [57] "with_columns_seq" "with_context"
#> [59] "with_row_index"
# Practical example ##
# First writing R iris dataset to disk, to illustrte a difference
temp_filepath = tempfile()
write.csv(iris, temp_filepath, row.names = FALSE)
# Following example illustrates 2 ways to obtain a LazyFrame
# The-Okay-way: convert an in-memory DataFrame to LazyFrame
# eager in-mem R data.frame
Rdf = read.csv(temp_filepath)
# eager in-mem polars DataFrame
Pdf = as_polars_df(Rdf)
# lazy frame starting from in-mem DataFrame
Ldf_okay = Pdf$lazy()
# The-Best-Way: LazyFrame created directly from a data source is best...
Ldf_best = pl$scan_csv(temp_filepath)
# ... as if to e.g. filter the LazyFrame, that filtering also caleld predicate will be
# pushed down in the executation stack to the csv_reader, and thereby only bringing into
# memory the rows matching to filter.
# apply filter:
filter_expr = pl$col("Species") == "setosa" # get only rows where Species is setosa
Ldf_okay = Ldf_okay$filter(filter_expr) # overwrite LazyFrame with new
Ldf_best = Ldf_best$filter(filter_expr)
# the non optimized plans are similar, on entire in-mem csv, apply filter
Ldf_okay$explain(optimized = FALSE)
#> [1] "FILTER [(col(\"Species\")) == (String(setosa))] FROM\n DF [\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\"]; PROJECT */5 COLUMNS; SELECTION: None"
#> [1] "FILTER [(col(\"Species\")) == (String(setosa))] FROM\n Csv SCAN [/tmp/Rtmpdz6h7d/fileb4dbe9eb5aa]\n PROJECT */5 COLUMNS"
# NOTE For Ldf_okay, the full time to load csv alrady paid when creating Rdf and Pdf
# The optimized plan are quite different, Ldf_best will read csv and perform filter simultaneously
Ldf_okay$explain()
#> [1] "DF [\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\"]; PROJECT */5 COLUMNS; SELECTION: [(col(\"Species\")) == (String(setosa))]"
#> [1] "Csv SCAN [/tmp/Rtmpdz6h7d/fileb4dbe9eb5aa]\nPROJECT */5 COLUMNS\nSELECTION: [(col(\"Species\")) == (String(setosa))]"
# To acquire result in-mem use $colelct()
Pdf_okay = Ldf_okay$collect()
Pdf_best = Ldf_best$collect()
# verify tables would be the same
all.equal(
Pdf_okay$to_data_frame(),
Pdf_best$to_data_frame()
)
#> [1] TRUE