Inner workings of the LazyFrame-class

Description

The LazyFrame-class is simply two environments of respectively the public and private methods/function calls to the polars rust side. The instantiated LazyFrame-object is an externalptr to a lowlevel rust polars LazyFrame object. The pointer address is the only statefullness of the LazyFrame object on the R side. Any other state resides on the rust side. The S3 method .DollarNames.RPolarsLazyFrame exposes all public $foobar()-methods which are callable onto the object.

Most methods return another LazyFrame-class instance or similar which allows for method chaining. This class system in lack of a better name could be called "environment classes" and is the same class system extendr provides, except here there is both a public and private set of methods. For implementation reasons, the private methods are external and must be called from .pr$LazyFrame$methodname(). Also, all private methods must take any self as an argument, thus they are pure functions. Having the private methods as pure functions solved/simplified self-referential complications.

DataFrame and LazyFrame can both be said to be a Frame. To convert use \<DataFrame>$lazy() and \<LazyFrame>$collect(). You can also create a LazyFrame directly with pl$LazyFrame(). This is quite similar to the lazy-collect syntax of the dplyr package to interact with database connections such as SQL variants. Most SQL databases would be able to perform the same optimizations as polars such predicate pushdown and projection pushdown. However polars can interact and optimize queries with both SQL DBs and other data sources such parquet files simultaneously.

Active bindings

columns

$columns returns a character vector with the column names.

dtypes

$dtypes returns a unnamed list with the data type of each column.

schema

$schema returns a named list with the data type of each column.

width

$width returns the number of columns in the LazyFrame.

Conversion to R data types considerations

When converting Polars objects, such as DataFrames to R objects, for example via the as.data.frame() generic function, each type in the Polars object is converted to an R type. In some cases, an error may occur because the conversion is not appropriate. In particular, there is a high possibility of an error when converting a Datetime type without a time zone. A Datetime type without a time zone in Polars is converted to the POSIXct type in R, which takes into account the time zone in which the R session is running (which can be checked with the Sys.timezone() function). In this case, if ambiguous times are included, a conversion error will occur. In such cases, change the session time zone using Sys.setenv(TZ = "UTC") and then perform the conversion, or use the $dt$replace_time_zone() method on the Datetime type column to explicitly specify the time zone before conversion.

# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am
# so this particular date-time doesn't exist
non_existent_time = as_polars_series("2020-03-08 02:00:00")\$str\$strptime(pl\$Datetime(), "%F %T")

withr::with_timezone(
  "America/New_York",
  {
    tryCatch(
      # This causes an error due to the time zone (the `TZ` env var is affected).
      as.vector(non_existent_time),
      error = function(e) e
    )
  }
)
#> <error: in \$to_vector(): in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()>

withr::with_timezone(
  "America/New_York",
  {
    # This is safe.
    as.vector(non_existent_time\$dt\$replace_time_zone("UTC"))
  }
)
#> [1] "2020-03-08 02:00:00 UTC"

Examples

library("polars")

# see all exported methods
ls(.pr$env$RPolarsLazyFrame)

#>  [1] "cast"                  "clear"                 "clone"                
#>  [4] "collect"               "collect_in_background" "columns"              
#>  [7] "drop"                  "drop_nulls"            "dtypes"               
#> [10] "explain"               "explode"               "fetch"                
#> [13] "fill_nan"              "fill_null"             "filter"               
#> [16] "first"                 "gather_every"          "group_by"             
#> [19] "group_by_dynamic"      "head"                  "join"                 
#> [22] "join_asof"             "join_where"            "last"                 
#> [25] "limit"                 "max"                   "mean"                 
#> [28] "median"                "min"                   "print"                
#> [31] "profile"               "quantile"              "rename"               
#> [34] "reverse"               "rolling"               "schema"               
#> [37] "select"                "select_seq"            "serialize"            
#> [40] "shift"                 "sink_csv"              "sink_ipc"             
#> [43] "sink_ndjson"           "sink_parquet"          "slice"                
#> [46] "sort"                  "sql"                   "std"                  
#> [49] "sum"                   "tail"                  "to_dot"               
#> [52] "unique"                "unnest"                "unpivot"              
#> [55] "var"                   "width"                 "with_columns"         
#> [58] "with_columns_seq"      "with_context"          "with_row_index"

# see all private methods (not intended for regular use)
ls(.pr$LazyFrame)

#>  [1] "cast"                         "cast_all"                    
#>  [3] "clone_in_rust"                "collect"                     
#>  [5] "collect_in_background"        "debug_plan"                  
#>  [7] "describe_optimized_plan"      "describe_optimized_plan_tree"
#>  [9] "describe_plan"                "describe_plan_tree"          
#> [11] "deserialize"                  "drop"                        
#> [13] "drop_nulls"                   "explode"                     
#> [15] "fetch"                        "fill_nan"                    
#> [17] "fill_null"                    "filter"                      
#> [19] "first"                        "group_by"                    
#> [21] "group_by_dynamic"             "join"                        
#> [23] "join_asof"                    "join_where"                  
#> [25] "last"                         "max"                         
#> [27] "mean"                         "median"                      
#> [29] "min"                          "optimization_toggle"         
#> [31] "print"                        "profile"                     
#> [33] "quantile"                     "rename"                      
#> [35] "reverse"                      "rolling"                     
#> [37] "schema"                       "select"                      
#> [39] "select_seq"                   "serialize"                   
#> [41] "shift"                        "sink_csv"                    
#> [43] "sink_ipc"                     "sink_json"                   
#> [45] "sink_parquet"                 "slice"                       
#> [47] "sort_by_exprs"                "std"                         
#> [49] "sum"                          "tail"                        
#> [51] "to_dot"                       "unique"                      
#> [53] "unnest"                       "unpivot"                     
#> [55] "var"                          "with_columns"                
#> [57] "with_columns_seq"             "with_context"                
#> [59] "with_row_index"

# Practical example ##
# First writing R iris dataset to disk, to illustrte a difference
temp_filepath = tempfile()
write.csv(iris, temp_filepath, row.names = FALSE)

# Following example illustrates 2 ways to obtain a LazyFrame

# The-Okay-way: convert an in-memory DataFrame to LazyFrame

# eager in-mem R data.frame
Rdf = read.csv(temp_filepath)

# eager in-mem polars DataFrame
Pdf = as_polars_df(Rdf)

# lazy frame starting from in-mem DataFrame
Ldf_okay = Pdf$lazy()

# The-Best-Way:  LazyFrame created directly from a data source is best...
Ldf_best = pl$scan_csv(temp_filepath)

# ... as if to e.g. filter the LazyFrame, that filtering also caleld predicate will be
# pushed down in the executation stack to the csv_reader, and thereby only bringing into
# memory the rows matching to filter.
# apply filter:
filter_expr = pl$col("Species") == "setosa" # get only rows where Species is setosa
Ldf_okay = Ldf_okay$filter(filter_expr) # overwrite LazyFrame with new
Ldf_best = Ldf_best$filter(filter_expr)

# the non optimized plans are similar, on entire in-mem csv, apply filter
Ldf_okay$explain(optimized = FALSE)

#> [1] "FILTER [(col(\"Species\")) == (String(setosa))] FROM\n  DF [\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\"]; PROJECT */5 COLUMNS; SELECTION: None"

Ldf_best$explain(optimized = FALSE)

#> [1] "FILTER [(col(\"Species\")) == (String(setosa))] FROM\n  Csv SCAN [/tmp/RtmpHAzLvk/file95d827546911]\n  PROJECT */5 COLUMNS"

# NOTE For Ldf_okay, the full time to load csv alrady paid when creating Rdf and Pdf

# The optimized plan are quite different, Ldf_best will read csv and perform filter simultaneously
Ldf_okay$explain()

#> [1] "DF [\"Sepal.Length\", \"Sepal.Width\", \"Petal.Length\", \"Petal.Width\"]; PROJECT */5 COLUMNS; SELECTION: [(col(\"Species\")) == (String(setosa))]"

Ldf_best$explain()

#> [1] "Csv SCAN [/tmp/RtmpHAzLvk/file95d827546911]\nPROJECT */5 COLUMNS\nSELECTION: [(col(\"Species\")) == (String(setosa))]"

# To acquire result in-mem use $colelct()
Pdf_okay = Ldf_okay$collect()
Pdf_best = Ldf_best$collect()


# verify tables would be the same
all.equal(
  Pdf_okay$to_data_frame(),
  Pdf_best$to_data_frame()
)

#> [1] TRUE

# a user might write it as a one-liner like so:
Pdf_best2 = pl$scan_csv(temp_filepath)$filter(pl$col("Species") == "setosa")