Inner workings of the DataFrame-class
Description
The DataFrame
-class is simply two environments of
respectively the public and private methods/function calls to the polars
Rust side. The instantiated DataFrame
-object is an
externalptr
to a low-level Rust polars DataFrame object.
The S3 method .DollarNames.RPolarsDataFrame
exposes all
public $foobar()
-methods which
are callable onto the object. Most methods return another
DataFrame
- class instance or similar which allows for
method chaining. This class system could be called "environment classes"
(in lack of a better name) and is the same class system
extendr
provides, except here there are both a public and
private set of methods. For implementation reasons, the private methods
are external and must be called from
.pr$DataFrame$methodname()
. Also, all private methods must
take any self
as an argument, thus they are pure functions.
Having the private methods as pure functions solved/simplified
self-referential complications.
Details
Check out the source code in
R/dataframe_frame.R
to see how public methods are derived from private methods. Check out
extendr-wrappers.R
to see the extendr
-auto-generated methods. These are moved
to .pr
and converted into pure external functions in
after-wrappers.R.
In
zzz.R
(named zzz
to be last file sourced) the
extendr
-methods are removed and replaced by any function
prefixed DataFrame_
.
Active bindings
columns
$columns
returns a character
vector with the column names.
dtypes
$dtypes
returns a unnamed list
with the data type of each column.
flags
$flags
returns a nested list with
column names at the top level and column flags in each sublist.
Flags are used internally to avoid doing unnecessary computations, such
as sorting a variable that we know is already sorted. The number of
flags varies depending on the column type: columns of type
array
and list
have the flags
SORTED_ASC
, SORTED_DESC
, and
FAST_EXPLODE
, while other column types only have the former
two.
-
SORTED_ASC
is set toTRUE
when we sort a column in increasing order, so that we can use this information later on to avoid re-sorting it. -
SORTED_DESC
is similar but applies to sort in decreasing order.
height
$height
returns the number of
rows in the DataFrame.
schema
$schema
returns a named list with
the data type of each column.
shape
$shape
returns a numeric vector
of length two with the number of rows and the number of columns.
width
$width
returns the number of
columns in the DataFrame.
Conversion to R data types considerations
When converting Polars objects, such as DataFrames to R objects, for
example via the as.data.frame()
generic function, each type
in the Polars object is converted to an R type. In some cases, an error
may occur because the conversion is not appropriate. In particular,
there is a high possibility of an error when converting a Datetime type
without a time zone. A Datetime type without a time zone in Polars is
converted to the POSIXct type in R, which takes into account the time
zone in which the R session is running (which can be checked with the
Sys.timezone()
function). In this case, if ambiguous times
are included, a conversion error will occur. In such cases, change the
session time zone using Sys.setenv(TZ = "UTC")
and then
perform the conversion, or use the $dt$replace_time_zone()
method on the Datetime type column to explicitly specify the time zone
before conversion.
# Due to daylight savings, clocks were turned forward 1 hour on Sunday, March 8, 2020, 2:00:00 am # so this particular date-time doesn't exist non_existent_time = as_polars_series("2020-03-08 02:00:00")\$str\$strptime(pl\$Datetime(), "%F %T") withr::with_timezone( "America/New_York", { tryCatch( # This causes an error due to the time zone (the `TZ` env var is affected). as.vector(non_existent_time), error = function(e) e ) } ) #> <error: in to_r: ComputeError(ErrString("datetime '2020-03-08 02:00:00' is non-existent in time zone 'America/New_York'. You may be able to use `non_existent='null'` to return `null` in this case.")) When calling: devtools::document()> withr::with_timezone( "America/New_York", { # This is safe. as.vector(non_existent_time\$dt\$replace_time_zone("UTC")) } ) #> [1] "2020-03-08 02:00:00 UTC"
Examples
library("polars")
# see all public exported method names (normally accessed via a class
# instance with $)
ls(.pr$env$RPolarsDataFrame)
#> [1] "cast" "clear" "clone" "columns"
#> [5] "describe" "drop" "drop_in_place" "drop_nulls"
#> [9] "dtype_strings" "dtypes" "equals" "estimated_size"
#> [13] "explode" "fill_nan" "fill_null" "filter"
#> [17] "first" "flags" "gather_every" "get_column"
#> [21] "get_columns" "glimpse" "group_by" "group_by_dynamic"
#> [25] "head" "height" "item" "join"
#> [29] "join_asof" "join_where" "last" "lazy"
#> [33] "limit" "max" "mean" "median"
#> [37] "min" "n_chunks" "null_count" "partition_by"
#> [41] "pivot" "print" "quantile" "rechunk"
#> [45] "rename" "reverse" "rolling" "sample"
#> [49] "schema" "select" "select_seq" "shape"
#> [53] "shift" "slice" "sort" "sql"
#> [57] "std" "sum" "tail" "to_data_frame"
#> [61] "to_dummies" "to_list" "to_raw_ipc" "to_series"
#> [65] "to_struct" "transpose" "unique" "unnest"
#> [69] "unpivot" "var" "width" "with_columns"
#> [73] "with_columns_seq" "with_row_index" "write_csv" "write_ipc"
#> [77] "write_json" "write_ndjson" "write_parquet"
#> [1] "clear" "clone_in_rust"
#> [3] "columns" "default"
#> [5] "drop_all_in_place" "drop_in_place"
#> [7] "dtype_strings" "dtypes"
#> [9] "equals" "estimated_size"
#> [11] "export_stream" "from_arrow_record_batches"
#> [13] "from_raw_ipc" "get_column"
#> [15] "get_columns" "lazy"
#> [17] "n_chunks" "new_with_capacity"
#> [19] "null_count" "partition_by"
#> [21] "pivot_expr" "print"
#> [23] "rechunk" "sample_frac"
#> [25] "sample_n" "schema"
#> [27] "select" "select_at_idx"
#> [29] "select_seq" "set_column_from_robj"
#> [31] "set_column_from_series" "set_column_names_mut"
#> [33] "shape" "to_dummies"
#> [35] "to_list" "to_list_tag_structs"
#> [37] "to_list_unwind" "to_raw_ipc"
#> [39] "to_struct" "transpose"
#> [41] "unnest" "unpivot"
#> [43] "with_columns" "with_columns_seq"
#> [45] "with_row_index" "write_csv"
#> [47] "write_ipc" "write_json"
#> [49] "write_ndjson" "write_parquet"
#> [1] 150 5
# use a private method, which has mutability
result = .pr$DataFrame$set_column_from_robj(df, 150:1, "some_ints")
# Column exists in both dataframes-objects now, as they are just pointers to
# the same object
# There are no public methods with mutability.
df2 = df
df$columns
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#> [6] "some_ints"
#> [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
#> [6] "some_ints"
#> $Sepal.Length
#> $Sepal.Length$SORTED_ASC
#> [1] TRUE
#>
#> $Sepal.Length$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Sepal.Width
#> $Sepal.Width$SORTED_ASC
#> [1] FALSE
#>
#> $Sepal.Width$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Petal.Length
#> $Petal.Length$SORTED_ASC
#> [1] FALSE
#>
#> $Petal.Length$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Petal.Width
#> $Petal.Width$SORTED_ASC
#> [1] FALSE
#>
#> $Petal.Width$SORTED_DESC
#> [1] FALSE
#>
#>
#> $Species
#> $Species$SORTED_ASC
#> [1] FALSE
#>
#> $Species$SORTED_DESC
#> [1] FALSE
#>
#>
#> $some_ints
#> $some_ints$SORTED_ASC
#> [1] FALSE
#>
#> $some_ints$SORTED_DESC
#> [1] FALSE
# set_column_from_robj-method is fallible and returned a result which could
# be "ok" or an error.
# No public method or function will ever return a result.
# The `result` is very close to the same as output from functions decorated
# with purrr::safely.
# To use results on the R side, these must be unwrapped first such that
# potentially errors can be thrown. `unwrap(result)` is a way to communicate
# errors happening on the Rust side to the R side. `Extendr` default behavior
# is to use `panic!`(s) which would cause some unnecessarily confusing and
# some very verbose error messages on the inner workings of rust.
# `unwrap(result)` in this case no error, just a NULL because this mutable
# method does not return any ok-value.
# Try unwrapping an error from polars due to unmatching column lengths
err_result = .pr$DataFrame$set_column_from_robj(df, 1:10000, "wrong_length")
tryCatch(unwrap(err_result, call = NULL), error = \(e) cat(as.character(e)))
#> Error in unwrap(err_result, call = NULL): could not find function "unwrap"