Number of chunks of the Series in a DataFrame

Description

Number of chunks (memory allocations) for all or first Series in a DataFrame.

Usage

<DataFrame>$n_chunks(strategy = "first")

Arguments

strategy Either “all” or “first”. “first” only returns chunks for the first Series.

Details

A DataFrame is a vector of Series. Each Series in rust-polars is a wrapper around a ChunkedArray, which is like a virtual contiguous vector physically backed by an ordered set of chunks. Each chunk of values has a contiguous memory layout and is an arrow array. Arrow arrays are a fast, thread-safe and cross-platform memory layout.

In R, combining with c() or rbind() requires immediate vector re-allocation to place vector values in contiguous memory. This is slow and memory consuming, and it is why repeatedly appending to a vector in R is discouraged.

In polars, when we concatenate or append to Series or DataFrame, the re-allocation can be avoided or delayed by simply appending chunks to each individual Series. However, if chunks become many and small or are misaligned across Series, this can hurt the performance of subsequent operations.

Most places in the polars api where chunking could occur, the user have to typically actively opt-out by setting an argument rechunk = FALSE.

Value

A real vector of chunk counts per Series.

Examples

library("polars")

# create DataFrame with misaligned chunks
df = pl$concat(
  1:10, # single chunk
  pl$concat(1:5, 1:5, rechunk = FALSE, how = "vertical")$rename("b"), # two chunks
  how = "horizontal"
)
df

#> shape: (10, 2)
#> ┌─────┬─────┐
#> │ x   ┆ b   │
#> │ --- ┆ --- │
#> │ i32 ┆ i32 │
#> ╞═════╪═════╡
#> │ 1   ┆ 1   │
#> │ 2   ┆ 2   │
#> │ 3   ┆ 3   │
#> │ 4   ┆ 4   │
#> │ 5   ┆ 5   │
#> │ 6   ┆ 1   │
#> │ 7   ┆ 2   │
#> │ 8   ┆ 3   │
#> │ 9   ┆ 4   │
#> │ 10  ┆ 5   │
#> └─────┴─────┘

df$n_chunks()

#> [1] 1

# rechunk a chunked DataFrame
df$rechunk()$n_chunks()

#> [1] 1

# rechunk is not an in-place operation
df$n_chunks()

#> [1] 1

# The following toy example emulates the Series "chunkyness" in R. Here it a
# S3-classed list with same type of vectors and where have all relevant S3
# generics implemented to make behave as if it was a regular vector.
"+.chunked_vector" = \(x, y) structure(list(unlist(x) + unlist(y)), class = "chunked_vector")
print.chunked_vector = \(x, ...) print(unlist(x), ...)
c.chunked_vector = \(...) {
  structure(do.call(c, lapply(list(...), unclass)), class = "chunked_vector")
}
rechunk = \(x) structure(unlist(x), class = "chunked_vector")
x = structure(list(1:4, 5L), class = "chunked_vector")
x

#> [1] 1 2 3 4 5

x + 5:1

#> [1] 6 6 6 6 6

lapply(x, tracemem) # trace chunks to verify no re-allocation

#> [[1]]
#> [1] "<0x555aa208dc00>"
#> 
#> [[2]]
#> [1] "<0x555a9fbd48d8>"

z = c(x, x)
z # looks like a plain vector

#>  [1] 1 2 3 4 5 1 2 3 4 5

lapply(z, tracemem) # mem allocation  in z are the same from x

#> [[1]]
#> [1] "<0x555aa208dc00>"
#> 
#> [[2]]
#> [1] "<0x555a9fbd48d8>"
#> 
#> [[3]]
#> [1] "<0x555aa208dc00>"
#> 
#> [[4]]
#> [1] "<0x555a9fbd48d8>"

str(z)

#> List of 4
#>  $ : int [1:4] 1 2 3 4
#>  $ : int 5
#>  $ : int [1:4] 1 2 3 4
#>  $ : int 5
#>  - attr(*, "class")= chr "chunked_vector"

z = rechunk(z)
str(z)

#>  'chunked_vector' int [1:10] 1 2 3 4 5 1 2 3 4 5