polars.DataFrame.to_torch#

Convert DataFrame to a 2D PyTorch tensor, Dataset, or dict of Tensors.

New in version 0.20.23.

Parameters:

return_type{“tensor”, “dataset”, “dict”}: Set return type; a 2D PyTorch tensor, PolarsDataset (a frame-specialized TensorDataset), or dict of Tensors.
label: One or more column names, expressions, or selectors that label the feature data; when return_type is “dataset”, the PolarsDataset will return (features, label) tensor tuples for each row. Otherwise, it returns (features,) tensor tuples where the feature contains all the row data; note that setting this parameter with any other result type will raise an informative error.
features: One or more column names, expressions, or selectors that contain the feature data; if omitted, all columns that are not designated as part of the label are used. This parameter is a no-op for return-types other than “dataset”.
dtype: Unify the dtype of all returned tensors; this casts any frame Series that are not of the required dtype before converting to tensor. This includes the label column unless the label is an expression (such as pl.col("label_column").cast(pl.Int16)).

Examples

>>> df = pl.DataFrame(
...     {
...         "lbl": [0, 1, 2, 3],
...         "feat1": [1, 0, 0, 1],
...         "feat2": [1.5, -0.5, 0.0, -2.25],
...     }
... )

Standard return type (Tensor), with f32 supertype:

>>> df.to_torch(dtype=pl.Float32)
tensor([[ 0.0000,  1.0000,  1.5000],
        [ 1.0000,  0.0000, -0.5000],
        [ 2.0000,  0.0000,  0.0000],
        [ 3.0000,  1.0000, -2.2500]])

As a dictionary of individual Tensors:

>>> df.to_torch("dict")
{'lbl': tensor([0, 1, 2, 3]),
 'feat1': tensor([1, 0, 0, 1]),
 'feat2': tensor([ 1.5000, -0.5000,  0.0000, -2.2500], dtype=torch.float64)}

As a PolarsDataset, with f64 supertype:

>>> ds = df.to_torch("dataset", dtype=pl.Float64)
>>> ds[3]
(tensor([ 3.0000,  1.0000, -2.2500], dtype=torch.float64),)
>>> ds[:2]
(tensor([[ 0.0000,  1.0000,  1.5000],
         [ 1.0000,  0.0000, -0.5000]], dtype=torch.float64),)
>>> ds[[0, 3]]
(tensor([[ 0.0000,  1.0000,  1.5000],
         [ 3.0000,  1.0000, -2.2500]], dtype=torch.float64),)

As a convenience the PolarsDataset can opt-in to half-precision data for experimentation (usually this would be set on the model/pipeline):

>>> list(ds.half())
[(tensor([0.0000, 1.0000, 1.5000], dtype=torch.float16),),
 (tensor([ 1.0000,  0.0000, -0.5000], dtype=torch.float16),),
 (tensor([2., 0., 0.], dtype=torch.float16),),
 (tensor([ 3.0000,  1.0000, -2.2500], dtype=torch.float16),)]

Pass PolarsDataset to a DataLoader, designating the label:

>>> from torch.utils.data import DataLoader
>>> ds = df.to_torch("dataset", label="lbl")
>>> dl = DataLoader(ds, batch_size=2)
>>> batches = list(dl)
>>> batches[0]
[tensor([[ 1.0000,  1.5000],
         [ 0.0000, -0.5000]], dtype=torch.float64), tensor([0, 1])]

Note that labels can be given as expressions, allowing them to have a dtype independent of the feature columns (multi-column labels are supported).

>>> ds = df.to_torch(
...     "dataset",
...     dtype=pl.Float32,
...     label=pl.col("lbl").cast(pl.Int16),
... )
>>> ds[:2]
(tensor([[ 1.0000,  1.5000],
         [ 0.0000, -0.5000]]), tensor([0, 1], dtype=torch.int16))

Easily integrate with (for example) scikit-learn and other datasets:

>>> from sklearn.datasets import fetch_california_housing  
>>> housing = fetch_california_housing()  
>>> df = pl.DataFrame(
...     data=housing.data,
...     schema=housing.feature_names,
... ).with_columns(
...     Target=housing.target,
... )  
>>> train = df.to_torch("dataset", label="Target")  
>>> loader = DataLoader(
...     train,
...     shuffle=True,
...     batch_size=64,
... )