polars.DataFrame#

class polars.DataFrame(data: dict[str, Sequence[Any]] | Sequence[Any] | np.ndarray[Any, Any] | pa.Table | pd.DataFrame | pli.Series | None = None, columns: ColumnsType | None = None, orient: Orientation | None = None)[source]#

Two-dimensional data structure representing data as a table with rows and columns.

Parameters:
datadict, Sequence, ndarray, Series, or pandas.DataFrame

Two-dimensional data in various forms. dict must contain Sequences. Sequence may contain Series or other Sequences.

columnsSequence of str or (str,DataType) pairs, default None

Column labels to use for resulting DataFrame. If specified, overrides any labels already present in the data. Must match data dimensions.

orient{‘col’, ‘row’}, default None

Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.

Notes

Some methods internally convert the DataFrame into a LazyFrame before collecting the results back into a DataFrame. This can lead to unexpected behavior when using a subclassed DataFrame. For example,

>>> class MyDataFrame(pl.DataFrame):
...     pass
...
>>> isinstance(MyDataFrame().lazy().collect(), MyDataFrame)
False

Examples

Constructing a DataFrame from a dictionary:

>>> data = {"a": [1, 2], "b": [3, 4]}
>>> df = pl.DataFrame(data)
>>> df
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   │
└─────┴─────┘

Notice that the dtype is automatically inferred as a polars Int64:

>>> df.dtypes
[<class 'polars.datatypes.Int64'>, <class 'polars.datatypes.Int64'>]

In order to specify dtypes for your columns, initialize the DataFrame with a list of typed Series:

>>> data = [
...     pl.Series("col1", [1, 2], dtype=pl.Float32),
...     pl.Series("col2", [3, 4], dtype=pl.Int64),
... ]
>>> df2 = pl.DataFrame(data)
>>> df2
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ 4    │
└──────┴──────┘

Or set the columns parameter with a list of (name,dtype) pairs (compatible with all of the other valid data parameter types):

>>> data = {"col1": [1, 2], "col2": [3, 4]}
>>> df3 = pl.DataFrame(data, columns=[("col1", pl.Float32), ("col2", pl.Int64)])
>>> df3
shape: (2, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ ---  ┆ ---  │
│ f32  ┆ i64  │
╞══════╪══════╡
│ 1.0  ┆ 3    │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0  ┆ 4    │
└──────┴──────┘

Constructing a DataFrame from a numpy ndarray, specifying column names:

>>> import numpy as np
>>> data = np.array([(1, 2), (3, 4)], dtype=np.int64)
>>> df4 = pl.DataFrame(data, columns=["a", "b"], orient="col")
>>> df4
shape: (2, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2   ┆ 4   │
└─────┴─────┘

Constructing a DataFrame from a list of lists, row orientation inferred:

>>> data = [[1, 2, 3], [4, 5, 6]]
>>> df4 = pl.DataFrame(data, columns=["a", "b", "c"])
>>> df4
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4   ┆ 5   ┆ 6   │
└─────┴─────┴─────┘
Attributes:
columns

Get or set column names.

dtypes

Get dtypes of columns in DataFrame.

height

Get the height of the DataFrame.

schema

Get a dict[column name, DataType].

shape

Get the shape of the DataFrame.

width

Get the width of the DataFrame.

Methods

apply(f[, return_dtype, inference_size])

Apply a custom/user-defined function (UDF) over the rows of the DataFrame.

cleared()

Create an empty copy of the current DataFrame.

clone()

Cheap deepcopy/clone.

describe()

Summary statistics for a DataFrame.

drop(name)

Remove column from DataFrame and return as new.

drop_in_place(name)

Drop in place.

drop_nulls([subset])

Return a new DataFrame where the null values are dropped.

estimated_size([unit])

Return an estimation of the total (heap) allocated size of the DataFrame.

explode(columns)

Explode DataFrame to long format by exploding a column with Lists.

extend(other)

Extend the memory backed by this DataFrame with the values from other.

fill_nan(fill_value)

Fill floating point NaN values by an Expression evaluation.

fill_null([value, strategy, limit, ...])

Fill null values using the specified value or strategy.

filter(predicate)

Filter the rows in the DataFrame based on a predicate expression.

find_idx_by_name(name)

Find the index of a column by name.

fold(operation)

Apply a horizontal reduction on a DataFrame.

frame_equal(other[, null_equal])

Check if DataFrame is equal to other.

get_column(name)

Get a single column as Series by name.

get_columns()

Get the DataFrame as a List of Series.

groupby(by[, maintain_order])

Start a groupby operation.

groupby_dynamic(index_column, every[, ...])

Group based on a time value (or index value of type Int32, Int64).

groupby_rolling(index_column, period[, ...])

Create rolling groups based on a time column.

hash_rows([seed, seed_1, seed_2, seed_3])

Hash and combine the rows in this DataFrame.

head([n])

Get the first n rows.

hstack(columns[, in_place])

Return a new DataFrame grown horizontally by stacking multiple Series to it.

insert_at_idx(index, series)

Insert a Series at a certain column index.

interpolate()

Interpolate intermediate values.

is_duplicated()

Get a mask of all duplicated rows in this DataFrame.

is_empty()

Check if the dataframe is empty.

is_unique()

Get a mask of all unique rows in this DataFrame.

join(other[, left_on, right_on, on, how, suffix])

Join in SQL-like fashion.

join_asof(other[, left_on, right_on, on, ...])

Perform an asof join.

lazy()

Start a lazy query from this point.

limit([n])

Get the first n rows.

max()

Aggregate the columns of this DataFrame to their maximum value.

mean()

Aggregate the columns of this DataFrame to their mean value.

median()

Aggregate the columns of this DataFrame to their median value.

melt([id_vars, value_vars, variable_name, ...])

Unpivot a DataFrame from wide to long format.

min()

Aggregate the columns of this DataFrame to their minimum value.

n_chunks()

Get number of chunks used by the ChunkedArrays of this DataFrame.

null_count()

Create a new DataFrame that shows the null counts per column.

partition_by()

Split into multiple DataFrames partitioned by groups.

pipe(func, *args, **kwargs)

Apply a function on Self.

pivot(values, index, columns[, ...])

Create a spreadsheet-style pivot table as a DataFrame.

product()

Aggregate the columns of this DataFrame to their product values.

quantile(quantile[, interpolation])

Aggregate the columns of this DataFrame to their quantile value.

rechunk()

Rechunk the data in this DataFrame to a contiguous allocation.

rename(mapping)

Rename column names.

replace(column, new_col)

Replace a column by a new Series.

replace_at_idx(index, series)

Replace a column at an index location.

reverse()

Reverse the DataFrame.

row([index, by_predicate])

Get a row as tuple, either by index or by predicate.

rows()

Convert columnar data to rows as python tuples.

sample([n, frac, with_replacement, shuffle, ...])

Sample from this DataFrame.

select(exprs)

Select columns from this DataFrame.

shift(periods)

Shift values by the given period.

shift_and_fill(periods, fill_value)

Shift the values by a given period and fill the resulting null values.

shrink_to_fit([in_place])

Shrink DataFrame memory usage.

slice(offset[, length])

Get a slice of this DataFrame.

sort(by[, reverse, nulls_last])

Sort the DataFrame by column.

std([ddof])

Aggregate the columns of this DataFrame to their standard deviation value.

sum()

Aggregate the columns of this DataFrame to their sum value.

tail([n])

Get the last n rows.

take_every(n)

Take every nth row in the DataFrame and return as a new DataFrame.

to_arrow()

Collect the underlying arrow arrays in an Arrow Table.

to_dict()

Convert DataFrame to a dictionary mapping column name to values.

to_dicts()

Convert every row to a dictionary.

to_dummies(*[, columns])

Get one hot encoded dummy variables.

to_numpy()

Convert DataFrame to a 2D NumPy array.

to_pandas(*args[, date_as_object])

Cast to a pandas DataFrame.

to_series([index])

Select column as Series at index location.

to_struct(name)

Convert a DataFrame to a Series of type Struct.

transpose([include_header, header_name, ...])

Transpose a DataFrame over the diagonal.

unique([maintain_order, subset, keep])

Drop duplicate rows from this DataFrame.

unnest(names)

Decompose a struct into its fields.

unstack(step[, how, columns, fill_values])

Unstack a long table to a wide form without doing an aggregation.

upsample(time_column, every[, offset, by, ...])

Upsample a DataFrame at a regular frequency.

var([ddof])

Aggregate the columns of this DataFrame to their variance value.

vstack(df[, in_place])

Grow this DataFrame vertically by stacking a DataFrame to it.

with_column(column)

Return a new DataFrame with the column added or replaced.

with_columns([exprs])

Add or overwrite multiple columns in a DataFrame.

with_row_count([name, offset])

Add a column at index 0 that counts the rows.

write_avro(file[, compression])

Write to Apache Avro file.

write_csv()

Write to comma-separated values (CSV) file.

write_ipc(file[, compression])

Write to Arrow IPC binary stream or Feather file.

write_json()

Serialize to JSON representation.

write_ndjson()

Serialize to newline delimited JSON representation.

write_parquet(file, *[, compression, ...])

Write to Apache Parquet file.

__init__(data: dict[str, Sequence[Any]] | Sequence[Any] | np.ndarray[Any, Any] | pa.Table | pd.DataFrame | pli.Series | None = None, columns: ColumnsType | None = None, orient: Orientation | None = None)[source]#

Methods

__init__([data, columns, orient])

apply(f[, return_dtype, inference_size])

Apply a custom/user-defined function (UDF) over the rows of the DataFrame.

cleared()

Create an empty copy of the current DataFrame.

clone()

Cheap deepcopy/clone.

describe()

Summary statistics for a DataFrame.

drop(name)

Remove column from DataFrame and return as new.

drop_in_place(name)

Drop in place.

drop_nulls([subset])

Return a new DataFrame where the null values are dropped.

estimated_size([unit])

Return an estimation of the total (heap) allocated size of the DataFrame.

explode(columns)

Explode DataFrame to long format by exploding a column with Lists.

extend(other)

Extend the memory backed by this DataFrame with the values from other.

fill_nan(fill_value)

Fill floating point NaN values by an Expression evaluation.

fill_null([value, strategy, limit, ...])

Fill null values using the specified value or strategy.

filter(predicate)

Filter the rows in the DataFrame based on a predicate expression.

find_idx_by_name(name)

Find the index of a column by name.

fold(operation)

Apply a horizontal reduction on a DataFrame.

frame_equal(other[, null_equal])

Check if DataFrame is equal to other.

get_column(name)

Get a single column as Series by name.

get_columns()

Get the DataFrame as a List of Series.

groupby(by[, maintain_order])

Start a groupby operation.

groupby_dynamic(index_column, every[, ...])

Group based on a time value (or index value of type Int32, Int64).

groupby_rolling(index_column, period[, ...])

Create rolling groups based on a time column.

hash_rows([seed, seed_1, seed_2, seed_3])

Hash and combine the rows in this DataFrame.

head([n])

Get the first n rows.

hstack(columns[, in_place])

Return a new DataFrame grown horizontally by stacking multiple Series to it.

insert_at_idx(index, series)

Insert a Series at a certain column index.

interpolate()

Interpolate intermediate values.

is_duplicated()

Get a mask of all duplicated rows in this DataFrame.

is_empty()

Check if the dataframe is empty.

is_unique()

Get a mask of all unique rows in this DataFrame.

join(other[, left_on, right_on, on, how, suffix])

Join in SQL-like fashion.

join_asof(other[, left_on, right_on, on, ...])

Perform an asof join.

lazy()

Start a lazy query from this point.

limit([n])

Get the first n rows.

max()

Aggregate the columns of this DataFrame to their maximum value.

mean()

Aggregate the columns of this DataFrame to their mean value.

median()

Aggregate the columns of this DataFrame to their median value.

melt([id_vars, value_vars, variable_name, ...])

Unpivot a DataFrame from wide to long format.

min()

Aggregate the columns of this DataFrame to their minimum value.

n_chunks()

Get number of chunks used by the ChunkedArrays of this DataFrame.

null_count()

Create a new DataFrame that shows the null counts per column.

partition_by()

Split into multiple DataFrames partitioned by groups.

pipe(func, *args, **kwargs)

Apply a function on Self.

pivot(values, index, columns[, ...])

Create a spreadsheet-style pivot table as a DataFrame.

product()

Aggregate the columns of this DataFrame to their product values.

quantile(quantile[, interpolation])

Aggregate the columns of this DataFrame to their quantile value.

rechunk()

Rechunk the data in this DataFrame to a contiguous allocation.

rename(mapping)

Rename column names.

replace(column, new_col)

Replace a column by a new Series.

replace_at_idx(index, series)

Replace a column at an index location.

reverse()

Reverse the DataFrame.

row([index, by_predicate])

Get a row as tuple, either by index or by predicate.

rows()

Convert columnar data to rows as python tuples.

sample([n, frac, with_replacement, shuffle, ...])

Sample from this DataFrame.

select(exprs)

Select columns from this DataFrame.

shift(periods)

Shift values by the given period.

shift_and_fill(periods, fill_value)

Shift the values by a given period and fill the resulting null values.

shrink_to_fit([in_place])

Shrink DataFrame memory usage.

slice(offset[, length])

Get a slice of this DataFrame.

sort(by[, reverse, nulls_last])

Sort the DataFrame by column.

std([ddof])

Aggregate the columns of this DataFrame to their standard deviation value.

sum()

Aggregate the columns of this DataFrame to their sum value.

tail([n])

Get the last n rows.

take_every(n)

Take every nth row in the DataFrame and return as a new DataFrame.

to_arrow()

Collect the underlying arrow arrays in an Arrow Table.

to_dict()

Convert DataFrame to a dictionary mapping column name to values.

to_dicts()

Convert every row to a dictionary.

to_dummies(*[, columns])

Get one hot encoded dummy variables.

to_numpy()

Convert DataFrame to a 2D NumPy array.

to_pandas(*args[, date_as_object])

Cast to a pandas DataFrame.

to_series([index])

Select column as Series at index location.

to_struct(name)

Convert a DataFrame to a Series of type Struct.

transpose([include_header, header_name, ...])

Transpose a DataFrame over the diagonal.

unique([maintain_order, subset, keep])

Drop duplicate rows from this DataFrame.

unnest(names)

Decompose a struct into its fields.

unstack(step[, how, columns, fill_values])

Unstack a long table to a wide form without doing an aggregation.

upsample(time_column, every[, offset, by, ...])

Upsample a DataFrame at a regular frequency.

var([ddof])

Aggregate the columns of this DataFrame to their variance value.

vstack(df[, in_place])

Grow this DataFrame vertically by stacking a DataFrame to it.

with_column(column)

Return a new DataFrame with the column added or replaced.

with_columns([exprs])

Add or overwrite multiple columns in a DataFrame.

with_row_count([name, offset])

Add a column at index 0 that counts the rows.

write_avro(file[, compression])

Write to Apache Avro file.

write_csv()

Write to comma-separated values (CSV) file.

write_ipc(file[, compression])

Write to Arrow IPC binary stream or Feather file.

write_json()

Serialize to JSON representation.

write_ndjson()

Serialize to newline delimited JSON representation.

write_parquet(file, *[, compression, ...])

Write to Apache Parquet file.

Attributes

columns

Get or set column names.

dtypes

Get dtypes of columns in DataFrame.

height

Get the height of the DataFrame.

schema

Get a dict[column name, DataType].

shape

Get the shape of the DataFrame.

width

Get the width of the DataFrame.