DataFrame#
This page gives an overview of all public DataFrame methods.
- class polars.DataFrame(data: FrameInitTypes | None = None, schema: SchemaDefinition | None = None, *, schema_overrides: SchemaDict | None = None, orient: Orientation | None = None, infer_schema_length: int | None = 100, nan_to_null: bool = False)[source]
Two-dimensional data structure representing data as a table with rows and columns.
- Parameters:
- datadict, Sequence, ndarray, Series, or pandas.DataFrame
Two-dimensional data in various forms; dict input must contain Sequences, Generators, or a
range
. Sequence may contain Series or other Sequences.- schemaSequence of str, (str,DataType) pairs, or a {str:DataType,} dict
The DataFrame schema may be declared in several ways:
As a dict of {name:type} pairs; if type is None, it will be auto-inferred.
As a list of column names; in this case types are automatically inferred.
As a list of (name,type) pairs; this is equivalent to the dictionary form.
If you supply a list of column names that does not match the names in the underlying data, the names given here will overwrite them. The number of names given in the schema should match the underlying data dimensions.
- schema_overridesdict, default None
Support type specification or override of one or more columns; note that any dtypes inferred from the schema param will be overridden. underlying data, the names given here will overwrite them.
The number of entries in the schema should match the underlying data dimensions, unless a sequence of dictionaries is being passed, in which case a _partial_ schema can be declared to prevent specific fields from being loaded.
- orient{‘col’, ‘row’}, default None
Whether to interpret two-dimensional data as columns or as rows. If None, the orientation is inferred by matching the columns and data dimensions. If this does not yield conclusive results, column orientation is used.
- infer_schema_lengthint, default None
Maximum number of rows to read for schema inference; only applies if the input data is a sequence or generator of rows; other input is read as-is.
- nan_to_nullbool, default False
If the data comes from one or more numpy arrays, can optionally convert input data np.nan values to null instead. This is a no-op for all other input data.
Notes
Some methods internally convert the DataFrame into a LazyFrame before collecting the results back into a DataFrame. This can lead to unexpected behavior when using a subclassed DataFrame. For example,
>>> class MyDataFrame(pl.DataFrame): ... pass ... >>> isinstance(MyDataFrame().lazy().collect(), MyDataFrame) False
Examples
Constructing a DataFrame from a dictionary:
>>> data = {"a": [1, 2], "b": [3, 4]} >>> df = pl.DataFrame(data) >>> df shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
Notice that the dtypes are automatically inferred as polars Int64:
>>> df.dtypes [Int64, Int64]
To specify a more detailed/specific frame schema you can supply the schema parameter with a dictionary of (name,dtype) pairs…
>>> data = {"col1": [0, 2], "col2": [3, 7]} >>> df2 = pl.DataFrame(data, schema={"col1": pl.Float32, "col2": pl.Int64}) >>> df2 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 0.0 ┆ 3 │ │ 2.0 ┆ 7 │ └──────┴──────┘
…a sequence of (name,dtype) pairs…
>>> data = {"col1": [1, 2], "col2": [3, 4]} >>> df3 = pl.DataFrame(data, schema=[("col1", pl.Float32), ("col2", pl.Int64)]) >>> df3 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘
…or a list of typed Series.
>>> data = [ ... pl.Series("col1", [1, 2], dtype=pl.Float32), ... pl.Series("col2", [3, 4], dtype=pl.Int64), ... ] >>> df4 = pl.DataFrame(data) >>> df4 shape: (2, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ f32 ┆ i64 │ ╞══════╪══════╡ │ 1.0 ┆ 3 │ │ 2.0 ┆ 4 │ └──────┴──────┘
Constructing a DataFrame from a numpy ndarray, specifying column names:
>>> import numpy as np >>> data = np.array([(1, 2), (3, 4)], dtype=np.int64) >>> df5 = pl.DataFrame(data, schema=["a", "b"], orient="col") >>> df5 shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘
Constructing a DataFrame from a list of lists, row orientation inferred:
>>> data = [[1, 2, 3], [4, 5, 6]] >>> df6 = pl.DataFrame(data, schema=["a", "b", "c"]) >>> df6 shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┘
Methods:
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
Create an empty copy of the current DataFrame, with zero to 'n' rows.
Cheap deepcopy/clone.
Return Pearson product-moment correlation coefficients.
Summary statistics for a DataFrame.
Remove columns from the dataframe.
Drop a single column in-place and return the dropped column.
Drop all rows that contain null values.
Return an estimation of the total (heap) allocated size of the DataFrame.
Explode the dataframe to long format by exploding the given columns.
Extend the memory backed by this DataFrame with the values from other.
Fill floating point NaN values by an Expression evaluation.
Fill null values using the specified value or strategy.
Filter the rows in the DataFrame based on a predicate expression.
Find the index of a column by name.
Apply a horizontal reduction on a DataFrame.
Check if DataFrame is equal to other.
Get a single column as Series by name.
Get the DataFrame as a List of Series.
Return a dense preview of the dataframe.
Start a groupby operation.
Group based on a time value (or index value of type Int32, Int64).
Create rolling groups based on a time column.
Hash and combine the rows in this DataFrame.
Get the first n rows.
Return a new DataFrame grown horizontally by stacking multiple Series to it.
Insert a Series at a certain column index.
Interpolate intermediate values.
Get a mask of all duplicated rows in this DataFrame.
Check if the dataframe is empty.
Get a mask of all unique rows in this DataFrame.
Return the dataframe as a scalar.
Returns an iterator over the DataFrame of rows of python-native values.
Returns a non-copying iterator of slices over the underlying DataFrame.
Join in SQL-like fashion.
Perform an asof join.
Start a lazy query from this point.
Get the first n rows.
Aggregate the columns of this DataFrame to their maximum value.
Aggregate the columns of this DataFrame to their mean value.
Aggregate the columns of this DataFrame to their median value.
Unpivot a DataFrame from wide to long format.
Take two sorted DataFrames and merge them by the sorted key.
Aggregate the columns of this DataFrame to their minimum value.
Get number of chunks used by the ChunkedArrays of this DataFrame.
Return the number of unique rows, or the number of unique row-subsets.
Create a new DataFrame that shows the null counts per column.
Group by the given columns and return the groups as separate dataframes.
Offers a structured way to apply a sequence of user-defined functions (UDFs).
Create a spreadsheet-style pivot table as a DataFrame.
Aggregate the columns of this DataFrame to their product values.
Aggregate the columns of this DataFrame to their quantile value.
Rechunk the data in this DataFrame to a contiguous allocation.
Rename column names.
Replace a column by a new Series.
Replace a column at an index location.
Reverse the DataFrame.
Get the values of a single row, either by index or by predicate.
Returns all data in the DataFrame as a list of rows of python-native values.
Sample from this DataFrame.
Select columns from this DataFrame.
Shift values by the given period.
Shift the values by a given period and fill the resulting null values.
Shrink DataFrame memory usage.
Get a slice of this DataFrame.
Sort the dataframe by the given columns.
Aggregate the columns of this DataFrame to their standard deviation value.
Aggregate the columns of this DataFrame to their sum value.
Get the last n rows.
Take every nth row in the DataFrame and return as a new DataFrame.
Collect the underlying arrow arrays in an Arrow Table.
Convert DataFrame to a dictionary mapping column name to values.
Convert every row to a dictionary of python-native values.
Convert categorical variables into dummy/indicator variables.
Convert DataFrame to a 2D NumPy array.
Cast to a pandas DataFrame.
Select column as Series at index location.
Convert a
DataFrame
to aSeries
of typeStruct
.Return the k largest elements.
Transpose a DataFrame over the diagonal.
Drop duplicate rows from this dataframe.
Decompose struct columns into separate columns for each of their fields.
Unstack a long table to a wide form without doing an aggregation.
Update the values in this DataFrame with the non-null values in other.
Upsample a DataFrame at a regular frequency.
Aggregate the columns of this DataFrame to their variance value.
Grow this DataFrame vertically by stacking a DataFrame to it.
Add columns to this DataFrame.
Add a column at index 0 that counts the rows.
Write to Apache Avro file.
Write to comma-separated values (CSV) file.
Write a polars frame to a database.
Write frame data to a table in an Excel workbook/worksheet.
Write to Arrow IPC binary stream or Feather file.
Serialize to JSON representation.
Serialize to newline delimited JSON representation.
Write to Apache Parquet file.
Attributes:
Get or set column names.
Get the datatypes of the columns of this DataFrame.
Get the height of the DataFrame.
Get a dict[column name, DataType].
Get the shape of the DataFrame.
Get the width of the DataFrame.
- apply(function: Callable[[tuple[Any, ...]], Any], return_dtype: PolarsDataType | None = None, *, inference_size: int = 256) Self [source]
Apply a custom/user-defined function (UDF) over the rows of the DataFrame.
The UDF will receive each row as a tuple of values:
udf(row)
.Implementing logic using a Python function is almost always _significantly_ slower and more memory intensive than implementing the same logic using the native expression API because:
The native expression engine runs in Rust; UDFs run in Python.
Use of Python UDFs forces the DataFrame to be materialized in memory.
Polars-native expressions can be parallelised (UDFs typically cannot).
Polars-native expressions can be logically optimised (UDFs cannot).
Wherever possible you should strongly prefer the native expression API to achieve the best performance.
- Parameters:
- function
Custom function or lambda.
- return_dtype
Output type of the operation. If none given, Polars tries to infer the type.
- inference_size
Only used in the case when the custom function returns rows. This uses the first n rows to determine the output schema
Notes
The frame-level
apply
cannot track column names (as the UDF is a black-box that may arbitrarily drop, rearrange, transform, or add new columns); if you want to apply a UDF such that column names are preserved, you should use the expression-levelapply
syntax instead.If your function is expensive and you don’t want it to be called more than once for a given input, consider applying an
@lru_cache
decorator to it. With suitable data you may achieve order-of-magnitude speedups (or more).
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [-1, 5, 8]})
Return a DataFrame by mapping each row to a tuple:
>>> df.apply(lambda t: (t[0] * 2, t[1] * 3)) shape: (3, 2) ┌──────────┬──────────┐ │ column_0 ┆ column_1 │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════════╪══════════╡ │ 2 ┆ -3 │ │ 4 ┆ 15 │ │ 6 ┆ 24 │ └──────────┴──────────┘
It is better to implement this with an expression:
>>> df.select([pl.col("foo") * 2, pl.col("bar") * 3])
Return a Series by mapping each row to a scalar:
>>> df.apply(lambda t: (t[0] * 2 + t[1])) shape: (3, 1) ┌───────┐ │ apply │ │ --- │ │ i64 │ ╞═══════╡ │ 1 │ │ 9 │ │ 14 │ └───────┘
In this case it is better to use the following expression:
>>> df.select(pl.col("foo") * 2 + pl.col("bar"))
- clear(n: int = 0) Self [source]
Create an empty copy of the current DataFrame, with zero to ‘n’ rows.
Returns a DataFrame with identical schema but no data.
- Parameters:
- n
Number of (empty) rows to return in the cleared frame.
See also
clone
Cheap deepcopy/clone.
Examples
>>> df = pl.DataFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> df.clear() shape: (0, 3) ┌─────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞═════╪═════╪══════╡ └─────┴─────┴──────┘
>>> df.clear(n=2) shape: (2, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ null ┆ null ┆ null │ └──────┴──────┴──────┘
- clone() Self [source]
Cheap deepcopy/clone.
See also
clear
Create an empty copy of the current DataFrame, with identical schema but no data.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.clone() shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true │ │ 2 ┆ 4.0 ┆ true │ │ 3 ┆ 10.0 ┆ false │ │ 4 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘
- property columns: list[str][source]
Get or set column names.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.columns ['foo', 'bar', 'ham']
Set column names:
>>> df.columns = ["apple", "banana", "orange"] >>> df shape: (3, 3) ┌───────┬────────┬────────┐ │ apple ┆ banana ┆ orange │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪════════╪════════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴────────┴────────┘
- corr(**kwargs: Any) Self [source]
Return Pearson product-moment correlation coefficients.
See numpy
corrcoef
for more information: https://numpy.org/doc/stable/reference/generated/numpy.corrcoef.html- Parameters:
- kwargs
keyword arguments are passed to numpy corrcoef
Notes
This functionality requires numpy to be installed.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [3, 2, 1], "ham": [7, 8, 9]}) >>> df.corr() shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════╪══════╡ │ 1.0 ┆ -1.0 ┆ 1.0 │ │ -1.0 ┆ 1.0 ┆ -1.0 │ │ 1.0 ┆ -1.0 ┆ 1.0 │ └──────┴──────┴──────┘
- describe() Self [source]
Summary statistics for a DataFrame.
See also
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "a": [1.0, 2.8, 3.0], ... "b": [4, 5, None], ... "c": [True, False, True], ... "d": [None, "b", "c"], ... "e": ["usd", "eur", None], ... "f": [date(2020, 1, 1), date(2021, 1, 1), date(2022, 1, 1)], ... } ... ) >>> df.describe() shape: (7, 7) ┌────────────┬──────────┬──────────┬──────────┬──────┬──────┬────────────┐ │ describe ┆ a ┆ b ┆ c ┆ d ┆ e ┆ f │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ str ┆ str ┆ str │ ╞════════════╪══════════╪══════════╪══════════╪══════╪══════╪════════════╡ │ count ┆ 3.0 ┆ 3.0 ┆ 3.0 ┆ 3 ┆ 3 ┆ 3 │ │ null_count ┆ 0.0 ┆ 1.0 ┆ 0.0 ┆ 1 ┆ 1 ┆ 0 │ │ mean ┆ 2.266667 ┆ 4.5 ┆ 0.666667 ┆ null ┆ null ┆ null │ │ std ┆ 1.101514 ┆ 0.707107 ┆ 0.57735 ┆ null ┆ null ┆ null │ │ min ┆ 1.0 ┆ 4.0 ┆ 0.0 ┆ b ┆ eur ┆ 2020-01-01 │ │ max ┆ 3.0 ┆ 5.0 ┆ 1.0 ┆ c ┆ usd ┆ 2022-01-01 │ │ median ┆ 2.8 ┆ 4.5 ┆ 1.0 ┆ null ┆ null ┆ null │ └────────────┴──────────┴──────────┴──────────┴──────┴──────┴────────────┘
- drop(columns: str | Sequence[str], *more_columns: str) Self [source]
Remove columns from the dataframe.
- Parameters:
- columns
Name of the column(s) that should be removed from the dataframe.
- *more_columns
Additional columns to drop, specified as positional arguments.
Examples
Drop a single column by passing the name of that column.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop("ham") shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪═════╡ │ 1 ┆ 6.0 │ │ 2 ┆ 7.0 │ │ 3 ┆ 8.0 │ └─────┴─────┘
Drop multiple columns by passing a list of column names.
>>> df.drop(["bar", "ham"]) shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Or use positional arguments to drop multiple columns in the same way.
>>> df.drop("foo", "bar") shape: (3, 1) ┌─────┐ │ ham │ │ --- │ │ str │ ╞═════╡ │ a │ │ b │ │ c │ └─────┘
- drop_in_place(name: str) Series [source]
Drop a single column in-place and return the dropped column.
- Parameters:
- name
Name of the column to drop.
- Returns:
- The dropped column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop_in_place("ham") shape: (3,) Series: 'ham' [str] [ "a" "b" "c" ]
- drop_nulls(subset: str | Sequence[str] | None = None) Self [source]
Drop all rows that contain null values.
Returns a new DataFrame.
- Parameters:
- subset
Column name(s) for which null values are considered. If set to
None
(default), use all columns.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, None, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.drop_nulls() shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
This method drops rows where any single value of the row is null.
Below are some example snippets that show how you could drop null values based on other conditions
>>> df = pl.DataFrame( ... { ... "a": [None, None, None, None], ... "b": [1, 2, None, 1], ... "c": [1, None, None, 1], ... } ... ) >>> df shape: (4, 3) ┌──────┬──────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ i64 ┆ i64 │ ╞══════╪══════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ null ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴──────┴──────┘
Drop a row only if all values are null:
>>> df.filter(~pl.all(pl.all().is_null())) shape: (3, 3) ┌──────┬─────┬──────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ i64 ┆ i64 │ ╞══════╪═════╪══════╡ │ null ┆ 1 ┆ 1 │ │ null ┆ 2 ┆ null │ │ null ┆ 1 ┆ 1 │ └──────┴─────┴──────┘
Drop a column if all values are null:
>>> df[[s.name for s in df if not (s.null_count() == df.height)]] shape: (4, 2) ┌──────┬──────┐ │ b ┆ c │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪══════╡ │ 1 ┆ 1 │ │ 2 ┆ null │ │ null ┆ null │ │ 1 ┆ 1 │ └──────┴──────┘
- property dtypes: list[PolarsDataType][source]
Get the datatypes of the columns of this DataFrame.
The datatypes can also be found in column headers when printing the DataFrame.
See also
schema
Returns a {colname:dtype} mapping.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.dtypes [Int64, Float64, Utf8] >>> df shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- estimated_size(unit: SizeUnit = 'b') int | float [source]
Return an estimation of the total (heap) allocated size of the DataFrame.
Estimated size is given in the specified unit (bytes by default).
This estimation is the sum of the size of its buffers, validity, including nested arrays. Multiple arrays may share buffers and bitmaps. Therefore, the size of 2 arrays is not the sum of the sizes computed from this function. In particular, [StructArray]’s size is an upper bound.
When an array is sliced, its allocated size remains constant because the buffer unchanged. However, this function will yield a smaller number. This is because this function returns the visible size of the buffer, not its total capacity.
FFI buffers are included in this estimation.
- Parameters:
- unit{‘b’, ‘kb’, ‘mb’, ‘gb’, ‘tb’}
Scale the returned size to the given unit.
Examples
>>> df = pl.DataFrame( ... { ... "x": list(reversed(range(1_000_000))), ... "y": [v / 1000 for v in range(1_000_000)], ... "z": [str(v) for v in range(1_000_000)], ... }, ... schema=[("x", pl.UInt32), ("y", pl.Float64), ("z", pl.Utf8)], ... ) >>> df.estimated_size() 25888898 >>> df.estimated_size("mb") 24.689577102661133
- explode(columns: str | Sequence[str] | Expr | Sequence[Expr], *more_columns: str | Expr) Self [source]
Explode the dataframe to long format by exploding the given columns.
- Parameters:
- columns
Name of the column(s) to explode. Columns must be of datatype List or Utf8. Accepts
col
expressions as input as well.- *more_columns
Additional names of columns to explode, specified as positional arguments.
- Returns:
- DataFrame
Examples
>>> df = pl.DataFrame( ... { ... "letters": ["a", "a", "b", "c"], ... "numbers": [[1], [2, 3], [4, 5], [6, 7, 8]], ... } ... ) >>> df shape: (4, 2) ┌─────────┬───────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════════╪═══════════╡ │ a ┆ [1] │ │ a ┆ [2, 3] │ │ b ┆ [4, 5] │ │ c ┆ [6, 7, 8] │ └─────────┴───────────┘ >>> df.explode("numbers") shape: (8, 2) ┌─────────┬─────────┐ │ letters ┆ numbers │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════════╪═════════╡ │ a ┆ 1 │ │ a ┆ 2 │ │ a ┆ 3 │ │ b ┆ 4 │ │ b ┆ 5 │ │ c ┆ 6 │ │ c ┆ 7 │ │ c ┆ 8 │ └─────────┴─────────┘
- extend(other: Self) Self [source]
Extend the memory backed by this DataFrame with the values from other.
Different from vstack which adds the chunks from other to the chunks of this DataFrame extend appends the data from other to the underlying memory locations and thus may cause a reallocation.
If this does not cause a reallocation, the resulting data structure will not have any extra chunks and thus will yield faster queries.
Prefer extend over vstack when you want to do a query after a single append. For instance during online operations where you add n rows and rerun a query.
Prefer vstack over extend when you want to append many times before doing a query. For instance when you read in multiple files and when to store them in a single DataFrame. In the latter case, finish the sequence of vstack operations with a rechunk.
- Parameters:
- other
DataFrame to vertically add.
Examples
>>> df1 = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df2 = pl.DataFrame({"foo": [10, 20, 30], "bar": [40, 50, 60]}) >>> df1.extend(df2) shape: (6, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 5 │ │ 3 ┆ 6 │ │ 10 ┆ 40 │ │ 20 ┆ 50 │ │ 30 ┆ 60 │ └─────┴─────┘
- fill_nan(fill_value: Expr | int | float | None) Self [source]
Fill floating point NaN values by an Expression evaluation.
- Parameters:
- fill_value
Value to fill NaN with.
- Returns:
- DataFrame with NaN replaced with fill_value
Warning
Note that floating point NaNs (Not a Number) are not missing values! To replace missing values, use
fill_null()
.See also
Examples
>>> df = pl.DataFrame( ... { ... "a": [1.5, 2, float("NaN"), 4], ... "b": [0.5, 4, float("NaN"), 13], ... } ... ) >>> df.fill_nan(99) shape: (4, 2) ┌──────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════╪══════╡ │ 1.5 ┆ 0.5 │ │ 2.0 ┆ 4.0 │ │ 99.0 ┆ 99.0 │ │ 4.0 ┆ 13.0 │ └──────┴──────┘
- fill_null(value: Any | None = None, strategy: FillNullStrategy | None = None, limit: int | None = None, *, matches_supertype: bool = True) Self [source]
Fill null values using the specified value or strategy.
- Parameters:
- value
Value used to fill null values.
- strategy{None, ‘forward’, ‘backward’, ‘min’, ‘max’, ‘mean’, ‘zero’, ‘one’}
Strategy used to fill null values.
- limit
Number of consecutive null values to fill when using the ‘forward’ or ‘backward’ strategy.
- matches_supertype
Fill all matching supertype of the fill
value
.
- Returns:
- DataFrame with None values replaced by the filling strategy.
See also
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None, 4], ... "b": [0.5, 4, None, 13], ... } ... ) >>> df.fill_null(99) shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 99 ┆ 99.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘ >>> df.fill_null(strategy="forward") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> df.fill_null(strategy="max") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 4 ┆ 13.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
>>> df.fill_null(strategy="zero") shape: (4, 2) ┌─────┬──────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ f64 │ ╞═════╪══════╡ │ 1 ┆ 0.5 │ │ 2 ┆ 4.0 │ │ 0 ┆ 0.0 │ │ 4 ┆ 13.0 │ └─────┴──────┘
- filter(predicate: Expr | str | Series | list[bool] | np.ndarray[Any, Any] | bool) Self [source]
Filter the rows in the DataFrame based on a predicate expression.
- Parameters:
- predicate
Expression that evaluates to a boolean Series.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... )
Filter on one condition:
>>> df.filter(pl.col("foo") < 3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
Filter on multiple conditions:
>>> df.filter((pl.col("foo") < 3) & (pl.col("ham") == "a")) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
Filter on an OR condition:
>>> df.filter((pl.col("foo") == 1) | (pl.col("ham") == "c")) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- find_idx_by_name(name: str) int [source]
Find the index of a column by name.
- Parameters:
- name
Name of the column to find.
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.find_idx_by_name("ham") 2
- fold(operation: Callable[[Series, Series], Series]) Series [source]
Apply a horizontal reduction on a DataFrame.
This can be used to effectively determine aggregations on a row level, and can be applied to any DataType that can be supercasted (casted to a similar parent type).
An example of the supercast rules when applying an arithmetic operation on two DataTypes are for instance:
Int8 + Utf8 = Utf8 Float32 + Int64 = Float32 Float32 + Float64 = Float64
- Parameters:
- operation
function that takes two Series and returns a Series.
Examples
A horizontal sum operation:
>>> df = pl.DataFrame( ... { ... "a": [2, 1, 3], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [f64] [ 4.0 5.0 9.0 ]
A horizontal minimum operation:
>>> df = pl.DataFrame({"a": [2, 1, 3], "b": [1, 2, 3], "c": [1.0, 2.0, 3.0]}) >>> df.fold(lambda s1, s2: s1.zip_with(s1 < s2, s2)) shape: (3,) Series: 'a' [f64] [ 1.0 1.0 3.0 ]
A horizontal string concatenation:
>>> df = pl.DataFrame( ... { ... "a": ["foo", "bar", 2], ... "b": [1, 2, 3], ... "c": [1.0, 2.0, 3.0], ... } ... ) >>> df.fold(lambda s1, s2: s1 + s2) shape: (3,) Series: 'a' [str] [ "foo11.0" "bar22.0" null ]
A horizontal boolean or, similar to a row-wise .any():
>>> df = pl.DataFrame( ... { ... "a": [False, False, True], ... "b": [False, True, False], ... } ... ) >>> df.fold(lambda s1, s2: s1 | s2) shape: (3,) Series: 'a' [bool] [ false true true ]
- frame_equal(other: DataFrame, *, null_equal: bool = True) bool [source]
Check if DataFrame is equal to other.
- Parameters:
- other
DataFrame to compare with.
- null_equal
Consider null values as equal.
Examples
>>> df1 = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df2 = pl.DataFrame( ... { ... "foo": [3, 2, 1], ... "bar": [8.0, 7.0, 6.0], ... "ham": ["c", "b", "a"], ... } ... ) >>> df1.frame_equal(df1) True >>> df1.frame_equal(df2) False
- get_column(name: str) Series [source]
Get a single column as Series by name.
- Parameters:
- namestr
Name of the column to retrieve.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_column("foo") shape: (3,) Series: 'foo' [i64] [ 1 2 3 ]
- get_columns() list[Series] [source]
Get the DataFrame as a List of Series.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.get_columns() [shape: (3,) Series: 'foo' [i64] [ 1 2 3 ], shape: (3,) Series: 'bar' [i64] [ 4 5 6 ]]
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.get_columns() [shape: (4,) Series: 'a' [i64] [ 1 2 3 4 ], shape: (4,) Series: 'b' [f64] [ 0.5 4.0 10.0 13.0 ], shape: (4,) Series: 'c' [bool] [ true true false true ]]
- glimpse(return_as_string: Literal[False]) None [source]
- glimpse(return_as_string: Literal[True]) str
Return a dense preview of the dataframe.
The formatting is done one line per column, so wide dataframes show nicely. Each line will show the column name, the data type and the first few values.
- Parameters:
- return_as_string
If True, return as string rather than printing to stdout.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... { ... "a": [1.0, 2.8, 3.0], ... "b": [4, 5, None], ... "c": [True, False, True], ... "d": [None, "b", "c"], ... "e": ["usd", "eur", None], ... "f": [date(2020, 1, 1), date(2021, 1, 2), date(2022, 1, 1)], ... } ... ) >>> df.glimpse() Rows: 3 Columns: 6 $ a <f64> 1.0, 2.8, 3.0 $ b <i64> 4, 5, None $ c <bool> True, False, True $ d <str> None, b, c $ e <str> usd, eur, None $ f <date> 2020-01-01, 2021-01-02, 2022-01-01
- groupby(by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr, maintain_order: bool = False) GroupBy[Self] [source]
Start a groupby operation.
- Parameters:
- by
Column(s) to group by. Accepts expression input. Strings are parsed as column names.
- *more_by
Additional columns to group by, specified as positional arguments.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default groupby. Settings this to
True
blocks the possibility to run on the streaming engine.
Examples
Group by one column and call
agg
to compute the grouped sum of another column.>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.groupby("a").agg(pl.col("b").sum()) shape: (3, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ a ┆ 2 │ │ b ┆ 5 │ │ c ┆ 3 │ └─────┴─────┘
Set
maintain_order=True
to ensure the order of the groups is consistent with the input.>>> df.groupby("a", maintain_order=True).agg(pl.col("c")) shape: (3, 2) ┌─────┬───────────┐ │ a ┆ c │ │ --- ┆ --- │ │ str ┆ list[i64] │ ╞═════╪═══════════╡ │ a ┆ [5, 3] │ │ b ┆ [4, 2] │ │ c ┆ [1] │ └─────┴───────────┘
Group by multiple columns by passing a list of column names.
>>> df.groupby(["a", "b"]).agg(pl.max("c")) shape: (4, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
Or use positional arguments to group by multiple columns in the same way. Expressions are also accepted.
>>> df.groupby("a", pl.col("b") // 2).agg(pl.col("c").mean()) shape: (3, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ f64 │ ╞═════╪═════╪═════╡ │ a ┆ 0 ┆ 4.0 │ │ b ┆ 1 ┆ 3.0 │ │ c ┆ 1 ┆ 1.0 │ └─────┴─────┴─────┘
The
GroupBy
object returned by this method is iterable, returning the name and data of each group.>>> for name, data in df.groupby("a"): ... print(name) ... print(data) ... a shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘ b shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘ c shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘
- groupby_dynamic(index_column: str, *, every: str | timedelta, period: str | timedelta | None = None, offset: str | timedelta | None = None, truncate: bool = True, include_boundaries: bool = False, closed: ClosedInterval = 'left', by: IntoExpr | Iterable[IntoExpr] | None = None, start_by: StartBy = 'window') DynamicGroupBy[Self] [source]
Group based on a time value (or index value of type Int32, Int64).
Time windows are calculated and rows are assigned to windows. Different from a normal groupby is that a row can be member of multiple groups. The time/index window could be seen as a rolling window, with a window size determined by dates/times/values instead of slots in the DataFrame.
A window is defined by:
every: interval of the window
period: length of the window
offset: offset of the window
The every, period and offset arguments are created with the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupby_dynamic on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
Warning
The index column must be sorted in ascending order.
- Parameters:
- index_column
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.
In case of a dynamic groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- every
interval of the window
- period
length of the window, if None it is equal to ‘every’
- offset
offset of the window if None and period is None it will be equal to negative every
- truncate
truncate the time value to the window lower bound
- include_boundaries
Add the lower and upper bound of the window to the “_lower_bound” and “_upper_bound” columns. This will impact performance because it’s harder to parallelize
- closed{‘left’, ‘right’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- by
Also group by this column/these columns
- start_by{‘window’, ‘datapoint’, ‘monday’}
The strategy to determine the start of the first window by.
‘window’: Truncate the start of the window with the ‘every’ argument.
‘datapoint’: Start from the first encountered data point.
‘monday’: Start the window on the monday before the first data point.
Examples
>>> from datetime import datetime >>> # create an example dataframe >>> df = pl.DataFrame( ... { ... "time": pl.date_range( ... low=datetime(2021, 12, 16), ... high=datetime(2021, 12, 16, 3), ... interval="30m", ... ), ... "n": range(7), ... } ... ) >>> df shape: (7, 2) ┌─────────────────────┬─────┐ │ time ┆ n │ │ --- ┆ --- │ │ datetime[μs] ┆ i64 │ ╞═════════════════════╪═════╡ │ 2021-12-16 00:00:00 ┆ 0 │ │ 2021-12-16 00:30:00 ┆ 1 │ │ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 01:30:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 4 │ │ 2021-12-16 02:30:00 ┆ 5 │ │ 2021-12-16 03:00:00 ┆ 6 │ └─────────────────────┴─────┘
Group by windows of 1 hour starting at 2021-12-16 00:00:00.
>>> df.groupby_dynamic("time", every="1h", closed="right").agg( ... [ ... pl.col("time").min().alias("time_min"), ... pl.col("time").max().alias("time_max"), ... ] ... ) shape: (4, 3) ┌─────────────────────┬─────────────────────┬─────────────────────┐ │ time ┆ time_min ┆ time_max │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] │ ╞═════════════════════╪═════════════════════╪═════════════════════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 00:00:00 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 00:30:00 ┆ 2021-12-16 01:00:00 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 01:30:00 ┆ 2021-12-16 02:00:00 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 02:30:00 ┆ 2021-12-16 03:00:00 │ └─────────────────────┴─────────────────────┴─────────────────────┘
The window boundaries can also be added to the aggregation result
>>> df.groupby_dynamic( ... "time", every="1h", include_boundaries=True, closed="right" ... ).agg([pl.col("time").count().alias("time_count")]) shape: (4, 4) ┌─────────────────────┬─────────────────────┬─────────────────────┬────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ time ┆ time_count │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ u32 │ ╞═════════════════════╪═════════════════════╪═════════════════════╪════════════╡ │ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1 │ │ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 2 │ │ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2 │ │ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2 │ └─────────────────────┴─────────────────────┴─────────────────────┴────────────┘
When closed=”left”, should not include right end of interval [lower_bound, upper_bound)
>>> df.groupby_dynamic("time", every="1h", closed="left").agg( ... [ ... pl.col("time").count().alias("time_count"), ... pl.col("time").alias("time_agg_list"), ... ] ... ) shape: (4, 3) ┌─────────────────────┬────────────┬───────────────────────────────────┐ │ time ┆ time_count ┆ time_agg_list │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ u32 ┆ list[datetime[μs]] │ ╞═════════════════════╪════════════╪═══════════════════════════════════╡ │ 2021-12-16 00:00:00 ┆ 2 ┆ [2021-12-16 00:00:00, 2021-12-16… │ │ 2021-12-16 01:00:00 ┆ 2 ┆ [2021-12-16 01:00:00, 2021-12-16… │ │ 2021-12-16 02:00:00 ┆ 2 ┆ [2021-12-16 02:00:00, 2021-12-16… │ │ 2021-12-16 03:00:00 ┆ 1 ┆ [2021-12-16 03:00:00] │ └─────────────────────┴────────────┴───────────────────────────────────┘
When closed=”both” the time values at the window boundaries belong to 2 groups.
>>> df.groupby_dynamic("time", every="1h", closed="both").agg( ... [pl.col("time").count().alias("time_count")] ... ) shape: (5, 2) ┌─────────────────────┬────────────┐ │ time ┆ time_count │ │ --- ┆ --- │ │ datetime[μs] ┆ u32 │ ╞═════════════════════╪════════════╡ │ 2021-12-15 23:00:00 ┆ 1 │ │ 2021-12-16 00:00:00 ┆ 3 │ │ 2021-12-16 01:00:00 ┆ 3 │ │ 2021-12-16 02:00:00 ┆ 3 │ │ 2021-12-16 03:00:00 ┆ 1 │ └─────────────────────┴────────────┘
Dynamic groupbys can also be combined with grouping on normal keys
>>> df = pl.DataFrame( ... { ... "time": pl.date_range( ... low=datetime(2021, 12, 16), ... high=datetime(2021, 12, 16, 3), ... interval="30m", ... ), ... "groups": ["a", "a", "a", "b", "b", "a", "a"], ... } ... ) >>> df shape: (7, 2) ┌─────────────────────┬────────┐ │ time ┆ groups │ │ --- ┆ --- │ │ datetime[μs] ┆ str │ ╞═════════════════════╪════════╡ │ 2021-12-16 00:00:00 ┆ a │ │ 2021-12-16 00:30:00 ┆ a │ │ 2021-12-16 01:00:00 ┆ a │ │ 2021-12-16 01:30:00 ┆ b │ │ 2021-12-16 02:00:00 ┆ b │ │ 2021-12-16 02:30:00 ┆ a │ │ 2021-12-16 03:00:00 ┆ a │ └─────────────────────┴────────┘ >>> df.groupby_dynamic( ... "time", ... every="1h", ... closed="both", ... by="groups", ... include_boundaries=True, ... ).agg([pl.col("time").count().alias("time_count")]) shape: (7, 5) ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬────────────┐ │ groups ┆ _lower_boundary ┆ _upper_boundary ┆ time ┆ time_count │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ u32 │ ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪════════════╡ │ a ┆ 2021-12-15 23:00:00 ┆ 2021-12-16 00:00:00 ┆ 2021-12-15 23:00:00 ┆ 1 │ │ a ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3 │ │ a ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1 │ │ a ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2 │ │ a ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1 │ │ b ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2 │ │ b ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1 │ └────────┴─────────────────────┴─────────────────────┴─────────────────────┴────────────┘
Dynamic groupby on an index column
>>> df = pl.DataFrame( ... { ... "idx": pl.arange(0, 6, eager=True), ... "A": ["A", "A", "B", "B", "B", "C"], ... } ... ) >>> ( ... df.groupby_dynamic( ... "idx", ... every="2i", ... period="3i", ... include_boundaries=True, ... closed="right", ... ).agg(pl.col("A").alias("A_agg_list")) ... ) shape: (3, 4) ┌─────────────────┬─────────────────┬─────┬─────────────────┐ │ _lower_boundary ┆ _upper_boundary ┆ idx ┆ A_agg_list │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[str] │ ╞═════════════════╪═════════════════╪═════╪═════════════════╡ │ 0 ┆ 3 ┆ 0 ┆ ["A", "B", "B"] │ │ 2 ┆ 5 ┆ 2 ┆ ["B", "B", "C"] │ │ 4 ┆ 7 ┆ 4 ┆ ["C"] │ └─────────────────┴─────────────────┴─────┴─────────────────┘
- groupby_rolling(index_column: str, *, period: str | timedelta, offset: str | timedelta | None = None, closed: ClosedInterval = 'right', by: IntoExpr | Iterable[IntoExpr] | None = None) RollingGroupBy[Self] [source]
Create rolling groups based on a time column.
Also works for index values of type Int32 or Int64.
Different from a
dynamic_groupby
the windows are now determined by the individual values and are not of constant intervals. For constant intervals use groupby_dynamicThe period and offset arguments are created either from a timedelta, or by using the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
In case of a groupby_rolling on an integer column, the windows are defined by:
“1i” # length 1
“10i” # length 10
- Parameters:
- index_column
Column used to group based on the time window. Often to type Date/Datetime This column must be sorted in ascending order. If not the output will not make sense.
In case of a rolling groupby on indices, dtype needs to be one of {Int32, Int64}. Note that Int32 gets temporarily cast to Int64, so if performance matters use an Int64 column.
- period
length of the window
- offset
offset of the window. Default is -period
- closed{‘right’, ‘left’, ‘both’, ‘none’}
Define which sides of the temporal interval are closed (inclusive).
- by
Also group by this column/these columns
See also
Examples
>>> dates = [ ... "2020-01-01 13:45:48", ... "2020-01-01 16:42:13", ... "2020-01-01 16:45:09", ... "2020-01-02 18:12:48", ... "2020-01-03 19:45:32", ... "2020-01-08 23:16:43", ... ] >>> df = pl.DataFrame({"dt": dates, "a": [3, 7, 5, 9, 2, 1]}).with_columns( ... pl.col("dt").str.strptime(pl.Datetime) ... ) >>> out = df.groupby_rolling(index_column="dt", period="2d").agg( ... [ ... pl.sum("a").alias("sum_a"), ... pl.min("a").alias("min_a"), ... pl.max("a").alias("max_a"), ... ] ... ) >>> assert out["sum_a"].to_list() == [3, 10, 15, 24, 11, 1] >>> assert out["max_a"].to_list() == [3, 7, 7, 9, 9, 1] >>> assert out["min_a"].to_list() == [3, 3, 3, 3, 2, 1] >>> out shape: (6, 4) ┌─────────────────────┬───────┬───────┬───────┐ │ dt ┆ sum_a ┆ min_a ┆ max_a │ │ --- ┆ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ i64 ┆ i64 ┆ i64 │ ╞═════════════════════╪═══════╪═══════╪═══════╡ │ 2020-01-01 13:45:48 ┆ 3 ┆ 3 ┆ 3 │ │ 2020-01-01 16:42:13 ┆ 10 ┆ 3 ┆ 7 │ │ 2020-01-01 16:45:09 ┆ 15 ┆ 3 ┆ 7 │ │ 2020-01-02 18:12:48 ┆ 24 ┆ 3 ┆ 9 │ │ 2020-01-03 19:45:32 ┆ 11 ┆ 2 ┆ 9 │ │ 2020-01-08 23:16:43 ┆ 1 ┆ 1 ┆ 1 │ └─────────────────────┴───────┴───────┴───────┘
- hash_rows(seed: int = 0, seed_1: int | None = None, seed_2: int | None = None, seed_3: int | None = None) Series [source]
Hash and combine the rows in this DataFrame.
The hash value is of type UInt64.
- Parameters:
- seed
Random seed parameter. Defaults to 0.
- seed_1
Random seed parameter. Defaults to seed if not set.
- seed_2
Random seed parameter. Defaults to seed if not set.
- seed_3
Random seed parameter. Defaults to seed if not set.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 3, 4], ... "ham": ["a", "b", None, "d"], ... } ... ) >>> df.hash_rows(seed=42) shape: (4,) Series: '' [u64] [ 10783150408545073287 1438741209321515184 10047419486152048166 2047317070637311557 ]
- head(n: int = 5) Self [source]
Get the first n rows.
- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the last
abs(n)
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.head(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
Pass a negative value to get all rows except the last
abs(n)
.>>> df.head(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
- property height: int[source]
Get the height of the DataFrame.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.height 5
- hstack(columns: list[Series] | DataFrame, *, in_place: bool = False) Self [source]
Return a new DataFrame grown horizontally by stacking multiple Series to it.
- Parameters:
- columns
Series to stack.
- in_place
Modify in place.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> x = pl.Series("apple", [10, 20, 30]) >>> df.hstack([x]) shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str ┆ i64 │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6 ┆ a ┆ 10 │ │ 2 ┆ 7 ┆ b ┆ 20 │ │ 3 ┆ 8 ┆ c ┆ 30 │ └─────┴─────┴─────┴───────┘
- insert_at_idx(index: int, series: Series) Self [source]
Insert a Series at a certain column index. This operation is in place.
- Parameters:
- index
Column to insert the new Series column.
- series
Series to insert.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> s = pl.Series("baz", [97, 98, 99]) >>> df.insert_at_idx(1, s) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ baz ┆ bar │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 97 ┆ 4 │ │ 2 ┆ 98 ┆ 5 │ │ 3 ┆ 99 ┆ 6 │ └─────┴─────┴─────┘
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> s = pl.Series("d", [-2.5, 15, 20.5, 0]) >>> df.insert_at_idx(3, s) shape: (4, 4) ┌─────┬──────┬───────┬──────┐ │ a ┆ b ┆ c ┆ d │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 │ ╞═════╪══════╪═══════╪══════╡ │ 1 ┆ 0.5 ┆ true ┆ -2.5 │ │ 2 ┆ 4.0 ┆ true ┆ 15.0 │ │ 3 ┆ 10.0 ┆ false ┆ 20.5 │ │ 4 ┆ 13.0 ┆ true ┆ 0.0 │ └─────┴──────┴───────┴──────┘
- interpolate() Self [source]
Interpolate intermediate values. The interpolation method is linear.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 9, 10], ... "bar": [6, 7, 9, None], ... "baz": [1, None, None, 9], ... } ... ) >>> df.interpolate() shape: (4, 3) ┌─────┬──────┬─────┐ │ foo ┆ bar ┆ baz │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪══════╪═════╡ │ 1 ┆ 6 ┆ 1 │ │ 5 ┆ 7 ┆ 3 │ │ 9 ┆ 9 ┆ 6 │ │ 10 ┆ null ┆ 9 │ └─────┴──────┴─────┘
- is_duplicated() Series [source]
Get a mask of all duplicated rows in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_duplicated() shape: (4,) Series: '' [bool] [ true false false true ]
This mask can be used to visualize the duplicated lines like this:
>>> df.filter(df.is_duplicated()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 1 ┆ x │ │ 1 ┆ x │ └─────┴─────┘
- is_empty() bool [source]
Check if the dataframe is empty.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.is_empty() False >>> df.filter(pl.col("foo") > 99).is_empty() True
- is_unique() Series [source]
Get a mask of all unique rows in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 1], ... "b": ["x", "y", "z", "x"], ... } ... ) >>> df.is_unique() shape: (4,) Series: '' [bool] [ false true true false ]
This mask can be used to visualize the unique lines like this:
>>> df.filter(df.is_unique()) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ str │ ╞═════╪═════╡ │ 2 ┆ y │ │ 3 ┆ z │ └─────┴─────┘
- item() Any [source]
Return the dataframe as a scalar.
Equivalent to
df[0,0]
, with a check that the shape is (1,1).Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}) >>> result = df.select((pl.col("a") * pl.col("b")).sum()) >>> result shape: (1, 1) ┌─────┐ │ a │ │ --- │ │ i64 │ ╞═════╡ │ 32 │ └─────┘ >>> result.item() 32
- iter_rows(named: Literal[False] = False, buffer_size: int = 500) Iterator[tuple[Any, ...]] [source]
- iter_rows(named: Literal[True], buffer_size: int = 500) Iterator[dict[str, Any]]
Returns an iterator over the DataFrame of rows of python-native values.
- Parameters:
- named
Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- buffer_size
Determines the number of rows that are buffered internally while iterating over the data; you should only modify this in very specific cases where the default value is determined not to be a good fit to your access pattern, as the speedup from using the buffer is significant (~2-4x). Setting this value to zero disables row buffering.
- Returns:
- An iterator of tuples (default) or dictionaries of python row values.
Warning
Row iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.
See also
rows
Materialises all frame data as a list of rows.
Notes
If you have
ns
-precision temporal values you should be aware that python natively only supports up tous
-precision; if this matters you should export to a different format.Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> [row[0] for row in df.iter_rows()] [1, 3, 5] >>> [row["b"] for row in df.iter_rows(named=True)] [2, 4, 6]
- iter_slices(n_rows: int = 10000) Iterator[DataFrame] [source]
Returns a non-copying iterator of slices over the underlying DataFrame.
- Parameters:
- n_rows
Determines the number of rows contained in each DataFrame slice.
See also
iter_rows
Row iterator over frame data (does not materialise all rows).
partition_by
Split into multiple DataFrames, partitioned by groups.
Examples
>>> from datetime import date >>> df = pl.DataFrame( ... data={ ... "a": range(17_500), ... "b": date(2023, 1, 1), ... "c": "klmnoopqrstuvwxyz", ... }, ... schema_overrides={"a": pl.Int32}, ... ) >>> for idx, frame in enumerate(df.iter_slices()): ... print(f"{type(frame).__name__}:[{idx}]:{len(frame)}") ... DataFrame:[0]:10000 DataFrame:[1]:7500
Using
iter_slices
is an efficient way to chunk-iterate over DataFrames and any supported frame export/conversion types; for example, as RecordBatches:>>> for frame in df.iter_slices(n_rows=15_000): ... record_batch = frame.to_arrow().to_batches()[0] ... print(record_batch, "\n<< ", len(record_batch)) ... pyarrow.RecordBatch a: int32 b: date32[day] c: large_string << 15000 pyarrow.RecordBatch a: int32 b: date32[day] c: large_string << 2500
- join(other: DataFrame, *, left_on: str | Expr | Sequence[str | Expr] | None = None, right_on: str | Expr | Sequence[str | Expr] | None = None, on: str | Expr | Sequence[str | Expr] | None = None, how: JoinStrategy = 'inner', suffix: str = '_right') Self [source]
Join in SQL-like fashion.
- Parameters:
- other
DataFrame to join with.
- left_on
Name(s) of the left join column(s).
- right_on
Name(s) of the right join column(s).
- on
Name(s) of the join columns in both DataFrames.
- how{‘inner’, ‘left’, ‘outer’, ‘semi’, ‘anti’, ‘cross’}
Join strategy.
- suffix
Suffix to append to columns with a duplicate name.
- Returns:
- Joined DataFrame
See also
Notes
For joining on columns with categorical data, see
pl.StringCache()
.Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> other_df = pl.DataFrame( ... { ... "apple": ["x", "y", "z"], ... "ham": ["a", "b", "d"], ... } ... ) >>> df.join(other_df, on="ham") shape: (2, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ └─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="outer") shape: (4, 4) ┌──────┬──────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞══════╪══════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ null ┆ null ┆ d ┆ z │ │ 3 ┆ 8.0 ┆ c ┆ null │ └──────┴──────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="left") shape: (3, 4) ┌─────┬─────┬─────┬───────┐ │ foo ┆ bar ┆ ham ┆ apple │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str ┆ str │ ╞═════╪═════╪═════╪═══════╡ │ 1 ┆ 6.0 ┆ a ┆ x │ │ 2 ┆ 7.0 ┆ b ┆ y │ │ 3 ┆ 8.0 ┆ c ┆ null │ └─────┴─────┴─────┴───────┘
>>> df.join(other_df, on="ham", how="semi") shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 7.0 ┆ b │ └─────┴─────┴─────┘
>>> df.join(other_df, on="ham", how="anti") shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- join_asof(other: DataFrame, *, left_on: str | None | Expr = None, right_on: str | None | Expr = None, on: str | None | Expr = None, by_left: str | Sequence[str] | None = None, by_right: str | Sequence[str] | None = None, by: str | Sequence[str] | None = None, strategy: AsofJoinStrategy = 'backward', suffix: str = '_right', tolerance: str | int | float | None = None, allow_parallel: bool = True, force_parallel: bool = False) Self [source]
Perform an asof join.
This is similar to a left-join except that we match on nearest key rather than equal keys.
Both DataFrames must be sorted by the asof_join key.
For each row in the left DataFrame:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
The default is “backward”.
- Parameters:
- other
Lazy DataFrame to join with.
- left_on
Join column of the left DataFrame.
- right_on
Join column of the right DataFrame.
- on
Join column of both DataFrames. If set, left_on and right_on should be None.
- by
join on these columns before doing asof join
- by_left
join on these columns before doing asof join
- by_right
join on these columns before doing asof join
- strategy{‘backward’, ‘forward’}
Join strategy.
- suffix
Suffix to append to columns with a duplicate name.
- tolerance
Numeric tolerance. By setting this the join will only be done if the near keys are within this distance. If an asof join is done on columns of dtype “Date”, “Datetime”, “Duration” or “Time” you use the following string language:
1ns (1 nanosecond)
1us (1 microsecond)
1ms (1 millisecond)
1s (1 second)
1m (1 minute)
1h (1 hour)
1d (1 day)
1w (1 week)
1mo (1 calendar month)
1y (1 calendar year)
1i (1 index count)
Or combine them: “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
- allow_parallel
Allow the physical plan to optionally evaluate the computation of both DataFrames up to the join in parallel.
- force_parallel
Force the physical plan to evaluate the computation of both DataFrames up to the join in parallel.
Examples
>>> from datetime import datetime >>> gdp = pl.DataFrame( ... { ... "date": [ ... datetime(2016, 1, 1), ... datetime(2017, 1, 1), ... datetime(2018, 1, 1), ... datetime(2019, 1, 1), ... ], # note record date: Jan 1st (sorted!) ... "gdp": [4164, 4411, 4566, 4696], ... } ... ) >>> population = pl.DataFrame( ... { ... "date": [ ... datetime(2016, 5, 12), ... datetime(2017, 5, 12), ... datetime(2018, 5, 12), ... datetime(2019, 5, 12), ... ], # note record date: May 12th (sorted!) ... "population": [82.19, 82.66, 83.12, 83.52], ... } ... ) >>> population.join_asof( ... gdp, left_on="date", right_on="date", strategy="backward" ... ) shape: (4, 3) ┌─────────────────────┬────────────┬──────┐ │ date ┆ population ┆ gdp │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ f64 ┆ i64 │ ╞═════════════════════╪════════════╪══════╡ │ 2016-05-12 00:00:00 ┆ 82.19 ┆ 4164 │ │ 2017-05-12 00:00:00 ┆ 82.66 ┆ 4411 │ │ 2018-05-12 00:00:00 ┆ 83.12 ┆ 4566 │ │ 2019-05-12 00:00:00 ┆ 83.52 ┆ 4696 │ └─────────────────────┴────────────┴──────┘
- lazy() LazyFrame [source]
Start a lazy query from this point. This returns a LazyFrame object.
Operations on a LazyFrame are not executed until this is requested by either calling:
.fetch()
(run on a small number of rows)
.collect()
(run on all data)
.describe_plan()
(print unoptimized query plan)
.describe_optimized_plan()
(print optimized query plan)
.show_graph()
(show (un)optimized query plan as graphviz graph)
Lazy operations are advised because they allow for query optimization and more parallelization.
- Returns:
- LazyFrame
Examples
>>> df = pl.DataFrame( ... { ... "a": [None, 2, 3, 4], ... "b": [0.5, None, 2.5, 13], ... "c": [True, True, False, None], ... } ... ) >>> df.lazy() <polars.LazyFrame object at ...>
- limit(n: int = 5) Self [source]
Get the first n rows.
Alias for
DataFrame.head()
.- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the last
abs(n)
.
See also
- max(axis: Literal[0] = 0) Self [source]
- max(axis: Literal[1]) Series
- max(axis: int = 0) Self | Series
Aggregate the columns of this DataFrame to their maximum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.max() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ └─────┴─────┴─────┘
- mean(*, axis: Literal[0] = 0, null_strategy: NullStrategy = 'ignore') Self [source]
- mean(*, axis: Literal[1], null_strategy: NullStrategy = 'ignore') Series
- mean(*, axis: int = 0, null_strategy: NullStrategy = 'ignore') Self | Series
Aggregate the columns of this DataFrame to their mean value.
- Parameters:
- axis
Either 0 or 1.
- null_strategy{‘ignore’, ‘propagate’}
This argument is only used if axis == 1.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... "spam": [True, False, None], ... } ... ) >>> df.mean() shape: (1, 4) ┌─────┬─────┬──────┬──────┐ │ foo ┆ bar ┆ ham ┆ spam │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str ┆ f64 │ ╞═════╪═════╪══════╪══════╡ │ 2.0 ┆ 7.0 ┆ null ┆ 0.5 │ └─────┴─────┴──────┴──────┘ >>> df.mean(axis=1) shape: (3,) Series: 'foo' [f64] [ 2.666667 3.0 5.5 ]
- median() Self [source]
Aggregate the columns of this DataFrame to their median value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.median() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘
- melt(id_vars: Sequence[str] | str | None = None, value_vars: Sequence[str] | str | None = None, variable_name: str | None = None, value_name: str | None = None) Self [source]
Unpivot a DataFrame from wide to long format.
Optionally leaves identifiers set.
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
- Parameters:
- id_vars
Columns to use as identifier variables.
- value_vars
Values to use as identifier variables. If value_vars is empty all columns that are not in id_vars will be used.
- variable_name
Name to give to the variable column. Defaults to “variable”
- value_name
Name to give to the value column. Defaults to “value”
Examples
>>> df = pl.DataFrame( ... { ... "a": ["x", "y", "z"], ... "b": [1, 3, 5], ... "c": [2, 4, 6], ... } ... ) >>> df.melt(id_vars="a", value_vars=["b", "c"]) shape: (6, 3) ┌─────┬──────────┬───────┐ │ a ┆ variable ┆ value │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ i64 │ ╞═════╪══════════╪═══════╡ │ x ┆ b ┆ 1 │ │ y ┆ b ┆ 3 │ │ z ┆ b ┆ 5 │ │ x ┆ c ┆ 2 │ │ y ┆ c ┆ 4 │ │ z ┆ c ┆ 6 │ └─────┴──────────┴───────┘
- merge_sorted(other: DataFrame, key: str) Self [source]
Take two sorted DataFrames and merge them by the sorted key.
The output of this operation will also be sorted. It is the callers responsibility that the frames are sorted by that key otherwise the output will not make sense.
The schemas of both DataFrames must be equal.
- Parameters:
- other
Other DataFrame that must be merged
- key
Key that is sorted.
- min(axis: Literal[0] = 0) Self [source]
- min(axis: Literal[1]) Series
- min(axis: int = 0) Self | Series
Aggregate the columns of this DataFrame to their minimum value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.min() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ └─────┴─────┴─────┘
- n_chunks(strategy: Literal['first'] = 'first') int [source]
- n_chunks(strategy: Literal['all']) list[int]
Get number of chunks used by the ChunkedArrays of this DataFrame.
- Parameters:
- strategy{‘first’, ‘all’}
Return the number of chunks of the ‘first’ column, or ‘all’ columns in this DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.n_chunks() 1 >>> df.n_chunks(strategy="all") [1, 1, 1]
- n_unique(subset: str | Expr | Sequence[str | Expr] | None = None) int [source]
Return the number of unique rows, or the number of unique row-subsets.
- Parameters:
- subset
One or more columns/expressions that define what to count; omit to return the count of unique rows.
Notes
This method operates at the
DataFrame
level; to operate on subsets at the expression level you can make use of struct-packing instead, for example:>>> expr_unique_subset = pl.struct(["a", "b"]).n_unique()
If instead you want to count the number of unique values per-column, you can also use expression-level syntax to return a new frame containing that result:
>>> df = pl.DataFrame([[1, 2, 3], [1, 2, 4]], schema=["a", "b", "c"]) >>> df_nunique = df.select(pl.all().n_unique())
In aggregate context there is also an equivalent method for returning the unique values per-group:
>>> df_agg_nunique = df.groupby(by=["a"]).n_unique()
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 1, 2, 3, 4, 5], ... "b": [0.5, 0.5, 1.0, 2.0, 3.0, 3.0], ... "c": [True, True, True, False, True, True], ... } ... ) >>> df.n_unique() 5
Simple columns subset.
>>> df.n_unique(subset=["b", "c"]) 4
Expression subset.
>>> df.n_unique( ... subset=[ ... (pl.col("a") // 2), ... (pl.col("c") | (pl.col("b") >= 2)), ... ], ... ) 3
- null_count() Self [source]
Create a new DataFrame that shows the null counts per column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, None, 3], ... "bar": [6, 7, None], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.null_count() shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═════╪═════╪═════╡ │ 1 ┆ 1 ┆ 0 │ └─────┴─────┴─────┘
- partition_by(by: str | Iterable[str], *more_by: str, maintain_order: bool = True, as_dict: Literal[False] = False) list[Self] [source]
- partition_by(by: str | Iterable[str], *more_by: str, maintain_order: bool = True, as_dict: Literal[True]) dict[Any, Self]
Group by the given columns and return the groups as separate dataframes.
- Parameters:
- by
Name of the column(s) to group by.
- *more_by
Additional names of columns to group by, specified as positional arguments.
- maintain_order
Ensure that the order of the groups is consistent with the input data. This is slower than a default partition by operation.
- as_dict
Return a dictionary instead of a list. The dictionary keys are the distinct group values that identify that group.
Examples
Pass a single column name to partition by that column.
>>> df = pl.DataFrame( ... { ... "a": ["a", "b", "a", "b", "c"], ... "b": [1, 2, 1, 3, 3], ... "c": [5, 4, 3, 2, 1], ... } ... ) >>> df.partition_by("a") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘]
Partition by multiple columns by either passing a list of column names, or by specifying each column name as a positional argument.
>>> df.partition_by("a", "b") [shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘]
Return the partitions as a dictionary by specifying
as_dict=True
.>>> df.partition_by("a", as_dict=True) {'a': shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ a ┆ 1 ┆ 5 │ │ a ┆ 1 ┆ 3 │ └─────┴─────┴─────┘, 'b': shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ b ┆ 2 ┆ 4 │ │ b ┆ 3 ┆ 2 │ └─────┴─────┴─────┘, 'c': shape: (1, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ c ┆ 3 ┆ 1 │ └─────┴─────┴─────┘}
- pipe(func: Callable[Concatenate[DataFrame, P], T], *args: P.args, **kwargs: P.kwargs) T [source]
Offers a structured way to apply a sequence of user-defined functions (UDFs).
- Parameters:
- func
Callable; will receive the frame as the first parameter, followed by any given args/kwargs.
- *args
Arguments to pass to the UDF.
- **kwargs
Keyword arguments to pass to the UDF.
Notes
It is recommended to use LazyFrame when piping operations, in order to fully take advantage of query optimization and parallelization. See
df.lazy()
.Examples
>>> def cast_str_to_int(data, col_name): ... return data.with_columns(pl.col(col_name).cast(pl.Int64)) ... >>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": ["10", "20", "30", "40"]}) >>> df.pipe(cast_str_to_int, col_name="b") shape: (4, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 10 │ │ 2 ┆ 20 │ │ 3 ┆ 30 │ │ 4 ┆ 40 │ └─────┴─────┘
>>> df = pl.DataFrame({"b": [1, 2], "a": [3, 4]}) >>> df shape: (2, 2) ┌─────┬─────┐ │ b ┆ a │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 3 │ │ 2 ┆ 4 │ └─────┴─────┘ >>> df.pipe(lambda tdf: tdf.select(sorted(tdf.columns))) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 3 ┆ 1 │ │ 4 ┆ 2 │ └─────┴─────┘
- pivot(values: Sequence[str] | str, index: Sequence[str] | str, columns: Sequence[str] | str, aggregate_function: PivotAgg | Expr = 'first', *, maintain_order: bool = True, sort_columns: bool = False, separator: str = '_') Self [source]
Create a spreadsheet-style pivot table as a DataFrame.
- Parameters:
- values
Column values to aggregate. Can be multiple columns if the columns arguments contains multiple columns as well.
- index
One or multiple keys to group by.
- columns
Name of the column(s) whose values will be used as the header of the output DataFrame.
- aggregate_function{‘first’, ‘sum’, ‘max’, ‘min’, ‘mean’, ‘median’, ‘last’, ‘count’}
A predefined aggregate function str or an expression.
- maintain_order
Sort the grouped keys so that the output order is predictable.
- sort_columns
Sort the transposed columns by name. Default is by order of discovery.
- separator
Used as separator/delimiter in generated column names.
- Returns:
- DataFrame
Examples
>>> df = pl.DataFrame( ... { ... "foo": ["one", "one", "one", "two", "two", "two"], ... "bar": ["A", "B", "C", "A", "B", "C"], ... "baz": [1, 2, 3, 4, 5, 6], ... } ... ) >>> df.pivot(values="baz", index="foo", columns="bar") shape: (2, 4) ┌─────┬─────┬─────┬─────┐ │ foo ┆ A ┆ B ┆ C │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═════╡ │ one ┆ 1 ┆ 2 ┆ 3 │ │ two ┆ 4 ┆ 5 ┆ 6 │ └─────┴─────┴─────┴─────┘
- product() Self [source]
Aggregate the columns of this DataFrame to their product values.
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3], ... "b": [0.5, 4, 10], ... "c": [True, True, False], ... } ... )
>>> df.product() shape: (1, 3) ┌─────┬──────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ i64 │ ╞═════╪══════╪═════╡ │ 6 ┆ 20.0 ┆ 0 │ └─────┴──────┴─────┘
- quantile(quantile: float, interpolation: RollingInterpolationMethod = 'nearest') Self [source]
Aggregate the columns of this DataFrame to their quantile value.
- Parameters:
- quantile
Quantile between 0.0 and 1.0.
- interpolation{‘nearest’, ‘higher’, ‘lower’, ‘midpoint’, ‘linear’}
Interpolation method.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.quantile(0.5, "nearest") shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 2.0 ┆ 7.0 ┆ null │ └─────┴─────┴──────┘
- rechunk() Self [source]
Rechunk the data in this DataFrame to a contiguous allocation.
This will make sure all subsequent operations have optimal and predictable performance.
- rename(mapping: dict[str, str]) Self [source]
Rename column names.
- Parameters:
- mapping
Key value pairs that map from old name to new name.
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> df.rename({"foo": "apple"}) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ └───────┴─────┴─────┘
- replace(column: str, new_col: Series) Self [source]
Replace a column by a new Series.
- Parameters:
- column
Column to replace.
- new_col
New column to insert.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> s = pl.Series([10, 20, 30]) >>> df.replace("foo", s) # works in-place! shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 10 ┆ 4 │ │ 20 ┆ 5 │ │ 30 ┆ 6 │ └─────┴─────┘
- replace_at_idx(index: int, series: Series) Self [source]
Replace a column at an index location.
- Parameters:
- index
Column index.
- series
Series that will replace the column.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> s = pl.Series("apple", [10, 20, 30]) >>> df.replace_at_idx(0, s) shape: (3, 3) ┌───────┬─────┬─────┐ │ apple ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═══════╪═════╪═════╡ │ 10 ┆ 6 ┆ a │ │ 20 ┆ 7 ┆ b │ │ 30 ┆ 8 ┆ c │ └───────┴─────┴─────┘
- reverse() Self [source]
Reverse the DataFrame.
Examples
>>> df = pl.DataFrame( ... { ... "key": ["a", "b", "c"], ... "val": [1, 2, 3], ... } ... ) >>> df.reverse() shape: (3, 2) ┌─────┬─────┐ │ key ┆ val │ │ --- ┆ --- │ │ str ┆ i64 │ ╞═════╪═════╡ │ c ┆ 3 │ │ b ┆ 2 │ │ a ┆ 1 │ └─────┴─────┘
- row(index: int | None = None, *, by_predicate: Expr | None = None, named: Literal[False] = False) tuple[Any, ...] [source]
- row(index: int | None = None, *, by_predicate: Expr | None = None, named: Literal[True]) dict[str, Any]
Get the values of a single row, either by index or by predicate.
- Parameters:
- index
Row index.
- by_predicate
Select the row according to a given expression/predicate.
- named
Return a dictionary instead of a tuple. The dictionary is a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- Returns:
- Tuple (default) or dictionary of row values.
Warning
You should NEVER use this method to iterate over a DataFrame; if you absolutely require row-iteration you should strongly prefer
iter_rows()
instead.See also
Notes
The
index
andby_predicate
params are mutually exclusive. Additionally, to ensure clarity, the by_predicate parameter must be supplied by keyword.When using
by_predicate
it is an error condition if anything other than one row is returned; more than one row raisesTooManyRowsReturned
, and zero rows will raiseNoRowsReturned
(both inherit fromRowsException
).Examples
Specify an index to return the row at the given index as a tuple.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.row(2) (3, 8, 'c')
Specify
named=True
to get a dictionary instead with a mapping of column names to row values.>>> df.row(2, named=True) {'foo': 3, 'bar': 8, 'ham': 'c'}
Use
by_predicate
to return the row that matches the given predicate.>>> df.row(by_predicate=(pl.col("ham") == "b")) (2, 7, 'b')
- rows(named: Literal[False] = False) list[tuple[Any, ...]] [source]
- rows(named: Literal[True]) list[dict[str, Any]]
Returns all data in the DataFrame as a list of rows of python-native values.
- Parameters:
- named
Return dictionaries instead of tuples. The dictionaries are a mapping of column name to row value. This is more expensive than returning a regular tuple, but allows for accessing values by column name.
- Returns:
- A list of tuples (default) or dictionaries of row values.
Warning
Row-iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.
See also
iter_rows
Row iterator over frame data (does not materialise all rows).
Notes
If you have
ns
-precision temporal values you should be aware that python natively only supports up tous
-precision; if this matters you should export to a different format.Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> df.rows() [(1, 2), (3, 4), (5, 6)] >>> df.rows(named=True) [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}, {'a': 5, 'b': 6}]
- sample(n: int | None = None, *, frac: float | None = None, with_replacement: bool = False, shuffle: bool = False, seed: int | None = None) Self [source]
Sample from this DataFrame.
- Parameters:
- n
Number of items to return. Cannot be used with frac. Defaults to 1 if frac is None.
- frac
Fraction of items to return. Cannot be used with n.
- with_replacement
Allow values to be sampled more than once.
- shuffle
If set to True, the order of the sampled rows will be shuffled. If set to False (default), the order of the returned rows will be neither stable nor fully random.
- seed
Seed for the random number generator. If set to None (default), a random seed is generated using the
random
module.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sample(n=2, seed=0) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
- property schema: SchemaDict[source]
Get a dict[column name, DataType].
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.schema {'foo': Int64, 'bar': Float64, 'ham': Utf8}
- select(exprs: IntoExpr | Iterable[IntoExpr] | None = None, *more_exprs: IntoExpr, **named_exprs: IntoExpr) Self [source]
Select columns from this DataFrame.
- Parameters:
- exprs
Column(s) to select. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- *more_exprs
Additional columns to select, specified as positional arguments.
- **named_exprs
Additional columns to select, specified as keyword arguments. The columns will be renamed to the keyword used.
Examples
Pass the name of a column to select that column.
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.select("foo") shape: (3, 1) ┌─────┐ │ foo │ │ --- │ │ i64 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘
Multiple columns can be selected by passing a list of column names.
>>> df.select(["foo", "bar"]) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 6 │ │ 2 ┆ 7 │ │ 3 ┆ 8 │ └─────┴─────┘
Multiple columns can also be selected using positional arguments instead of a list. Expressions are also accepted.
>>> df.select(pl.col("foo"), pl.col("bar") + 1) shape: (3, 2) ┌─────┬─────┐ │ foo ┆ bar │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 7 │ │ 2 ┆ 8 │ │ 3 ┆ 9 │ └─────┴─────┘
Use keyword arguments to easily name your expression inputs.
>>> df.select(threshold=pl.when(pl.col("foo") > 2).then(10).otherwise(0)) shape: (3, 1) ┌───────────┐ │ threshold │ │ --- │ │ i32 │ ╞═══════════╡ │ 0 │ │ 0 │ │ 10 │ └───────────┘
Expressions with multiple outputs can be automatically instantiated as Structs by enabling the experimental setting
Config.set_auto_structify(True)
:>>> with pl.Config() as cfg: ... cfg.set_auto_structify(True) ... df.select( ... is_odd=(pl.col(pl.INTEGER_DTYPES) % 2).suffix("_is_odd"), ... ) ... shape: (3, 1) ┌───────────┐ │ is_odd │ │ --- │ │ struct[2] │ ╞═══════════╡ │ {1,0} │ │ {0,1} │ │ {1,0} │ └───────────┘
- property shape: tuple[int, int][source]
Get the shape of the DataFrame.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.shape (5, 1)
- shift(periods: int) Self [source]
Shift values by the given period.
- Parameters:
- periods
Number of places to shift (may be negative).
See also
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.shift(periods=1) shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ null ┆ null ┆ null │ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ └──────┴──────┴──────┘ >>> df.shift(periods=-1) shape: (3, 3) ┌──────┬──────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞══════╪══════╪══════╡ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ null ┆ null ┆ null │ └──────┴──────┴──────┘
- shift_and_fill(periods: int, fill_value: int | str | float) Self [source]
Shift the values by a given period and fill the resulting null values.
- Parameters:
- periods
Number of places to shift (may be negative).
- fill_value
fill None values with this value.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.shift_and_fill(periods=1, fill_value=0) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 0 ┆ 0 ┆ 0 │ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ └─────┴─────┴─────┘
- shrink_to_fit(*, in_place: bool = False) Self [source]
Shrink DataFrame memory usage.
Shrinks to fit the exact capacity needed to hold the data.
- slice(offset: int, length: int | None = None) Self [source]
Get a slice of this DataFrame.
- Parameters:
- offset
Start index. Negative indexing is supported.
- length
Length of the slice. If set to
None
, all rows starting at the offset will be selected.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6.0, 7.0, 8.0], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.slice(1, 2) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ 7.0 ┆ b │ │ 3 ┆ 8.0 ┆ c │ └─────┴─────┴─────┘
- sort(by: IntoExpr | Iterable[IntoExpr], *more_by: IntoExpr, descending: bool | Sequence[bool] = False, nulls_last: bool = False) Self [source]
Sort the dataframe by the given columns.
- Parameters:
- by
Column(s) to sort by. Accepts expression input. Strings are parsed as column names.
- *more_by
Additional columns to sort by, specified as positional arguments.
- descending
Sort in descending order. When sorting by multiple columns, can be specified per column by passing a sequence of booleans.
- nulls_last
Place null values last.
Examples
Pass a single column name to sort by that column.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, None], ... "b": [6.0, 5.0, 4.0], ... "c": ["a", "c", "b"], ... } ... ) >>> df.sort("a") shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
Sorting by expressions is also supported.
>>> df.sort(pl.col("a") + pl.col("b") * 2, nulls_last=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ └──────┴─────┴─────┘
Sort by multiple columns by passing a list of columns.
>>> df.sort(["c", "a"], descending=True) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 2 ┆ 5.0 ┆ c │ │ null ┆ 4.0 ┆ b │ │ 1 ┆ 6.0 ┆ a │ └──────┴─────┴─────┘
Or use positional arguments to sort by multiple columns in the same way.
>>> df.sort("c", "a", descending=[False, True]) shape: (3, 3) ┌──────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ str │ ╞══════╪═════╪═════╡ │ 1 ┆ 6.0 ┆ a │ │ null ┆ 4.0 ┆ b │ │ 2 ┆ 5.0 ┆ c │ └──────┴─────┴─────┘
- std(ddof: int = 1) Self [source]
Aggregate the columns of this DataFrame to their standard deviation value.
- Parameters:
- ddof
Degrees of freedom
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.std() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 1.0 ┆ 1.0 ┆ null │ └─────┴─────┴──────┘ >>> df.std(ddof=0) shape: (1, 3) ┌──────────┬──────────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪══════════╪══════╡ │ 0.816497 ┆ 0.816497 ┆ null │ └──────────┴──────────┴──────┘
- sum(*, axis: Literal[0] = 0, null_strategy: NullStrategy = 'ignore') Self [source]
- sum(*, axis: Literal[1], null_strategy: NullStrategy = 'ignore') Series
- sum(*, axis: int = 0, null_strategy: NullStrategy = 'ignore') Self | Series
Aggregate the columns of this DataFrame to their sum value.
- Parameters:
- axis
Either 0 or 1.
- null_strategy{‘ignore’, ‘propagate’}
This argument is only used if axis == 1.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.sum() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪══════╡ │ 6 ┆ 21 ┆ null │ └─────┴─────┴──────┘ >>> df.sum(axis=1) shape: (3,) Series: 'foo' [str] [ "16a" "27b" "38c" ]
- tail(n: int = 5) Self [source]
Get the last n rows.
- Parameters:
- n
Number of rows to return. If a negative value is passed, return all rows except the first
abs(n)
.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> df.tail(3) shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 3 ┆ 8 ┆ c │ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘
Pass a negative value to get all rows except the first
abs(n)
.>>> df.tail(-3) shape: (2, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 4 ┆ 9 ┆ d │ │ 5 ┆ 10 ┆ e │ └─────┴─────┴─────┘
- take_every(n: int) Self [source]
Take every nth row in the DataFrame and return as a new DataFrame.
Examples
>>> s = pl.DataFrame({"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}) >>> s.take_every(2) shape: (2, 2) ┌─────┬─────┐ │ a ┆ b │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 5 │ │ 3 ┆ 7 │ └─────┴─────┘
- to_arrow() Table [source]
Collect the underlying arrow arrays in an Arrow Table.
This operation is mostly zero copy.
- Data types that do copy:
CategoricalType
Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3, 4, 5, 6], "bar": ["a", "b", "c", "d", "e", "f"]} ... ) >>> df.to_arrow() pyarrow.Table foo: int64 bar: large_string ---- foo: [[1,2,3,4,5,6]] bar: [["a","b","c","d","e","f"]]
- to_dict(as_series: Literal[True] = True) dict[str, Series] [source]
- to_dict(as_series: Literal[False]) dict[str, list[Any]]
- to_dict(as_series: bool = True) dict[str, Series] | dict[str, list[Any]]
Convert DataFrame to a dictionary mapping column name to values.
- Parameters:
- as_series
True -> Values are series False -> Values are List[Any]
Examples
>>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4, 5], ... "fruits": ["banana", "banana", "apple", "apple", "banana"], ... "B": [5, 4, 3, 2, 1], ... "cars": ["beetle", "audi", "beetle", "beetle", "beetle"], ... "optional": [28, 300, None, 2, -30], ... } ... ) >>> df shape: (5, 5) ┌─────┬────────┬─────┬────────┬──────────┐ │ A ┆ fruits ┆ B ┆ cars ┆ optional │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ str ┆ i64 │ ╞═════╪════════╪═════╪════════╪══════════╡ │ 1 ┆ banana ┆ 5 ┆ beetle ┆ 28 │ │ 2 ┆ banana ┆ 4 ┆ audi ┆ 300 │ │ 3 ┆ apple ┆ 3 ┆ beetle ┆ null │ │ 4 ┆ apple ┆ 2 ┆ beetle ┆ 2 │ │ 5 ┆ banana ┆ 1 ┆ beetle ┆ -30 │ └─────┴────────┴─────┴────────┴──────────┘ >>> df.to_dict(as_series=False) {'A': [1, 2, 3, 4, 5], 'fruits': ['banana', 'banana', 'apple', 'apple', 'banana'], 'B': [5, 4, 3, 2, 1], 'cars': ['beetle', 'audi', 'beetle', 'beetle', 'beetle'], 'optional': [28, 300, None, 2, -30]} >>> df.to_dict(as_series=True) {'A': shape: (5,) Series: 'A' [i64] [ 1 2 3 4 5 ], 'fruits': shape: (5,) Series: 'fruits' [str] [ "banana" "banana" "apple" "apple" "banana" ], 'B': shape: (5,) Series: 'B' [i64] [ 5 4 3 2 1 ], 'cars': shape: (5,) Series: 'cars' [str] [ "beetle" "audi" "beetle" "beetle" "beetle" ], 'optional': shape: (5,) Series: 'optional' [i64] [ 28 300 null 2 -30 ]}
- to_dicts() list[dict[str, Any]] [source]
Convert every row to a dictionary of python-native values.
Notes
If you have
ns
-precision temporal values you should be aware that python natively only supports up tous
-precision; if this matters you should export to a different format.Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]}) >>> df.to_dicts() [{'foo': 1, 'bar': 4}, {'foo': 2, 'bar': 5}, {'foo': 3, 'bar': 6}]
- to_dummies(columns: str | Sequence[str] | None = None, *, separator: str = '_') Self [source]
Convert categorical variables into dummy/indicator variables.
- Parameters:
- columns
Name of the column(s) that should be converted to dummy variables. If set to
None
(default), convert all columns.- separator
Separator/delimiter used when generating column names.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2], ... "bar": [3, 4], ... "ham": ["a", "b"], ... } ... ) >>> df.to_dummies() shape: (2, 6) ┌───────┬───────┬───────┬───────┬───────┬───────┐ │ foo_1 ┆ foo_2 ┆ bar_3 ┆ bar_4 ┆ ham_a ┆ ham_b │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │ ╞═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡ │ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 │ │ 0 ┆ 1 ┆ 0 ┆ 1 ┆ 0 ┆ 1 │ └───────┴───────┴───────┴───────┴───────┴───────┘
- to_numpy() ndarray[Any, Any] [source]
Convert DataFrame to a 2D NumPy array.
This operation clones data.
Notes
If you’re attempting to convert Utf8 to an array you’ll need to install
pyarrow
.Examples
>>> df = pl.DataFrame( ... {"foo": [1, 2, 3], "bar": [6, 7, 8], "ham": ["a", "b", "c"]} ... ) >>> numpy_array = df.to_numpy() >>> type(numpy_array) <class 'numpy.ndarray'>
- to_pandas(*args: Any, use_pyarrow_extension_array: bool = False, **kwargs: Any) DataFrame [source]
Cast to a pandas DataFrame.
This requires that
pandas
andpyarrow
are installed. This operation clones data, unless use_pyarrow_extension_array=True.- Parameters:
- use_pyarrow_extension_array
Use PyArrow backed-extension arrays instead of numpy arrays for each column of the pandas DataFrame; this allows zero copy operations and preservation of null values. Subsequent operations on the resulting pandas DataFrame may trigger conversion to NumPy arrays if that operation is not supported by pyarrow compute functions.
- kwargs
Arguments will be sent to
pyarrow.Table.to_pandas()
.
- Returns:
Examples
>>> import pandas >>> df1 = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> pandas_df1 = df1.to_pandas() >>> type(pandas_df1) <class 'pandas.core.frame.DataFrame'> >>> pandas_df1.dtypes foo int64 bar int64 ham object dtype: object >>> df2 = pl.DataFrame( ... { ... "foo": [1, 2, None], ... "bar": [6, None, 8], ... "ham": [None, "b", "c"], ... } ... ) >>> pandas_df2 = df2.to_pandas() >>> pandas_df2 foo bar ham 0 1.0 6.0 None 1 2.0 NaN b 2 NaN 8.0 c >>> pandas_df2.dtypes foo float64 bar float64 ham object dtype: object >>> pandas_df2_pa = df2.to_pandas( ... use_pyarrow_extension_array=True ... ) >>> pandas_df2_pa foo bar ham 0 1 6 <NA> 1 2 <NA> b 2 <NA> 8 c >>> pandas_df2_pa.dtypes foo int64[pyarrow] bar int64[pyarrow] ham large_string[pyarrow] dtype: object
- to_series(index: int = 0) Series [source]
Select column as Series at index location.
- Parameters:
- index
Location of selection.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.to_series(1) shape: (3,) Series: 'bar' [i64] [ 6 7 8 ]
- to_struct(name: str) Series [source]
Convert a
DataFrame
to aSeries
of typeStruct
.- Parameters:
- name
Name for the struct Series
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4, 5], ... "b": ["one", "two", "three", "four", "five"], ... } ... ) >>> df.to_struct("nums") shape: (5,) Series: 'nums' [struct[2]] [ {1,"one"} {2,"two"} {3,"three"} {4,"four"} {5,"five"} ]
- top_k(k: int, *, by: IntoExpr | Iterable[IntoExpr], descending: bool | Sequence[bool] = False, nulls_last: bool = False) Self [source]
Return the k largest elements.
If ‘descending=True` the smallest elements will be given.
- Parameters:
- k
Number of rows to return.
- by
Column(s) included in sort order. Accepts expression input. Strings are parsed as column names.
- descending
Return the ‘k’ smallest. Top-k by multiple columns can be specified per column by passing a sequence of booleans.
- nulls_last
Place null values last.
- transpose(*, include_header: bool = False, header_name: str = 'column', column_names: Iterator[str] | Sequence[str] | None = None) Self [source]
Transpose a DataFrame over the diagonal.
- Parameters:
- include_header
If set, the column names will be added as first column.
- header_name
If include_header is set, this determines the name of the column that will be inserted.
- column_names
Optional generator/iterator that yields column names. Will be used to replace the columns in the DataFrame.
- Returns:
- DataFrame
Notes
This is a very expensive operation. Perhaps you can do it differently.
Examples
>>> df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3]}) >>> df.transpose(include_header=True) shape: (2, 4) ┌────────┬──────────┬──────────┬──────────┐ │ column ┆ column_0 ┆ column_1 ┆ column_2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪══════════╪══════════╪══════════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 1 ┆ 2 ┆ 3 │ └────────┴──────────┴──────────┴──────────┘
Replace the auto-generated column names with a list
>>> df.transpose(include_header=False, column_names=["a", "b", "c"]) shape: (2, 3) ┌─────┬─────┬─────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╡ │ 1 ┆ 2 ┆ 3 │ │ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┘
Include the header as a separate column
>>> df.transpose( ... include_header=True, header_name="foo", column_names=["a", "b", "c"] ... ) shape: (2, 4) ┌─────┬─────┬─────┬─────┐ │ foo ┆ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞═════╪═════╪═════╪═════╡ │ a ┆ 1 ┆ 2 ┆ 3 │ │ b ┆ 1 ┆ 2 ┆ 3 │ └─────┴─────┴─────┴─────┘
Replace the auto-generated column with column names from a generator function
>>> def name_generator(): ... base_name = "my_column_" ... count = 0 ... while True: ... yield f"{base_name}{count}" ... count += 1 ... >>> df.transpose(include_header=False, column_names=name_generator()) shape: (2, 3) ┌─────────────┬─────────────┬─────────────┐ │ my_column_0 ┆ my_column_1 ┆ my_column_2 │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞═════════════╪═════════════╪═════════════╡ │ 1 ┆ 2 ┆ 3 │ │ 1 ┆ 2 ┆ 3 │ └─────────────┴─────────────┴─────────────┘
- unique(*, maintain_order: bool = True, subset: str | Sequence[str] | None = None, keep: UniqueKeepStrategy = 'any') Self [source]
Drop duplicate rows from this dataframe.
- Parameters:
- maintain_order
Keep the same order as the original DataFrame. This is more expensive to compute. Settings this to
True
blocks the possibility to run on the streaming engine.- subset
Column name(s) to consider when identifying duplicates. If set to
None
(default), use all columns.- keep{‘first’, ‘last’, ‘any’, ‘none’}
Which of the duplicate rows to keep.
- ‘any’: Does not give any guarantee of which row is kept.
This allows more optimizations.
‘none’: Don’t keep duplicate rows.
‘first’: Keep first unique row.
‘last’: Keep last unique row.
- Returns:
- DataFrame with unique rows.
Warning
This method will fail if there is a column of type List in the DataFrame or subset.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 1], ... "bar": ["a", "a", "a", "a"], ... "ham": ["b", "b", "b", "b"], ... } ... ) >>> df.unique() shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> df.unique(subset=["bar", "ham"]) shape: (1, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘ >>> df.unique(keep="last") shape: (3, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str │ ╞═════╪═════╪═════╡ │ 2 ┆ a ┆ b │ │ 3 ┆ a ┆ b │ │ 1 ┆ a ┆ b │ └─────┴─────┴─────┘
- unnest(columns: str | Sequence[str], *more_columns: str) Self [source]
Decompose struct columns into separate columns for each of their fields.
The new columns will be inserted into the dataframe at the location of the struct column.
- Parameters:
- columns
Name of the struct column(s) that should be unnested.
- *more_columns
Additional columns to unnest, specified as positional arguments.
Examples
>>> df = pl.DataFrame( ... { ... "before": ["foo", "bar"], ... "t_a": [1, 2], ... "t_b": ["a", "b"], ... "t_c": [True, None], ... "t_d": [[1, 2], [3]], ... "after": ["baz", "womp"], ... } ... ).select("before", pl.struct(pl.col("^t_.$")).alias("t_struct"), "after") >>> df shape: (2, 3) ┌────────┬─────────────────────┬───────┐ │ before ┆ t_struct ┆ after │ │ --- ┆ --- ┆ --- │ │ str ┆ struct[4] ┆ str │ ╞════════╪═════════════════════╪═══════╡ │ foo ┆ {1,"a",true,[1, 2]} ┆ baz │ │ bar ┆ {2,"b",null,[3]} ┆ womp │ └────────┴─────────────────────┴───────┘ >>> df.unnest("t_struct") shape: (2, 6) ┌────────┬─────┬─────┬──────┬───────────┬───────┐ │ before ┆ t_a ┆ t_b ┆ t_c ┆ t_d ┆ after │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ str ┆ bool ┆ list[i64] ┆ str │ ╞════════╪═════╪═════╪══════╪═══════════╪═══════╡ │ foo ┆ 1 ┆ a ┆ true ┆ [1, 2] ┆ baz │ │ bar ┆ 2 ┆ b ┆ null ┆ [3] ┆ womp │ └────────┴─────┴─────┴──────┴───────────┴───────┘
- unstack(step: int, how: UnstackDirection = 'vertical', columns: str | Sequence[str] | None = None, fill_values: list[Any] | None = None) Self [source]
Unstack a long table to a wide form without doing an aggregation.
This can be much faster than a pivot, because it can skip the grouping phase.
- Parameters:
- step
Number of rows in the unstacked frame.
- how{ ‘vertical’, ‘horizontal’ }
Direction of the unstack.
- columns
Name of the column(s) to include in the operation. If set to
None
(default), use all columns.- fill_values
Fill values that don’t fit the new size with this value.
Warning
This functionality is experimental and may be subject to changes without it being considered a breaking change.
Examples
>>> from string import ascii_uppercase >>> df = pl.DataFrame( ... { ... "col1": list(ascii_uppercase[0:9]), ... "col2": pl.arange(0, 9, eager=True), ... } ... ) >>> df shape: (9, 2) ┌──────┬──────┐ │ col1 ┆ col2 │ │ --- ┆ --- │ │ str ┆ i64 │ ╞══════╪══════╡ │ A ┆ 0 │ │ B ┆ 1 │ │ C ┆ 2 │ │ D ┆ 3 │ │ … ┆ … │ │ F ┆ 5 │ │ G ┆ 6 │ │ H ┆ 7 │ │ I ┆ 8 │ └──────┴──────┘ >>> df.unstack(step=3, how="vertical") shape: (3, 6) ┌────────┬────────┬────────┬────────┬────────┬────────┐ │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪════════╪════════╪════════╪════════╪════════╡ │ A ┆ D ┆ G ┆ 0 ┆ 3 ┆ 6 │ │ B ┆ E ┆ H ┆ 1 ┆ 4 ┆ 7 │ │ C ┆ F ┆ I ┆ 2 ┆ 5 ┆ 8 │ └────────┴────────┴────────┴────────┴────────┴────────┘ >>> df.unstack(step=3, how="horizontal") shape: (3, 6) ┌────────┬────────┬────────┬────────┬────────┬────────┐ │ col1_0 ┆ col1_1 ┆ col1_2 ┆ col2_0 ┆ col2_1 ┆ col2_2 │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 │ ╞════════╪════════╪════════╪════════╪════════╪════════╡ │ A ┆ B ┆ C ┆ 0 ┆ 1 ┆ 2 │ │ D ┆ E ┆ F ┆ 3 ┆ 4 ┆ 5 │ │ G ┆ H ┆ I ┆ 6 ┆ 7 ┆ 8 │ └────────┴────────┴────────┴────────┴────────┴────────┘
- update(other: DataFrame, on: None | str | Sequence[str] = None, how: str = 'left') Self [source]
Update the values in this DataFrame with the non-null values in other.
- Parameters:
- other
DataFrame that will be used to update the values
- on
Column names that will be joined on. If none given the row count is used.
- how{‘left’, ‘inner’}
‘Left’ will keep the left table rows as is. ‘Inner’ will remove rows that are not found in other
Warning
This functionality is experimental and may change without it being considered a breaking change.
Notes
This is syntactic sugar for a left/inner join + coalesce
Examples
>>> df = pl.DataFrame( ... { ... "A": [1, 2, 3, 4], ... "B": [400, 500, 600, 700], ... } ... ) >>> df shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 400 │ │ 2 ┆ 500 │ │ 3 ┆ 600 │ │ 4 ┆ 700 │ └─────┴─────┘ >>> new_df = pl.DataFrame( ... { ... "B": [4, None, 6], ... "C": [7, 8, 9], ... } ... ) >>> new_df shape: (3, 2) ┌──────┬─────┐ │ B ┆ C │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞══════╪═════╡ │ 4 ┆ 7 │ │ null ┆ 8 │ │ 6 ┆ 9 │ └──────┴─────┘ >>> df.update(new_df) shape: (4, 2) ┌─────┬─────┐ │ A ┆ B │ │ --- ┆ --- │ │ i64 ┆ i64 │ ╞═════╪═════╡ │ 1 ┆ 4 │ │ 2 ┆ 500 │ │ 3 ┆ 6 │ │ 4 ┆ 700 │ └─────┴─────┘
- upsample(time_column: str, *, every: str | timedelta, offset: str | timedelta | None = None, by: str | Sequence[str] | None = None, maintain_order: bool = False) Self [source]
Upsample a DataFrame at a regular frequency.
- Parameters:
- time_column
time column will be used to determine a date_range. Note that this column has to be sorted for the output to make sense.
- every
interval will start ‘every’ duration
- offset
change the start of the date_range by this offset.
- by
First group by these columns and then upsample for every group
- maintain_order
Keep the ordering predictable. This is slower.
- The `every` and `offset` arguments are created with
- the following string language:
- - 1ns (1 nanosecond)
- - 1us (1 microsecond)
- - 1ms (1 millisecond)
- - 1s (1 second)
- - 1m (1 minute)
- - 1h (1 hour)
- - 1d (1 day)
- - 1w (1 week)
- - 1mo (1 calendar month)
- - 1y (1 calendar year)
- - 1i (1 index count)
- Or combine them:
- “3d12h4m25s” # 3 days, 12 hours, 4 minutes, and 25 seconds
Examples
Upsample a DataFrame by a certain interval.
>>> from datetime import datetime >>> df = pl.DataFrame( ... { ... "time": [ ... datetime(2021, 2, 1), ... datetime(2021, 4, 1), ... datetime(2021, 5, 1), ... datetime(2021, 6, 1), ... ], ... "groups": ["A", "B", "A", "B"], ... "values": [0, 1, 2, 3], ... } ... ) >>> df.upsample( ... time_column="time", every="1mo", by="groups", maintain_order=True ... ).select(pl.all().forward_fill()) shape: (7, 3) ┌─────────────────────┬────────┬────────┐ │ time ┆ groups ┆ values │ │ --- ┆ --- ┆ --- │ │ datetime[μs] ┆ str ┆ i64 │ ╞═════════════════════╪════════╪════════╡ │ 2021-02-01 00:00:00 ┆ A ┆ 0 │ │ 2021-03-01 00:00:00 ┆ A ┆ 0 │ │ 2021-04-01 00:00:00 ┆ A ┆ 0 │ │ 2021-05-01 00:00:00 ┆ A ┆ 2 │ │ 2021-04-01 00:00:00 ┆ B ┆ 1 │ │ 2021-05-01 00:00:00 ┆ B ┆ 1 │ │ 2021-06-01 00:00:00 ┆ B ┆ 3 │ └─────────────────────┴────────┴────────┘
- var(ddof: int = 1) Self [source]
Aggregate the columns of this DataFrame to their variance value.
- Parameters:
- ddof
Degrees of freedom
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... "ham": ["a", "b", "c"], ... } ... ) >>> df.var() shape: (1, 3) ┌─────┬─────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞═════╪═════╪══════╡ │ 1.0 ┆ 1.0 ┆ null │ └─────┴─────┴──────┘ >>> df.var(ddof=0) shape: (1, 3) ┌──────────┬──────────┬──────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ str │ ╞══════════╪══════════╪══════╡ │ 0.666667 ┆ 0.666667 ┆ null │ └──────────┴──────────┴──────┘
- vstack(df: DataFrame, *, in_place: bool = False) Self [source]
Grow this DataFrame vertically by stacking a DataFrame to it.
- Parameters:
- df
DataFrame to stack.
- in_place
Modify in place
Examples
>>> df1 = pl.DataFrame( ... { ... "foo": [1, 2], ... "bar": [6, 7], ... "ham": ["a", "b"], ... } ... ) >>> df2 = pl.DataFrame( ... { ... "foo": [3, 4], ... "bar": [8, 9], ... "ham": ["c", "d"], ... } ... ) >>> df1.vstack(df2) shape: (4, 3) ┌─────┬─────┬─────┐ │ foo ┆ bar ┆ ham │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ str │ ╞═════╪═════╪═════╡ │ 1 ┆ 6 ┆ a │ │ 2 ┆ 7 ┆ b │ │ 3 ┆ 8 ┆ c │ │ 4 ┆ 9 ┆ d │ └─────┴─────┴─────┘
- property width: int[source]
Get the width of the DataFrame.
Examples
>>> df = pl.DataFrame({"foo": [1, 2, 3, 4, 5]}) >>> df.width 1
- with_columns(exprs: IntoExpr | Iterable[IntoExpr] = None, *more_exprs: IntoExpr, **named_exprs: IntoExpr) Self [source]
Add columns to this DataFrame.
Added columns will replace existing columns with the same name.
- Parameters:
- exprs
Column or columns to add. Accepts expression input. Strings are parsed as column names, other non-expression inputs are parsed as literals.
- *more_exprs
Additional columns to add, specified as positional arguments.
- **named_exprs
Additional columns to add, specified as keyword arguments. The columns will be renamed to the keyword used.
- Returns:
- A new DataFrame with the columns added.
Notes
Creating a new DataFrame using this method does not create a new copy of existing data.
Examples
Pass an expression to add it as a new column.
>>> df = pl.DataFrame( ... { ... "a": [1, 2, 3, 4], ... "b": [0.5, 4, 10, 13], ... "c": [True, True, False, True], ... } ... ) >>> df.with_columns((pl.col("a") ** 2).alias("a^2")) shape: (4, 4) ┌─────┬──────┬───────┬──────┐ │ a ┆ b ┆ c ┆ a^2 │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 │ ╞═════╪══════╪═══════╪══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 │ └─────┴──────┴───────┴──────┘
Added columns will replace existing columns with the same name.
>>> df.with_columns(pl.col("a").cast(pl.Float64)) shape: (4, 3) ┌─────┬──────┬───────┐ │ a ┆ b ┆ c │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╡ │ 1.0 ┆ 0.5 ┆ true │ │ 2.0 ┆ 4.0 ┆ true │ │ 3.0 ┆ 10.0 ┆ false │ │ 4.0 ┆ 13.0 ┆ true │ └─────┴──────┴───────┘
Multiple columns can be added by passing a list of expressions.
>>> df.with_columns( ... [ ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").is_not()).alias("not c"), ... ] ... ) shape: (4, 6) ┌─────┬──────┬───────┬──────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴──────┴──────┴───────┘
Multiple columns also can be added using positional arguments instead of a list.
>>> df.with_columns( ... (pl.col("a") ** 2).alias("a^2"), ... (pl.col("b") / 2).alias("b/2"), ... (pl.col("c").is_not()).alias("not c"), ... ) shape: (4, 6) ┌─────┬──────┬───────┬──────┬──────┬───────┐ │ a ┆ b ┆ c ┆ a^2 ┆ b/2 ┆ not c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 1.0 ┆ 0.25 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 4.0 ┆ 2.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 9.0 ┆ 5.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 16.0 ┆ 6.5 ┆ false │ └─────┴──────┴───────┴──────┴──────┴───────┘
Use keyword arguments to easily name your expression inputs.
>>> df.with_columns( ... ab=pl.col("a") * pl.col("b"), ... not_c=pl.col("c").is_not(), ... ) shape: (4, 5) ┌─────┬──────┬───────┬──────┬───────┐ │ a ┆ b ┆ c ┆ ab ┆ not_c │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ bool ┆ f64 ┆ bool │ ╞═════╪══════╪═══════╪══════╪═══════╡ │ 1 ┆ 0.5 ┆ true ┆ 0.5 ┆ false │ │ 2 ┆ 4.0 ┆ true ┆ 8.0 ┆ false │ │ 3 ┆ 10.0 ┆ false ┆ 30.0 ┆ true │ │ 4 ┆ 13.0 ┆ true ┆ 52.0 ┆ false │ └─────┴──────┴───────┴──────┴───────┘
Expressions with multiple outputs can be automatically instantiated as Structs by enabling the experimental setting
Config.set_auto_structify(True)
:>>> with pl.Config() as cfg: ... cfg.set_auto_structify(True) ... df.drop("c").with_columns( ... diffs=pl.col(["a", "b"]).diff().suffix("_diff"), ... ) ... shape: (4, 3) ┌─────┬──────┬─────────────┐ │ a ┆ b ┆ diffs │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ struct[2] │ ╞═════╪══════╪═════════════╡ │ 1 ┆ 0.5 ┆ {null,null} │ │ 2 ┆ 4.0 ┆ {1,3.5} │ │ 3 ┆ 10.0 ┆ {1,6.0} │ │ 4 ┆ 13.0 ┆ {1,3.0} │ └─────┴──────┴─────────────┘
- with_row_count(name: str = 'row_nr', offset: int = 0) Self [source]
Add a column at index 0 that counts the rows.
- Parameters:
- name
Name of the column to add.
- offset
Start the row count at this offset. Default = 0
Examples
>>> df = pl.DataFrame( ... { ... "a": [1, 3, 5], ... "b": [2, 4, 6], ... } ... ) >>> df.with_row_count() shape: (3, 3) ┌────────┬─────┬─────┐ │ row_nr ┆ a ┆ b │ │ --- ┆ --- ┆ --- │ │ u32 ┆ i64 ┆ i64 │ ╞════════╪═════╪═════╡ │ 0 ┆ 1 ┆ 2 │ │ 1 ┆ 3 ┆ 4 │ │ 2 ┆ 5 ┆ 6 │ └────────┴─────┴─────┘
- write_avro(file: BinaryIO | BytesIO | str | Path, compression: AvroCompression = 'uncompressed') None [source]
Write to Apache Avro file.
- Parameters:
- file
File path to which the file should be written.
- compression{‘uncompressed’, ‘snappy’, ‘deflate’}
Compression method. Defaults to “uncompressed”.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.avro" >>> df.write_avro(path)
- write_csv(file: None = None, has_header: bool = True, separator: str = ',', quote: str = '"', batch_size: int = 1024, datetime_format: str | None = None, date_format: str | None = None, time_format: str | None = None, float_precision: int | None = None, null_value: str | None = None) str [source]
- write_csv(file: BytesIO | str | Path, has_header: bool = True, separator: str = ',', quote: str = '"', batch_size: int = 1024, datetime_format: str | None = None, date_format: str | None = None, time_format: str | None = None, float_precision: int | None = None, null_value: str | None = None) None
Write to comma-separated values (CSV) file.
- Parameters:
- file
File path to which the result should be written. If set to
None
(default), the output is returned as a string instead.- has_header
Whether to include header in the CSV output.
- separator
Separate CSV fields with this symbol.
- quote
Byte to use as quoting character.
- batch_size
Number of rows that will be processed per thread.
- datetime_format
A format string, with the specifiers defined by the chrono Rust crate. If no format specified, the default fractional-second precision is inferred from the maximum timeunit found in the frame’s Datetime cols (if any).
- date_format
A format string, with the specifiers defined by the chrono Rust crate.
- time_format
A format string, with the specifiers defined by the chrono Rust crate.
- float_precision
Number of decimal places to write, applied to both
Float32
andFloat64
datatypes.- null_value
A string representing null values (defaulting to the empty string).
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.csv" >>> df.write_csv(path, separator=",")
- write_database(table_name: str, connection_uri: str, *, if_exists: DbWriteMode = 'fail', engine: DbWriteEngine = 'sqlalchemy') None [source]
Write a polars frame to a database.
- Parameters:
- table_name
Name of the table to append to or create in the SQL database.
- connection_uri
Connection uri, for example
“postgresql://username:password@server:port/database”
- if_exists{‘append’, ‘replace’, ‘fail’}
The insert mode. ‘replace’ will create a new database table, overwriting an existing one. ‘append’ will append to an existing table. ‘fail’ will fail if table already exists.
- engine{‘sqlalchemy’, ‘adbc’}
Select the engine used for writing the data.
- write_excel(workbook: Workbook | BytesIO | Path | str | None = None, worksheet: str | None = None, *, position: tuple[int, int] | str = 'A1', table_style: str | dict[str, Any] | None = None, table_name: str | None = None, column_formats: dict[str | tuple[str, ...], str] | None = None, dtype_formats: dict[OneOrMoreDataTypes, str] | None = None, conditional_formats: ConditionalFormatDict | None = None, column_totals: ColumnTotalsDefinition | None = None, column_widths: dict[str | tuple[str, ...], int] | int | None = None, row_totals: RowTotalsDefinition | None = None, row_heights: dict[int | tuple[int, ...], int] | int | None = None, sparklines: dict[str, Sequence[str] | dict[str, Any]] | None = None, float_precision: int = 3, has_header: bool = True, autofilter: bool = True, autofit: bool = False, hidden_columns: Sequence[str] | None = None, hide_gridlines: bool = False, sheet_zoom: int | None = None) Workbook [source]
Write frame data to a table in an Excel workbook/worksheet.
- Parameters:
- workbookWorkbook
String name or path of the workbook to create, BytesIO object to write into, or an open
xlsxwriter.Workbook
object that has not been closed. If None, writes to adataframe.xlsx
workbook in the working directory.- worksheetstr
Name of target worksheet; if None, writes to “Sheet1” when creating a new workbook (note that writing to an existing workbook requires a valid existing -or new- worksheet name).
- position{str, tuple}
Table position in Excel notation (eg: “A1”), or a (row,col) integer tuple.
- table_style{str, dict}
A named Excel table style, such as “Table Style Medium 4”, or a dictionary of
{"key":value,}
options containing one or more of the following keys: “style”, “first_column”, “last_column”, “banded_columns, “banded_rows”.- table_namestr
Name of the output table object in the worksheet; can then be referred to in the sheet by formulae/charts, or by subsequent
xlsxwriter
operations.- column_formatsdict
A
{colname:str,}
dictionary for applying an Excel format string to the given columns. Formats defined here (such as “dd/mm/yyyy”, “0.00%”, etc) will override any defined indtype_formats
(below).- dtype_formatsdict
A
{dtype:str,}
dictionary that sets the default Excel format for the given dtype. (This can be overridden on a per-column basis by thecolumn_formats
param). It is also valid to use dtype groups such aspl.FLOAT_DTYPES
as the dtype/format key, to simplify setting uniform integer and float formats.- conditional_formatsdict
A
{colname(s):str,}
,{colname(s):dict,}
, or{colname(s):list,}
dictionary defining conditional format options for the specified columns.If supplying a string typename, should be one of the valid
xlsxwriter
types such as “3_color_scale”, “data_bar”, etc.If supplying a dictionary you can make use of any/all
xlsxwriter
supported options, including icon sets, formulae, etc.Supplying multiple columns as a tuple/key will apply a single format across all columns - this is effective in creating a heatmap, as the min/max values will be determined across the entire range, not per-column.
Finally, you can also supply a list made up from the above options in order to apply more than one conditional format to the same range.
- column_totals{bool, list, dict}
Add a column-total row to the exported table.
If True, all numeric columns will have an associated total using “sum”.
If passing a list of colnames, only those given will have a total.
For more control, pass a
{colname:funcname,}
dict. Valid names are: “average”, “count_nums”, “count”, “max”, “min”, “std_dev”, “sum”, “var”.
- column_widths{dict, int}
A
{colname:int,}
dict or single integer that sets (or overrides if autofitting) table column widths in integer pixel units. If given as an integer the same value is used for all table columns.- row_totals{dict, bool}
Add a row-total column to the right-hand side of the exported table.
If True, a column called “total” will be added at the end of the table that applies a “sum” function row-wise across all numeric columns.
If passing a list/sequence of column names, only the matching columns will participate in the sum.
Can also pass a
{colname:columns,}
dictionary to create one or more total columns with distinct names, referencing different columns.
- row_heights{dict, int}
An int or
{row_index:int,}
dictionary that sets the height of the given rows (if providing a dictionary) or all rows (if providing an integer) that intersect with the table body (including any header and total row) in integer pixel units. Note thatrow_index
starts at zero and will be the header row (unlesshas_headers
is False).- sparklinesdict
A
{colname:list,}
or{colname:dict,}
dictionary defining one or more sparklines to be written into a new column in the table.If passing a list of colnames (used as the source of the sparkline data) the default sparkline settings are used (eg: line chart with no markers).
For more control an
xlsxwriter
-compliant options dict can be supplied, in which case three additional polars-specific keys are available: “columns”, “insert_before”, and “insert_after”. These allow you to define the source columns and position the sparkline(s) with respect to other table columns. If no position directive is given, sparklines are added to the end of the table (eg: to the far right) in the order they are given.
- float_precision{dict, int}
Default number of decimals displayed for floating point columns (note that this is purely a formatting directive; the actual values are not rounded).
- has_headerbool
Indicate if the table should be created with a header row.
- autofilterbool
If the table has headers, provide autofilter capability.
- autofitbool
Calculate individual column widths from the data.
- hidden_columnslist
A list of table columns to hide in the worksheet.
- hide_gridlinesbool
Do not display any gridlines on the output worksheet.
- sheet_zoomint
Set the default zoom level of the output worksheet.
Notes
Conditional formatting dictionaries should provide xlsxwriter-compatible definitions; polars will take care of how they are applied on the worksheet with respect to the relative sheet/column position. For supported options, see: https://xlsxwriter.readthedocs.io/working_with_conditional_formats.html
Similarly, sparkline option dictionaries should contain xlsxwriter-compatible key/values, as well as a mandatory polars “columns” key that defines the sparkline source data; these source columns should be adjacent to each other. Two other polars-specific keys are available to help define where the sparkline appears in the table: “insert_after”, and “insert_before”. The value associated with these keys should be the name of a column in the exported table. https://xlsxwriter.readthedocs.io/working_with_sparklines.html
Examples
Instantiate a basic dataframe:
>>> from random import uniform >>> from datetime import date >>> >>> df = pl.DataFrame( ... { ... "dtm": [date(2023, 1, 1), date(2023, 1, 2), date(2023, 1, 3)], ... "num": [uniform(-500, 500), uniform(-500, 500), uniform(-500, 500)], ... "val": [10_000, 20_000, 30_000], ... } ... )
Export to “dataframe.xlsx” (the default workbook name, if not specified) in the working directory, add column totals (“sum”) on all numeric columns, autofit:
>>> df.write_excel(column_totals=True, autofit=True)
Write frame to a specific location on the sheet, set a named table style, apply US-style date formatting, increase default float precision, apply non-default total function to a single column, autofit:
>>> df.write_excel( ... position="B4", ... table_style="Table Style Light 16", ... dtype_formats={pl.Date: "mm/dd/yyyy"}, ... column_totals={"num": "average"}, ... float_precision=6, ... autofit=True, ... )
Write the same frame to a named worksheet twice, applying different styles and conditional formatting to each table, adding table titles using explicit xlsxwriter integration:
>>> from xlsxwriter import Workbook >>> with Workbook("multi_frame.xlsx") as wb: ... # basic/default conditional formatting ... df.write_excel( ... workbook=wb, ... worksheet="data", ... position=(3, 1), # specify position as (row,col) coordinates ... conditional_formats={"num": "3_color_scale", "val": "data_bar"}, ... table_style="Table Style Medium 4", ... ) ... ... # advanced conditional formatting, custom styles ... df.write_excel( ... workbook=wb, ... worksheet="data", ... position=(len(df) + 7, 1), ... table_style={ ... "style": "Table Style Light 4", ... "first_column": True, ... }, ... conditional_formats={ ... "num": { ... "type": "3_color_scale", ... "min_color": "#76933c", ... "mid_color": "#c4d79b", ... "max_color": "#ebf1de", ... }, ... "val": { ... "type": "data_bar", ... "data_bar_2010": True, ... "bar_color": "#9bbb59", ... "bar_negative_color_same": True, ... "bar_negative_border_color_same": True, ... }, ... }, ... column_formats={"num": "#,##0.000;[White]-#,##0.000"}, ... column_widths={"val": 125}, ... autofit=True, ... ) ... ... # add some table titles (with a custom format) ... ws = wb.get_worksheet_by_name("data") ... fmt_title = wb.add_format( ... { ... "font_color": "#4f6228", ... "font_size": 12, ... "italic": True, ... "bold": True, ... } ... ) ... ws.write(2, 1, "Basic/default conditional formatting", fmt_title) ... ws.write(len(df) + 6, 1, "Customised conditional formatting", fmt_title) ...
Export a table containing two different types of sparklines. Use default options for the “trend” sparkline and customised options (and positioning) for the “+/-” win_loss sparkline, with non-default integer dtype formatting, column totals, a subtle two-tone heatmap and hidden worksheet gridlines:
>>> df = pl.DataFrame( ... { ... "id": ["aaa", "bbb", "ccc", "ddd", "eee"], ... "q1": [100, 55, -20, 0, 35], ... "q2": [30, -10, 15, 60, 20], ... "q3": [-50, 0, 40, 80, 80], ... "q4": [75, 55, 25, -10, -55], ... } ... ) >>> df.write_excel( ... table_style="Table Style Light 2", ... # apply accounting format to all flavours of integer ... dtype_formats={pl.INTEGER_DTYPES: "#,##0_);(#,##0)"}, ... sparklines={ ... # default options; just provide source cols ... "trend": ["q1", "q2", "q3", "q4"], ... # customised sparkline type, with positioning directive ... "+/-": { ... "columns": ["q1", "q2", "q3", "q4"], ... "insert_after": "id", ... "type": "win_loss", ... }, ... }, ... conditional_formats={ ... # create a unified multi-column heatmap ... ("q1", "q2", "q3", "q4"): { ... "type": "2_color_scale", ... "min_color": "#95b3d7", ... "max_color": "#ffffff", ... }, ... }, ... column_totals=["q1", "q2", "q3", "q4"], ... row_totals=True, ... hide_gridlines=True, ... )
- write_ipc(file: BinaryIO | BytesIO | str | Path, compression: IpcCompression = 'uncompressed') None [source]
Write to Arrow IPC binary stream or Feather file.
- Parameters:
- file
File path to which the file should be written.
- compression{‘uncompressed’, ‘lz4’, ‘zstd’}
Compression method. Defaults to “uncompressed”.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.arrow" >>> df.write_ipc(path)
- write_json(file: None = None, pretty: bool = False, row_oriented: bool = False) str [source]
- write_json(file: IOBase | str | Path, pretty: bool = False, row_oriented: bool = False) None
Serialize to JSON representation.
- Parameters:
- file
File path to which the result should be written. If set to
None
(default), the output is returned as a string instead.- pretty
Pretty serialize json.
- row_oriented
Write to row oriented json. This is slower, but more common.
See also
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> df.write_json() '{"columns":[{"name":"foo","datatype":"Int64","values":[1,2,3]},{"name":"bar","datatype":"Int64","values":[6,7,8]}]}' >>> df.write_json(row_oriented=True) '[{"foo":1,"bar":6},{"foo":2,"bar":7},{"foo":3,"bar":8}]'
- write_ndjson(file: None = None) str [source]
- write_ndjson(file: IOBase | str | Path) None
Serialize to newline delimited JSON representation.
- Parameters:
- file
File path to which the result should be written. If set to
None
(default), the output is returned as a string instead.
Examples
>>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3], ... "bar": [6, 7, 8], ... } ... ) >>> df.write_ndjson() '{"foo":1,"bar":6}\n{"foo":2,"bar":7}\n{"foo":3,"bar":8}\n'
- write_parquet(file: str | Path | BytesIO, *, compression: ParquetCompression = 'zstd', compression_level: int | None = None, statistics: bool = False, row_group_size: int | None = None, use_pyarrow: bool = False, pyarrow_options: dict[str, object] | None = None) None [source]
Write to Apache Parquet file.
- Parameters:
- file
File path to which the file should be written.
- compression{‘lz4’, ‘uncompressed’, ‘snappy’, ‘gzip’, ‘lzo’, ‘brotli’, ‘zstd’}
Choose “zstd” for good compression performance. Choose “lz4” for fast compression/decompression. Choose “snappy” for more backwards compatibility guarantees when you deal with older parquet readers.
- compression_level
The level of compression to use. Higher compression means smaller files on disk.
“gzip” : min-level: 0, max-level: 10.
“brotli” : min-level: 0, max-level: 11.
“zstd” : min-level: 1, max-level: 22.
- statistics
Write statistics to the parquet headers. This requires extra compute.
- row_group_size
Size of the row groups in number of rows. If None (default), the chunks of the DataFrame are used. Writing in smaller chunks may reduce memory pressure and improve writing speeds. If None and
use_pyarrow=True
, the row group size will be the minimum of the DataFrame size and 64 * 1024 * 1024.- use_pyarrow
Use C++ parquet implementation vs Rust parquet implementation. At the moment C++ supports more features.
- pyarrow_options
Arguments passed to
pyarrow.parquet.write_table
.
Examples
>>> import pathlib >>> >>> df = pl.DataFrame( ... { ... "foo": [1, 2, 3, 4, 5], ... "bar": [6, 7, 8, 9, 10], ... "ham": ["a", "b", "c", "d", "e"], ... } ... ) >>> path: pathlib.Path = dirpath / "new_file.parquet" >>> df.write_parquet(path)