Skip to content

Version 0.20

Breaking changes

Change default join behavior with regard to null values

Previously, null values in the join key were considered a value like any other value. This meant that null values in the left frame would be joined with null values in the right frame. This is expensive and does not match default behavior in SQL.

Default behavior has now been changed to ignore null values in the join key. The previous behavior can be retained by setting join_nulls=True.

Example

Before:

>>> df1 = pl.DataFrame({"a": [1, 2, None], "b": [4, 4, 4]})
>>> df2 = pl.DataFrame({"a": [None, 2, 3], "c": [5, 5, 5]})
>>> df1.join(df2, on="a", how="inner")
shape: (2, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ i64 ┆ i64 │
╞══════╪═════╪═════╡
│ null ┆ 4   ┆ 5   │
│ 2    ┆ 4   ┆ 5   │
└──────┴─────┴─────┘

After:

>>> df1.join(df2, on="a", how="inner")
shape: (1, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 2   ┆ 4   ┆ 5   │
└─────┴─────┴─────┘
>>> df1.join(df2, on="a", how="inner", join_nulls=True)  # Keeps previous behavior
shape: (2, 3)
┌──────┬─────┬─────┐
│ a    ┆ b   ┆ c   │
│ ---  ┆ --- ┆ --- │
│ i64  ┆ i64 ┆ i64 │
╞══════╪═════╪═════╡
│ null ┆ 4   ┆ 5   │
│ 2    ┆ 4   ┆ 5   │
└──────┴─────┴─────┘

Preserve left and right join keys in outer joins

Previously, the result of an outer join did not contain the join keys of the left and right frames. Rather, it contained a coalesced version of the left key and right key. This loses information and does not conform to default SQL behavior.

The behavior has been changed to include the original join keys. Name clashes are solved by appending a suffix (_right by default) to the right join key name. The previous behavior can be retained by setting how="outer_coalesce".

Example

Before:

>>> df1 = pl.DataFrame({"L1": ["a", "b", "c"], "L2": [1, 2, 3]})
>>> df2 = pl.DataFrame({"L1": ["a", "c", "d"], "R2": [7, 8, 9]})
>>> df1.join(df2, on="L1", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ L1  ┆ L2   ┆ R2   │
│ --- ┆ ---  ┆ ---  │
│ str ┆ i64  ┆ i64  │
╞═════╪══════╪══════╡
│ a   ┆ 1    ┆ 7    │
│ c   ┆ 3    ┆ 8    │
│ d   ┆ null ┆ 9    │
│ b   ┆ 2    ┆ null │
└─────┴──────┴──────┘

After:

>>> df1.join(df2, on="L1", how="outer")
shape: (4, 4)
┌──────┬──────┬──────────┬──────┐
│ L1   ┆ L2   ┆ L1_right ┆ R2   │
│ ---  ┆ ---  ┆ ---      ┆ ---  │
│ str  ┆ i64  ┆ str      ┆ i64  │
╞══════╪══════╪══════════╪══════╡
│ a    ┆ 1    ┆ a        ┆ 7    │
│ b    ┆ 2    ┆ null     ┆ null │
│ c    ┆ 3    ┆ c        ┆ 8    │
│ null ┆ null ┆ d        ┆ 9    │
└──────┴──────┴──────────┴──────┘
>>> df1.join(df2, on="a", how="outer_coalesce")  # Keeps previous behavior
shape: (4, 3)
┌─────┬──────┬──────┐
│ L1  ┆ L2   ┆ R2   │
│ --- ┆ ---  ┆ ---  │
│ str ┆ i64  ┆ i64  │
╞═════╪══════╪══════╡
│ a   ┆ 1    ┆ 7    │
│ c   ┆ 3    ┆ 8    │
│ d   ┆ null ┆ 9    │
│ b   ┆ 2    ┆ null │
└─────┴──────┴──────┘

count now ignores null values

The count method for Expr and Series now ignores null values. Use len to get the count with null values included.

Note that pl.count() and group_by(...).count() are unchanged. These count the number of rows in the context, so nulls are not applicable in the same way.

This brings behavior more in line with the SQL standard, where COUNT(col) ignores null values but COUNT(*) counts rows regardless of null values.

Example

Before:

>>> df = pl.DataFrame({"a": [1, 2, None]})
>>> df.select(pl.col("a").count())
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ u32 │
╞═════╡
│ 3   │
└─────┘

After:

>>> df.select(pl.col("a").count())
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ u32 │
╞═════╡
│ 2   │
└─────┘
>>> df.select(pl.col("a").len())  # Mirrors previous behavior
shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ u32 │
╞═════╡
│ 3   │
└─────┘

NaN values are now considered equal

Floating point NaN values were treated as unequal across Polars operations. This has been corrected to better match user expectation and existing standards.

While this is considered a bug fix, it is included in this guide in order to draw attention to possible impact on user workflows that may contain NaN values.

Example

Before:

>>> s = pl.Series([1.0, float("nan"), float("inf")])
>>> s == s
shape: (3,)
Series: '' [bool]
[
        true
        false
        true
]

After:

>>> s == s
shape: (3,)
Series: '' [bool]
[
        true
        true
        true
]

Assertion utils updates to exact checking and NaN equality

The assertion utility functions assert_frame_equal and assert_series_equal would use the tolerance parameters atol and rtol to do approximate checking, unless check_exact was set to True. This could lead to some surprising behavior, as integers are generally thought of as exact values. Integer values are now always checked exactly. To do inexact checking, convert to float first.

Additionally, the nans_compare_equal parameter has been removed and NaN values are now always considered equal, which was the previous default behavior. This parameter had previously been deprecated but has been removed before the end of the standard deprecation period to facilitate the change to NaN equality.

Example

Before:

>>> from polars.testing import assert_frame_equal
>>> df1 = pl.DataFrame({"id": [123456]})
>>> df2 = pl.DataFrame({"id": [123457]})
>>> assert_frame_equal(df1, df2)  # Passes

After:

>>> assert_frame_equal(df1, df2)
...
AssertionError: DataFrames are different (value mismatch for column 'id')
[left]:  [123456]
[right]: [123457]

Allow all DataType objects to be instantiated

Polars data types are subclasses of the DataType class. We had a 'hack' in place that automatically converted data types instantiated without any arguments to the class, rather than actually instantiating it. The idea was to allow specifying data types as Int64 rather than Int64(), which is more succinct. However, this caused some unexpected behavior when working directly with data type objects, especially as there was a discrepancy with data types like Datetime which will be instantiated in many cases.

Going forward, instantiating a data type will always return an instance of that class. Both classes an instances are handled by Polars, so the previous short syntax is still available. Methods that return data types like Series.dtype and DataFrame.schema now always return instantiated data types objects.

You may have to update some of your data type checks if you were not already using the equality operator (==), as well as update some type hints.

Example

Before:

>>> s = pl.Series([1, 2, 3], dtype=pl.Int8)
>>> s.dtype == pl.Int8
True
>>> s.dtype is pl.Int8
True
>>> isinstance(s.dtype, pl.Int8)
False

After:

>>> s.dtype == pl.Int8
True
>>> s.dtype is pl.Int8
False
>>> isinstance(s.dtype, pl.Int8)
True

Update constructors for Decimal and Array data types

The data types Decimal and Array have had their parameters switched around. The new constructors should more closely match user expectations.

Example

Before:

>>> pl.Array(2, pl.Int16)
Array(Int16, 2)
>>> pl.Decimal(5, 10)
Decimal(precision=10, scale=5)

After:

>>> pl.Array(pl.Int16, 2)
Array(Int16, width=2)
>>> pl.Decimal(10, 5)
Decimal(precision=10, scale=5)

DataType.is_nested changed from a property to a class method

This is a minor change, but a very important one to properly update. Failure to update accordingly may result in faulty logic, as Python will evaluate the method to True. For example, if dtype.is_nested will now evaluate to True regardless of the data type, because it returns the method, which Python considers truthy.

Example

Before:

>>> pl.List(pl.Int8).is_nested
True

After:

>>> pl.List(pl.Int8).is_nested()
True

Smaller integer data types for datetime components dt.month, dt.week

Most datetime components such as month and week would previously return a UInt32 type. This has been updated to the smallest appropriate signed integer type. This should reduce memory consumption.

Method Dtype old Dtype new
year i32 i32
iso_year i32 i32
quarter u32 i8
month u32 i8
week u32 i8
day u32 i8
weekday u32 i8
ordinal_day u32 i16
hour u32 i8
minute u32 i8
second u32 i8
millisecond u32 i32*
microsecond u32 i32
nanosecond u32 i32

*Technically, millisecond can be an i16. This may be updated in the future.

Example

Before:

>>> from datetime import date
>>> s = pl.Series([date(2023, 12, 31), date(2024, 1, 1)])
>>> s.dt.month()
shape: (2,)
Series: '' [u32]
[
        12
        1
]

After:

>>> s.dt.month()
shape: (2,)
Series: '' [u8]
[
        12
        1
]

Series now defaults to Null data type when no data is present

This replaces the previous behavior of initializing as a Float32 type.

Example

Before:

>>> pl.Series("a", [None])
shape: (1,)
Series: 'a' [f32]
[
        null
]

After:

>>> pl.Series("a", [None])
shape: (1,)
Series: 'a' [null]
[
        null
]

replace reimplemented with slightly different behavior

The new implementation is mostly backwards compatible. Please do note the following:

  1. The logic for determining the return data type has changed. You may want to specify return_dtype to override the inferred data type, or take advantage of the new function signature (separate old and new parameters) to influence the return type.
  2. The previous workaround for referencing other columns as default by using a struct column no longer works. It now simply works as expected, no workaround needed.

Example

Before:

>>> df = pl.DataFrame({"a": [1, 2, 2, 3], "b": [1.5, 2.5, 5.0, 1.0]}, schema={"a": pl.Int8, "b": pl.Float64})
>>> df.select(pl.col("a").replace({2: 100}))
shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ i8  │
╞═════╡
│ 1   │
│ 100 │
│ 100 │
│ 3   │
└─────┘
>>> df.select(pl.struct("a", "b").replace({2: 100}, default=pl.col("b")))
shape: (4, 1)
┌───────┐
│ a     │
│ ---   │
│ f64   │
╞═══════╡
│ 1.5   │
│ 100.0 │
│ 100.0 │
│ 1.0   │
└───────┘

After:

>>> df.select(pl.col("a").replace({2: 100}))
shape: (4, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 100 │
│ 100 │
│ 3   │
└─────┘
>>> df.select(pl.col("a").replace({2: 100}, default=pl.col("b")))  # No struct needed
shape: (4, 1)
┌───────┐
│ a     │
│ ---   │
│ f64   │
╞═══════╡
│ 1.5   │
│ 100.0 │
│ 100.0 │
│ 1.0   │
└───────┘

value_counts resulting column renamed from counts to count

The resulting struct field for the value_counts method has been renamed from counts to count.

Example

Before:

>>> s = pl.Series("a", ["x", "x", "y"])
>>> s.value_counts()
shape: (2, 2)
┌─────┬────────┐
│ a   ┆ counts │
│ --- ┆ ---    │
│ str ┆ u32    │
╞═════╪════════╡
│ x   ┆ 2      │
│ y   ┆ 1      │
└─────┴────────┘

After:

>>> s.value_counts()
shape: (2, 2)
┌─────┬───────┐
│ a   ┆ count │
│ --- ┆ ---   │
│ str ┆ u32   │
╞═════╪═══════╡
│ x   ┆ 2     │
│ y   ┆ 1     │
└─────┴───────┘

Update read_parquet to use Object Store rather than fsspec

If you were using read_parquet, installing fsspec as an optional dependency is no longer required. The new Object Store implementation was already in use for scan_parquet. It may have slightly different behavior in certain cases, such as how credentials are detected and how downloads are performed.

The resulting DataFrame should be identical between versions.

Deprecations

Cumulative functions renamed from cum* to cum_*

Technically, this deprecation was introduced in version 0.19.14, but many users will first encounter it when upgrading to 0.20. It's a relatively impactful change, which is why we mention it here.

Old name New name
cumfold cum_fold
cumreduce cum_reduce
cumsum cum_sum
cumprod cum_prod
cummin cum_min
cummax cum_max
cumcount cum_count