Missing data
This page sets out how missing data is represented in Polars
and how missing data can be filled.
null
and NaN
values
Each column in a DataFrame
(or equivalently a Series
) is an Arrow array or a collection of Arrow arrays based on the Apache Arrow format. Missing data is represented in Arrow and Polars
with a null
value. This null
missing value applies for all data types including numerical values.
Polars
also allows NotaNumber
or NaN
values for float columns. These NaN
values are considered to be a type of floating point data rather than missing data. We discuss NaN
values separately below.
You can manually define a missing value with the python None
value:
shape: (2, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ null │
└───────┘
Info
In Pandas
the value for missing data depends on the dtype of the column. In Polars
missing data is always represented as a null
value.
Missing data metadata
Each Arrow array used by Polars
stores two kinds of metadata related to missing data. This metadata allows Polars
to quickly show how many missing values there are and which values are missing.
The first piece of metadata is the null_count
- this is the number of rows with null
values in the column:
null_count_df = df.null_count()
print(null_count_df)
let null_count_df = df.null_count();
println!("{}", &null_count_df);
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ u32 │
╞═══════╡
│ 1 │
└───────┘
The null_count
method can be called on a DataFrame
, a column from a DataFrame
or a Series
. The null_count
method is a cheap operation as null_count
is already calculated for the underlying Arrow array.
The second piece of metadata is an array called a validity bitmap that indicates whether each data value is valid or missing.
The validity bitmap is memory efficient as it is bit encoded - each value is either a 0 or a 1. This bit encoding means the memory overhead per array is only (array length / 8) bytes. The validity bitmap is used by the is_null
method in Polars
.
You can return a Series
based on the validity bitmap for a column in a DataFrame
or a Series
with the is_null
method:
is_null_series = df.select(
pl.col("value").is_null(),
)
print(is_null_series)
let is_null_series = df
.clone()
.lazy()
.select([col("value").is_null()])
.collect()?;
println!("{}", &is_null_series);
shape: (2, 1)
┌───────┐
│ value │
│ --- │
│ bool │
╞═══════╡
│ false │
│ true │
└───────┘
The is_null
method is a cheap operation that does not require scanning the full column for null
values. This is because the validity bitmap already exists and can be returned as a Boolean array.
Filling missing data
Missing data in a Series
can be filled with the fill_null
method. You have to specify how you want the fill_null
method to fill the missing data. The main ways to do this are filling with:
- a literal such as 0 or "0"
- a strategy such as filling forwards
- an expression such as replacing with values from another column
- interpolation
We illustrate each way to fill nulls by defining a simple DataFrame
with a missing value in col2
:
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ null │
│ 3 ┆ 3 │
└──────┴──────┘
Fill with specified literal value
We can fill the missing data with a specified literal value with pl.lit
:
fill_literal_df = (
df.with_columns(
pl.col("col2").fill_null(
pl.lit(2),
),
),
)
print(fill_literal_df)
let fill_literal_df = df
.clone()
.lazy()
.with_columns([col("col2").fill_null(lit(2))])
.collect()?;
println!("{}", &fill_literal_df);
(shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ 3 │
└──────┴──────┘,)
Fill with a strategy
We can fill the missing data with a strategy such as filling forward:
fill_forward_df = df.with_columns(
pl.col("col2").fill_null(strategy="forward"),
)
print(fill_forward_df)
let fill_forward_df = df
.clone()
.lazy()
.with_columns([col("col2").forward_fill(None)])
.collect()?;
println!("{}", &fill_forward_df);
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 1 │
│ 3 ┆ 3 │
└──────┴──────┘
You can find other fill strategies in the API docs.
Fill with an expression
For more flexibility we can fill the missing data with an expression. For example, to fill nulls with the median value from that column:
fill_median_df = df.with_columns(
pl.col("col2").fill_null(pl.median("col2")),
)
print(fill_median_df)
let fill_median_df = df
.clone()
.lazy()
.with_columns([col("col2").fill_null(median("col2"))])
.collect()?;
println!("{}", &fill_median_df);
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞══════╪══════╡
│ 1 ┆ 1.0 │
│ 2 ┆ 2.0 │
│ 3 ┆ 3.0 │
└──────┴──────┘
In this case the column is cast from integer to float because the median is a float statistic.
Fill with interpolation
In addition, we can fill nulls with interpolation (without using the fill_null
function):
fill_interpolation_df = df.with_columns(
pl.col("col2").interpolate(),
)
print(fill_interpolation_df)
let fill_interpolation_df = df
.clone()
.lazy()
.with_columns([col("col2").interpolate(InterpolationMethod::Linear)])
.collect()?;
println!("{}", &fill_interpolation_df);
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ 3 │
└──────┴──────┘
NotaNumber
or NaN
values
Missing data in a Series
has a null
value. However, you can use NotaNumber
or NaN
values in columns with float datatypes. These NaN
values can be created from Numpy's np.nan
or the native python float('nan')
:
shape: (4, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 1.0 │
│ NaN │
│ NaN │
│ 3.0 │
└───────┘
Info
In Pandas
by default a NaN
value in an integer column causes the column to be cast to float. This does not happen in Polars
- instead an exception is raised.
NaN
values are considered to be a type of floating point data and are not considered to be missing data in Polars
. This means:
NaN
values are not counted with thenull_count
methodNaN
values are filled when you usefill_nan
method but are not filled with thefill_null
method
Polars
has is_nan
and fill_nan
methods which work in a similar way to the is_null
and fill_null
methods. The underlying Arrow arrays do not have a pre-computed validity bitmask for NaN
values so this has to be computed for the is_nan
method.
One further difference between null
and NaN
values is that taking the mean
of a column with null
values excludes the null
values from the calculation but with NaN
values taking the mean results in a NaN
. This behaviour can be avoided by replacing the NaN
values with null
values;
mean_nan_df = nan_df.with_columns(
pl.col("value").fill_nan(None).alias("value"),
).mean()
print(mean_nan_df)
let mean_nan_df = nan_df
.clone()
.lazy()
.with_columns([col("value").fill_nan(lit(NULL)).alias("value")])
.mean()
.collect()?;
println!("{}", &mean_nan_df);
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ f64 │
╞═══════╡
│ 2.0 │
└───────┘