Coming from Pandas
Here we set out the key points that anyone who has experience with Pandas
and wants to
try Polars
should know. We include both differences in the concepts the libraries are
built on and differences in how you should write Polars
code compared to Pandas
code.
Differences in concepts between Polars
and Pandas
Polars
does not have an index
Pandas
gives a label to each row with an index. Polars
does not use an index and
each row is indexed by its integer position in the table.
Indexes are not needed! Not having them makes things easier - convince us otherwise!
For more detail on how you select data in Polars
see the indexing
section.
Polars
uses Apache Arrow arrays to represent data in memory while Pandas
uses Numpy
arrays
Polars
represents data in memory with Arrow arrays while Pandas
represents data in
memory in Numpy
arrays. Apache Arrow is an emerging standard for in-memory columnar
analytics that can accelerate data load times, reduce memory usage and accelerate
calculations.
Polars
can convert data to Numpy
format with the to_numpy
method.
Polars
has more support for parallel operations than Pandas
Polars
exploits the strong support for concurrency in Rust to run many operations in
parallel. While some operations in Pandas
are multi-threaded the core of the library
is single-threaded and an additional library such as Dask
must be used to parallelise
operations.
Polars
can lazily evaluate queries and apply query optimization
Eager evaluation is where code is evaluated as soon as you run the code. Lazy evaluation is where running a line of code means that the underlying logic is added to a query plan rather than being evaluated.
Polars
supports eager evaluation and lazy evaluation whereas Pandas
only supports
eager evaluation. The lazy evaluation mode is powerful because Polars
carries out
automatic query optimization where it examines the query plan and looks for ways to
accelerate the query or reduce memory usage.
Dask
also supports lazy evaluation where it generates a query plan. However, Dask
does not carry out query optimization on the query plan.
Key syntax differences
Users coming from Pandas
generally need to know one thing...
polars != pandas
If your Polars
code looks like it could be Pandas
code, it might run, but it likely
runs slower than it should.
Let's go through some typical Pandas
code and see how we might write that in Polars
.
Selecting data
As there is no index in Polars
there is no .loc
or iloc
method in Polars
- and
there is also no SettingWithCopyWarning
in Polars
.
To learn more about how you select data in Polars
see the indexing
section.
However, the best way to select data in Polars
is to use the expression API. For
example, if you want to select a column in Pandas
you can do one of the following:
df['a']
df.loc[:,'a']
but in Polars
you would use the .select
method:
df.select(['a'])
If you want to select rows based on the values then in Polars
you use the .filter
method:
df.filter(pl.col('a')<10)
As noted in the section on expressions below, Polars
can run operations in .select
and filter
in parallel and Polars
can carry out query optimization on the full set
of data selection criteria.
Be lazy
Working in lazy evaluation mode is straightforward and should be your default in
Polars
as the lazy mode allows Polars
to do query optimization.
We can run in lazy mode by either using an implicitly lazy function (such as scan_csv
)
or explicitly using the lazy
method.
Take the following simple example where we read a CSV file from disk and do a groupby.
The CSV file has numerous columns but we just want to do a groupby on one of the id
columns (id1
) and then sum by a value column (v1
). In Pandas
this would be:
df = pd.read_csv(csvFile)
groupedDf = df.loc[:,['id1','v1']].groupby('id1').sum('v1')
In Polars
you can build this query in lazy mode with query optimization and evaluate
it by replacing the eager Pandas
function read_csv
with the implicitly lazy Polars
function scan_csv
:
df = pl.scan_csv(csvFile)
groupedDf = df.groupby('id1').agg(pl.col('v1').sum()).collect()
Polars
optimizes this query by identifying that only the id1
and v1
columns are
relevant and so will only read these columns from the CSV. By calling the .collect
method at the end of the second line we instruct Polars
to eagerly evaluate the query.
If you do want to run this query in eager mode you can just replace scan_csv
with
read_csv
in the Polars
code.
Read more about working with lazy evaluation in the lazy API section.
Express yourself
A typical Pandas
script consists of multiple data transformations that are executed
sequentially. However, in Polars
these transformations can be executed in parallel
using expressions.
Column assignment
We have a dataframe df
with a column called value
. We want to add two new columns, a
column called tenXValue
where the value
column is multiplied by 10 and a column
called hundredXValue
where the value
column is multiplied by 100.
In Pandas
this would be:
df["tenXValue"] = df["value"] * 10
df["hundredXValue"] = df["value"] * 100
These column assignments are executed sequentially.
In Polars
we add columns to df
using the .with_column
method and name them with
the .alias
method:
df.with_columns([
(pl.col("value") * 10).alias("tenXValue"),
(pl.col("value") * 100).alias("hundredXValue"),
])
These column assignments are executed in parallel.
Column asignment based on predicate
In this case we have a dataframe df
with columns a
,b
and c
. We want to re-assign
the values in column a
based on a condition. When the value in column c
is equal to
2 then we replace the value in a
with the value in b
.
In Pandas
this would be:
df.loc[df["c"] == 2, "a"] = df.loc[df["c"] == 2, "b"]
while in Polars
this would be:
df.with_column(
pl.when(pl.col("c") == 2)
.then(pl.col("b"))
.otherwise(pl.col("a")).alias("a")
)
The Polars
way is pure in that the original DataFrame
is not modified. The mask
is
also not computed twice as in Pandas
(you could prevent this in Pandas
, but that
would require setting a temporary variable).
Additionally Polars
can compute every branch of an if -> then -> otherwise
in
parallel. This is valuable, when the branches get more expensive to compute.
Filtering
We want to filter the dataframe df
with housing data based on some criteria.
In Pandas
you filter the dataframe by passing Boolean expressions to the loc
method:
df.loc[(df['sqft_living'] > 2500) & (df['price'] < 300000)]
while in Polars
you call the filter
method:
df.filter(
(pl.col("m2_living") > 2500) & (pl.col("price") < 300000)
)
The query optimizer in Polars
can also detect if you write multiple filters separately
and combine them into a single filter in the optimized plan.
Pandas
transform
The Pandas
documentation demonstrates an operation on a groupby called transform
. In
this case we have a dataframe df
and we want a new column showing the number of rows
in each group.
In Pandas
we have:
df = pd.DataFrame({
"type": ["m", "n", "o", "m", "m", "n", "n"]
"c": [1, 1, 1, 2, 2, 2, 2],
})
df["size"] = df.groupby("c")["type"].transform(len)
Here Pandas
does a groupby on "c"
, takes column "type"
, computes the group length
and then joins the result back to the original DataFrame
producing:
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
In Polars
the same can be achieved with window
functions:
df.select([
pl.all(),
pl.col("type").count().over("c").alias("size")
])
shape: (7, 3)
┌─────┬──────┬──────┐
│ c ┆ type ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪══════╪══════╡
│ 1 ┆ m ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ n ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ o ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 │
└─────┴──────┴──────┘
Because we can store the whole operation in a single expression, we can combine several
window
functions and even combine different groups!
Polars
will cache window expressions that are applied over the same group, so storing
them in a single select
is both convenient and optimal. In the following example
we look at a case where we are calculating group statistics over "c"
twice:
df.select([
pl.all(),
pl.col("c").count().over("c").alias("size"),
pl.col("c").sum().over("type").alias("sum"),
pl.col("c").reverse().over("c").flatten().alias("reverse_type")
])
shape: (7, 5)
┌─────┬──────┬──────┬─────┬──────────────┐
│ c ┆ type ┆ size ┆ sum ┆ reverse_type │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 ┆ i64 ┆ i64 │
╞═════╪══════╪══════╪═════╪══════════════╡
│ 1 ┆ m ┆ 3 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ n ┆ 3 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ o ┆ 3 ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 ┆ 5 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ m ┆ 4 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ n ┆ 4 ┆ 5 ┆ 1 │
└─────┴──────┴──────┴─────┴──────────────┘