Selecting with expressions
In this page we show how to select rows and columns with the preferred expression approach. We cover:
- use of the Expression API via the
filter
andselect
methods to select data - combining these expressions and
- optimization of these expression in lazy mode.
To select data with expressions we use:
- the
filter
method to select rows - the
select
method to select columns
For simplicity we deal with DataFrame
examples throughout. The principles are the same for Series
objects except that columns obviously cannot be selected in a Series
. To illustrate the filter
and select
methods we define a simple DataFrame
:
df = pl.DataFrame(
{
"id": [1, 2, 3],
"color": ["blue", "red", "green"],
"size": ["small", "medium", "large"],
}
)
print(df)
shape: (3, 3)
┌─────┬───────┬────────┐
│ id ┆ color ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═══════╪════════╡
│ 1 ┆ blue ┆ small │
│ 2 ┆ red ┆ medium │
│ 3 ┆ green ┆ large │
└─────┴───────┴────────┘
Selecting rows with the filter
method
We can select rows by using the filter
method. In the filter
method we pass the condition we are using to select the rows as an expression:
filter_df = df.filter(pl.col("id") <= 2)
print(filter_df)
shape: (2, 3)
┌─────┬───────┬────────┐
│ id ┆ color ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═══════╪════════╡
│ 1 ┆ blue ┆ small │
│ 2 ┆ red ┆ medium │
└─────┴───────┴────────┘
We can specify multiple conditions in filter
using the &
operator:
multi_filter_df = df.filter((pl.col("id") <= 2) & (pl.col("size") == "small"))
print(multi_filter_df)
shape: (1, 3)
┌─────┬───────┬───────┐
│ id ┆ color ┆ size │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str │
╞═════╪═══════╪═══════╡
│ 1 ┆ blue ┆ small │
└─────┴───────┴───────┘
Selecting columns with the select
method
We select columns using the select
method. In the select
method we can specify the columns with:
- a (string) column name
- a list of (string) column names
- a boolean list of the same length as the number of columns
- an expression such as a condition on the column name
- a
Series
Select a single column
single_select_df = df.select("id")
print(single_select_df)
shape: (3, 1)
┌─────┐
│ id │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
Select a list of columns
list_select_df = df.select(["id", "color"])
print(list_select_df)
shape: (3, 2)
┌─────┬───────┐
│ id ┆ color │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ blue │
│ 2 ┆ red │
│ 3 ┆ green │
└─────┴───────┘
Select columns with an expression
To select based on a condition on the column name:
condition_select_df = df.select(pl.col("^col.*$"))
print(condition_select_df)
shape: (3, 1)
┌───────┐
│ color │
│ --- │
│ str │
╞═══════╡
│ blue │
│ red │
│ green │
└───────┘
To select based on the dtype of the columns:
dtype_select_df = df.select(pl.col(pl.Int64))
print(dtype_select_df)
shape: (3, 1)
┌─────┐
│ id │
│ --- │
│ i64 │
╞═════╡
│ 1 │
│ 2 │
│ 3 │
└─────┘
Selecting rows and columns
We can combine the filter
and select
methods to select rows and columns
expression_df = df.filter(pl.col("id") <= 2).select(["id", "color"])
print(expression_df)
shape: (2, 2)
┌─────┬───────┐
│ id ┆ color │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ blue │
│ 2 ┆ red │
└─────┴───────┘
Query optimization
In lazy mode the query optimizer may be able to optimize the query based on the expressions.
In this example we scan a CSV file with many columns using scan_csv
and then select
a subset of them. The query optimizer creates a query plan that causes only the selected columns to be read from the CSV - see how the Project
part of the query plan below states that only 1 of 2 columns will be read:
lazy_select_df = pl.scan_csv("data/appleStock.csv").select(["Date"])
print(lazy_select_df.describe_optimized_plan())
FAST_PROJECT: [Date]
CSV SCAN data/appleStock.csv
PROJECT 1/2 COLUMNS
If you specify two separate filter conditions the query optimizer will combine them into a single joint condition (see the Selection
part of the query plan below):
lazy_filter_df = (
pl.scan_csv("data/appleStock.csv")
.filter(
pl.col("Date") == datetime(1995, 10, 16),
)
.filter(pl.col("Close") > 100)
)
print(lazy_filter_df.describe_optimized_plan())
CSV SCAN data/appleStock.csv
PROJECT */2 COLUMNS
SELECTION: [([(col("Date")) == (1995-10-16 00:00:00.cast(Utf8))]) & ([(col("Close")) > (100.0)])]