Selecting with expressions

In this page we show how to select rows and columns with the preferred expression approach. We cover:

  • use of the Expression API via the filter and select methods to select data
  • combining these expressions and
  • optimization of these expression in lazy mode.

To select data with expressions we use:

  • the filter method to select rows
  • the select method to select columns

For simplicity we deal with DataFrame examples throughout. The principles are the same for Series objects except that columns obviously cannot be selected in a Series. To illustrate the filter and select methods we define a simple DataFrame:

df = pl.DataFrame(
    {
        "id": [1, 2, 3],
        "color": ["blue", "red", "green"],
        "size": ["small", "medium", "large"],
    }
)
print(df)
shape: (3, 3)
┌─────┬───────┬────────┐
│ id  ┆ color ┆ size   │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ str    │
╞═════╪═══════╪════════╡
│ 1   ┆ blue  ┆ small  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ red   ┆ medium │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3   ┆ green ┆ large  │
└─────┴───────┴────────┘

Selecting rows with the filter method

We can select rows by using the filter method. In the filter method we pass the condition we are using to select the rows as an expression:

filter_df = df.filter(pl.col("id") <= 2)
print(filter_df)
shape: (2, 3)
┌─────┬───────┬────────┐
│ id  ┆ color ┆ size   │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ str    │
╞═════╪═══════╪════════╡
│ 1   ┆ blue  ┆ small  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2   ┆ red   ┆ medium │
└─────┴───────┴────────┘

We can specify multiple conditions in filter using the & operator:

multi_filter_df = df.filter((pl.col("id") <= 2) & (pl.col("size") == "small"))
print(multi_filter_df)
shape: (1, 3)
┌─────┬───────┬───────┐
│ id  ┆ color ┆ size  │
│ --- ┆ ---   ┆ ---   │
│ i64 ┆ str   ┆ str   │
╞═════╪═══════╪═══════╡
│ 1   ┆ blue  ┆ small │
└─────┴───────┴───────┘

Selecting columns with the select method

We select columns using the select method. In the select method we can specify the columns with:

  • a (string) column name
  • a list of (string) column names
  • a boolean list of the same length as the number of columns
  • an expression such as a condition on the column name
  • a Series

Select a single column

single_select_df = df.select("id")
print(single_select_df)
shape: (3, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 2   │
├╌╌╌╌╌┤
│ 3   │
└─────┘

Select a list of columns

list_select_df = df.select(["id", "color"])
print(list_select_df)
shape: (3, 2)
┌─────┬───────┐
│ id  ┆ color │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ blue  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ red   │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3   ┆ green │
└─────┴───────┘

Select based on a Boolean list:

boolean_list_select_df = df.select(
    [column for column in df.columns if column.startswith("col")],
)
print(boolean_list_select_df)
shape: (3, 1)
┌───────┐
│ color │
│ ---   │
│ str   │
╞═══════╡
│ blue  │
├╌╌╌╌╌╌╌┤
│ red   │
├╌╌╌╌╌╌╌┤
│ green │
└───────┘

Select columns with an expression

To select based on a condition on the column name:

condition_select_df = df.select(pl.col("^col.*$"))
print(condition_select_df)
shape: (3, 1)
┌───────┐
│ color │
│ ---   │
│ str   │
╞═══════╡
│ blue  │
├╌╌╌╌╌╌╌┤
│ red   │
├╌╌╌╌╌╌╌┤
│ green │
└───────┘

To select based on the dtype of the columns:

dtype_select_df = df.select(pl.col(pl.Int64))
print(dtype_select_df)
shape: (3, 1)
┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 1   │
├╌╌╌╌╌┤
│ 2   │
├╌╌╌╌╌┤
│ 3   │
└─────┘

Selecting rows and columns

We can combine the filter and select methods to select rows and columns

expression_df = df.filter(pl.col("id") <= 2).select(["id", "color"])
print(expression_df)
shape: (2, 2)
┌─────┬───────┐
│ id  ┆ color │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 1   ┆ blue  │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2   ┆ red   │
└─────┴───────┘

Query optimization

In lazy mode the query optimizer may be able to optimize the query based on the expressions.

In this example we scan a CSV file with many columns using scan_csv and then select a subset of them. The query optimizer creates a query plan that causes only the selected columns to be read from the CSV - see how the Project part of the query plan below states that only 1 of 2 columns will be read:

lazy_select_df = pl.scan_csv("data/appleStock.csv").select(["Date"])
print(lazy_select_df.describe_optimized_plan())
FAST PROJECTION 
CSV SCAN data/appleStock.csv; PROJECT 1/2 COLUMNS; SELECTION: None

If you specify two separate filter conditions the query optimizer will combine them into a single joint condition (see the Selection part of the query plan below):

lazy_filter_df = (
    pl.scan_csv("data/appleStock.csv")
    .filter(
        pl.col("Date") == datetime(1995, 10, 16),
    )
    .filter(pl.col("Close") > 100)
)
print(lazy_filter_df.describe_optimized_plan())
CSV SCAN data/appleStock.csv; PROJECT 2/2 COLUMNS; SELECTION: Some([([(col("Date")) == (813801600000000000i64.strict_cast(Datetime(Microseconds, None)).map().cast(Utf8))]) & ([(col("Close")) > (100f64)])])