Series & DataFrames
The core base data structures provided by Polars are Series and DataFrames.
Series
Series are a 1-dimensional data structure. Within a series all elements have the same data type (e.g. int, string).
The snippet below shows how to create a simple named Series object. In a later section of this getting started guide we will learn how to read data from external sources (e.g. files, database), for now lets keep it simple.
shape: (5,)
Series: 'a' [i64]
[
1
2
3
4
5
]
Methods
Although it is more common to work directly on a DataFrame object, Series implement a number of base methods which make it easy to perform transformations. Below are some examples of common operations you might want to perform. Note that these are for illustration purposes and only show a small subset of what is available.
Aggregations
Series out of the box supports all basic aggregations (e.g. min, max, mean, mode, ...).
1
5
String
There are a number of methods related to string operations in the StringNamespace. These only work on Series with the Datatype Utf8.
s = pl.Series("a", ["polar", "bear", "arctic", "polar fox", "polar bear"])
s2 = s.str.replace("polar", "pola")
print(s2)
// This operation is not directly available on the Series object yet, only on the DataFrame
shape: (5,)
Series: 'a' [str]
[
"pola"
"bear"
"arctic"
"pola fox"
"pola bear"
]
Datetime
Similar to strings, there is a separate namespace for datetime related operations in the DateLikeNameSpace. These only work on Serieswith DataTypes related to dates.
from datetime import datetime
start = datetime(2001, 1, 1)
stop = datetime(2001, 1, 9)
s = pl.date_range(start, stop, interval="2d", eager=True)
s.dt.day()
print(s)
// This operation is not directly available on the Series object yet, only on the DataFrame
shape: (5,)
Series: 'date' [datetime[μs]]
[
2001-01-01 00:00:00
2001-01-03 00:00:00
2001-01-05 00:00:00
2001-01-07 00:00:00
2001-01-09 00:00:00
]
DataFrame
A DataFrame is a 2-dimensional data structure that is backed by a Series, and it could be seen as an abstraction of on collection (e.g. list) of Series. Operations that can be executed on DataFrame are very similar to what is done in a SQL like query. You can GROUP BY, JOIN, PIVOT, but also define custom functions. In the next pages we will cover how to perform these transformations.
from datetime import datetime
df = pl.DataFrame(
{
"integer": [1, 2, 3, 4, 5],
"date": [
datetime(2022, 1, 1),
datetime(2022, 1, 2),
datetime(2022, 1, 3),
datetime(2022, 1, 4),
datetime(2022, 1, 5),
],
"float": [4.0, 5.0, 6.0, 7.0, 8.0],
}
)
print(df)
use chrono::prelude::*;
let df: DataFrame = df!(
"integer" => &[1, 2, 3, 4, 5],
"date" => &[
NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 2).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 4).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 5).unwrap().and_hms_opt(0, 0, 0).unwrap()
],
"float" => &[4.0, 5.0, 6.0, 7.0, 8.0],
)
.unwrap();
println!("{}", df);
shape: (5, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
│ 4 ┆ 2022-01-04 00:00:00 ┆ 7.0 │
│ 5 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
└─────────┴─────────────────────┴───────┘
Viewing data
This part focuses on viewing data in a DataFrame. We will use the DataFrame from the previous example as a starting point.
Head
The head function shows by default the first 5 rows of a DataFrame. You can specify the number of rows you want to see (e.g. df.head(10)).
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
└─────────┴─────────────────────┴───────┘
Tail
The tail function shows the last 5 rows of a DataFrame. You can also specify the number of rows you want to see, similar to head.
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
│ 4 ┆ 2022-01-04 00:00:00 ┆ 7.0 │
│ 5 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
└─────────┴─────────────────────┴───────┘
Sample
If you want to get an impression of the data of your DataFrame, you can also use sample. With sample you get an n number of random rows from the DataFrame.
shape: (2, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 5 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
│ 4 ┆ 2022-01-04 00:00:00 ┆ 7.0 │
└─────────┴─────────────────────┴───────┘
Describe
Describe returns summary statistics of your DataFrame. It will provide several quick statistics if possible.
print(df.describe())
describe · Available on feature describe
println!("{:?}", df.describe(None));
shape: (9, 4)
┌────────────┬──────────┬─────────────────────┬──────────┐
│ describe ┆ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ f64 │
╞════════════╪══════════╪═════════════════════╪══════════╡
│ count ┆ 5.0 ┆ 5 ┆ 5.0 │
│ null_count ┆ 0.0 ┆ 0 ┆ 0.0 │
│ mean ┆ 3.0 ┆ null ┆ 6.0 │
│ std ┆ 1.581139 ┆ null ┆ 1.581139 │
│ min ┆ 1.0 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
│ 25% ┆ 2.0 ┆ null ┆ 5.0 │
│ 50% ┆ 3.0 ┆ null ┆ 6.0 │
│ 75% ┆ 4.0 ┆ null ┆ 7.0 │
│ max ┆ 5.0 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
└────────────┴──────────┴─────────────────────┴──────────┘