Series & DataFrames
The core base data structures provided by Polars are Series
and DataFrames
.
Series
Series are a 1-dimensional data structure. Within a series all elements have the same data type (e.g. int, string).
The snippet below shows how to create a simple named Series
object. In a later section of this getting started guide we will learn how to read data from external sources (e.g. files, database), for now lets keep it simple.
shape: (5,)
Series: 'a' [i64]
[
1
2
3
4
5
]
Methods
Although it is more common to work directly on a DataFrame
object, Series
implement a number of base methods which make it easy to perform transformations. Below are some examples of common operations you might want to perform. Note that these are for illustration purposes and only show a small subset of what is available.
Aggregations
Series
out of the box supports all basic aggregations (e.g. min, max, mean, mode, ...).
1
5
String
There are a number of methods related to string operations in the StringNamespace
. These only work on Series
with the Datatype Utf8
.
s = pl.Series("a", ["polar", "bear", "arctic", "polar fox", "polar bear"])
s2 = s.str.replace("polar", "pola")
print(s2)
// This operation is not directly available on the Series object yet, only on the DataFrame
shape: (5,)
Series: 'a' [str]
[
"pola"
"bear"
"arctic"
"pola fox"
"pola bear"
]
Datetime
Similar to strings, there is a separate namespace for datetime related operations in the DateLikeNameSpace
. These only work on Series
with DataTypes related to dates.
from datetime import datetime
start = datetime(2001, 1, 1)
stop = datetime(2001, 1, 9)
s = pl.date_range(start, stop, interval="2d", eager=True)
s.dt.day()
print(s)
// This operation is not directly available on the Series object yet, only on the DataFrame
shape: (5,)
Series: 'date' [datetime[μs]]
[
2001-01-01 00:00:00
2001-01-03 00:00:00
2001-01-05 00:00:00
2001-01-07 00:00:00
2001-01-09 00:00:00
]
DataFrame
A DataFrame
is a 2-dimensional data structure that is backed by a Series
, and it could be seen as an abstraction of on collection (e.g. list) of Series
. Operations that can be executed on DataFrame
are very similar to what is done in a SQL
like query. You can GROUP BY
, JOIN
, PIVOT
, but also define custom functions. In the next pages we will cover how to perform these transformations.
from datetime import datetime
df = pl.DataFrame(
{
"integer": [1, 2, 3, 4, 5],
"date": [
datetime(2022, 1, 1),
datetime(2022, 1, 2),
datetime(2022, 1, 3),
datetime(2022, 1, 4),
datetime(2022, 1, 5),
],
"float": [4.0, 5.0, 6.0, 7.0, 8.0],
}
)
print(df)
use chrono::prelude::*;
let df: DataFrame = df!(
"integer" => &[1, 2, 3, 4, 5],
"date" => &[
NaiveDate::from_ymd_opt(2022, 1, 1).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 2).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 3).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 4).unwrap().and_hms_opt(0, 0, 0).unwrap(),
NaiveDate::from_ymd_opt(2022, 1, 5).unwrap().and_hms_opt(0, 0, 0).unwrap()
],
"float" => &[4.0, 5.0, 6.0, 7.0, 8.0],
)
.unwrap();
println!("{}", df);
shape: (5, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
│ 4 ┆ 2022-01-04 00:00:00 ┆ 7.0 │
│ 5 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
└─────────┴─────────────────────┴───────┘
Viewing data
This part focuses on viewing data in a DataFrame
. We will use the DataFrame
from the previous example as a starting point.
Head
The head
function shows by default the first 5 rows of a DataFrame
. You can specify the number of rows you want to see (e.g. df.head(10)
).
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
│ 2 ┆ 2022-01-02 00:00:00 ┆ 5.0 │
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
└─────────┴─────────────────────┴───────┘
Tail
The tail
function shows the last 5 rows of a DataFrame
. You can also specify the number of rows you want to see, similar to head
.
shape: (3, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 3 ┆ 2022-01-03 00:00:00 ┆ 6.0 │
│ 4 ┆ 2022-01-04 00:00:00 ┆ 7.0 │
│ 5 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
└─────────┴─────────────────────┴───────┘
Sample
If you want to get an impression of the data of your DataFrame
, you can also use sample
. With sample
you get an n number of random rows from the DataFrame
.
shape: (2, 3)
┌─────────┬─────────────────────┬───────┐
│ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═════════╪═════════════════════╪═══════╡
│ 5 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
│ 4 ┆ 2022-01-04 00:00:00 ┆ 7.0 │
└─────────┴─────────────────────┴───────┘
Describe
Describe
returns summary statistics of your DataFrame
. It will provide several quick statistics if possible.
print(df.describe())
describe
· Available on feature describe
println!("{:?}", df.describe(None));
shape: (9, 4)
┌────────────┬──────────┬─────────────────────┬──────────┐
│ describe ┆ integer ┆ date ┆ float │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ f64 │
╞════════════╪══════════╪═════════════════════╪══════════╡
│ count ┆ 5.0 ┆ 5 ┆ 5.0 │
│ null_count ┆ 0.0 ┆ 0 ┆ 0.0 │
│ mean ┆ 3.0 ┆ null ┆ 6.0 │
│ std ┆ 1.581139 ┆ null ┆ 1.581139 │
│ min ┆ 1.0 ┆ 2022-01-01 00:00:00 ┆ 4.0 │
│ 25% ┆ 2.0 ┆ null ┆ 5.0 │
│ 50% ┆ 3.0 ┆ null ┆ 6.0 │
│ 75% ┆ 4.0 ┆ null ┆ 7.0 │
│ max ┆ 5.0 ┆ 2022-01-05 00:00:00 ┆ 8.0 │
└────────────┴──────────┴─────────────────────┴──────────┘