polars.scan_csv#

polars.scan_csv(file: str | Path, has_header: bool = True, sep: str = ',', comment_char: str | None = None, quote_char: str | None = '"', skip_rows: int = 0, dtypes: dict[str, PolarsDataType] | None = None, null_values: str | list[str] | dict[str, str] | None = None, ignore_errors: bool = False, cache: bool = True, with_column_names: Callable[[list[str]], list[str]] | None = None, infer_schema_length: int | None = 100, n_rows: int | None = None, encoding: CsvEncoding = 'utf8', low_memory: bool = False, rechunk: bool = True, skip_rows_after_header: int = 0, row_count_name: str | None = None, row_count_offset: int = 0, parse_dates: bool = False, eol_char: str = '\n') LazyFrame[source]#

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters:
file

Path to a file.

has_header

Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset starting at 1.

sep

Single byte character to use as delimiter in the file.

comment_char

Single byte character that indicates the start of a comment line, for instance #.

quote_char

Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.

skip_rows

Start reading after skip_rows lines. The header will be parsed at this offset.

dtypes

Overwrite dtypes during inference.

null_values

Values to interpret as null values. You can provide a:

  • str: All values equal to this string will be null.

  • List[str]: All values equal to any string in this list will be null.

  • Dict[str, str]: A dictionary that maps column name to a null value string.

ignore_errors

Try to keep reading lines if some lines yield errors. First try infer_schema_length=0 to read all columns as pl.Utf8 to check which values might cause an issue.

cache

Cache the result after reading.

with_column_names

Apply a function over the column names. This can be used to update a schema just in time, thus before scanning.

infer_schema_length

Maximum number of lines to read to infer schema. If set to 0, all columns will be read as pl.Utf8. If set to None, a full table scan will be done (slow).

n_rows

Stop reading from CSV file after reading n_rows.

encoding{‘utf8’, ‘utf8-lossy’}

Lossy means that invalid utf8 values are replaced with characters. Defaults to “utf8”.

low_memory

Reduce memory usage in expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/ files are parsed.

skip_rows_after_header

Skip this number of rows when the header is parsed.

row_count_name

If not None, this will insert a row count column with the given name into the DataFrame.

row_count_offset

Offset to start the row_count column (only used if the name is set).

parse_dates

Try to automatically parse dates. If this does not succeed, the column remains of data type pl.Utf8.

eol_char

Single byte end of line character

Returns:
LazyFrame

See also

read_csv

Read a CSV file into a DataFrame.

Examples

>>> import pathlib
>>>
>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the scan, so less data is read into memory
...     .fetch(100)  # pushed a limit of 100 rows to the scan level
... )  

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "terrible", "to", "read"]}
... )
>>> path: pathlib.Path = dirpath / "mydf.csv"
>>> df.write_csv(path)
>>> pl.scan_csv(
...     path, with_column_names=lambda cols: [col.lower() for col in cols]
... ).fetch()
shape: (4, 2)
┌─────────┬──────────┐
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
╞═════════╪══════════╡
│ 1       ┆ is       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ terrible │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ to       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ read     │
└─────────┴──────────┘