polars.scan_csv

polars.scan_csv(file: Union[str, pathlib.Path], has_header: bool = True, sep: str = ',', comment_char: Optional[str] = None, quote_char: Optional[str] = '"', skip_rows: int = 0, dtypes: Optional[Dict[str, Type[polars.datatypes.DataType]]] = None, null_values: Optional[Union[str, List[str], Dict[str, str]]] = None, ignore_errors: bool = False, cache: bool = True, with_column_names: Optional[Callable[[List[str]], List[str]]] = None, infer_schema_length: Optional[int] = 100, n_rows: Optional[int] = None, encoding: str = 'utf8', low_memory: bool = False, rechunk: bool = True, skip_rows_after_header: int = 0, row_count_name: Optional[str] = None, row_count_offset: int = 0, parse_dates: bool = False, **kwargs: Any) polars.internals.lazy_frame.LazyFrame

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.

Parameters
file

Path to a file.

has_header

Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset starting at 1.

sep

Single byte character to use as delimiter in the file.

comment_char

Single byte character that indicates the start of a comment line, for instance #.

quote_char

Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.

skip_rows

Start reading after skip_rows lines. The header will be parsed at this offset.

dtypes

Overwrite dtypes during inference.

null_values
Values to interpret as null values. You can provide a:
  • str: All values equal to this string will be null.

  • List[str]: A null value per column.

  • Dict[str, str]: A dictionary that maps column name to a

    null value string.

ignore_errors

Try to keep reading lines if some lines yield errors. First try infer_schema_length=0 to read all columns as pl.Utf8 to check which values might cause an issue.

cache

Cache the result after reading.

with_column_names

Apply a function over the column names. This can be used to update a schema just in time, thus before scanning.

infer_schema_length

Maximum number of lines to read to infer schema. If set to 0, all columns will be read as pl.Utf8. If set to None, a full table scan will be done (slow).

n_rows

Stop reading from CSV file after reading n_rows.

encoding

Allowed encodings: utf8 or utf8-lossy. Lossy means that invalid utf8 values are replaced with characters.

low_memory

Reduce memory usage in expense of performance.

rechunk

Reallocate to contiguous memory when all chunks/ files are parsed.

skip_rows_after_header

Skip these number of rows when the header is parsed

row_count_name

If not None, this will insert a row count column with give name into the DataFrame

row_count_offset

Offset to start the row_count column (only use if the name is set)

parse_dates

Try to automatically parse dates. If this does not succeed, the column remains of data type pl.Utf8.

Examples

>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the the scan, so less data read in memory
...     .fetch(100)  # pushed a limit of 100 rows to the scan level
... )  

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "terrible", "to", "read"]}
... )
>>> df.to_csv("mydf.csv")
>>> pl.scan_csv(
...     "mydf.csv", with_column_names=lambda cols: [col.lower() for col in cols]
... ).fetch()
shape: (4, 2)
┌─────────┬──────────┐
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
╞═════════╪══════════╡
│ 1       ┆ is       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2       ┆ terrible │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3       ┆ to       │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4       ┆ read     │
└─────────┴──────────┘