polars.scan_csv(file: str | Path, has_header: bool = True, sep: str = ',', comment_char: str | None = None, quote_char: str | None = '"', skip_rows: int = 0, dtypes: SchemaDict | None = None, null_values: str | list[str] | dict[str, str] | None = None, missing_utf8_is_empty_string: bool = False, ignore_errors: bool = False, cache: bool = True, with_column_names: Callable[[list[str]], list[str]] | None = None, infer_schema_length: int | None = 100, n_rows: int | None = None, encoding: CsvEncoding = 'utf8', low_memory: bool = False, rechunk: bool = True, skip_rows_after_header: int = 0, row_count_name: str | None = None, row_count_offset: int = 0, parse_dates: bool = False, eol_char: str = '\n') LazyFrame[source]#

Lazily read from a CSV file or multiple files via glob patterns.

This allows the query optimizer to push down predicates and projections to the scan level, thereby potentially reducing memory overhead.


Path to a file.


Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset starting at 1.


Single byte character to use as delimiter in the file.


Single byte character that indicates the start of a comment line, for instance #.


Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.


Start reading after skip_rows lines. The header will be parsed at this offset.


Overwrite dtypes during inference.


Values to interpret as null values. You can provide a:

  • str: All values equal to this string will be null.

  • List[str]: All values equal to any string in this list will be null.

  • Dict[str, str]: A dictionary that maps column name to a null value string.


By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.


Try to keep reading lines if some lines yield errors. First try infer_schema_length=0 to read all columns as pl.Utf8 to check which values might cause an issue.


Cache the result after reading.


Apply a function over the column names. This can be used to update a schema just in time, thus before scanning.


Maximum number of lines to read to infer schema. If set to 0, all columns will be read as pl.Utf8. If set to None, a full table scan will be done (slow).


Stop reading from CSV file after reading n_rows.

encoding{‘utf8’, ‘utf8-lossy’}

Lossy means that invalid utf8 values are replaced with characters. Defaults to “utf8”.


Reduce memory usage in expense of performance.


Reallocate to contiguous memory when all chunks/ files are parsed.


Skip this number of rows when the header is parsed.


If not None, this will insert a row count column with the given name into the DataFrame.


Offset to start the row_count column (only used if the name is set).


Try to automatically parse dates. If this does not succeed, the column remains of data type pl.Utf8.


Single byte end of line character


See also


Read a CSV file into a DataFrame.


>>> import pathlib
>>> (
...     pl.scan_csv("my_long_file.csv")  # lazy, doesn't do a thing
...     .select(
...         ["a", "c"]
...     )  # select only 2 columns (other columns will not be read)
...     .filter(
...         pl.col("a") > 10
...     )  # the filter is pushed down the scan, so less data is read into memory
...     .fetch(100)  # pushed a limit of 100 rows to the scan level
... )  

We can use with_column_names to modify the header before scanning:

>>> df = pl.DataFrame(
...     {"BrEeZaH": [1, 2, 3, 4], "LaNgUaGe": ["is", "terrible", "to", "read"]}
... )
>>> path: pathlib.Path = dirpath / "mydf.csv"
>>> df.write_csv(path)
>>> pl.scan_csv(
...     path, with_column_names=lambda cols: [col.lower() for col in cols]
... ).fetch()
shape: (4, 2)
│ breezah ┆ language │
│ ---     ┆ ---      │
│ i64     ┆ str      │
│ 1       ┆ is       │
│ 2       ┆ terrible │
│ 3       ┆ to       │
│ 4       ┆ read     │