source: str | Path,
has_header: bool = True,
columns: Sequence[int] | Sequence[str] | None = None,
new_columns: Sequence[str] | None = None,
separator: str = ',',
comment_char: str | None = None,
quote_char: str | None = '"',
skip_rows: int = 0,
dtypes: Mapping[str, PolarsDataType] | Sequence[PolarsDataType] | None = None,
null_values: str | Sequence[str] | dict[str, str] | None = None,
missing_utf8_is_empty_string: bool = False,
ignore_errors: bool = False,
try_parse_dates: bool = False,
n_threads: int | None = None,
infer_schema_length: int | None = 100,
batch_size: int = 50000,
n_rows: int | None = None,
encoding: CsvEncoding | str = 'utf8',
low_memory: bool = False,
rechunk: bool = True,
skip_rows_after_header: int = 0,
row_count_name: str | None = None,
row_count_offset: int = 0,
sample_size: int = 1024,
eol_char: str = '\n',
raise_if_empty: bool = True,
) BatchedCsvReader[source]#

Read a CSV file in batches.

Upon creation of the BatchedCsvReader, Polars will gather statistics and determine the file chunks. After that, work will only be done if next_batches is called, which will return a list of n frames of the given batch size.


Path to a file or a file-like object (by file-like object, we refer to objects that have a read() method, such as a file handler (e.g. via builtin open function) or BytesIO).If fsspec is installed, it will be used to open remote files.


Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format: column_x, with x being an enumeration over every column in the dataset starting at 1.


Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.


Rename columns right after parsing the CSV file. If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.


Single byte character to use as delimiter in the file.


Single byte character that indicates the start of a comment line, for instance #.


Single byte character used for csv quoting, default = ". Set to None to turn off special handling and escaping of quotes.


Start reading after skip_rows lines.


Overwrite dtypes during inference.


Values to interpret as null values. You can provide a:

  • str: All values equal to this string will be null.

  • List[str]: All values equal to any string in this list will be null.

  • Dict[str, str]: A dictionary that maps column name to a null value string.


By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.


Try to keep reading lines if some lines yield errors. First try infer_schema_length=0 to read all columns as pl.Utf8 to check which values might cause an issue.


Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type pl.Utf8.


Number of threads to use in csv parsing. Defaults to the number of physical cpu’s of your system.


Maximum number of lines to read to infer schema. If set to 0, all columns will be read as pl.Utf8. If set to None, a full table scan will be done (slow).


Number of lines to read into the buffer at once.

Modify this to change performance.


Stop reading from CSV file after reading n_rows. During multi-threaded parsing, an upper bound of n_rows rows cannot be guaranteed.

encoding{‘utf8’, ‘utf8-lossy’, …}

Lossy means that invalid utf8 values are replaced with characters. When using other encodings than utf8 or utf8-lossy, the input is first decoded in memory with python. Defaults to utf8.


Reduce memory usage at expense of performance.


Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.


Skip this number of rows when the header is parsed.


If not None, this will insert a row count column with the given name into the DataFrame.


Offset to start the row_count column (only used if the name is set).


Set the sample size. This is used to sample statistics to estimate the allocation needed.


Single byte end of line character.


When there is no data in the source,``NoDataError`` is raised. If this parameter is set to False, None will be returned from next_batches(n) instead.


See also


Lazily read from a CSV file or multiple files via glob patterns.


>>> reader = pl.read_csv_batched(
...     "./tpch/tables_scale_100/lineitem.tbl", separator="|", try_parse_dates=True
... )  
>>> batches = reader.next_batches(5)  
>>> for df in batches:  
...     print(df)

Read big CSV file in batches and write a CSV file for each “group” of interest.

>>> seen_groups = set()
>>> reader = pl.read_csv_batched("big_file.csv")  
>>> batches = reader.next_batches(100)  
>>> while batches:  
...     df_current_batches = pl.concat(batches)
...     partition_dfs = df_current_batches.partition_by("group", as_dict=True)
...     for group, df in partition_dfs.items():
...         if group in seen_groups:
...             with open(f"./data/{group}.csv", "a") as fh:
...                 fh.write(df.write_csv(file=None, has_header=False))
...         else:
...             df.write_csv(file=f"./data/{group}.csv", has_header=True)
...         seen_groups.add(group)
...     batches = reader.next_batches(100)