polars.read_csv_batched#
- polars.read_csv_batched(source: str | Path, *, has_header: bool = True, columns: Sequence[int] | Sequence[str] | None = None, new_columns: Sequence[str] | None = None, separator: str = ',', comment_char: str | None = None, quote_char: str | None = '"', skip_rows: int = 0, dtypes: Mapping[str, PolarsDataType] | Sequence[PolarsDataType] | None = None, null_values: str | Sequence[str] | dict[str, str] | None = None, missing_utf8_is_empty_string: bool = False, ignore_errors: bool = False, try_parse_dates: bool = False, n_threads: int | None = None, infer_schema_length: int | None = 100, batch_size: int = 50000, n_rows: int | None = None, encoding: CsvEncoding | str = 'utf8', low_memory: bool = False, rechunk: bool = True, skip_rows_after_header: int = 0, row_count_name: str | None = None, row_count_offset: int = 0, sample_size: int = 1024, eol_char: str = '\n') BatchedCsvReader [source]#
Read a CSV file in batches.
Upon creation of the
BatchedCsvReader
, Polars will gather statistics and determine the file chunks. After that, work will only be done ifnext_batches
is called.- Parameters:
- source
Path to a file or a file-like object. By file-like object, we refer to objects with a
read()
method, such as a file handler (e.g. via builtinopen
function) orStringIO
orBytesIO
. Iffsspec
is installed, it will be used to open remote files.- has_header
Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset starting at 1.- columns
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- new_columns
Rename columns right after parsing the CSV file. If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- separator
Single byte character to use as delimiter in the file.
- comment_char
Single byte character that indicates the start of a comment line, for instance
#
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
lines.- dtypes
Overwrite dtypes during inference.
- null_values
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: All values equal to any string in this list will be null.Dict[str, str]
: A dictionary that maps column name to a null value string.
- missing_utf8_is_empty_string
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
- ignore_errors
Try to keep reading lines if some lines yield errors. First try
infer_schema_length=0
to read all columns aspl.Utf8
to check which values might cause an issue.- try_parse_dates
Try to automatically parse dates. Most ISO8601-like formats can be inferred, as well as a handful of others. If this does not succeed, the column remains of data type
pl.Utf8
.- n_threads
Number of threads to use in csv parsing. Defaults to the number of physical cpu’s of your system.
- infer_schema_length
Maximum number of lines to read to infer schema. If set to 0, all columns will be read as
pl.Utf8
. If set toNone
, a full table scan will be done (slow).- batch_size
Number of lines to read into the buffer at once.
Modify this to change performance.
- n_rows
Stop reading from CSV file after reading
n_rows
. During multi-threaded parsing, an upper bound ofn_rows
rows cannot be guaranteed.- encoding{‘utf8’, ‘utf8-lossy’, …}
Lossy means that invalid utf8 values are replaced with
�
characters. When using other encodings thanutf8
orutf8-lossy
, the input is first decoded in memory with python. Defaults toutf8
.- low_memory
Reduce memory usage at expense of performance.
- rechunk
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_count_name
If not None, this will insert a row count column with the given name into the DataFrame.
- row_count_offset
Offset to start the row_count column (only used if the name is set).
- sample_size
Set the sample size. This is used to sample statistics to estimate the allocation needed.
- eol_char
Single byte end of line character.
- Returns:
- BatchedCsvReader
See also
scan_csv
Lazily read from a CSV file or multiple files via glob patterns.
Examples
>>> reader = pl.read_csv_batched( ... "./tpch/tables_scale_100/lineitem.tbl", separator="|", try_parse_dates=True ... ) >>> batches = reader.next_batches(5) >>> for df in batches: ... print(df) ...
Read big CSV file in batches and write a CSV file for each “group” of interest.
>>> seen_groups = set() >>> reader = pl.read_csv_batched("big_file.csv") >>> batches = reader.next_batches(100)
>>> while batches: ... df_current_batches = pl.concat(batches) ... partition_dfs = df_current_batches.partition_by("group", as_dict=True) ... ... for group, df in partition_dfs.items(): ... if group in seen_groups: ... with open(f"./data/{group}.csv", "a") as fh: ... fh.write(df.write_csv(file=None, has_header=False)) ... else: ... df.write_csv(file=f"./data/{group}.csv", has_header=True) ... seen_groups.add(group) ... ... batches = reader.next_batches(100) ...