polars.read_csv#
- polars.read_csv(file: str | TextIO | BytesIO | Path | BinaryIO | bytes, has_header: bool = True, columns: list[int] | list[str] | None = None, new_columns: list[str] | None = None, sep: str = ',', comment_char: str | None = None, quote_char: str | None = '"', skip_rows: int = 0, dtypes: Mapping[str, PolarsDataType] | list[PolarsDataType] | None = None, null_values: str | list[str] | dict[str, str] | None = None, missing_utf8_is_empty_string: bool = False, ignore_errors: bool = False, parse_dates: bool = False, n_threads: int | None = None, infer_schema_length: int | None = 100, batch_size: int = 8192, n_rows: int | None = None, encoding: CsvEncoding | str = 'utf8', low_memory: bool = False, rechunk: bool = True, use_pyarrow: bool = False, storage_options: dict[str, Any] | None = None, skip_rows_after_header: int = 0, row_count_name: str | None = None, row_count_offset: int = 0, sample_size: int = 1024, eol_char: str = '\n') DataFrame [source]#
Read a CSV file into a DataFrame.
- Parameters:
- file
Path to a file or a file-like object. By file-like object, we refer to objects with a
read()
method, such as a file handler (e.g. via builtinopen
function) orStringIO
orBytesIO
. Iffsspec
is installed, it will be used to open remote files.- has_header
Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset starting at 1.- columns
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- new_columns
Rename columns right after parsing the CSV file. If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- sep
Single byte character to use as delimiter in the file.
- comment_char
Single byte character that indicates the start of a comment line, for instance
#
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
lines.- dtypes
Overwrite dtypes during inference.
- null_values
Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: All values equal to any string in this list will be null.Dict[str, str]
: A dictionary that maps column name to a null value string.
- missing_utf8_is_empty_string
By default a missing value is considered to be null; if you would prefer missing utf8 values to be treated as the empty string you can set this param True.
- ignore_errors
Try to keep reading lines if some lines yield errors. First try
infer_schema_length=0
to read all columns aspl.Utf8
to check which values might cause an issue.- parse_dates
Try to automatically parse dates. If this does not succeed, the column remains of data type
pl.Utf8
. Ifuse_pyarrow=True
, dates will always be parsed.- n_threads
Number of threads to use in csv parsing. Defaults to the number of physical cpu’s of your system.
- infer_schema_length
Maximum number of lines to read to infer schema. If set to 0, all columns will be read as
pl.Utf8
. If set toNone
, a full table scan will be done (slow).- batch_size
Number of lines to read into the buffer at once. Modify this to change performance.
- n_rows
Stop reading from CSV file after reading
n_rows
. During multi-threaded parsing, an upper bound ofn_rows
rows cannot be guaranteed.- encoding{‘utf8’, ‘utf8-lossy’, …}
Lossy means that invalid utf8 values are replaced with
�
characters. When using other encodings thanutf8
orutf8-lossy
, the input is first decoded im memory with python. Defaults toutf8
.- low_memory
Reduce memory usage at expense of performance.
- rechunk
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- use_pyarrow
Try to use pyarrow’s native CSV parser. This will always parse dates, even if
parse_dates=False
. This is not always possible. The set of arguments given to this function determines if it is possible to use pyarrow’s native parser. Note that pyarrow and polars may have a different strategy regarding type inference.- storage_options
Extra options that make sense for
fsspec.open()
or a particular storage connection. e.g. host, port, username, password, etc.- skip_rows_after_header
Skip this number of rows when the header is parsed.
- row_count_name
If not None, this will insert a row count column with the given name into the DataFrame.
- row_count_offset
Offset to start the row_count column (only used if the name is set).
- sample_size
Set the sample size. This is used to sample statistics to estimate the allocation needed.
- eol_char
Single byte end of line character.
- Returns:
- DataFrame
See also
scan_csv
Lazily read from a CSV file or multiple files via glob patterns.
Notes
This operation defaults to a rechunk operation at the end, meaning that all data will be stored continuously in memory. Set rechunk=False if you are benchmarking the csv-reader. A rechunk is an expensive operation.