polars.read_csv¶
- polars.read_csv(file: Union[str, TextIO, _io.BytesIO, pathlib.Path, BinaryIO, bytes], has_header: bool = True, columns: Optional[Union[List[int], List[str]]] = None, new_columns: Optional[List[str]] = None, sep: str = ',', comment_char: Optional[str] = None, quote_char: Optional[str] = '"', skip_rows: int = 0, dtypes: Optional[Union[Mapping[str, Type[polars.datatypes.DataType]], List[Type[polars.datatypes.DataType]]]] = None, null_values: Optional[Union[str, List[str], Dict[str, str]]] = None, ignore_errors: bool = False, parse_dates: bool = False, n_threads: Optional[int] = None, infer_schema_length: Optional[int] = 100, batch_size: int = 8192, n_rows: Optional[int] = None, encoding: str = 'utf8', low_memory: bool = False, rechunk: bool = True, use_pyarrow: bool = False, storage_options: Optional[Dict] = None, skip_rows_after_header: int = 0, row_count_name: Optional[str] = None, row_count_offset: int = 0, sample_size: int = 1024, **kwargs: Any) polars.internals.frame.DataFrame ¶
Read a CSV file into a Dataframe.
- Parameters
- file
Path to a file or a file-like object. By file-like object, we refer to objects with a
read()
method, such as a file handler (e.g. via builtinopen
function) orStringIO
orBytesIO
. Iffsspec
is installed, it will be used to open remote files.- has_header
Indicate if the first row of dataset is a header or not. If set to False, column names will be autogenerated in the following format:
column_x
, withx
being an enumeration over every column in the dataset starting at 1.- columns
Columns to select. Accepts a list of column indices (starting at zero) or a list of column names.
- new_columns
Rename columns right after parsing the CSV file. If the given list is shorter than the width of the DataFrame the remaining columns will have their original name.
- sep
Single byte character to use as delimiter in the file.
- comment_char
Single byte character that indicates the start of a comment line, for instance
#
.- quote_char
Single byte character used for csv quoting, default =
"
. Set to None to turn off special handling and escaping of quotes.- skip_rows
Start reading after
skip_rows
lines.- dtypes
Overwrite dtypes during inference.
- null_values
- Values to interpret as null values. You can provide a:
str
: All values equal to this string will be null.List[str]
: A null value per column.Dict[str, str]
: A dictionary that maps column name to anull value string.
- ignore_errors
Try to keep reading lines if some lines yield errors. First try
infer_schema_length=0
to read all columns aspl.Utf8
to check which values might cause an issue.- parse_dates
Try to automatically parse dates. If this does not succeed, the column remains of data type
pl.Utf8
.- n_threads
Number of threads to use in csv parsing. Defaults to the number of physical cpu’s of your system.
- infer_schema_length
Maximum number of lines to read to infer schema. If set to 0, all columns will be read as
pl.Utf8
. If set toNone
, a full table scan will be done (slow).- batch_size
Number of lines to read into the buffer at once. Modify this to change performance.
- n_rows
Stop reading from CSV file after reading
n_rows
. During multi-threaded parsing, an upper bound ofn_rows
rows cannot be guaranteed.- encoding
Allowed encodings:
utf8
orutf8-lossy
. Lossy means that invalid utf8 values are replaced with�
characters.- low_memory
Reduce memory usage at expense of performance.
- rechunk
Make sure that all columns are contiguous in memory by aggregating the chunks into a single array.
- use_pyarrow
Try to use pyarrow’s native CSV parser. This is not always possible. The set of arguments given to this function determines if it is possible to use pyarrow’s native parser. Note that pyarrow and polars may have a different strategy regarding type inference.
- storage_options
Extra options that make sense for
fsspec.open()
or a particular storage connection. e.g. host, port, username, password, etc.- skip_rows_after_header
Skip these number of rows when the header is parsed
- row_count_name
If not None, this will insert a row count column with give name into the DataFrame
- row_count_offset
Offset to start the row_count column (only use if the name is set)
- sample_size:
Set the sample size. This is used to sample statistics to estimate the allocation needed.
- Returns
- DataFrame