polars.read_excel#
- polars.read_excel(
- source: str | Path | IO[bytes] | bytes,
- *,
- sheet_id: int | Sequence[int] | None = None,
- sheet_name: str | list[str] | tuple[str] | None = None,
- engine: ExcelSpreadsheetEngine | None = None,
- engine_options: dict[str, Any] | None = None,
- read_options: dict[str, Any] | None = None,
- schema_overrides: SchemaDict | None = None,
- raise_if_empty: bool = True,
Read Excel spreadsheet data into a DataFrame.
New in version 0.20.6: Added “calamine” fastexcel engine for Excel Workbooks (.xlsx, .xlsb, .xls).
New in version 0.19.4: Added “pyxlsb” engine for Excel Binary Workbooks (.xlsb).
New in version 0.19.3: Added “openpyxl” engine, and added
schema_overrides
parameter.- Parameters:
- source
Path to a file or a file-like object (by file-like object, we refer to objects that have a
read()
method, such as a file handler (e.g. via builtinopen
function) orBytesIO
).- sheet_id
Sheet number(s) to convert (set
0
to load all sheets as DataFrames) and return a{sheetname:frame,}
dict. (Defaults to1
if neither this norsheet_name
are specified). Can also take a sequence of sheet numbers.- sheet_name
Sheet name(s) to convert; cannot be used in conjunction with
sheet_id
. If more than one is given then a{sheetname:frame,}
dict is returned.- engine
Library used to parse the spreadsheet file; currently defaults to “xlsx2csv” if not explicitly set.
“xlsx2csv”: converts the data to an in-memory CSV before using the native polars
read_csv
method to parse the result. You can passengine_options
andread_options
to refine the conversion.“openpyxl”: this engine is significantly slower than
xlsx2csv
but supports additional automatic type inference; potentially useful if you are otherwise unable to parse your sheet with the (default)xlsx2csv
engine in conjunction with theschema_overrides
parameter.“pyxlsb”: this engine is used for Excel Binary Workbooks (
.xlsb
files). Note that you have to useschema_overrides
to correctly load date/datetime columns (or these will be read as floats representing offset Julian values).“calamine”: this engine can be used for reading all major types of Excel Workbook (
.xlsx
,.xlsb
,.xls
) and is dramatically faster than the other options, using thefastexcel
module to bind calamine.
- engine_options
Additional options passed to the underlying engine’s primary parsing constructor (given below), if supported:
“xlsx2csv”:
Xlsx2csv
“openpyxl”:
load_workbook
“pyxlsb”:
open_workbook
“calamine”:
n/a
- read_options
Extra options passed to the function that reads the sheet data (for example, the
read_csv
method if using the “xlsx2csv” engine, to which you could pass{"infer_schema_length": None}
, or theload_sheet_by_name
method if using the “calamine” engine.- schema_overrides
Support type specification or override of one or more columns.
- raise_if_empty
When there is no data in the sheet,`NoDataError` is raised. If this parameter is set to False, an empty DataFrame (with no columns) is returned instead.
- Returns:
- DataFrame
If reading a single sheet.
- dict
If reading multiple sheets, a “{sheetname: DataFrame, …}” dict is returned.
Notes
When using the default
xlsx2csv
engine the target Excel sheet is first converted to CSV usingxlsx2csv.Xlsx2csv(source).convert()
and then parsed with Polars’read_csv()
function. You can pass additional options toread_options
to influence this part of the parsing pipeline.Examples
Read the “data” worksheet from an Excel file into a DataFrame.
>>> pl.read_excel( ... source="test.xlsx", ... sheet_name="data", ... )
Read table data from sheet 3 in an Excel workbook as a DataFrame while skipping empty lines in the sheet. As sheet 3 does not have a header row and the default engine is
xlsx2csv
you can pass the necessary additional settings for this to the “read_options” parameter; these will be passed toread_csv()
.>>> pl.read_excel( ... source="test.xlsx", ... sheet_id=3, ... engine_options={"skip_empty_lines": True}, ... read_options={"has_header": False, "new_columns": ["a", "b", "c"]}, ... )
If the correct datatypes can’t be determined you can use
schema_overrides
and/or some of theread_csv()
documentation to see which options you can pass to fix this issue. For example"infer_schema_length": None
can be used to read the data twice, once to infer the correct output types and once more to then read the data with those types. If the types are known in advance thenschema_overrides
is the more efficient option.>>> pl.read_excel( ... source="test.xlsx", ... read_options={"infer_schema_length": 1000}, ... schema_overrides={"dt": pl.Date}, ... )
The
openpyxl
package can also be used to parse Excel data; it has slightly better default type detection, but is slower thanxlsx2csv
. If you have a sheet that is better read using this package you can set the engine as “openpyxl” (if you use this engine thenread_options
cannot be set).>>> pl.read_excel( ... source="test.xlsx", ... engine="openpyxl", ... schema_overrides={"dt": pl.Datetime, "value": pl.Int32}, ... )