Read
scan_readstat(path, **kwargs)
Returns a Polars LazyFrame for SAS (.sas7bdat, .xpt/.xpt5/.xpt8), Stata (.dta), and SPSS (.sav/.zsav) files.
from polars_readstat import scan_readstat
lf = scan_readstat("file.sas7bdat")
df = lf.collect()
Key parameters:
| Parameter | Default | Notes |
|---|---|---|
preserve_order |
False |
Preserve row order or expose a row index (see below). |
missing_string_as_null |
False |
Convert empty strings to null. |
value_labels_as_strings |
False |
For labeled numeric columns, return label strings. |
schema_overrides |
None |
Dict of {column: polars_dtype}. |
batch_size |
None |
Rows per internal chunk during collect. Auto-inferred if None. |
informative_nulls |
None |
Capture user-defined missing indicators. |
threads |
None |
Number of threads. Defaults to the Polars thread pool size. |
compress |
None |
Optional type compression after scan. |
catalog |
None |
SAS catalog for value labels. See SAS catalog. |
Compression options
compress accepts:
True: enable all read-side compression transformsFalse/None: disable read-side compression- a
CompressOptionsobject - a plain dict with the same fields
This is read-side type compression (not the Stata writer compress flag).
Fields:
| Field | Default | Description |
|---|---|---|
enabled |
False |
Enable compression after scan. |
cols |
None |
Restrict compression to a list of column names. |
compress_numeric |
False |
Downcast numeric types where safe. |
datetime_to_date |
False |
Convert datetime columns to date when possible. |
string_to_numeric |
False |
Convert strings to numeric when safe. |
infer_compress_length |
None |
Limit compression inference to the first N rows. |
Informative Nulls
SAS, Stata, and SPSS files support user-defined missing value codes (SAS .A–.Z, Stata .a–.z, SPSS discrete/range missings). By default these are read as null.
Use informative_nulls to capture the missing-value indicator alongside the data value.
from polars_readstat import scan_readstat, InformativeNullOpts
lf = scan_readstat(
"file.dta",
informative_nulls=InformativeNullOpts(columns="all"),
)
informative_nulls accepts either an InformativeNullOpts dataclass or a plain dict.
The default indicator suffix is _null (used in separate_column mode).
Modes
| Mode | Description |
|---|---|
"separate_column" (default) |
Adds a parallel String column <col><suffix> after each tracked column. |
"struct" |
Wraps each (value, indicator) pair into a Struct column. |
"merged_string" |
Merges into a single String column (value as string, or the indicator code). |
from polars_readstat import InformativeNullOpts
opts = InformativeNullOpts(
columns=["income", "age"],
mode="separate_column",
suffix="_missing",
use_value_labels=True,
)
Column projection should use standard Polars lazy syntax:
lf = scan_readstat("file.sas7bdat").select(["income", "age"])
Preserve order options
preserve_order accepts:
| Value | Behavior |
|---|---|
False |
Allow out-of-order batches for higher throughput. |
True |
Current behavior: buffer batches to preserve order (more memory). |
PreserveOrderOpts(...) or dict |
Row-index-based modes (less buffering). |
PreserveOrderOpts fields:
| Field | Default | Description |
|---|---|---|
mode |
"buffered" |
"buffered", "row_index", or "sort". |
row_index_name |
"row_index" |
Column name when mode is "row_index" or "sort". |
Modes:
| Mode | Description |
|---|---|
"buffered" |
Keep original row order by buffering batches in Rust. This can result in higher RAM spikes as it affects the streaming of results to polars. |
"row_index" |
Add a row index column in Rust, but return unsorted batches (so you can sort on the row_index later). |
"sort" |
Add a row index in Rust, then sort and drop it in Python. |
Example:
from polars_readstat import scan_readstat, PreserveOrderOpts
lf = scan_readstat(
"file.sas7bdat",
preserve_order=PreserveOrderOpts(mode="row_index", row_index_name="__row_idx"),
)
SAS Transport (XPT)
.xpt, .xpt5, and .xpt8 files (SAS Transport v5/v8) are supported via the same scan_readstat / ScanReadstat API. Reading is parallelised by row range.
lf = scan_readstat("file.xpt")
df = lf.collect()
SAS Transport files do not carry value labels. Use the catalog parameter with a .sas7bcat file to attach labels from a separate catalog (see SAS catalog).
SAS catalog
SAS stores value labels in a separate format catalog (.sas7bcat). The catalog parameter lets you attach those labels when reading.
catalog accepts:
- a path to a
.sas7bcatfile - a pre-built
{format_name: {code: label}}dict
When catalog is set and value_labels_as_strings=True, columns whose SAS format name appears in the catalog are returned as strings with codes replaced by their labels.
from polars_readstat import scan_readstat
lf = scan_readstat(
"file.sas7bdat",
catalog="formats.sas7bcat",
value_labels_as_strings=True,
)
df = lf.collect()
To inspect which labels would be applied without collecting data, use ScanReadstat.catalog_labels:
from polars_readstat import ScanReadstat
reader = ScanReadstat("file.sas7bdat", catalog="formats.sas7bcat")
print(reader.catalog_labels)
# {"sex": {1.0: "Male", 2.0: "Female"}, ...}
catalog_labels is a column-name-keyed dict (mapped from format names via the file metadata), or None if no catalog was provided.
ScanReadstat(path, **kwargs)
Reader object that exposes schema, metadata, catalog_labels, and df. Useful when you need file metadata before collecting data. Accepts the same parameters as scan_readstat.
from polars_readstat import ScanReadstat
reader = ScanReadstat("file.dta")
print(reader.schema)
print(reader.metadata["row_count"])
df = reader.df.collect()
# With a SAS catalog
reader = ScanReadstat("file.sas7bdat", catalog="formats.sas7bcat")
print(reader.catalog_labels) # column-name-keyed label dicts
df = reader.df.collect() # re-uses already-read metadata
Metadata structure varies by format.
SAS (.sas7bdat)
Per-variable info is under the columns key.
{
"row_count": 10,
"column_count": 100,
"table_name": "TEST1",
"file_encoding": "WINDOWS-1252",
"sas_release": "9.0401M1",
"compression": "None",
"creator_proc": "DATASTEP",
"columns": [
{
"name": "Column1",
"label": null,
"format": "BEST",
"type": "Numeric",
"length": 8,
"offset": 0
},
{
"name": "Column4",
"label": null,
"format": "MMDDYY",
"type": "Numeric",
"length": 8,
"offset": 16
},
...
]
}
SAS .sas7bdat files do not contain value labels.
Stata (.dta)
Per-variable info is under the variables key.
{
"row_count": 30,
"version": 118,
"byte_order": "Little",
"encoding": "UTF-8",
"data_label": null,
"timestamp": " 8 Aug 2016 15:21",
"variables": [
{
"name": "ethnicsn",
"label": "ethnicity, senegal",
"format": "%8.0g",
"type": "Numeric(Int)",
"value_label_name": "ETHNICSN",
"value_labels": {
"101": "bainouk",
"102": "badiaranke",
...
}
},
...
]
}
SPSS (.sav / .zsav)
Per-variable info is under the variables key. SPSS exposes the richest variable-level metadata.
{
"row_count": 5,
"version": 2,
"compression": "RLE",
"encoding": "windows-1252",
"file_label": null,
"variables": [
{
"name": "mylabl",
"label": "labeled",
"type": "Numeric",
"measure": "Scale",
"alignment": "Right",
"display_width": 8,
"decimal_places": 2,
"format_type": 5,
"format_width": 8,
"format_decimals": 2,
"format_class": null,
"value_label": "labels0",
"value_labels": {
"1": "Male",
"2": "Female"
},
"missing_doubles": [],
"missing_strings": [],
"missing_range": false,
...
},
...
]
}
SPSS format fields
Each variable includes three format fields from the SAV file header:
format_type— integer code identifying the SPSS format (see table below)format_width— total display width in charactersformat_decimals— number of decimal places
You can reconstruct the SPSS format string as {name}{width}.{decimals}, e.g. format_type=5, format_width=8, format_decimals=2 → "F8.2".
format_type codes
| Code | SPSS name | Description |
|---|---|---|
| 1 | A |
String (alphanumeric) |
| 5 | F |
Standard numeric (fixed decimal) |
| 20 | DATE |
Date (dd-mmm-yyyy) |
| 21 | TIME |
Time (hh:mm:ss) |
| 22 | DATETIME |
Date and time (dd-mmm-yyyy hh:mm:ss) |
| 23 | ADATE |
Date (mm/dd/yyyy) |
| 24 | JDATE |
Julian date (yyyyddd) |
| 25 | DTIME |
Duration / elapsed time (dd hh:mm:ss) |
| 38 | EDATE |
European date (dd.mm.yyyy) |
| 39 | SDATE |
Sortable date (yyyy/mm/dd) |
| 41 | YMDHMS |
ISO-style datetime (yyyy-mm-dd hh:mm:ss) |
Other codes appear in the wild but are less common. The full list is in the SPSS SAV format specification.
format_class
format_class is set to "Date", "Time", or "DateTime" for temporal formats (codes 20–25, 38–39, 41) so you can detect date/time columns without hardcoding the numeric codes. It is null for all other formats (numeric, string, etc.).
value_label vs value_labels
value_label— the name of the label set as stored in the file (a string like"labels0"), useful for identifying which variables share the same label set.value_labels— the actual mapping of coded values to label strings (the dict you use for display/recoding). This is what you need in practice.