Read

`scan_readstat(path, **kwargs)`

Returns a Polars LazyFrame for SAS (.sas7bdat, .xpt/.xpt5/.xpt8), Stata (.dta), and SPSS (.sav/.zsav) files.

from polars_readstat import scan_readstat

lf = scan_readstat("file.sas7bdat")
df = lf.collect()

Key parameters:

Parameter	Default	Notes
`preserve_order`	`False`	Preserve row order or expose a row index (see below).
`missing_string_as_null`	`False`	Convert empty strings to `null`.
`value_labels_as_strings`	`False`	For labeled numeric columns, return label strings.
`schema_overrides`	`None`	Dict of `{column: polars_dtype}`.
`batch_size`	`None`	Rows per internal chunk during collect. Auto-inferred if `None`.
`informative_nulls`	`None`	Capture user-defined missing indicators.
`threads`	`None`	Number of threads. Defaults to the Polars thread pool size.
`compress`	`None`	Optional type compression after scan.
`catalog`	`None`	SAS catalog for value labels. See SAS catalog.

Compression options

compress accepts:

True: enable all read-side compression transforms
False / None: disable read-side compression
a CompressOptions object
a plain dict with the same fields

This is read-side type compression (not the Stata writer compress flag).

Fields:

Field	Default	Description
`enabled`	`False`	Enable compression after scan.
`cols`	`None`	Restrict compression to a list of column names.
`compress_numeric`	`False`	Downcast numeric types where safe.
`datetime_to_date`	`False`	Convert datetime columns to date when possible.
`string_to_numeric`	`False`	Convert strings to numeric when safe.
`infer_compress_length`	`None`	Limit compression inference to the first N rows.

Informative Nulls

SAS, Stata, and SPSS files support user-defined missing value codes (SAS .A–.Z, Stata .a–.z, SPSS discrete/range missings). By default these are read as null.

Use informative_nulls to capture the missing-value indicator alongside the data value.

from polars_readstat import scan_readstat, InformativeNullOpts

lf = scan_readstat(
    "file.dta",
    informative_nulls=InformativeNullOpts(columns="all"),
)

informative_nulls accepts either an InformativeNullOpts dataclass or a plain dict. The default indicator suffix is _null (used in separate_column mode).

Modes

Mode	Description
`"separate_column"` (default)	Adds a parallel `String` column `<col><suffix>` after each tracked column.
`"struct"`	Wraps each `(value, indicator)` pair into a `Struct` column.
`"merged_string"`	Merges into a single `String` column (value as string, or the indicator code).

from polars_readstat import InformativeNullOpts

opts = InformativeNullOpts(
    columns=["income", "age"],
    mode="separate_column",
    suffix="_missing",
    use_value_labels=True,
)

Column projection should use standard Polars lazy syntax:

lf = scan_readstat("file.sas7bdat").select(["income", "age"])

Preserve order options

preserve_order accepts:

Value	Behavior
`False`	Allow out-of-order batches for higher throughput.
`True`	Current behavior: buffer batches to preserve order (more memory).
`PreserveOrderOpts(...)` or `dict`	Row-index-based modes (less buffering).

PreserveOrderOpts fields:

Field	Default	Description
`mode`	`"buffered"`	`"buffered"`, `"row_index"`, or `"sort"`.
`row_index_name`	`"row_index"`	Column name when mode is `"row_index"` or `"sort"`.

Modes:

Mode	Description
`"buffered"`	Keep original row order by buffering batches in Rust. This can result in higher RAM spikes as it affects the streaming of results to polars.
`"row_index"`	Add a row index column in Rust, but return unsorted batches (so you can sort on the row_index later).
`"sort"`	Add a row index in Rust, then sort and drop it in Python.

Example:

from polars_readstat import scan_readstat, PreserveOrderOpts

lf = scan_readstat(
    "file.sas7bdat",
    preserve_order=PreserveOrderOpts(mode="row_index", row_index_name="__row_idx"),
)

SAS Transport (XPT)

.xpt, .xpt5, and .xpt8 files (SAS Transport v5/v8) are supported via the same scan_readstat / ScanReadstat API. Reading is parallelised by row range.

lf = scan_readstat("file.xpt")
df = lf.collect()

SAS Transport files do not carry value labels. Use the catalog parameter with a .sas7bcat file to attach labels from a separate catalog (see SAS catalog).

SAS catalog

SAS stores value labels in a separate format catalog (.sas7bcat). The catalog parameter lets you attach those labels when reading.

catalog accepts:

a path to a .sas7bcat file
a pre-built {format_name: {code: label}} dict

When catalog is set and value_labels_as_strings=True, columns whose SAS format name appears in the catalog are returned as strings with codes replaced by their labels.

from polars_readstat import scan_readstat

lf = scan_readstat(
    "file.sas7bdat",
    catalog="formats.sas7bcat",
    value_labels_as_strings=True,
)
df = lf.collect()

To inspect which labels would be applied without collecting data, use ScanReadstat.catalog_labels:

from polars_readstat import ScanReadstat

reader = ScanReadstat("file.sas7bdat", catalog="formats.sas7bcat")
print(reader.catalog_labels)
# {"sex": {1.0: "Male", 2.0: "Female"}, ...}

catalog_labels is a column-name-keyed dict (mapped from format names via the file metadata), or None if no catalog was provided.

`ScanReadstat(path, **kwargs)`

Reader object that exposes schema, metadata, catalog_labels, and df. Useful when you need file metadata before collecting data. Accepts the same parameters as scan_readstat.

from polars_readstat import ScanReadstat

reader = ScanReadstat("file.dta")

print(reader.schema)
print(reader.metadata["row_count"])

df = reader.df.collect()

# With a SAS catalog
reader = ScanReadstat("file.sas7bdat", catalog="formats.sas7bcat")
print(reader.catalog_labels)   # column-name-keyed label dicts
df = reader.df.collect()       # re-uses already-read metadata

Metadata structure varies by format.

SAS (`.sas7bdat`)

Per-variable info is under the columns key.

{
  "row_count": 10,
  "column_count": 100,
  "table_name": "TEST1",
  "file_encoding": "WINDOWS-1252",
  "sas_release": "9.0401M1",
  "compression": "None",
  "creator_proc": "DATASTEP",
  "columns": [
    {
      "name": "Column1",
      "label": null,
      "format": "BEST",
      "type": "Numeric",
      "length": 8,
      "offset": 0
    },
    {
      "name": "Column4",
      "label": null,
      "format": "MMDDYY",
      "type": "Numeric",
      "length": 8,
      "offset": 16
    },
    ...
  ]
}

SAS .sas7bdat files do not contain value labels.

Stata (`.dta`)

Per-variable info is under the variables key.

{
  "row_count": 30,
  "version": 118,
  "byte_order": "Little",
  "encoding": "UTF-8",
  "data_label": null,
  "timestamp": " 8 Aug 2016 15:21",
  "variables": [
    {
      "name": "ethnicsn",
      "label": "ethnicity, senegal",
      "format": "%8.0g",
      "type": "Numeric(Int)",
      "value_label_name": "ETHNICSN",
      "value_labels": {
        "101": "bainouk",
        "102": "badiaranke",
        ...
      }
    },
    ...
  ]
}

SPSS (`.sav` / `.zsav`)

Per-variable info is under the variables key. SPSS exposes the richest variable-level metadata.

{
  "row_count": 5,
  "version": 2,
  "compression": "RLE",
  "encoding": "windows-1252",
  "file_label": null,
  "variables": [
    {
      "name": "mylabl",
      "label": "labeled",
      "type": "Numeric",
      "measure": "Scale",
      "alignment": "Right",
      "display_width": 8,
      "decimal_places": 2,
      "format_type": 5,
      "format_width": 8,
      "format_decimals": 2,
      "format_class": null,
      "value_label": "labels0",
      "value_labels": {
        "1": "Male",
        "2": "Female"
      },
      "missing_doubles": [],
      "missing_strings": [],
      "missing_range": false,
      ...
    },
    ...
  ]
}

SPSS format fields

Each variable includes three format fields from the SAV file header:

format_type — integer code identifying the SPSS format (see table below)
format_width — total display width in characters
format_decimals — number of decimal places

You can reconstruct the SPSS format string as {name}{width}.{decimals}, e.g. format_type=5, format_width=8, format_decimals=2 → "F8.2".

format_type codes

Code	SPSS name	Description
1	`A`	String (alphanumeric)
5	`F`	Standard numeric (fixed decimal)
20	`DATE`	Date (dd-mmm-yyyy)
21	`TIME`	Time (hh:mm:ss)
22	`DATETIME`	Date and time (dd-mmm-yyyy hh:mm:ss)
23	`ADATE`	Date (mm/dd/yyyy)
24	`JDATE`	Julian date (yyyyddd)
25	`DTIME`	Duration / elapsed time (dd hh:mm:ss)
38	`EDATE`	European date (dd.mm.yyyy)
39	`SDATE`	Sortable date (yyyy/mm/dd)
41	`YMDHMS`	ISO-style datetime (yyyy-mm-dd hh:mm:ss)

Other codes appear in the wild but are less common. The full list is in the SPSS SAV format specification.

format_class

format_class is set to "Date", "Time", or "DateTime" for temporal formats (codes 20–25, 38–39, 41) so you can detect date/time columns without hardcoding the numeric codes. It is null for all other formats (numeric, string, etc.).

value_label vs value_labels

value_label — the name of the label set as stored in the file (a string like "labels0"), useful for identifying which variables share the same label set.
value_labels — the actual mapping of coded values to label strings (the dict you use for display/recoding). This is what you need in practice.