Skip to content

Read

scan_readstat(path, **kwargs)

Returns a Polars LazyFrame for SAS, Stata, and SPSS files.

Key parameters:

Parameter Default Notes
preserve_order False Preserve row order or expose a row index (see below).
missing_string_as_null False Convert empty strings to null.
value_labels_as_strings False For labeled numeric columns, return label strings.
schema_overrides None Dict of {column: polars_dtype}.
batch_size 100_000 Rows per internal chunk during collect.
informative_nulls None Capture user-defined missing indicators.
threads None Defaults to the Polars thread pool size.
compress None Optional type compression after scan.

Compression options

compress accepts:

  • True: enable all read-side compression transforms
  • False / None: disable read-side compression
  • a CompressOptions object
  • a plain dict with the same fields

This is read-side type compression (not the Stata writer compress flag).

Fields:

Field Default Description
enabled False Enable compression after scan.
cols None Restrict compression to a list of column names.
compress_numeric False Downcast numeric types where safe.
datetime_to_date False Convert datetime columns to date when possible.
string_to_numeric False Convert strings to numeric when safe.
infer_compress_length None Limit compression inference to the first N rows.

Informative Nulls

SAS, Stata, and SPSS files support user-defined missing value codes (SAS .A.Z, Stata .a.z, SPSS discrete/range missings). By default these are read as null.

Use informative_nulls to capture the missing-value indicator alongside the data value.

from polars_readstat import scan_readstat, InformativeNullOpts

lf = scan_readstat(
    "file.dta",
    informative_nulls=InformativeNullOpts(columns="all"),
)

informative_nulls accepts either an InformativeNullOpts dataclass or a plain dict. The default indicator suffix is _null (used in separate_column mode).

Modes

Mode Description
"separate_column" (default) Adds a parallel String column <col><suffix> after each tracked column.
"struct" Wraps each (value, indicator) pair into a Struct column.
"merged_string" Merges into a single String column (value as string, or the indicator code).
from polars_readstat import InformativeNullOpts

opts = InformativeNullOpts(
    columns=["income", "age"],
    mode="separate_column",
    suffix="_missing",
    use_value_labels=True,
)

Column projection should use standard Polars lazy syntax:

lf = scan_readstat("file.sas7bdat").select(["income", "age"])

Preserve order options

preserve_order accepts:

Value Behavior
False Allow out-of-order batches for higher throughput.
True Current behavior: buffer batches to preserve order (more memory).
PreserveOrderOpts(...) or dict Row-index-based modes (less buffering).

PreserveOrderOpts fields:

Field Default Description
mode "buffered" "buffered", "row_index", or "sort".
row_index_name "row_index" Column name when mode is "row_index" or "sort".

Modes:

Mode Description
"buffered" Keep original row order by buffering batches in Rust. This can result in higher RAM spikes as it affects the streaming of results to polars.
"row_index" Add a row index column in Rust, but return unsorted batches (so you can sort on the row_index later).
"sort" Add a row index in Rust, then sort and drop it in Python.

Example:

from polars_readstat import scan_readstat, PreserveOrderOpts

lf = scan_readstat(
    "file.sas7bdat",
    preserve_order=PreserveOrderOpts(mode="row_index", row_index_name="__row_idx"),
)

read_readstat(path, **kwargs)

Eager version of scan_readstat returning a DataFrame. Accepts the same parameters.

ScanReadstat(path, **kwargs)

Reader object that exposes:

  • schema: a polars.Schema
  • metadata: a dict with file info and per-column details
  • df: a LazyFrame, same as scan_readstat(path)

Metadata includes:

  • columns[].name
  • columns[].label
  • columns[].value_labels