Public Code Repositories

I/O Tools

Working with data across programs can be a pain. So I made a tool to load SAS data in Polars and to use parquet files in Stata.

polars-readstat

This is a Polars IO plugin to read SAS (sas7bdat), Stata (dta), and SPSS (sav) files. It is generally as fast or faster than pandas and pyreadstat, particularly for SAS files and reading subsets of columns from files.

Available on PyPi.

pq/stata-parquet-io

pq is a Stata package that enables reading and writing Parquet files directly in Stata. This plugin bridges the gap between Stata’s data analysis capabilities and the increasingly popular Parquet file format, which is optimized for storage and performance with large datasets. This is slower than reading/writing a dta file in Stata (Stata is really good at that), but makes it easier to use Stata in multi-language workflows (using dta files is slow in R and Python). It can be as fast as Stata when loading a small subset of columns from large files.

Available on SSC

Tools for Missing Data Problems

Survey-Kit

I took advantage of the extended shutdown of 2025 to make some tools available that mirror what I use on the NEWS project to address missing data issues that can bias estimates from survey AND administrative data.

Survey-kit includes a super fast algorithm for calibration to reweight a dataset to make it representative (hat tip to Carl Sanders’s entropy-balance-weighting package). Calibration can help address nonresponse bias in survey data (see this blog, this paper on the CPS ASEC, or this paper on the ACS) and representativeness in administrative data.

The package will also include tools to use machine learning (LightGBM) to impute for missing data. This can be useful in addressing nonrandom nonresponse in survey data and missing administrative data.

It has tools for simpler imputation as well (hot deck/stat match, regression-based imputation, etc.) and to generate multiple imputation statistics with bootstrap or replicate weights and easily make comparisons.

The package uses Narwhals to be (mostly) dataframe agnostic - which means you pass in a Pandas dataframe (or polars or duckdb, etc.) and that’s what you get back. It does require polars and pyarrow, unfortunately, because there’s some logic that I couldn’t do in Narwhals.

Available on PyPi, with documentation.

Jon Rothbaum

I/O Tools

polars-readstat

pq/stata-parquet-io

Tools for Missing Data Problems

Survey-Kit