Statistics & Standard Errors¶
What Is It¶
The Statistics module provides tools for calculating summary statistics with proper standard errors for complex survey data and multiple imputation. It handles three key sources of uncertainty:
- Sampling variance - From complex survey designs (stratification, clustering, unequal weights) with replicate weights (for examples, see CPS ASEC documentation, ACS documentation, Consumer Expenditure Survey documentation, SCF documentation, MEPS documentation, ...)
- Imputation uncertainty - When working with multiply imputed data
- Both combined - Full uncertainty accounting when you have imputed data with replicate weights
The package supports:
- Quick exploratory summaries with summary()
- Replicate weight standard errors (Bootstrap and Balanced Repeated Replication)
- Multiple imputation inference using Rubin's rules
- Statistical comparisons with proper standard errors
- Custom analysis functions with automatic variance estimation
Why Use It¶
Standard statistical software often gets uncertainty wrong:
The problem: - Using design weights without replicate weights → Standard errors too small - Simple variance formulas with complex samples → Wrong confidence intervals - Ignoring imputation uncertainty → Overstated precision - Ad-hoc comparisons → Invalid hypothesis tests.
For example, replicate weights are intended to account for correlation over space and time in sample selection, so using them for comparisons will account for correlation in estimates across the different replicate factors. To give a concrete example, addresses in the CPS are in sample for consecutive months, so if you want to get the correct confidence intervals on the change in unemployment, you need to use the replicate weights in the month-to-month comparison. The standard errors will be much narrower than if you compared the unemployment rates as if they were two independent samples.
This package: - Implements proper variance estimation for complex surveys (as used by Census Bureau, BLS, etc.) - Correctly combines uncertainty from multiple imputations - Makes it easy to do the right thing (comparisons, custom functions, etc.) - Fast enough for production use with millions of observations (assuming you use polars)
Key Features¶
- DataFrame agnostic - Works with Polars, Pandas, Arrow, DuckDB via Narwhals
- Fast - Optimized for large datasets (100K+ rows, tested on millions)
- Flexible - Custom functions get variance estimation for free
- Parallel processing - Run imputations in parallel
- Proper inference - Implements Rubin's rules for MI, supports BRR/bootstrap for replicate weights
- Production-ready - Used in Census Bureau's National Experimental Wellbeing Statistics
Implementation Notes¶
Replicate Weights¶
Two variance estimation approaches:
- Bootstrap (bootstrap=True) - Standard bootstrap resampling variance
- BRR (bootstrap=False) - Balanced Repeated Replication as used by Census Bureau
Multiple Imputation¶
Implements Rubin's (1987) combining rules: - Combined estimate = average across implicates - Total variance = within-imputation variance + between-imputation variance - Degrees of freedom account for finite number of implicates - Missing information rate quantifies uncertainty from imputation
When to Use What¶
| Use Case | Tool | Why |
|---|---|---|
| Quick exploration | summary() |
Fast, no SEs needed |
| Survey data with replicate weights | StatCalculator + Replicates |
Proper complex survey variance |
| Imputed data | mi_ses_from_function() |
Accounts for imputation uncertainty |
| Imputed survey data | Both combined | Full uncertainty (sampling + imputation) |
| Custom analysis | StatCalculator.from_function() |
Any function gets proper SEs |
API¶
See the Basic Statistics and Standard Errors documentation estimating summary stats and standard errors. See the Multiple Imputation documentation estimating statistics with multiply imputed data.
Example/Tutorial¶
The summary() function provides easy data summaries (with more control than the default describe() methods in pandas or polars)
from survey_kit.utilities.random import RandomData
from survey_kit.utilities.dataframe import summary
from survey_kit.statistics.statistics import Statistics
from survey_kit import logger
# %%
logger.info("Draw some random data")
n_rows = 1_000
df = (
RandomData(n_rows=n_rows, seed=12332151)
.index("index")
.integer("v_int", 0, 10)
.boolean("v_bool")
.float("v_float", -1, 1)
.integer("weight_0", 100, 1_000_000)
.integer("year", 2016, 2018)
.integer("quarter", 1, 4)
.to_df()
.lazy()
)
# %%
logger.info("The simplest option: just call summary(df)")
logger.info(" Note the '_ =' is to prevent it from being printed twice in jupyter when generating the html, you can ignore")
_ = summary(df)
# %%
logger.info("\n\n + Weighted")
_ = summary(df,weight="weight_0")
# %%
logger.info("\n\n + by something")
_ = summary(
df,
weight="weight_0",
by="year"
)
_ = summary(
df,
weight="weight_0",
by=["quarter","year"]
)
# %%
logger.info("\n\n with detailed stats and 4-sig digit rounding")
_ = summary(
df,
weight="weight_0",
detailed=True,
drb_round=True
)
# %%
logger.info("\n\n with additional stats")
logger.info("What is available:")
logger.info(Statistics.available_stats())
_ = summary(
df,
weight="weight_0",
additional_stats=["q10","q95","n|not0","share|not0"]
)
logger.info("Get them (but no need to print):")
df_stats = summary(
df,
weight="weight_0",
additional_stats=["q10","q95","n|not0","share|not0"],
print=False,
)
logger.info(df_stats.collect())
For proper variance estimation with complex survey designs or using bootstrap weights
from __future__ import annotations
import polars as pl
from survey_kit.utilities.random import RandomData
from survey_kit.utilities.dataframe import summary
from survey_kit.statistics.calculator import StatCalculator,ComparisonItem
from survey_kit.statistics.statistics import Statistics
from survey_kit.statistics.replicates import Replicates
from survey_kit.statistics.bootstrap import bayes_bootstrap
from survey_kit.utilities.random import set_seed, generate_seed
from survey_kit import logger, config
# %%
# Draw some random data
def gen_random_table(n_rows: int, n_replicates: int, seed: int):
set_seed(seed)
df = (
RandomData(n_rows=n_rows, seed=generate_seed())
.index("index")
.integer("v_1", 0, 10)
.integer("year", 2016, 2021)
.integer("month", 1, 12)
.integer("income", 0, 100_000)
.integer("income_2", 0, 120_000)
).to_df()
# Attach bootstrap weights to the data
df = (
pl.concat(
[
df,
bayes_bootstrap(
n_rows=n_rows,
n_draws=n_replicates+1,
seed=generate_seed(),
initial_weight_index=0,
prefix="weight_"
)
],
how="horizontal"
)
.lazy()
)
df = df.with_columns(
pl.when(pl.col("year").ne(2016)).then(pl.col("income")).otherwise(pl.lit(0))
)
return df
n_rows = 1_000
n_replicates = 10
logger.info("Create two datasets to compare, with bootstrap weights (or repliate weights)")
df = gen_random_table(n_rows, n_replicates, seed=1230)
df_compare = gen_random_table(n_rows, n_replicates, seed=9324)
# %%
logger.info("What's the data look like")
logger.info(df.head(10).collect())
logger.info(df_compare.head(10).collect())
# %%
logger.info("Use the stat calculator class to get stats")
logger.info(" The basic option, just get some stats")
stats = statistics=Statistics(
stats=["mean", "median|not0"],
columns=["v_1", "income", "income_2"]
)
logger.info("What is available:")
logger.info(Statistics.available_stats())
# %%
logger.info("Estimate them")
sc = StatCalculator(
df,
statistics=stats,
weight="weight_0",
)
# %%
logger.info(" Want standard errors?")
logger.info(" Set up 'Replicates'")
logger.info(" from data (with df)")
logger.info(" NOTE: if pass 'bootstrap=True', it will calculate ")
logger.info(" regular bootstrap SEs rather than balanced repeated replication SEs")
logger.info(" (if you don't know what that means, pass bootstrap=True)")
replicates_from_df = Replicates(
df=df,
weight_stub="weight_",
bootstrap=True
)
logger.info(" or just tell it the number (n_replicates)")
replicates = Replicates(
weight_stub="weight_",
n_replicates=n_replicates,
bootstrap=True
)
logger.info(" Same either way")
logger.info(replicates_from_df.rep_list)
logger.info(replicates.rep_list)
del replicates_from_df
# %%
logger.info("\n\nNow let's calculate the stats with replicate-weighted standard errors")
logger.info(" for df")
logger.info(" (where each estimate has the SE below the estimate)")
sc = StatCalculator(
df,
statistics=stats,
weight="weight_0",
replicates=replicates
)
logger.info(" for df_compare")
sc_compare = StatCalculator(
df_compare,
statistics=stats,
weight="weight_0",
replicates=replicates
)
logger.info("Save them for later")
sc.save(f"{config.path_temp_files}/sc")
sc_compare.save(f"{config.path_temp_files}/sc_compare")
del sc
del sc_compare
# %%
logger.info("Load them back up")
sc = StatCalculator.load(f"{config.path_temp_files}/sc")
sc_compare = StatCalculator.load(f"{config.path_temp_files}/sc_compare")
# %%
logger.info("\n\nWe can compare them easily")
d_comparisons = sc.compare(sc_compare)
logger.info("Which gives us a dictionary of comparisons, with keys=['difference','ratio']")
logger.info(d_comparisons['difference'])
logger.info(d_comparisons['ratio'])
# %%
logger.info("\n\nLet's just get the difference between the means to medians, within the same data")
logger.info(" For df")
sc_mean_median = sc.compare(
sc,
ratio=False,
compare_list_columns=[
ComparisonItem.Column(
"mean",
"median (not 0)",
name="median_mean"
)
])["difference"]
logger.info(" For df_compare")
sc_compare_mean_median = sc_compare.compare(
sc_compare,
ratio=False,
compare_list_columns=[
ComparisonItem.Column(
"mean",
"median (not 0)",
name="median_mean"
)
])["difference"]
# %%
logger.info("\n\nNow, we can use the same tool to get a difference in difference comparison")
logger.info(" i.e. [df(median|not0) - df(mean)] - [df_compare(median|not0) - df_compare(mean)]")
d_diff_in_diffs = sc_mean_median.compare(sc_compare_mean_median,ratio=False)
# %%
logger.info("\n\nLet's get the ratio two variables")
logger.info(" rather than the difference between two statistics for the same variable)")
logger.info(" For df")
sc_inc_inc_2 = sc.compare(
sc,
difference=False,
compare_list_variables=[
ComparisonItem.Variable(
value1="income",
value2="income_2",
name="income_comp"
)
])["ratio"]
logger.info(" For df_compare")
sc_inc_inc_2_compare = sc_compare.compare(
sc_compare,
difference=False,
compare_list_variables=[
ComparisonItem.Variable(
value1="income",
value2="income_2",
name="income_comp"
)
])["ratio"]
# %%
logger.info("\n\nAnd for fun, the diff-in-diff")
logger.info(" i.e. [df(income_2)/df(income)] - [df_compare(income_2)/df_compare(income)]")
d_diff_in_diffs_2 = sc_inc_inc_2.compare(sc_inc_inc_2_compare,ratio=False)
This works with any function that takes a dataframe and weight and returns estimates. You get proper variance estimation automatically.
from __future__ import annotations
import polars as pl
from sklearn.linear_model import LinearRegression
from survey_kit.utilities.random import RandomData
from survey_kit.utilities.dataframe import summary, columns_from_list
from survey_kit.statistics.calculator import StatCalculator
from survey_kit.statistics.replicates import Replicates
from survey_kit.statistics.bootstrap import bayes_bootstrap
from survey_kit.utilities.random import set_seed, generate_seed
from survey_kit import logger
# %%
# Draw some random data
def gen_random_table(n_rows: int, n_replicates: int, seed: int):
set_seed(seed)
df = (
RandomData(n_rows=n_rows, seed=generate_seed())
.index("index")
.integer("v_1", 0, 10)
.float("v_2", 0, 100)
.float("v_3", 0, 100)
.float("v_4", 0, 100)
.np_distribution("e", "normal", loc=0, scale=100)
)
df = df.to_df()
c_v1 = pl.col("v_1")
c_v2 = pl.col("v_2")
c_v3 = pl.col("v_3")
c_v4 = pl.col("v_4")
e = pl.col("e")
df = df.with_columns(
y=10*c_v1-5*c_v2+100*c_v3+c_v4 + e
)
df = (
pl.concat(
[
df,
bayes_bootstrap(
n_rows=n_rows,
n_draws=n_replicates+1,
seed=generate_seed(),
initial_weight_index=0,
prefix="weight_"
)
],
how="horizontal"
)
.lazy()
)
return df
# %%
n_rows = 1_000
n_replicates = 10
logger.info("Create two datasets to compare, with replicate weights (or bootstrap weights)")
df = gen_random_table(n_rows, n_replicates, seed=1230)
df_compare = gen_random_table(n_rows, n_replicates, seed=3254)
logger.info("The data")
_ = summary(df)
# %%
logger.info("To run an arbitrary stat, you need")
logger.info(" any function that takes a dataframe and a weight and returns a dataframe")
logger.info("The function will be run for each boostrap/replicate weight")
def run_regression(df:pl.LazyFrame | pl.DataFrame,
y:str,
X:list[str],
weight:str) -> pl.LazyFrame | pl.DataFrame:
df = df.lazy().collect()
model = LinearRegression()
model.fit(
X=df.select(X),
y=df.select(y),
sample_weight=df[weight]
)
df_betas = pl.DataFrame(
dict(
Variable=df.select(X).lazy().collect_schema().names() + ["_Intercept_"],
Beta=[
float(vali)
for vali in [float(coefi) for coefi in model.coef_[0]] + [float(model.intercept_[0])]
],
)
)
return df_betas
# %%
logger.info("The betas from the regression")
df_betas = run_regression(df=df,y="y",X=columns_from_list(df,["v_*"]),weight="weight_0")
logger.info(df_betas)
logger.info("Call StatCalculator.from_function")
logger.info(" Pass in the data")
logger.info(" 'estimate_ids': the name of the column(s) that index the estimates")
logger.info(" 'arguments': stable (non df and non-weight) arguments")
logger.info(" 'replicates': weights to loop over")
sc = StatCalculator.from_function(
run_regression,
df=df,
estimate_ids=["Variable"],
arguments=dict(
y="y",
X=columns_from_list(df,["v_*"])
),
replicates=Replicates(
weight_stub="weight_",
n_replicates=n_replicates,
bootstrap=True
),
display=False
)
logger.info("The result")
sc.print()
# %%
logger.info("You can compare it like any other stat, run the same regression on another sample")
logger.info(" Run the regression on a different random draw")
sc_compare = StatCalculator.from_function(
run_regression,
df=df_compare,
estimate_ids=["Variable"],
arguments=dict(
y="y",
X=columns_from_list(df,["v_*"])
),
replicates=Replicates(
weight_stub="weight_",
n_replicates=n_replicates,
bootstrap=True
),
display=False
)
# %%
logger.info(" And compare the results")
d_comp = sc.compare(sc_compare)
This works with any function that takes a dataframe and weight and returns estimates. You get proper variance estimation automatically.
Combine estimates properly across multiply imputed datasets using Rubin's rule. This incorporates the bootstrap/replicate factor code internally.
import sys
import os
from pathlib import Path
import narwhals as nw
import polars as pl
from survey_kit.utilities.dataframe import summary
from survey_kit.imputation.srmi import SRMI
from survey_kit import logger, config
from survey_kit.statistics.multiple_imputation import MultipleImputation, mi_ses_from_function
from survey_kit.statistics.calculator import StatCalculator
from survey_kit.statistics.statistics import Statistics
from survey_kit.statistics.replicates import Replicates
from survey_kit.utilities.random import RandomData
from survey_kit.utilities.dataframe import safe_height
from survey_kit.statistics.bootstrap import bayes_bootstrap
from survey_kit.utilities.random import set_seed, generate_seed
# %%
logger.info("Load the SRMI results from that tutorial")
path_model = f"{config.path_temp_files}/py_srmi_test_gbm"
srmi = SRMI.load(path_model)
df_implicates = srmi.df_implicates
# %%
logger.info("Draw bootstrap weights")
set_seed(8345)
n_rows = safe_height(df_implicates[0])
n_replicates = 10
df_weights = (
bayes_bootstrap(
n_rows=n_rows,
n_draws=n_replicates+1,
seed=generate_seed(),
prefix="weight_",
initial_weight_index=0
)
.with_row_index("index")
)
# %%
logger.info("What's the data look like")
_ = df_implicates.pipe(summary)
_ = summary(df_weights)
# %%
logger.info("What statistics do I want:")
logger.info(" In this case, the mean of all variables that start with var_")
stats = Statistics(
stats=["mean"],
columns="var_*",
)
# %%
logger.info("Define the 'replicate' object, which tell is what the weight variables are")
replicates = Replicates(weight_stub="weight_", n_replicates=n_replicates)
# %%
logger.info("Arguments that are getting passed to StatCalculator at each run")
arguments = dict(
statistics=stats,
replicates=replicates
)
# %%
logger.info("Get the multiple imputation standard errofs by calling StatCalculator")
logger.info(" for each implicate and each replicate factor")
logger.info(" If you had 100 bootstrap weights and 5 imputation draws (implicates)")
logger.info(" the StatCalculator calculation would run 5*100=500 times")
mi_results_seq = mi_ses_from_function(
delegate=StatCalculator,
df_implicates=df_implicates,
df_noimputes=df_weights,
index=["index"],
arguments=arguments,
join_on=["Variable"],
parallel=False,
)
# %%
logger.info("Do it again, but run each implicate in it's own process and collect the results")
logger.info(" This will split up the job and ")
logger.info(" use all your cpus (or what you set in config.cpus), dividing them up among the jobs")
mi_results = mi_ses_from_function(
delegate=StatCalculator,
df_implicates=df_implicates,
df_noimputes=df_weights,
index=["index"],
arguments=arguments,
join_on=["Variable"],
parallel=True,
)
# %%
logger.info("The sequential (non-parallel) job results")
mi_results.print()
# %%
logger.info("The parallel job results")
mi_results_seq.print()
# %%
logger.info("Save them for later")
mi_results.save(f"{config.path_temp_files}/mi_sequential")
mi_results_seq.save(f"{config.path_temp_files}/mi_parallel")
del mi_results
del mi_results_seq
# %%
logger.info("Load them back up")
mi_results = MultipleImputation.load(f"{config.path_temp_files}/mi_sequential")
mi_results_seq = MultipleImputation.load(f"{config.path_temp_files}/mi_parallel")
# %%
logger.info("Compare them")
d_comparison = mi_results.compare(mi_results_seq)
logger.info(" Print the mi_results_seq - mi_results")
d_comparison["difference"].print()
logger.info(" Print the (mi_results_seq-mi_results)/mi_results")
d_comparison["ratio"].print()
This works with any function that takes a dataframe and weight and returns estimates. You get proper variance estimation (replicate weights + MI) automatically.
from __future__ import annotations
import polars as pl
from survey_kit.utilities.dataframe import summary, join_wrapper
from survey_kit.imputation.srmi import SRMI
from survey_kit import logger, config
from survey_kit.statistics.multiple_imputation import mi_ses_from_function
from survey_kit.statistics.calculator import StatCalculator
from survey_kit.statistics.replicates import Replicates
from survey_kit.utilities.dataframe import safe_height
from survey_kit.statistics.bootstrap import bayes_bootstrap
# %%
logger.info("Load the data from the SRMI tutorial")
path_model = f"{config.path_temp_files}/py_srmi_test_gbm"
srmi = SRMI.load(path_model)
df_implicates = srmi.df_implicates
# %%
logger.info("Make random weights for the bootstrap")
n_rows = safe_height(df_implicates[0])
n_replicates = 10
df_weights = (
bayes_bootstrap(
n_rows=n_rows,
n_draws=n_replicates+1,
seed=8345,
prefix="weight_",
initial_weight_index=0,
)
.with_row_index("index")
)
# %%
logger.info("What's the data look like")
logger.info(" The SRMI data - both implicates")
_ = df_implicates.pipe(summary)
logger.info(" The weights")
_ = summary(df_weights)
# %%
logger.info("To run an arbitrary stat, you need")
logger.info(" any function that takes a dataframe and a weight and returns a dataframe")
logger.info("The function will be run for each boostrap/replicate weight")
logger.info(" Note: you may need to import packages within the function")
logger.info(" if executed in parallel (it will get pickled and reloaded)")
def run_regression(df:pl.LazyFrame | pl.DataFrame,
y:str,
X:list[str],
weight:str="") -> pl.LazyFrame | pl.DataFrame:
import polars as pl
from sklearn.linear_model import LinearRegression
df = df.lazy().collect()
model = LinearRegression()
d_weight = {}
if weight != "":
d_weight["sample_weight"] = df[weight]
model.fit(
X=df.select(X),
y=df.select(y),
**d_weight
)
df_betas = pl.DataFrame(
dict(
Variable=df.select(X).lazy().collect_schema().names() + ["_Intercept_"],
Beta=[
float(vali)
for vali in [float(coefi) for coefi in model.coef_[0]] + [float(model.intercept_[0])]
],
)
)
return df_betas
# %%
logger.info("The betas from the regression (no weights, implicate 0)")
y = "var_gbm2"
X = ["var2","var3","var4","var5"]
df_betas = run_regression(df=df_implicates[0],y=y,X=X)
logger.info(df_betas)
# %%
logger.info("Confirming how to run it on one implicate")
arguments_sc = dict(
y=y,
X=X,
)
replicates = Replicates(
weight_stub="weight_",
n_replicates=n_replicates,
bootstrap=True
)
sc_compare = StatCalculator.from_function(
run_regression,
df=join_wrapper(df_implicates[0],df_weights,on=["index"],how="left"),
estimate_ids=["Variable"],
arguments=arguments_sc,
replicates=replicates,
display=False
)
sc_compare.print()
# %%
logger.info("Now the full function")
logger.info(" Note the nested arguments...")
logger.info(" Maybe not the best API, but it does mean")
logger.info(" that you can run ANYTHING that takes in a")
logger.info(" dataframe and a weight and returns a dataframe of estiamtes")
logger.info(" and run it through the bootstrap + MI process")
arguments_sc = dict(
y=y,
X=X,
)
arguments_mi = dict(
delegate=run_regression,
arguments=arguments_sc,
estimate_ids=["Variable"],
replicates=replicates,
)
mi_results_seq = mi_ses_from_function(
delegate=StatCalculator.from_function,
df_implicates=df_implicates,
path_srmi=path_model,
df_noimputes=df_weights,
index=["index"],
arguments=arguments_mi,
join_on=["Variable"],
parallel=False,
)
mi_results_seq.print()
# %%
logger.info("For parallel, it's better to pass the path of the model")
logger.info(" To avoid unnecessary I/O")
logger.info(" (For sequential it doesn't really matter)")
mi_results = mi_ses_from_function(
delegate=StatCalculator.from_function,
# df_implicates=df_implicates,
path_srmi=path_model,
df_noimputes=df_weights,
index=["index"],
arguments=arguments_mi,
join_on=["Variable"],
parallel=True,
)
# %%
logger.info("The results")
mi_results.print()
mi_results_seq.print()
# %%
logger.info("Compare two sets of results")
d_comparison = mi_results.compare(mi_results_seq)
d_comparison["difference"].print()
d_comparison["ratio"].print()
This works with any function that takes a dataframe and weight and returns estimates. You get proper variance estimation (replicate weights + MI) automatically.