Skip to content

Imputation/SRMI

SRMI

Bases: Serializable

Sequential Regression Multiple Imputation (SRMI) class for handling missing data imputation.

This class manages the complete SRMI process including variable setup, model configuration, parallel execution, and result management across multiple implicates and iterations.

Parameters:

Name Type Description Default
df IntoFrameT

Data to be used in the imputation

None
variables list[Variable]

Variables to be imputed (as Variable class instances), by default None

None
model str | list

R string formula that is the default for the imputation

''
selection Selection

Variable selection method used within the imputation (if any), by default None

None
preselection Selection

Variable selection done before SRMI starts to pre-prune inputs, by default None

None
modeltype ModelType

Imputation model type from the ModelType enumeration, by default None

None
parameters dict

Model parameters dictionary, by default None

None
joint dict

Key-value pairs of variables to be included together (i.e. if one is selected in the variable selection step, the other is too), by default None

None
ordered_categorical list

List of categorical variables in model that are ordered, by default None An example would be education (vs. a variable with no ordering like state or county code)

None
seed int

Random seed for replicability, by default 0 (no seed)

0
weight str

Weight variable name for imputation modeling, by default ""

''
n_implicates int

Number of separate implicates to impute

0
n_iterations int

Number of iterations in each implicate

0
bayesian_bootstrap bool

Use Bayesian Bootstrap to account for uncertainty in coefficients, by default True

True
bootstrap_index list

Index variables for resampling in Bayesian Bootstrap (i.e. if you want to resample by household, not person), by default None

None
bootstrap_where str

SQL Condition for keeping observations when resampling, by default ""

''
index list

Columns that uniquely identify observations such as ["h_seq","pppos"], by default None

None
parallel bool

Run implicates in parallel, by default True

True
parallel_variables_per_job int

Number of variables per parallel job (for memory management and to deal with memory leaks, if there are any), by default 0

0
parallel_CallInputs CallInputs | None

Parameters for parallel execution such as memory and CPU allocation, by default None

None
parallel_testing bool

Test parallel jobs without running them, by default False

False
path_model str

Directory to save model data and temporary files

''
model_name str

Model name for continuing existing runs, by default ""

''
force_start bool

Restart imputation even if existing run exists, by default False

False
save_every_variable bool

Save data after each variable is imputed, by default False

False
save_every_iteration bool

Save data after each iteration completes, by default True

True
from_load bool

Flag indicating object created from saved state, by default False

False
imputation_stats list[str] | None

List of statistics to calculate during imputation, by default None

None

Raises:

Type Description
Exception

If n_implicates < 1 or n_iterations < 1 If path equals path_model_new in load_to_continue_prior

Examples:

Basic usage:

>>> srmi = SRMI(
...     df=data,
...     variables=[var1, var2],
...     n_implicates=5,
...     n_iterations=10,
...     path_model="/path/to/model"
... )
>>> srmi.run()

With parallel execution:

>>> srmi = SRMI(
...     df=data,
...     variables=vars_list,
...     n_implicates=5,
...     n_iterations=10,
...     parallel=True,
...     parallel_CallInputs=CallInputs(CPUs=4, MemInMB=5000)
... )

run

run() -> None

Execute the SRMI imputation process.

Orchestrates the complete imputation workflow including initialization, preprocessing, and execution in parallel or sequential mode.

Notes

The method performs these steps: 1. Creates folders and initializes implicates to be run 2. Preprocesses data (variable selection, hyperparameter tuning) 3. Runs imputation in parallel or sequential mode 4. Saves results and statistics

For parallel execution, creates job files for each iteration and variable subset. For sequential execution, runs each implicate directly.

Variable

Bases: Serializable

PrePost

Namespace within Variable class for handling pre and post imputation operations.

Currently that can be: 1) a Narwhals Expr (NarwhalsExpression) which is anything you can put in nw.from_native(df).with_columns() 2) a python function handle and parameters which allows you to call an arbitrary function

Function

NarwhalsExpression

Bases: Serializable

Parameters

Factory class for creating parameter dictionaries for different imputation methods.

Provides static methods to generate properly formatted parameter dictionaries with validation and default values for each imputation approach.

ErrorDraw

Bases: Enum

Random = 0 pmm = 1

RegressionModel

Bases: Enum

OLS = 0

Probit = 1

Logit = 2

TwoSampleRegression = 3

HotDeck staticmethod

HotDeck(
    model_list: list[str] | list[list[str]] = None,
    donate_list: list | None = None,
    n_hotdeck_array: int = 3,
    sequential_drop: bool = True,
) -> dict

Parameters for hot HotDeck imputation

Parameters:

Name Type Description Default
model_list list[str] | list[list[str]]

Each model is a list of variables that are used as match keys. model_list can either be a list of strings (the model itself) or it can be a list of lists of strings (sequential hot deck to match on)

None
donate_list list

Additional variables to impute together, by default None I.e., you can predict earnings amount to find a donor, but then also impute hours worked and weeks worked with it.

None
n_hotdeck_array int

Size of hot deck donor arrays, by default 3

3
sequential_drop bool

Drop variables sequentially until matches found, by default True If model_list is a list of strings (one model), should we sequentially drop the last variable until all recipients find a donor? Makes it easier to set the hot deck/stat match up.

True

Returns:

Type Description
dict

Hot deck parameter dictionary

LightGBM staticmethod

LightGBM(
    tune: bool = False,
    tune_overwrite: bool = False,
    parameters: dict | None = None,
    tuner=None,
    tune_hyperparameter_path: str = "",
    quantiles: list = None,
    error: ErrorDraw = ErrorDraw.pmm,
    parameters_pmm: dict | None = None,
) -> dict

Parameters for LightGBM-based imputation.

Parameters:

Name Type Description Default
tune bool

Whether to tune hyperparameters, by default False

False
tune_overwrite bool

Overwrite existing tuned parameters, by default False

False
parameters dict | None

LightGBM model parameters, by default None From NEWS.CodeUtilities.Python.LightGBM

None
tuner Tuner

Hyperparameter tuner object, by default None From NEWS.CodeUtilities.Python.LightGBM

None
tune_hyperparameter_path str

Path for saving/loading tuned parameters, by default ""

''
quantiles list

Quantiles for quantile regression, by default None

None
error ErrorDraw

How to convert yhat from LGBM into imputes. If pmm, draw from nearest yhat neighbors, for example. The default is ErrorDraw.pmm.

pmm
parameters_pmm dict | None

PMM parameters if using PMM error drawing, by default None

None

Returns:

Type Description
dict

LightGBM parameter dictionary

Raises:

Type Description
Exception

If quantiles are not between 0 and 1

NearestNeighbor staticmethod

NearestNeighbor(
    match_to: str | list[str], parameters_pmm: dict = None
) -> dict

Match directly to an x variable (or set) using nearest neighbor matching

Parameters:

Name Type Description Default
match_to str | list[str]

Variable or list to match to.

required
parameters_pmm dict

Nearest neighbor match parameters

None

Returns:

Type Description
dict

nearest neighbor parameter dictionary

Regression staticmethod

Regression(
    model: RegressionModel = RegressionModel.OLS,
    error: ErrorDraw = ErrorDraw.pmm,
    random_share: float = 1.0,
    parameters_pmm: dict = None,
) -> dict

Parameters for regression-based imputation.

Parameters:

Name Type Description Default
model RegressionModel

Type of regression model, by default RegressionModel.OLS

OLS
error ErrorDraw

Method for drawing errors, by default ErrorDraw.Random If pmm, draw from nearest yhat neighbors, for example. If Random, draw from observed errors for modeled observations

pmm
random_share float

Fraction of data to use for regression, by default 1.0 Use less memory by running the regression on a random subset?

1.0
parameters_pmm dict

PMM parameters if using PMM error drawing, by default None

None

Returns:

Type Description
dict

Regression parameter dictionary

StatMatch staticmethod

StatMatch(
    model_list: list = None,
    donate_list: list = None,
    sequential_drop: bool = False,
)

Parameters for hot statistical match imputation

Stat match and hot deck are basically the same, but the hot deck iterates over the data carrying arrays of possible donor values whereas the stat match just does a join of donors and recipients

Parameters:

Name Type Description Default
model_list list[str] | list[list[str]]

Each model is a list of variables that are used as match keys. model_list can either be a list of strings (the model itself) or it can be a list of lists of strings (sequential hot deck to match on)

None
donate_list list

Additional variables to impute together, by default None I.e., you can predict earnings amount to find a donor, but then also impute hours worked and weeks worked with it.

None
sequential_drop bool

Drop variables sequentially until matches found, by default False If model_list is a list of strings (one model), should we sequentially drop the last variable until all recipients find a donor? Makes it easier to set the hot deck/stat match up.

False

Returns:

Type Description
dict

stat match parameter dictionary

pmm staticmethod

pmm(
    knearest: int = 10,
    model: RegressionModel = RegressionModel.OLS,
    donate_list: list[str] = None,
    winsor: tuple[float, float] = [0, 1],
    share_leave_out: float = 0.0,
    donate_by: list[str] | str | None = None,
) -> dict

Parameters for predictive mean matching imputation.

Parameters:

Name Type Description Default
knearest int

Number of nearest neighbors for matching, by default 10

10
model RegressionModel

Regression model type, by default RegressionModel.OLS

OLS
donate_list list[str]

Additional variables to impute together, by default None i.e., you can predict earnings amount to find a donor, but then also impute hours worked and weeks worked with it.

None
winsor tuple[float, float]

Winsorization percentiles, by default [0, 1]

[0, 1]
share_leave_out float

Leave out a share to adjust the predictions to make sure in a random sample they match the training sample yhat distribution (potentially relevant for LightGBM), by default 0.0

0.0
donate_by list[str] | str | None

Grouping variables for donation, by default None

None

Returns:

Type Description
dict

PMM parameter dictionary

Selection

Bases: Serializable

Handles variable selection for imputation models.

Supports various selection methods including LASSO, stepwise selection, and custom functions to reduce model dimensionality.

Parameters:

Name Type Description Default
method Method

Selection method to use, by default Method.No

No
parameters dict

Method-specific parameters, by default None

None
select_within_by bool

Whether to run selection within each by-group, by default True

True
function callable

Custom selection function, by default None

None

Parameters

lasso

lasso(
    nfolds: int = 5,
    type_measure: str = "default",
    include_base_with_interaction: bool = True,
    winsorize: tuple[float, float] | None = None,
    continuous: bool = False,
    binomial: bool = False,
    missing_dummies: bool = True,
    optimal_lambda: float = None,
    optimal_lambda_from_pre: bool = True,
    scale_lambda: float = 1.0,
) -> dict

Parameters for LASSO variable selection method.

Parameters:

Name Type Description Default
nfolds int

Number of folds for cross-validation during LASSO regression. Used to determine optimal lambda value through k-fold cross-validation.

5
type_measure str

Type of measure to use for cross-validation error. Determines how model performance is evaluated during lambda selection. Options are "default", "mse", "deviance", "class", "auc", "mae".

"default"
include_base_with_interaction bool

Whether to include base variables when interaction terms are selected. If True, base variables are automatically included when their interactions are selected by LASSO.

True
winsorize tuple[float, float] or None

Percentile bounds for winsorizing the dependent variable. Tuple of (low_percentile, high_percentile) to cap extreme values. None means no winsorization.

None
continuous bool

Force treatment of dependent variable as continuous, overriding automatic detection.

False
binomial bool

Force treatment of dependent variable as binomial, overriding automatic detection.

False
missing_dummies bool

Whether to create dummy variables for missing values in predictor variables.

True
optimal_lambda float or None

Pre-specified optimal lambda value for LASSO regularization. If None, optimal lambda will be determined through cross-validation.

None
optimal_lambda_from_pre bool

Whether to use optimal lambda from a previous preselection step.

True
scale_lambda float

Scaling factor applied to the optimal lambda value. Values > 1 make regularization stronger (fewer variables selected), values < 1 make it weaker (more variables selected).

1.0

Returns:

Type Description
dict

Dictionary containing all parameter values for LASSO selection.

Raises:

Type Description
Exception

When type_measure is not one of the acceptable values.

stepwise

stepwise(
    nfolds: int = 5,
    scoring: str = "neg_mean_squared_error",
    include_base_with_interaction: bool = True,
    winsorize: tuple[float, float] | None = None,
    missing_dummies: bool = True,
    min_features_to_select: int = 10,
)

Parameters for stepwise variable selection method using Recursive Feature Elimination with Cross-Validation (RFECV).

Parameters:

Name Type Description Default
nfolds int

Number of folds for cross-validation during stepwise selection. Used in RFECV to evaluate feature importance.

5
scoring str

Scoring metric used to evaluate model performance during cross-validation. Should be a valid sklearn scoring parameter.

"neg_mean_squared_error"
include_base_with_interaction bool

Whether to include base variables when interaction terms are selected. If True, base variables are automatically included when their interactions are selected.

True
winsorize tuple[float, float] or None

Percentile bounds for winsorizing the dependent variable. Tuple of (low_percentile, high_percentile) to cap extreme values. None means no winsorization.

None
missing_dummies bool

Whether to create dummy variables for missing values in predictor variables.

True
min_features_to_select int

Minimum number of features that must be selected by the stepwise procedure. Prevents over-reduction of the feature set.

10

Returns:

Type Description
dict

Dictionary containing all parameter values for stepwise selection.