Imputation/SRMI¶
SRMI ¶
Bases: Serializable
Sequential Regression Multiple Imputation (SRMI) class for handling missing data imputation.
This class manages the complete SRMI process including variable setup, model configuration, parallel execution, and result management across multiple implicates and iterations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
IntoFrameT
|
Data to be used in the imputation |
None
|
variables
|
list[Variable]
|
Variables to be imputed (as Variable class instances), by default None |
None
|
model
|
str | list
|
R string formula that is the default for the imputation |
''
|
selection
|
Selection
|
Variable selection method used within the imputation (if any), by default None |
None
|
preselection
|
Selection
|
Variable selection done before SRMI starts to pre-prune inputs, by default None |
None
|
modeltype
|
ModelType
|
Imputation model type from the ModelType enumeration, by default None |
None
|
parameters
|
dict
|
Model parameters dictionary, by default None |
None
|
joint
|
dict
|
Key-value pairs of variables to be included together (i.e. if one is selected in the variable selection step, the other is too), by default None |
None
|
ordered_categorical
|
list
|
List of categorical variables in model that are ordered, by default None An example would be education (vs. a variable with no ordering like state or county code) |
None
|
seed
|
int
|
Random seed for replicability, by default 0 (no seed) |
0
|
weight
|
str
|
Weight variable name for imputation modeling, by default "" |
''
|
n_implicates
|
int
|
Number of separate implicates to impute |
0
|
n_iterations
|
int
|
Number of iterations in each implicate |
0
|
bayesian_bootstrap
|
bool
|
Use Bayesian Bootstrap to account for uncertainty in coefficients, by default True |
True
|
bootstrap_index
|
list
|
Index variables for resampling in Bayesian Bootstrap (i.e. if you want to resample by household, not person), by default None |
None
|
bootstrap_where
|
str
|
SQL Condition for keeping observations when resampling, by default "" |
''
|
index
|
list
|
Columns that uniquely identify observations such as ["h_seq","pppos"], by default None |
None
|
parallel
|
bool
|
Run implicates in parallel, by default True |
True
|
parallel_variables_per_job
|
int
|
Number of variables per parallel job (for memory management and to deal with memory leaks, if there are any), by default 0 |
0
|
parallel_CallInputs
|
CallInputs | None
|
Parameters for parallel execution such as memory and CPU allocation, by default None |
None
|
parallel_testing
|
bool
|
Test parallel jobs without running them, by default False |
False
|
path_model
|
str
|
Directory to save model data and temporary files |
''
|
model_name
|
str
|
Model name for continuing existing runs, by default "" |
''
|
force_start
|
bool
|
Restart imputation even if existing run exists, by default False |
False
|
save_every_variable
|
bool
|
Save data after each variable is imputed, by default False |
False
|
save_every_iteration
|
bool
|
Save data after each iteration completes, by default True |
True
|
from_load
|
bool
|
Flag indicating object created from saved state, by default False |
False
|
imputation_stats
|
list[str] | None
|
List of statistics to calculate during imputation, by default None |
None
|
Raises:
| Type | Description |
|---|---|
Exception
|
If n_implicates < 1 or n_iterations < 1 If path equals path_model_new in load_to_continue_prior |
Examples:
Basic usage:
>>> srmi = SRMI(
... df=data,
... variables=[var1, var2],
... n_implicates=5,
... n_iterations=10,
... path_model="/path/to/model"
... )
>>> srmi.run()
With parallel execution:
>>> srmi = SRMI(
... df=data,
... variables=vars_list,
... n_implicates=5,
... n_iterations=10,
... parallel=True,
... parallel_CallInputs=CallInputs(CPUs=4, MemInMB=5000)
... )
run ¶
Execute the SRMI imputation process.
Orchestrates the complete imputation workflow including initialization, preprocessing, and execution in parallel or sequential mode.
Notes
The method performs these steps: 1. Creates folders and initializes implicates to be run 2. Preprocesses data (variable selection, hyperparameter tuning) 3. Runs imputation in parallel or sequential mode 4. Saves results and statistics
For parallel execution, creates job files for each iteration and variable subset. For sequential execution, runs each implicate directly.
Variable ¶
Bases: Serializable
PrePost ¶
Namespace within Variable class for handling pre and post imputation operations.
Currently that can be: 1) a Narwhals Expr (NarwhalsExpression) which is anything you can put in nw.from_native(df).with_columns() 2) a python function handle and parameters which allows you to call an arbitrary function
Function ¶
NarwhalsExpression ¶
Bases: Serializable
Parameters ¶
Factory class for creating parameter dictionaries for different imputation methods.
Provides static methods to generate properly formatted parameter dictionaries with validation and default values for each imputation approach.
ErrorDraw ¶
Bases: Enum
Random = 0 pmm = 1
HotDeck
staticmethod
¶
HotDeck(
model_list: list[str] | list[list[str]] = None,
donate_list: list | None = None,
n_hotdeck_array: int = 3,
sequential_drop: bool = True,
) -> dict
Parameters for hot HotDeck imputation
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_list
|
list[str] | list[list[str]]
|
Each model is a list of variables that are used as match keys. model_list can either be a list of strings (the model itself) or it can be a list of lists of strings (sequential hot deck to match on) |
None
|
donate_list
|
list
|
Additional variables to impute together, by default None I.e., you can predict earnings amount to find a donor, but then also impute hours worked and weeks worked with it. |
None
|
n_hotdeck_array
|
int
|
Size of hot deck donor arrays, by default 3 |
3
|
sequential_drop
|
bool
|
Drop variables sequentially until matches found, by default True If model_list is a list of strings (one model), should we sequentially drop the last variable until all recipients find a donor? Makes it easier to set the hot deck/stat match up. |
True
|
Returns:
| Type | Description |
|---|---|
dict
|
Hot deck parameter dictionary |
LightGBM
staticmethod
¶
LightGBM(
tune: bool = False,
tune_overwrite: bool = False,
parameters: dict | None = None,
tuner=None,
tune_hyperparameter_path: str = "",
quantiles: list = None,
error: ErrorDraw = ErrorDraw.pmm,
parameters_pmm: dict | None = None,
) -> dict
Parameters for LightGBM-based imputation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tune
|
bool
|
Whether to tune hyperparameters, by default False |
False
|
tune_overwrite
|
bool
|
Overwrite existing tuned parameters, by default False |
False
|
parameters
|
dict | None
|
LightGBM model parameters, by default None From NEWS.CodeUtilities.Python.LightGBM |
None
|
tuner
|
Tuner
|
Hyperparameter tuner object, by default None From NEWS.CodeUtilities.Python.LightGBM |
None
|
tune_hyperparameter_path
|
str
|
Path for saving/loading tuned parameters, by default "" |
''
|
quantiles
|
list
|
Quantiles for quantile regression, by default None |
None
|
error
|
ErrorDraw
|
How to convert yhat from LGBM into imputes. If pmm, draw from nearest yhat neighbors, for example. The default is ErrorDraw.pmm. |
pmm
|
parameters_pmm
|
dict | None
|
PMM parameters if using PMM error drawing, by default None |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
LightGBM parameter dictionary |
Raises:
| Type | Description |
|---|---|
Exception
|
If quantiles are not between 0 and 1 |
NearestNeighbor
staticmethod
¶
Match directly to an x variable (or set) using nearest neighbor matching
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
match_to
|
str | list[str]
|
Variable or list to match to. |
required |
parameters_pmm
|
dict
|
Nearest neighbor match parameters |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
nearest neighbor parameter dictionary |
Regression
staticmethod
¶
Regression(
model: RegressionModel = RegressionModel.OLS,
error: ErrorDraw = ErrorDraw.pmm,
random_share: float = 1.0,
parameters_pmm: dict = None,
) -> dict
Parameters for regression-based imputation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
RegressionModel
|
Type of regression model, by default RegressionModel.OLS |
OLS
|
error
|
ErrorDraw
|
Method for drawing errors, by default ErrorDraw.Random If pmm, draw from nearest yhat neighbors, for example. If Random, draw from observed errors for modeled observations |
pmm
|
random_share
|
float
|
Fraction of data to use for regression, by default 1.0 Use less memory by running the regression on a random subset? |
1.0
|
parameters_pmm
|
dict
|
PMM parameters if using PMM error drawing, by default None |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
Regression parameter dictionary |
StatMatch
staticmethod
¶
Parameters for hot statistical match imputation
Stat match and hot deck are basically the same, but the hot deck iterates over the data carrying arrays of possible donor values whereas the stat match just does a join of donors and recipients
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_list
|
list[str] | list[list[str]]
|
Each model is a list of variables that are used as match keys. model_list can either be a list of strings (the model itself) or it can be a list of lists of strings (sequential hot deck to match on) |
None
|
donate_list
|
list
|
Additional variables to impute together, by default None I.e., you can predict earnings amount to find a donor, but then also impute hours worked and weeks worked with it. |
None
|
sequential_drop
|
bool
|
Drop variables sequentially until matches found, by default False If model_list is a list of strings (one model), should we sequentially drop the last variable until all recipients find a donor? Makes it easier to set the hot deck/stat match up. |
False
|
Returns:
| Type | Description |
|---|---|
dict
|
stat match parameter dictionary |
pmm
staticmethod
¶
pmm(
knearest: int = 10,
model: RegressionModel = RegressionModel.OLS,
donate_list: list[str] = None,
winsor: tuple[float, float] = [0, 1],
share_leave_out: float = 0.0,
donate_by: list[str] | str | None = None,
) -> dict
Parameters for predictive mean matching imputation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
knearest
|
int
|
Number of nearest neighbors for matching, by default 10 |
10
|
model
|
RegressionModel
|
Regression model type, by default RegressionModel.OLS |
OLS
|
donate_list
|
list[str]
|
Additional variables to impute together, by default None i.e., you can predict earnings amount to find a donor, but then also impute hours worked and weeks worked with it. |
None
|
winsor
|
tuple[float, float]
|
Winsorization percentiles, by default [0, 1] |
[0, 1]
|
share_leave_out
|
float
|
Leave out a share to adjust the predictions to make sure in a random sample they match the training sample yhat distribution (potentially relevant for LightGBM), by default 0.0 |
0.0
|
donate_by
|
list[str] | str | None
|
Grouping variables for donation, by default None |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
PMM parameter dictionary |
Selection ¶
Bases: Serializable
Handles variable selection for imputation models.
Supports various selection methods including LASSO, stepwise selection, and custom functions to reduce model dimensionality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
method
|
Method
|
Selection method to use, by default Method.No |
No
|
parameters
|
dict
|
Method-specific parameters, by default None |
None
|
select_within_by
|
bool
|
Whether to run selection within each by-group, by default True |
True
|
function
|
callable
|
Custom selection function, by default None |
None
|
Parameters ¶
lasso ¶
lasso(
nfolds: int = 5,
type_measure: str = "default",
include_base_with_interaction: bool = True,
winsorize: tuple[float, float] | None = None,
continuous: bool = False,
binomial: bool = False,
missing_dummies: bool = True,
optimal_lambda: float = None,
optimal_lambda_from_pre: bool = True,
scale_lambda: float = 1.0,
) -> dict
Parameters for LASSO variable selection method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nfolds
|
int
|
Number of folds for cross-validation during LASSO regression. Used to determine optimal lambda value through k-fold cross-validation. |
5
|
type_measure
|
str
|
Type of measure to use for cross-validation error. Determines how model performance is evaluated during lambda selection. Options are "default", "mse", "deviance", "class", "auc", "mae". |
"default"
|
include_base_with_interaction
|
bool
|
Whether to include base variables when interaction terms are selected. If True, base variables are automatically included when their interactions are selected by LASSO. |
True
|
winsorize
|
tuple[float, float] or None
|
Percentile bounds for winsorizing the dependent variable. Tuple of (low_percentile, high_percentile) to cap extreme values. None means no winsorization. |
None
|
continuous
|
bool
|
Force treatment of dependent variable as continuous, overriding automatic detection. |
False
|
binomial
|
bool
|
Force treatment of dependent variable as binomial, overriding automatic detection. |
False
|
missing_dummies
|
bool
|
Whether to create dummy variables for missing values in predictor variables. |
True
|
optimal_lambda
|
float or None
|
Pre-specified optimal lambda value for LASSO regularization. If None, optimal lambda will be determined through cross-validation. |
None
|
optimal_lambda_from_pre
|
bool
|
Whether to use optimal lambda from a previous preselection step. |
True
|
scale_lambda
|
float
|
Scaling factor applied to the optimal lambda value. Values > 1 make regularization stronger (fewer variables selected), values < 1 make it weaker (more variables selected). |
1.0
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary containing all parameter values for LASSO selection. |
Raises:
| Type | Description |
|---|---|
Exception
|
When type_measure is not one of the acceptable values. |
stepwise ¶
stepwise(
nfolds: int = 5,
scoring: str = "neg_mean_squared_error",
include_base_with_interaction: bool = True,
winsorize: tuple[float, float] | None = None,
missing_dummies: bool = True,
min_features_to_select: int = 10,
)
Parameters for stepwise variable selection method using Recursive Feature Elimination with Cross-Validation (RFECV).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
nfolds
|
int
|
Number of folds for cross-validation during stepwise selection. Used in RFECV to evaluate feature importance. |
5
|
scoring
|
str
|
Scoring metric used to evaluate model performance during cross-validation. Should be a valid sklearn scoring parameter. |
"neg_mean_squared_error"
|
include_base_with_interaction
|
bool
|
Whether to include base variables when interaction terms are selected. If True, base variables are automatically included when their interactions are selected. |
True
|
winsorize
|
tuple[float, float] or None
|
Percentile bounds for winsorizing the dependent variable. Tuple of (low_percentile, high_percentile) to cap extreme values. None means no winsorization. |
None
|
missing_dummies
|
bool
|
Whether to create dummy variables for missing values in predictor variables. |
True
|
min_features_to_select
|
int
|
Minimum number of features that must be selected by the stepwise procedure. Prevents over-reduction of the feature set. |
10
|
Returns:
| Type | Description |
|---|---|
dict
|
Dictionary containing all parameter values for stepwise selection. |