Basic Statistics and Standard¶

summary ¶

summary(
    df: IntoFrameT,
    columns: list[str] | str | None = None,
    weight: str = "",
    print: bool = True,
    stats: list[str] | str | None = None,
    detailed: bool = False,
    additional_stats: list[str] | str | None = None,
    by: list[str] | str | None = None,
    quantile_interpolated: bool = False,
    drb_round: bool = False,
) -> IntoFrameT

Generate summary statistics for a dataframe.

A convenience function for quickly exploring data. Calculates common summary statistics (mean, std, min, max, etc.) with optional weighting and grouping. Works with any dataframe backend (Polars, Pandas, Arrow, DuckDB) via Narwhals.

Parameters:

Name	Type	Description	Default
`df`	`IntoFrameT`	Input dataframe to summarize.	required
`columns`	`list[str] \| str \| None`	Columns to summarize. Supports wildcards (e.g., "income_*"). If None, summarizes all columns. Default is None.	`None`
`weight`	`str`	Column name for weights. If provided, calculates weighted statistics. Default is "" (unweighted).	`''`
`print`	`bool`	Print the summary table. Default is True.	`True`
`stats`	`list[str] \| str \| None`	Statistics to calculate. If None, uses default set. See Statistics.available_stats() for options. Default is None.	`None`
`detailed`	`bool`	Use detailed statistics (includes quartiles). Overrides stats parameter. Default is False.	`False`
`additional_stats`	`list[str] \| str \| None`	Additional statistics beyond the default/detailed set. Examples: ["q10", "q90", "n\|not0", "share\|not0"]. Default is None.	`None`
`by`	`list[str] \| str \| None`	Column(s) to group by before calculating statistics. Default is None.	`None`
`quantile_interpolated`	`bool`	Use interpolated quantiles (vs exact values from data). Default is False.	`False`
`drb_round`	`bool`	Apply DRB (Disclosure Review Board) rounding rules for 4 significant digits. Useful for publication-ready output. Default is False.	`False`

Returns:

Type	Description
`IntoFrameT`	Dataframe of summary statistics (same type as input df).

Examples:

Basic unweighted summary:

>>> from survey_kit.utilities.dataframe import summary
>>> from survey_kit.utilities.random import RandomData
>>>
>>> df = RandomData(n_rows=1000, seed=123).integer("income", 0, 100_000).to_df()
>>> summary(df)

Weighted summary:

>>> summary(df, weight="survey_weight")

By groups:

>>> summary(df, weight="survey_weight", by="year")

Detailed statistics with rounding:

>>> summary(df, weight="survey_weight", detailed=True, drb_round=True)

Custom statistics:

>>> from survey_kit.statistics.statistics import Statistics
>>> Statistics.available_stats()  # See options
>>> summary(df, additional_stats=["q10", "q90", "n|not0", "share|not0"])

Specific columns with wildcards:

>>> summary(df, columns=["income_*", "age"], weight="survey_weight")

Get results without printing:

>>> df_stats = summary(df, weight="survey_weight", print=False)
>>> print(df_stats.collect())

Notes

Default statistics (if stats=None and detailed=False): - n: Count of non-missing values - n|missing: Count of missing values - mean: Average - std: Standard deviation - min: Minimum - max: Maximum

Detailed statistics (if detailed=True): - Adds: q25, q50 (median), q75

The "|not0" suffix excludes zeros: "n|not0" counts non-zero values, "share|not0" calculates proportion among non-zero observations.

StatCalculator ¶

Bases: Serializable

A comprehensive class for calculating statistical estimates with optional replicate weights.

StatCalculator provides a unified interface for computing various statistics on datasets, with support for weighted calculations, replicate weight standard errors, grouping, and comparison operations. It handles both simple estimates and complex bootstrap or replicate weight variance estimation.

The class supports: - Multiple statistics calculated simultaneously - Weighted and unweighted estimates - Replicate weight and bootstrap standard errors - Grouping/stratification via by - Automatic disclosure avoidance rounding (which can be disabled) - Comparison operations between sets of estimate

Parameters:

Name	Type	Description	Default
`df`	`IntoFrameT`	A narwhals-compliant dataframe	`None`
`statistics`	`list[Statistics] \| Statistics \| None`	Statistics object(s) defining what columns and statistics to calculate. Each Statistics object specifies variables and statistical measures. Default is None.	`None`
`weight`	`str`	Column name for survey weights if weighted estimates are desired. Default is "" (unweighted).	`''`
`scale_wgts_to`	`float`	Value to scale weights to sum to (proportional adjustment). Default is 0.0 (no scaling).	`0.0`
`replicates`	`Replicates \| None`	Replicates object for calculating replicate weight standard errors. Generates weight lists from stub names and counts. Default is None.	`None`
`by`	`dict[str, list[str]] \| list \| None`	Dictionary defining grouping variables for stratified estimates. Keys are group names, values are lists of grouping variables. Example: {"state":["st"], "county":["st","cty"]}. Default is None.	`None`
`display`	`bool`	Whether to print results to log automatically. Default is True.	`True`
`display_all_vars`	`bool`	Print all variables or truncate display. Default is True.	`True`
`display_max_vars`	`int`	Maximum variables to display when display_all_vars=False. Default is 20.	`20`
`round_output`	`bool \| int`	Apply rounding to output (True for DRB rules, int for sig digits). Default is True.	`False`
`calculate`	`bool`	Internal parameter - whether to run calculations immediately. Default is True.	`True`

Attributes:

Name	Type	Description
`df_estimates`	`IntoFrameT, narwhals compliant dataframe`	Main estimates dataframe with calculated statistics.
`df_ses`	`IntoFrameT, narwhals compliant dataframe`	Standard errors dataframe (populated when replicates are used).
`df_replicates`	`IntoFrameT, narwhals compliant dataframe`	Full replicate estimates for additional analysis.
`variable_ids`	`list[str]`	Column names that identify unique estimates/variables.
`summarize_vars`	`list[str]`	All grouping variables from by (flattened).
`bootstrap`	`bool`	Whether bootstrap standard errors are to be calculated, as opposed to replicate weights.

Examples:

Basic usage with simple statistics:

>>> from survey_kit.statistics.statistics import Statistics
>>> stats = Statistics(columns=["income", "age"], statistics=["mean", "median"])
>>> sc = StatCalculator(df=my_data, statistics=stats, weight="survey_wgt")
>>> sc.print()

With replicate weights for standard errors:

>>> from survey_kit.statistics.replicates import Replicates
>>> reps = Replicates(weight_stub="rep_wgt_", n_replicates=80)
>>> sc = StatCalculator(df=my_data, statistics=stats, replicates=reps)
>>> sc.print()  # Will show standard errors

Grouped analysis:

>>> calc = StatCalculator(
...     df=my_data,
...     statistics=stats,
...     by={"state": ["state_code"], "region": ["region_code"]}
... )

Comparison between two sets of estimates:

>>> sc_1 = StatCalculator(df=data1, statistics=stats)
>>> sc_2 = StatCalculator(df=data2, statistics=stats)
>>> comparison = sc_1.compare(sc_2)
>>> comparison["difference"].print()

Source code in src\survey_kit\statistics\calculator.py

class StatCalculator(Serializable):
    """
    A comprehensive class for calculating statistical estimates with optional replicate weights.

    StatCalculator provides a unified interface for computing various statistics on datasets,
    with support for weighted calculations, replicate weight standard errors, grouping,
    and comparison operations. It handles both simple estimates and complex bootstrap
    or replicate weight variance estimation.

    The class supports:
    - Multiple statistics calculated simultaneously
    - Weighted and unweighted estimates
    - Replicate weight and bootstrap standard errors
    - Grouping/stratification via by
    - Automatic disclosure avoidance rounding (which can be disabled)
    - Comparison operations between sets of estimate

    Parameters
    ----------
    df : IntoFrameT
        A narwhals-compliant dataframe
    statistics : list[Statistics]|Statistics|None, optional
        Statistics object(s) defining what columns and statistics to calculate.
        Each Statistics object specifies variables and statistical measures.
        Default is None.
    weight : str, optional
        Column name for survey weights if weighted estimates are desired.
        Default is "" (unweighted).
    scale_wgts_to : float, optional
        Value to scale weights to sum to (proportional adjustment).
        Default is 0.0 (no scaling).
    replicates : Replicates|None, optional
        Replicates object for calculating replicate weight standard errors.
        Generates weight lists from stub names and counts. Default is None.
    by : dict[str,list[str]]|list|None, optional
        Dictionary defining grouping variables for stratified estimates.
        Keys are group names, values are lists of grouping variables.
        Example: {"state":["st"], "county":["st","cty"]}. Default is None.
    display : bool, optional
        Whether to print results to log automatically. Default is True.
    display_all_vars : bool, optional
        Print all variables or truncate display. Default is True.
    display_max_vars : int, optional
        Maximum variables to display when display_all_vars=False. Default is 20.
    round_output : bool|int, optional
        Apply rounding to output (True for DRB rules, int for sig digits).
        Default is True.
    calculate : bool, optional
        Internal parameter - whether to run calculations immediately.
        Default is True.

    Attributes
    ----------
    df_estimates : IntoFrameT, narwhals compliant dataframe
        Main estimates dataframe with calculated statistics.
    df_ses : IntoFrameT, narwhals compliant dataframe
        Standard errors dataframe (populated when replicates are used).
    df_replicates : IntoFrameT, narwhals compliant dataframe
        Full replicate estimates for additional analysis.
    variable_ids : list[str]
        Column names that identify unique estimates/variables.
    summarize_vars : list[str]
        All grouping variables from by (flattened).
    bootstrap : bool
        Whether bootstrap standard errors are to be
        calculated, as opposed to replicate weights.

    Examples
    --------
    Basic usage with simple statistics:

    >>> from survey_kit.statistics.statistics import Statistics
    >>> stats = Statistics(columns=["income", "age"], statistics=["mean", "median"])
    >>> sc = StatCalculator(df=my_data, statistics=stats, weight="survey_wgt")
    >>> sc.print()

    With replicate weights for standard errors:

    >>> from survey_kit.statistics.replicates import Replicates
    >>> reps = Replicates(weight_stub="rep_wgt_", n_replicates=80)
    >>> sc = StatCalculator(df=my_data, statistics=stats, replicates=reps)
    >>> sc.print()  # Will show standard errors

    Grouped analysis:

    >>> calc = StatCalculator(
    ...     df=my_data,
    ...     statistics=stats,
    ...     by={"state": ["state_code"], "region": ["region_code"]}
    ... )

    Comparison between two sets of estimates:

    >>> sc_1 = StatCalculator(df=data1, statistics=stats)
    >>> sc_2 = StatCalculator(df=data2, statistics=stats)
    >>> comparison = sc_1.compare(sc_2)
    >>> comparison["difference"].print()
    """

    _save_suffix = "stats_calc"

    def __init__(
        self,
        df: IntoFrameT | None = None,
        statistics: list[Statistics] | Statistics | None = None,
        weight: str = "",
        scale_wgts_to: float = 0.0,
        replicates: Replicates | None = None,
        by: dict[str, list[str]] | list | None = None,
        display: bool = True,
        display_all_vars: bool = True,
        display_max_vars: int = 20,
        round_output: bool | int = False,
        allow_slow_pandas: bool = False,
        calculate: bool = True,
    ):
        if statistics is None:
            self.statistics = []
            calculate = False
        elif type(statistics) is Statistics:
            statistics = [statistics]

        self.df = df
        if statistics is not None:
            if type(statistics) is not list:
                statistics = [statistics]
        self.statistics = statistics
        self.weight = weight
        if by is not None:
            if len(by) == 0:
                by = None

        self.by = by
        self.display = display
        self.display_all_vars = display_all_vars
        self.display_max_vars = display_max_vars
        self.round_output = round_output
        self.summarize_vars = self._by_vars()
        self.rounding = Rounding(round_output=round_output, round_all=False)

        #   Default columns for variable name id
        self.variable_ids = ["Variable"]

        self.replicates = replicates
        self.scale_wgts_to = scale_wgts_to

        self.allow_slow_pandas = allow_slow_pandas

        self.replicate_stats = ReplicateStats()
        if replicates is not None:
            self.replicate_stats.bootstrap = replicates.bootstrap

        self.scale_weights()

        if calculate:
            if self.replicates is None:
                self._calculate()
            else:
                self._calculate_replicates()

            self.df = None

    def copy(self):
        sc_copy = StatCalculator(
            df=self.df,
            statistics=copy(self.statistics),
            weight=self.weight,
            scale_wgts_to=0,
            replicates=copy(self.replicates),
            by=copy(self.by),
            display=self.display,
            display_all_vars=self.display_all_vars,
            display_max_vars=self.display_max_vars,
            round_output=False,
            allow_slow_pandas=allow_slow_pandas,
            calculate=False,
        )

        sc_copy.scale_wgts_to = self.scale_wgts_to
        sc_copy.rounding = copy(self.rounding)
        sc_copy.replicate_stats = self.replicate_stats.copy()
        sc_copy.variable_ids = self.variable_ids

        return sc_copy

    @property
    def df_estimates(self):
        """
        IntoFrameT : Main estimates dataframe containing calculated statistics.

        This property provides access to the primary results table with all
        calculated statistics. Includes variable identifiers, grouping variables,
        and statistical estimates as columns.
        """
        return self.replicate_stats.df_estimates

    @df_estimates.setter
    def df_estimates(self, value):
        self.replicate_stats.df_estimates = value

    @property
    def df_ses(self):
        """
        IntoFrameT : Standard errors dataframe (when replicate weights are used).

        Contains standard error estimates for all statistics calculated using
        replicate weight methods. Has the same structure as df_estimates but
        with standard errors instead of point estimates. Only populated when
        replicates parameter is provided.
        """
        return self.replicate_stats.df_ses

    @df_ses.setter
    def df_ses(self, value):
        self.replicate_stats.df_ses = value

    @property
    def df_replicates(self):
        """
        IntoFrameT : Full replicate estimates dataframe.

        Contains individual estimates for each replicate weight, allowing for
        custom variance calculations or additional analysis. Includes all
        columns from df_estimates plus a replicate identifier column.
        Only populated when replicates parameter is provided.
        """
        return self.replicate_stats.df_replicates

    @df_replicates.setter
    def df_replicates(self, value):
        self.replicate_stats.df_replicates = value

    @property
    def bootstrap(self):
        return self.replicate_stats.bootstrap

    def _by_vars(self=None, by: dict | None = None) -> list[str]:
        """
        Just get a list of the variables to be used as indexes for
            summary stats (for a select statement)
        Returns
        -------
        list[str]
            A full list of all the indexes (with no duplicates)

        """

        if by is None:
            by = self.by

        summarize_list = []

        if by is not None:
            if type(by) is dict:
                for listi in by.values():
                    summarize_list.extend(listi)
            elif type(by) is list:
                for itemi in by:
                    if type(itemi) is list:
                        summarize_list.extend(itemi)
                    else:
                        summarize_list.append(itemi)

            summarize_list = list(dict.fromkeys(summarize_list))

        return summarize_list

    def _calculate(self, weight: str | None = None, display: bool | None = None):
        """
        Parameters
        ----------
        weight : str|None, optional
            Programmer option, do not use. The default is None.
        display : bool|None, optional
            Display the results?. The default is None.

        Returns
        -------
        Calculate the estimates (no SEs).  Mostly should only be called internally
        Populates df_estimates

        """

        df_collected = None

        if weight is None:
            weight = self.weight

        for statsi in self.statistics:
            dfi = statsi.calculate(
                df=self.df,
                weight=weight,
                by=self.by,
                summarize_vars=self.summarize_vars,
                rounding=self.rounding,
                allow_slow_pandas=self.allow_slow_pandas,
            )

            if df_collected is None:
                df_collected = dfi
            else:
                cols_prior = df_collected.drop(
                    self.variable_ids + self.summarize_vars
                ).columns
                cols_now = dfi.drop(self.variable_ids + self.summarize_vars).columns
                cols_match = list(set(cols_prior).intersection(cols_now))

                n_rows = df_collected.select(nw.len()).collect().item()
                df_collected = (
                    join_wrapper(
                        df=df_collected.with_row_index("__summary_index__"),
                        df_to=dfi.with_row_index("__summary_index2__", offset=n_rows),
                        on=self.variable_ids + self.summarize_vars,
                        how="full",
                    )
                    # df_collected = (
                    #                     JoinFileList_Simple(
                    #                         dflist=[
                    #                                     df_collected.with_row_index("__summary_index__"),
                    #                                     dfi.with_row_index("__summary_index2__",
                    #                                                        offset=dfRowCount(df_collected)),
                    #                                 ],
                    #                         Join="outer",
                    #                         JoinOn=self.variable_ids + self.summarize_vars,
                    #                         join_nulls=True,
                    #                         quietly=True
                    #                     )
                    .with_columns(
                        nw.coalesce(
                            nw.col(["__summary_index__", "__summary_index2__"]).alias(
                                "__summary_index__"
                            )
                        )
                    )
                    .sort(self.summarize_vars + ["__summary_index__"])
                    .drop(["__summary_index__", "__summary_index2__"])
                )

                if len(cols_match):
                    cols_new = list(set(cols_now).difference(cols_prior))
                    cols_select = cols_prior + cols_new
                    with_coalesce = [
                        nw.coalesce(nw.col(coli, f"{coli}_right"))
                        for coli in cols_match
                    ]

                    df_collected = df_collected.with_columns(with_coalesce).select(
                        self.variable_ids + self.summarize_vars + cols_select
                    )

        self.df_estimates = df_collected

        if self.rounding.round_output:
            self.df_estimates = self.round_results()

        if display is None:
            display = self.display

        if display:
            self.print()

        return self.df_estimates

    def _calculate_replicates(self):
        """
        Calculate the estimates for each replicate weight

        Returns
        -------
        Populates df_estimates, df_ses and df_replicates

        """

        replicate_se_return = replicates_ses_from_function(
            delegate=self._calculate,
            arguments={"display": False},
            join_on=self.variable_ids + self.summarize_vars,
            weights=self.replicates.rep_list,
            bootstrap=self.replicates.bootstrap,
        )

        self.df_estimates = replicate_se_return.df_estimates
        self.df_ses = replicate_se_return.df_ses
        self.df_replicates = replicate_se_return.df_replicates

        if self.rounding.round_output:
            self.df_ses = self.round_results(df=self.df_ses)

        if self.display:
            self.print()

    def round_results(
        self,
        df: IntoFrameT | None = None,
        rounding: Rounding | None = None,
        display_only: bool = False,
    ) -> IntoFrameT:
        """
        Parameters
        ----------
        df : IntoFrameT, optional
            Table of estimates. The default is the estimates in df_estimates
        rounding : Rounding|None, optional
            Rounding (True for DRB rules) and an integer for specific number of significant digits. The default is self's rounding.
        display_only : bool, optional
            If True, affects the display of numbers (casts to strings). The default is False.

        Returns
        -------
        df : IntoFrameT
            The rounded estimates.

        """

        if df is None:
            df = self.df_estimates

        if rounding is None:
            rounding = self.rounding

        if df is not None:
            df = drb_round_table(
                df=df,
                columns=rounding.cols_round,
                columns_n=rounding.cols_n,
                columns_exclude=rounding.cols_exclude,
                round_all=rounding.round_all,
                digits=rounding.round_digits,
                display_only=display_only,
            )

        return df

    def scale_weights(self=None):
        if self.df is None:
            return None

        if self.scale_wgts_to > 0:
            if self.weight != "":
                self.df = self.df.with_columns(
                    (
                        nw.col(self.weight) / nw.sum(self.weight) * self.scale_wgts_to
                    ).alias(self.weight)
                )

            if self.replicates is not None:
                for weighti in self.replicates.rep_list:
                    self.df = self.df.with_columns(
                        (nw.col(weighti) / nw.sum(weighti) * self.scale_wgts_to).alias(
                            weighti
                        )
                    )

    def print(
        self,
        df: IntoFrameT | None = None,
        round_output: bool | int | None = None,
        estimates_per_page: int = 0,
        sub_log: logging = None,
    ):
        """
        Print the estimates (with SEs if applicable) to the log.

        Parameters
        ----------
        df : IntoFrameT, optional
            The estimates to display. Default is the estimates in self.
        round_output : bool|int|None, optional
            Rounding rule (True for DRB, integer for number of significant digits).
            Default is self's rounding rule.
        estimates_per_page : int, optional
            Repeat the header every k estimates. Defaults to 0 (don't repeat).
        sub_log : logging, optional
            Override logger. Default is None (no override).

        Returns
        -------
        None
        """
        if self.df_replicates is not None:
            self._print_replicates(
                round_output=round_output,
                estimates_per_page=estimates_per_page,
                sub_log=sub_log,
            )
        else:
            self._print_estimates(
                df=df,
                round_output=round_output,
                estimates_per_page=estimates_per_page,
                sub_log=sub_log,
            )

    def _print_estimates(
        self,
        df: IntoFrameT | None = None,
        round_output: bool | int | None = None,
        estimates_per_page: int = 0,
        sub_log: logging = None,
    ):
        """
        Prints the estimates (when there are no SEs) to the log

        Parameters
        ----------
        df : IntoFrameT, optional
            The estimates to show The default is the estimates in self.
        round_output : bool|int|None, optional
            Rounding rule (True for DRB, integer for number of significant digits). The default is self's rounding rule.
        estimates_per_page : int, optional
            Repeat the header every k estimates.  Defaults to 0 (don't)
        sub_log : logging , optional
            Override logger?  Default is None (no override)
        Returns
        -------
        None.

        """

        if df is None:
            df = self.df_estimates

        nw_type = NarwhalsType(df)
        df = nw_type.to_polars().lazy().collect()
        if sub_log is None:
            sub_log = logger

        #   f_print = print
        f_print = sub_log.info
        #   Round?
        if round_output:
            rounding = deepcopy(self.rounding)
            rounding.set_round_digits(round_output)

            df = self.round_results(df=df, rounding=rounding, display_only=True)

        if self.display_all_vars:
            n_rows = df.height
        else:
            n_rows = min(self.display_max_vars, df.height)

        with pl.Config(fmt_str_lengths=50) as cfg:
            #   Basic formatting
            cfg.set_tbl_cell_alignment("RIGHT")
            cfg.set_tbl_hide_column_data_types(True)
            cfg.set_tbl_hide_dataframe_shape(True)
            cfg.set_thousands_separator(True)
            cfg.set_tbl_width_chars(600)
            cfg.set_tbl_cols(len(df.lazy().collect_schema()))

            cfg.set_tbl_rows(n_rows)

            if estimates_per_page > 0 and n_rows > estimates_per_page:
                slices = math.ceil(n_rows / estimates_per_page)

                for slicei in range(slices):
                    f_print(
                        df.slice(
                            offset=estimates_per_page * slicei,
                            length=estimates_per_page,
                        )
                    )
            else:
                f_print(df)

    def _print_replicates(
        self,
        round_output: bool | int | None = None,
        estimates_per_page: int = 0,
        sub_log: logging = None,
    ):
        """
        Prints the estimates (when there are SEs) to the log

        Parameters
        ----------
        round_output : bool|int|None, optional
            Rounding rule (True for DRB, integer for number of significant digits). The default is self's rounding rule.
        estimates_per_page : int, optional
            Repeat the header every k estimates.  Defaults to 0 (don't)
        sub_log : logging , optional
            Override logger?  Default is None (no override)

        Returns
        -------
        None.

        """
        if sub_log is None:
            sub_log = logger

        #   Round?
        if round_output:
            rounding = deepcopy(self.rounding)
            rounding.set_round_digits(round_output)

            df_estimates = self.round_results(
                df=self.df_estimates, rounding=rounding, display_only=True
            )
            df_ses = self.round_results(
                df=self.df_ses, rounding=rounding, display_only=True
            )
        else:
            df_estimates = self.df_estimates
            df_ses = self.df_ses

        print_se_table(
            df_estimates=df_estimates,
            df_ses=df_ses,
            display_all_vars=self.display_all_vars,
            display_max_vars=self.display_max_vars,
            sort_vars=self.variable_ids + self.summarize_vars,
            round_output=False,
            sub_log=sub_log,
        )

    def table_of_estimates(
        self,
        round_output: bool | int | None = None,
        estimates_to_show: list[str] | None = None,
        variable_prefix: str = "",
        estimate_type_variable_name: str = "Statistic",
        ci_level: float = 0.95,
    ) -> IntoFrameT:
        """
        Create a formatted table of estimates with option of statistics to report.

        Parameters
        ----------
        round_output : bool|int|None, optional
            Rounding rule for display.
        estimates_to_show : list[str] | None, optional
            List of estimate types to include. Options: "estimate", "se", "t", "p", "ci".
            Default is ["estimate", "se"].
        variable_prefix : str, optional
            Prefix to add to variable column names. Default is "".
        estimate_type_variable_name : str, optional
            Name for the column indicating statistic type. Default is "Statistic".
        ci_level : float, optional
            Confidence interval level for "ci" estimates. Default is 0.95.

        Returns
        -------
        IntoFrameT
            Formatted table with estimates arranged by statistic type.
        """
        if estimates_to_show is None:
            estimates_to_show = ["estimate", "se"]

        df_ordered = []
        nw_ordered = []
        col_sort = "__order_output_table__"
        for index, esti in enumerate(estimates_to_show):
            dfi = None
            if esti.lower() == "estimate":
                dfi = self.df_estimates

            elif esti.lower() == "se" and self.df_replicates is not None:
                dfi = self.df_ses
            elif esti.lower() == "t" and self.df_replicates is not None:
                dfi = self._df_t()
            elif esti.lower() == "p" and self.df_replicates is not None:
                dfi = self._df_p()
            elif esti.lower() == "ci" and self.df_replicates is not None:
                dfi = self._df_ci(ci_level=ci_level)
            else:
                message = f"{esti} not allowed for estimates_to_show"
                logger.error(message)
                raise Exception(message)

            if dfi is not None:
                nwi = NarwhalsType(dfi)
                nw_ordered.append(nwi)
                dfi = nwi.to_polars()
                df_ordered.append(
                    dfi.with_columns(
                        [
                            pl.lit(index).alias(col_sort),
                            pl.lit(esti.lower()).alias(estimate_type_variable_name),
                        ]
                    )
                )

        col_row_index = "___estimate_row_count___"
        df_ordered[0] = df_ordered[0].with_row_index(col_row_index)
        df_display = concat_wrapper(df_ordered, how="diagonal").lazy()

        sort_vars = self.variable_ids + self.summarize_vars
        df_display = df_display.sort(sort_vars + [col_sort]).with_columns(
            pl.col(col_row_index).forward_fill()
        )
        df_display = df_display.sort([col_row_index] + [col_sort]).drop(col_row_index)

        #   Clear extraneous information
        with_clear = []
        for coli in sort_vars:
            c_col = pl.col(coli)
            with_clear.append(
                pl.when(pl.col(col_sort) != 0)
                .then(pl.lit(""))
                .otherwise(c_col.cast(pl.String))
                .alias(coli)
            )

        select_order = sort_vars + [estimate_type_variable_name]
        remaining = []
        rename = {}
        for coli in df_display.lazy.collect_schema().names():
            if coli not in select_order and coli != col_sort:
                if variable_prefix != "":
                    rename[coli] = f"{variable_prefix}{coli}"

                remaining.append(coli)

        df_display = df_display.with_columns(with_clear).select(
            select_order + remaining
        )
        #   Round?
        if round_output:
            rounding = deepcopy(self.rounding)
            rounding.set_round_digits(round_output)

            if (
                len(rounding.cols_n) == 0
                and len(rounding.cols_round) == 0
                and not rounding.round_all
            ):
                #   Nothing set to round, round all
                rounding.cols_round = remaining

            df_display = self.round_results(
                df=df_display, rounding=rounding, display_only=True
            )

        if len(rename):
            df_display = df_display.rename(rename)
        return nw_ordered[0].lazy().from_polars(df_display)

    def join_tables_of_estimates(
        self, df_list: list[IntoFrameT], estimate_type_variable_name: str = "Statistic"
    ) -> IntoFrameT:
        df_list = [nw.from_native(dfi) for dfi in df_list]
        sort_vars = (
            self.variable_ids + self.summarize_vars + [estimate_type_variable_name]
        )

        variable_filled = []
        for coli in self.variable_ids:
            if df_list[0].schema[coli] == nw.String:
                c_missing = nw.col(coli).is_null() | (pl.col(coli) == "")
            else:
                c_missing = nw.col(coli).is_missing()

            variable_filled.append(
                (
                    nw.when(c_missing)
                    .then(nw.lit(None))
                    .otherwise(nw.col(coli))
                    .alias(coli)
                    .fill_null(strategy="forward")
                )
            )

        row_indices = []
        for i in range(0, len(df_list)):
            row_indices.append(f"__row_index_{i}")
            df_list[i] = (
                df_list[i]
                .with_row_index(f"__row_index_{i}")
                .with_columns(variable_filled)
            )

        df_out = join_list(df_list, on=sort_vars, how="full").sort(row_indices)

        with_clear = []
        c_index = nw.col("__row_index_0")
        for coli in self.variable_ids:
            c_col = nw.col(coli)

            with_clear.append(
                nw.when(c_index == nw.min("__row_index_0").over(self.variable_ids))
                .then(c_col)
                .otherwise(nw.lit(""))
                .alias(coli)
            )

        df_out = (
            drop_if_exists(
                nw.from_native(df_out)
                .with_columns(with_clear)
                .to_native(),
                columns=["__row_index_*"]
            )
        )
        return df_out

    def _df_t(self) -> IntoFrameT:
        join_on = self.variable_ids + self.summarize_vars
        nw_type = NarwhalsType(self.df_estimates)
        cols_stats = (
            nw.from_native(self.df_estimates)
            .lazy()
            .drop(join_on)
            .collect_schema()
            .names()
        )

        with_t = []
        for coli in cols_stats:
            c_est = nw.col(coli)
            c_se = nw.col(f"{coli}_se")

            with_t.append((c_est / c_se).abs().alias(coli))

        df_out = join_list(
            [self.df_estimates, self.df_ses],
            how="left",
            on=join_on,
            suffixes=["", "_se"],
        ).select(join_on + with_t)
        return NarwhalsType.return_df(df_out, nw_type)

    def _df_p(self) -> pl.DataFrame | pl.LazyFrame:
        nw_p = NarwhalsType(self._df_t())
        df_p = nw_p.to_polars()

        join_on = self.variable_ids + self.summarize_vars
        cols_stats = self.df_estimates.drop(join_on).columns

        def p_value(t):
            if t == float("inf") or t == float("nan"):
                return 0
            df = 1_000_000

            return scipy.stats.t.sf(t, df) * 2

        def p_value_lambda(t, col_t):
            try:
                return p_value(t[col_t])
            except:
                return None

        for coli in cols_stats:
            index_t = df_p.columns.index(f"{coli}")

            df_p = (
                df_p.with_columns(df_p.map_rows(lambda t: p_value_lambda(t, index_t)))
                .drop(coli)
                .rename({"map": coli})
            )

        return nw_p.from_polars(df_p.select(join_on + cols_stats))

    def _df_ci(self, ci_level: float = 0.95):
        return self.replicate_stats._df_ci(
            ci_level=ci_level, join_on=self.variable_ids + self.summarize_vars
        )

    def compare(
        self,
        compare_to,
        difference: bool = True,
        ratio: bool = True,
        display: bool = True,
        ratio_minus_1: bool = True,
        compare_list_variables: list[ComparisonItem.Variable] | None = None,
        compare_list_columns: list[ComparisonItem.Column] | None = None,
        quietly: bool = False,
    ):
        """
        Compare this set of estimates to another set of estimates,
        including MultipleImputation estimates.

        Parameters
        ----------
        compare_to : StatCalculator | MultipleImputation
            The other object to compare to.
        difference : bool, optional
            Calculate and return the difference (with key "difference"). Default is True.
        ratio : bool, optional
            Calculate and return the ratio (with key "ratio"). Default is True.
        ratio_minus_1 : bool, optional
            Rescale ratio by subtracting 1 from it. Default is True.
        display : bool, optional
            Print the difference/ratio to the log. Default is True.
        compare_list_variables : list[ComparisonItem.Variable] | None, optional
            List of variables to compare (i.e. compare rows from prior calculations)
        compare_list_columns : list[ComparisonItem.Column], optional
            List of columns to compare
            For example if compare_list_variables = [ComparisonItem.Column("mean","median")]
                then compare the mean of 1 to the median of 2
        quietly : bool, optional
            Suppress informational messages. Default is False.

        Returns
        -------
        dict[str, StatCalculator]
            Dictionary with keys ["difference","ratio"] containing
            StatCalculator objects with the comparison estimates
            (with SEs if applicable).
        """

        outputs = {}
        if statistical_comparison_item(self) and statistical_comparison_item(
            compare_to
        ):
            if not quietly:
                if self.bootstrap:
                    logger.info("Comparing estimates using bootstrap weights")
                else:
                    logger.info("Comparing estimates using replicate weights")

            outputs = compare(
                stats1=self,
                stats2=compare_to,
                join_on=self.variable_ids + self.summarize_vars,
                rounding=self.rounding,
                difference=difference,
                ratio=ratio,
                ratio_minus_1=ratio_minus_1,
                compare_list_variables=compare_list_variables,
                compare_list_columns=compare_list_columns,
            )

            if display:
                if difference:
                    logger.info("  Difference")
                    outputs["difference"].print(round_output=self.round_output)
                    logger.info("\n")
                if ratio:
                    logger.info("  Ratio")
                    outputs["ratio"].print(round_output=self.round_output)
                    logger.info("\n")

        else:
            if not quietly:
                logger.info("Comparing estimates")

            df1 = self.df_estimates
            df2 = compare_to.df_estimates

            (df1, df2) = StatComp.process_compare_lists(
                df1=df1,
                df2=df2,
                join_on=self._by_vars() + self.variable_ids,
                compare_list_variables=compare_list_variables,
                compare_list_columns=compare_list_columns,
            )

            sm_compare = StatCalculator(
                df=None, statistics=self.statistics, by=self.by, calculate=False
            )

            cols_index = self.variable_ids + self.summarize_vars
            cols_nonindex = df1.drop(cols_index).columns

            df1 = SafeCollect(df1)
            df2 = SafeCollect(df2)

            #   logger.info(df1.schema)
            #   Upcast any columns that need to be
            [df1, df2] = _polars_safe_upcast(
                df1.with_columns(pl.col(pl.Boolean).cast(pl.Int8)),
                df2.with_columns(pl.col(pl.Boolean).cast(pl.Int8)),
                cols1=cols_nonindex,
                cols2=cols_nonindex,
            )

            df_difference = SafeCollect(df2.select(cols_nonindex)) - df1.select(
                cols_nonindex
            )
            df_ratio = (df_difference) / df1.select(cols_nonindex)

            if difference:
                sm_diff = sm_compare

                if ratio:
                    sm_compare = deepcopy(sm_diff)

                sm_diff.df_estimates = pl.concat(
                    [df1.select(cols_index), df_difference], how="horizontal"
                )

                outputs["difference"] = sm_diff

                if display:
                    logger.info("  Difference")
                    sm_diff.print(round_output=sm_diff.round_output)
                    logger.info("\n")

            if ratio:
                sm_ratio = sm_compare

                sm_ratio.df_estimates = pl.concat(
                    [df1.select(cols_index), df_ratio], how="horizontal"
                )

                outputs["ratio"] = sm_ratio

                if display:
                    logger.info("  Ratio")
                    sm_ratio.print(round_output=sm_ratio.round_output)
                    logger.info("\n")

        return outputs

    def from_function(
        delegate: Callable,
        estimate_ids: list | str,
        df: IntoFrameT | None = None,
        df_argument: str = "df",
        arguments: dict | None = None,
        weight: str = "",
        replicates: Replicates | None = None,
        scale_wgts_to: float = 0.0,
        weight_argument_name: str = "weight",
        by: dict[str, list[str]] | None = None,
        display: bool = True,
        display_all_vars: bool = True,
        display_max_vars: int = 20,
        round_output: bool | int = True,
    ) -> StatCalculator:
        """
        Create a StatCalculator from a custom function that returns estimates.

        This static method allows wrapping any function that returns estimates
        in a StatCalculator object for easy display and comparison.

        Parameters
        ----------
        delegate : callable
            Function that returns a table of estimates. Must accept weight
            parameter if replicates are used.
        estimate_ids : list | str
            Column names that identify each unique estimate.
        df : pl.LazyFrame | pl.DataFrame, optional
            Dataframe passed as "df" argument to delegate. Allows dynamic
            subsetting with by. Default is None.
        df_argument : str, optional
            Name of argument with data. Defaults is "df".
        arguments : dict, optional
            Static arguments (other than weight) passed to delegate. Default is None.
        weight : str, optional
            Weight column name for weighted statistics. Default is "".
        replicates : Replicates|None, optional
            Replicates object for replicate weight standard errors. Default is None.
        scale_wgts_to : float, optional
            Scale weights to sum to this value. Default is 0.0 (no scaling).
        weight_argument_name : str, optional
            Keyword argument name for passing weight to delegate. Default is "weight".
        by : dict[str,list[str]]|None, optional
            Dictionary defining grouping variables for summary statistics.
        display : bool, optional
            Print results to log. Default is True.
        display_all_vars : bool, optional
            Print all variables rather than truncated summary. Default is True.
        display_max_vars : int, optional
            Maximum variables to print if display_all_vars=False. Default is 20.
        round_output : bool|int, optional
            Round the output. Default is True.

        Returns
        -------
        StatCalculator
            StatCalculator object containing the function results with
            estimates, SEs, and replicates as applicable.
        """

        #   Input parsing
        if arguments is None:
            arguments = {}

        if type(estimate_ids) is str:
            estimate_ids = [estimate_ids]

        if df is None:
            by = None
        else:
            if scale_wgts_to > 0:
                if weight != "":
                    weights_to_cast = [weight]
                    if replicates is not None:
                        weights_to_cast.extend(replicates.rep_list)
                    df = safe_sum_cast(df, weights_to_cast)

                    with_scale = [
                        (nw.col(weighti) / nw.col(weighti).sum() * scale_wgts_to).alias(
                            nw.col(weighti)
                        )
                        for weighti in weights_to_cast
                    ]
                    df = nw.from_native(df).with_columns(with_scale).to_native()

        if by is None:
            by = {"All": []}

        replicate_name = "___replicate___"

        df_estimates = []
        df_ses = []
        df_replicates = []
        by_vars = StatCalculator._by_vars(by=by)

        nw_type = NarwhalsType(df)
        for keyi, valuei in by.items():
            if keyi == "All":
                logger.info(f"Running {delegate.__name__}")
            else:
                logger.info(f"Running {delegate.__name__} for {keyi}")

            df_list = []
            if df is not None:
                if len(valuei):
                    df_partitioned = (
                        nw_type.to_polars()
                        .lazy()
                        .collect()
                        .partition_by(by=valuei, maintain_order=True, include_key=True)
                    )

                    df_partitioned = [
                        nw_type.from_polars(dfi) for dfi in df_partitioned
                    ]

                    df_list.extend(df_partitioned)
                else:
                    df_list.append(df)

            if len(df_list) == 0:
                df_list = [None]

            for dfi in df_list:
                append_by = []
                append_values = []

                if dfi is not None:
                    arguments[df_argument] = dfi

                    if len(valuei):
                        append_values = dfi.select(valuei).unique().to_dicts()
                        append_by = [
                            nw.lit(valuei).alias(keyi)
                            for keyi, valuei in append_values[0].items()
                        ]

                if replicates is None:
                    df_esti = delegate(**arguments)
                    if len(append_by):
                        df_esti = (
                            nw.from_native(df_esti).with_columns(append_by).to_native()
                        )

                    df_estimates.append(df_esti)
                else:
                    if len(append_values):
                        logger.info(append_values)

                    rep_return = replicates_ses_from_function(
                        delegate=delegate,
                        arguments=arguments,
                        join_on=estimate_ids,
                        weight_argument_name=weight_argument_name,
                        weights=replicates.rep_list,
                        replicate_name=replicate_name,
                    )

                    df_esti = rep_return.df_estimates
                    df_sei = rep_return.df_ses
                    df_repi = rep_return.df_replicates

                    if len(append_by):
                        df_esti = (
                            nw.from_native(df_esti).with_columns(append_by).to_native()
                        )
                        df_sei = (
                            nw.from_native(df_sei).with_columns(append_by).to_native()
                        )

                        df_repi = (
                            nw.from_native(df_repi).with_columns(append_by).to_native()
                        )

                    df_estimates.append(df_esti)
                    df_ses.append(df_sei)
                    df_replicates.append(df_repi)

            del df_list

        #   Set up the output
        ss_out = StatCalculator(
            df=None,
            weight=weight,
            replicates=replicates,
            by=by,
            display=display,
            display_all_vars=display_all_vars,
            display_max_vars=display_max_vars,
            round_output=round_output,
            calculate=False,
        )

        ss_out.variable_ids = estimate_ids

        if len(df_estimates):
            df_estimates = concat_wrapper(df_estimates, how="diagonal")
            #   Final variable order
            if len(by_vars):
                select_order = estimate_ids + by_vars
                select_order.extend(
                    [
                        coli
                        for coli in safe_columns(df_estimates)
                        if coli not in select_order
                    ]
                )
            else:
                select_order = safe_columns(df_estimates)
            ss_out.df_estimates = df_estimates.select(select_order)
        if len(df_ses):
            ss_out.df_ses = concat_wrapper(df_ses, how="diagonal").select(select_order)

        if len(df_ses):
            ss_out.df_replicates = concat_wrapper(df_replicates, how="diagonal").select(
                select_order + [replicate_name]
            )

        ss_out.df_estimates = ss_out.round_results(df=ss_out.df_estimates)
        ss_out.df_ses = ss_out.round_results(df=ss_out.df_ses)

        if display:
            ss_out.print()

        return ss_out

    def filter(self, filter_expr: nw.Expr) -> StatCalculator:
        self = self.copy()
        self.replicate_stats = self.replicate_stats.filter(filter_expr)

        return self

    def select(
        self, select_expr: nw.Expr | str | list[str] | list[nw.Expr]
    ) -> StatCalculator:
        cols_keep = (
            nw.from_native(self.df_estimates)
            .lazy()
            .select(select_expr)
            .collect_schema()
            .names()
        )
        add_join_on = list(set(self.variable_ids).difference(cols_keep))
        cols_keep = add_join_on + cols_keep

        self = self.copy()
        self.replicate_stats = self.replicate_stats.select(cols_keep)

        return self

    def with_columns(self, with_expr: nw.Expr | list[nw.Expr]) -> StatCalculator:
        self = self.copy()
        self.replicate_stats = self.replicate_stats.with_columns(with_expr)

        return self

    def sort(
        self, sort_expr: nw.Expr | list[nw.Expr] | str | list[str]
    ) -> StatCalculator:
        self = self.copy()
        self.replicate_stats = self.replicate_stats.sort(sort_expr)

        return self

    def drop(
        self, drop_expr: nw.Expr | list[nw.Expr] | str | list[str]
    ) -> ReplicateStats:
        self = self.copy()
        self.replicate_stats = self.replicate_stats.drop(drop_expr)

        return self

    def rename(self, d_rename: dict[str, str]) -> StatCalculator:
        self = self.copy()
        self.replicate_stats = self.replicate_stats.rename(d_rename)

        return self

    def scale_by(
        self, factor: float, columns: list[str] | str | None = None
    ) -> StatCalculator:
        if columns is None:
            #   Any columns that aren't the join_on ones
            columns = (
                nw.from_native(self.df_estimates.columns)
                .lazy()
                .collect_schema()
                .names()
            )
            columns = list(set(columns).difference(self.join_on))

        return self.with_columns(with_expr=nw.col(columns) * factor)

    def pipe(self, function: Callable, *args, **kwargs) -> StatCalculator:
        """
        Pipe a function to df_estimates, df_ses, and df_replicates (as necessary)

        Parameters
        ----------
        function : Callable
            Function to pipe.
        *args : TYPE
            arguments to function
        **kwargs : TYPE
            keyword arguments to function

        Returns
        -------
        StatCalculator

        """

        self = self.copy()
        self.replicate_stats = self.replicate_stats.pipe(
            function=function, *args, **kwargs
        )

        return self

    # def reshape_groups_wide_long(self,
    #                              copy:bool=False,
    #                              group_first:bool=True,
    #                              group_col:str="Group",
    #                              invert_group:bool=False) -> StatCalculator:
    #     if copy:
    #         self = self.copy()

    #     def _reshape(df:pl.DataFrame | pl.LazyFrame,
    #                  join_on:list[str]) -> pl.DataFrame | pl.LazyFrame:
    #         if "___replicate___" in df.columns :
    #             join_on = join_on + ["___replicate___"]

    #         df_concat = []
    #         for coli in df.columns:
    #             if coli not in join_on:
    #                 coli_group = coli.split(":")[0]
    #                 coli_value = coli.split(":")[1]
    #                 c_name = pl.col(join_on[0])
    #                 rename = {coli:coli_value}

    #                 if group_first:
    #                     with_name = pl.concat_str([pl.lit(coli_group),
    #                                                pl.lit(":"),
    #                                                c_name]).alias(c_name.meta.output_name())
    #                 else:
    #                     with_name = pl.concat_str([c_name,
    #                                                pl.lit(":"),
    #                                                pl.lit(coli_group)]).alias(c_name.meta.output_name())

    #                 with_name = with_name.alias(c_name.meta.output_name())
    #                 if group_col != "":
    #                     if invert_group:
    #                         with_name = [with_name,
    #                                      coli_value.alias(group_col)]
    #                     else:
    #                         with_name = [with_name,
    #                                      pl.lit(coli_group).alias(group_col)]

    #                 df_concat.append((df.select(join_on + [coli])
    #                                     .rename(rename)
    #                                     .with_columns(with_name)))

    #         return pl.concat(df_concat,
    #                          how="vertical_relaxed")

    #     self = self.pipe(_reshape,
    #                      join_on=self.variable_ids)

    #     return self
    def concat_with(
        self, sc_concat: StatCalculator, how: str = "horizontal"
    ) -> StatCalculator:
        """
        Concatenate this with another StatCalculator object

        Parameters
        ----------
        sc_concat : StatCalculator
            Other mi object to concatenate with.
        how : str, optional
            horizontal or vertical?
            Horizontal will actually do a join and vertical will just stack them
            The default is "horizontal".

        Returns
        -------
        StatCalculator

        """

        self = self.copy()

        self.replicate_stats.concat_with(
            rs_concat=sc_concat.replicate_stats,
            join_on_self=self.variable_ids,
            join_on_concat=sc_concat.variable_ids,
        )

        return self

    def drb_round_table(
        self,
        columns: list | str | None = None,
        columns_n: list | str | None = None,
        columns_exclude: list | str | None = None,
        round_all: bool = True,
        digits: int = 4,
        compress: bool = False,
    ) -> StatCalculator:
        """
        Apply DRB (Disclosure Review Board) rounding rules to the estimates.

        Parameters
        ----------
        columns : list|str|None, optional
            Specific columns to round. Default is None.
        columns_n : list|str|None, optional
            Columns to treat as counts for rounding. Default is None.
        columns_exclude : list|str|None, optional
            Columns to exclude from rounding. Default is None.
        round_all : bool, optional
            Apply rounding to all numeric columns. Default is True.
        digits : int, optional
            Number of significant digits for rounding. Default is 4.
        compress : bool, optional
            Use compressed rounding format. Default is False.

        Returns
        -------
        StatCalculator
            StatCalculator with DRB rounding applied.
        """
        kwargs = copy(locals())
        del kwargs["self"]
        self.replicate_stats = self.replicate_stats.pipe(
            function=drb_round_table, **kwargs
        )

        return self

    #####################################################
    #   Serializable - BEGIN
    #####################################################
    @classmethod
    def _init_from_dict(cls, data: dict):
        return super()._init_from_dict(data, calculate=False)

df_estimates `property` `writable` ¶

df_estimates

IntoFrameT : Main estimates dataframe containing calculated statistics.

This property provides access to the primary results table with all calculated statistics. Includes variable identifiers, grouping variables, and statistical estimates as columns.

df_replicates `property` `writable` ¶

df_replicates

IntoFrameT : Full replicate estimates dataframe.

Contains individual estimates for each replicate weight, allowing for custom variance calculations or additional analysis. Includes all columns from df_estimates plus a replicate identifier column. Only populated when replicates parameter is provided.

df_ses `property` `writable` ¶

df_ses

IntoFrameT : Standard errors dataframe (when replicate weights are used).

Contains standard error estimates for all statistics calculated using replicate weight methods. Has the same structure as df_estimates but with standard errors instead of point estimates. Only populated when replicates parameter is provided.

compare ¶

compare(
    compare_to,
    difference: bool = True,
    ratio: bool = True,
    display: bool = True,
    ratio_minus_1: bool = True,
    compare_list_variables: list[Variable] | None = None,
    compare_list_columns: list[Column] | None = None,
    quietly: bool = False,
)

Compare this set of estimates to another set of estimates, including MultipleImputation estimates.

Parameters:

Name	Type	Description	Default
`compare_to`	`StatCalculator \| MultipleImputation`	The other object to compare to.	required
`difference`	`bool`	Calculate and return the difference (with key "difference"). Default is True.	`True`
`ratio`	`bool`	Calculate and return the ratio (with key "ratio"). Default is True.	`True`
`ratio_minus_1`	`bool`	Rescale ratio by subtracting 1 from it. Default is True.	`True`
`display`	`bool`	Print the difference/ratio to the log. Default is True.	`True`
`compare_list_variables`	`list[Variable] \| None`	List of variables to compare (i.e. compare rows from prior calculations)	`None`
`compare_list_columns`	`list[Column]`	List of columns to compare For example if compare_list_variables = [ComparisonItem.Column("mean","median")] then compare the mean of 1 to the median of 2	`None`
`quietly`	`bool`	Suppress informational messages. Default is False.	`False`

Returns:

Type	Description
`dict[str, StatCalculator]`	Dictionary with keys ["difference","ratio"] containing StatCalculator objects with the comparison estimates (with SEs if applicable).

Source code in src\survey_kit\statistics\calculator.py

def compare(
    self,
    compare_to,
    difference: bool = True,
    ratio: bool = True,
    display: bool = True,
    ratio_minus_1: bool = True,
    compare_list_variables: list[ComparisonItem.Variable] | None = None,
    compare_list_columns: list[ComparisonItem.Column] | None = None,
    quietly: bool = False,
):
    """
    Compare this set of estimates to another set of estimates,
    including MultipleImputation estimates.

    Parameters
    ----------
    compare_to : StatCalculator | MultipleImputation
        The other object to compare to.
    difference : bool, optional
        Calculate and return the difference (with key "difference"). Default is True.
    ratio : bool, optional
        Calculate and return the ratio (with key "ratio"). Default is True.
    ratio_minus_1 : bool, optional
        Rescale ratio by subtracting 1 from it. Default is True.
    display : bool, optional
        Print the difference/ratio to the log. Default is True.
    compare_list_variables : list[ComparisonItem.Variable] | None, optional
        List of variables to compare (i.e. compare rows from prior calculations)
    compare_list_columns : list[ComparisonItem.Column], optional
        List of columns to compare
        For example if compare_list_variables = [ComparisonItem.Column("mean","median")]
            then compare the mean of 1 to the median of 2
    quietly : bool, optional
        Suppress informational messages. Default is False.

    Returns
    -------
    dict[str, StatCalculator]
        Dictionary with keys ["difference","ratio"] containing
        StatCalculator objects with the comparison estimates
        (with SEs if applicable).
    """

    outputs = {}
    if statistical_comparison_item(self) and statistical_comparison_item(
        compare_to
    ):
        if not quietly:
            if self.bootstrap:
                logger.info("Comparing estimates using bootstrap weights")
            else:
                logger.info("Comparing estimates using replicate weights")

        outputs = compare(
            stats1=self,
            stats2=compare_to,
            join_on=self.variable_ids + self.summarize_vars,
            rounding=self.rounding,
            difference=difference,
            ratio=ratio,
            ratio_minus_1=ratio_minus_1,
            compare_list_variables=compare_list_variables,
            compare_list_columns=compare_list_columns,
        )

        if display:
            if difference:
                logger.info("  Difference")
                outputs["difference"].print(round_output=self.round_output)
                logger.info("\n")
            if ratio:
                logger.info("  Ratio")
                outputs["ratio"].print(round_output=self.round_output)
                logger.info("\n")

    else:
        if not quietly:
            logger.info("Comparing estimates")

        df1 = self.df_estimates
        df2 = compare_to.df_estimates

        (df1, df2) = StatComp.process_compare_lists(
            df1=df1,
            df2=df2,
            join_on=self._by_vars() + self.variable_ids,
            compare_list_variables=compare_list_variables,
            compare_list_columns=compare_list_columns,
        )

        sm_compare = StatCalculator(
            df=None, statistics=self.statistics, by=self.by, calculate=False
        )

        cols_index = self.variable_ids + self.summarize_vars
        cols_nonindex = df1.drop(cols_index).columns

        df1 = SafeCollect(df1)
        df2 = SafeCollect(df2)

        #   logger.info(df1.schema)
        #   Upcast any columns that need to be
        [df1, df2] = _polars_safe_upcast(
            df1.with_columns(pl.col(pl.Boolean).cast(pl.Int8)),
            df2.with_columns(pl.col(pl.Boolean).cast(pl.Int8)),
            cols1=cols_nonindex,
            cols2=cols_nonindex,
        )

        df_difference = SafeCollect(df2.select(cols_nonindex)) - df1.select(
            cols_nonindex
        )
        df_ratio = (df_difference) / df1.select(cols_nonindex)

        if difference:
            sm_diff = sm_compare

            if ratio:
                sm_compare = deepcopy(sm_diff)

            sm_diff.df_estimates = pl.concat(
                [df1.select(cols_index), df_difference], how="horizontal"
            )

            outputs["difference"] = sm_diff

            if display:
                logger.info("  Difference")
                sm_diff.print(round_output=sm_diff.round_output)
                logger.info("\n")

        if ratio:
            sm_ratio = sm_compare

            sm_ratio.df_estimates = pl.concat(
                [df1.select(cols_index), df_ratio], how="horizontal"
            )

            outputs["ratio"] = sm_ratio

            if display:
                logger.info("  Ratio")
                sm_ratio.print(round_output=sm_ratio.round_output)
                logger.info("\n")

    return outputs

concat_with ¶

concat_with(
    sc_concat: StatCalculator, how: str = "horizontal"
) -> StatCalculator

Concatenate this with another StatCalculator object

Parameters:

Name	Type	Description	Default
`sc_concat`	`StatCalculator`	Other mi object to concatenate with.	required
`how`	`str`	horizontal or vertical? Horizontal will actually do a join and vertical will just stack them The default is "horizontal".	`'horizontal'`

Returns:

Type	Description
`StatCalculator`

Source code in src\survey_kit\statistics\calculator.py

def concat_with(
    self, sc_concat: StatCalculator, how: str = "horizontal"
) -> StatCalculator:
    """
    Concatenate this with another StatCalculator object

    Parameters
    ----------
    sc_concat : StatCalculator
        Other mi object to concatenate with.
    how : str, optional
        horizontal or vertical?
        Horizontal will actually do a join and vertical will just stack them
        The default is "horizontal".

    Returns
    -------
    StatCalculator

    """

    self = self.copy()

    self.replicate_stats.concat_with(
        rs_concat=sc_concat.replicate_stats,
        join_on_self=self.variable_ids,
        join_on_concat=sc_concat.variable_ids,
    )

    return self

drb_round_table ¶

drb_round_table(
    columns: list | str | None = None,
    columns_n: list | str | None = None,
    columns_exclude: list | str | None = None,
    round_all: bool = True,
    digits: int = 4,
    compress: bool = False,
) -> StatCalculator

Apply DRB (Disclosure Review Board) rounding rules to the estimates.

Parameters:

Name	Type	Description	Default
`columns`	`list \| str \| None`	Specific columns to round. Default is None.	`None`
`columns_n`	`list \| str \| None`	Columns to treat as counts for rounding. Default is None.	`None`
`columns_exclude`	`list \| str \| None`	Columns to exclude from rounding. Default is None.	`None`
`round_all`	`bool`	Apply rounding to all numeric columns. Default is True.	`True`
`digits`	`int`	Number of significant digits for rounding. Default is 4.	`4`
`compress`	`bool`	Use compressed rounding format. Default is False.	`False`

Returns:

Type	Description
`StatCalculator`	StatCalculator with DRB rounding applied.

Source code in src\survey_kit\statistics\calculator.py

def drb_round_table(
    self,
    columns: list | str | None = None,
    columns_n: list | str | None = None,
    columns_exclude: list | str | None = None,
    round_all: bool = True,
    digits: int = 4,
    compress: bool = False,
) -> StatCalculator:
    """
    Apply DRB (Disclosure Review Board) rounding rules to the estimates.

    Parameters
    ----------
    columns : list|str|None, optional
        Specific columns to round. Default is None.
    columns_n : list|str|None, optional
        Columns to treat as counts for rounding. Default is None.
    columns_exclude : list|str|None, optional
        Columns to exclude from rounding. Default is None.
    round_all : bool, optional
        Apply rounding to all numeric columns. Default is True.
    digits : int, optional
        Number of significant digits for rounding. Default is 4.
    compress : bool, optional
        Use compressed rounding format. Default is False.

    Returns
    -------
    StatCalculator
        StatCalculator with DRB rounding applied.
    """
    kwargs = copy(locals())
    del kwargs["self"]
    self.replicate_stats = self.replicate_stats.pipe(
        function=drb_round_table, **kwargs
    )

    return self

from_function ¶

from_function(
    delegate: Callable,
    estimate_ids: list | str,
    df: IntoFrameT | None = None,
    df_argument: str = "df",
    arguments: dict | None = None,
    weight: str = "",
    replicates: Replicates | None = None,
    scale_wgts_to: float = 0.0,
    weight_argument_name: str = "weight",
    by: dict[str, list[str]] | None = None,
    display: bool = True,
    display_all_vars: bool = True,
    display_max_vars: int = 20,
    round_output: bool | int = True,
) -> StatCalculator

Create a StatCalculator from a custom function that returns estimates.

This static method allows wrapping any function that returns estimates in a StatCalculator object for easy display and comparison.

Parameters:

Name	Type	Description	Default
`delegate`	`callable`	Function that returns a table of estimates. Must accept weight parameter if replicates are used.	required
`estimate_ids`	`list \| str`	Column names that identify each unique estimate.	required
`df`	`LazyFrame \| DataFrame`	Dataframe passed as "df" argument to delegate. Allows dynamic subsetting with by. Default is None.	`None`
`df_argument`	`str`	Name of argument with data. Defaults is "df".	`'df'`
`arguments`	`dict`	Static arguments (other than weight) passed to delegate. Default is None.	`None`
`weight`	`str`	Weight column name for weighted statistics. Default is "".	`''`
`replicates`	`Replicates \| None`	Replicates object for replicate weight standard errors. Default is None.	`None`
`scale_wgts_to`	`float`	Scale weights to sum to this value. Default is 0.0 (no scaling).	`0.0`
`weight_argument_name`	`str`	Keyword argument name for passing weight to delegate. Default is "weight".	`'weight'`
`by`	`dict[str, list[str]] \| None`	Dictionary defining grouping variables for summary statistics.	`None`
`display`	`bool`	Print results to log. Default is True.	`True`
`display_all_vars`	`bool`	Print all variables rather than truncated summary. Default is True.	`True`
`display_max_vars`	`int`	Maximum variables to print if display_all_vars=False. Default is 20.	`20`
`round_output`	`bool \| int`	Round the output. Default is True.	`True`

Returns:

Type	Description
`StatCalculator`	StatCalculator object containing the function results with estimates, SEs, and replicates as applicable.

Source code in src\survey_kit\statistics\calculator.py

def from_function(
    delegate: Callable,
    estimate_ids: list | str,
    df: IntoFrameT | None = None,
    df_argument: str = "df",
    arguments: dict | None = None,
    weight: str = "",
    replicates: Replicates | None = None,
    scale_wgts_to: float = 0.0,
    weight_argument_name: str = "weight",
    by: dict[str, list[str]] | None = None,
    display: bool = True,
    display_all_vars: bool = True,
    display_max_vars: int = 20,
    round_output: bool | int = True,
) -> StatCalculator:
    """
    Create a StatCalculator from a custom function that returns estimates.

    This static method allows wrapping any function that returns estimates
    in a StatCalculator object for easy display and comparison.

    Parameters
    ----------
    delegate : callable
        Function that returns a table of estimates. Must accept weight
        parameter if replicates are used.
    estimate_ids : list | str
        Column names that identify each unique estimate.
    df : pl.LazyFrame | pl.DataFrame, optional
        Dataframe passed as "df" argument to delegate. Allows dynamic
        subsetting with by. Default is None.
    df_argument : str, optional
        Name of argument with data. Defaults is "df".
    arguments : dict, optional
        Static arguments (other than weight) passed to delegate. Default is None.
    weight : str, optional
        Weight column name for weighted statistics. Default is "".
    replicates : Replicates|None, optional
        Replicates object for replicate weight standard errors. Default is None.
    scale_wgts_to : float, optional
        Scale weights to sum to this value. Default is 0.0 (no scaling).
    weight_argument_name : str, optional
        Keyword argument name for passing weight to delegate. Default is "weight".
    by : dict[str,list[str]]|None, optional
        Dictionary defining grouping variables for summary statistics.
    display : bool, optional
        Print results to log. Default is True.
    display_all_vars : bool, optional
        Print all variables rather than truncated summary. Default is True.
    display_max_vars : int, optional
        Maximum variables to print if display_all_vars=False. Default is 20.
    round_output : bool|int, optional
        Round the output. Default is True.

    Returns
    -------
    StatCalculator
        StatCalculator object containing the function results with
        estimates, SEs, and replicates as applicable.
    """

    #   Input parsing
    if arguments is None:
        arguments = {}

    if type(estimate_ids) is str:
        estimate_ids = [estimate_ids]

    if df is None:
        by = None
    else:
        if scale_wgts_to > 0:
            if weight != "":
                weights_to_cast = [weight]
                if replicates is not None:
                    weights_to_cast.extend(replicates.rep_list)
                df = safe_sum_cast(df, weights_to_cast)

                with_scale = [
                    (nw.col(weighti) / nw.col(weighti).sum() * scale_wgts_to).alias(
                        nw.col(weighti)
                    )
                    for weighti in weights_to_cast
                ]
                df = nw.from_native(df).with_columns(with_scale).to_native()

    if by is None:
        by = {"All": []}

    replicate_name = "___replicate___"

    df_estimates = []
    df_ses = []
    df_replicates = []
    by_vars = StatCalculator._by_vars(by=by)

    nw_type = NarwhalsType(df)
    for keyi, valuei in by.items():
        if keyi == "All":
            logger.info(f"Running {delegate.__name__}")
        else:
            logger.info(f"Running {delegate.__name__} for {keyi}")

        df_list = []
        if df is not None:
            if len(valuei):
                df_partitioned = (
                    nw_type.to_polars()
                    .lazy()
                    .collect()
                    .partition_by(by=valuei, maintain_order=True, include_key=True)
                )

                df_partitioned = [
                    nw_type.from_polars(dfi) for dfi in df_partitioned
                ]

                df_list.extend(df_partitioned)
            else:
                df_list.append(df)

        if len(df_list) == 0:
            df_list = [None]

        for dfi in df_list:
            append_by = []
            append_values = []

            if dfi is not None:
                arguments[df_argument] = dfi

                if len(valuei):
                    append_values = dfi.select(valuei).unique().to_dicts()
                    append_by = [
                        nw.lit(valuei).alias(keyi)
                        for keyi, valuei in append_values[0].items()
                    ]

            if replicates is None:
                df_esti = delegate(**arguments)
                if len(append_by):
                    df_esti = (
                        nw.from_native(df_esti).with_columns(append_by).to_native()
                    )

                df_estimates.append(df_esti)
            else:
                if len(append_values):
                    logger.info(append_values)

                rep_return = replicates_ses_from_function(
                    delegate=delegate,
                    arguments=arguments,
                    join_on=estimate_ids,
                    weight_argument_name=weight_argument_name,
                    weights=replicates.rep_list,
                    replicate_name=replicate_name,
                )

                df_esti = rep_return.df_estimates
                df_sei = rep_return.df_ses
                df_repi = rep_return.df_replicates

                if len(append_by):
                    df_esti = (
                        nw.from_native(df_esti).with_columns(append_by).to_native()
                    )
                    df_sei = (
                        nw.from_native(df_sei).with_columns(append_by).to_native()
                    )

                    df_repi = (
                        nw.from_native(df_repi).with_columns(append_by).to_native()
                    )

                df_estimates.append(df_esti)
                df_ses.append(df_sei)
                df_replicates.append(df_repi)

        del df_list

    #   Set up the output
    ss_out = StatCalculator(
        df=None,
        weight=weight,
        replicates=replicates,
        by=by,
        display=display,
        display_all_vars=display_all_vars,
        display_max_vars=display_max_vars,
        round_output=round_output,
        calculate=False,
    )

    ss_out.variable_ids = estimate_ids

    if len(df_estimates):
        df_estimates = concat_wrapper(df_estimates, how="diagonal")
        #   Final variable order
        if len(by_vars):
            select_order = estimate_ids + by_vars
            select_order.extend(
                [
                    coli
                    for coli in safe_columns(df_estimates)
                    if coli not in select_order
                ]
            )
        else:
            select_order = safe_columns(df_estimates)
        ss_out.df_estimates = df_estimates.select(select_order)
    if len(df_ses):
        ss_out.df_ses = concat_wrapper(df_ses, how="diagonal").select(select_order)

    if len(df_ses):
        ss_out.df_replicates = concat_wrapper(df_replicates, how="diagonal").select(
            select_order + [replicate_name]
        )

    ss_out.df_estimates = ss_out.round_results(df=ss_out.df_estimates)
    ss_out.df_ses = ss_out.round_results(df=ss_out.df_ses)

    if display:
        ss_out.print()

    return ss_out

pipe ¶

pipe(function: Callable, *args, **kwargs) -> StatCalculator

Pipe a function to df_estimates, df_ses, and df_replicates (as necessary)

Parameters:

Name	Type	Description	Default
`function`	`Callable`	Function to pipe.	required
`*args`	`TYPE`	arguments to function	`()`
`**kwargs`	`TYPE`	keyword arguments to function	`{}`

Returns:

Type	Description
`StatCalculator`

Source code in src\survey_kit\statistics\calculator.py

def pipe(self, function: Callable, *args, **kwargs) -> StatCalculator:
    """
    Pipe a function to df_estimates, df_ses, and df_replicates (as necessary)

    Parameters
    ----------
    function : Callable
        Function to pipe.
    *args : TYPE
        arguments to function
    **kwargs : TYPE
        keyword arguments to function

    Returns
    -------
    StatCalculator

    """

    self = self.copy()
    self.replicate_stats = self.replicate_stats.pipe(
        function=function, *args, **kwargs
    )

    return self

print ¶

print(
    df: IntoFrameT | None = None,
    round_output: bool | int | None = None,
    estimates_per_page: int = 0,
    sub_log: logging = None,
)

Print the estimates (with SEs if applicable) to the log.

Parameters:

Name	Type	Description	Default
`df`	`IntoFrameT`	The estimates to display. Default is the estimates in self.	`None`
`round_output`	`bool \| int \| None`	Rounding rule (True for DRB, integer for number of significant digits). Default is self's rounding rule.	`None`
`estimates_per_page`	`int`	Repeat the header every k estimates. Defaults to 0 (don't repeat).	`0`
`sub_log`	`logging`	Override logger. Default is None (no override).	`None`

Returns:

Type	Description
`None`

Source code in src\survey_kit\statistics\calculator.py

def print(
    self,
    df: IntoFrameT | None = None,
    round_output: bool | int | None = None,
    estimates_per_page: int = 0,
    sub_log: logging = None,
):
    """
    Print the estimates (with SEs if applicable) to the log.

    Parameters
    ----------
    df : IntoFrameT, optional
        The estimates to display. Default is the estimates in self.
    round_output : bool|int|None, optional
        Rounding rule (True for DRB, integer for number of significant digits).
        Default is self's rounding rule.
    estimates_per_page : int, optional
        Repeat the header every k estimates. Defaults to 0 (don't repeat).
    sub_log : logging, optional
        Override logger. Default is None (no override).

    Returns
    -------
    None
    """
    if self.df_replicates is not None:
        self._print_replicates(
            round_output=round_output,
            estimates_per_page=estimates_per_page,
            sub_log=sub_log,
        )
    else:
        self._print_estimates(
            df=df,
            round_output=round_output,
            estimates_per_page=estimates_per_page,
            sub_log=sub_log,
        )

round_results ¶

round_results(
    df: IntoFrameT | None = None,
    rounding: Rounding | None = None,
    display_only: bool = False,
) -> IntoFrameT

Parameters:

Name	Type	Description	Default
`df`	`IntoFrameT`	Table of estimates. The default is the estimates in df_estimates	`None`
`rounding`	`Rounding \| None`	Rounding (True for DRB rules) and an integer for specific number of significant digits. The default is self's rounding.	`None`
`display_only`	`bool`	If True, affects the display of numbers (casts to strings). The default is False.	`False`

Returns:

Name	Type	Description
`df`	`IntoFrameT`	The rounded estimates.

Source code in src\survey_kit\statistics\calculator.py

def round_results(
    self,
    df: IntoFrameT | None = None,
    rounding: Rounding | None = None,
    display_only: bool = False,
) -> IntoFrameT:
    """
    Parameters
    ----------
    df : IntoFrameT, optional
        Table of estimates. The default is the estimates in df_estimates
    rounding : Rounding|None, optional
        Rounding (True for DRB rules) and an integer for specific number of significant digits. The default is self's rounding.
    display_only : bool, optional
        If True, affects the display of numbers (casts to strings). The default is False.

    Returns
    -------
    df : IntoFrameT
        The rounded estimates.

    """

    if df is None:
        df = self.df_estimates

    if rounding is None:
        rounding = self.rounding

    if df is not None:
        df = drb_round_table(
            df=df,
            columns=rounding.cols_round,
            columns_n=rounding.cols_n,
            columns_exclude=rounding.cols_exclude,
            round_all=rounding.round_all,
            digits=rounding.round_digits,
            display_only=display_only,
        )

    return df

table_of_estimates ¶

table_of_estimates(
    round_output: bool | int | None = None,
    estimates_to_show: list[str] | None = None,
    variable_prefix: str = "",
    estimate_type_variable_name: str = "Statistic",
    ci_level: float = 0.95,
) -> IntoFrameT

Create a formatted table of estimates with option of statistics to report.

Parameters:

Name	Type	Description	Default
`round_output`	`bool \| int \| None`	Rounding rule for display.	`None`
`estimates_to_show`	`list[str] \| None`	List of estimate types to include. Options: "estimate", "se", "t", "p", "ci". Default is ["estimate", "se"].	`None`
`variable_prefix`	`str`	Prefix to add to variable column names. Default is "".	`''`
`estimate_type_variable_name`	`str`	Name for the column indicating statistic type. Default is "Statistic".	`'Statistic'`
`ci_level`	`float`	Confidence interval level for "ci" estimates. Default is 0.95.	`0.95`

Returns:

Type	Description
`IntoFrameT`	Formatted table with estimates arranged by statistic type.

Source code in src\survey_kit\statistics\calculator.py

def table_of_estimates(
    self,
    round_output: bool | int | None = None,
    estimates_to_show: list[str] | None = None,
    variable_prefix: str = "",
    estimate_type_variable_name: str = "Statistic",
    ci_level: float = 0.95,
) -> IntoFrameT:
    """
    Create a formatted table of estimates with option of statistics to report.

    Parameters
    ----------
    round_output : bool|int|None, optional
        Rounding rule for display.
    estimates_to_show : list[str] | None, optional
        List of estimate types to include. Options: "estimate", "se", "t", "p", "ci".
        Default is ["estimate", "se"].
    variable_prefix : str, optional
        Prefix to add to variable column names. Default is "".
    estimate_type_variable_name : str, optional
        Name for the column indicating statistic type. Default is "Statistic".
    ci_level : float, optional
        Confidence interval level for "ci" estimates. Default is 0.95.

    Returns
    -------
    IntoFrameT
        Formatted table with estimates arranged by statistic type.
    """
    if estimates_to_show is None:
        estimates_to_show = ["estimate", "se"]

    df_ordered = []
    nw_ordered = []
    col_sort = "__order_output_table__"
    for index, esti in enumerate(estimates_to_show):
        dfi = None
        if esti.lower() == "estimate":
            dfi = self.df_estimates

        elif esti.lower() == "se" and self.df_replicates is not None:
            dfi = self.df_ses
        elif esti.lower() == "t" and self.df_replicates is not None:
            dfi = self._df_t()
        elif esti.lower() == "p" and self.df_replicates is not None:
            dfi = self._df_p()
        elif esti.lower() == "ci" and self.df_replicates is not None:
            dfi = self._df_ci(ci_level=ci_level)
        else:
            message = f"{esti} not allowed for estimates_to_show"
            logger.error(message)
            raise Exception(message)

        if dfi is not None:
            nwi = NarwhalsType(dfi)
            nw_ordered.append(nwi)
            dfi = nwi.to_polars()
            df_ordered.append(
                dfi.with_columns(
                    [
                        pl.lit(index).alias(col_sort),
                        pl.lit(esti.lower()).alias(estimate_type_variable_name),
                    ]
                )
            )

    col_row_index = "___estimate_row_count___"
    df_ordered[0] = df_ordered[0].with_row_index(col_row_index)
    df_display = concat_wrapper(df_ordered, how="diagonal").lazy()

    sort_vars = self.variable_ids + self.summarize_vars
    df_display = df_display.sort(sort_vars + [col_sort]).with_columns(
        pl.col(col_row_index).forward_fill()
    )
    df_display = df_display.sort([col_row_index] + [col_sort]).drop(col_row_index)

    #   Clear extraneous information
    with_clear = []
    for coli in sort_vars:
        c_col = pl.col(coli)
        with_clear.append(
            pl.when(pl.col(col_sort) != 0)
            .then(pl.lit(""))
            .otherwise(c_col.cast(pl.String))
            .alias(coli)
        )

    select_order = sort_vars + [estimate_type_variable_name]
    remaining = []
    rename = {}
    for coli in df_display.lazy.collect_schema().names():
        if coli not in select_order and coli != col_sort:
            if variable_prefix != "":
                rename[coli] = f"{variable_prefix}{coli}"

            remaining.append(coli)

    df_display = df_display.with_columns(with_clear).select(
        select_order + remaining
    )
    #   Round?
    if round_output:
        rounding = deepcopy(self.rounding)
        rounding.set_round_digits(round_output)

        if (
            len(rounding.cols_n) == 0
            and len(rounding.cols_round) == 0
            and not rounding.round_all
        ):
            #   Nothing set to round, round all
            rounding.cols_round = remaining

        df_display = self.round_results(
            df=df_display, rounding=rounding, display_only=True
        )

    if len(rename):
        df_display = df_display.rename(rename)
    return nw_ordered[0].lazy().from_polars(df_display)

Statistics ¶

Parameters:

Name	Type	Description	Default
`stats`	`(list[str],)`	List of statistics to calculate (mean, median, etc.) Call Statistics.available_stats() for options	required
`formula`	`str`	formulaic (or R)-style formula for defining statistics to be calculated. The default is "". This takes precedence over columns	`''`
`columns`	`list[str] \| str \| None`	List of columns to calculate statistics over. The default is None.	`None`
`quantile_interpolated`	`bool`	Use linear interpolation (census-style) for quantiles. The default is False.	`False`
`quantile_interpolated_interval`	`int`	If quantile_interpolated, what is the bin interval? The default is 2500.	`2500`

Source code in src\survey_kit\statistics\statistics.py

class Statistics:
    """
    Parameters
    ----------
    stats : list[str],
        List of statistics to calculate (mean, median, etc.)
        Call Statistics.available_stats() for options
    formula : str, optional
        formulaic (or R)-style formula for defining statistics to be calculated.
        The default is "".  This takes precedence over columns
    columns : list[str]|str|None, optional
        List of columns to calculate statistics over. The default is None.
    quantile_interpolated : bool, optional
        Use linear interpolation (census-style) for quantiles. The default is False.
    quantile_interpolated_interval : int, optional
        If quantile_interpolated, what is the bin interval? The default is 2500.

    """

    def __init__(
        self,
        stats: list[str],
        formula: str = "",
        columns: list[str] | str | None = None,
        quantile_interpolated: bool = False,
        quantile_interpolated_interval: int = 2500,
    ):
        #   Input parsing/set defaults
        if columns is None:
            columns = []
        elif type(columns) is str:
            columns = [columns]

        self.formula = formula
        self.columns = columns
        self.quantile_interpolated = quantile_interpolated
        self.quantile_interpolated_interval = quantile_interpolated_interval
        self.stats = stats

    @nw.narwhalify
    def calculate(
        self,
        df: IntoFrameT,
        weight: str = "",
        by: dict[str, list[str]] | None = None,
        summarize_vars: list | None = None,
        rounding: Rounding | None = None,
        allow_slow_pandas: bool = False,
    ):
        nw_type = NarwhalsType(df)

        if summarize_vars is None:
            summarize_vars = []
        if rounding is None:
            rounding = Rounding(round_output=False)

        if by is None:
            by = {"All": []}

        if type(by) is list:
            by = {f"{i}": itemi for i, itemi in enumerate(by)}

        if self.formula != "":
            #   It's a formula, process accordingly
            df_summary = nw.from_native(
                Formula(self.formula).get_model_matrix(df)
            ).lazy_backend(nw_type)
            cols_summary = df_summary.collect_schema().names()
        else:
            #   It's a variable list
            if len(self.columns):
                cols = []
                for coli in self.columns:
                    cols.extend(columns_from_list(df=df, columns=[coli]))
                df_summary = df.select(cols)
                cols_summary = cols
            else:
                cols_summary = df.lazy_backend(nw_type).collect_schema().names()

        df_summary = df_summary.select(cs.numeric(), cs.boolean())
        cols_summary = safe_columns(df_summary)
        #   Keep the weights
        if (
            weight != ""
            and weight not in df_summary.lazy_backend(nw_type).collect_schema().names()
        ):
            df_summary = concat_wrapper(
                [df_summary, df.select(weight)], how="horizontal"
            )

        if len(summarize_vars):
            df_summary = concat_wrapper(
                [drop_if_exists(df_summary, summarize_vars), df.select(summarize_vars)],
                how="horizontal",
            )

        #   Rename the stats for more useful table headers
        stats_rename = {
            "count": "n, weighted",
            "rawcount": "n",
            "weight": "n, weighted",
        }

        #   Same with the "modifiers"
        modifiers_output = {
            "not0": " (not 0)",
            "is0": " (== 0)",
            "notmissing": " (not null)",
            "share": " (share)",
            "missing": " (missing)",
        }

        #   Process the stats
        stats_headers = {}
        stats_dict = {}
        for stati in self.stats:
            stat_mod = stati.split("|")

            stati_raw = stat_mod[0]

            modifier = ""
            mod_header = ""
            if len(stat_mod) == 2:
                modifier = stat_mod[1]

                if modifier in modifiers_output.keys():
                    mod_header = modifiers_output[modifier]

            if stati_raw in stats_rename.keys():
                stat_headeri = stats_rename[stati_raw]
            else:
                stat_headeri = stati_raw
            stats_headers[stati] = f"{stat_headeri}{mod_header}"

            if modifier != "":
                cols_include = [f"{coli}|{modifier}" for coli in cols_summary]
            else:
                cols_include = cols_summary.copy()
            stats_dict = column_stats_builder(
                column_stats=stats_dict,
                cols_include=cols_include,
                df=df_summary,
                stat=[stati_raw],
            )

        summary_tables = calculate_by(
            df=df_summary,
            column_stats=stats_dict,
            by=by,
            always_return_as_collection=True,
            weight=weight,
            quantile_interpolated=self.quantile_interpolated,
            quantile_interpolated_interval=self.quantile_interpolated_interval,
            allow_slow_pandas=allow_slow_pandas,
        )

        suffixes = {}

        for stati in self.stats:
            stat_mod = stati.split("|")
            stat_onlyi = stat_mod[0]
            if len(stat_mod) == 2:
                modifier = stat_mod[1]
            else:
                modifier = ""

            suffixes[stati] = self.stat_suffix(stat_onlyi, modifier)

        stat_cols_final = None
        default_index = "___index___"

        for keyi, valuei in summary_tables.items():
            valuei = nw.from_native(valuei)
            b_default_index = False

            if keyi in by.keys():
                index = list_input(by[keyi])

                if index is None:
                    index = default_index
                elif len(index) == 0:
                    index = default_index
            else:
                index = default_index

            if index == default_index:
                #   Just use the row number as the index
                valuei = (
                    valuei.lazy()
                    .collect()
                    .with_row_index(name=index)
                    .lazy_backend(nw_type)
                )
                index = [default_index]
                b_default_index = True

            summaries_by_var = []

            #   Reshape to wide
            for coli in cols_summary:
                #   Catch if it's been renamed, like n->rawcount
                (coli, _, coli_original) = _check_special_modifiers(coli)

                cols_stats = [f"{coli}_{suffixi}" for suffixi in set(suffixes.values())]
                keep_list = index + cols_stats

                summaryi = valuei.select(keep_list)

                if stat_cols_final is None:
                    stat_cols_final = [
                        f"{stats_headers[keyi]}" for keyi in stats_headers.keys()
                    ]

                summaryi = summaryi.rename(
                    {
                        f"{coli}_{suffixes[keyi]}": stats_headers[keyi]
                        for keyi in stats_headers.keys()
                    }
                )
                if b_default_index:
                    summaryi = summaryi.drop(index)
                    keep_list_final = ["Variable"] + list(
                        dict.fromkeys(stats_headers.values())
                    )
                else:
                    keep_list_final = (
                        ["Variable"]
                        + index
                        + list(dict.fromkeys(stats_headers.values()))
                    )

                summaryi = summaryi.with_columns(
                    nw.lit(coli_original).alias("Variable")
                ).select(keep_list_final)
                summaries_by_var.append(summaryi)

            valuei = concat_wrapper(summaries_by_var, how="vertical")
            # summary_tables[keyi] = Compress(valuei,
            #                                 no_boolean=True)
            summary_tables[keyi] = valuei

        cols_by = []
        for keyi, valuei in summary_tables.items():
            cols_by.extend(
                list(
                    set(
                        summary_tables[keyi].lazy().collect_schema().names()
                    ).difference(stat_cols_final + cols_by)
                )
            )
            cols_by.remove("Variable")

        cols_dedupped = list(set(stat_cols_final))
        if len(cols_dedupped) != len(stat_cols_final):
            stat_cols_final = _columns_original_order(
                cols_unordered=cols_dedupped, cols_ordered=stat_cols_final
            )
        keep_order = ["Variable"] + cols_by + stat_cols_final
        output_table = concat_wrapper(
            list(summary_tables.values()), how="diagonal"
        ).select(keep_order)

        if len(cols_by):
            output_table = output_table.sort(cols_by)

        #   Get the information for rounding
        (cols_round, cols_n) = self.rounding_columns(
            output_table.drop(["Variable"] + summarize_vars)
        )

        rounding.cols_round = list(set(rounding.cols_round + cols_round))
        rounding.cols_n = list(set(rounding.cols_n + cols_n))

        return output_table.lazy_backend(nw_type)

    def stat_suffix(
        self=None,
        Statistic: str = "",
        #   not0, missing, nonmissing
        modifier: str = "",
    ) -> str:
        if modifier != "":
            modifier_suffix = f"_{modifier}"
        else:
            modifier_suffix = ""

        if Statistic in ["mean", "sum", "var", "std", "max", "min", "first", "gini"]:
            suffix = Statistic + modifier_suffix
        elif Statistic == "median":
            suffix = "q0_5" + modifier_suffix
        elif Statistic.startswith("q") or Statistic.startswith("p"):
            quantile = float(Statistic.replace("q", "").replace("p", "")) / 100
            suffix = f"q{str(quantile).replace('.', '_')}" + modifier_suffix
        elif (
            Statistic.startswith("count")
            or Statistic.startswith("rawcount")
            or Statistic.startswith("share")
            or Statistic.startswith("rawshare")
            or Statistic == "n"
            or Statistic == "weight"
        ):
            if Statistic.startswith("count") or Statistic == "weight":
                count_prefix = "n"
            elif Statistic.startswith("rawcount") or Statistic == "n":
                count_prefix = "rawn"
            elif Statistic.startswith("share"):
                count_prefix = "share"
            elif Statistic.startswith("rawshare"):
                count_prefix = "rawshare"

            count_suffix = ""
            suffixes = ["_not0", "_is0", "_notmissing", "_missing", "_share"]
            for si in suffixes:
                if Statistic.endswith(si):
                    count_suffix = si

            suffix = f"{count_prefix}{count_suffix}{modifier_suffix}"

        try:
            return suffix
        except:
            message = f"{Statistic} is not a valid statistic"
            logger.error(message)
            raise Exception(message)

    @nw.narwhalify
    def rounding_columns(self, df: IntoFrameT) -> tuple[list[str], list[str]]:
        cols_n = [
            "n",
            "n (missing)",
            "n (not null)",
            "n (not 0)",  # ,
            # "n, weighted",
            # "n missing, weighted",
            # "n (not null), weighted",
            # "n (not 0), weighted"
        ]
        columns = df.lazy().collect_schema().names()
        cols_n = list(set(cols_n).intersection(columns))
        cols_round = list(set(columns).difference(cols_n))

        return (cols_round, cols_n)

    def available_stats():
        examples = [
            "mean",
            "sum",
            "median",
            "q10",
            "q97.5",
            "std",
            "var",
            "max",
            "min",
            "weight",
            "n",
            "gini",
        ]
        logger.info("")
        logger.info(f"Some examples: {examples}")
        logger.info("")
        logger.info(
            "Stats can also have 'modifiers' appended to them separated by a pipe ('|'), including"
        )
        modifiers = ["not0", "missing", "notmissing", "is0", "share"]
        logger.info(modifiers)

        logger.info("")
        logger.info("For quantiles, pass q{number} where number in (0,100)")
        logger.info("")
        logger.info("n is the unweighted count and weight is the weighted count")
        logger.info(f"     for n/weight: {modifiers}")

        modifiers = ["not0"]
        logger.info(f"     for all other stats: {modifiers}")

        examples = [
            "mean|not0",
            "sum|not0",
            "median",
            "min|not0",
            "count|missing",
            "n|notmissing",
            "n|share",
        ]
        logger.info("")
        logger.info(f"""Some examples: {examples}""")

Replicates ¶

Bases: Serializable

Configuration for replicate weight variance estimation.

Replicates defines the structure of replicate weights in survey data for calculating proper standard errors that account for complex sample designs. Supports two variance estimation methods: Bootstrap and Balanced Repeated Replication (BRR).

Replicate weights are commonly used by statistical agencies (Census Bureau, BLS, etc.) to enable users to calculate design-based standard errors without sharing the full sample design details (strata, clusters, etc.).

Parameters:

Name	Type	Description	Default
`weight_stub`	`str`	Prefix for weight column names. For example, if weight_stub="weight_", the function looks for columns: weight_0, weight_1, ..., weight_n where weight_0 is the base weight and weight_1 through weight_n are the replicate weights.	required
`df`	`IntoFrameT \| None`	Dataframe containing the weight columns. Used to automatically detect the number of replicates. Default is None.	`None`
`n_replicates`	`int \| None`	Number of replicate weights (excluding the base weight). If None, will be inferred from df. Default is None.	`None`
`bootstrap`	`bool`	Type of variance estimation: - True: Bootstrap variance (standard bootstrap resampling) - False: Balanced Repeated Replication (BRR) variance If you don't know which to use, use bootstrap=True. Default is False.	`False`

Attributes:

Name	Type	Description
`weight_stub`	`str`	The weight column prefix.
`n_replicates`	`int`	Number of replicate weights.
`bootstrap`	`bool`	Variance estimation method flag.
`rep_list`	`list[str]`	List of all weight column names (base + replicates).

Raises:

Type	Description
`Exception`	If neither df nor n_replicates is provided.

Examples:

Infer number of replicates from dataframe:

>>> from survey_kit.statistics.replicates import Replicates
>>> replicates = Replicates(
...     df=df,
...     weight_stub="weight_",
...     bootstrap=True
... )
>>> print(replicates.n_replicates)
>>> print(replicates.rep_list)

Specify number of replicates directly:

>>> replicates = Replicates(
...     weight_stub="weight_",
...     n_replicates=80,
...     bootstrap=False  # Use BRR variance
... )

Use with StatCalculator:

>>> from survey_kit.statistics.calculator import StatCalculator
>>> from survey_kit.statistics.statistics import Statistics
>>>
>>> stats = Statistics(stats=["mean", "median"], columns=["income"])
>>> replicates = Replicates(weight_stub="weight_", n_replicates=80, bootstrap=True)
>>>
>>> sc = StatCalculator(
...     df=df,
...     statistics=stats,
...     weight="weight_0",
...     replicates=replicates
... )
>>> sc.print()

Use with multiple imputation:

>>> from survey_kit.statistics.multiple_imputation import mi_ses_from_function
>>>
>>> arguments = {
...     "statistics": stats,
...     "replicates": replicates,
...     "weight": "weight_0"
... }
>>>
>>> mi_results = mi_ses_from_function(
...     delegate=StatCalculator,
...     df_implicates=srmi.df_implicates,
...     df_noimputes=weights_df,
...     arguments=arguments,
...     join_on=["Variable"]
... )

Notes

Bootstrap variance (bootstrap=True): - Standard bootstrap resampling variance estimation - SE = sqrt(Σ(θ̂ᵣ - θ̂₀)² / R) where θ̂₀ is the base weight estimate, θ̂ᵣ are replicate estimates, and R is number of replicates - Use when you generated your own bootstrap weights

BRR variance (bootstrap=False): - Balanced Repeated Replication variance - Used by Census Bureau and other statistical agencies - SE = sqrt(4 Σ(θ̂ᵣ - θ̂)² / R) where θ̄ᵣ is the mean across all replicate estimates - Use when working with official government surveys that provide BRR weights

The rep_list attribute provides all weight column names in order: [weight_0, weight_1, ..., weight_n] where weight_0 is the base weight.

ComparisonItem ¶

Helpers for specifying comparisons in StatCalculator and MultipleImputation.

Provides two types of comparisons:

Variable: Compare different variables (e.g., income vs income_2)
Column: Compare different statistics (e.g., mean vs median)

Source code in src\survey_kit\statistics\comparisons.py

class ComparisonItem:
    """
    Helpers for specifying comparisons in [StatCalculator][survey_kit.statistics.calculator.StatCalculator]
    and [MultipleImputation][survey_kit.statistics.multiple_imputation.MultipleImputation].

    Provides two types of comparisons:

    - Variable: Compare different variables (e.g., income vs income_2)
    - Column: Compare different statistics (e.g., mean vs median)

    """

    class Variable:
        """
        Specify a comparison between two variables.

        Used to compare different variables within the same dataset, such as
        comparing income from two sources or comparing outcomes across groups.

        Parameters
        ----------
        value1 : str
            First variable value to compare (e.g., "income").
        value2 : str
            Second variable value to compare (e.g., "income_2").
        column : str, optional
            Name of the column containing variable names in the estimates dataframe.
            Default is "Variable".
        name : str, optional
            Name for the comparison result. If empty, uses f"{value1}_vs_{value2}".
            Default is "".

        Examples
        --------
        >>> from survey_kit.statistics.calculator import ComparisonItem
        >>>
        >>> # Compare two income variables
        >>> comp = ComparisonItem.Variable(
        ...     value1="wage_income",
        ...     value2="self_employment_income",
        ...     name="wage_vs_self_employment"
        ... )
        >>>
        >>> comparison = sc.compare(
        ...     sc,
        ...     difference=False,
        ...     compare_list_variables=[comp]
        ... )["ratio"]
        """

        def __init__(
            self, value1: str, value2: str, column: str = "Variable", name: str = ""
        ):
            self.column = column
            self.value1 = value1
            self.value2 = value2
            self.name = name

    class Column:
        """
        Specify a comparison between two statistics/columns.

        Used to compare different statistics for the same variable, such as
        comparing mean vs median or comparing different quantiles.

        Parameters
        ----------
        column1 : str
            First statistic column to compare (e.g., "mean").
        column2 : str
            Second statistic column to compare (e.g., "median").
        name : str, optional
            Name for the comparison result. If empty, uses f"c({column1},{column2})".
            Default is "".

        Examples
        --------
        >>> from survey_kit.statistics.calculator import ComparisonItem
        >>>
        >>> # Compare mean vs median
        >>> comp = ComparisonItem.Column(
        ...     column1="mean",
        ...     column2="median (not 0)",
        ...     name="median_mean_diff"
        ... )
        >>>
        >>> comparison = sc.compare(
        ...     sc,
        ...     ratio=False,
        ...     compare_list_columns=[comp]
        ... )["difference"]
        """

        def __init__(self, column1: str, column2: str, name: str = ""):
            self.column1 = column1
            self.column2 = column2

            if name == "":
                name = f"c({column1},{column2})"
            self.name = name

Column ¶

Specify a comparison between two statistics/columns.

Used to compare different statistics for the same variable, such as comparing mean vs median or comparing different quantiles.

Parameters:

Name	Type	Description	Default
`column1`	`str`	First statistic column to compare (e.g., "mean").	required
`column2`	`str`	Second statistic column to compare (e.g., "median").	required
`name`	`str`	Name for the comparison result. If empty, uses f"c({column1},{column2})". Default is "".	`''`

Examples:

>>> from survey_kit.statistics.calculator import ComparisonItem
>>>
>>> # Compare mean vs median
>>> comp = ComparisonItem.Column(
...     column1="mean",
...     column2="median (not 0)",
...     name="median_mean_diff"
... )
>>>
>>> comparison = sc.compare(
...     sc,
...     ratio=False,
...     compare_list_columns=[comp]
... )["difference"]

Source code in src\survey_kit\statistics\comparisons.py

class Column:
    """
    Specify a comparison between two statistics/columns.

    Used to compare different statistics for the same variable, such as
    comparing mean vs median or comparing different quantiles.

    Parameters
    ----------
    column1 : str
        First statistic column to compare (e.g., "mean").
    column2 : str
        Second statistic column to compare (e.g., "median").
    name : str, optional
        Name for the comparison result. If empty, uses f"c({column1},{column2})".
        Default is "".

    Examples
    --------
    >>> from survey_kit.statistics.calculator import ComparisonItem
    >>>
    >>> # Compare mean vs median
    >>> comp = ComparisonItem.Column(
    ...     column1="mean",
    ...     column2="median (not 0)",
    ...     name="median_mean_diff"
    ... )
    >>>
    >>> comparison = sc.compare(
    ...     sc,
    ...     ratio=False,
    ...     compare_list_columns=[comp]
    ... )["difference"]
    """

    def __init__(self, column1: str, column2: str, name: str = ""):
        self.column1 = column1
        self.column2 = column2

        if name == "":
            name = f"c({column1},{column2})"
        self.name = name

Variable ¶

Specify a comparison between two variables.

Used to compare different variables within the same dataset, such as comparing income from two sources or comparing outcomes across groups.

Parameters:

Name	Type	Description	Default
`value1`	`str`	First variable value to compare (e.g., "income").	required
`value2`	`str`	Second variable value to compare (e.g., "income_2").	required
`column`	`str`	Name of the column containing variable names in the estimates dataframe. Default is "Variable".	`'Variable'`
`name`	`str`	Name for the comparison result. If empty, uses f"{value1}vs". Default is "".	`''`

Examples:

>>> from survey_kit.statistics.calculator import ComparisonItem
>>>
>>> # Compare two income variables
>>> comp = ComparisonItem.Variable(
...     value1="wage_income",
...     value2="self_employment_income",
...     name="wage_vs_self_employment"
... )
>>>
>>> comparison = sc.compare(
...     sc,
...     difference=False,
...     compare_list_variables=[comp]
... )["ratio"]

Source code in src\survey_kit\statistics\comparisons.py

class Variable:
    """
    Specify a comparison between two variables.

    Used to compare different variables within the same dataset, such as
    comparing income from two sources or comparing outcomes across groups.

    Parameters
    ----------
    value1 : str
        First variable value to compare (e.g., "income").
    value2 : str
        Second variable value to compare (e.g., "income_2").
    column : str, optional
        Name of the column containing variable names in the estimates dataframe.
        Default is "Variable".
    name : str, optional
        Name for the comparison result. If empty, uses f"{value1}_vs_{value2}".
        Default is "".

    Examples
    --------
    >>> from survey_kit.statistics.calculator import ComparisonItem
    >>>
    >>> # Compare two income variables
    >>> comp = ComparisonItem.Variable(
    ...     value1="wage_income",
    ...     value2="self_employment_income",
    ...     name="wage_vs_self_employment"
    ... )
    >>>
    >>> comparison = sc.compare(
    ...     sc,
    ...     difference=False,
    ...     compare_list_variables=[comp]
    ... )["ratio"]
    """

    def __init__(
        self, value1: str, value2: str, column: str = "Variable", name: str = ""
    ):
        self.column = column
        self.value1 = value1
        self.value2 = value2
        self.name = name

Basic Statistics and Standard¶

summary ¶

StatCalculator ¶

df_estimates property writable ¶

df_replicates property writable ¶

df_ses property writable ¶

compare ¶

concat_with ¶

drb_round_table ¶

from_function ¶

pipe ¶

print ¶

round_results ¶

table_of_estimates ¶

Statistics ¶

Replicates ¶

ComparisonItem ¶

Column ¶

Variable ¶

df_estimates `property` `writable` ¶

df_replicates `property` `writable` ¶

df_ses `property` `writable` ¶