Data Utilities#

mlsynth.utils.datautils.balance(df: DataFrame, unit_id_column_name: str, time_period_column_name: str) None#

Check if the panel is strongly balanced.

A strongly balanced panel means every unit has an observation for every time period, and there are no duplicate unit-time observations.

Parameters:
  • df (pd.DataFrame) – The input panel data. Must contain columns specified by unit_id_column_name and time_period_column_name.

  • unit_id_column_name (str) – The name of the column in df that identifies the units. (Formerly unit_col)

  • time_period_column_name (str) – The name of the column in df that identifies the time periods. (Formerly time_col)

Returns:

None – This function does not return a value but raises an error if the panel is not strongly balanced or contains duplicates.

Raises:

MlsynthDataError – If duplicate unit-time observations are found. If the panel is not strongly balanced (i.e., not all units have observations for all time periods).

mlsynth.utils.datautils.build_covariate_matrix(df: DataFrame, unit_id_column_name: str, time_period_column_name: str, covariates: List[str], pre_periods: int, unit_order: List[Any], *, aggregation: str = 'pre_mean', normalize: bool = True) Tuple[ndarray, Tuple[str, ...], ndarray, ndarray]#

Per-unit covariate matrix aligned to unit_order.

Aggregates each covariate column into a single value per unit. The default "pre_mean" aggregation takes the mean over the first pre_periods rows; time-invariant covariates collapse to their constant value.

With normalize=True each column is centered by its cross-unit mean and scaled by its cross-unit std (computed across the units in unit_order) so all covariates are unit-free, which is the standard pre-step before applying SCM-style predictor weights or computing standardized mean differences for balance diagnostics.

Returns:

  • cov_matrix (np.ndarray, shape (len(unit_order), len(covariates))) – Per-unit values for each covariate, in the order given by unit_order and covariates.

  • cov_names (tuple of str) – The covariate names, in the same order as the columns of cov_matrix.

  • cov_means (np.ndarray, shape (len(covariates),)) – Cross-unit means used for centering (zeros when normalize=False).

  • cov_scales (np.ndarray, shape (len(covariates),)) – Cross-unit stds used for scaling (ones when normalize=False).

mlsynth.utils.datautils.build_donor_segments(ell_hat, m, T0, n)#
mlsynth.utils.datautils.clean_surrogates2(surrogate_matrix: ndarray, donor_covariates_matrix: ndarray, treated_unit_covariates_matrix: ndarray, num_pre_treatment_periods: int, common_covariates_matrix: ndarray | None = None) ndarray#

Clean surrogate variables by orthogonalizing them against covariates.

This function adjusts each surrogate variable in surrogate_matrix by removing the linear influence of a set of covariates. The covariates for projection are formed by combining donor_covariates_matrix (donor pool covariates) and treated_unit_covariates_matrix (treated unit covariates), potentially augmented by common_covariates_matrix. The cleaning process involves estimating a linear relationship between each surrogate and the combined covariates using data from the pre-treatment period (num_pre_treatment_periods), and then subtracting this predicted component from the surrogate across all time periods.

Parameters:
  • surrogate_matrix (np.ndarray) – Matrix of surrogate variables to be cleaned. Shape (n_periods, n_surrogates).

  • donor_covariates_matrix (np.ndarray) – Matrix of covariates for the donor pool. Shape (n_periods, n_donor_covariates).

  • treated_unit_covariates_matrix (np.ndarray) – Matrix of covariates for the treated unit. Shape (n_periods, n_treated_covariates). Typically, n_donor_covariates equals n_treated_covariates.

  • num_pre_treatment_periods (int) – Number of pre-treatment periods. The linear projection coefficients (projection_coefficients) are estimated using data up to this period.

  • common_covariates_matrix (Optional[np.ndarray], default None) – Additional common covariates to include in the projection for both donor and treated sides. Shape (n_periods, n_common_covariates).

Returns:

np.ndarray – The matrix of cleaned surrogate variables, where the influence of the specified covariates has been removed. Shape (n_periods, n_surrogates).

mlsynth.utils.datautils.dataprep(df: DataFrame, unit_id_column_name: str, time_period_column_name: str, outcome_column_name: str, treatment_indicator_column_name: str, allow_no_donors: bool = False, *, covariates: List[str] | None = None, covariate_aggregation: Literal['pre_mean'] = 'pre_mean', normalize_covariates: bool = True, marex: bool = False) Dict[str, Any]#

Prepare data for synthetic control methods.

Pivots a long DataFrame into a wide format required by many synthetic control estimators. It identifies treated and donor units, and separates data into pre-treatment and post-treatment periods. Handles cases with a single treated unit or multiple treated units (by cohorting based on treatment start time).

Parameters:
  • df (pd.DataFrame) – The input panel data in long format. Must contain columns for unit identifiers, time periods, outcome variable, and treatment indicator.

  • unit_id_column_name (str) – Name of the column in df that identifies unique units.

  • time_period_column_name (str) – Name of the column in df that identifies time periods.

  • outcome_column_name (str) – Name of the column in df for the outcome variable.

  • treatment_indicator_column_name (str) – Name of the column in df for the treatment indicator (0 or 1).

  • allow_no_donors (bool, default False) – If True, do not raise when zero donor units remain after dropping the treated unit.

  • covariates (list of str, optional) – Column names to fold into a per-unit covariate matrix. When supplied, the returned dict gains covariate_matrix / covariate_names / covariate_means / covariate_scales plus donor_covariate_matrix / treated_covariate_matrix (single treated case). Time-varying covariates are aggregated per the covariate_aggregation mode; time-invariant covariates collapse to their constant value. None (default) leaves the return dict identical to the pre-extension behaviour.

  • covariate_aggregation ({“pre_mean”}, default “pre_mean”) – How to collapse each covariate column to a single value per unit. Currently only "pre_mean" (the pre-treatment mean) is supported.

  • normalize_covariates (bool, default True) – If True, z-score each covariate column across units (the standard pre-step for SCM-style predictor weights and SMD diagnostics).

  • marex (bool, default False) – If True, additionally compute the Abadie-Zhou-style stacked predictor matrix X = [Y_pre; covariates^T] plus its population-weighted aggregate, surfacing the inputs that the MAREX-family experimental-design estimators consume.

Returns:

Dict[str, Any] – A dictionary containing the prepared data. Always populates the backward-compatible keys (treated_unit_name, Ywide, y, donor_names, donor_matrix, total_periods, pre_periods, post_periods, time_labels in the single-treated case; Ywide, cohorts, time_labels in the cohort case). When the opt-in kwargs above are non-default, additionally populates:

covariate_matrix : (N, M) covariate_names : tuple of str covariate_means : (M,) covariate_scales : (M,) donor_covariate_matrix : (n_donors, M) — single-treated only treated_covariate_matrix : (1, M) — single-treated only predictor_matrix : (T_pre + M, N) — when marex=True predictor_block_sizes : (T_pre, M) — when marex=True f_weights : (N,) — when marex=True X_population_mean : (T_pre + M,) — when marex=True

Raises:

MlsynthDataError

  • If no donor units are found after pivoting and selecting for a single treated unit case. - If there are zero pre-treatment periods for a single treated unit case. - If a covariate column name is not present in df.columns. - If covariate_aggregation is unknown.

mlsynth.utils.datautils.logictreat(treatment_matrix: ndarray) Dict[str, Any]#

Analyze a treatment matrix to determine treatment timings and unit counts.

Identifies the pre-treatment and post-treatment periods for single or multiple treated units. For multiple treated units, it determines the first treatment period for each.

Parameters:

treatment_matrix (np.ndarray) – A 2D NumPy array where rows represent time periods and columns represent units. A value of 1 indicates treatment, 0 otherwise. Shape (n_periods, n_units).

Returns:

Dict[str, Any] – A dictionary containing treatment analysis results. The keys and their meanings depend on whether a single or multiple treated units are detected:

If a single treated unit is found:
”Num Treated Units”int

Always 1.

”Post Periods”int

Number of post-treatment periods for the treated unit.

”Treated Index”np.ndarray

1D array containing the column index of the treated unit. Shape (1,).

”Pre Periods”int

Number of pre-treatment periods for the treated unit.

”Total Periods”int

Total number of time periods for the treated unit.

If multiple treated units are found:
”Num Treated Units”int

Number of unique treated units detected.

”Treated Index”np.ndarray

1D array containing the column indices of all treated units. Shape (num_treated_units,).

”First Treat Periods”np.ndarray

1D array, where each element is the first treatment period (0-indexed) for the corresponding treated unit in “Treated Index”. Shape (num_treated_units,).

”Pre Periods by Unit”np.ndarray

1D array, number of pre-treatment periods for each treated unit. Shape (num_treated_units,).

”Post Periods by Unit”np.ndarray

1D array, number of post-treatment periods for each treated unit. Shape (num_treated_units,).

”Total Periods”int

Total number of time periods in the input treatment_matrix.

Raises:

MlsynthDataError – If treatment_matrix is not a NumPy array. If no treated units are found (zero treated observations). If a treated unit has no post-treatment period. If treatment is not sustained for a treated unit (i.e., a 0 appears after a 1 in its treatment vector).

mlsynth.utils.datautils.proxy_dataprep(df: DataFrame, surrogate_units: List[Any], proxy_variable_column_names_map: Dict[str, List[str]], unit_id_column_name: str = 'Artist', time_period_column_name: str = 'Date', num_total_periods: int | None = None) Tuple[ndarray, ndarray]#

Construct surrogate and surrogate-proxy matrices from long-form data.

This function takes a long-format DataFrame and pivots it to create two wide-format matrices for a specified set of surrogate units:

  1. A surrogate matrix (often denoted as X in models like ProximalSC), derived from variables specified by proxy_variable_column_names_map[‘donorproxies’].

  2. A surrogate-proxy matrix (often denoted Z1), derived from variables specified by proxy_variable_column_names_map[‘surrogatevars’].

The function assumes that proxy_variable_column_names_map[‘donorproxies’] and proxy_variable_column_names_map[‘surrogatevars’] each contain the name of a single column in df to be used for constructing these matrices.

Parameters:
  • df (pd.DataFrame) – The input panel data in long format. Must contain columns for unit identifiers, time periods, and the variables specified in proxy_variable_column_names_map.

  • surrogate_units (List[Any]) – A list of unit identifiers (matching values in unit_id_column_name) that will form the columns of the output matrices.

  • proxy_variable_column_names_map (Dict[str, List[str]]) – A dictionary specifying the variables to use for constructing the matrices. It must contain two keys:

    • ‘donorproxies’: A list containing the name of the column in df to use for the surrogate matrix. (Formerly proxy_vars)

    • ‘surrogatevars’: A list containing the name of the column in df to use for the surrogate-proxy matrix.

  • unit_id_column_name (str, default “Artist”) – Name of the column in df that identifies unique units. (Formerly id_col)

  • time_period_column_name (str, default “Date”) – Name of the column in df that identifies time periods. This column will form the index of the pivoted DataFrames before conversion to NumPy arrays. (Formerly time_col)

  • num_total_periods (Optional[int], default None) – Total number of time periods. This parameter is not explicitly used in the current implementation’s logic but is included for potential API consistency or future extensions. (Formerly T)

Returns:

  • surrogate_matrix (np.ndarray) – The constructed surrogate matrix (X). Shape (n_time_periods, n_surrogate_units).

  • surrogate_proxy_matrix (np.ndarray) – The constructed surrogate-proxy matrix (Z1). Shape (n_time_periods, n_surrogate_units).

Raises:

KeyError

  • If proxy_variable_column_names_map does not contain ‘donorproxies’ or ‘surrogatevars’, or if the lists associated with these keys are empty.