Data Utilities#

mlsynth.utils.datautils.balance(df: DataFrame, unit_id_column_name: str, time_period_column_name: str) → None#

Check if the panel is strongly balanced.

A strongly balanced panel means every unit has an observation for every time period, and there are no duplicate unit-time observations.

Parameters:

df (pd.DataFrame) – The input panel data. Must contain columns specified by unit_id_column_name and time_period_column_name.
unit_id_column_name (str) – The name of the column in df that identifies the units. (Formerly unit_col)
time_period_column_name (str) – The name of the column in df that identifies the time periods. (Formerly time_col)

Returns:

None – This function does not return a value but raises an error if the panel is not strongly balanced or contains duplicates.

Raises:

MlsynthDataError – If duplicate unit-time observations are found. If the panel is not strongly balanced (i.e., not all units have observations for all time periods).

mlsynth.utils.datautils.build_covariate_matrix(df: DataFrame, unit_id_column_name: str, time_period_column_name: str, covariates: List[str], pre_periods: int, unit_order: List[Any], *, aggregation: str = 'pre_mean', normalize: bool = True) → Tuple[ndarray, Tuple[str, ...], ndarray, ndarray]#

Per-unit covariate matrix aligned to unit_order.

Aggregates each covariate column into a single value per unit. The default "pre_mean" aggregation takes the mean over the first pre_periods rows; time-invariant covariates collapse to their constant value.

With normalize=True each column is centered by its cross-unit mean and scaled by its cross-unit std (computed across the units in unit_order) so all covariates are unit-free, which is the standard pre-step before applying SCM-style predictor weights or computing standardized mean differences for balance diagnostics.

Returns:

cov_matrix (np.ndarray, shape (len(unit_order), len(covariates))) – Per-unit values for each covariate, in the order given by unit_order and covariates.
cov_names (tuple of str) – The covariate names, in the same order as the columns of cov_matrix.
cov_means (np.ndarray, shape (len(covariates),)) – Cross-unit means used for centering (zeros when normalize=False).
cov_scales (np.ndarray, shape (len(covariates),)) – Cross-unit stds used for scaling (ones when normalize=False).

mlsynth.utils.datautils.build_donor_segments(ell_hat, m, T0, n)#

mlsynth.utils.datautils.clean_surrogates2(surrogate_matrix: ndarray, donor_covariates_matrix: ndarray, treated_unit_covariates_matrix: ndarray, num_pre_treatment_periods: int, common_covariates_matrix: ndarray | None = None) → ndarray#

Clean surrogate variables by orthogonalizing them against covariates.

This function adjusts each surrogate variable in surrogate_matrix by removing the linear influence of a set of covariates. The covariates for projection are formed by combining donor_covariates_matrix (donor pool covariates) and treated_unit_covariates_matrix (treated unit covariates), potentially augmented by common_covariates_matrix. The cleaning process involves estimating a linear relationship between each surrogate and the combined covariates using data from the pre-treatment period (num_pre_treatment_periods), and then subtracting this predicted component from the surrogate across all time periods.

Parameters:

surrogate_matrix (np.ndarray) – Matrix of surrogate variables to be cleaned. Shape (n_periods, n_surrogates).
donor_covariates_matrix (np.ndarray) – Matrix of covariates for the donor pool. Shape (n_periods, n_donor_covariates).
treated_unit_covariates_matrix (np.ndarray) – Matrix of covariates for the treated unit. Shape (n_periods, n_treated_covariates). Typically, n_donor_covariates equals n_treated_covariates.
num_pre_treatment_periods (int) – Number of pre-treatment periods. The linear projection coefficients (projection_coefficients) are estimated using data up to this period.
common_covariates_matrix (Optional[np.ndarray], default None) – Additional common covariates to include in the projection for both donor and treated sides. Shape (n_periods, n_common_covariates).

Returns:

np.ndarray – The matrix of cleaned surrogate variables, where the influence of the specified covariates has been removed. Shape (n_periods, n_surrogates).

mlsynth.utils.datautils.dataprep(df: DataFrame, unit_id_column_name: str, time_period_column_name: str, outcome_column_name: str, treatment_indicator_column_name: str, allow_no_donors: bool = False, *, covariates: List[str] | None = None, covariate_aggregation: Literal['pre_mean'] = 'pre_mean', normalize_covariates: bool = True, marex: bool = False) → Dict[str, Any]#

Prepare data for synthetic control methods.

Pivots a long DataFrame into a wide format required by many synthetic control estimators. It identifies treated and donor units, and separates data into pre-treatment and post-treatment periods. Handles cases with a single treated unit or multiple treated units (by cohorting based on treatment start time).

Parameters:

df (pd.DataFrame) – The input panel data in long format. Must contain columns for unit identifiers, time periods, outcome variable, and treatment indicator.
unit_id_column_name (str) – Name of the column in df that identifies unique units.
time_period_column_name (str) – Name of the column in df that identifies time periods.
outcome_column_name (str) – Name of the column in df for the outcome variable.
treatment_indicator_column_name (str) – Name of the column in df for the treatment indicator (0 or 1).
allow_no_donors (bool, default False) – If True, do not raise when zero donor units remain after dropping the treated unit.
covariates (list of str, optional) – Column names to fold into a per-unit covariate matrix. When supplied, the returned dict gains covariate_matrix / covariate_names / covariate_means / covariate_scales plus donor_covariate_matrix / treated_covariate_matrix (single treated case). Time-varying covariates are aggregated per the covariate_aggregation mode; time-invariant covariates collapse to their constant value. None (default) leaves the return dict identical to the pre-extension behaviour.
covariate_aggregation ({“pre_mean”}, default “pre_mean”) – How to collapse each covariate column to a single value per unit. Currently only "pre_mean" (the pre-treatment mean) is supported.
normalize_covariates (bool, default True) – If True, z-score each covariate column across units (the standard pre-step for SCM-style predictor weights and SMD diagnostics).
marex (bool, default False) – If True, additionally compute the Abadie-Zhou-style stacked predictor matrix X = [Y_pre; covariates^T] plus its population-weighted aggregate, surfacing the inputs that the MAREX-family experimental-design estimators consume.

Returns:

Dict[str, Any] – A dictionary containing the prepared data. Always populates the backward-compatible keys (treated_unit_name, Ywide, y, donor_names, donor_matrix, total_periods, pre_periods, post_periods, time_labels in the single-treated case; Ywide, cohorts, time_labels in the cohort case). When the opt-in kwargs above are non-default, additionally populates:

covariate_matrix : (N, M) covariate_names : tuple of str covariate_means : (M,) covariate_scales : (M,) donor_covariate_matrix : (n_donors, M) — single-treated only treated_covariate_matrix : (1, M) — single-treated only predictor_matrix : (T_pre + M, N) — when marex=True predictor_block_sizes : (T_pre, M) — when marex=True f_weights : (N,) — when marex=True X_population_mean : (T_pre + M,) — when marex=True

Raises:

MlsynthDataError –

If no donor units are found after pivoting and selecting for a single treated unit case. - If there are zero pre-treatment periods for a single treated unit case. - If a covariate column name is not present in df.columns. - If covariate_aggregation is unknown.

mlsynth.utils.datautils.geoex_dataprep(df: DataFrame, unit_id_column_name: str, time_period_column_name: str, outcome_column_name: str, *, post_col: str | None = None) → Dict[str, Any]#

Organize a balanced panel into a wide outcome matrix, without knowing who is treated.

dataprep is the workhorse for estimators, but it needs a treatment indicator to split treated/donor units and pre/post periods. Experimental design tools — geo-experiment market selection, pre-experiment power, clustering, exploratory diagnostics — run before anyone is assigned to treatment, so they only need the treatment-agnostic core: a strongly balanced time x unit outcome matrix. geoex_dataprep returns exactly that, sharing column/index conventions with dataprep (Ywide / time_labels) so the two are interchangeable across modules.

Parameters:

df (pd.DataFrame) – The input panel data in long format. Must contain columns for unit identifiers, time periods, and the outcome variable.
unit_id_column_name (str) – Name of the column in df that identifies units (e.g. geos / markets).
time_period_column_name (str) – Name of the column in df that identifies time periods.
outcome_column_name (str) – Name of the column in df for the outcome variable.
post_col (str, optional) – Name of a 0/1 column marking post-treatment time periods (à la MAREX / lexscm). When supplied, the returned panel is restricted to the pre-treatment period so the design reproduces identically whether the caller hands over a full post-treatment panel (flagged by post_col) or a pre-only panel. None (default) treats every period as pre-treatment.

Returns:

Dict[str, Any] – A dictionary with the treatment-agnostic panel. When post_col is given, Ywide / Y / time_labels / n_units / n_periods describe the pre-treatment slice only:

Ywidepd.DataFrame: Wide outcome matrix of shape (n_periods, n_units); the index is the sorted time labels and the columns are the unit ids.
Ynp.ndarray: Ywide as a (n_periods, n_units) float array — the form the correlation / similarity tools consume directly.
unit_namespd.Index: Unit ids — the Ywide column index.
time_labelspd.Index: Sorted time labels — the Ywide row index.
n_unitsint: Number of units (columns).
n_periodsint: Number of time periods (rows) in the returned panel.
pre_periodsint: Number of pre-treatment periods (equals n_periods; derived from post_col when given, else the full panel length).
post_colstr or None: Echo of the post_col argument, so callers can tell whether the panel was sliced to the pre-period.

Raises:

MlsynthDataError –
- If any of the named columns (including post_col) are absent. - If the panel is not strongly balanced or contains duplicate unit-time observations (delegated to balance()). - If post_col is undefined (NaN) for any time period.
MlsynthConfigError –
- If post_col marks every period as post-treatment (no pre-period). - If post_col’s post periods are not a single contiguous block at the end of the panel.

mlsynth.utils.datautils.logictreat(treatment_matrix: ndarray) → Dict[str, Any]#

Analyze a treatment matrix to determine treatment timings and unit counts.

Identifies the pre-treatment and post-treatment periods for single or multiple treated units. For multiple treated units, it determines the first treatment period for each.

Parameters:

treatment_matrix (np.ndarray) – A 2D NumPy array where rows represent time periods and columns represent units. A value of 1 indicates treatment, 0 otherwise. Shape (n_periods, n_units).

Returns:

Dict[str, Any] – A dictionary containing treatment analysis results. The keys and their meanings depend on whether a single or multiple treated units are detected:

If a single treated unit is found:

”Num Treated Units”int: Always 1.
”Post Periods”int: Number of post-treatment periods for the treated unit.
”Treated Index”np.ndarray: 1D array containing the column index of the treated unit. Shape (1,).
”Pre Periods”int: Number of pre-treatment periods for the treated unit.
”Total Periods”int: Total number of time periods for the treated unit.

If multiple treated units are found:

”Num Treated Units”int: Number of unique treated units detected.
”Treated Index”np.ndarray: 1D array containing the column indices of all treated units. Shape (num_treated_units,).
”First Treat Periods”np.ndarray: 1D array, where each element is the first treatment period (0-indexed) for the corresponding treated unit in “Treated Index”. Shape (num_treated_units,).
”Pre Periods by Unit”np.ndarray: 1D array, number of pre-treatment periods for each treated unit. Shape (num_treated_units,).
”Post Periods by Unit”np.ndarray: 1D array, number of post-treatment periods for each treated unit. Shape (num_treated_units,).
”Total Periods”int: Total number of time periods in the input treatment_matrix.

Raises:

MlsynthDataError – If treatment_matrix is not a NumPy array. If no treated units are found (zero treated observations). If a treated unit has no post-treatment period. If treatment is not sustained for a treated unit (i.e., a 0 appears after a 1 in its treatment vector).

mlsynth.utils.datautils.proxy_dataprep(df: DataFrame, surrogate_units: List[Any], proxy_variable_column_names_map: Dict[str, List[str]], unit_id_column_name: str = 'Artist', time_period_column_name: str = 'Date', num_total_periods: int | None = None) → Tuple[ndarray, ndarray]#

Construct surrogate and surrogate-proxy matrices from long-form data.

This function takes a long-format DataFrame and pivots it to create two wide-format matrices for a specified set of surrogate units:

A surrogate matrix (often denoted as X in models like ProximalSC), derived from variables specified by proxy_variable_column_names_map[‘donorproxies’].
A surrogate-proxy matrix (often denoted Z1), derived from variables specified by proxy_variable_column_names_map[‘surrogatevars’].

The function assumes that proxy_variable_column_names_map[‘donorproxies’] and proxy_variable_column_names_map[‘surrogatevars’] each contain the name of a single column in df to be used for constructing these matrices.

Parameters:

df (pd.DataFrame) – The input panel data in long format. Must contain columns for unit identifiers, time periods, and the variables specified in proxy_variable_column_names_map.
surrogate_units (List[Any]) – A list of unit identifiers (matching values in unit_id_column_name) that will form the columns of the output matrices.
proxy_variable_column_names_map (Dict[str, List[str]]) – A dictionary specifying the variables to use for constructing the matrices. It must contain two keys:
- ‘donorproxies’: A list containing the name of the column in df to use for the surrogate matrix. (Formerly proxy_vars)
- ‘surrogatevars’: A list containing the name of the column in df to use for the surrogate-proxy matrix.
unit_id_column_name (str, default “Artist”) – Name of the column in df that identifies unique units. (Formerly id_col)
time_period_column_name (str, default “Date”) – Name of the column in df that identifies time periods. This column will form the index of the pivoted DataFrames before conversion to NumPy arrays. (Formerly time_col)
num_total_periods (Optional[int], default None) – Total number of time periods. This parameter is not explicitly used in the current implementation’s logic but is included for potential API consistency or future extensions. (Formerly T)

Returns:

surrogate_matrix (np.ndarray) – The constructed surrogate matrix (X). Shape (n_time_periods, n_surrogate_units).
surrogate_proxy_matrix (np.ndarray) – The constructed surrogate-proxy matrix (Z1). Shape (n_time_periods, n_surrogate_units).

Raises:

KeyError –

If proxy_variable_column_names_map does not contain ‘donorproxies’ or ‘surrogatevars’, or if the lists associated with these keys are empty.

Data Utilities

Contents

Data Utilities#