Data Utilities#
- mlsynth.utils.datautils.balance(df: DataFrame, unit_id_column_name: str, time_period_column_name: str) None#
Check if the panel is strongly balanced.
A strongly balanced panel means every unit has an observation for every time period, and there are no duplicate unit-time observations.
- Parameters:
df (pd.DataFrame) – The input panel data. Must contain columns specified by unit_id_column_name and time_period_column_name.
unit_id_column_name (str) – The name of the column in df that identifies the units. (Formerly unit_col)
time_period_column_name (str) – The name of the column in df that identifies the time periods. (Formerly time_col)
- Returns:
None – This function does not return a value but raises an error if the panel is not strongly balanced or contains duplicates.
- Raises:
MlsynthDataError – If duplicate unit-time observations are found. If the panel is not strongly balanced (i.e., not all units have observations for all time periods).
- mlsynth.utils.datautils.build_covariate_matrix(df: DataFrame, unit_id_column_name: str, time_period_column_name: str, covariates: List[str], pre_periods: int, unit_order: List[Any], *, aggregation: str = 'pre_mean', normalize: bool = True) Tuple[ndarray, Tuple[str, ...], ndarray, ndarray]#
Per-unit covariate matrix aligned to
unit_order.Aggregates each covariate column into a single value per unit. The default
"pre_mean"aggregation takes the mean over the firstpre_periodsrows; time-invariant covariates collapse to their constant value.With
normalize=Trueeach column is centered by its cross-unit mean and scaled by its cross-unit std (computed across the units inunit_order) so all covariates are unit-free, which is the standard pre-step before applying SCM-style predictor weights or computing standardized mean differences for balance diagnostics.- Returns:
cov_matrix (np.ndarray, shape
(len(unit_order), len(covariates))) – Per-unit values for each covariate, in the order given byunit_orderandcovariates.cov_names (tuple of str) – The covariate names, in the same order as the columns of
cov_matrix.cov_means (np.ndarray, shape
(len(covariates),)) – Cross-unit means used for centering (zeros whennormalize=False).cov_scales (np.ndarray, shape
(len(covariates),)) – Cross-unit stds used for scaling (ones whennormalize=False).
- mlsynth.utils.datautils.build_donor_segments(ell_hat, m, T0, n)#
- mlsynth.utils.datautils.clean_surrogates2(surrogate_matrix: ndarray, donor_covariates_matrix: ndarray, treated_unit_covariates_matrix: ndarray, num_pre_treatment_periods: int, common_covariates_matrix: ndarray | None = None) ndarray#
Clean surrogate variables by orthogonalizing them against covariates.
This function adjusts each surrogate variable in surrogate_matrix by removing the linear influence of a set of covariates. The covariates for projection are formed by combining donor_covariates_matrix (donor pool covariates) and treated_unit_covariates_matrix (treated unit covariates), potentially augmented by common_covariates_matrix. The cleaning process involves estimating a linear relationship between each surrogate and the combined covariates using data from the pre-treatment period (num_pre_treatment_periods), and then subtracting this predicted component from the surrogate across all time periods.
- Parameters:
surrogate_matrix (np.ndarray) – Matrix of surrogate variables to be cleaned. Shape (n_periods, n_surrogates).
donor_covariates_matrix (np.ndarray) – Matrix of covariates for the donor pool. Shape (n_periods, n_donor_covariates).
treated_unit_covariates_matrix (np.ndarray) – Matrix of covariates for the treated unit. Shape (n_periods, n_treated_covariates). Typically, n_donor_covariates equals n_treated_covariates.
num_pre_treatment_periods (int) – Number of pre-treatment periods. The linear projection coefficients (projection_coefficients) are estimated using data up to this period.
common_covariates_matrix (Optional[np.ndarray], default None) – Additional common covariates to include in the projection for both donor and treated sides. Shape (n_periods, n_common_covariates).
- Returns:
np.ndarray – The matrix of cleaned surrogate variables, where the influence of the specified covariates has been removed. Shape (n_periods, n_surrogates).
- mlsynth.utils.datautils.dataprep(df: DataFrame, unit_id_column_name: str, time_period_column_name: str, outcome_column_name: str, treatment_indicator_column_name: str, allow_no_donors: bool = False, *, covariates: List[str] | None = None, covariate_aggregation: Literal['pre_mean'] = 'pre_mean', normalize_covariates: bool = True, marex: bool = False) Dict[str, Any]#
Prepare data for synthetic control methods.
Pivots a long DataFrame into a wide format required by many synthetic control estimators. It identifies treated and donor units, and separates data into pre-treatment and post-treatment periods. Handles cases with a single treated unit or multiple treated units (by cohorting based on treatment start time).
- Parameters:
df (pd.DataFrame) – The input panel data in long format. Must contain columns for unit identifiers, time periods, outcome variable, and treatment indicator.
unit_id_column_name (str) – Name of the column in df that identifies unique units.
time_period_column_name (str) – Name of the column in df that identifies time periods.
outcome_column_name (str) – Name of the column in df for the outcome variable.
treatment_indicator_column_name (str) – Name of the column in df for the treatment indicator (0 or 1).
allow_no_donors (bool, default False) – If True, do not raise when zero donor units remain after dropping the treated unit.
covariates (list of str, optional) – Column names to fold into a per-unit covariate matrix. When supplied, the returned dict gains
covariate_matrix/covariate_names/covariate_means/covariate_scalesplusdonor_covariate_matrix/treated_covariate_matrix(single treated case). Time-varying covariates are aggregated per thecovariate_aggregationmode; time-invariant covariates collapse to their constant value.None(default) leaves the return dict identical to the pre-extension behaviour.covariate_aggregation ({“pre_mean”}, default “pre_mean”) – How to collapse each covariate column to a single value per unit. Currently only
"pre_mean"(the pre-treatment mean) is supported.normalize_covariates (bool, default True) – If True, z-score each covariate column across units (the standard pre-step for SCM-style predictor weights and SMD diagnostics).
marex (bool, default False) – If True, additionally compute the Abadie-Zhou-style stacked predictor matrix
X = [Y_pre; covariates^T]plus its population-weighted aggregate, surfacing the inputs that the MAREX-family experimental-design estimators consume.
- Returns:
Dict[str, Any] – A dictionary containing the prepared data. Always populates the backward-compatible keys (treated_unit_name, Ywide, y, donor_names, donor_matrix, total_periods, pre_periods, post_periods, time_labels in the single-treated case; Ywide, cohorts, time_labels in the cohort case). When the opt-in kwargs above are non-default, additionally populates:
covariate_matrix: (N, M)covariate_names: tuple of strcovariate_means: (M,)covariate_scales: (M,)donor_covariate_matrix: (n_donors, M) — single-treated onlytreated_covariate_matrix: (1, M) — single-treated onlypredictor_matrix: (T_pre + M, N) — whenmarex=Truepredictor_block_sizes: (T_pre, M) — whenmarex=Truef_weights: (N,) — whenmarex=TrueX_population_mean: (T_pre + M,) — whenmarex=True- Raises:
MlsynthDataError –
If no donor units are found after pivoting and selecting for a single treated unit case. - If there are zero pre-treatment periods for a single treated unit case. - If a covariate column name is not present in
df.columns. - Ifcovariate_aggregationis unknown.
- mlsynth.utils.datautils.logictreat(treatment_matrix: ndarray) Dict[str, Any]#
Analyze a treatment matrix to determine treatment timings and unit counts.
Identifies the pre-treatment and post-treatment periods for single or multiple treated units. For multiple treated units, it determines the first treatment period for each.
- Parameters:
treatment_matrix (np.ndarray) – A 2D NumPy array where rows represent time periods and columns represent units. A value of 1 indicates treatment, 0 otherwise. Shape (n_periods, n_units).
- Returns:
Dict[str, Any] – A dictionary containing treatment analysis results. The keys and their meanings depend on whether a single or multiple treated units are detected:
- If a single treated unit is found:
- ”Num Treated Units”int
Always 1.
- ”Post Periods”int
Number of post-treatment periods for the treated unit.
- ”Treated Index”np.ndarray
1D array containing the column index of the treated unit. Shape (1,).
- ”Pre Periods”int
Number of pre-treatment periods for the treated unit.
- ”Total Periods”int
Total number of time periods for the treated unit.
- If multiple treated units are found:
- ”Num Treated Units”int
Number of unique treated units detected.
- ”Treated Index”np.ndarray
1D array containing the column indices of all treated units. Shape (num_treated_units,).
- ”First Treat Periods”np.ndarray
1D array, where each element is the first treatment period (0-indexed) for the corresponding treated unit in “Treated Index”. Shape (num_treated_units,).
- ”Pre Periods by Unit”np.ndarray
1D array, number of pre-treatment periods for each treated unit. Shape (num_treated_units,).
- ”Post Periods by Unit”np.ndarray
1D array, number of post-treatment periods for each treated unit. Shape (num_treated_units,).
- ”Total Periods”int
Total number of time periods in the input treatment_matrix.
- Raises:
MlsynthDataError – If treatment_matrix is not a NumPy array. If no treated units are found (zero treated observations). If a treated unit has no post-treatment period. If treatment is not sustained for a treated unit (i.e., a 0 appears after a 1 in its treatment vector).
- mlsynth.utils.datautils.proxy_dataprep(df: DataFrame, surrogate_units: List[Any], proxy_variable_column_names_map: Dict[str, List[str]], unit_id_column_name: str = 'Artist', time_period_column_name: str = 'Date', num_total_periods: int | None = None) Tuple[ndarray, ndarray]#
Construct surrogate and surrogate-proxy matrices from long-form data.
This function takes a long-format DataFrame and pivots it to create two wide-format matrices for a specified set of surrogate units:
A surrogate matrix (often denoted as X in models like ProximalSC), derived from variables specified by proxy_variable_column_names_map[‘donorproxies’].
A surrogate-proxy matrix (often denoted Z1), derived from variables specified by proxy_variable_column_names_map[‘surrogatevars’].
The function assumes that proxy_variable_column_names_map[‘donorproxies’] and proxy_variable_column_names_map[‘surrogatevars’] each contain the name of a single column in df to be used for constructing these matrices.
- Parameters:
df (pd.DataFrame) – The input panel data in long format. Must contain columns for unit identifiers, time periods, and the variables specified in proxy_variable_column_names_map.
surrogate_units (List[Any]) – A list of unit identifiers (matching values in unit_id_column_name) that will form the columns of the output matrices.
proxy_variable_column_names_map (Dict[str, List[str]]) – A dictionary specifying the variables to use for constructing the matrices. It must contain two keys:
‘donorproxies’: A list containing the name of the column in df to use for the surrogate matrix. (Formerly proxy_vars)
‘surrogatevars’: A list containing the name of the column in df to use for the surrogate-proxy matrix.
unit_id_column_name (str, default “Artist”) – Name of the column in df that identifies unique units. (Formerly id_col)
time_period_column_name (str, default “Date”) – Name of the column in df that identifies time periods. This column will form the index of the pivoted DataFrames before conversion to NumPy arrays. (Formerly time_col)
num_total_periods (Optional[int], default None) – Total number of time periods. This parameter is not explicitly used in the current implementation’s logic but is included for potential API consistency or future extensions. (Formerly T)
- Returns:
surrogate_matrix (np.ndarray) – The constructed surrogate matrix (X). Shape (n_time_periods, n_surrogate_units).
surrogate_proxy_matrix (np.ndarray) – The constructed surrogate-proxy matrix (Z1). Shape (n_time_periods, n_surrogate_units).
- Raises:
KeyError –
If proxy_variable_column_names_map does not contain ‘donorproxies’ or ‘surrogatevars’, or if the lists associated with these keys are empty.