Forward-Selected Synthetic Control (FSCM)#

When to Use This Estimator#

The synthetic control method (SCM) of Abadie, Diamond and Hainmueller [ABADIE2010] builds a treated unit’s counterfactual as a convex combination of donor units. Conventional practice is to start from all available donors and let the simplex weights zero out the irrelevant ones. Cerulli [FSCM] argues this is often suboptimal: the number of initial donors is itself a complexity parameter governing a bias–variance trade-off. A richer donor pool fits the pre-treatment window better in sample, but each extra donor that is only weakly correlated with the treated unit injects variance into the counterfactual out of sample – the synthetic control overfits the pre-period and predicts the post-period worse.

FSCM resolves this by treating the donor count as a tuning parameter chosen by out-of-sample validation. It is the right tool when you have a single treated unit and a sizeable donor pool and suspect that not all donors deserve to be in the comparison set – when you want SCM to tell you how many and which donors to use, rather than assuming “more is better.” Because the selection is greedy (forward stepwise), it scales linearly in the pool size, unlike the \(2^N\) exhaustive subset search.

The donor (and predictor) weights are computed by the bilevel optimization of Malo, Eskelinen, Zhou and Kuosmanen [malo2023computing], implemented from scratch in mlsynth.utils.fscm_helpers.bilevel – no external QP solver is used. Two switches control the estimator:

  • forward_selection (default True) – when True, run the greedy forward selection with rolling-origin out-of-sample validation, fitting each candidate donor set with the bilevel solver. When False, skip selection entirely and take the full bilevel solve over all donors (the global SCM optimum), reporting the donors that carry weight.

  • covariates / match_periods – when given, the estimator runs in predictor mode (Abadie’s predictor matching with a bilevel-optimized predictor-weight matrix \(\mathbf{V}\)); when omitted, in trajectory mode (matching the pre-treatment outcome path). The four combinations all run.

Notation#

Let \(j = 1\) denote the treated unit, with all units \(\mathcal{N} \coloneqq \{1, \dots, N\}\) and donor pool \(\mathcal{N}_0 \coloneqq \mathcal{N} \setminus \{1\}\) of cardinality \(N_0\). Time runs over \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\), 1-indexed; the intervention takes effect after period \(T_0\), splitting \(\mathcal{T}\) into the pre-period \(\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}\) (of length \(T_0\)) and the post-period \(\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}\).

The treated series is \(\mathbf{y}_1\) with scalar outcomes \(y_{1t}\), and each donor \(j \in \mathcal{N}_0\) contributes a series \(\mathbf{y}_j\), stacked into the donor matrix \(\mathbf{Y}_0 \coloneqq [\mathbf{y}_j]_{j \in \mathcal{N}_0} \in \mathbb{R}^{T \times N_0}\) (one column per donor). For a donor subset \(U \subseteq \mathcal{N}_0\), the simplex weights solve

\[\mathbf{w}^\ast(U) = \operatorname*{argmin}_{\mathbf{w}\in\Delta_U} \sum_{t\in\mathcal{S}} \bigl(y_{1t} - \mathbf{Y}_{0,t,U}\,\mathbf{w}\bigr)^2, \qquad \Delta_U \coloneqq \Bigl\{\mathbf{w}\ge 0 : \textstyle\sum_{j\in U}w_j = 1\Bigr\},\]

fit over a window \(\mathcal{S}\), and the root-mean-square prediction error over an evaluation window \(\mathcal{E}\) is \(\mathrm{RMSPE}_{\mathcal{E}}(U) = \sqrt{|\mathcal{E}|^{-1}\sum_{t\in\mathcal{E}} (y_{1t} - \mathbf{Y}_{0,t,U}\mathbf{w}^\ast(U))^2}\). The synthetic counterfactual is \(\widehat{\mathbf{y}}_1\) with entries \(\widehat{y}_{1t}\), the per-period effect is \(\tau_t \coloneqq y_{1t} - \widehat{y}_{1t}\), and the ATT is \(\widehat{\tau} \coloneqq |\mathcal{T}_2|^{-1} \sum_{t \in \mathcal{T}_2} \tau_t\). The significance level is \(\alpha\).

Computing the weights: bilevel optimization#

With predictors, SCM jointly chooses predictor weights \(\mathbf{V}\) (a \(K\times K\) non-negative diagonal matrix on the simplex) and donor weights \(\mathbf{w}\). Malo et al. [malo2023computing] show this is an optimistic bilevel program: the upper level fits the outcome, the lower level fits the \(\mathbf{V}\)-weighted predictors,

\[\min_{\mathbf{V},\mathbf{w}} \; \tfrac{1}{T_0}\bigl\|\mathbf{y}_{1,\mathcal{T}_1} - \mathbf{Y}_{0,\mathcal{T}_1}\mathbf{w}\bigr\|_2^2 \quad\text{s.t.}\quad \mathbf{w} \in \operatorname*{argmin}_{\mathbf{w}\in\Delta} \bigl\|\mathbf{x}_1 - \mathbf{X}_0\mathbf{w}\bigr\|_{\mathbf{V}}^2 ,\]

which is NP-hard in general and is the reason off-the-shelf SCM packages can be numerically unstable. mlsynth implements the paper’s globally-convergent iterative algorithm in three stages, short-circuiting as soon as an optimum is certified (the paper notes the optimum is usually a corner found early):

  1. Unconstrained feasibility (Section 3.1) – solve the simplex regression of the treated outcome on the donors, giving the lower bound \(L(\mathbf{w})\) on the upper-level loss; an LP over \(\mathbf{V}\) checks whether some predictor is already matched, which certifies optimality.

  2. Corner solutions (Section 3.2) – evaluate the \(K\) basic predictor weightings (all weight on one predictor) and keep the best by outcome loss.

  3. Tykhonov-regularized descent (Section 3.3) – only if a gap remains, descend over \(\mathbf{V}\) for a vanishing regularization sequence.

The lower-level (and the trajectory-mode) simplex problems are solved by a self-contained FISTA projected-gradient routine (simplex_lstsq()), which matches a reference QP solver to ~1e-8. In predictor mode the optimal \(\mathbf{V}\) is computed once on the full donor pool and reused through forward selection.

The forward stepwise algorithm#

When forward_selection=True (default), the donor count is chosen by Cerulli’s procedure ([FSCM], Table 1):

  1. Start from the empty model \(U_0 = \varnothing\).

  2. For \(k = 0, 1, \ldots, N_0-1\): among the \(N_0-k\) candidate donors not yet selected, add the one whose inclusion minimizes the in-sample pre-period RMSPE, giving the nested model \(U_{k+1} = U_k \cup \{j^\ast\}\).

  3. For each nested model \(U_k\), compute an out-of-sample validation RMSPE and select \(k^\ast = \operatorname*{argmin}_k \mathrm{CV}(U_k)\).

The selected donor set is \(U_{k^\ast}\); mlsynth then refits the weights on the full pre-period over \(U_{k^\ast}\) to form the counterfactual \(\widehat{y}_{1t} = \mathbf{Y}_{0,t,U_{k^\ast}}\mathbf{w}^\ast\), the gap \(\tau_t = y_{1t} - \widehat{y}_{1t}\), and the ATT \(\widehat{\tau} = |\mathcal{T}_2|^{-1}\sum_{t\in\mathcal{T}_2}\tau_t\).

When forward_selection=False the selection and cross-validation are skipped: the estimator returns the single full bilevel solve over all donors (the global SCM optimum), reporting the weight-bearing donors. This is faster and is the right choice when you want the canonical SCM weights rather than a parsimonious donor subset.

Rolling-origin cross-validation. Cerulli’s paper splits the pre-period once (early half train, late half test). With short pre-periods (Proposition 99 has only 19) a single split is noisy, so mlsynth uses an expanding-window, one-step-ahead scheme instead: for each origin \(t\) from \(\lceil T_0\cdot\texttt{cv\_split}\rceil\) to \(T_0-1\), weights are fit on \(\{1,\ldots,t-1\}\) and used to forecast period \(t\); the validation score is the RMSPE of those one-step forecasts. Every late pre-period period serves as a test point, using the data more efficiently than a single cut.

Assumptions (forward convex hull + selection consistency)#

FSCM relaxes canonical SCM’s “the full donor pool must contain a convex-hull match for the treated unit” to the much weaker “some subset does,” and the forward-stepwise selector is the device that finds it. The four assumptions that make the selector behave:

A1 (forward convex-hull condition – the identifying premise). There exists a non-empty subset \(U^\ast \subseteq \mathcal{N}_0\) and simplex weights \(\mathbf{w}^\ast \in \Delta_{U^\ast}\) such that the treated unit’s pre-period trajectory is (approximately) reproduced by the corresponding donor combination,

\[y_{1t} \;\approx\; \sum_{j \in U^\ast} w_j^\ast\, y_{jt} \quad \text{for all } t \in \mathcal{T}_1.\]

Remark. The classical SCM hull condition is the special case \(U^\ast = \mathcal{N}_0\); FSCM operates whenever some subset (potentially a small one) supplies the hull. If no subset of controls can form a convex hull around the target, FSCM cannot be used – and no amount of forward stepwise will rescue it. The diagnostic is the lower envelope of in-sample RMSPE across the nested models \(\{U_k\}\): if it never falls below the noise floor at any size, A1 is failing.

A2 (stable pre/post relationship). The weights that reproduce the treated unit on the pre-period also reproduce its untreated trajectory on the post-period – the standard SCM identification premise carried through to the selected subset. Remark. This is what licenses using pre-period out-of-sample fit (the cross-validation test window) as a stand-in for post-period predictive accuracy of the counterfactual, which is never observed.

A3 (forward-stepwise selection consistency). Under regularity conditions on the donor pool and the pre-period length (Shi & Huang 2023, Theorem 1) [fsPDA], the forward stepwise selection rule recovers the oracle donor set wpa1:

\[\mathbb{P}\bigl( U_{k^\ast} = U^\ast \bigr) \;\longrightarrow\; 1 \qquad \text{as } T_0 \to \infty.\]

Remark. This is the wpa1 selection-consistency property that distinguishes FSCM from heuristic donor pre-screening: greedy forward steps are not just computationally tractable (\(1 + N_0(N_0+1)/2\) fits vs. the exhaustive \(2^{N_0}\)), they asymptotically pick the right subset. The regularity conditions follow Shi & Huang and assume bounded donor signals, mixing shocks within units, and a signal-strength condition on the oracle weights (no donor in \(U^\ast\) carries vanishing weight in \(\mathbf{w}^\ast\)).

A4 (informative cross-validation split). The pre-period is long enough, and the donor / treated dynamics stable enough across the split, that the test-RMSPE on the held-out interval is a consistent estimate of out-of-sample prediction error. Remark. mlsynth’s expanding-window scheme expects the late pre-period to resemble the post-period in distribution; if the panel has a structural break inside the pre-window (a global financial crisis, a regime change), this assumption fails and the selected \(k^\ast\) will reflect the break, not the donor pool.

Empirical Illustration: California’s Proposition 99#

Following Cerulli’s application – the canonical Abadie [ABADIE2010] study of California’s 1988 tobacco-control program – FSCM runs on P99data.csv (per-capita cigarette sales, 38 control states, 1970–2000). The treated indicator is California from 1989 onward.

import pandas as pd
from mlsynth import FSCM

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/P99data.csv"
df = pd.read_csv(url)
df["treated"] = ((df["state"] == "California") & (df["year"] >= 1989)).astype(int)

res = FSCM({"df": df, "outcome": "cigsale", "treat": "treated",
            "unitid": "state", "time": "year", "display_graphs": True}).fit()

print(f"optimal donors : {res.n_selected} of {res.diagnostics['n_donors_available']}")
print(f"selected       : {res.selected_donors}")
print(f"ATT            : {res.att:.2f}")
print(f"pre-period R^2 : {res.diagnostics['pre_r_squared']:.3f}")
print(f"CV RMSPE       : optimum={res.diagnostics['cv_rmspe_at_optimum']:.3f}"
      f"  full pool={res.diagnostics['cv_rmspe_full_pool']:.3f}")

This prints:

optimal donors : 3 of 38
selected       : ['Montana', 'Nevada', 'Utah']
ATT            : -20.15
pre-period R^2 : 0.970
CV RMSPE       : optimum=1.605  full pool=2.916

Forward selection keeps 3 of the 38 donors and estimates a drop of about 20 packs per capita, consistent with Abadie’s original synthetic-control estimate. The key diagnostic is the last line: the rolling-origin CV RMSPE at the optimum (1.61) is far below using all 38 donors (2.92) – the out-of-sample evidence for Cerulli’s bias–variance argument, that the smaller donor set forecasts the treated unit better than the full pool. With display_graphs=True the second panel plots the CV-RMSPE curve against the number of donors (the paper’s Fig. 2–3), with the selected count marked.

Matching on the author’s full predictor specification#

P99data.csv ships with Abadie’s predictor specification: the covariates lnincome, beer, age15to24, retprice and cigarette sales in 1975, 1980 and 1988. Abadie averages the covariates over 1980–1988 (beer over 1984–1988); these aggregation windows are given through covariate_windows, and the lagged smoking values through match_periods. With forward_selection=False the estimator returns the full bilevel SCM optimum over all donors:

res = FSCM({"df": df, "outcome": "cigsale", "treat": "treated",
            "unitid": "state", "time": "year", "forward_selection": False,
            "covariates": ["lnincome", "beer", "age15to24", "retprice"],
            "covariate_windows": {"lnincome": (1980, 1988), "age15to24": (1980, 1988),
                                  "retprice": (1980, 1988), "beer": (1984, 1988)},
            "match_periods": [1975, 1980, 1988]}).fit()
print(f"R^2 {res.diagnostics['pre_r_squared']:.4f}  ATT {res.att:.2f}")
print({str(k): round(v, 3) for k, v in res.donor_weights.items() if v > 1e-3})
# R^2 0.9787  ATT -19.68
# {'Utah': 0.386, 'Montana': 0.257, 'Nevada': 0.206, 'Connecticut': 0.107, 'New Hampshire': 0.043}

This reproduces the global optimum reported in Malo et al. [malo2023computing] (their Table 1): \(R^2 = 0.979\) with donor weights concentrated on Utah, Montana, Nevada, Connecticut and New Hampshire – the solution the standard Synth package fails to reach. Consistent with the paper’s central finding, the optimal predictor weights \(\mathbf{V}\) form a corner solution that places all weight on a single predictor: many predictors are interchangeable at the optimum (the upper-level loss is non-unique in \(\mathbf{V}\)), so the bilevel solver certifies optimality at a corner rather than spreading weight across predictors.

Note

With forward_selection=True and predictors, the global \(\mathbf{V}\) is frozen and reused across candidate donor sets. Because that corner \(\mathbf{V}\) may rest on an easily-matched predictor, it can stop constraining the subset fits, so the rolling-origin CV may decline to prune the pool. For predictor matching, forward_selection=False (the full bilevel optimum) is therefore the more reliable choice; forward selection is most useful in trajectory mode.

Verification#

Note

Empirical (Proposition 99). In trajectory mode with forward selection, FSCM reproduces Cerulli’s regime: a small donor set (3 of 38) with a rolling-origin CV RMSPE minimized below the full pool – the out-of-sample signature of the bias–variance trade-off – and an ATT \(\approx -20\). With the full Abadie predictor spec and forward_selection=False, the bilevel solver reproduces the global SCM optimum of Malo et al. [malo2023computing] exactly (\(R^2 = 0.979\), Table 1 donor weights), which the Synth package does not reach.

Solver. The self-contained FISTA simplex solver agrees with a reference QP solver to ~1e-8 over random problems; the bilevel unconstrained feasibility certificate, corner-solution bounds, and a determinism check are unit-tested (mlsynth/tests/test_fscm_bilevel.py). All four forward_selection x covariates combinations are exercised in mlsynth/tests/test_fscm.py.

Core API#

Forward-Selected Synthetic Control (FSCM).

A thin, NumPy-first orchestration over mlsynth.utils.fscm_helpers. FSCM (Cerulli 2024) treats the number of donors as a complexity parameter governing a bias–variance trade-off. A forward stepwise selection grows a nested donor sequence on the training half of the pre-period (greedy on in-sample RMSPE), and a two-interval-time out-of-sample validation on the held-out test half picks the donor count that minimizes test RMSPE. The final simplex weights are refit on the full pre-period over the selected donors.

Optional covariate matching (covariates=[...]) augments the SCM objective with the author’s predictor specification; selection and validation scores are always measured on the outcome.

References

Cerulli, G. (2024). Optimal initial donor selection for the synthetic control method. Economics Letters, 244, 111976. https://doi.org/10.1016/j.econlet.2024.111976

class mlsynth.estimators.fscm.FSCM(config: FSCMConfig | dict)#

Bases: object

Forward-Selected Synthetic Control estimator.

Parameters:

config (FSCMConfig or dict) – Validated configuration. Beyond the common fields (df, outcome, treat, unitid, time, display_graphs, save, colors), FSCM reads covariates (optional predictor columns), match_periods (specific pre-treatment periods whose outcome value is matched directly, as in Abadie’s special predictors), cv_split (training fraction of the pre-period) and max_donors (cap on forward-selection steps).

fit() FSCMResults#

Select donors, fit the synthetic control, and return results.

Returns:

FSCMResults – Container with the optimal donor set, simplex weights, counterfactual, gap, ATT, fit diagnostics, and the forward-selection / cross-validation path.

Configuration#

class mlsynth.config_models.FSCMConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, forward_selection: bool = True, covariates: ~typing.List[str] | None = None, covariate_windows: ~typing.Dict[str, ~typing.Any] | None = None, match_periods: ~typing.List[~typing.Any] | None = None, cv_split: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.5, max_donors: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None)#

Configuration for the Forward-Selected Synthetic Control Method (FSCM).

FSCM grows a nested donor sequence by forward stepwise selection on the training half of the pre-period (greedy on in-sample RMSPE), then chooses the donor count by minimizing out-of-sample RMSPE on the held-out test half (two-interval-time cross-validation). The final simplex weights are refit on the full pre-period over the selected donors.

References

Cerulli, Giovanni. 2024. “Optimal initial donor selection for the synthetic control method.” Economics Letters, 244: 111976. https://doi.org/10.1016/j.econlet.2024.111976

covariate_windows: Dict[str, Any] | None#
covariates: List[str] | None#
cv_split: float#
forward_selection: bool#
match_periods: List[Any] | None#
max_donors: int | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Result Containers#

FSCM.fit() returns a FSCMResults, an EffectResult on the standardized two-family contract: the flat accessors res.att / res.counterfactual / res.gap / res.donor_weights (the full-pool mapping) / res.pre_rmse resolve through the standardized sub-models (effects / time_series / weights / fit_diagnostics). The FSCM-specific detail is carried alongside: the weight-bearing donor set (selected_donors), the raw simplex weight array (weights_vector), the rich fit-diagnostics dict (diagnostics – pre-RMSE, R2, donor counts, CV stats), and – when forward selection ran – a FSCMSelectionPath (per-size in-sample and rolling-origin CV RMSPE, and the selection order; None when forward_selection=False). The prepared, NumPy-only panel is exposed as a FSCMInputs, with units and time addressed through an IndexSet.

Note

The raw simplex weight array is res.weights_vector and the rich diagnostics dict is res.diagnostics; the bare names res.weights / res.fit_diagnostics are reserved by the contract for the standardized WeightsResults / FitDiagnosticsResults sub-models.

Frozen, NumPy-first containers for Forward-Selected SCM (FSCM).

FSCM (Cerulli 2024) treats the number of donors as a complexity parameter governing a bias–variance trade-off: a richer donor pool fits the pre-treatment window better in sample but injects variance from poorly correlated donors out of sample. A forward stepwise selection grows a nested donor sequence (greedy on the training-period fit), and a two-interval-time out-of-sample validation on the held-out tail of the pre-period picks the donor count that minimizes test RMSPE.

Everything below is pure NumPy; units/time are addressed through IndexSet. The only DataFrame touchpoint is setup.

References

Cerulli, G. (2024). Optimal initial donor selection for the synthetic control method. Economics Letters, 244, 111976.

class mlsynth.utils.fscm_helpers.structures.FSCMInputs(unit_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, time_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, y: ~numpy.ndarray, Y: ~numpy.ndarray, T0: int, treated_label: ~typing.Any, cov_treated: ~numpy.ndarray = <factory>, cov_donors: ~numpy.ndarray = <factory>, covariate_names: ~typing.List[str] = <factory>, match_idx: ~numpy.ndarray = <factory>, match_periods: ~typing.List[~typing.Any] = <factory>, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

Preprocessed, NumPy-only panel for the FSCM engine.

Parameters:
  • unit_index (IndexSet) – All N donor units (column order of Y).

  • time_index (IndexSet) – All T periods (row order of y and Y).

  • y (np.ndarray) – Treated-unit outcome over all periods, shape (T,).

  • Y (np.ndarray) – Donor outcomes, shape (T, N).

  • T0 (int) – Number of pre-treatment periods; post is T2 = T - T0.

  • treated_label (Any) – Identifier of the treated unit.

  • cov_treated (np.ndarray, optional) – Treated covariate predictor values, shape (P,) – each covariate averaged over its (Abadie) aggregation window. Empty if no covariates.

  • cov_donors (np.ndarray, optional) – Donor covariate predictor values, shape (N, P). Empty if none.

  • covariate_names (list of str) – Names of the P covariate columns.

  • match_idx (np.ndarray, optional) – Time-row indices of “special predictor” periods – specific pre-treatment periods whose outcome value is matched directly (e.g. the 1975/1980/1988 cigarette-sales values in Abadie’s spec).

  • match_periods (list) – Labels of those periods, for provenance.

  • metadata (dict) – Free-form provenance.

property T: int#
T0: int#
property T2: int#
Y: ndarray#
cov_donors: ndarray#
cov_treated: ndarray#
covariate_names: List[str]#
property donor_labels: ndarray#
property has_covariates: bool#
property has_match_periods: bool#
property has_predictors: bool#
match_idx: ndarray#
match_periods: List[Any]#
metadata: Dict[str, Any]#
property n_donors: int#
time_index: IndexSet#
treated_label: Any#
unit_index: IndexSet#
y: ndarray#
class mlsynth.utils.fscm_helpers.structures.FSCMResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.fscm_helpers.structures.FSCMInputs, selected_donors: ~typing.List[~typing.Any], weights_vector: ~numpy.ndarray, selection_path: ~mlsynth.utils.fscm_helpers.structures.FSCMSelectionPath | None = None, diagnostics: ~typing.Dict[str, float] = <factory>, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: BaseEstimatorResults

Top-level container returned by mlsynth.FSCM.fit().

An EffectResult (the observational report): it populates the standardized sub-models so the flat accessors (att / counterfactual / gap / donor_weights / pre_rmse) resolve through the base contract. The FSCM-specific fields below carry the forward-selection detail.

Parameters:
  • inputs (FSCMInputs) – Preprocessed panel.

  • selected_donors (list) – Labels of the donor set carrying weight.

  • weights_vector (np.ndarray) – (n_selected,) simplex weights over selected_donors (the raw weight array; res.donor_weights is the full-pool mapping).

  • selection_path (FSCMSelectionPath or None) – Forward-selection / cross-validation trace (None when forward selection is off).

  • diagnostics (dict) – Rich fit diagnostics (pre-RMSE, R-squared, donor counts, CV stats); the standardized fit_diagnostics sub-model mirrors the headline numbers and keeps this dict in additional_metrics.

  • metadata (dict) – Free-form provenance.

diagnostics: Dict[str, float]#
inputs: FSCMInputs#
metadata: Dict[str, Any]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property n_selected: int#
selected_donors: List[Any]#
selection_path: FSCMSelectionPath | None#
weights_vector: np.ndarray#
class mlsynth.utils.fscm_helpers.structures.FSCMSelectionPath(sizes: ndarray, order: List[Any], train_rmspe: ndarray, test_rmspe: ndarray, optimal_size: int)#

Bases: object

The forward-selection / cross-validation trace (Cerulli Fig. 2–3).

Each entry k describes the nested model with k donors: the donor added at that step, the in-sample pre-period RMSPE, and the out-of-sample rolling-origin CV RMSPE used to choose the optimal size.

optimal_size: int#
order: List[Any]#
sizes: ndarray#
test_rmspe: ndarray#
train_rmspe: ndarray#

Helper Modules#

Data preparation – the only DataFrame touchpoint: pivots to NumPy, builds the unit/time IndexSetes, splits pre/post, and assembles the optional covariate arrays.

Long-DataFrame -> NumPy boundary for FSCM (the only pandas touchpoint).

mlsynth.utils.fscm_helpers.setup.derive_treatment(df: DataFrame, unitid: str, time: str, treat: str) Tuple[Any, Any]#

Read the single treated unit and its first treated period from treat.

mlsynth.utils.fscm_helpers.setup.prepare_fscm_inputs(df: DataFrame, *, unitid: str, time: str, outcome: str, treat: str, covariates: List[str] | None = None, covariate_windows: Dict[str, Tuple[Any, Any]] | None = None, match_periods: List[Any] | None = None) FSCMInputs#

Pivot the panel to NumPy, build IndexSets, split pre/post.

Covariate predictors are each averaged over an aggregation window (Abadie’s specification): covariate_windows maps a covariate to an inclusive (start, end) label range; covariates not listed are averaged over the full pre-treatment period. match_periods are specific pre-treatment periods whose outcome value is matched directly.

The forward-selection / rolling-origin cross-validation engine and the weight-fitting that dispatches to the bilevel solver (predictor mode) or the trajectory simplex solve.

NumPy-first FSCM engine: forward selection + rolling-origin CV.

Implements Cerulli (2024) with two matching modes:

  • Trajectory mode (no covariates and no special predictors): the SCM weights match the treated unit’s pre-treatment outcome trajectory.

  • Predictor mode (covariates and/or match_periods given): the weights match Abadie’s predictor specification, with the predictor-weight matrix V and donor weights jointly determined by the bilevel optimization of Malo et al. (2024) – V is solved once on the full donor pool and then reused through forward selection (see bilevel).

In both modes the donor pool is grown greedily and the donor count is chosen by rolling-origin cross-validation; the final weights are refit on the full pre-period over the selected donors. All optimization is self-contained – the simplex problems are solved by the FISTA primitive in bilevel.simplex, not by Opt.SCopt.

mlsynth.utils.fscm_helpers.estimation.run_fscm(inputs: FSCMInputs, *, forward_selection: bool = True, cv_split: float = 0.5, max_donors: int | None = None) FSCMResults#

Synthetic control via the Malo et al. (2024) bilevel solver.

Parameters:
  • inputs (FSCMInputs) – Prepared NumPy panel.

  • forward_selection (bool, default True) – If True, run Cerulli’s forward selection with rolling-origin OOS validation to choose a donor subset (each candidate fit by the bilevel solver). If False, take the full bilevel solve over all donors with no selection.

  • cv_split (float, default 0.5) – Sets the first rolling origin (forward selection only): forecasting begins at period round(T0 * cv_split) and sweeps to the pre-period end.

  • max_donors (int, optional) – Cap on the number of forward-selection steps (default: all donors).

Plotting: the outcome paths and the donor-count CV-RMSPE selection curve.

Plotting for FSCM: treated-vs-counterfactual and the donor-count CV curve.

The observed-vs-counterfactual panel is delegated to the shared Plotter; the rolling-origin CV curve is FSCM’s own bespoke panel and stays local.

mlsynth.utils.fscm_helpers.plotter.plot_fscm(results: FSCMResults, *, outcome: str, time: str, treated_color: str = 'black', counterfactual_color: str | List[str] = 'red', save: bool | str = False) None#

Outcome paths (shared archetype) plus the donor-count CV curve.

Bilevel optimization (Malo et al. 2024)#

A self-contained implementation of the optimistic bilevel SCM program, used as the weight solver in predictor mode (and the simplex primitive everywhere). No external QP solver is involved.

The simplex-constrained least-squares core: Euclidean projection onto the probability simplex and the FISTA accelerated projected-gradient solver.

The three algorithm stages – unconstrained feasibility, corner solutions, and Tykhonov-regularized descent.

The driver composing the stages, and the structured problem/solution containers.