Vanilla Synthetic Control (VanillaSC)

Contents

Vanilla Synthetic Control (VanillaSC)#

Overview#

VanillaSC is the standard synthetic control method (Abadie & Gardeazabal 2003; Abadie, Diamond & Hainmueller 2010, 2015), built on mlsynth’s self-contained bilevel engine. It estimates the effect on a single treated unit by constructing a weighted average of donor units – the synthetic control – that tracks the treated unit’s pre-treatment path, and reads the effect as the post-treatment gap between the treated unit and its synthetic counterpart.

What distinguishes this implementation is how it treats the two regimes of the SCM optimisation honestly:

  • No covariates -> the donor weights \(\mathbf{w}\) solve the convex simplex least-squares fit on the pre-treatment outcomes. This is a single, well-posed convex program – deterministic and reproducible (unique up to donor collinearity).

  • Covariates -> the predictor weights \(\mathbf{V}\) and donor weights \(\mathbf{w}\) are chosen jointly through a bilevel program. This is non-convex, and the predictor weights are generically non-identified. VanillaSC solves it with a reliable backend and reports a diagnostic (\(\text{v\_agreement}\)) so that fragility is visible rather than silent.

When to use this estimator#

  • You want the standard synthetic control done reliably, with the solver choice and identification fragility surfaced.

  • Outcome-only matching when you have a long, informative pre-period – this is the well-posed, reproducible case.

  • Covariate matching with mscmt when the donor pool is rich enough that the problem is well-conditioned (see the replications below). When \(\text{v\_agreement}\) comes back near 1, prefer outcome-only or penalized.

A concrete example: a state passes a tobacco-control law and you want its effect on cigarette sales. There is one treated state, a long pre-law history, and dozens of untreated states to draw on. VanillaSC builds a synthetic California as a weighted blend of those donor states that matches the pre-law sales path, then reads the policy effect as the post-law gap between the real and synthetic series.

Notation#

Let \(j = 1\) denote the treated unit, with all units \(\mathcal{N} \coloneqq \{1, \dots, N\}\) and donor pool \(\mathcal{N}_0 \coloneqq \mathcal{N} \setminus \{1\}\) of cardinality \(N_0\). Time runs over \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\), 1-indexed; the intervention takes effect after period \(T_0\), splitting \(\mathcal{T}\) into the pre-period \(\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}\) (of length \(T_0\)) and the post-period \(\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}\).

The treated series is \(\mathbf{y}_1 = (y_{11}, \dots, y_{1T})^\top \in \mathbb{R}^{T}\) with scalar outcomes \(y_{1t}\); each donor \(j \in \mathcal{N}_0\) contributes a series \(\mathbf{y}_j\), stacked into the donor matrix \(\mathbf{Y}_0 \coloneqq [\mathbf{y}_j]_{j \in \mathcal{N}_0} \in \mathbb{R}^{T \times N_0}\) (one column per donor). Donor weights are \(\mathbf{w} \in \mathbb{R}^{N_0}\), constrained to the unit simplex \(\Delta^{N_0} \coloneqq \{\mathbf{w} \in \mathbb{R}_{\ge 0}^{N_0} : \|\mathbf{w}\|_1 = 1\}\) in the canonical SCM; the optimiser is \(\mathbf{w}^\ast\) and the fitted vector \(\widehat{\mathbf{w}}\). The synthetic counterfactual is \(\widehat{\mathbf{y}}_1 \coloneqq \mathbf{Y}_0\,\widehat{\mathbf{w}}\) with entries \(\widehat{y}_{1t}\), the per-period effect is \(\tau_t \coloneqq y_{1t} - \widehat{y}_{1t}\), and the ATT is \(\widehat{\tau} \coloneqq |\mathcal{T}_2|^{-1} \sum_{t \in \mathcal{T}_2} \tau_t\). The significance level is \(\alpha\).

Identifying assumptions#

  1. Pre-treatment fit / convex-hull support. There exist weights \(\mathbf{w} \in \Delta^{N_0}\) under which the treated pre-period path is reproduced by the donors, \(y_{1t} \approx (\mathbf{Y}_0\mathbf{w})_t\) for \(t \in \mathcal{T}_1\) – equivalently, the treated unit lies inside (or near) the convex hull of the donors’ lagged outcomes (Abadie, Diamond & Hainmueller 2010).

    Remark. This is the workhorse identifying condition: when the simplex fit balances the pre-period exactly the synthetic control is (near) unbiased, and a good pre-fit is the empirical certificate one inspects. When no convex combination fits – few donors, an outlying treated unit, a long high-dimensional pre-period – residual imbalance becomes bias, which is exactly the regime the ridge augmentation below targets.

  2. No anticipation. Treatment has no effect before \(T_0\): \(y_{1t} = y_{1t}^N\) for all \(t \in \mathcal{T}_1\), so the pre-period outcomes reflect the no-intervention path.

    Remark. If the treated unit reacts in advance of the formal intervention date, the pre-period fit is contaminated by the effect itself and the gap understates it. The remedy is to date \(T_0\) at the first plausible response, not the nominal policy date.

  3. No interference and untreated donors (SUTVA). The treatment of unit \(1\) does not change any donor’s outcome, and every \(j \in \mathcal{N}_0\) is untreated over \(\mathcal{T}\), so \(\mathbf{Y}_0\) carries only no-intervention outcomes.

    Remark. Donors contaminated by the same shock (spillovers, a co-incident policy) bias the counterfactual. Drop or quarantine such units from the donor pool before fitting.

  4. Outcome-model stability. The no-intervention outcomes follow a stable data-generating process across \(\mathcal{T}\) – for instance a linear factor structure \(y_{jt}^N = \boldsymbol{\lambda}_j^\top \mathbf{f}_t + \varepsilon_{jt}\) – so that weights matching the treated unit’s factor loadings on \(\mathcal{T}_1\) continue to reproduce its no-intervention path on \(\mathcal{T}_2\) (Abadie, Diamond & Hainmueller 2010).

    Remark. This is what licenses extrapolating the pre-period fit forward: the synthetic control tracks the treated unit after \(T_0\) only if the same latent structure governs both windows. A regime change unrelated to treatment (a structural break in \(\mathbf{f}_t\)) breaks the counterfactual even with a perfect pre-fit.

Mathematical formulation#

For a treated unit with pre-treatment outcomes \(\mathbf{y}_{1,\mathcal{T}_1} \in \mathbb{R}^{T_0}\) and donors \(\mathbf{Y}_{0,\mathcal{T}_1} \in \mathbb{R}^{T_0 \times N_0}\):

Outcome-only (no covariates).

\[\widehat{\mathbf{w}} = \operatorname*{argmin}_{\mathbf{w}} \; \bigl\| \mathbf{y}_{1,\mathcal{T}_1} - \mathbf{Y}_{0,\mathcal{T}_1}\mathbf{w} \bigr\|_2^2 \quad \text{s.t.} \quad \mathbf{w} \ge 0,\ \mathbf{1}^\top\mathbf{w} = 1.\]

Covariate matching (bilevel). With predictor matrices \(\mathbf{X}_1 \in \mathbb{R}^{P}\) (treated) and \(\mathbf{X}_0 \in \mathbb{R}^{P \times N_0}\) (donors), each predictor averaged over its window and scaled to unit variance, the lower level solves, for given diagonal predictor weights \(\mathbf{V}\),

\[\mathbf{w}^\ast(\mathbf{V}) = \operatorname*{argmin}_{\mathbf{w} \in \Delta^{N_0}} \; (\mathbf{X}_1 - \mathbf{X}_0 \mathbf{w})^\top \mathbf{V} (\mathbf{X}_1 - \mathbf{X}_0 \mathbf{w}),\]

and the upper level chooses \(\mathbf{V}\) to minimise the pre-treatment outcome fit,

\[\min_{\mathbf{V}} \; \bigl\| \mathbf{y}_{1,\mathcal{T}_1} - \mathbf{Y}_{0,\mathcal{T}_1}\, \mathbf{w}^\ast(\mathbf{V}) \bigr\|_2^2 .\]

The donor weights \(\mathbf{w}\) and the counterfactual are pinned by this program; the predictor weights \(\mathbf{V}\) are generically not (a whole polytope of \(\mathbf{V}\) reproduces the same \(\mathbf{w}\)).

Backends#

The covariate path exposes three reliable solvers via backend=:

"outcome-only"

No predictor weights; the convex simplex fit above. The well-posed default (also selected by backend="auto" when no covariates are given).

"mscmt"

Becker & Kloessner (2018): a global differential-evolution search over \(\log_{10} \mathbf{V}\) with the simplex inner solve. The default when covariates are supplied. Set canonical_v="min.loss.w" (or "max.order") to report a canonical, reproducible \(\mathbf{V}\) via the MSCMT determine_v step.

"malo"

Malo et al. (2024): a staged corner search. Fast and exact when the optimum is a predictor corner – but note that when a lagged outcome is among the predictors, the loss-minimising corner puts all weight on that lag, collapsing the inner match to pure outcome-fitting (it drifts toward the outcome floor).

"penalized"

Abadie & L’Hour (2021): a pairwise-penalized estimator with leave-one-out \(\lambda\) selection, giving a unique, sparse \(\mathbf{w}\). Works with or without covariates. The solver is cross-validated against the authors’ own wsoll1 (durable case pensynth_prop99); see Penalized Synthetic Control (Abadie & L’Hour 2021).

The identification diagnostic#

When covariates are used, res.weights.summary_stats["v_agreement"] reports the maximum absolute difference between the two MSCMT canonical predictor-weight vectors (min.loss.w and max.order). It is small when \(\mathbf{V}\) is well identified and large (up to 1) when the predictor weights – and the donor weights they imply – are fragile. A large value is a warning that the covariate-matched solution should not be over-interpreted.

Inference#

Five inference modes are available via inference=:

"placebo" (default, inference=True)

Abadie’s in-space placebo test: the synthetic control is refit treating each donor as pseudo-treated, and the treated unit’s post/pre RMSPE ratio is ranked against the placebo distribution to give a p-value. Simple and assumption-light, but the smallest achievable p-value is about \(1/(N_0+1)\).

"conformal" – prediction intervals (Chernozhukov, Wüthrich & Zhu 2021)

The augsynth default for Augmented SCM, and a distribution-free test by inversion. For a sharp null \(H_0:\ \tau_t = \tau_0\) the post-period treated outcome is adjusted by \(\tau_0\), the weights are refit on the adjusted data, and the post-period residual is checked for whether it conforms with the pre-treatment residuals – its rank among them is the \(p\)-value,

\[p(\tau_0) = \frac{1 + \#\{\,t \le T_0 : |\widehat{u}_t| \ge |\widehat{u}_{\mathrm{post}}(\tau_0)|\,\}}{T_0 + 1}.\]

Inverting the test (the \(\tau_0\) not rejected at level \(\alpha\)) gives a per-period prediction interval for the random counterfactual \(y_{1t}^N\); the same machinery returns one joint-null \(p\)-value for the whole effect path. It is exactly valid when the residuals are exchangeable, and finite-sample bounded otherwise – the ridge penalty controls the SCM-vs-ASCM weight difference, so validity holds as \(T_0 \to \infty\). Unlike the placebo test it needs no donor pool, and unlike asymptotic intervals no normality; in the Kansas calibrations its intervals attain near-nominal coverage where plain SCM under-covers from poor-fit bias. The bands are returned in res.inference.details["counterfactual_lower" / "counterfactual_upper"] (shaded on the plot) alongside the joint ["joint_p_value"]mlsynth.utils.bilevel.ridge_inference.conformal_intervals().

"scpi" – prediction intervals (Cattaneo, Feng & Titiunik 2021)

Treats \(\tau_T\) as a predictand (a random variable) and builds prediction intervals, decomposing the prediction error as

\[\widehat{\tau}_T - \tau_T = e_T - \mathbf{p}_T^\top(\widehat{\boldsymbol{\beta}} - \boldsymbol{\beta}_0),\]

an out-of-sample shock \(e_T\) plus an in-sample weight-estimation error. The counterfactual prediction band is assembled period-by-period as \([\,Y_{\text{fit}} + w_L + e_L,\; Y_{\text{fit}} + w_U + e_U\,]\), and the treatment-effect interval is \([\,Y_{\text{obs}} - \text{cf}_U,\; Y_{\text{obs}} - \text{cf}_L\,]\).

  • In-sample (\(w_L\)/\(w_U\)): a simulation-based bound. With \(\mathbf{Q} = \mathbf{B}^\top\mathbf{B}/T_0\) (donor pre-outcomes), \(\widehat{\boldsymbol{\Sigma}} = \mathbf{B}^\top \mathrm{diag}(\boldsymbol{\omega})\,\mathbf{B} / T_0^2\) where \(\omega_t = \tfrac{T_0}{T_0-\mathrm{df}}(u_t - \mathbb{E}[u_t])^2\) (HC1), and pre-period residuals \(\mathbf{u} = \mathbf{A} - \mathbf{B}\widehat{\mathbf{w}}\), draw \(\mathbf{G}^\ast \sim N(\mathbf{0},\widehat{\boldsymbol{\Sigma}})\). For each draw and predictor \(\mathbf{p}_T\), solve over the localised simplex set

    \[\min/\max\ \mathbf{p}_T^\top\mathbf{x} \quad\text{s.t.}\quad (\mathbf{x}-\widehat{\mathbf{w}})^\top\mathbf{Q}(\mathbf{x}-\widehat{\mathbf{w}}) - 2\mathbf{G}^{\ast\top}(\mathbf{x}-\widehat{\mathbf{w}}) \le 0,\; \textstyle\sum \mathbf{x} = 1,\; \mathbf{x} \ge \boldsymbol{\ell},\]

    with \(\ell_j = \widehat{w}_j\) if \(\widehat{w}_j < \rho\) else \(0\). The regularisation parameter \(\rho\) is data-driven and capped at \(\rho_{\max} = 0.2\); \(\mathbf{Q}\) is reduced via a thresholded eigen-square-root so collinear (near-null) donor directions are left unconstrained. \(w_L\)/\(w_U\) are the \(\alpha_1/2\) / \(1-\alpha_1/2\) quantiles of \(\mathbf{p}_T^\top(\widehat{\mathbf{w}} - \mathbf{x})\) across draws.

  • Out-of-sample (\(e_L\)/\(e_U\)): a location-scale model, \(e_T = \mathbb{E}[e] + \sqrt{\mathrm{Var}[e]}\,\varepsilon\). The conditional mean and a log-variance scale (capped by the residual IQR, Gaussian \(\varepsilon\)) are estimated by regressing \(\mathbf{u}\) on the active-donor design; "ls" and "empirical" use standardized / raw residual quantiles.

VanillaSC returns the average-effect (ATT) interval in res.inference.ci_lower/ci_upper and the full per-period sequence (point effects, prediction intervals, counterfactual bands, and the in-/out-of-sample components) in res.inference.details. This implements the canonical simplex / outcome-only case; for covariate backends it uses the same outcome design and is approximate.

Note

This is a self-contained, MIT-licensed re-derivation of the Cattaneo-Feng-Titiunik algorithm – it does not import the GPL reference package scpi. It is validated to reproduce scpi’s CI_all_gaussian on the Proposition 99 panel to within Monte-Carlo error (see test_scpi_matches_reference_package, which is skipped unless scpi_pkg happens to be installed).

"lto" – leave-two-out refined placebo (Lei & Sudijono 2025)

A design-based randomization test that fixes the two structural weaknesses of the ordinary placebo test – its coarse \(\{1/N, 2/N, \dots\}\) grid and its zero size when \(\alpha < 1/N\). It replaces the “one turn each” permutation with a tournament over triples and reports both a naive p-value (res.inference.p_value) and a powered one (details["p_powered"]), together with the Type-I bound and tournament tallies. It shares the placebo test’s assumptions but is far more powerful in small donor pools. See The leave-two-out refined placebo test and the two theory subsections below for the full treatment.

"ttest" – debiased SC t-test for the ATT (Chernozhukov, Wüthrich & Zhu 2025)

A \(K\)-fold cross-fitting debiasing with a self-normalized statistic that is asymptotically \(t_{K-1}\), giving the ATT in the familiar one-number form \(\widehat{\tau} \pm t_{K-1}(1-\alpha/2)\,\mathrm{se}\) – robust to misspecification and valid with stationary or non-stationary data, with no long-run-variance estimation. The pre-period is split into ttest_K blocks; each block’s weights are refit (with the configured backend) on its complement, and the held-out block gap removes the SC bias. The debiased ATT, se, tstat and the two-sided p_value land in res.inference.details with the interval in res.inference.ci_lower/ci_upper. Set ttest_K="auto" to choose \(K\) from the SC-residual persistence and the RAE formula (their Section 3.2); \(K = 3\) is the small-\(T_0\) benchmark. Because it only needs \(\ell_2\)-consistent weights it composes with every backend. Supplying oracle_weights (a {donor: weight} map) bypasses the weight solve and uses those weights in every fold – the paper’s oracle benchmark, and a way to plug in externally computed weights. Reference: mlsynth.utils.inferutils.debiased_sc_ttest(); reproduces the paper’s Table 5 carbon-tax estimate (durable case benchmarks/cases/cwz_ttest.py).

The debiased t-test: assumptions and econometric theory#

Setup and estimand. The treated unit \(j = 1\) is untreated through \(T_0\) and treated over \(\mathcal{T}_2\); the target is the ATT \(\tau \coloneqq |\mathcal{T}_2|^{-1}\sum_{t\in\mathcal{T}_2}(y_{1t}^I - y_{1t}^N)\). SC predicts the missing \(y_{1t}^N\) from the donor outcomes \(\mathbf{x}_t \coloneqq (y_{jt})_{j\in\mathcal{N}_0}\); write the linear prediction model

\[y_{1t}^N = \mathbf{x}_t^\top \mathbf{w}^\ast + u_t, \qquad \mathbf{w}^\ast \coloneqq \operatorname*{argmin}_{\mathbf{w}\in\Delta^{N_0}} \mathbb{E}\bigl(y_{1t}^N - \mathbf{x}_t^\top\mathbf{w}\bigr)^2 .\]

This is a predictive, not a structural, model: the pseudo-true weights \(\mathbf{w}^\ast\) need not be “true” SC weights, so the test tolerates misspecification (for instance a linear factor model under which SC is biased).

Why cross-fitting debiases. The naive SC ATT is biased because the weights are estimated in high dimension relative to \(T_0\); under stationarity that bias is the same in \(\mathcal{T}_1\) and \(\mathcal{T}_2\). The \(K\)-fold cross-fit estimates it from a held-out pre-period block and subtracts it. Splitting \(\{1,\dots,T_0\}\) into \(K\) blocks of length \(r = \min(\lfloor T_0/K\rfloor, T_1)\) and refitting \(\widehat{\mathbf{w}}_{(k)}\) on each block’s complement keeps the weights approximately independent of the held-out block \(H_k\), so

\[\widehat{\tau}_k = |\mathcal{T}_2|^{-1}\!\!\sum_{t\in\mathcal{T}_2} \bigl(y_{1t} - \mathbf{x}_t^\top\widehat{\mathbf{w}}_{(k)}\bigr) - |H_k|^{-1}\!\!\sum_{t\in H_k} \bigl(y_{1t} - \mathbf{x}_t^\top\widehat{\mathbf{w}}_{(k)}\bigr), \qquad \widehat{\tau} = K^{-1}\!\sum_{k=1}^{K} \widehat{\tau}_k ,\]

where the second (held-out) term estimates the pre-period bias that, under stationarity, also contaminates the first (post-period) term.

The identifying assumptions are the following (stationary case; Section 4 of the paper relaxes the first to nonstationary data).

  1. Covariance-stationary data. \(\{(y_{1t}^N, \mathbf{x}_t)\}\) is covariance-stationary.

    Remark. This is what makes the held-out pre-period gap a valid estimate of the post-period bias – the load-bearing restriction. It is plausibly violated by a structural break shortly after \(T_0\); the placebo test (inference="placebo") can be used to probe it.

  2. \(\ell_2\)-consistent weights. \(\max_{k} \|\widehat{\mathbf{w}}_{(k)} - \mathbf{w}^\ast\|_2 = o_p(1)\).

    Remark. Mild and generic: it holds for SC even when the number of donors \(N_0\) grows with \(T_0\) (no sparsity needed) and for many penalized-regression estimators – the reason the test rides any backend (outcome-only / mscmt / malo / penalized) and, more broadly, any \(\ell_2\)-consistent weighting.

  3. Weak dependence. \(\{(\mathbf{x}_t, u_t)\}\) is \(\beta\)-mixing with sufficient moments and covariance eigenvalues bounded away from zero.

    Remark. Satisfied by ARMA, GARCH and many stochastic-volatility processes; it rules out unit-root / near-unit-root prediction errors in the stationary theory (the nonstationary results in Section 4 handle those separately).

Limiting distribution. Under Assumptions 1-3, as \(T_0, T_1 \to \infty\), the component estimators \((\widehat{\tau}_1,\dots,\widehat{\tau}_K)\) are jointly asymptotically normal but share a common term \(\xi_0\) (the post-treatment average), so they are not independent. Estimating the long-run variance \(\sigma^2\) is unreliable in small samples, so the test instead self-normalizes:

\[\mathbb{T}_K = \frac{\sqrt{K}\,(\widehat{\tau} - \tau)} {\widehat{\sigma}_{\widehat{\tau}}}, \qquad \widehat{\sigma}_{\widehat{\tau}} = \sqrt{1 + \tfrac{Kr}{T_1}}\, \Bigl(\tfrac{1}{K-1}\textstyle\sum_{k}(\widehat{\tau}_k - \widehat{\tau})^2 \Bigr)^{1/2}.\]

The shared \(\xi_0\) cancels between numerator and denominator – this is exactly what the \(\sqrt{1 + Kr/T_1}\) rescale corrects for – giving an asymptotically pivotal \(\mathbb{T}_K \xrightarrow{d} t_{K-1}\), a Student-\(t\) law with \(K-1\) degrees of freedom. No LRV estimate, no subsampling, no permutation distribution is needed; self-normalization also delivers higher-order refinements that explain the strong small-sample performance. Inverting the statistic gives \(\widehat{\tau} \pm t_{K-1}(1-\alpha/2)\, \widehat{\sigma}_{\widehat{\tau}}/\sqrt{K}\) with asymptotic coverage \(1-\alpha\) (a confidence interval when \(\tau\) is fixed, a prediction interval when it is random).

Efficiency and nonstationarity. The debiased SC estimator’s asymptotic variance is no larger than difference-in-differences’, whether or not SC is correctly specified, and the t-test stays valid when DID’s common-trends assumption fails. With nonstationary data it is valid when all units share a common nonstationarity, or under bounded heterogeneity in the deviations (then SC must be correctly specified); its robustness improves as \(T_0\) grows. Reach for it with one treated unit and enough post-periods (the asymptotics send \(T_1 \to \infty\)); when \(T_1\) is very small or a break hits just after \(T_0\), prefer the conformal or SCPI modes, which are valid for fixed \(T_1\).

Choosing \(K\). ttest_K trades interval length against coverage accuracy: larger \(K\) shortens the interval – its relative asymptotic efficiency (the RAE of eq. 14, reproduced in mlsynth.utils.inferutils.rae()) rises from about 64% at \(K=3\) toward 92% at \(K=10\) for \(c_0 = T_0/T_1 = 30/16\) – but degrades coverage when \(T_0\) is small or the prediction errors are persistent. ttest_K=3 is the robust small-\(T_0\) benchmark; ttest_K="auto" gauges persistence by an AR(1) fit to the SC residuals and selects \(K\) per Section 3.2 (it bumps to \(K=4\) when persistence is low and climbs further only when \(T_0\) is large enough to keep each block well-sized).

How the SCPI machinery works (one fit)#

scpi_intervals(y, Y0, pre, W, ...) takes the fitted donor weights \(\widehat{\mathbf{w}}\) (from any backend), the donor outcome matrix, and the number of pre-treatment periods, and runs the following steps. Let \(\mathbf{A} = \mathbf{y}_{1,\mathcal{T}_1}\) be the treated pre-outcomes, \(\mathbf{B} = \mathbf{Y}_{0,\mathcal{T}_1}\) the donor pre-outcomes, \(\mathbf{P}\) the donor post-outcomes, and \(\mathbf{u} = \mathbf{A} - \mathbf{B}\widehat{\mathbf{w}}\) the pre-period residuals.

  1. Degrees of freedom. For the simplex, \(\mathrm{df} = (\#\{\widehat{w}_j \neq 0\}) - 1\), giving the HC1 correction \(\mathrm{vc} = T_0/(T_0-\mathrm{df})\).

  2. Regularisation parameter \(\rho\). The data-driven type-1 value \(\rho = \tfrac{\sigma_u}{\min_j \mathrm{sd}(\mathbf{B}_j)} \sqrt{\log(N_0)\, d_0 \log T_0}/\sqrt{T_0}\), capped at \(\rho_{\max}=0.2\) (with a fallback bump if it comes out below \(0.001\)). \(\rho\) defines the “active” donor set \(\{\,j : \widehat{w}_j > \rho\,\}\).

  3. Conditional mean & variance. Regress \(\mathbf{u}\) on the active-donor design \([\,\mathbf{B}_{\cdot,\text{active}},\,\mathbf{1}\,]\) to get \(\mathbb{E}[\mathbf{u}]\) (the u_missp step), then \(\omega_t = \mathrm{vc}\,(u_t - \mathbb{E}[u_t])^2\). Form \(\mathbf{Q} = \mathbf{B}^\top\mathbf{B}/T_0\) and \(\widehat{\boldsymbol{\Sigma}} = \mathbf{B}^\top\mathrm{diag}(\boldsymbol{\omega})\mathbf{B}/T_0^2\), and its matrix square root \(\boldsymbol{\Sigma}^{1/2}\).

  4. Localised feasible set. Lower bounds \(\ell_j = \widehat{w}_j\) if \(\widehat{w}_j < \rho\) else \(0\) (near-binding donors are pinned at their tiny weight; active donors may move down to zero). \(\mathbf{Q}\) is reduced by a thresholded eigen-square-root so the near-null (collinear) directions are left unconstrained.

  5. In-sample simulation. For each of scpi_sims draws \(\mathbf{G}^\ast = \boldsymbol{\Sigma}^{1/2}\,\mathbf{z}\), \(\mathbf{z}\sim N(\mathbf{0},\mathbf{I})\), and each post predictor \(\mathbf{p}_T\), solve the small conic program in \(\mathbf{x}\) (donor weights) twice – minimise and maximise \(\mathbf{p}_T^\top\mathbf{x}\) subject to \((\mathbf{x}-\widehat{\mathbf{w}})^\top\mathbf{Q}(\mathbf{x}-\widehat{\mathbf{w}}) - 2\mathbf{G}^{\ast\top}(\mathbf{x}-\widehat{\mathbf{w}})\le 0\), \(\sum \mathbf{x} = 1\), \(\mathbf{x}\ge\boldsymbol{\ell}\). Record \(\mathbf{p}_T^\top(\widehat{\mathbf{w}} - \mathbf{x})\) for each branch; \(w_L\)/\(w_U\) are the \(\alpha_1/2\) / \(1-\alpha_1/2\) quantiles across draws.

  6. Out-of-sample band. From the location-scale model on \(\mathbf{u}\) get \(e_L\)/\(e_U\) per post period (Section above).

  7. Assemble. Counterfactual band \([\,Y_{\text{fit}} + w_L + e_L,\; Y_{\text{fit}} + w_U + e_U\,]\), effect interval \([\,Y_{\text{obs}} - \text{cf}_U,\; Y_{\text{obs}} - \text{cf}_L\,]\), and an ATT interval from an appended post-period-average predictor row. An extra averaged row is carried through steps 5-6 so the ATT interval uses the same simulation, not a naive average of the per-period bounds.

The result is an InferenceResults with ci_lower/ci_upper (the ATT interval), confidence_level \(= 1-2\alpha\), and a details dict holding the per-period periods, tau, pi_lower/pi_upper, counterfactual_lower/upper, the in_sample_* (\(w_L,w_U\)) and out_of_sample_* (\(e_L,e_U\)) components, sims and e_method.

Composing SCPI with the backends#

backend (how \(\mathbf{w}\) is estimated) and inference (how uncertainty is quantified) are orthogonal – any of the four backends pairs with any of the three inference modes:

VanillaSC({..., "backend": "mscmt", "inference": "scpi"}).fit()
VanillaSC({..., "backend": "malo",  "inference": "scpi"}).fit()

The pipeline fits the weights with the chosen backend and hands the resulting res.W to scpi_intervals. Two things to keep in mind:

  • The in-sample simulation rebuilds \(\mathbf{Q}\) and \(\widehat{\boldsymbol{\Sigma}}\) from the donor pre-outcomes \(\mathbf{B}\), treating \(\widehat{\mathbf{w}}\) as simplex weights. With outcome-only this is the exact Cattaneo-Feng-Titiunik interval (the case validated against scpi). With mscmt/malo the weights were also shaped by the covariate predictors, so SCPI uses the outcome design as a stand-in – it is approximate for covariate backends. The point effects, the ATT, and the out-of-sample band are unaffected; only the in-sample \(w_L\)/\(w_U\) term carries the approximation.

  • Read the SCPI interval alongside \(\text{v\_agreement}\). When the predictor weights are non-identified (v_agreement near 1, e.g. Prop 99 with lagged outcomes) the point counterfactual is still pinned, but the covariate-matched solution is fragile; the placebo test, which is exact for any backend, is the conservative cross-check.

The leave-two-out (LTO) refined placebo test#

What it is. The ordinary placebo test (above) gives each of the \(N\) units exactly one turn as the pseudo-treated unit and ranks the real treated unit’s fit statistic against those \(N\) values. That is its weakness: the p-value can only land on the grid \(\{1/N, 2/N, \dots, 1\}\), and at a conventional level like \(\alpha = 0.05\) with a small donor pool the test is either coarse or – when \(\alpha < 1/N\) – literally unable to reject (its size is zero). The Lei-Sudijono (2025) leave-two-out test keeps the same design-based logic but replaces the “one turn each” permutation with a tournament over triples. Think of every triple \(\{i, j, I\}\) (two controls and the treated unit) as a match: leave all three out of the donor pool, build a synthetic control for each of them from the remaining \(N-3\) units, score each by its post/pre RMSPE ratio, and the unit with the largest ratio “wins” the match. The treated unit should win often if the treatment had a real effect (a large post-period gap relative to a tight pre-period fit). The p-value is the fraction of matches the treated unit does not win,

\[p_{\mathrm{naive\text{-}LTO}} = \frac{1}{(N-1)(N-2)} \sum_{i \neq j} \mathbf{1}\bigl\{R_{i,j,I;I} \le \max(R_{i,j,I;i}, R_{i,j,I;j})\bigr\},\]

where \(R_{i,j,I;k} = \lvert S_{\text{ratio-RMSPE}}(\mathbf{y}_k, \widehat{\mathbf{y}}_k)\rvert\) is the score of unit \(k\) when the pool excludes \(\{i, j, I\}\). Because there are \(\binom{N-1}{2}\) matches rather than \(N\), the p-value lives on an \(O(N^2)\)-fine grid – the granularity problem disappears.

Two p-values. res.inference.p_value is the naive LTO p-value above. res.inference.details["p_powered"] is the powered variant \(p_{\mathrm{naive\text{-}LTO}} - c(N, \alpha) + \delta\), which shifts the naive value down by the largest amount the discrete Type-I bound allows (powered_offset_c), strictly increasing power. The powered value is a decision rule tied to one \(\alpha\) – reject when it is \(\le \alpha\) (reject_at_alpha) – not a general-purpose p-value, so do not compare it across levels or report it as “the” p-value.

LTO: design-based assumptions and econometric theory#

The LTO test is design-based, not outcome-model-based: the potential outcomes are treated as fixed, and all randomness comes from which unit got treated. Its validity rests on two assumptions.

  • Uniform assignment. The treated index \(I\) is uniformly distributed over \(\{1, \dots, N\}\) – a priori, any unit was equally likely to be the treated one. Under the null this makes the \(N\) units exchangeable, which is exactly what licenses the tournament. This holds by construction in (cluster-)randomized experiments. In observational work it is a modelling choice: it is most defensible when the treated unit is comparable to the donors (often after covariate adjustment), and in quasi-experimental settings – e.g. natural disasters, where which locality is hit is plausibly close to random over a small comparable region.

  • Sharp null. The hypothesis tested is Fisher’s sharp null \(H_0 : y_{it}^I = y_{it}^N\) for all \(t > T_0\) (no effect for any unit in any post period), or a known-\(\tau\) additive version \(y_{it}^I = y_{it}^N + \tau_{it}\). Sharpness is what lets the test impute every unit’s counterfactual under the null and so run the tournament.

Under these, the test has a finite-sample Type-I error guarantee (no large-\(N\), no long-\(T\), no asymptotics):

\[\mathbb{P}_{H_0}\!\left(p_{\mathrm{naive\text{-}LTO}} \le \alpha\right) \le \frac{\lfloor N f(N, \alpha)\rfloor}{N},\]

reported as type_i_bound. This bound is never worse than the approximate-placebo bound \((\lfloor N\alpha\rfloor + 1)/N\), and for the levels and sizes typical of SCM applications (\(\alpha \in \{0.01, 0.02\}\) for \(6 < N < 200\); \(\alpha = 0.05\) for most \(N\)) it is identical to it – so switching to LTO costs nothing in worst-case Type-I error. Crucially, the placebo bound is tight whereas the LTO bound generally is not: in practice the LTO test’s actual Type-I error is often strictly below \(\alpha\), i.e. it can be unconditionally valid even when \(\alpha < 1/N\).

Two further theoretical properties matter in practice:

  • Consistency where the placebo test fails (Theorem 6.1). When \(\alpha < 1/N\), the LTO test is uniformly consistent – its power goes to 1 as the effect size grows. The approximate placebo test is not: in this regime it can have essentially zero power no matter how large the true effect (zero if \(N\) is even, \(\le 1/N\) if odd). This is the single strongest reason to prefer LTO in small donor pools.

  • Confidence regions. Inverting the additive-\(\tau\) test (\(\{\theta : p_{\mathrm{naive\text{-}LTO}}(\theta) > \alpha\}\)) yields a region for the post-period effect path with guaranteed coverage \(\ge 1 - \lfloor N f(N,\alpha)\rfloor / N\). (mlsynth currently reports the p-values; the inversion is a straightforward extension.)

Methodologically, the LTO test is a new kind of randomization inference: it generalises the Jackknife+ of Barber et al. (2021) (which leaves one point out and so still has \(1/N\) granularity) and is distinct from classical permutation/rank inference. It also – unlike most asymptotic SCM inference – does not simplify the synthetic-control construction: the full weight/predictor machinery (any VanillaSC backend) is re-run inside every match, so the test reflects the estimator you actually use.

When the LTO assumptions are violated#

The sharp null is testable and usually uncontroversial; the uniform assignment assumption is where care is needed.

  • Selection on outcomes / non-comparable treated unit. If the treated unit was chosen because of its (anticipated) trajectory, or is structurally unlike every donor, exchangeability fails and the Type-I guarantee no longer holds. The usual remedy is to restore comparability through the specification – match on covariates, restrict the donor pool to genuinely similar units – before trusting any placebo-type p-value.

  • Known non-uniform assignment. When the treatment probabilities \(\pi_k\) are known or estimable (e.g. seismic risk for an earthquake study), Lei-Sudijono give a weighted LTO p-value \(p_{\text{w-LTO}}(\pi)\) that reweights each match by \(\pi_j\pi_k / ((1-\pi_I)^2 - \sum_{l\neq I}\pi_l^2)\) and reduces to the naive value when \(\pi_i \equiv 1/N\).

  • Sensitivity analysis (the \(\Gamma\) ). Rather than commit to uniformity, one can ask how far from it the design could be before the conclusion flips. Following Rosenbaum, constrain \(\pi_i \in [\tfrac{1}{\Gamma N}, \tfrac{\Gamma}{N}]\) and find the smallest \(\Gamma \ge 1\) at which the worst-case weighted p-value crosses \(\alpha\). In the paper, Prop 99 tolerates \(\Gamma \approx 1.4\) (robust) while German reunification flips at only \(\Gamma \approx 1.1\) (fragile). The weighted p-value and \(\Gamma\) search require solving a non-convex (NP-hard) quadratic program and are not yet implemented in VanillaSC; the uniform-assignment naive/powered p-values are.

Choosing among placebo, LTO, and SCPI#

  • Prefer LTO over the ordinary placebo whenever the donor pool is small – especially in the \(\alpha < 1/N\) regime (e.g. \(N \le 20\) at \(\alpha = 0.05\)), where the placebo test cannot reject and LTO can. The powered variant almost Pareto-improves the placebo: same worst-case Type-I error, more power. Both share the same assumptions, so LTO is close to a free upgrade.

  • Keep the ordinary placebo when you want the most familiar, widely-reported statistic, when \(N\) is large enough that granularity is a non-issue, or as a cheap (\(O(N)\) vs \(O(N^2)\)) cross-check.

  • Reach for SCPI when the question is how big the effect is (a prediction interval / confidence statement on the magnitude), not just whether there is one. SCPI rests on different (model-based, conditional) foundations than the design-based placebo/LTO tests, so the two are complementary: LTO answers “is the treated unit special?” by randomization, SCPI quantifies the effect’s uncertainty.

Empirical replications#

VanillaSC reproduces the three canonical synthetic-control studies on their original datasets – California / Proposition 99 (ADH 2010; synthetic Utah/Nevada/Montana/Colorado/Connecticut, ATT \(\approx -19\) packs), German reunification (ADH 2015; Austria-dominant donor pool, negative ATT) and the Basque Country (Abadie-Gardeazabal 2003; Cataluna \(\approx 0.8\) + Madrid \(\approx 0.2\), ATT \(\approx -0.68\)). See the dedicated replication page, VanillaSC — Standard Synthetic Control (ADH 2010/2015; Abadie-Gardeazabal 2003), for the full datasets, code and donor-weight tables. These are locked as regression tests in mlsynth/tests/test_vanillasc_replications.py.

Ridge augmentation (Augmented SCM)#

The ridge-augmented synthetic control of Ben-Michael, Feller & Rothstein (2021) – progfunc="Ridge" in the augsynth R package – is not a separate estimator but a bias-correction layer on top of the simplex SCM. Given simplex weights \(\mathbf{w}\) and the centered pre-treatment outcomes \(\mathbf{A} = \mathbf{X}_1 - \bar{\mathbf{X}}\), \(\mathbf{B} = \mathbf{X}_0 - \bar{\mathbf{X}}\), it adds a ridge correction that closes the residual pre-treatment imbalance the simplex cannot,

\[\mathbf{w}_{\text{aug}} = \mathbf{w} + (\mathbf{A} - \mathbf{B}^\top\mathbf{w})^\top \left(\mathbf{B}\mathbf{B}^\top + \lambda \mathbf{I}\right)^{-1} \mathbf{B},\]

at the cost of leaving the simplex (the augmented weights may go negative and need not sum to one). Because any base \(\mathbf{w}\) can be augmented, the capability lives in the bilevel engine (mlsynth.utils.bilevel.ridge_augment.ridge_augment_weights()) and rides along wherever the solver goes. The penalty \(\lambda\) is chosen by leave-one-period-out cross-validation (augsynth’s 1-SE rule); inference is by the conformal permutation test of Chernozhukov, Wüthrich & Zhu (2021) (mlsynth.utils.bilevel.ridge_inference.conformal_pvalue()).

When to prefer augmentation#

Plain SCM is justified only when the pre-treatment fit is excellent: when the treated unit lies inside the convex hull of the donors’ lagged outcomes the simplex weights balance \(\mathbf{X}_1\) exactly and the estimator is (near) unbiased (Abadie, Diamond & Hainmueller 2010). Outside the hull – few donors, a long or high-dimensional pre-period, an outlying treated unit – no convex combination fits, and the residual imbalance \(\mathbf{X}_1 - \mathbf{X}_0^\top\widehat{\mathbf{w}}^{\mathrm{scm}}\) turns into bias. SCM has no way to correct it.

Augmented SCM is the middle ground. With \(\widehat{\mathbf{m}}\) the ridge outcome model, the ASCM estimate is the SCM estimate plus an estimate of that bias,

\[\widehat{y}_{1T}^{\,\mathrm{aug},N} = \underbrace{\textstyle\sum_{j \in \mathcal{N}_0} \widehat{w}_j^{\,\mathrm{scm}} y_{jT}}_{\text{SCM}} + \Big( \widehat{m}_T - \textstyle\sum_{j \in \mathcal{N}_0}\widehat{w}_j^{\,\mathrm{scm}} \widehat{m}_{jT} \Big),\]

exactly analogous to bias correction for inexact matching (Abadie & Imbens 2011) and connected to doubly-robust estimation (Robins, Rotnitzky & Zhao 1994). Ridge ASCM is itself a penalized SCM whose penalty is on deviations from the simplex weights: it starts at the SCM solution and extrapolates beyond the hull (admitting negative weights) only as far as needed. \(\lambda\) sets the amount – \(\lambda \to \infty\) recovers plain SCM (no extrapolation), \(\lambda \to 0\) drives the pre-fit to zero (full extrapolation). That is a bias-variance dial: augmentation removes imbalance bias at the cost of a larger weight norm / extrapolation variance, and \(\lambda\) (leave-one-period-out CV, one-standard-error rule) negotiates it.

The authors’ practical rule – and ours – is to decide from the estimated bias itself: it is the imbalance term above, in the units of the estimand, and the first quantity ASCM computes. If it is large relative to the effect you expect, augment; if pre-fit is already excellent the correction is negligible and ASCM and SCM coincide. Two diagnostics accompany it: the pre-treatment RMSE (\(\lVert \mathbf{X}_1 - \mathbf{X}_0^\top\widehat{\mathbf{w}}\rVert/\sqrt{T_0}\), the imbalance that remains) and the extrapolation distance (\(\lVert\widehat{\mathbf{w}}^{\mathrm{aug}} - \widehat{\mathbf{w}}^{\mathrm{scm}}\rVert /\sqrt{N_0}\), how far the weights left the simplex). Across the paper’s calibrated DGPs, ASCM has both lower bias and lower RMSE than SCM – gains largest under misspecification and poor fit, modest when SCM already fits well.

Auxiliary covariates enter in either of augsynth’s two ways: parallel (residualize=False, the default) standardizes the covariates to the outcome scale and stacks them as extra matching rows; residualized (residualize=True) regresses the covariates out of the outcomes, matches on the residuals, and restores covariate balance with an add-back on the weights.

Example#

Augmented SCM is a mode of VanillaSC – set augment="ridge" (and inference="conformal" for the CWZ prediction intervals). Covariates are passed by column name; following mlsynth’s convention, apply any transforms (e.g. log) to the DataFrame yourself first. residualize=True switches parallel inclusion for the residualized variant. The four-cell augsynth Kansas ladder, reproduced through the public API:

import numpy as np, pandas as pd
from mlsynth import VanillaSC

df = pd.read_csv("basedata/kansas_ascm.csv")          # long fips x quarter panel
for c in ("revstatecapita", "revlocalcapita", "avgwklywagecapita"):
    df[c] = np.log(df[c])                             # log the covariates up front
covs = ["lngdpcapita", "revstatecapita", "revlocalcapita",
        "avgwklywagecapita", "estabscapita", "emplvlcapita"]
base = dict(df=df, outcome="lngdpcapita", treat="treated",
            unitid="fips", time="year_qtr")

VanillaSC({**base}).fit().effects.att                              # -0.029  classic SCM
VanillaSC({**base, "augment": "ridge"}).fit().effects.att          # -0.040  ridge ASCM
VanillaSC({**base, "augment": "ridge",
           "covariates": covs}).fit().effects.att                  # -0.063  covariate ASCM
VanillaSC({**base, "augment": "ridge", "covariates": covs,
           "residualize": True}).fit().effects.att                 # -0.057  residualized

# conformal prediction intervals (and a plotted band with display_graphs=True):
res = VanillaSC({**base, "augment": "ridge", "inference": "conformal"}).fit()
res.inference.ci_lower, res.inference.ci_upper        # ATT prediction interval
res.inference.details["joint_p_value"]                # conformal joint-null p-value

Verification#

The augmentation is validated against augsynth on its flagship Kansas tax-cut study (quarterly log GDP per capita): the de-biasing ladder – classic SCM (ATT \(-0.029\)), ridge ASCM (\(-0.040\)), covariate ASCM (\(-0.061\)) and the residualized variant (\(-0.055\)) – is reproduced value-for-value, with pre-fit \(L_2\) imbalance falling monotonically from \(0.083\) to \(0.054\). The paper’s Section-7 thesis (near-nominal coverage and bias reduction across calibrated DGPs) is reproduced as a Path-B simulation. The full ladder is reproduced through the public API – pinned in mlsynth/tests/test_vanillasc_ascm.py::test_augsynth_kansas_ladder_public_api – not just at the engine level. See the dedicated page Ridge ASCM — Augmented Synthetic Control (Ben-Michael, Feller & Rothstein 2021); durable cases ascm_kansas (cross-validation vs augsynth) and augsynth_calibrated (Path B), locked in mlsynth/tests/test_bilevel_ridge.py.

Core API#

VanillaSC: the standard synthetic control, on the bilevel engine.

The ordinary single-treated synthetic control method (Abadie & Gardeazabal 2003; Abadie, Diamond & Hainmueller 2010), implemented on mlsynth’s self-contained bilevel machinery:

  • No covariates -> the well-posed convex problem: donor weights W minimise the pre-treatment outcome fit on the simplex. Unique up to donor collinearity, deterministic, reproducible.

  • Covariates -> the bilevel program (predictor weights V + donor weights W), solved by a reliable backend: "mscmt" (global differential evolution, Becker-Kloessner 2018), "malo" (corner search, Malo et al. 2024), or "penalized" (unique/sparse, Abadie-L’Hour 2021).

Because predictor weights are generically non-identified, VanillaSC reports a v_agreement diagnostic (the gap between the two MSCMT canonical V choices): small means V is well identified, large means the predictor weights – and the donor weights they imply – are fragile.

class mlsynth.estimators.vanillasc.VanillaSC(config: VanillaSCConfig | dict)#

Bases: object

Standard synthetic control estimator (bilevel engine).

Parameters:

config (VanillaSCConfig or dict) – Configuration. See mlsynth.utils.vanillasc_helpers.config.VanillaSCConfig.

Examples

>>> from mlsynth.estimators.vanillasc import VanillaSC
>>> cfg = {"df": panel, "outcome": "gdp", "treat": "treated",
...        "unitid": "country", "time": "year",
...        "covariates": ["trade", "infrate"], "backend": "mscmt"}
>>> res = VanillaSC(cfg).fit()
>>> res.effects.att
fit() BaseEstimatorResults#

Estimate the synthetic control and return standardized results.

Configuration#

class mlsynth.utils.vanillasc_helpers.config.VanillaSCConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, backend: ~typing.Literal['auto', 'outcome-only', 'malo', 'mscmt', 'penalized'] = 'auto', covariates: ~typing.List[str] | None = None, covariate_windows: ~typing.Dict[~typing.Any, ~typing.Any] | None = None, canonical_v: bool | str = False, seed: int = 0, mscmt_maxiter: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 300, mscmt_popsize: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 15, mscmt_prune_shady: bool = True, augment: ~typing.Literal['ridge'] | None = None, ridge_lambda: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, residualize: bool = False, inference: bool | str = True, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, scpi_sims: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 200, scpi_e_method: ~typing.Literal['gaussian', 'empirical'] = 'gaussian', lto_max_pairs: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, oracle_weights: ~typing.Dict[~typing.Any, float] | None = None, ttest_K: int | ~typing.Literal['auto'] = 3, penalized_cv: ~typing.Literal['holdout', 'loo', 'pensynth'] = 'holdout')#

Configuration for the VanillaSC estimator (standard SCM, bilevel engine).

The ordinary single-treated synthetic control, built on the self-contained bilevel machinery. With no covariates it reduces to the well-posed convex outcome-matching problem; with covariates it routes through the bilevel predictor-weight (V) optimisation, with a selectable, reliable backend.

Parameters:
  • backend ({“auto”, “outcome-only”, “malo”, “mscmt”, “penalized”}) – Predictor-weight backend. "auto" (default) uses "outcome-only" (convex simplex fit on pre-treatment outcomes) when no covariates are given, and "mscmt" (global differential-evolution V search) when they are. "malo" is the Malo et al. (2024) corner search, "penalized" the Abadie-L’Hour (2021) unique/sparse estimator.

  • covariates (list of str, optional) – Predictor columns. Each is averaged over its window (see covariate_windows) and scaled to unit variance, then matched via the bilevel program. None -> outcome-only matching.

  • covariate_windows (dict, optional) – Per-covariate inclusive (start, end) averaging window of time labels (Abadie’s special-predictor spec). Covariates not listed are averaged over the full pre-treatment period.

  • canonical_v (bool or {“min.loss.w”, “max.order”}) – Canonicalise the (non-identified) predictor weights for mscmt (MSCMT determine_v). The reported v_agreement is small when V is well identified and large when it is fragile. Default False.

  • seed (int) – RNG seed for the mscmt differential-evolution search.

  • mscmt_maxiter, mscmt_popsize (int) – Differential-evolution budget for the mscmt backend.

  • inference (bool or {“placebo”, “scpi”, “lto”}) – Inference method. True/"placebo" (default) runs Abadie in-space placebo inference (refit treating each donor as pseudo-treated; the p-value ranks the treated unit’s post/pre RMSPE ratio). "scpi" runs Cattaneo-Feng-Titiunik (2021) prediction intervals (in-sample simulation + out-of-sample location-scale; exact for the simplex / outcome-only synthetic control). "lto" runs the Lei-Sudijono (2025) leave-two-out refined placebo test (O(J^2) reference comparisons; finer granularity and non-zero size when alpha < 1/N). False skips inference.

  • alpha (float) – Level. For placebo, the confidence statement; for SCPI, used as both the in-sample (alpha1) and out-of-sample (alpha2) levels, giving a prediction interval with coverage approximately 1 - 2*alpha.

  • scpi_sims (int) – Number of Gaussian draws for the SCPI in-sample simulation.

  • scpi_e_method ({“gaussian”, “empirical”}) – Out-of-sample location-scale tabulation for SCPI.

  • lto_max_pairs (int, optional) – Cap on the number of donor pairs evaluated by the "lto" test (deterministic subsample via seed). None (default) uses all J*(J-1)/2 pairs; set a cap to keep the O(J^2) cost tractable with slow backends.

alpha: float#
augment: Literal['ridge'] | None#
backend: Literal['auto', 'outcome-only', 'malo', 'mscmt', 'penalized']#
canonical_v: bool | str#
covariate_windows: Dict[Any, Any] | None#
covariates: List[str] | None#
inference: bool | str#
lto_max_pairs: int | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

mscmt_maxiter: int#
mscmt_popsize: int#
mscmt_prune_shady: bool#
oracle_weights: Dict[Any, float] | None#
penalized_cv: Literal['holdout', 'loo', 'pensynth']#
residualize: bool#
ridge_lambda: float | None#
scpi_e_method: Literal['gaussian', 'empirical']#
scpi_sims: int#
seed: int#
ttest_K: int | Literal['auto']#

Engine#

Orchestration for the VanillaSC estimator.

dataprep -> (optional) covariate matrices -> bilevel engine -> ATT, fit diagnostics, in-space placebo inference -> standardized results.

mlsynth.utils.vanillasc_helpers.pipeline.run_vanillasc(config) BaseEstimatorResults#

Fit VanillaSC and assemble BaseEstimatorResults.

SCPI prediction intervals for the (simplex) synthetic control.

Cattaneo, Feng & Titiunik (2021, JASA) and Cattaneo, Feng, Palomba & Titiunik (2025, JSS scpi). The prediction error of the synthetic-control counterfactual decomposes as

tau_hat_T - tau_T = e_T - p_T’ (beta_hat - beta_0),

an out-of-sample shock e_T plus an in-sample weight-estimation error. This module is a from-scratch (MIT-licensed) re-derivation of the algorithm described in those papers – it does not import the GPL scpi package – and has been validated to reproduce scpi’s CI_all_gaussian for the canonical simplex control to within Monte-Carlo error.

The counterfactual prediction band is assembled period-by-period as

[ Y_fit + w_lb + e_lb , Y_fit + w_ub + e_ub ],

with the treatment-effect interval [Y_obs - cf_upper, Y_obs - cf_lower].

In-sample component (w_lb/w_ub)#

With Z = B (donor pre-outcomes), Q = Z'Z / T0 and pre-period residuals u = A - B w_hat, draw G* ~ N(0, Sigma) with Sigma = Z' diag(omega) Z / T0**2 and omega_t = (T0/(T0-df)) (u_t - E[u_t])**2 (HC1; E[u] from a regression of u on the active-donor design when u_missp). For each draw and post-period predictor p_T solve, over the localised simplex set,

min / max p_T’ x s.t. (x - w_hat)’Q(x - w_hat) - 2 G*’(x - w_hat) <= 0,

sum(x) = 1, x >= lb,

where lb_j = w_hat_j if w_hat_j < rho else 0 (the local geometry of Cattaneo et al.; rho is the data-driven regularisation parameter, capped at rho_max = 0.2). Q is reduced via a thresholded eigen-square-root so that collinear (near-null) donor directions are left unconstrained, exactly as in the reference conic reformulation. w_lb/w_ub are the alpha1/2 / 1 - alpha1/2 quantiles, across draws, of p_T'(w_hat - x) for the maximising / minimising branch.

Out-of-sample component (e_lb/e_ub)#

A location-scale model for e_T: regress u on the active-donor design to get the conditional mean E[e] and a log-variance model for the scale sqrt(Var[e]) (Gaussian), capped by the inter-quartile range of the residuals (IQR / 1.34). The Gaussian band is E[e] +/- sqrt(-2 ln alpha2) * scale; "ls" uses standardized-residual quantiles, "empirical" the raw residual quantiles.

This implements the canonical simplex case (w >= 0, sum w = 1), the scpi default and the standard synthetic control.

class mlsynth.utils.vanillasc_helpers.scpi.SCPIResult(tau: ndarray, lower: ndarray, upper: ndarray, cf_lower: ndarray, cf_upper: ndarray, M1_lower: ndarray, M1_upper: ndarray, M2_lower: ndarray, M2_upper: ndarray, metadata: Dict[str, Any])#

Per-post-period SCPI prediction intervals (arrays of length T_post).

M1_lower: ndarray#
M1_upper: ndarray#
M2_lower: ndarray#
M2_upper: ndarray#
cf_lower: ndarray#
cf_upper: ndarray#
lower: ndarray#
metadata: Dict[str, Any]#
tau: ndarray#
upper: ndarray#
mlsynth.utils.vanillasc_helpers.scpi.scpi_intervals(y: ndarray, Y0: ndarray, pre: int, W: ndarray, *, sims: int = 200, u_alpha: float = 0.05, e_alpha: float = 0.05, u_missp: bool = True, e_method: str = 'gaussian', seed: int = 0) SCPIResult#

Compute SCPI prediction intervals for a simplex synthetic control.

Parameters:
  • y (np.ndarray) – Treated outcome over all periods, shape (T,).

  • Y0 (np.ndarray) – Donor outcomes over all periods, shape (T, J) (columns match W).

  • pre (int) – Number of pre-treatment periods T0.

  • W (np.ndarray) – Fitted simplex donor weights, shape (J,).

  • sims (int) – Number of Gaussian draws for the in-sample simulation.

  • u_alpha, e_alpha (float) – In-sample (alpha1) and out-of-sample (alpha2) levels.

  • u_missp (bool) – If True, allow E[u | H] != 0 (estimated by regressing the pre-period residuals on the active-donor design); else assume 0.

  • e_method ({“gaussian”, “ls”, “empirical”}) – Tabulation for the out-of-sample shock.

  • seed (int) – RNG seed for the simulation.

Leave-Two-Out (LTO) refined placebo test for the synthetic control.

Lei & Sudijono (2025), “Inference for Synthetic Controls via Refined Placebo Tests” (arXiv:2401.07152). The ordinary placebo / permutation test builds its null distribution from only \(N\) reference estimates, so its p-value lives on the coarse grid \(\{1/N, 2/N, \dots, 1\}\) and has zero size when \(\alpha < 1/N\). The LTO test bypasses this by leaving two control units out at a time, producing \(O(N^2)\) reference comparisons while retaining the same finite-sample Type-I error guarantee under uniform assignment.

Procedure (naive LTO, eqs. 5-7)#

Let \(I\) be the treated unit and \([N]\setminus\{I\}\) the controls (\(N = J + 1\) with \(J\) donors). For every unordered pair of distinct controls \(\{i, j\}\):

  1. Build the synthetic control for each \(k \in \{i, j, I\}\) using the donor pool \([N]\setminus\{i, j, I\}\) (all controls except \(i, j\)), and form the residual \(R_{i,j,I;k} = \lvert S(Y_k, \hat Y_k)\rvert\) with \(S\) the post/pre RMSPE-ratio statistic.

  2. Let \(R^{\mathrm{LTO}}_{i,j} = \max(R_{i,j,I;i}, R_{i,j,I;j})\); the treated unit “wins” the triple when \(R_{i,j,I;I} > R^{\mathrm{LTO}}_{i,j}\).

The naive LTO p-value counts the fraction of pairs the treated unit does not win,

\[p_{\mathrm{naive\text{-}LTO}} = \frac{1}{(N-1)(N-2)} \sum_{i \neq j} \mathbf{1}\{R_{i,j,I;I} \le R^{\mathrm{LTO}}_{i,j}\},\]

which (Theorem 2.2) satisfies \(\mathbb{P}_{H_0}(p_{\mathrm{naive\text{-}LTO}} \le \alpha) \le \lfloor N f(N, \alpha)\rfloor / N\).

Powered LTO (Theorem 2.3)#

For testing at a fixed level \(\alpha\), the powered p-value \(p_{\mathrm{powered\text{-}LTO}}(\alpha) = p_{\mathrm{naive\text{-}LTO}} - c(N, \alpha) + \delta\) shifts the naive value down by the largest amount that leaves the discrete Type-I bound unchanged, strictly increasing power. It is only valid for the \(\alpha\) it was computed at (reject when it is \(\le \alpha\)).

mlsynth.utils.vanillasc_helpers.lto.lto_f(N: int, alpha: float) float#

Type-I error rate function \(f(N, \alpha)\) (Lei-Sudijono eq. 9).

mlsynth.utils.vanillasc_helpers.lto.lto_placebo_test(engine: Any, y: ndarray, Y0: ndarray, pre: int, *, X1: ndarray | None = None, X0: ndarray | None = None, alpha: float = 0.05, max_pairs: int | None = None, seed: int = 0) Dict[str, Any]#

Run the Lei-Sudijono (2025) LTO refined placebo test.

Parameters:
  • engine (BilevelSCM) – Fitted-config synthetic-control engine; engine.fit(...) is re-run for each leave-two-out subproblem (any backend works, but the cost is \(O(J^2)\) fits, so fast backends are recommended).

  • y (np.ndarray) – Treated outcome over all periods, shape (T,).

  • Y0 (np.ndarray) – Donor outcomes, shape (T, J).

  • pre (int) – Number of pre-treatment periods.

  • X1, X0 (np.ndarray, optional) – Treated predictor vector (P,) and donor predictor matrix (P, J) (already windowed and scaled). None for outcome-only matching.

  • alpha (float) – Level at which the powered LTO p-value and Type-I bound are reported.

  • max_pairs (int, optional) – Cap on the number of donor pairs evaluated (deterministic subsample, for expensive backends). None -> all \(\binom{J}{2}\) pairs.

  • seed (int) – RNG seed for the pair subsample when max_pairs is set.

Returns:

dictp_value (naive LTO), p_powered (valid only at alpha), c (powered offset), type_i_bound, n_pairs, treated_losses, N, alpha, reject (powered decision at alpha), and subsampled.

mlsynth.utils.vanillasc_helpers.lto.lto_powered_offset(N: int, alpha: float) float#

c(N, alpha): largest shift leaving the discrete Type-I bound fixed.

Defined (Theorem 2.3) as the smallest c with f(N, alpha + c) = (floor(N f(N, alpha)) + 1) / N. Found by bisection on the monotone increasing f. Reproduces the paper’s values (c(39, 0.05) = 0.002, c(17, 0.05) = 0.0125).

mlsynth.utils.vanillasc_helpers.lto.lto_type_i_bound(N: int, alpha: float) float#

Discrete Type-I error upper bound \(\lfloor N f(N,\alpha)\rfloor/N\).

SCPI prediction intervals#

To request Cattaneo-Feng-Titiunik prediction intervals instead of the placebo test, set inference="scpi". On Prop 99 (outcome-only) this yields an ATT around \(-19\) with a 90% prediction interval that excludes zero, and per-period intervals that widen as the post-period extends.

import pandas as pd
from mlsynth import VanillaSC

d = pd.read_csv("basedata/augmented_cali_long.csv")
d["treated"] = ((d.state == "California") & (d.year >= 1989)).astype(int)

res = VanillaSC({
    "df": d[["state", "year", "cigsale", "treated"]],
    "outcome": "cigsale", "treat": "treated", "unitid": "state", "time": "year",
    "backend": "outcome-only", "inference": "scpi", "alpha": 0.05,
    "scpi_sims": 200, "display_graphs": False,
}).fit()

print(res.inference.ci_lower, res.inference.ci_upper)   # ATT prediction interval
det = res.inference.details                              # per-period sequence
for yr, lo, up in zip(det["periods"], det["pi_lower"], det["pi_upper"]):
    print(yr, round(lo, 1), round(up, 1))

SCPI with the covariate backends (MSCMT and Malo)#

The same inference="scpi" switch composes with the covariate-matching backends. Running each of the three canonical studies under both mscmt and malo (alpha=0.05 -> 90% intervals, scpi_sims=200, seed=1) gives the table below. The ATT prediction interval excludes zero in every case, and the two backends agree to within Monte-Carlo / weight-choice differences – a useful robustness cross-check. Note the v_agreement column: for Prop 99 and Germany under mscmt the predictor weights are non-identified (\(\approx 1\)), so those intervals should be read with the caveat above.

Study (backend)

ATT

ATT 90% PI

v_agreement

top donors

California (mscmt)

\(-18.98\)

\([-27.31,\,-5.28]\)

\(\approx 1\) (fragile)

Utah .34, Nevada .24, Montana .20

California (malo)

\(-19.60\)

\([-31.32,\,-3.27]\)

n/a

Utah .38, Montana .25, Nevada .21

Germany (mscmt)

\(-1396\)

\([-2368,\,-949]\)

\(\approx 1\) (fragile)

Austria .40, Switz .16, USA .15

Germany (malo)

\(-1306\)

\([-2025,\,-521]\)

n/a

USA .35, Austria .33, Switz .11

Basque (mscmt)

\(-0.70\)

\([-1.13,\,-0.32]\)

\(0.63\)

Cataluna .84, Madrid .16

Basque (malo)

\(-0.63\)

\([-1.14,\,-0.18]\)

\(\approx 0\) (clean)

Cataluna .47, Madrid .33

The Basque case is the cleanest: with the special-predictor covariates, malo returns a well-identified \(\mathbf{V}\) (v_agreement \(\approx 0\)) and mscmt recovers the published Cataluna/Madrid split, both with tight intervals that exclude zero. The early German post-years (1990-1992) are not significant under either backend – the interval includes zero – and only turn decisively negative later, exactly as the reunification narrative implies.

import pandas as pd
from mlsynth import VanillaSC

# --- California / Prop 99 (ADH 2010) ---
d = pd.read_csv("basedata/augmented_cali_long.csv")
for yr, col in [(1975, "cig_1975"), (1980, "cig_1980"), (1988, "cig_1988")]:
    d[col] = d.state.map(d[d.year == yr].set_index("state").cigsale)
d["treated"] = ((d.state == "California") & (d.year >= 1989)).astype(int)
cov = ["p_cig", "pct15-24", "loginc", "pc_beer", "cig_1975", "cig_1980", "cig_1988"]
win = {"p_cig": (1980, 1988), "pct15-24": (1980, 1988),
       "loginc": (1980, 1988), "pc_beer": (1984, 1988)}
common = dict(df=d, outcome="cigsale", treat="treated", unitid="state", time="year",
              covariates=cov, covariate_windows=win, inference="scpi",
              alpha=0.05, scpi_sims=200, seed=1, display_graphs=False)

mscmt = VanillaSC({**common, "backend": "mscmt", "canonical_v": "min.loss.w"}).fit()
malo  = VanillaSC({**common, "backend": "malo"}).fit()
for name, r in [("mscmt", mscmt), ("malo", malo)]:
    i = r.inference
    print(name, round(r.effects.att, 2), (round(i.ci_lower, 2), round(i.ci_upper, 2)),
          "v_agreement=", r.weights.summary_stats.get("v_agreement"))

# --- German reunification (ADH 2015): outcome "gdp", same pattern ---
# --- Basque (AG 2003): outcome "gdpcap", special-predictor covariates ---
# (swap df/outcome/covariates; everything else is identical.)

The per-period sequence is always in res.inference.details; switching backend changes \(\widehat{\mathbf{w}}\) (and hence the centre and width of the band) but not the inference code path.

Leave-two-out refined placebo test#

Set inference="lto" for the Lei-Sudijono (2025) refined placebo test. It is a drop-in replacement for the ordinary placebo with a much finer p-value grid and valid rejections when \(\alpha < 1/N\).

import pandas as pd
from mlsynth import VanillaSC

d = pd.read_csv("basedata/augmented_cali_long.csv")
d["treated"] = ((d.state == "California") & (d.year >= 1989)).astype(int)

res = VanillaSC({
    "df": d[["state", "year", "cigsale", "treated"]],
    "outcome": "cigsale", "treat": "treated", "unitid": "state", "time": "year",
    "backend": "outcome-only", "inference": "lto", "alpha": 0.05,
    "display_graphs": False,
}).fit()

det = res.inference.details
print(res.inference.p_value)        # naive LTO p-value (703 pairs for N = 39)
print(det["p_powered"], det["powered_offset_c"])   # powered p-value at alpha
print(det["type_i_bound"], det["reject_at_alpha"])

Empirical relations across the three studies#

On the canonical datasets the LTO test reproduces Lei-Sudijono’s (2025) Table 1: it can change the conclusion where the placebo grid is too coarse (German: exact placebo 0.059 does not reject, LTO 0.042 does), is not mechanically smaller (Basque: LTO 0.67 > placebo 0.41), and nearly coincides with the placebo when both reject (Prop 99: 0.024 vs 0.026). The helper constants match the paper exactly (c(39, 0.05) = 0.002, c(17, 0.05) = 0.0125). See VanillaSC — Standard Synthetic Control (ADH 2010/2015; Abadie-Gardeazabal 2003) for the full Table-1 relations and discussion. Because the cost is \(O(N_0^2)\) engine fits, run the covariate-matched (mscmt) version on the smaller studies or cap pairs with lto_max_pairs; the 38-donor Prop 99 outcome-only LTO runs in under two minutes.

References#

Abadie, A., & Gardeazabal, J. (2003). “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review 93(1):113-132.

Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies.” Journal of the American Statistical Association 105(490):493-505.

Abadie, A., Diamond, A., & Hainmueller, J. (2015). “Comparative Politics and the Synthetic Control Method.” American Journal of Political Science 59(2):495-510.

Abadie, A., & L’Hour, J. (2021). “A Penalized Synthetic Control Estimator for Disaggregated Data.” Journal of the American Statistical Association 116(536):1817-1834.

Becker, M., & Kloessner, S. (2018). “Fast and Reliable Computation of Generalized Synthetic Controls.” Econometrics and Statistics 5:1-19.

Lei, L., & Sudijono, T. (2025). “Inference for Synthetic Controls via Refined Placebo Tests.” arXiv:2401.07152.

Malo, P., Eskelinen, J., Zhou, X., & Kuosmanen, T. (2024). “Computing Synthetic Controls Using Bilevel Optimization.” Computational Economics 64:1113-1136.