SparseSC — L1 Predictor Selection for Synthetic Controls (Vives-i-Bastida 2022)#
- Estimator:
Sparse Synthetic Control (SparseSC) —
mlsynth.SparseSC- Source:
Vives-i-Bastida, Jaume (2022), “Predictor Selection for Synthetic Controls,” working paper, arXiv:2203.11576 [jaumesparsesc].
- Replication type:
Path A (two canonical empirical panels) + Path B (the paper’s own Monte-Carlo study, Section 4 / Figures 1-2).
- Status:
Fully verified — recovers the Abadie, Diamond & Hainmueller (2010) Proposition 99 ATT and the Abadie & Gardeazabal (2003) Basque result (each with the L1 penalty selecting the predictor set), and reproduces the qualitative findings of the paper’s simulation study.
Why Path A#
SparseSC’s contribution is predictor selection: an L1 penalty on the
predictor-importance vector \(v\) drives uninformative predictors to exactly
zero. The natural way to demonstrate that is to hand the estimator a deliberately
over-rich predictor set on the canonical California tobacco-control panel
(Proposition 99) and check that (a) the penalty prunes it back to a sparse,
interpretable subset, and (b) the resulting effect still lands on the
Abadie-Diamond-Hainmueller (ADH) benchmark of roughly \(-19\) packs with the
ADH donor pool. Both the outcome panel and the augmented covariates ship in
basedata/augmented_cali_long.csv, so this is reproducible value-for-value.
The specification#
Outcome / treatment: per-capita cigarette sales, California treated from 1989; pre-period 1970-1988.
Predictors (over-rich, the point of the exercise): 30 economic and policy covariates carried in the augmented panel (collapsed to pre-period unit means) plus three lagged outcomes — cigarette sales in 1975, 1980 and 1988 — giving \(P = 33\) predictor rows against \(N = 38\) donors. The first predictor is pinned to \(v_1 = 1\) to fix the scale; the rest are bound-constrained non-negative.
Penalty selection: the default 51-point grid \(\{0\} \cup \mathrm{logspace}(-4, 0, 50)\), with \(\lambda\) chosen by the unpenalised validation-block MSE.
Outer V-solve: the finite-difference gradient default (see A note on optimisation below).
Inference: the default moving-block conformal CI (Chernozhukov, Wuethrich & Zhu 2021), calibrated on the validation residuals.
Reproducing the result#
import pandas as pd, numpy as np
from mlsynth import SparseSC
d = pd.read_csv("basedata/augmented_cali_long.csv")
d["treated"] = ((d.state == "California") & (d.year >= 1989)).astype(int)
# over-rich predictor set: every numeric covariate with complete, non-constant
# pre-period coverage (collapsed to unit means), capped at 30, + lagged outcomes
pre = d[d.year < 1989]
drop = {"state", "year", "treated", "cigsale", "stateno", "state_fips",
"state_icpsr", "is_a_state", "region"}
covs = [c for c in d.columns if c not in drop and d[c].dtype.kind in "if"
and pre.groupby("state")[c].mean().notna().all()
and pre.groupby("state")[c].mean().std(ddof=1) > 0][:30]
res = SparseSC({
"df": d, "outcome": "cigsale", "treat": "treated",
"unitid": "state", "time": "year",
"covariates": covs, "outcome_lag_periods": [1975, 1980, 1988],
"run_inference": True, "inference_method": "conformal",
"display_graphs": False, # defaults: FD gradient, 51-pt grid
}).fit()
keep = int(np.sum(np.asarray(res.design.v) > 1e-6))
print(f"ATT={res.att:.2f} pre-RMSE={res.pre_rmse:.2f} "
f"lambda*={res.design.opt_lambda:.4g} predictors kept={keep}/{len(res.design.v)}")
print(f"95% CI=[{res.inference.ci_lower:.2f}, {res.inference.ci_upper:.2f}]")
Results#
Quantity |
SparseSC (augmented) |
ADH (2010) benchmark |
|---|---|---|
ATT, 1989-2000 (packs) |
-17.9 |
\(\approx -19\) |
95% conformal CI |
|
excludes 0 |
pre-treatment RMSE |
2.21 |
~1.8 |
predictors kept (of 33) |
6 |
n/a |
selected \(\lambda\) |
0.019 |
n/a |
donor pool |
Utah 0.39, Nevada 0.30, Connecticut 0.20, Colorado 0.12 |
Utah / Nevada / Connecticut / Colorado / Montana |
What it confirms#
The penalty selects. From 33 candidate predictors the L1 fit keeps only 6, discarding the bulk of the over-rich augmented set — exactly the variable-selection behaviour that motivates the method.
The effect is canonical. The pruned fit lands at \(-17.9\) packs with a conformal interval excluding zero, squarely on the ADH \(\approx -19\) benchmark, and recovers ADH’s donor pool (Utah / Nevada / Connecticut / Colorado) from a 38-state pool — the selection does not distort the answer.
The durable check lives in benchmarks/cases/sparse_sc_prop99.py:
python benchmarks/run_benchmarks.py --case sparse_sc_prop99
A second case: the Basque Country#
The same estimator, unchanged, reproduces the other canonical SC study —
Abadie & Gardeazabal’s (2003) Basque Country terrorism analysis — on the full
predictor set shipped in basedata/basque_data.csv. Here the outcome is real
GDP per capita, the Basque Country is treated from 1975, and the predictors are
the A&G schooling shares, sectoral GVA shares, investment ratio and population
density (collapsed to pre-period unit means), plus three lagged outcomes (GDP
per capita in 1960, 1965, 1969) — \(P = 15\) against \(N = 16\) donors.
import pandas as pd
from mlsynth import SparseSC
d = pd.read_csv("basedata/basque_data.csv")
d["treated"] = ((d.regionname == "Basque Country (Pais Vasco)")
& (d.year >= 1975)).astype(int)
ed = ["school.illit", "school.prim", "school.med", "school.high", "invest"]
sec = ["sec.agriculture", "sec.energy", "sec.industry", "sec.construction",
"sec.services.venta", "sec.services.nonventa"]
res = SparseSC({
"df": d, "outcome": "gdpcap", "treat": "treated",
"unitid": "regionname", "time": "year",
"covariates": ed + sec + ["popdens"],
"outcome_lag_periods": [1960, 1965, 1969],
"run_inference": True, "inference_method": "conformal",
"display_graphs": False,
}).fit()
Quantity |
SparseSC |
A&G (2003) benchmark |
|---|---|---|
ATT, 1975-1997 (GDP p.c., thousands) |
-0.65 |
peak \(\approx -0.85\) |
95% conformal CI |
|
excludes 0 |
pre-treatment RMSE |
0.092 |
n/a |
predictors kept (of 15) |
3 (illit, non-market services, popdens) |
n/a |
donor pool |
Cataluna 0.82, Madrid 0.16, Cantabria 0.03 |
Catalonia + Madrid |
gap trajectory |
opens ~1978, peak \(-0.95\) (1990), \(-0.75\) by 1997 |
widening through the 1980s |
What it adds: SparseSC recovers A&G’s actual two-donor synthetic — Catalonia
plus Madrid — rather than the single-donor Catalonia the penalized/MSCMT backends
collapse to on this panel, while pruning 15 predictors to 3 and achieving a
tighter pre-fit (RMSE 0.092) than either. The effect (\(-0.65\) average,
peaking \(-0.95\)) lands on the A&G result. Notably the penalty keeps
popdens as informative here — the very dimension that destabilised the
Abadie-L’Hour bias correction on this panel: density genuinely helps explain
GDP (a good predictor) even though it cannot be safely extrapolated in a
residual correction.
Path B: the simulation study (Section 4)#
The paper’s Monte-Carlo (Figures 1-2) is reproduced from a linear factor model with a grouped structure (the design of the companion paper, Abadie & Vives-i-Bastida 2022, extended with covariates):
with J+1 = 21 units in 7 groups of 3 sharing a one-hot factor loading
mu_i; common factors lambda_t AR(1) (rho = 0.5, standard-Gaussian
innovations); delta_t = 100; eps ~ N(0, 0.25^2). Covariates split into
useful Z^1 (nonzero theta) and nuisance Z^2 (zero
theta), all drawn U[0,1]; the treated unit’s useful predictors are set
to 1/2(Z_2 + Z_3) and it shares units 2,3’s group, so the oracle synthetic
control is w_2 = w_3 = 1/2. The design matrix adds 10 lagged outcomes (20
predictors total). Two regimes: k1=k2=5 (balanced) and k1=1, k2=9
(nuisance-heavy). The true effect is zero, so post-treatment MSE measures
counterfactual prediction error. Three estimators are compared, mapped onto
SparseSC’s knobs:
SCM — the standard control with the fixed Mahalanobis weight
V = (X0' X0)^{-1}(no optimisation; solved asmin_w (X1-X0 w)'V(X1-X0 w)on the simplex).SCM λ=0 —
Vminimises the validation-block outcome fit with no penalty (outer_loss_window="validation",lambda=0; Abadie-Diamond- Hainmueller 2015).Sparse — the same with the L1 penalty and CV-selected
lambda.
Results (B = 60 draws; “Vnoise” is the V-weight assigned to the nuisance covariates):
setting / method |
post-MSE |
|bias| |
val-RMSE |
w₂+w₃ |
Vnoise |
|---|---|---|---|---|---|
k1=5,k2=5 SCM |
1.78 |
0.06 |
1.27 |
0.25 |
— |
SCM λ=0 |
0.132 |
0.001 |
0.278 |
0.92 |
5.27 |
Sparse |
0.153 |
0.010 |
0.255 |
0.90 |
0.16 |
k1=1,k2=9 SCM |
1.99 |
0.02 |
1.28 |
0.17 |
— |
SCM λ=0 |
0.164 |
0.003 |
0.253 |
0.89 |
0.85 |
Sparse |
0.141 |
0.016 |
0.224 |
0.90 |
0.21 |
The paper’s three headline findings reproduce:
Standard SCM is decisively worst (post-MSE ~1.8-2.0; it cannot even concentrate weight on the right donors, w₂+w₃ ≈ 0.2).
Sparse is robust across regimes while the unpenalised method degrades. Moving from balanced to nuisance-heavy,
SCM λ=0worsens 0.132 → 0.164 while Sparse stays flat (0.153 → 0.141) and overtakes it — Figure 1b’s central result.Sparse performs the selection (Figure 2): it drives the nuisance-predictor weights to ≈ 0 (Vnoise 0.16-0.21) where
SCM λ=0piles weight on them (5.27, 0.85), and it carries the lowest validation RMSE throughout.
Honest caveats: absolute MSE levels are not comparable to the paper (the
useful-predictor coefficient scale theta_t is not pinned down in either
paper — the companion design has no covariate term — so it is filled in the
spirit of the theta_t Z_i term as time-varying N(0,1)); and in the easy
k1=k2=5 regime Sparse ties rather than strictly beats SCM λ=0, within
the latitude of that unspecified constant, B = 60 Monte-Carlo noise, and a
coarser (21-point) penalty grid. The ordering and the robustness/selection
mechanism — the paper’s actual claims — reproduce.
A note on optimisation (and grid resolution)#
Two implementation facts surface naturally here and are worth recording:
Grid resolution does the heavy lifting. A coarse 21-point grid lands at \(-20.8\) packs with pre-RMSE 4.77 and 25 predictors retained — the penalty barely bites. The full 51-point default grid finds the better \(\lambda\) (0.019), achieves true sparsity (6 predictors), halves the pre-RMSE, and pulls the ATT onto \(-17.9\). The default grid matters more than any micro-optimisation of the solver.
Keep the finite-difference gradient default. SparseSC also ships an envelope-theorem closed-form gradient (
use_analytical_grad=True) that is ~5-10x faster, but on this augmented spec it settles on a much worse critical point (pre-RMSE \(\approx 10\), no predictor selection) that even multi-start restarts do not escape — the finite-difference path’s gradient noise is what finds the good basin. The analytical gradient is therefore opt-in, and the verified result above uses the FD default.
Note
“Augmented” here is an automatically selected 30-covariate proxy (every numeric covariate with complete, non-constant pre-period coverage), not a hand-curated predictor list. The headline match — sparse selection, ADH-range ATT, ADH donor pool — is robust to that choice; the exact \(\lambda\) and retained-predictor identities will shift with the predictor set.