Benchmarks#
Every estimator in mlsynth ships with at least one durable benchmark: a
self-contained case under benchmarks/cases/ that re-runs a published result
(or a reference implementation) and asserts the headline numbers against a fixed
tolerance. Where the Replications page tells the story of each
validation in prose, this page documents the machinery – the runnable cases
that guard against regressions as the library changes.
Each case is a small module exposing run() (which returns a dict of metrics,
driving everything through mlsynth’s public API) and EXPECTED (a map from
metric to a (value, tolerance) pair). The driver compares the two and a case
that cannot find its data or an optional reference dependency raises
BenchmarkSkipped rather than failing.
Running them#
python benchmarks/run_benchmarks.py --all # every pure-Python case
python benchmarks/run_benchmarks.py --case cwz_ttest # one case
python benchmarks/run_benchmarks.py --with-reference # also R / external cross-checks
The registry of cases lives in benchmarks/registry.py (the source of truth);
the catalogue below is grouped by validation path.
Validation paths#
Path A – reproduce the source paper’s empirical result on the original authors’ data.
Path B – reproduce the paper’s Monte Carlo / simulation table.
Cross-validation – match an authoritative reference implementation (an R/MATLAB package or the authors’ own code); these skip themselves when the optional dependency is absent.
Path A — empirical replications#
Case |
Validates |
|---|---|
|
RPCA-SC West Germany |
|
CWZ 2025 Table 5 carbon-tax debiased t-test |
|
DSC distributional SC on Dube minimum-wage (Gunsilius/DiSCo vignette) |
|
DSCAR Beijing PM2.5 alerts (Zheng-Chen) |
|
HK GDP empirical |
|
forward-selected SC (Prop 99) |
|
HSC HK handover |
|
Walmart placebo design |
|
dense L-inf vs sparse SC (Prop 99) |
|
MAREX Walmart placebo design (Abadie-Zhao SCDesign, 10-store subset) |
|
MASC Basque/ETA (KMPT Sec 5) |
|
Shi-Wang Brexit multi-treated-units L2-relaxation |
|
PDA methods on HK CEPA (Shi-Wang App E.1) |
|
Shi-Huang China luxury-watch fsPDA (prewhitened-NW) |
|
Shi-Wang China PPI L2-relaxation (real-estate policy) |
|
SCM-relaxation Brexit/UK GDP (2016Q3) |
|
SCM-relaxation Brexit robustness (2020Q1) |
|
Lee-Wooldridge Prop99 + castle |
|
SBC German reunification |
|
Tian et al. West Germany balance |
|
L1 predictor selection (Prop 99) |
|
SPCD design vs random/SC on Prop 99 (Lu et al. 2022) |
|
grossi direct+spillover German reunification (Grossi et al.) |
|
inclusive SCM German reunification (Di Stefano-Mellace) |
|
iterative waterfall SCM German reunification (Melnychuk) |
|
SPOTSYNTH donor-spillover screening: Germany/California/Basque (Fig 6) + detection (Fig 2) + debias (Fig 4) |
|
Brooklyn showroom (Li-Shankar) |
|
canonical ADH 2010 Prop 99 |
Path B — Monte Carlo / simulation#
Case |
Validates |
|---|---|
|
ASCM near-nominal coverage + bias reduction (BMR 2021 Sec 7) |
|
ClusterSC vs RSC |
|
CTSC vs two-way FE bias (Powell 2022 Table 1) |
|
CWZ 2025 Table 3 application-based Monte Carlo |
|
DR/PIPW recovery + double-robustness (Qiu et al. normal DGP) |
|
simulation |
|
FMA asymptotic-CI coverage robust to variance (Li-Sonnier) |
|
HSC regime adaptation |
|
Abadie-Zhao design sim |
|
L-inf vs SC (Wang-Xing-Ye Table 4) |
|
MSQRT unbiasedness + RMSE noise-floor (Shen-Song-Abadie Sec 6) |
|
nonlinear coverage + error-shrinks-with-J |
|
PANGEO trajectory match vs scalar (Chen et al.) |
|
Shi-Wang Table 2 L2-relaxation size/power |
|
Li-Bell Table 2 LASSO-PDA OOS prediction (N>T1) |
|
Jiang et al. 2025 prediction-interval coverage (Tables 2-5) |
|
Shi-Huang Table 1 fs-vs-LASSO size/power geometry |
|
PI/PIS/PIPost vs SC under trending factor (Liu et al.) |
|
latent-group MC, relaxations beat SCM |
|
RSC train≈gen error |
|
Shi-Xi-Xie MSE ratios |
|
Sun averaged regime geometry |
|
Tian Table 1 / Sun Sim1 |
|
SSDiD vs DiD coverage/RMSE |
|
SHC latent-confounder recovery (Chen-Yang-Yang Sec 3.1) |
|
SIV vs 2SLS-TWFE bias (Gulek-Vives Table 1) |
|
ORTHSC carbon-tax ATT/p/K/CI (Fry; Andersson 2019 data, vs live R) |
|
ORTHSC fixed-smoothing t-test size control + power (Fry Tables 1-2) |
|
SAR spillover recovery + SCM nesting (Sakaguchi-Tagawa) |
|
SPSC IFEM recovery + DT-vs-NoDT coverage (Park-Tchetgen) |
|
Doudchenko et al. 2021 Monte Carlo (BLS unemployment) |
|
TASC vs SC state-space ablation (Rho et al.) |
|
Figure 2 MSE-ratio grid |
Cross-validation against reference implementations#
Case |
Validates |
|---|---|
|
vs augsynth: Kansas ridge-ASCM ladder (SCM/ridge/covariate/residualized) |
|
vs authors’ repo |
|
vs LIVE augsynth (Rscript): lambda/weights/ATT (skips if absent) |
|
vs LIVE pensynth wsoll1 (Rscript+LowRankQP): penalized SC weights/ATT on Prop 99 (skips if absent) |
|
vs GeoLiftMarketSelection: CPIC investment value-for-value |
|
vs augsynth: multi-cell per-cell ATT + donor exclusion |
|
vs GeoLift/augsynth: GeoLift_Walkthrough realized report (fixedeff ASCM + conformal) |
|
LINF vs LinfinitySC (skips if absent) |
|
vs causaltensor |
|
vs R microsynth panel method (Seattle DMI) |
|
vs Bottmer’s mlSC_estimator (skips if absent) |
|
vs Tian’s NSC.R (Prop 99 Table 2) |
|
vs augsynth::multisynth (jackknife + bootstrap SEs) |
|
vs freshtaste/proximal (Panic 1907 Table 3) |
|
vs scmrelax (skips if absent) |
|
Shen CIs + coverage |
|
vs causaltensor |
|
vs Agarwal-Shah-Shen 2026 authors’ code (Prop 99) |
|
vs deshen24/syntheticNN (Prop 99) |
|
vs Melnychuk-Andrii/Spillover-SCM (inclusive SCM German) |
|
vs jcao0/synthetic-control-spillover (Cao-Dowd Prop 99) |
|
vs authors’ repo |
|
vs jcao0/staggered_synthetic_control (criminality Sec 4) |