Benchmarks#

Every estimator in mlsynth ships with at least one durable benchmark: a self-contained case under benchmarks/cases/ that re-runs a published result (or a reference implementation) and asserts the headline numbers against a fixed tolerance. Where the Replications page tells the story of each validation in prose, this page documents the machinery – the runnable cases that guard against regressions as the library changes.

Each case is a small module exposing run() (which returns a dict of metrics, driving everything through mlsynth’s public API) and EXPECTED (a map from metric to a (value, tolerance) pair). The driver compares the two and a case that cannot find its data or an optional reference dependency raises BenchmarkSkipped rather than failing.

Running them#

python benchmarks/run_benchmarks.py --all            # every pure-Python case
python benchmarks/run_benchmarks.py --case cwz_ttest  # one case
python benchmarks/run_benchmarks.py --with-reference  # also R / external cross-checks

The registry of cases lives in benchmarks/registry.py (the source of truth); the catalogue below is grouped by validation path.

Validation paths#

  • Path A – reproduce the source paper’s empirical result on the original authors’ data.

  • Path B – reproduce the paper’s Monte Carlo / simulation table.

  • Cross-validation – match an authoritative reference implementation (an R/MATLAB package or the authors’ own code); these skip themselves when the optional dependency is absent.

Path A — empirical replications#

Case

Validates

clustersc_rpca_germany

RPCA-SC West Germany

cwz_ttest

CWZ 2025 Table 5 carbon-tax debiased t-test

dsc_dube

DSC distributional SC on Dube minimum-wage (Gunsilius/DiSCo vignette)

dscar_beijing

DSCAR Beijing PM2.5 alerts (Zheng-Chen)

fdid_hongkong

HK GDP empirical

fscm_prop99

forward-selected SC (Prop 99)

hsc_hongkong

HSC HK handover

lexscm_walmart

Walmart placebo design

linf_prop99

dense L-inf vs sparse SC (Prop 99)

marex_walmart

MAREX Walmart placebo design (Abadie-Zhao SCDesign, 10-store subset)

masc_basque

MASC Basque/ETA (KMPT Sec 5)

pda_brexit

Shi-Wang Brexit multi-treated-units L2-relaxation

pda_hongkong

PDA methods on HK CEPA (Shi-Wang App E.1)

pda_luxurywatch

Shi-Huang China luxury-watch fsPDA (prewhitened-NW)

pda_ppi

Shi-Wang China PPI L2-relaxation (real-estate policy)

rescm_brexit

SCM-relaxation Brexit/UK GDP (2016Q3)

rescm_brexit_2020

SCM-relaxation Brexit robustness (2020Q1)

rolldid_lw

Lee-Wooldridge Prop99 + castle

sbc_germany

SBC German reunification

scmo_germany

Tian et al. West Germany balance

sparse_sc_prop99

L1 predictor selection (Prop 99)

spcd_prop99

SPCD design vs random/SC on Prop 99 (Lu et al. 2022)

spillsynth_grossi_germany

grossi direct+spillover German reunification (Grossi et al.)

spillsynth_iscm_germany

inclusive SCM German reunification (Di Stefano-Mellace)

spillsynth_iterative_germany

iterative waterfall SCM German reunification (Melnychuk)

spotsynth_real_data

SPOTSYNTH donor-spillover screening: Germany/California/Basque (Fig 6) + detection (Fig 2) + debias (Fig 4)

tssc_brooklyn

Brooklyn showroom (Li-Shankar)

vanillasc_prop99

canonical ADH 2010 Prop 99

Path B — Monte Carlo / simulation#

Case

Validates

augsynth_calibrated

ASCM near-nominal coverage + bias reduction (BMR 2021 Sec 7)

clustersc_subgroups

ClusterSC vs RSC

ctsc_powell_mc

CTSC vs two-way FE bias (Powell 2022 Table 1)

cwz_mc

CWZ 2025 Table 3 application-based Monte Carlo

dr_proximal_mc

DR/PIPW recovery + double-robustness (Qiu et al. normal DGP)

fdid_table5

simulation

fma_coverage_mc

FMA asymptotic-CI coverage robust to variance (Li-Sonnier)

hsc_mc

HSC regime adaptation

lexscm_design_mc

Abadie-Zhao design sim

linf_sim

L-inf vs SC (Wang-Xing-Ye Table 4)

msqrt_sim

MSQRT unbiasedness + RMSE noise-floor (Shen-Song-Abadie Sec 6)

nsc_mc

nonlinear coverage + error-shrinks-with-J

pangeo_supergeo_mc

PANGEO trajectory match vs scalar (Chen et al.)

pda_l2_sim

Shi-Wang Table 2 L2-relaxation size/power

pda_lasso_sim

Li-Bell Table 2 LASSO-PDA OOS prediction (N>T1)

pda_pi_coverage

Jiang et al. 2025 prediction-interval coverage (Tables 2-5)

pda_table1

Shi-Huang Table 1 fs-vs-LASSO size/power geometry

proximal_surrogates_mc

PI/PIS/PIPost vs SC under trending factor (Liu et al.)

rescm_relax_mc

latent-group MC, relaxations beat SCM

rsc_synth_error

RSC train≈gen error

sbc_mc

Shi-Xi-Xie MSE ratios

scmo_averaged_mc

Sun averaged regime geometry

scmo_concatenated_mc

Tian Table 1 / Sun Sim1

seq_sdid_mc

SSDiD vs DiD coverage/RMSE

shc_recovery_mc

SHC latent-confounder recovery (Chen-Yang-Yang Sec 3.1)

siv_syria_mc

SIV vs 2SLS-TWFE bias (Gulek-Vives Table 1)

orthsc_carbontax

ORTHSC carbon-tax ATT/p/K/CI (Fry; Andersson 2019 data, vs live R)

orthsc_size_power

ORTHSC fixed-smoothing t-test size control + power (Fry Tables 1-2)

spillsynth_sar_mc

SAR spillover recovery + SCM nesting (Sakaguchi-Tagawa)

spsc_ifem_mc

SPSC IFEM recovery + DT-vs-NoDT coverage (Park-Tchetgen)

syndes_bls

Doudchenko et al. 2021 Monte Carlo (BLS unemployment)

tasc_mc

TASC vs SC state-space ablation (Rho et al.)

tssc_figure2

Figure 2 MSE-ratio grid

Cross-validation against reference implementations#

Case

Validates

ascm_kansas

vs augsynth: Kansas ridge-ASCM ladder (SCM/ridge/covariate/residualized)

clustersc_subgroups_ref

vs authors’ repo

geolift_augsynth_ref

vs LIVE augsynth (Rscript): lambda/weights/ATT (skips if absent)

pensynth_prop99

vs LIVE pensynth wsoll1 (Rscript+LowRankQP): penalized SC weights/ATT on Prop 99 (skips if absent)

geolift_cpic

vs GeoLiftMarketSelection: CPIC investment value-for-value

geolift_multicell

vs augsynth: multi-cell per-cell ATT + donor exclusion

geolift_walkthrough

vs GeoLift/augsynth: GeoLift_Walkthrough realized report (fixedeff ASCM + conformal)

linf_crossval_ref

LINF vs LinfinitySC (skips if absent)

mcnnm_prop99

vs causaltensor

microsynth_seattle

vs R microsynth panel method (Seattle DMI)

mlsc_bottmer

vs Bottmer’s mlSC_estimator (skips if absent)

nsc_prop99

vs Tian’s NSC.R (Prop 99 Table 2)

ppscm_paglayan

vs augsynth::multisynth (jackknife + bootstrap SEs)

proximal_panic1907

vs freshtaste/proximal (Panic 1907 Table 3)

rescm_relax_ref

vs scmrelax (skips if absent)

rsc_shen_coverage

Shen CIs + coverage

sdid_prop99

vs causaltensor

si_prop99

vs Agarwal-Shah-Shen 2026 authors’ code (Prop 99)

snn_prop99

vs deshen24/syntheticNN (Prop 99)

spillsynth_iscm_xval

vs Melnychuk-Andrii/Spillover-SCM (inclusive SCM German)

spillsynth_prop99

vs jcao0/synthetic-control-spillover (Cao-Dowd Prop 99)

spsydid_state_mc

vs authors’ repo

ssc_guanajuato

vs jcao0/staggered_synthetic_control (criminality Sec 4)