SPOTSYNTH — O’Riordan & Gilligan-Lee (2025) spillover detection#

Reproduction of the empirical results in

O’Riordan & Gilligan-Lee (2025). Spillover detection for donor selection in synthetic control models. Journal of Causal Inference 13:20240036.

SPOTSYNTH screens every candidate donor for spillover contamination – a valid donor’s post-intervention value is forecastable from the other donors’ pre-intervention data (Theorem 3.1); a forecast failure flags an invalid donor – then builds a simplex synthetic control on the donors judged valid. Two screening rules are exposed: S1 (keep the n best-forecast donors) and S2 (drop donors whose realised post value falls outside the forecast PPI).

The durable case benchmarks/cases/spotsynth_real_data.py reproduces three figures of the paper.

Figure 6 — real-data screening (Path A, semi-synthetic)#

On the three canonical Abadie panels – German Reunification (german_reunification.csv), California Tobacco Control (smoking_data.csv), and Basque Country / ETA (basque_data.csv) – a semi-synthetic invalid donor is planted: a noisy proxy of the target, \(x_{\text{syn}} \sim \mathcal N(y, \sigma)\). Tracking the target, it earns a large SC weight and biases the unscreened effect toward zero; both S1 and S2 flag and exclude it, recovering the effect.

Figure 2 — detection power (Path B)#

Leave-one-out detection AUC (probability an invalid donor scores more anomalous than a valid one): 0.97 sharp / 0.90 gradual under a valid majority (30% invalid), and a documented inversion (≈0) under an invalid majority (80%) – the regime where the package pins the lag anchor instead.

Figure 4 — sensitivity / proximal debias (Path B)#

When the kept donors are noisy proxies (errors-in-variables), even a perfect valid-donor SC is attenuation-biased. The proximal two-stage debias (eq. 5), using the screen-excluded donors as proximal controls, reduces that bias (mean \(|\text{bias}|\) 0.47 → 0.40 over the paper’s EIV DGP).

Note

The paper’s §3.4 also gives analytical bias bounds for false-positive (eq. 7) and false-negative (eq. 8) screening errors. These are sensitivity formulas the analyst evaluates (the false-negative bound needs the unknown spillover \(\tau\) as a sensitivity parameter); mlsynth implements the debias remedy (eq. 5), not the bounds, so the benchmark validates the debias, not the closed-form bounds.

Reproduce#

python benchmarks/run_benchmarks.py spotsynth_real_data