Spillover-Detecting Synthetic Control (SPOTSYNTH)

Spillover-Detecting Synthetic Control (SPOTSYNTH)#

Overview#

SPOTSYNTH packages the donor-selection procedure of O’Riordan & Gilligan-Lee (2025), Spillover detection for donor selection in synthetic control models ([SPOTSYNTH], Journal of Causal Inference 13:20240036). It addresses a prerequisite that classical synthetic control takes for granted: that the donor pool is valid – that no donor is itself affected by the intervention through a spillover. When the donor pool is large, deciding which donors are valid by domain knowledge alone is infeasible, and a single contaminated donor – one that moves with the treated unit after the intervention – can absorb a large weight and bias the estimated effect toward zero.

The paper’s main result (Theorem 3.1) is that, under the same assumptions that make synthetic control non-parametrically identified (invariant causal mechanisms and proxy completeness), a valid donor’s post-intervention value is forecastable from pre-intervention donor data. SPOTSYNTH turns this into a practical screen: for each candidate donor, forecast its untreated post-intervention path from donor data the intervention has not touched, and flag the donor if its realised path departs from the forecast. A forecast failure means the donor was hit by a spillover (or its latent distribution shifted) – either way it is excluded. (Two forecast anchors implement this – a leave-one-out anchor, the default, and the paper’s first-post-point anchor; the forecast-anchor section explains when to use each.) The surviving donors feed the authors’ Bayesian Dirichlet simplex synthetic control (a \(\mathrm{Dirichlet}(0.4)\) prior on the weights, a half-normal prior on the residual scale, pre-period standardisation, and 95% posterior-predictive credible intervals). The donors the screen excludes are not discarded: they can be reused as proximal control variables in a two-stage (GMM) step that debiases the weights when the kept donors are noisy proxies.

Two selection rules are exposed through SPOTSYNTHConfig.selection:

S1 – keep the donors with the smallest forecast error. The analyst fixes how many donors to keep (e.g. “give me 30 valid donors”), which is convenient when a downstream method needs a set number of donors.
S2 – keep the donors whose realised post-intervention value falls inside a posterior predictive interval (default 80%). The analyst does not fix the number kept; instead the interval level controls the false-positive rate (how often a valid donor is wrongly excluded).

When to use this estimator#

Reach for SPOTSYNTH when:

You have a large donor pool and cannot certify by hand that every donor is free of spillovers. This is the motivating case – e.g. estimating the effect of a feature launch on a platform where any of thousands of candidate “donor” markets might have been indirectly exposed.
You are worried a donor is too good a match – a unit that tracks the treated unit suspiciously closely after the intervention. Such a donor grabs a large SC weight and biases the effect toward zero; the screen is built to catch exactly this (the semi-synthetic demonstrations below).
You want a principled, data-driven donor screen with explicit sensitivity bounds on the bias from selection errors, rather than an ad hoc “drop the weird-looking donor” rule.

Use the default forecast="loo" for applied work – a mostly-valid donor pool with a contaminant whose effect may arrive at any speed (it is onset-robust and dominates this regime). Switch to forecast="lag" only when you have prior reason to believe a large fraction of the pool is contaminated and the spillover is abrupt, the one regime where loo inverts (see the forecast-anchor power analysis below).

Do not use SPOTSYNTH when:

A relevant latent has no valid donor proxy (A2 fails). No amount of donor selection closes the backdoor path; the FP-bias bound below is uninformative. Switch to a factor-model-aware estimator (Factor Model Approach (FMA)) or a design that observes the confounder.
The treatment effect on the target is gradual and the donor pool is mostly contaminated. The "lag" anchor needs a sharp first-period signal; the "loo" anchor needs a valid majority. With neither, the screen has no clean reference.
Causal mechanisms are non-invariant (A1 fails) – e.g. the latent-to-donor map changes over the sample. The pre-period forecast then does not transport to the post-period.
You only have a tiny, hand-curated donor pool already known to be valid. The screen adds variance (it may drop a good donor) without identification gain; a canonical SC (Two-Step Synthetic Control, Forward Difference-in-Differences (FDID)) is the more honest default.
Interference runs treated-to-treated or is structural across many units rather than a few contaminated donors. For spillover-aware estimands (rather than donor cleaning) see Spatial Synthetic Difference-in-Differences (SpSyDiD) and Spillover-Aware Synthetic Control (SPILLSYNTH).

Notation#

Let \(j = 1\) denote the treated unit, with all units \(\mathcal{N} \coloneqq \{1, \dots, N\}\) and donor pool \(\mathcal{N}_0 \coloneqq \mathcal{N} \setminus \{1\}\) of cardinality \(N_0\). Time runs over \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\), 1-indexed; the intervention takes effect after the common adoption time \(T_0\), splitting \(\mathcal{T}\) into the pre-period \(\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}\) (of length \(T_0\)) and the post-period \(\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}\). The intervention indicator is \(d_t \coloneqq \mathbf{1}\{t > T_0\}\) – zero on \(\mathcal{T}_1\), one on \(\mathcal{T}_2\).

The treated series is \(\mathbf{y}_1 = (y_{11}, \dots, y_{1T})^\top \in \mathbb{R}^{T}\) with scalar outcomes \(y_{1t}\); each donor \(j \in \mathcal{N}_0\) contributes a series \(\mathbf{y}_j\), stacked into the donor matrix \(\mathbf{Y}_0 \coloneqq [\mathbf{y}_j]_{j \in \mathcal{N}_0} \in \mathbb{R}^{T \times N_0}\) (one column per donor). The panel is generated by latent factors \(u_1, \dots, u_M\) for which the donors are proxies. Donor weights are \(\mathbf{w} \in \mathbb{R}^{N_0}\), constrained to the unit simplex \(\Delta^{N_0} \coloneqq \{\mathbf{w} \in \mathbb{R}_{\ge 0}^{N_0} : \|\mathbf{w}\|_1 = 1\}\); the optimiser is \(\mathbf{w}^\ast\). The synthetic counterfactual is \(\widehat{\mathbf{y}}_1 \coloneqq \mathbf{Y}_0\,\mathbf{w}^\ast\) with entries \(\widehat{y}_{1t}\), the per-period effect is \(\tau_t \coloneqq y_{1t} - \widehat{y}_{1t}\), and the ATT is \(\widehat{\tau} \coloneqq |\mathcal{T}_2|^{-1} \sum_{t \in \mathcal{T}_2} \tau_t\). A donor’s spillover magnitude – the size of the post-intervention shift its outcome suffers when it is invalid – is written \(\gamma_j\), kept distinct from the treatment effect \(\tau\).

Assumptions#

The screen rests on the SC structural causal model and three working assumptions layered on Theorem 3.1.

A1 (Invariant causal mechanisms; Definition 3.3). The structural functions are time-invariant. This is what makes the forecast function \(h_j\) the same before and after the intervention, so a pre-intervention forecast is valid post-intervention.

Remark. This is what licenses transporting the pre-period forecast across \(T_0\): only if the same mechanism governs both windows does a forecast trained on \(\mathcal{T}_1\) predict a valid donor’s untreated value on \(\mathcal{T}_2\). A valid donor’s pre-period one-step forecast residuals should therefore look stationary; strong heteroskedasticity or trending residual variance signals a non-invariant mechanism.

A2 (Proxy completeness; Definition 3.2). The donors are proxies for the latents. If a relevant latent has no donor proxy, excluding donors cannot close all backdoor paths and the SC is biased by omitted variables (Section 3.4.1).

Remark. Donor selection cleans contaminated proxies; it cannot manufacture a proxy for a latent the donor pool never spanned. When this fails the FP-bias bound below is uninformative, which is exactly the regime in which the page steers the reader to a factor-model-aware estimator (Factor Model Approach (FMA)). A large pre-period fit residual for the treated unit against the donor pool is a symptom that the donors do not span the latents.

A3 (No contemporaneous latent shift). The latent error distributions \(P(\varepsilon_u)\) do not shift at the same time as the intervention.

Remark. The paper is explicit that latent shifts which occur later in the post-period do not bias the screen (the forecast test only inspects the first post-intervention point), and that lags can be absorbed by time averaging. A contemporaneous latent shift, however, is indistinguishable from a spillover and produces a false positive (a valid donor wrongly excluded; Figure 5).

What does not break the screen: spillovers that arrive late, latent shifts that arrive late, and large donor pools (the factor-regularised forecast handles \(N >\) pre-period length).

Mathematical Formulation#

The estimand#

We observe the treated unit \(j = 1\) with outcome \(y_{1t}\) and the donor pool \(\mathcal{N}_0\) with outcomes \(y_{jt}\), over periods \(t \in \mathcal{T}\). The intervention indicator is \(d_t = \mathbf{1}\{t > T_0\}\) – zero on the pre-period \(\mathcal{T}_1\), one on the post-period \(\mathcal{T}_2\). The estimand is the treatment effect on the treated,

\[\tau_t = \underbrace{\mathbb E\bigl(y_{1t} \mid \mathrm{do}(d_t = 1)\bigr)}_{\text{observed}} - \underbrace{\mathbb E\bigl(y_{1t} \mid \mathrm{do}(d_t = 0)\bigr)}_{\text{counterfactual}}, \qquad t \in \mathcal{T}_2,\]

estimated as the post-intervention gap between the treated unit and a synthetic control built from valid donors. A donor is valid if it adheres to the structural causal model of the paper (Figure 1a) and remains a proxy for the latent factors at every time point – in particular it must not be impacted by the intervention. Spillover effects manifest as a post-intervention shift in the donor’s exogenous error \(P(\varepsilon_{jt})\).

SC structural causal model#

Following Zeitler et al. (2023), the panel is modelled as a structural causal model with latent variables \(u_1, \dots, u_M\), donors \(\mathbf{y}_j\) (\(j \in \mathcal{N}_0\)) as children of the latents, and the treated unit \(\mathbf{y}_1\) as a child of the latents and the intervention. Two conditions are central.

Definition 3.2 (Proxy completeness). For any square-integrable \(f\), if \(\mathbb E(f(\mathbf{Y}_{0,t}) \mid u_{1t}, \dots, u_{Mt}) = 0\) then \(f \equiv 0\), where \(\mathbf{Y}_{0,t}\) is the donor cross-section \((y_{jt})_{j \in \mathcal{N}_0}\) at time \(t\). Intuitively, the donors carry all the “information” the latents do – they are genuine proxies for the latent factors.

Definition 3.3 (Invariant causal mechanism). The deterministic functions mapping parents to children do not depend on the time index \(t\). This generalises the time-independent factor loadings of the classical latent-factor SC model.

The forecast theorem#

Theorem 3.1. If causal mechanisms are invariant, and the donor cross-section \(\mathbf{Y}_{0,t-1}\) is a proxy for the latents \(u_{1,t-1}, \dots, u_{M,t-1}\), then for each donor \(j\) there exists a unique function \(h_j\) such that for all \(t\)

\[\mathbb E(y_{jt}) = \mathbb E\bigl(h_j(\mathbf{Y}_{0,t-1}, P(\varepsilon_{jt}))\bigr).\]

The intuition (Figure 1b): because the donors at \(t-1\) are proxies for the latents at \(t-1\), and the latents evolve by an invariant mechanism, we can write \(y_{jt}\) as a function of the lagged donor cross-section \(\mathbf{Y}_{0,t-1}\) and the donor’s own noise. So a donor’s value at \(t\) is forecastable from the donor pool one step earlier.

This is the key contrast with standard SC identifiability, which uses post-intervention donor values to predict the target’s counterfactual. Theorem 3.1 instead lets us forecast each donor’s own post-intervention value from pre-intervention data, and that forecast becomes a test for validity.

From theorem to screen#

A donor \(j\) can be forecast from its past if (a) it is valid (not impacted by the intervention) and (b) the latent error distributions \(P(\varepsilon_u)\) have not shifted at time \(t\). Conversely, failing to forecast \(y_{jt}\) from pre-intervention data implies (a), (b), or both are violated. Assuming the latents have not shifted (or shift only later in the post-period, which does not bias the screen – see below), a forecast failure flags a spillover – an invalid donor.

Algorithm 1 (per candidate donor \(j\) ).

Normalise the donor data and labels (zero mean, unit variance over the pre-intervention window). The normalisation makes the procedure invariant to the scale of the donors.
Regress \(y_{jt}\) on the lagged cross-section \(y_{1,t-1}, \dots, y_{N,t-1}\) over the pre-intervention transitions \(t \le T_0\) to obtain \(\widehat{h}_j\). mlsynth fits a factor-regularised regression (the leading donor factors of the lagged cross-section) so the forecast is well posed even when \(N\) exceeds the number of pre-intervention periods – the regime of large donor pools the method is built for.
Predict the first post-intervention value \(\widehat{y}_{j,T_0+1}\) from the last pre-intervention cross-section (which is clean), with a \(\phi\)-level posterior predictive interval \([\widehat{y}_{j,-}, \widehat{y}_{j,+}]\).
Forecast error (procedure S1): \(A_j \coloneqq |y_{j,T_0+1} - \widehat{y}_{j,T_0+1}|\).
PPI flag (procedure S2): \(B_j = 0\) if \(\widehat{y}_{j,-} < y_{j,T_0+1} < \widehat{y}_{j,+}\), else \(B_j = 1\).

The assumed forecast model (paper equation 3) is linear and time-invariant,

\[y_{jt} \sim \mathcal N\Bigl(\rho_j + \textstyle\sum_k \theta_{jk}\, y_{k,t-1},\ \sigma_{y_j}\Bigr),\]

with coefficients \(\rho_j, \theta_{jk}\) shared across time – encoding that \(h_j\) is time-independent. The two selection rules are

\[S1:\ \min_{j}\ \Bigl|y_{j,T_0+1} - \rho_j - \textstyle\sum_k \theta_{jk} y_{k,T_0}\Bigr|, \qquad S2:\ y_{j,T_0+1} \in \text{the } \phi\text{-PPI}.\]

Choosing the forecast anchor: `loo` (default) vs `lag`#

The screen needs a forecast of each donor’s untreated post-intervention path to test against. mlsynth offers two anchors, and the choice between them is the single most consequential setting on the estimator.

The paper’s anchor (lag) is Algorithm 1 to the letter: forecast only the first post-intervention point, from the last clean pre-intervention cross-section. At \(T_0\) the lagged predictors are spillover-free, so the forecast predicts the untreated value and the spillover surfaces as the error. This works even when most donors are contaminated (the lag is anchored to clean pre-data, not to the donor consensus) – but it carries a hidden assumption: that the spillover is sharp, i.e. present at full magnitude by \(T_0 + 1\). The paper’s own spillover model bakes this in (the donor error mean jumps from 0 to \(\gamma_j\) at \(T_0 + 1\) and stays there). If the spillover instead builds gradually – the realistic case for diffusion, adoption, or accumulation – then \(\gamma_{j,T_0+1} \approx 0\), the first-post-point error is ~0, and the screen is blind by construction. (Testing later periods does not rescue it: their lags are themselves contaminated, so a one-step-lagged forecast then sees only the period-to-period increment of the spillover, which is small for a gradual ramp.)

The default anchor (loo) removes that fragility. It forecasts each donor’s whole post-intervention trajectory from the other donors’ common factors (leave-one-out) and ranks by the mean absolute deviation. Because it differences out the common factor and accumulates evidence over the entire post-period, a contaminated donor’s divergence is detectable regardless of how gradually it arrives. Its one requirement is a valid majority – the other donors must be mostly clean, so they form a trustworthy reference. That is the normal applied situation (a handful of suspect donors in a large pool).

Power analysis. run_forecast_power_analysis() reproduces the comparison: detection AUC (probability an invalid donor scores more anomalous than a valid one; 1 = perfect, 0.5 = none, < 0.5 = inverted) as the spillover onset sweeps sharp → gradual, at two contamination levels:

Detection AUC by anchor, onset, and contamination#
regime	onset	`lag`	`loo`
valid majority (30% invalid)	sharp	0.61	0.96
valid majority (30% invalid)	gradual	0.49	0.92
invalid majority (80%)	sharp	0.61	0.00 (inverted)
invalid majority (80%)	gradual	0.49	0.02 (inverted)

reproduced by:

from mlsynth.utils.spotsynth_helpers import run_forecast_power_analysis
run_forecast_power_analysis(invalid_fracs=(0.3, 0.8), ramps=(1, 6, 24))

The reading is clean: loo dominates the applied (valid-majority) regime and is robust to onset speed; lag only has power for sharp onsets and only earns its keep in the paper’s mostly-invalid stress regime, where loo inverts (the contaminated majority becomes the “consensus”, so the screen flags the valid donors). The remaining cell – gradual onset and invalid majority – is the honest limit: no forecast screen separates signal from a contaminated consensus that creeps in slowly.

A note on CUSUM. A cumulative-sum statistic on the lagged residuals can rescue lag on individual gradual real panels (it telescopes the increments back into the level), but the power analysis disqualifies it as a general default: in the shared-factor DGP it is swamped by the common innovation and falls below chance. loo is the robust generalization, so it – not CUSUM – is the default.

Recommendation. Use the default loo for applied work. Switch to lag only when you have prior reason to believe a large fraction of the pool is contaminated and the spillover is abrupt – e.g. the paper’s simulation, which this package pins to lag for exactly that reason.

Time averaging#

Two practical issues are handled by forecasting on time-averaged (coarsened) data, set via SPOTSYNTHConfig.time_average ("lag" only). First, a lag between the intervention and the onset of a spillover: averaging over a window still surfaces a spillover that arrives a few periods late. Second, very noisy donors: averaging reduces the donor noise and so reduces false negatives (Figure 3). The averaging must not mix pre- and post-intervention periods in the same bucket; mlsynth buckets the two windows separately. Longer windows reduce noise but raise the risk of false positives from latent shifts within the window.

The synthetic-control model#

Once the valid donors are selected, the counterfactual is built with the authors’ synthetic-control model (paper page 12). It is Bayesian:

\[y_{1t} \sim \mathcal N\Bigl(\alpha + \textstyle\sum_j w_j\, y_{jt},\ \sigma_y\Bigr), \quad w_j \ge 0,\ \textstyle\sum_j w_j = 1, \quad \mathbf{w} \sim \mathrm{Dirichlet}(0.4), \quad \sigma_y \sim \mathcal N^+(0, 1),\]

with the target and donors standardised to zero mean and unit standard deviation over the pre-intervention window – which is what absorbs the intercept \(\alpha\) of equation (4), so no separate intercept term is fit. The \(\mathrm{Dirichlet}(0.4)\) prior (concentration < 1) regularises the weights toward sparse corners of the simplex, retaining the Abadie-Diamond-Hainmueller non-negativity / sum-to-one restriction. 95% credible intervals for the counterfactual and the ATT are the 2.5 / 97.5 percentiles of the posterior predictive distribution.

This posterior has no closed form (a Dirichlet prior is not conjugate to a Gaussian likelihood under the simplex constraint). The authors fit it in Stan (Hamiltonian Monte Carlo / NUTS), which is proprietary and was not shared with us; mlsynth fits the identical model with NumPyro’s NUTS – the same HMC family – so it reproduces their estimation procedure as closely as an open-source tool can. NumPyro is an optional dependency: set SPOTSYNTHConfig.inference to "frequentist" for a fast, dependency-free simplex least-squares point estimate (no intervals) when running large simulations or when NumPyro is unavailable – the donor-selection bias pattern is identical either way.

Note

On validating this SC model. Because the authors’ Stan was unavailable, we implemented the model from the published specification (the p.12 equations above) rather than from their code, and validated our NUTS fit against exact grid quadrature of the posterior – ground truth, not another sampler. On 2- and 3-donor problems the posterior means of the weights and of \(\sigma_y\) match the deterministic numerical integral to \(\le 10^{-3}\) (weights) and \(\le 10^{-4}\) (\(\sigma_y\)). That establishes our code correctly samples the stated model. It does not and cannot establish that the authors’ Stan matches their published equations in every undocumented detail – the irreducible limitation when reference code is withheld. The paper’s headline contribution – the donor-selection screen (Algorithm 1) – is fully specified independently of the SC solver, and is what the Path-B (Figure 2) and Path-A (Figure 6) reproductions below exercise through .fit().

Bias and debiasing when the screen errs#

Bias when the screen errs: sensitivity analysis#

The screen can make two kinds of error, and the paper bounds the SC bias each induces (so the analyst can gauge robustness rather than trust the selection blindly).

False positive (a valid donor excluded). If the excluded donors were proxies for a relevant latent, dropping them reintroduces omitted- variable bias, bounded (Section 3.4.2) by

\[\text{FP Bias} \le N_0 \cdot \max_{j}(|w_j|) \cdot \max_{l}\bigl(|\mathbb E(z_l^{\text{pre}}) - \mathbb E(z_l^{\text{post}})|\bigr),\]

where \(z_l\) are the excluded donors and \(\mathbf{w}\) the SC weights. The bound is small when the kept donors already span the latents.
False negative (an invalid donor kept). The bias from a retained spillover-\(\gamma_j\) donor is bounded (Section 3.4.3) by

\[\text{FN Bias} \le N_0 \cdot \max_{j}(|w_j|) \cdot \max_{j}(|\gamma_j|).\]

The spillover \(\gamma_j\) is unknown but, following the negative- control literature (Miao 2024), can be treated as a sensitivity parameter: domain knowledge bounding \(\gamma_j\) bounds the bias, and one can ask how large a spillover would have to be to flip the sign of the estimated effect.

Using excluded donors to debias (proximal two-stage)#

The donors excluded by the screen are not used to build the SC, but they can still debias it. Because only pre-intervention data is ever used from the excluded donors \(z_l\), the SC estimate is unaffected by their post-intervention spillover dynamics. Treating the excluded donors as proximal control variables (Shi et al. 2023), one jointly models the target and the kept donors as functions of the excluded donors (paper equation 5) and recovers consistent weights even when the kept donors are imperfect (noisy) proxies of the latents.

The paper notes (page 9) that equation (5) “effectively combines [a] two-stage process into a single model”, and that this is the standard proximal / instrumental-variables estimator: regress the kept donors \(\mathbf{X}\) on the excluded donors \(\mathbf{Z}\) to form \(\widehat{\mathbf{X}}\), then regress the target on \(\widehat{\mathbf{X}}\). mlsynth implements exactly this two-stage estimator in closed form (no probabilistic-programming dependency), enabled by SPOTSYNTHConfig.debias. When True, the result object carries att_debiased alongside the screened ATT. On the paper’s Figure 4 (errors-in-variables) setting, this measurably reduces the attenuation bias that persists even under a perfect valid-donor selection.

Durable benchmark#

benchmarks/cases/spotsynth_real_data.py reproduces three figures of the paper end to end: Figure 6 (real-data screening on German Reunification, California, and Basque Country – both S1 and S2 exclude a planted noisy-proxy donor and recover the canonical effect, ATT California ~ −22, Basque ~ −1.2, while the unscreened estimate collapses toward zero); Figure 2 (leave-one-out detection AUC, high under a valid majority, inverted under an invalid majority); and Figure 4 (the proximal debias reduces errors-in- variables bias). Run it with python benchmarks/run_benchmarks.py spotsynth_real_data; see SPOTSYNTH — O’Riordan & Gilligan-Lee (2025) spillover detection.

Core API#

SPOTSYNTH: spillover detection for donor selection (O’Riordan & Gilligan-Lee 2025).

O’Riordan, M. & Gilligan-Lee, C. M. (2025). “Spillover detection for donor selection in synthetic control models.” Journal of Causal Inference 13:20240036. doi:10.1515/jci-2024-0036.

To identify a causal effect, a synthetic control needs donors that are not impacted by the intervention – valid donors. Deciding which donors are valid usually demands strong a-priori domain knowledge, which is infeasible with large donor pools. SPOTSYNTH replaces that domain knowledge with a forecast test.

The paper’s Theorem 3.1 shows that, under invariant causal mechanisms and the proxy-completeness condition, a valid donor’s post-intervention value is forecastable from pre-intervention donor data. Algorithm 1 turns this into a practical screen: for each candidate donor, fit a forecast on pre-intervention data and predict the first post-intervention value. A donor whose realised value departs from the forecast has either been hit by a spillover or seen its latent distribution shift – either way it is excluded. Two selection rules are offered: S1 (keep the donors with the smallest forecast error; the analyst fixes the number kept) and S2 (keep donors whose realised value falls inside a posterior predictive interval; controls the false-positive rate). The surviving donors feed a canonical simplex synthetic control.

class mlsynth.estimators.spotsynth.SPOTSYNTH(config: SPOTSYNTHConfig | dict)#

Bases: object

Spillover-detecting synthetic control.

Parameters:: config (SPOTSYNTHConfig or dict) – Configuration object. See mlsynth.config_models.SPOTSYNTHConfig.

fit() → SpotSynthResults#: Screen donors for spillover, fit the synthetic control, return results.

Configuration#

class mlsynth.config_models.SPOTSYNTHConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, selection: ~typing.Literal['S1', 'S2', 'all'] = 'S1', forecast: ~typing.Literal['loo', 'lag'] = 'loo', n_donors: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, ppi: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.8, n_factors: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 5, time_average: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, inference: ~typing.Literal['bayes', 'frequentist'] = 'bayes', dirichlet_alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0)] = 0.4, ci_level: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.95, n_samples: ~typing.Annotated[int, ~annotated_types.Ge(ge=200)] = 4000, n_warmup: ~typing.Annotated[int, ~annotated_types.Ge(ge=100)] = 2000, debias: bool = False, seed: int = 0)#

Configuration for the SPOTSYNTH estimator.

O’Riordan & Gilligan-Lee (2025), “Spillover detection for donor selection in synthetic control models” (Journal of Causal Inference 13:20240036, doi:10.1515/jci-2024-0036). Screens each candidate donor for spillover contamination via a pre-intervention forecast test (Algorithm 1), excludes the contaminated donors, and fits a simplex synthetic control on the valid set. Inherits the standard df / outcome / treat / unitid / time interface and expects a single treated unit.

Parameters:

selection ({“S1”, “S2”, “all”}) – Donor-selection rule. S1 keeps the n_donors donors with the smallest forecast error (the analyst fixes how many to keep); S2 keeps donors whose realised post-intervention value falls inside the ppi posterior predictive interval (controls the false-positive rate); all keeps every donor (the unscreened baseline).
forecast ({“loo”, “lag”}) – Forecast anchor for the screen. loo (default) is the leave-one-out anchor: each donor’s whole post-intervention trajectory is forecast from the other donors’ common factors and scored by the mean absolute deviation. It is robust to the onset speed of the spillover (it detects gradually-arriving contamination that the first-post-point misses) and is the right choice whenever the donor pool is mostly valid – the typical applied setting. lag is the paper’s literal Algorithm 1: forecast the first post-intervention point from the last clean pre-intervention cross-section. It is the correct anchor only in the paper’s stress regime – a mostly-invalid donor pool with a sharp (immediate) spillover – where loo inverts because the contaminated majority defines the “consensus”. See the forecast-anchor discussion in the estimator docs.
n_donors (int, optional) – Number of donors to keep under S1 (default: half the pool).
ppi (float) – Posterior-predictive-interval level for S2 (default 0.8).
n_factors (int) – Number of donor factors used to regularise the forecast (default 5).
time_average (int, optional) – Bucket width for time-averaging the data before screening (lag only; reduces false negatives with very noisy donors).
inference ({“bayes”, “frequentist”}) – Synthetic-control weight model. bayes (default) is the authors’ Bayesian simplex SC – weights with a Dirichlet(dirichlet_alpha) prior, a half-normal prior on the residual sd, pre-period standardisation of target and donors, and 95% posterior-predictive credible intervals – fit with NumPyro’s NUTS (the same Hamiltonian-Monte-Carlo family as the authors’ Stan). It requires the optional numpyro package. frequentist is a fast, dependency-free simplex least-squares point estimate (no intervals), useful for large simulations or when NumPyro is unavailable.
dirichlet_alpha (float) – Dirichlet concentration on the donor weights (paper uses 0.4; < 1 favours sparse weights).
ci_level (float) – Credible-interval level for the Bayesian fit (paper uses 0.95).
n_samples, n_warmup (int) – Posterior draws and warm-up iterations for the Bayesian sampler.
debias (bool) – If True, also compute the proximal (two-stage / GMM) debiased ATT (equation 5), using the screen-excluded donors as proximal controls to correct errors-in-variables bias when the kept donors are noisy proxies.
seed (int) – RNG seed for the Bayesian sampler.

ci_level: float#

debias: bool#

dirichlet_alpha: float#

forecast: Literal['loo', 'lag']#

inference: Literal['bayes', 'frequentist']#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_donors: int | None#

n_factors: int#

n_samples: int#

n_warmup: int#

ppi: float#

seed: int#

selection: Literal['S1', 'S2', 'all']#

time_average: int | None#

Helper Modules#

Spillover-detection screen (O’Riordan & Gilligan-Lee 2025, Algorithm 1).

The central object is spillover_screen(), which implements the donor forecast test underpinning Theorem 3.1: under invariant causal mechanisms and the proxy-completeness condition, a valid donor’s post-intervention value is forecastable from pre-intervention donor data. A donor whose realised post-intervention value departs from that forecast has either been hit by a spillover or seen its latent distribution shift – in either case it is unsafe to keep in the donor pool.

Two forecast anchors are provided. Both normalise each donor to zero mean and unit standard deviation over the pre-intervention window (Algorithm 1, step 1), optionally on time-averaged (“bucketed”) data (Section 3.2.1; Figure 3).

"loo" (default) – a leave-one-out anchor. Predict each donor’s whole post-intervention trajectory from the other donors’ common factors and rank by the mean absolute deviation. It differences out the common factor and accumulates evidence over the post-period, so it detects contamination regardless of how gradually the spillover arrives – but it needs a valid majority (the other donors form the reference). This is the right anchor for applied work (a few suspect donors in a mostly-valid pool) and is onset-robust.
"lag" – the paper’s Algorithm 1. Fit the forecast on lagged donor data over pre-intervention transitions, then predict the first post-intervention point from the last (clean) pre-intervention cross-section. Anchored to clean pre-data, it survives a mostly-invalid pool – but it assumes a sharp spillover (present at full size by the first post-period) and is blind to gradual onsets. Use it only in the paper’s mostly-invalid / sharp regime, where loo inverts. See run_forecast_power_analysis for the power comparison.

mlsynth.utils.spotsynth_helpers.screen.spillover_screen(D: ndarray, T0: int, donor_names, *, selection: str = 'S1', forecast: str = 'loo', n_donors=None, ppi: float = 0.8, n_factors: int = 5, time_average=None) → SpilloverScreen#

Run the Algorithm 1 spillover screen and select valid donors.

Parameters:

D (np.ndarray) – Donor-pool outcomes, shape (T, n_donors).
T0 (int) – Number of pre-intervention periods.
donor_names (sequence) – Donor names (length n_donors).
selection ({“S1”, “S2”, “all”}) – S1 keeps the n_donors donors with the smallest forecast error; S2 keeps the donors whose realised value lies inside the ppi posterior predictive interval; all keeps every donor (the baseline).
forecast ({“loo”, “lag”}) – Forecast anchor (default "loo"; see module docstring). Use "lag" only for the mostly-invalid / sharp-spillover regime.
n_donors (int, optional) – Number of donors to retain under S1 (default: half the pool, minimum 2).
ppi (float) – Posterior-predictive-interval level for S2 (default 0.8).
n_factors (int) – Number of donor factors used to regularise the forecast.
time_average (int, optional) – Bucket width for time-averaging the data before screening ("lag" only; Section 3.2.1).

Returns:

SpilloverScreen

Panel ingestion for the SPOTSYNTH estimator (single treated unit).

mlsynth.utils.spotsynth_helpers.setup.prepare_spotsynth_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str) → SpotSynthInputs#

Pivot a long panel into a treated series and a balanced donor matrix.

SPOTSYNTH targets the classic single-treated-unit synthetic-control design: one unit adopts the intervention at a common time T0 and every other unit is a candidate donor to be screened for spillover contamination.

Simplex synthetic-control weight solver for SPOTSYNTH.

After the spillover screen selects the valid donor subset, SPOTSYNTH fits the canonical Abadie-Diamond-Hainmueller program on those donors: non-negative weights summing to one, matching the treated unit’s pre-intervention path (O’Riordan & Gilligan-Lee 2025, equation (4) with the simplex restriction of Abadie et al.).

mlsynth.utils.spotsynth_helpers.sc.simplex_weights(y: ndarray, D: ndarray, T0: int) → Tuple[ndarray, ndarray]#

Fit simplex synthetic-control weights on the pre-intervention window.

Parameters:

y (np.ndarray) – Treated-unit series, length T.
D (np.ndarray) – Donor matrix, shape (T, n_donors).
T0 (int) – Number of pre-intervention periods.

Returns:

(weights, counterfactual) (tuple of np.ndarray) – weights has length n_donors; counterfactual is D @ weights over all T periods.

Bayesian Dirichlet simplex synthetic control (O’Riordan & Gilligan-Lee 2025, p.12).

The paper’s SC model is the Bayesian simplex regression

\[y^t \sim \mathcal N\Bigl(\textstyle\sum_i \beta_i x_i^t,\ \sigma_y\Bigr), \quad \beta_i \ge 0,\ \textstyle\sum_i \beta_i = 1, \quad \beta \sim \mathrm{Dirichlet}(0.4), \quad \sigma_y \sim \mathcal N^+(0, 1),\]

with the target and donors standardised to zero mean and unit standard deviation over the pre-intervention window (which absorbs the intercept \(\alpha\) of equation (4)). The \(\mathrm{Dirichlet}(0.4)\) prior (concentration < 1) regularises the weights toward sparse corners of the simplex. 95% credible intervals come from the 2.5 / 97.5 percentiles of the posterior predictive distribution.

The authors fit this model in Stan (Hamiltonian Monte Carlo / NUTS), which they could not share. mlsynth fits the identical model with NumPyro’s NUTS – the same HMC family – so it reproduces their estimation procedure as closely as an open-source tool can. NumPyro is an optional dependency; if it is not installed, use inference="frequentist" for the dependency-free simplex least-squares point estimate.

class mlsynth.utils.spotsynth_helpers.bayes.BayesianSCFit(weights: ndarray, counterfactual: ndarray, cf_lower: ndarray, cf_upper: ndarray, att: float, att_ci: Tuple[float, float], sigma: float, accept_prob: float, n_samples: int)#

Posterior summary of the Dirichlet simplex SC (NUTS).

accept_prob: float#

att: float#

att_ci: Tuple[float, float]#

cf_lower: ndarray#

cf_upper: ndarray#

counterfactual: ndarray#

n_samples: int#

sigma: float#

weights: ndarray#

mlsynth.utils.spotsynth_helpers.bayes.bayesian_simplex_sc(y: ndarray, D: ndarray, T0: int, *, alpha: float = 0.4, sigma_prior_scale: float = 1.0, n_samples: int = 4000, n_warmup: int = 1000, n_chains: int = 2, ci_level: float = 0.95, seed: int = 0) → BayesianSCFit#

Fit the Dirichlet(alpha) Bayesian simplex SC with NumPyro NUTS.

Parameters:

y (np.ndarray) – Treated-unit outcome, length T.
D (np.ndarray) – Donor matrix, shape (T, n).
T0 (int) – Number of pre-intervention periods (weights are fit on the pre-window).
alpha (float) – Dirichlet concentration (paper uses 0.4; < 1 favours sparse weights).
sigma_prior_scale (float) – Scale of the half-normal prior on the (standardised) residual sd.
n_samples, n_warmup (int) – Total post-warm-up draws (across chains) and warm-up iterations per chain.
n_chains (int) – Number of NUTS chains.
ci_level (float) – Credible-interval level (paper uses 0.95).
seed (int)

Returns:

BayesianSCFit

Raises:

MlsynthEstimationError – If NumPyro / JAX are not installed.

Proximal (two-stage / GMM) debiasing with excluded donors.

O’Riordan & Gilligan-Lee (2025), Section 3.3, equation (5). When the kept donors are noisy (imperfect) proxies of the latents, a synthetic control fit directly on them suffers errors-in-variables (attenuation) bias even with a perfect valid-donor selection (the paper’s Figure 4). The donors excluded by the spillover screen – though invalid for building the counterfactual – satisfy the proxy condition on pre-intervention data, so they can serve as proximal control variables to debias the weights.

The paper notes (page 9) that the joint model of equation (5) “effectively combines [a] two-stage process into a single model”, and that this is the standard proximal / instrumental-variables estimator: regress the kept donors \(X\) on the excluded donors \(Z\) to form \(\hat X\), then regress the target \(y\) on \(\hat X\). Crucially, only pre-intervention data from the excluded donors is used, so their post-intervention spillover dynamics never enter the estimate. This module implements that two-stage estimator in closed form – no probabilistic-programming dependency.

class mlsynth.utils.spotsynth_helpers.debias.ProximalDebiasFit(weights: ndarray, intercept: float, counterfactual: ndarray, att: float, n_instruments: int)#

Result of the two-stage proximal debiasing.

att: float#

counterfactual: ndarray#

intercept: float#

n_instruments: int#

weights: ndarray#

mlsynth.utils.spotsynth_helpers.debias.proximal_debias(y: ndarray, X_kept: ndarray, Z_excluded: ndarray, T0: int) → ProximalDebiasFit#

Two-stage proximal debiasing of the SC weights (equation 5).

Parameters:

y (np.ndarray) – Treated-unit outcome, length T.
X_kept (np.ndarray) – Kept (valid) donor matrix used to build the SC, shape (T, k).
Z_excluded (np.ndarray) – Excluded donor matrix used as proximal controls, shape (T, m). Only the pre-intervention rows enter the estimate.
T0 (int) – Number of pre-intervention periods.

Returns:

ProximalDebiasFit

Notes

Stage 1 regresses each kept donor on the excluded donors over the pre-intervention window to form the proximal projection \(\hat X = Z\,(Z_{\text{pre}}^+ X_{\text{pre}})\). Stage 2 regresses the treated unit on \(\hat X\) over the pre-window to obtain the debiased weights, and the counterfactual is the kept donors evaluated at those weights. With fewer excluded donors than kept donors the projection is rank-deficient and debiasing is skipped (the kept-donor fit is returned).

Orchestration for the SPOTSYNTH estimator (O’Riordan & Gilligan-Lee 2025).

mlsynth.utils.spotsynth_helpers.pipeline.run_spotsynth(inputs: SpotSynthInputs, *, selection: str = 'S1', forecast: str = 'loo', n_donors=None, ppi: float = 0.8, n_factors: int = 5, time_average=None, inference: str = 'bayes', dirichlet_alpha: float = 0.4, ci_level: float = 0.95, n_samples: int = 4000, n_warmup: int = 1000, debias: bool = False, seed: int = 0) → SpotSynthResults#

Screen donors for spillover, then fit a synthetic control on the valid set.

Stages#

Spillover screen (Algorithm 1) over the donor pool -> valid-donor subset.
Synthetic control on the selected donors: the authors’ Bayesian Dirichlet simplex SC fit with NumPyro NUTS (inference="bayes", with credible intervals) or a fast frequentist simplex (inference="frequentist").
ATT as the mean post-intervention gap; an unscreened (All) ATT is reported alongside for comparison.
Optionally, the proximal (two-stage / GMM) debiased ATT using the excluded donors (debias=True).

Frozen dataclasses for the SPOTSYNTH estimator.

O’Riordan & Gilligan-Lee (2025), Spillover detection for donor selection in synthetic control models, Journal of Causal Inference 13:20240036 (doi:10.1515/jci-2024-0036). SPOTSYNTH screens every candidate donor for spillover contamination by testing whether its first post-intervention value can be forecast from pre-intervention donor data (Theorem 3.1, Algorithm 1), excludes the donors that fail the test, and builds a synthetic control on the donors judged valid.

class mlsynth.utils.spotsynth_helpers.structures.SpilloverScreen(donor_names: ~typing.List[~typing.Any], forecast_error: ~numpy.ndarray, inside_ppi: ~numpy.ndarray, selected_idx: ~numpy.ndarray, excluded_idx: ~numpy.ndarray, selection: str, forecast: str, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Per-donor output of the Algorithm 1 spillover screen.

donor_names#

Donor names, in the original pool order.

Type:: list

forecast_error#

Procedure S1: absolute (normalised) forecast error \(A_i = |x_i^t - \hat x_i^t|\) at the screened post-intervention horizon. Smaller = more likely a valid donor.

Type:: np.ndarray

inside_ppi#

Procedure S2: True where the realised post-intervention value falls inside the donor’s forecast posterior predictive interval (i.e. B = 0 in the paper – judged valid).

Type:: np.ndarray of bool

selected_idx#

Indices (into the donor pool) of the donors judged valid and used to build the synthetic control.

Type:: np.ndarray

excluded_idx#

Indices of the donors flagged as spillover-contaminated.

Type:: np.ndarray

selection#

"S1" or "S2" – which procedure drove the selection.

Type:: str

forecast#

"lag" (paper Algorithm 1) or "loo" (leave-one-out variant).

Type:: str

metadata#

Type:: dict

donor_names: List[Any]#

excluded_idx: ndarray#

property excluded_names: List[Any]#

forecast: str#

forecast_error: ndarray#

inside_ppi: ndarray#

metadata: Dict[str, Any]#

selected_idx: ndarray#

property selected_names: List[Any]#

selection: str#

class mlsynth.utils.spotsynth_helpers.structures.SpotSynthInputs(y: ndarray, D: ndarray, T0: int, donor_names: List[Any], treated_name: Any, time_labels: ndarray)#

Preprocessed single-treated-unit panel for SPOTSYNTH.

y#

Treated-unit outcome series, length T.

Type:: np.ndarray

D#

Donor-pool outcomes, shape (T, n_donors) (columns = donors).

Type:: np.ndarray

T0#

Number of pre-intervention periods (intervention at index T0).

Type:: int

donor_names#

Names of the donor-pool columns (aligned with D).

Type:: list

treated_name#

Name of the treated unit.

Type:: Any

time_labels#

The T period labels.

Type:: np.ndarray

D: ndarray#

property T: int#

T0: int#

donor_names: List[Any]#

property n_donors: int#

time_labels: ndarray#

treated_name: Any#

y: ndarray#

class mlsynth.utils.spotsynth_helpers.structures.SpotSynthResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.spotsynth_helpers.structures.SpotSynthInputs, screen: ~mlsynth.utils.spotsynth_helpers.structures.SpilloverScreen, att_by_period: ~typing.Dict[~typing.Any, float], att_unscreened: float, inference_method: str = 'frequentist', counterfactual_lower: ~numpy.ndarray | None = None, counterfactual_upper: ~numpy.ndarray | None = None, att_debiased: float | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Top-level container returned by mlsynth.SPOTSYNTH.fit().

An EffectResult (the observational report): in addition to the SPOTSYNTH-specific fields below it exposes the standardized sub-models (effects, time_series, weights, inference, fit_diagnostics, method_details) and the flat accessors att / att_ci / counterfactual / gap / donor_weights / pre_rmse. The screened ATT is the post-period mean gap; att_ci reads the Dirichlet credible interval from inference.

Parameters:

inputs (SpotSynthInputs)
screen (SpilloverScreen) – The per-donor spillover diagnostics and the valid-donor selection.
att_by_period (dict) – {time_label: gap} over the post-intervention periods.
att_unscreened (float) – ATT from a synthetic control on the full donor pool (the All baseline) – for comparison.
inference_method (str) – "bayes" (Dirichlet posterior) or "frequentist" (simplex LS). (Renamed from the former inference field, which now holds the standardized InferenceResults.)
counterfactual_lower, counterfactual_upper (np.ndarray, optional) – Posterior-predictive credible band for the counterfactual (length T) under inference_method="bayes", else None.
att_debiased (float, optional) – Proximal (two-stage / GMM) debiased ATT using the excluded donors as proximal controls (when debias=True), else None.
metadata (dict)

att_by_period: Dict[Any, float]#

att_debiased: float | None#

att_unscreened: float#

counterfactual_lower: np.ndarray | None#

counterfactual_upper: np.ndarray | None#

inference_method: str#

inputs: SpotSynthInputs#

metadata: Dict[str, Any]#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

screen: SpilloverScreen#

The paper’s local-linear-trend DGP for SPOTSYNTH examples/replications.

Implements the data-generating process of O’Riordan & Gilligan-Lee (2025), Appendix B: a target series driven by a sum of latent local-linear-trend processes, a large pool of donors that are noisy proxies of those latents, and a random subset of donors hit by a constant spillover effect from the intervention. Returns a tidy long panel (one treated unit + donor pool) ready for mlsynth.SPOTSYNTH, together with the ground-truth validity mask.

mlsynth.utils.spotsynth_helpers.simulation.simulate_spillover_panel(n_donors: int = 120, n_latent: int = 10, T0: int = 80, n_post: int = 20, sigma_x: float = 0.5, frac_invalid: float = 0.8, tau: float = 2.0, spillover: float = -2.0, spillover_ramp: int = 1, sigma_u: float = 1.0, sigma_delta: float = 0.1, sigma_y: float = 0.1, seed: int = 0) → Tuple[DataFrame, ndarray]#

Simulate one panel from the Appendix B local-linear-trend DGP.

\[\begin{split}u_j^{t+1} &\sim \mathcal N(u_j^t + \delta_j^t,\ \sigma_u), \\ \delta_j^{t+1} &\sim \mathcal N(S_j + \rho_j(\delta_j^t - S_j),\ \sigma_\delta), \\ y^t &\sim \mathcal N\Bigl(\sum_j u_j^t + \tau I^t,\ \sigma_y\Bigr), \\ x_i^t &\sim \mathcal N\Bigl(\sum_j u_j^t + \tau_{x_i} I^t,\ \sigma_x\Bigr),\end{split}\]

with \(S_j \sim \mathcal N(0.1, 0.1)\), \(\rho_j \sim U(0, 1)\), intervention indicator \(I^t = \mathbb 1\{t \ge T_0\}\), target effect \(\tau\), and donor spillover \(\tau_{x_i} =\) spillover for the invalid donors and 0 for the valid ones.

Parameters:

n_donors (int) – Size of the donor pool.
n_latent (int) – Number of latent local-linear-trend processes.
T0, n_post (int) – Pre- and post-intervention period counts.
sigma_x (float) – Donor-noise standard deviation (low/medium/high ~ 0.1/0.5/1.0 in the paper). When sigma_x approaches |spillover| the screen’s false-negative rate rises (Figure 2).
frac_invalid (float) – Fraction of donors hit by the spillover effect (0.8 in the paper).
tau, spillover (float) – Treatment effect on the target and spillover effect on invalid donors.
spillover_ramp (int) – Onset speed of the donor spillover: 1 is a sharp/immediate level shift (the paper’s DGP); larger values ramp the spillover linearly to its final level over that many post-periods (a gradual onset), holding the final level fixed. Used to study detection power vs onset speed.
sigma_u, sigma_delta, sigma_y (float) – Latent-level, slope, and target-noise standard deviations.
seed (int) – RNG seed.

Returns:

(df, valid_mask) (tuple) – df is a long panel with columns unit, time, Y, treated; the treated unit is named "target" and donors "d{j}". valid_mask is a boolean array (length n_donors, aligned with sorted donor names) flagging the valid donors.

Replications of O’Riordan & Gilligan-Lee (2025).

Path B – the paper’s simulation study (Section 4.1 / Figure 2). On the Appendix B data-generating process, a synthetic control built on all donors is biased by the spillover-contaminated ones (bias ~1.6), one built on the valid donors is unbiased, and the S1 / S2 donor-selection procedures recover most of that gap, degrading as the donor noise approaches the spillover magnitude.
Path A (semi-synthetic) – the real-data demonstrations (Section 4.2 / Figure 6) on California tobacco control and German reunification. A semi-synthetic donor that is a noisy proxy of the treated unit grabs a large synthetic-control weight and biases the effect toward zero; the screen flags and excludes it, restoring the canonical effect.

Every reproduction is driven through the public mlsynth.SPOTSYNTH estimator (.fit()), not the internal helpers.

class mlsynth.utils.spotsynth_helpers.replication.SpotSimConfig(n_donors: int = 120, n_latent: int = 10, T0: int = 80, n_post: int = 20, frac_invalid: float = 0.8, tau: float = 2.0, spillover: float = -2.0, n_keep: int = 15, ppi: float = 0.8, n_factors: int = 5, n_reps: int = 20, noise_levels: tuple = (0.1, 0.5, 1.0))#

Parameters for the SPOTSYNTH simulation study (paper Section 4.1).

T0: int = 80#

frac_invalid: float = 0.8#

n_donors: int = 120#

n_factors: int = 5#

n_keep: int = 15#

n_latent: int = 10#

n_post: int = 20#

n_reps: int = 20#

noise_levels: tuple = (0.1, 0.5, 1.0)#

ppi: float = 0.8#

spillover: float = -2.0#

tau: float = 2.0#

mlsynth.utils.spotsynth_helpers.replication.replicate_all_spillover(*, verbose: bool = True) → Dict[str, Dict]#

Run all three real-data spillover demonstrations with the loo default.

Reproduces, end-to-end through SPOTSYNTH.fit(), the semi-synthetic contamination-and-recovery on California tobacco control, German reunification, and the Basque Country – each returning the oracle / contaminated / screened ATTs, the screened 95% credible interval, the pre-treatment RMSE, the synthetic-control donor weights, and the selected / excluded donor sets.

Returns:: dict – {"prop99": ..., "germany": ..., "basque": ...}, each the rich result dict from the corresponding replicate_* function.

mlsynth.utils.spotsynth_helpers.replication.replicate_basque_spillover(data: str | DataFrame | None = None, *, n_keep: int = 12, sigma: float = 0.1, seed: int = 0, verbose: bool = True)#

Basque Country (ETA terrorism, 1975) with a planted spillover donor.

A third canonical SC panel (Abadie & Gardeazabal 2003), not in the O’Riordan & Gilligan-Lee paper – an additional robustness check. Loads the 17-region Spanish panel, plants a semi-synthetic donor that is a noisy proxy of the Basque Country, and runs SPOTSYNTH S1 (keep 12) with the default leave-one-out forecast. Because the ETA effect builds gradually, this is exactly the case the loo anchor is designed for; the invalid donor is excluded and the ~-0.7 (thousand 1986 USD) per-capita-GDP effect restored.

mlsynth.utils.spotsynth_helpers.replication.replicate_germany_spillover(data: str | DataFrame | None = None, *, n_keep: int = 12, sigma: float = 20.0, seed: int = 0, verbose: bool = True)#

German reunification with a planted spillover donor (Figure 6a).

Loads the 17-country OECD panel, plants a semi-synthetic donor that is a noisy proxy of West Germany, and runs SPOTSYNTH S1 (keep 12) with the default leave-one-out forecast (the reunification effect builds slowly). The invalid donor is excluded, restoring the large negative per-capita-GDP effect.

mlsynth.utils.spotsynth_helpers.replication.replicate_prop99_spillover(data: str | DataFrame | None = None, *, n_keep: int = 30, sigma: float = 0.5, seed: int = 0, verbose: bool = True)#

California tobacco control with a planted spillover donor (Figure 6b).

Loads the 39-state Abadie tobacco panel, plants a semi-synthetic donor that is a noisy proxy of California, and runs SPOTSYNTH S1 (keep 30) with the default leave-one-out (loo) forecast anchor. The invalid donor is flagged and excluded, restoring the canonical ~-20 effect.

mlsynth.utils.spotsynth_helpers.replication.run_forecast_power_analysis(*, n_donors: int = 60, T0: int = 60, n_post: int = 30, sigma_x: float = 0.5, invalid_fracs=(0.3, 0.8), ramps=(1, 6, 24), n_factors: int = 8, n_reps: int = 30, seed: int = 0, verbose: bool = True) → Dict#

Detection power (AUC) of the lag vs loo anchors vs onset speed.

Reproduces the analysis that motivates loo as the default: for each contamination fraction and spillover onset speed (ramp; 1 = sharp, larger = gradual), the detection AUC of the two shipped forecast anchors – where AUC is the probability an invalid donor’s forecast statistic is more anomalous than a valid donor’s (1 = perfect, 0.5 = none, < 0.5 = inverted).

The headline findings it reproduces:

Valid majority (e.g. 30% invalid): loo is near-perfect and onset-robust (AUC ~0.95+ for sharp and gradual); lag (first post-point) only has power for sharp onsets and decays to chance as the onset becomes gradual.
Invalid majority (80%, the paper’s regime): loo inverts (AUC ~0, it flags the valid donors); lag is the only anchor with power, and only for sharp onsets.
Gradual onset + invalid majority: neither anchor has power – the honest limit of forecast-based spillover detection.

Returns:: dict – {(frac_invalid, ramp): {"lag": auc, "loo": auc}}.

mlsynth.utils.spotsynth_helpers.replication.run_spotsynth_simulation(cfg: SpotSimConfig = SpotSimConfig(n_donors=120, n_latent=10, T0=80, n_post=20, frac_invalid=0.8, tau=2.0, spillover=-2.0, n_keep=15, ppi=0.8, n_factors=5, n_reps=20, noise_levels=(0.1, 0.5, 1.0)), *, seed: int = 0, verbose: bool = True) → Dict[float, Dict[str, float]]#

Reproduce the Figure 2 bias finding through SPOTSYNTH.fit().

For each donor-noise level returns the mean bias E[tau_hat] - tau of four strategies: All (every donor), Valid (oracle – only the truly valid donors), S1 (smallest forecast error), and S2 (inside the PPI).

Returns:: dict – {sigma_x: {"All": bias, "Valid": bias, "S1": bias, "S2": bias}}.

Example#

A self-contained one-draw run on the paper’s data-generating process: a treated unit, a pool of 60 donors of which 80% carry a \(-2\) spillover, the S1 screen keeping the 12 most-forecastable donors, and the authors’ Bayesian Dirichlet SC (default) returning a 95% credible interval. The proximal (GMM) debiased ATT is requested with debias=True. Because this is the mostly-invalid regime, the example sets forecast="lag" (the default loo inverts when contaminated donors are the majority – see the forecast-anchor discussion).

"""One draw of the SPOTSYNTH spillover-detection DGP."""

from mlsynth import SPOTSYNTH
from mlsynth.utils.spotsynth_helpers import simulate_spillover_panel

# Treated unit + 60 donors; 80% invalid (spillover -2); true tau = 2.
df, valid_mask = simulate_spillover_panel(
    n_donors=60, T0=60, n_post=15, sigma_x=0.3, seed=3,
)

res = SPOTSYNTH({
    "df": df, "outcome": "Y", "treat": "treated",
    "unitid": "unit", "time": "time",
    "selection": "S1", "forecast": "lag", "n_donors": 12,
    "inference": "bayes",     # Dirichlet(0.4) Bayesian simplex SC (default)
    "debias": True,           # also report the proximal/GMM debiased ATT
    "display_graphs": False,
}).fit()

print(f"true tau            = +2.00")
print(f"unscreened ATT      = {res.att_unscreened:+.2f}   (all 60 donors)")
print(f"screened   ATT      = {res.att:+.2f}   ({res.metadata['n_selected']} valid donors)")
lo, hi = res.att_ci
print(f"95% credible ATT    = [{lo:+.2f}, {hi:+.2f}]")
print(f"debiased   ATT      = {res.att_debiased:+.2f}")
print(f"donors screened out = {res.metadata['n_excluded']}")
# forecast errors: valid donors should score lower than invalid ones
err = res.screen.forecast_error
print(f"mean S1 error  valid={err[valid_mask].mean():.3f}  "
      f"invalid={err[~valid_mask].mean():.3f}")

Verification (Path B): the simulation study#

This reproduces the headline finding of the paper’s Figure 2 through the public SPOTSYNTH.fit() call. On the Appendix B data-generating process, a synthetic control built on all donors is biased upward (~+1.6) by the spillover-contaminated ones; one built on the valid donors is unbiased; and the S1 / S2 screens recover most of that gap, degrading as the donor noise grows toward the spillover magnitude.

Note

This study is the paper’s mostly-invalid regime (80% of donors contaminated, with a sharp spillover), so it pins forecast="lag". The package default loo provably inverts here (the contaminated majority defines the consensus) – this is the one regime where lag is the correct anchor, and it is exactly why the paper uses the first-post-point screen. See the forecast-anchor power analysis above.

from mlsynth.utils.spotsynth_helpers import (
    run_spotsynth_simulation, SpotSimConfig,
)

# A compact configuration that runs in well under a minute.
cfg = SpotSimConfig(
    n_donors=80, T0=60, n_post=15, n_keep=12, n_reps=12,
    noise_levels=(0.1, 0.5, 1.0),
)
bias = run_spotsynth_simulation(cfg, seed=0)

prints a table of the bias \(\mathbb E[\widehat{\tau}] - \tau\) like:

SPOTSYNTH simulation (Figure 2), 12 reps, 80 donors, 80% invalid:
  noise      All    Valid       S1       S2
    0.1    +1.61    +0.00    +0.50    +1.32
    0.5    +1.60    -0.00    +0.48    +1.28
    1.0    +1.62    -0.00    +0.87    +1.10

reproducing the qualitative finding: All is badly biased, Valid is unbiased, S1 removes most of the bias and is best, S2 is intermediate, and the screens degrade as the noise rises toward the spillover size. (The paper’s full study uses 1000 donors and 2000 reps; SpotSimConfig defaults to that scale via the PAPER preset.)

Verification (semi-synthetic real data): Figure 6#

The paper also demonstrates the screen on two canonical SC datasets by planting a semi-synthetic invalid donor – a noisy proxy of the treated unit, \(y_{\text{syn},t} \sim \mathcal N(y_{1t}, \sigma)\). Being a near-copy of the target, this invalid donor receives a large SC weight and biases the effect toward zero; the screen flags and excludes it, restoring the canonical effect. Both demos run through SPOTSYNTH.fit().

California tobacco control (Figure 6b). The 39-state Abadie panel plus the planted donor; S1 keeps 30 donors with the default leave-one-out forecast (one contaminant in a mostly-valid pool – the regime loo is built for).

import pandas as pd
from mlsynth import SPOTSYNTH

url = ("https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
       "main/basedata/P99data.csv")
df = pd.read_csv(url)[["state", "year", "cigsale"]]
df["treated"] = ((df["state"] == "California") & (df["year"] >= 1989)).astype(int)

# Plant a noisy proxy of California into the donor pool.
import numpy as np
ca = df[df["state"] == "California"].sort_values("year")
syn = ca.copy()
syn["state"] = "Synthetic California"
syn["cigsale"] = ca["cigsale"].to_numpy() + np.random.default_rng(0).normal(0, 0.5, len(ca))
syn["treated"] = 0
df_contam = pd.concat([df, syn], ignore_index=True)

res = SPOTSYNTH({
    "df": df_contam, "outcome": "cigsale", "treat": "treated",
    "unitid": "state", "time": "year",
    "selection": "S1", "n_donors": 30,
    # forecast="loo" is the default; inference="bayes" is the default
    # (Dirichlet(0.4) Bayesian simplex SC).
    "display_graphs": False,
}).fit()

print(f"contaminated (all donors) ATT = {res.att_unscreened:+.2f}")
print(f"screened ATT                  = {res.att:+.2f}")
lo, hi = res.att_ci
print(f"95% credible interval         = [{lo:+.2f}, {hi:+.2f}]")
print(f"planted donor excluded?       = "
      f"{'Synthetic California' in res.screen.excluded_names}")

prints:

contaminated (all donors) ATT = -1.43
screened ATT                  = -22.87
95% credible interval         = [-26.61, -18.73]
planted donor excluded?       = True

The contaminated pool gives a near-zero effect (the synthetic donor hijacks the SC); the screen excludes it and the Bayesian Dirichlet SC recovers the canonical \(\approx -20\) packs-per-capita effect, with a credible interval that brackets it.

replicate_all_spillover() runs all three real panels at once – California (Figure 6b), German reunification (Figure 6a), and, as an additional robustness check not in the paper, the Basque Country (Abadie & Gardeazabal 2003, ETA terrorism 1975). All use the default forecast="loo" (each is a single contaminant in a mostly-valid pool) and the Bayesian Dirichlet SC, and each returns the full set of standard outputs – oracle / contaminated / screened ATTs, the 95% credible interval, the pre-treatment RMSE, the SC donor weights, and the selected / excluded donor sets:

from mlsynth.utils.spotsynth_helpers import replicate_all_spillover
results = replicate_all_spillover()

# each entry carries the standard diagnostics, e.g. for California:
ca = results["prop99"]
print(ca["screened_att"], ca["att_ci"], ca["pre_rmse"])
print(ca["donor_weights"])           # {donor: weight} for the screened SC
print(ca["synthetic_donor_excluded"])

prints:

Prop 99 (California):
  oracle ATT=-19.514  contaminated=-1.434  screened ATT=-22.871  95% CrI=(-26.61, -18.73)
  pre-treatment RMSE=2.544  donors kept=30/39  synthetic donor excluded=True
  top SC weights: Nevada=0.28  New Hampshire=0.24  Delaware=0.06
Reunification (Germany):
  oracle ATT=-1297.477  contaminated=-166.863  screened ATT=-1489.345  95% CrI=(-1672.92, -1314.43)
  pre-treatment RMSE=55.629  donors kept=12/17  synthetic donor excluded=True
  top SC weights: Austria=0.30  USA=0.28  Italy=0.22
Basque Country (ETA):
  oracle ATT=-0.692  contaminated=-0.379  screened ATT=-0.795  95% CrI=(-1.03, -0.60)
  pre-treatment RMSE=0.087  donors kept=12/17  synthetic donor excluded=True
  top SC weights: Cataluna=0.34  Aragon=0.17  Madrid (Comunidad De)=0.10

Summary (loo forecast, Bayesian Dirichlet SC):
panel                       oracle    contam   screened  pre-RMSE  synth excl
Prop 99 (California)       -19.514    -1.434    -22.871     2.544  True
Reunification (Germany)  -1297.477  -166.863  -1489.345    55.629  True
Basque Country (ETA)        -0.692    -0.379     -0.795     0.087  True

In all three the planted invalid donor is flagged and excluded, the Bayesian Dirichlet SC restores the effect the contamination had masked (with a 95% posterior-predictive credible interval), and the recovered donor weights match the canonical SC literature – Austria/USA for West Germany, Cataluna for the Basque Country. The Basque effect builds gradually, so it is exactly the case the loo anchor is designed for (the first-post-point lag anchor fails on it; see the forecast-anchor discussion).

Spillover-Detecting Synthetic Control (SPOTSYNTH)

Contents

Spillover-Detecting Synthetic Control (SPOTSYNTH)#

Overview#

When to use this estimator#

Notation#

Assumptions#

Mathematical Formulation#

The estimand#

SC structural causal model#

The forecast theorem#

From theorem to screen#

Choosing the forecast anchor: loo (default) vs lag#

Time averaging#

The synthetic-control model#

Bias and debiasing when the screen errs#

Bias when the screen errs: sensitivity analysis#

Using excluded donors to debias (proximal two-stage)#

Durable benchmark#

Core API#

Configuration#

Helper Modules#

Stages#

Example#

Verification (Path B): the simulation study#

Verification (semi-synthetic real data): Figure 6#

Choosing the forecast anchor: `loo` (default) vs `lag`#