SEQ_SDID — Sequential Synthetic DiD (Arkhangelsky & Samkov 2025)

SEQ_SDID — Sequential Synthetic DiD (Arkhangelsky & Samkov 2025)#

Estimator:

Sequential Synthetic Difference-in-Differences (Sequential SDiD)mlsynth.SequentialSDID

Source:

Arkhangelsky, D. & Samkov, A. (2025), “Sequential Synthetic Difference in Differences,” arXiv:2404.00164v2.

Replication type:

Path B — the paper’s Monte Carlo (Section 5.2.2, “Experiment 2: Calibrated State-Level Panel”; Table 1 and Figures 4-5).

Status:

Verified (geometry) — the headline coverage/RMSE contrast is reproduced; the exact Table-1 cells require the authors’ (non-public) CPS panel.

Validation strategy#

The paper’s central empirical claim is about inference: when parallel trends fail because adoption timing is correlated with unobserved interactive fixed effects, standard difference-in-differences is severely biased and its confidence intervals under-cover, whereas Sequential SDiD stays approximately unbiased with near-nominal coverage. Table 1 quantifies this on a state-by-year panel calibrated to March-CPS women’s log wages: 95% CI coverage of ~0.95 for Sequential SDiD against ~0.70 for DiD, with lower RMSE at every lag.

That panel is not public, so the cells cannot be matched value-for-value. Instead we re-implement the design from the paper’s description (scenario 1, paper only) and reproduce its geometry: the same qualitative ranking and an even sharper version of the same coverage collapse.

A convenient feature of the method makes the comparison airtight: the paper’s standard-DiD comparator is the same estimator at \(\eta \to \infty\) (the “Original Results” line in Figure 1; the stacked-DiD limit of Remark 2.2), exposed as mode="sdid_imputation". Both arms therefore share the Bayesian bootstrap and differ only in the weighting.

The data-generating process#

The DGP is packaged in mlsynth.utils.seq_sdid_helpers.simulate. Following the paper’s recipe:

  • Structural truth is fixed, only shocks are redrawn. The authors freeze the estimated structural components (two-way FE plus a low-rank interactive fixed effect) and generate new draws by resampling the idiosyncratic AR shocks. calibrate_staggered_ife() draws the structure once; each draw of simulate_replication() redraws only the AR(2) noise. This is what makes the within-panel bootstrap a valid measure of the sampling variability the Monte Carlo averages.

  • The IFE is a differential linear trend — the canonical rank-one interactive fixed effect, \(\lambda_i \, f_t\) with \(f_t = t/T\). Adoption is tilted toward high-loading (steeper-trending) units, so treatment timing is correlated with the unobserved trend. DiD assumes a common trend and is biased; Sequential SDiD balances the loading against later-adopting and never-treated donors and is not.

  • Cohorts are enlarged by replicating each unit four times (Section 5.2.1), so cohort aggregates concentrate.

  • Only donor-balanced cohorts are estimated. A cohort needs at least two later / never-treated donor cohorts to balance its loading, so a_max is capped to the sixth-latest cohort (the latest cohorts are donor-starved — see the estimator’s Sequential Synthetic Difference-in-Differences (Sequential SDiD) “Limitations”).

Reproducing Table 1’s geometry#

import warnings
import numpy as np
from mlsynth import SequentialSDID
from mlsynth.utils.seq_sdid_helpers.simulate import (
    calibrate_staggered_ife, simulate_replication)

design = calibrate_staggered_ife(seed=2024)
tau, K, M, B = 1.0, 4, 40, 50

def fit(df, mode):
    res = SequentialSDID({"df": df, "outcome": "y", "treat": "treat",
        "unitid": "unit", "time": "year", "mode": mode, "eta": 0.05,
        "K": K, "a_max": design.a_max, "n_bootstrap": B, "seed": 7,
        "display_graphs": False}).fit()
    return res.event_study.tau, res.event_study.ci

cov = {"ssdid": [], "sdid_imputation": []}
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for m in range(M):
        df = simulate_replication(design, np.random.default_rng(8000 + m), tau=tau)
        for mode in cov:
            tau_hat, ci = fit(df, mode)
            cov[mode].append(((ci[:, 0] <= tau) & (tau <= ci[:, 1])).mean())
print("SSDiD coverage", np.mean(cov["ssdid"]))
print("DiD   coverage", np.mean(cov["sdid_imputation"]))

Results#

At \(M = 40\) draws, \(B = 50\) bootstrap reps (the paper uses \(M = 1000\), \(B = 100\)):

Metric

Sequential SDiD

Standard DiD

Paper (SSDiD / DiD)

95% CI coverage

0.945

0.45

~0.95 / ~0.70

mean \(|\mathrm{bias}|\)

0.062

0.305

RMSE

0.252

0.346

SSDiD < DiD

What it confirms#

  • Sequential SDiD delivers valid inference — coverage 0.945, essentially the nominal 0.95 — under an IFE violation that breaks DiD.

  • Standard DiD’s coverage collapses to 0.45 and its bias is about five times larger; its CIs are unreliable in exactly the regime the method targets. (The collapse is sharper than the paper’s ~0.70 because the reconstructed differential-trend violation is stronger than the CPS calibration; the direction and ranking are the paper’s.)

  • Sequential SDiD has lower RMSE, the second half of Table 1’s finding.

A noiseless corollary, pinned in test_seq_sdid.py, underlies the result: on a noiseless rank-one IFE the estimator recovers the effect to machine precision for every donor-balanced cohort, so the design’s reliability is not a tolerance artifact.

The durable check lives in benchmarks/cases/seq_sdid_mc.py:

python benchmarks/run_benchmarks.py --case seq_sdid_mc