Rolling-Transformation DiD (ROLLDID)#
When to Use This Estimator#
ROLLDID implements the rolling-transformation difference-in-differences
estimator of Lee & Wooldridge [LW2026] — a clean-room (MIT) build from the
paper’s equations. It is for panel DiD when the number of treated units, the
number of control units, or both is small: the regime where the usual
cluster-robust / large-\(N\) asymptotics are unreliable and a single
mis-measured cluster can drive the answer. Its defining feature is that it
collapses the panel into one cross-sectional observation per unit by a
pre-treatment transformation, and then reads the treatment effect off an
ordinary cross-sectional regression. Two consequences follow.
First, because that regression is cross-sectional with independent observations, inference does not require clustering, weak time-series dependence, or a long panel: under the classical linear model it is exact in finite samples — valid even with a single treated unit (\(N_1 = 1\)) — and it composes naturally with randomization inference.
Second, in its detrending form it allows unit-specific linear trends, a strict relaxation of parallel trends. This is what lets it track a treated unit whose pre-period drifts away from the donor average — exactly the California Proposition 99 picture — without a convex donor combination.
ROLLDID is therefore the regression complement to the synthetic-control
family in mlsynth (Synthetic Difference-in-Differences (SDID), Forward Difference-in-Differences (FDID), Vanilla Synthetic Control (VanillaSC)): the same
small-donor regime, a different identification lever (parallel trends after
removing unit means or trends, rather than a weighted donor combination). On
short, donor-starved staggered panels — where SC-style per-cohort weight
optimisation becomes unstable — it stays well behaved, because it estimates no
weights at all.
Reach for ROLLDID when#
you have few treated and/or few control units and want inference you can defend in finite samples (down to one treated unit);
pre-trends are heterogeneous but approximately linear, so detrending buys you a weaker identifying assumption than parallel trends;
adoption is common-timing or staggered, and you want a per-period event study (common timing) or a cohort-share-weighted aggregate (staggered) with honest standard errors;
you want a transparent, weight-free alternative to SC / SDID to report alongside them.
Do not use ROLLDID when#
No never-treated units exist (staggered case). The aggregate uses the never-treated group as the comparison; with everyone eventually treated use a not-yet-treated design or Synthetic Difference-in-Differences (SDID).
The treated unit’s pre-trend is non-linear / complex. Linear detrending cannot remove it; a synthetic control (Forward Difference-in-Differences (FDID), Vanilla Synthetic Control (VanillaSC)) or a factor model (Matrix Completion with Nuclear Norm Minimization (MCNNM)) may fit better — plot the pre-period to decide.
You want the broader DiD ecosystem — Callaway–Sant’Anna, Sun–Abraham, Wooldridge ETWFE, honest-DiD sensitivity, large-\(N\) staggered estimators. mlsynth ships
ROLLDIDfor the small-\(N\) exact regime where it complements synthetic control; it does not aim to cover DiD comprehensively. For the full toolkit see the sibling package diff-diff.
Notation#
Let \(\mathcal{N} \coloneqq \{1, \dots, N\}\) index the units and \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\) the periods (1-indexed). Unit \(j\) has observed outcome \(y_{jt}\), with potential outcomes \(y_{jt}^{N}\) absent the intervention and \(y_{jt}^{I}\) under it; the treatment dummy is \(d_{jt}\), so \(y_{jt} = y_{jt}^{N} + (y_{jt}^{I} - y_{jt}^{N})\,d_{jt}\). Treatment is absorbing: once on it stays on. The unit-level “ever-treated” indicator is
partitioning \(\mathcal{N}\) into the eventually-treated \(\mathcal{N}_1 \coloneqq \{j : d_j = 1\}\) (size \(N_1\)) and the never-treated \(\mathcal{N}_0 \coloneqq \{j : d_j = 0\}\) (size \(N_0\)), with \(N = N_0 + N_1 \ge 3\).
Common timing. All treated units adopt after period \(T_0\), splitting time into the pre-period \(\mathcal{T}_1 \coloneqq \{t \le T_0\}\) and the post-period \(\mathcal{T}_2 \coloneqq \{t > T_0\}\).
Staggered adoption. Treated unit \(j\) has a cohort \(g_j \coloneqq \min\{t : d_{jt} = 1\}\); its pre-period is \(\{t < g_j\}\). Cohorts are collected in \(\mathcal{G} \coloneqq \{g_j : j \in \mathcal{N}_1\}\) with sizes \(N_g \coloneqq |\{j : g_j = g\}|\).
Transformed series carry a tilde: \(\widetilde{y}_{jt}\) is the residualised
post-period outcome and \(\widetilde{y}_j\) its collapse to a single scalar
per unit. The per-period effect is \(\tau_t\) and the ATT is
\(\widehat{\tau}\) (the gap and att of the result object).
Assumptions#
Assumption 1 (no anticipation). Treated potential outcomes equal the never-treated ones before adoption: \(\mathbb{E}[y_{jt}^{I} - y_{jt}^{N} \mid d_j = 1] = 0\) for all \(t < g_j\) (sufficient: \(y_{jt}^{I} = y_{jt}^{N}\) for \(t < g_j\)).
Remark. The pre-period mean / trend is estimated only on \(t < g_j\), so anticipation in the final pre-periods would contaminate the baseline. Detrending and shorter pre-windows are the sensitivity levers (the package exposes the pre-window directly).
Assumption 2 (parallel trends after the rolling transformation). With the within-unit transformed regressand \(\widetilde{y}_j(0)\) formed from the untreated potential outcomes (defined in the next section), there is a constant \(\alpha\) with
Two cases: (a) under demeaning, \(\widetilde{y}_j(0)\) differences out a unit-specific level, so \(\alpha\) absorbs an arbitrary common trend; (b) under detrending, it differences out a unit-specific linear trend, so \(\alpha\) permits unit-specific linear trends.
Remark. Case (b) is strictly weaker than standard parallel trends and is the estimator’s edge over SC/SDID when pre-trends are heterogeneous but linear: treatment may be correlated with a unit’s level and slope as long as it is mean-independent of the post-minus-pre deviation.
Assumption 3 (classical linear model, for exact inference). In the cross-sectional regression error \(u_j\) (next section),
Remark. Normality is plausible despite small \(N\) because \(\widetilde{y}_j\) averages \(y_{jt}\) over \(\mathcal{T}_2\): if the series is weakly dependent over time, a central limit theorem across time makes the average approximately normal. The leverage for inference comes from \(T\), not \(N\). HC3 relaxes homoskedasticity (Assumption 3 then only needs mean-independence) at the cost of requiring a handful of treated and control units; randomization inference drops normality entirely.
Mathematical Formulation#
The rolling transformation#
Each unit’s outcome is residualised against its own pre-treatment path and
then averaged over the post window. For cohort \(g\) (in common timing,
\(g = T_0 + 1\) for every treated unit), rolling="demean" (the paper’s
Procedure 2.1) removes the pre-period mean,
while rolling="detrend" (Procedure 3.1) fits a unit-specific line on the
pre-period, \((\widehat{a}_j, \widehat{b}_j) =
\operatorname*{argmin}_{a,b}\sum_{t<g}(y_{jt} - a - b\,t)^2\), and removes its
projection into the post-period,
The unit’s single cross-sectional regressand is the post-average of its transformed outcome,
The collapse equivalence#
The key algebraic fact (Lee & Wooldridge): the panel DiD coefficient equals the slope of the cross-sectional regression of \(\widetilde{y}_j\) on the ever-treated indicator,
so that, in closed form,
The transformation does two things at once: differencing post-minus-pre within each unit cancels any time-constant heterogeneity (a unit fixed effect, even a unit root), and collapsing time to a scalar pushes the serial correlation in \(\{y_{jt}\}\) inside \(\widetilde{y}_j\). Across units the \(\widetilde{y}_j\) are independent, so the object on which we do inference is an ordinary cross-section — no clustering, no large-\(T\) requirement, and strong time-series dependence is absorbed rather than modelled.
Per-period effects (common timing)#
Rather than collapse, regress the transformed value at each post period on \(d_j\) to obtain the event study,
with \(\tau_t\) the coefficient on \(d_j\). Because OLS is linear and \(\widetilde{y}_j = |\mathcal{T}_2|^{-1}\sum_{t}\widetilde{y}_{jt}\), the aggregate is exactly the mean of the per-period effects, \(\widehat{\tau} = |\mathcal{T}_2|^{-1}\sum_{t\in\mathcal{T}_2}\tau_t\).
Staggered aggregation#
With several cohorts, transform each unit relative to every cohort date and form, per cohort, \(\widetilde{y}_j^{(g)} \coloneqq (T-g+1)^{-1}\sum_{t\ge g} \widetilde{y}_{jt}\). Using never-treated units as the comparison and cohort shares \(\widehat{\omega}_g \coloneqq N_g / N_1\), the collapsed regressand is
Running the single cross-sectional regression \(\widetilde{y}_j = \alpha + \tau_\omega d_j + u_j\) then returns the cohort-share-weighted aggregate \(\widehat{\tau}_\omega\) whose standard error automatically accounts for the covariance across cohort effects (it is one regression, not a stitched-together sum). Per-cohort effects \(\widehat{\tau}_g\) are the analogous cohort-\(g\) vs never-treated contrasts. Using only never-treated comparisons sidesteps the “forbidden comparison” / negative-weighting pathologies of two-way fixed effects under staggered adoption.
Inference#
All three modes operate on the cross-sectional regression \(\widetilde{y}_j = \alpha + \tau d_j + u_j\) (or its per-period / per-cohort variants), with residual degrees of freedom \(N - 2\).
inference="exact"— under Assumption 3 the studentised statistic is exactly \(t\)-distributed,\[\frac{\widehat{\tau} - \tau}{\operatorname{se}(\widehat{\tau})} \;\sim\; t_{N-2},\]giving exact two-sided tests and exact-coverage intervals \(\widehat{\tau} \pm t_{1-\alpha/2,\,N-2}\,\operatorname{se}(\widehat{\tau})\), with any \(N \ge 3\), including \(N_1 = 1\). (With a single treated unit this is the studentised-residual / outlier statistic of Donald & Lang.)
inference="hc3"— heteroskedasticity-robust (MacKinnon–White HC3). Use only with a handful of treated and control units: with a single treated unit its leverage is \(1\) and HC3 is undefined, soROLLDIDraises rather than returning a degenerate standard error.inference="ri"— randomization inference: re-assign the \(N_1\) treated labels across unitsri_repstimes and report the permutation \(p\)-value \(\Pr(|\widehat{\tau}^{\,\text{perm}}| \ge |\widehat{\tau}|)\), requiring no normality.
The event study (common timing) and the per-cohort table (staggered) carry the
same effect-size-level uncertainty: each \(\tau_t\) (resp.
\(\widehat{\tau}_g\)) ships with its own se, p_value and
\((1-\alpha)\) confidence band, which is what plot_rolldid renders around
the effect path.
Why it improves on SC / SDID#
The synthetic-control family needs the treated unit inside the convex hull of the donors and, for inference, large \(N_0, T_0, T_1\) (SDID additionally assumes \(I(0)\) weak dependence and normality). ROLLDID needs none of those dimensions to grow: the collapse buys exact finite-sample inference from large \(T\) alone, and detrending weakens parallel trends rather than requiring a donor combination to exist. The trade-off, which the paper is explicit about, is efficiency: the cross-sectional estimator can have larger variance than SDID when SDID’s assumptions hold — but SDID’s packaged intervals can under-cover, whereas ROLLDID’s are exact. It is best read as a complement: a transparent, weight-free tool that is the more trustworthy of the two precisely when units or periods are scarce, or when the per-cohort weight optimisation of staggered SDID becomes ill-posed.
Example: California Proposition 99#
Reproduces the paper’s Table 3 (common timing, single treated unit) on the bundled Abadie–Diamond–Hainmueller smoking panel. The outcome is log per-capita cigarette sales; California is treated from 1989 against 38 never-treated states.
import numpy as np
import pandas as pd
from mlsynth import ROLLDID
BASE = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/"
df = pd.read_csv(BASE + "smoking_data.csv") # 39 states x 1970-2000
df["logcig"] = np.log(df["cigsale"])
df["treat"] = df["Proposition 99"].astype(int) # W_jt: California from 1989
res = ROLLDID({
"df": df, "outcome": "logcig", "treat": "treat",
"unitid": "state", "time": "year",
"rolling": "detrend", "inference": "exact", "display_graphs": False,
}).fit()
print(res.effects.att, res.inference.standard_error, res.inference.p_value)
# -0.227 0.094 0.021 (paper Table 3)
print(res.per_period[["time", "att", "ci_lower", "ci_upper"]].tail(3))
# tau_2000 = -0.403, 95% CI [-0.712, -0.094]
Staggered: castle-doctrine laws#
The bundled castle.csv panel (50 states, 2000–2010; 21 staggered adopters,
29 never-treated) gives the cohort-share-weighted aggregate via the same call,
with the treatment indicator turning on at each state’s adoption year. Then
res.effects.att is \(\widehat{\tau}_\omega\) (0.092 demeaning /
0.067 detrending, matching §7.2) and res.per_cohort is the per-cohort
breakdown with its own confidence intervals.
Verification#
Path A, both empirical applications reproduced to the reported precision and
cross-validated (during development) against the AGPL lwdid package used only
as a black-box oracle — clean-room, sharing no code. California Prop 99
(Table 3): demeaning ATT \(-0.422\) (se \(0.121\)), detrending
\(-0.227\) (se \(0.094\)), detrend exact \(p = 0.021\),
\(\tau_{2000} = -0.667\) / \(-0.403\). Castle laws (§7.2, staggered):
demeaning aggregate \(0.092\) (se \(0.057\)), detrending \(0.067\)
(HC3 se \(0.055\)). See ROLLDID — Lee & Wooldridge rolling-transformation DiD; the durable case is
benchmarks/cases/rolldid_lw.py and unit-level reproduction is pinned in
mlsynth/tests/test_rolldid.py.
Lee, S. J., & Wooldridge, J. M. (2026). Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes.
Core API#
- class mlsynth.ROLLDID(config: ROLLDIDConfig | dict)#
Rolling-transformation difference-in-differences.
- Parameters:
config (ROLLDIDConfig or dict) – See
mlsynth.utils.rolldid_helpers.config.ROLLDIDConfig— addsrolling(demean/detrend),inference(exact/hc3/ri),alpha,ri_reps,seedto the basedf/outcome/treat/unitid/timefields.
Examples
>>> from mlsynth import ROLLDID >>> res = ROLLDID({"df": panel, "outcome": "y", "treat": "w", ... "unitid": "id", "time": "t", "rolling": "detrend"}).fit() >>> res.effects.att, res.inference.p_value
- fit() ROLLDIDResults#
Run the rolling-DiD estimate and return a standardized result.
Configuration#
- class mlsynth.utils.rolldid_helpers.config.ROLLDIDConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, rolling: str = 'demean', inference: str = 'exact', alpha: float = 0.05, ri_reps: int = 1000, seed: int = 0)#
Configuration for ROLLDID (Lee & Wooldridge rolling-transformation DiD).
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Result Containers#
ROLLDID.fit() returns a
ROLLDIDResults, a standardized
BaseEstimatorResults: the aggregated ATT is
res.effects.att; res.inference carries the standard error, \(p\)-value,
confidence interval and method; res.time_series is the event-study path
(common timing). The rolling-DiD specifics sit alongside — res.transformation
(demean / detrend), res.inference_type, res.design (common /
staggered), res.n_treated / res.n_control, and the per-period
(res.per_period) or per-cohort (res.per_cohort) effect tables, each with
its own se / p_value / ci_lower / ci_upper.
- class mlsynth.utils.rolldid_helpers.structures.ROLLDIDResults(*, effects: EffectsResults | None = None, fit_diagnostics: FitDiagnosticsResults | None = None, time_series: TimeSeriesResults | None = None, weights: WeightsResults | None = None, inference: InferenceResults | None = None, method_details: MethodDetailsResults | None = None, sub_method_results: Dict[str, Any] | None = None, additional_outputs: Dict[str, Any] | None = None, raw_results: Dict[str, Any] | None = None, execution_summary: Dict[str, Any] | None = None, plot_config: PlotConfig | None = None, transformation: str | None = None, inference_type: str | None = None, design: str | None = None, n_treated: int | None = None, n_control: int | None = None, per_period: DataFrame | None = None, per_cohort: DataFrame | None = None, **extra_data: Any)#
Rolling-DiD result: a standardized
BaseEstimatorResults(effects.att= the aggregated ATT,inference= SE / p / CI / method,time_series= the event-study path for common timing) plus the rolling-DiD specifics.- n_treated, n_control
Eventually-treated and never-treated unit counts.
- Type:
- per_period#
Per-period ATTs (common timing):
time/att/se/t/p_value/ci_lower/ci_upper.- Type:
pandas.DataFrame or None
- per_cohort#
Per-cohort ATTs (staggered):
cohort/n_treated/att/ …- Type:
pandas.DataFrame or None
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'allow', 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Helper Modules#
Panel ingestion: resolves the long panel into per-unit series, the treatment cohorts, and the never-treated set; checks the absorbing-treatment and never-treated conditions.
Ingestion for ROLLDID: long panel -> per-unit series + cohort structure.
The rolling-DiD transformation is unit-level, so we need each unit’s full
outcome series, its treatment cohort (the first period in which its
treatment indicator turns on, g), and the set of never-treated units. The
design is common timing when every treated unit shares one cohort, and
staggered otherwise.
- mlsynth.utils.rolldid_helpers.setup.rolldid_setup(df: DataFrame, unit_id: str, time_id: str, outcome: str, treat: str) Dict[str, Any]#
Resolve the panel into the inputs the rolling-DiD pipeline needs.
Returns a dict with
Ywide(time x unit),cohort_of({unit: g}for treated units),never(never-treated unit labels),treated(all eventually-treated labels),times(sorted time labels), anddesign("common"/"staggered").- Raises:
MlsynthConfigError – If a required column is missing.
MlsynthDataError – If the panel is unbalanced, the treatment indicator is not 0/1, there are no treated units, no never-treated controls, or treatment switches off (the indicator must be absorbing once on).
The rolling transformations, the cross-sectional estimator (aggregate, per-period, per-cohort), and the exact / HC3 / randomization inference.
Estimation + inference for ROLLDID (clean-room from Lee & Wooldridge 2026).
The method collapses the panel to one cross-sectional observation per unit by a pre-treatment rolling transformation (demean = Procedure 2.1; detrend = Procedure 3.1), then reads the ATT off a cross-sectional regression of the transformed post-average on a treatment indicator. Common timing and staggered adoption share the same machinery (the staggered aggregate is eq. 7.18-7.19, with never-treated units as the comparison). Inference is exact-t (CLM normality), HC3, or randomization.
- mlsynth.utils.rolldid_helpers.pipeline.estimate(prep: Dict[str, Any], *, mode: str, inference: str, alpha: float, ri_reps: int, seed: int) Dict[str, Any]#
Run the rolling-DiD estimate end to end and return a result dict.
Event-study / effect plot.
Event-study / effect plot for ROLLDID.
- mlsynth.utils.rolldid_helpers.plotter.plot_rolldid(result, *, show: bool = True, theme: dict | None = None)#
Plot the rolling-DiD effect.
Common timing: the per-period ATTs (event study) with their confidence band and a zero reference line. Staggered: the per-cohort ATTs. Uses the shared
mlsynthhouse style.