Rolling-Transformation DiD (ROLLDID)#

When to Use This Estimator#

ROLLDID implements the rolling-transformation difference-in-differences estimator of Lee & Wooldridge [LW2026] — a clean-room (MIT) build from the paper’s equations. It is for panel DiD when the number of treated units, the number of control units, or both is small: the regime where the usual cluster-robust / large-\(N\) asymptotics are unreliable and a single mis-measured cluster can drive the answer. Its defining feature is that it collapses the panel into one cross-sectional observation per unit by a pre-treatment transformation, and then reads the treatment effect off an ordinary cross-sectional regression. Two consequences follow.

First, because that regression is cross-sectional with independent observations, inference does not require clustering, weak time-series dependence, or a long panel: under the classical linear model it is exact in finite samples — valid even with a single treated unit (\(N_1 = 1\)) — and it composes naturally with randomization inference.

Second, in its detrending form it allows unit-specific linear trends, a strict relaxation of parallel trends. This is what lets it track a treated unit whose pre-period drifts away from the donor average — exactly the California Proposition 99 picture — without a convex donor combination.

ROLLDID is therefore the regression complement to the synthetic-control family in mlsynth (Synthetic Difference-in-Differences (SDID), Forward Difference-in-Differences (FDID), Vanilla Synthetic Control (VanillaSC)): the same small-donor regime, a different identification lever (parallel trends after removing unit means or trends, rather than a weighted donor combination). On short, donor-starved staggered panels — where SC-style per-cohort weight optimisation becomes unstable — it stays well behaved, because it estimates no weights at all.

Reach for ROLLDID when#

  • you have few treated and/or few control units and want inference you can defend in finite samples (down to one treated unit);

  • pre-trends are heterogeneous but approximately linear, so detrending buys you a weaker identifying assumption than parallel trends;

  • adoption is common-timing or staggered, and you want a per-period event study (common timing) or a cohort-share-weighted aggregate (staggered) with honest standard errors;

  • you want a transparent, weight-free alternative to SC / SDID to report alongside them.

Do not use ROLLDID when#

Notation#

Let \(\mathcal{N} \coloneqq \{1, \dots, N\}\) index the units and \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\) the periods (1-indexed). Unit \(j\) has observed outcome \(y_{jt}\), with potential outcomes \(y_{jt}^{N}\) absent the intervention and \(y_{jt}^{I}\) under it; the treatment dummy is \(d_{jt}\), so \(y_{jt} = y_{jt}^{N} + (y_{jt}^{I} - y_{jt}^{N})\,d_{jt}\). Treatment is absorbing: once on it stays on. The unit-level “ever-treated” indicator is

\[d_j \coloneqq \max_{t\in\mathcal{T}} d_{jt} = \mathbf{1}\{\text{unit } j \text{ is ever treated}\},\]

partitioning \(\mathcal{N}\) into the eventually-treated \(\mathcal{N}_1 \coloneqq \{j : d_j = 1\}\) (size \(N_1\)) and the never-treated \(\mathcal{N}_0 \coloneqq \{j : d_j = 0\}\) (size \(N_0\)), with \(N = N_0 + N_1 \ge 3\).

Common timing. All treated units adopt after period \(T_0\), splitting time into the pre-period \(\mathcal{T}_1 \coloneqq \{t \le T_0\}\) and the post-period \(\mathcal{T}_2 \coloneqq \{t > T_0\}\).

Staggered adoption. Treated unit \(j\) has a cohort \(g_j \coloneqq \min\{t : d_{jt} = 1\}\); its pre-period is \(\{t < g_j\}\). Cohorts are collected in \(\mathcal{G} \coloneqq \{g_j : j \in \mathcal{N}_1\}\) with sizes \(N_g \coloneqq |\{j : g_j = g\}|\).

Transformed series carry a tilde: \(\widetilde{y}_{jt}\) is the residualised post-period outcome and \(\widetilde{y}_j\) its collapse to a single scalar per unit. The per-period effect is \(\tau_t\) and the ATT is \(\widehat{\tau}\) (the gap and att of the result object).

Assumptions#

Assumption 1 (no anticipation). Treated potential outcomes equal the never-treated ones before adoption: \(\mathbb{E}[y_{jt}^{I} - y_{jt}^{N} \mid d_j = 1] = 0\) for all \(t < g_j\) (sufficient: \(y_{jt}^{I} = y_{jt}^{N}\) for \(t < g_j\)).

Remark. The pre-period mean / trend is estimated only on \(t < g_j\), so anticipation in the final pre-periods would contaminate the baseline. Detrending and shorter pre-windows are the sensitivity levers (the package exposes the pre-window directly).

Assumption 2 (parallel trends after the rolling transformation). With the within-unit transformed regressand \(\widetilde{y}_j(0)\) formed from the untreated potential outcomes (defined in the next section), there is a constant \(\alpha\) with

\[\mathbb{E}\!\left[\widetilde{y}_j(0) \mid d_j\right] = \alpha .\]

Two cases: (a) under demeaning, \(\widetilde{y}_j(0)\) differences out a unit-specific level, so \(\alpha\) absorbs an arbitrary common trend; (b) under detrending, it differences out a unit-specific linear trend, so \(\alpha\) permits unit-specific linear trends.

Remark. Case (b) is strictly weaker than standard parallel trends and is the estimator’s edge over SC/SDID when pre-trends are heterogeneous but linear: treatment may be correlated with a unit’s level and slope as long as it is mean-independent of the post-minus-pre deviation.

Assumption 3 (classical linear model, for exact inference). In the cross-sectional regression error \(u_j\) (next section),

\[u_j \mid d_j \sim \mathcal{N}(0, \sigma^2) \quad\text{i.i.d. across } j .\]

Remark. Normality is plausible despite small \(N\) because \(\widetilde{y}_j\) averages \(y_{jt}\) over \(\mathcal{T}_2\): if the series is weakly dependent over time, a central limit theorem across time makes the average approximately normal. The leverage for inference comes from \(T\), not \(N\). HC3 relaxes homoskedasticity (Assumption 3 then only needs mean-independence) at the cost of requiring a handful of treated and control units; randomization inference drops normality entirely.

Mathematical Formulation#

The rolling transformation#

Each unit’s outcome is residualised against its own pre-treatment path and then averaged over the post window. For cohort \(g\) (in common timing, \(g = T_0 + 1\) for every treated unit), rolling="demean" (the paper’s Procedure 2.1) removes the pre-period mean,

\[\widetilde{y}_{jt} \coloneqq y_{jt} - \bar{y}_j^{\,\mathrm{pre}(g)}, \qquad \bar{y}_j^{\,\mathrm{pre}(g)} \coloneqq \frac{1}{g-1}\sum_{t < g} y_{jt}, \qquad t \ge g,\]

while rolling="detrend" (Procedure 3.1) fits a unit-specific line on the pre-period, \((\widehat{a}_j, \widehat{b}_j) = \operatorname*{argmin}_{a,b}\sum_{t<g}(y_{jt} - a - b\,t)^2\), and removes its projection into the post-period,

\[\widetilde{y}_{jt} \coloneqq y_{jt} - \bigl(\widehat{a}_j + \widehat{b}_j\,t\bigr), \qquad t \ge g .\]

The unit’s single cross-sectional regressand is the post-average of its transformed outcome,

\[\widetilde{y}_j \coloneqq \frac{1}{|\mathcal{T}_2|}\sum_{t\in\mathcal{T}_2} \widetilde{y}_{jt} \;=\; \bar{y}_j^{\,\mathrm{post}} - \bar{y}_j^{\,\mathrm{pre}} \quad\text{(demeaning)} .\]

The collapse equivalence#

The key algebraic fact (Lee & Wooldridge): the panel DiD coefficient equals the slope of the cross-sectional regression of \(\widetilde{y}_j\) on the ever-treated indicator,

\[\widetilde{y}_j = \alpha + \tau\,d_j + u_j, \qquad j \in \mathcal{N},\]

so that, in closed form,

\[\widehat{\tau} = \frac{1}{N_1}\sum_{j\in\mathcal{N}_1}\widetilde{y}_j - \frac{1}{N_0}\sum_{j\in\mathcal{N}_0}\widetilde{y}_j = \overline{\widetilde{y}}_{\,\text{treated}} - \overline{\widetilde{y}}_{\,\text{control}} .\]

The transformation does two things at once: differencing post-minus-pre within each unit cancels any time-constant heterogeneity (a unit fixed effect, even a unit root), and collapsing time to a scalar pushes the serial correlation in \(\{y_{jt}\}\) inside \(\widetilde{y}_j\). Across units the \(\widetilde{y}_j\) are independent, so the object on which we do inference is an ordinary cross-section — no clustering, no large-\(T\) requirement, and strong time-series dependence is absorbed rather than modelled.

Per-period effects (common timing)#

Rather than collapse, regress the transformed value at each post period on \(d_j\) to obtain the event study,

\[\widetilde{y}_{jt} = \alpha_t + \tau_t\,d_j + u_{jt}, \qquad t \in \mathcal{T}_2 ,\]

with \(\tau_t\) the coefficient on \(d_j\). Because OLS is linear and \(\widetilde{y}_j = |\mathcal{T}_2|^{-1}\sum_{t}\widetilde{y}_{jt}\), the aggregate is exactly the mean of the per-period effects, \(\widehat{\tau} = |\mathcal{T}_2|^{-1}\sum_{t\in\mathcal{T}_2}\tau_t\).

Staggered aggregation#

With several cohorts, transform each unit relative to every cohort date and form, per cohort, \(\widetilde{y}_j^{(g)} \coloneqq (T-g+1)^{-1}\sum_{t\ge g} \widetilde{y}_{jt}\). Using never-treated units as the comparison and cohort shares \(\widehat{\omega}_g \coloneqq N_g / N_1\), the collapsed regressand is

\[\begin{split}\widetilde{y}_j \coloneqq \begin{cases} \widetilde{y}_j^{(g_j)}, & j \in \mathcal{N}_1 \ \text{(its own cohort)},\\[2mm] \displaystyle\sum_{g\in\mathcal{G}} \widehat{\omega}_g\,\widetilde{y}_j^{(g)}, & j \in \mathcal{N}_0 \ \text{(never treated)} . \end{cases}\end{split}\]

Running the single cross-sectional regression \(\widetilde{y}_j = \alpha + \tau_\omega d_j + u_j\) then returns the cohort-share-weighted aggregate \(\widehat{\tau}_\omega\) whose standard error automatically accounts for the covariance across cohort effects (it is one regression, not a stitched-together sum). Per-cohort effects \(\widehat{\tau}_g\) are the analogous cohort-\(g\) vs never-treated contrasts. Using only never-treated comparisons sidesteps the “forbidden comparison” / negative-weighting pathologies of two-way fixed effects under staggered adoption.

Inference#

All three modes operate on the cross-sectional regression \(\widetilde{y}_j = \alpha + \tau d_j + u_j\) (or its per-period / per-cohort variants), with residual degrees of freedom \(N - 2\).

  • inference="exact" — under Assumption 3 the studentised statistic is exactly \(t\)-distributed,

    \[\frac{\widehat{\tau} - \tau}{\operatorname{se}(\widehat{\tau})} \;\sim\; t_{N-2},\]

    giving exact two-sided tests and exact-coverage intervals \(\widehat{\tau} \pm t_{1-\alpha/2,\,N-2}\,\operatorname{se}(\widehat{\tau})\), with any \(N \ge 3\), including \(N_1 = 1\). (With a single treated unit this is the studentised-residual / outlier statistic of Donald & Lang.)

  • inference="hc3" — heteroskedasticity-robust (MacKinnon–White HC3). Use only with a handful of treated and control units: with a single treated unit its leverage is \(1\) and HC3 is undefined, so ROLLDID raises rather than returning a degenerate standard error.

  • inference="ri" — randomization inference: re-assign the \(N_1\) treated labels across units ri_reps times and report the permutation \(p\)-value \(\Pr(|\widehat{\tau}^{\,\text{perm}}| \ge |\widehat{\tau}|)\), requiring no normality.

The event study (common timing) and the per-cohort table (staggered) carry the same effect-size-level uncertainty: each \(\tau_t\) (resp. \(\widehat{\tau}_g\)) ships with its own se, p_value and \((1-\alpha)\) confidence band, which is what plot_rolldid renders around the effect path.

Why it improves on SC / SDID#

The synthetic-control family needs the treated unit inside the convex hull of the donors and, for inference, large \(N_0, T_0, T_1\) (SDID additionally assumes \(I(0)\) weak dependence and normality). ROLLDID needs none of those dimensions to grow: the collapse buys exact finite-sample inference from large \(T\) alone, and detrending weakens parallel trends rather than requiring a donor combination to exist. The trade-off, which the paper is explicit about, is efficiency: the cross-sectional estimator can have larger variance than SDID when SDID’s assumptions hold — but SDID’s packaged intervals can under-cover, whereas ROLLDID’s are exact. It is best read as a complement: a transparent, weight-free tool that is the more trustworthy of the two precisely when units or periods are scarce, or when the per-cohort weight optimisation of staggered SDID becomes ill-posed.

Example: California Proposition 99#

Reproduces the paper’s Table 3 (common timing, single treated unit) on the bundled Abadie–Diamond–Hainmueller smoking panel. The outcome is log per-capita cigarette sales; California is treated from 1989 against 38 never-treated states.

import numpy as np
import pandas as pd
from mlsynth import ROLLDID

BASE = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/"
df = pd.read_csv(BASE + "smoking_data.csv")          # 39 states x 1970-2000
df["logcig"] = np.log(df["cigsale"])
df["treat"] = df["Proposition 99"].astype(int)       # W_jt: California from 1989

res = ROLLDID({
    "df": df, "outcome": "logcig", "treat": "treat",
    "unitid": "state", "time": "year",
    "rolling": "detrend", "inference": "exact", "display_graphs": False,
}).fit()

print(res.effects.att, res.inference.standard_error, res.inference.p_value)
#  -0.227   0.094   0.021     (paper Table 3)
print(res.per_period[["time", "att", "ci_lower", "ci_upper"]].tail(3))
#  tau_2000 = -0.403, 95% CI [-0.712, -0.094]

Staggered: castle-doctrine laws#

The bundled castle.csv panel (50 states, 2000–2010; 21 staggered adopters, 29 never-treated) gives the cohort-share-weighted aggregate via the same call, with the treatment indicator turning on at each state’s adoption year. Then res.effects.att is \(\widehat{\tau}_\omega\) (0.092 demeaning / 0.067 detrending, matching §7.2) and res.per_cohort is the per-cohort breakdown with its own confidence intervals.

Verification#

Path A, both empirical applications reproduced to the reported precision and cross-validated (during development) against the AGPL lwdid package used only as a black-box oracle — clean-room, sharing no code. California Prop 99 (Table 3): demeaning ATT \(-0.422\) (se \(0.121\)), detrending \(-0.227\) (se \(0.094\)), detrend exact \(p = 0.021\), \(\tau_{2000} = -0.667\) / \(-0.403\). Castle laws (§7.2, staggered): demeaning aggregate \(0.092\) (se \(0.057\)), detrending \(0.067\) (HC3 se \(0.055\)). See ROLLDID — Lee & Wooldridge rolling-transformation DiD; the durable case is benchmarks/cases/rolldid_lw.py and unit-level reproduction is pinned in mlsynth/tests/test_rolldid.py.

[LW2026]

Lee, S. J., & Wooldridge, J. M. (2026). Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes.

Core API#

class mlsynth.ROLLDID(config: ROLLDIDConfig | dict)#

Rolling-transformation difference-in-differences.

Parameters:

config (ROLLDIDConfig or dict) – See mlsynth.utils.rolldid_helpers.config.ROLLDIDConfig — adds rolling (demean/detrend), inference (exact/hc3/ri), alpha, ri_reps, seed to the base df / outcome / treat / unitid / time fields.

Examples

>>> from mlsynth import ROLLDID
>>> res = ROLLDID({"df": panel, "outcome": "y", "treat": "w",
...                "unitid": "id", "time": "t", "rolling": "detrend"}).fit()
>>> res.effects.att, res.inference.p_value
fit() ROLLDIDResults#

Run the rolling-DiD estimate and return a standardized result.

Configuration#

class mlsynth.utils.rolldid_helpers.config.ROLLDIDConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, rolling: str = 'demean', inference: str = 'exact', alpha: float = 0.05, ri_reps: int = 1000, seed: int = 0)#

Configuration for ROLLDID (Lee & Wooldridge rolling-transformation DiD).

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Result Containers#

ROLLDID.fit() returns a ROLLDIDResults, a standardized BaseEstimatorResults: the aggregated ATT is res.effects.att; res.inference carries the standard error, \(p\)-value, confidence interval and method; res.time_series is the event-study path (common timing). The rolling-DiD specifics sit alongside — res.transformation (demean / detrend), res.inference_type, res.design (common / staggered), res.n_treated / res.n_control, and the per-period (res.per_period) or per-cohort (res.per_cohort) effect tables, each with its own se / p_value / ci_lower / ci_upper.

class mlsynth.utils.rolldid_helpers.structures.ROLLDIDResults(*, effects: EffectsResults | None = None, fit_diagnostics: FitDiagnosticsResults | None = None, time_series: TimeSeriesResults | None = None, weights: WeightsResults | None = None, inference: InferenceResults | None = None, method_details: MethodDetailsResults | None = None, sub_method_results: Dict[str, Any] | None = None, additional_outputs: Dict[str, Any] | None = None, raw_results: Dict[str, Any] | None = None, execution_summary: Dict[str, Any] | None = None, plot_config: PlotConfig | None = None, transformation: str | None = None, inference_type: str | None = None, design: str | None = None, n_treated: int | None = None, n_control: int | None = None, per_period: DataFrame | None = None, per_cohort: DataFrame | None = None, **extra_data: Any)#

Rolling-DiD result: a standardized BaseEstimatorResults (effects.att = the aggregated ATT, inference = SE / p / CI / method, time_series = the event-study path for common timing) plus the rolling-DiD specifics.

transformation#

"demean" (Procedure 2.1) or "detrend" (Procedure 3.1).

Type:

str

inference_type#

"exact" / "hc3" / "ri" as requested.

Type:

str

design#

"common" (single cohort) or "staggered".

Type:

str

n_treated, n_control

Eventually-treated and never-treated unit counts.

Type:

int

per_period#

Per-period ATTs (common timing): time / att / se / t / p_value / ci_lower / ci_upper.

Type:

pandas.DataFrame or None

per_cohort#

Per-cohort ATTs (staggered): cohort / n_treated / att / …

Type:

pandas.DataFrame or None

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'allow', 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Helper Modules#

Panel ingestion: resolves the long panel into per-unit series, the treatment cohorts, and the never-treated set; checks the absorbing-treatment and never-treated conditions.

Ingestion for ROLLDID: long panel -> per-unit series + cohort structure.

The rolling-DiD transformation is unit-level, so we need each unit’s full outcome series, its treatment cohort (the first period in which its treatment indicator turns on, g), and the set of never-treated units. The design is common timing when every treated unit shares one cohort, and staggered otherwise.

mlsynth.utils.rolldid_helpers.setup.rolldid_setup(df: DataFrame, unit_id: str, time_id: str, outcome: str, treat: str) Dict[str, Any]#

Resolve the panel into the inputs the rolling-DiD pipeline needs.

Returns a dict with Ywide (time x unit), cohort_of ({unit: g} for treated units), never (never-treated unit labels), treated (all eventually-treated labels), times (sorted time labels), and design ("common" / "staggered").

Raises:
  • MlsynthConfigError – If a required column is missing.

  • MlsynthDataError – If the panel is unbalanced, the treatment indicator is not 0/1, there are no treated units, no never-treated controls, or treatment switches off (the indicator must be absorbing once on).

The rolling transformations, the cross-sectional estimator (aggregate, per-period, per-cohort), and the exact / HC3 / randomization inference.

Estimation + inference for ROLLDID (clean-room from Lee & Wooldridge 2026).

The method collapses the panel to one cross-sectional observation per unit by a pre-treatment rolling transformation (demean = Procedure 2.1; detrend = Procedure 3.1), then reads the ATT off a cross-sectional regression of the transformed post-average on a treatment indicator. Common timing and staggered adoption share the same machinery (the staggered aggregate is eq. 7.18-7.19, with never-treated units as the comparison). Inference is exact-t (CLM normality), HC3, or randomization.

mlsynth.utils.rolldid_helpers.pipeline.estimate(prep: Dict[str, Any], *, mode: str, inference: str, alpha: float, ri_reps: int, seed: int) Dict[str, Any]#

Run the rolling-DiD estimate end to end and return a result dict.

Event-study / effect plot.

Event-study / effect plot for ROLLDID.

mlsynth.utils.rolldid_helpers.plotter.plot_rolldid(result, *, show: bool = True, theme: dict | None = None)#

Plot the rolling-DiD effect.

Common timing: the per-period ATTs (event study) with their confidence band and a zero reference line. Staggered: the per-cohort ATTs. Uses the shared mlsynth house style.