Forward Difference-in-Differences (FDID)

Forward Difference-in-Differences (FDID)#

When to Use This Estimator#

Difference-in-differences (DiD) is the workhorse of quasi-experimental causal inference, but it rests on a parallel-trends assumption: the treated unit’s untreated outcome would have moved in lockstep with the average of the controls. With a large, heterogeneous pool of candidate controls that assumption is rarely credible for the pool as a whole – most of the controls are simply the wrong comparison. The usual escape hatches each have a catch:

Plain DiD uses every control with equal weight. One badly mismatched control contaminates the average, and its bias does not shrink as the panel grows.
Synthetic control (SC, [ABADIE2010]) weights the controls on the simplex, but is justified only as the pre-period grows without bound, and has no inference theory when the data are non-stationary of unknown form – exactly the regime of most marketing and macro panels.
The panel-data approach of Hsiao, Ching and Wan ([HCW]) and its forward-selected variant ([fsPDA]) fit an unrestricted regression on the controls. When the number of donors \(N_0\) exceeds the number of pre-treatment periods \(T_0\) – common in store/geo studies – they overfit in-sample and predict poorly out-of-sample.

Forward DiD (Li [Li2024]) targets precisely this regime: many candidate controls, a short-to-moderate pre-period, and a need for valid inference under non-stationarity. It keeps DiD’s transparency – an equal-weighted comparison group plus a single intercept – but chooses which controls enter the comparison by a greedy forward search on pre-treatment fit. Because only one parameter is ever estimated (the DiD intercept \(b_0\)), no matter how many controls are selected, overfitting is impossible and the textbook DiD standard error applies. Its advantages, in Li’s own summary:

It is a flexible drop-in for DiD, usable even when DiD’s all-controls parallel trend is too restrictive.
It accommodates any number of controls, including \(N_0 > T_0\).
There are no overfitting concerns – one parameter after selection.
It is computationally cheap: a greedy \(O(N_0^2)\) search rather than the \(2^{N_0}\) subsets of the optimal procedure.
It has inference theory valid for stationary and non-stationary data, which SC and HCW lack.

Forward Selection vs. Matching and Weighting#

Every synthetic-control-family estimator answers the same question – what comparison reproduces the treated unit’s untreated path? – but each makes a different structural bet about the comparison. Forward DiD’s bet is distinctive and worth stating plainly:

A subset of the controls shares the treated unit’s trend; find that subset and average it with equal weights.

It does not try to weight all controls cleverly (SC), nor regress on all of them (HCW), nor trust all of them (DiD). It selects. The selection is greedy: add, one at a time, the control that most improves pre-treatment fit, trace the fit as the subset grows, and keep the subset that fits best.

	DiD	Synthetic control	Forward DiD
Comparison	all controls, equal weight	all controls, simplex weights	a selected subset, equal weight
Free parameters	1 (intercept)	\(N\) weights	1 (intercept) – after selection
Overfitting risk	none	controlled by the simplex	none – one parameter
Key assumption	all controls are parallel	treated is in the convex hull	some subset is parallel
Inference under non-stationarity	standard	none available	standard (Prop. 2.1)

The equal weights are the crux. Because the selected controls enter through a single average – not \(|\mathcal{D}|\) separate coefficients – adding more controls cannot increase the model’s degrees of freedom. This is an implicit regularization that buys both the overfitting immunity and the clean, DiD-style inference theory. Forward DiD is therefore best read as DiD with a principled, data-driven choice of comparison group, not as a new weighting scheme.

Notation#

Index the units by \(j\). Let \(j = 1\) be the single treated unit and \(\mathcal{N} \coloneqq \{1, \ldots, N\}\) all units; the donor pool is \(\mathcal{N}_0 \coloneqq \mathcal{N} \setminus \{1\}\), with \(|\mathcal{N}_0| = N_0\). A selected subset \(\mathcal{D} \subseteq \mathcal{N}_0\) is the comparison group; write its equal-weighted average outcome as

\[\bar{y}_{\mathcal{D}, t} = \frac{1}{|\mathcal{D}|} \sum_{j \in \mathcal{D}} y_{jt}.\]

Time runs over \(t \in \mathcal{T} \coloneqq \{1, \ldots, T\}\); the intervention takes effect after \(T_0\), giving a pre-period \(\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}\) (so \(|\mathcal{T}_1| = T_0\)) and a post-period \(\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}\) (so \(|\mathcal{T}_2| = T - T_0\)). Potential outcomes are \(y_{jt}^N\) (without the intervention) and \(y_{jt}^I\) (under it); for the treated unit we observe \(y_{1t} = y_{1t}^N\) on \(\mathcal{T}_1\) and \(y_{1t} = y_{1t}^I\) on \(\mathcal{T}_2\). The estimand is the average treatment effect on the treated,

\[\mathrm{ATT} = \frac{1}{|\mathcal{T}_2|} \sum_{t \in \mathcal{T}_2} \bigl(y_{1t}^I - y_{1t}^N\bigr).\]

Notation bridge

Li [Li2024] writes the treated outcome \(y_{tr,t}\), the selected control set \(\mathcal{N}_{co}\) of size \(N_{co}\), its average \(\bar{y}_{\mathcal{N}_{co}, t}\), the intercept \(\alpha\), and uses \(T_1\) / \(T_2\) for the pre/post sample sizes (treatment at \(T_1 + 1\)). In the mlsynth canon these become the treated unit \(j = 1\) (hence \(y_{1t}\)), the comparison group \(\mathcal{D} \subseteq \mathcal{N}_0\) with average \(\bar{y}_{\mathcal{D}, t}\), the level-shift intercept \(b_0\) (the canon reserves \(\alpha\) for the significance level), and the single split point \(T_0\).

Mathematical Formulation#

The DiD model#

For a fixed comparison group \(\mathcal{D}\), Forward DiD posits that the treated unit’s untreated outcome equals the group average plus a constant level shift:

\[y_{1t}^N = b_0 + \bar{y}_{\mathcal{D}, t} + v_t, \qquad t = 1, \ldots, T,\]

with \(b_0\) an unknown intercept and \(v_t\) a zero-mean, weakly dependent error. Crucially, \(y_{1t}^N\) and \(\bar{y}_{\mathcal{D}, t}\) may each be non-stationary (trending) provided their difference is stationary – this is the Forward DiD parallel-trends condition. The intercept is estimated by least squares on the pre-period,

\[\widehat{b}_0 = \frac{1}{T_0} \sum_{t \in \mathcal{T}_1} \bigl(y_{1t} - \bar{y}_{\mathcal{D}, t}\bigr),\]

so the in-sample fit and out-of-sample counterfactual are

\[\widehat{y}_{1t} = \widehat{b}_0 + \bar{y}_{\mathcal{D}, t}, \qquad t = 1, \ldots, T,\]

and the per-period effect is \(\tau_t = y_{1t} - \widehat{y}_{1t}\), whose post-period average is the ATT,

\[\widehat{\tau} = \frac{1}{|\mathcal{T}_2|} \sum_{t \in \mathcal{T}_2} \bigl(y_{1t} - \widehat{y}_{1t}\bigr).\]

Because the model has a single parameter, the pre-treatment fit quality is summarized by an \(R^2\) (identical to the adjusted \(R^2\), since there is only one regressor coefficient):

\[R^2_{\mathcal{D}} = 1 - \frac{\sum_{t \in \mathcal{T}_1} \widehat{v}_t^2} {\sum_{t \in \mathcal{T}_1} (y_{1t} - \bar{y}_1)^2}, \qquad \widehat{v}_t = y_{1t} - \bar{y}_{\mathcal{D}, t} - \widehat{b}_0,\]

where \(\bar{y}_1\) is the treated unit’s pre-period mean.

Note

\(\tau_t\) and \(\widehat{\tau}\) are exactly the gap and att the result object returns (computed in mlsynth.utils.effectutils), and the pre/post fit are rmse_pre / rmse_post (mlsynth.utils.fitutils) – the math here names the quantities mlsynth.FDID.fit() reports.

The forward-selection algorithm#

Maximizing \(R^2_{\mathcal{D}}\) is equivalent to minimizing the pre-period residual variance \(T_0^{-1} \sum_{t \in \mathcal{T}_1} \widehat{v}_t^2\). Forward DiD searches over comparison groups greedily:

Step 1. For each single donor \(i \in \mathcal{N}_0\), form the one-unit comparison group and compute its pre-period \(R^2\). Keep the best single donor, \(\widehat{c}_1\).
Step 2. Add to \(\{\widehat{c}_1\}\) each of the remaining \(N_0 - 1\) donors in turn; keep the two-unit group with the highest \(R^2\).
Step 3. Continue, adding one donor at a time, until all \(N_0\) donors are in. This yields \(N_0\) nested groups (sizes \(1, 2, \ldots, N_0\)); select the one, \(\widehat{\mathcal{D}} = \mathcal{N}_{co}\), with the largest \(R^2\).

The greedy search evaluates \(1 + 2 + \cdots + N_0 = N_0(N_0+1)/2\) sub-models rather than the \(2^{N_0}\) of the exhaustive procedure (for \(N_0 = 60\), that is 1,830 versus \(1.15 \times 10^{18}\)). The final group \(\widehat{\mathcal{D}}\) is then plugged into the DiD formulas above for the ATT, its standard error, and the \(R^2\).

How mlsynth computes this: incremental means and a batched \(R^2\)#

Read literally, each step of the algorithm re-forms a subset average from scratch and re-fits a DiD regression for every remaining candidate – an \(O(N)\) rebuild times an \(O(N)\) candidate loop times the per-candidate work, repeated \(N\) times. mlsynth’s forward_did_select() instead collapses each step into a handful of vectorized NumPy operations through three observations.

1. The comparison average is updated incrementally, never rebuilt. Let \(\mathbf{m}^{(k)} \in \mathbb{R}^{T}\) be the running average over the \(k\) already-selected controls. Adding control \(c\) gives the \((k+1)\)-average by a single rank-one update,

\[\mathbf{m}^{(k+1)} = \mathbf{m}^{(k)} + \frac{\mathbf{y}_c - \mathbf{m}^{(k)}}{k + 1},\]

which is \(O(T)\) rather than \(O(kT)\). This is _update_synthetic_control() (current_mean + (control - current_mean) / (k + 1)).

2. All candidate averages for a step are built in one matrix. At step \(k\), let \(\mathbf{Y}_{\mathcal{R}} \in \mathbb{R}^{T_0 \times |\mathcal{R}|}\) stack the pre-period columns of the remaining candidates \(\mathcal{R}\). Every candidate \((k+1)\)-average – one per column – is formed simultaneously by broadcasting the running pre-period mean \(\mathbf{m}^{(k)}_{\mathcal{T}_1}\):

\[\mathbf{M} = \frac{k\,\mathbf{m}^{(k)}_{\mathcal{T}_1}\mathbf{1}^\top + \mathbf{Y}_{\mathcal{R}}}{k + 1} \in \mathbb{R}^{T_0 \times |\mathcal{R}|}.\]

In code this is the one line new_means = (current_mean_pre[:, None] * k + candidates) / (k + 1) inside _select_best_donor().

3. The intercept \(b_0\) drops out, so scoring is pure inner products. This is the step that removes the per-candidate regression entirely. Profiling out \(b_0\) from the DiD loss is exactly centering: the fitted residual for candidate column \(\ell\) is \(\widehat{v}_t = (y_{1t} - \bar y_1) - (M_{t\ell} - \bar M_\ell)\). Writing \(\widetilde{\mathbf{y}} = \mathbf{y}_{1,\mathcal{T}_1} - \bar y_1\) (precomputed once, with its norm \(\|\widetilde{\mathbf{y}}\|_2^2 = \mathrm{ss}_{\text{tot}}\)), the residual sum of squares for all candidates is

\[\mathrm{SSR}_\ell = \mathrm{ss}_{\text{tot}} + \underbrace{\| \mathbf{M}_\ell - \bar M_\ell \|_2^2}_{\text{column SS}} - 2\,\underbrace{\widetilde{\mathbf{y}}^\top (\mathbf{M}_\ell - \bar M_\ell)}_{\text{one matrix--vector product}}, \qquad R^2_\ell = 1 - \frac{\mathrm{SSR}_\ell}{\mathrm{ss}_{\text{tot}}}.\]

The cross term for the whole candidate set is the single matvec \(\widetilde{\mathbf{y}}^\top(\mathbf{M} - \bar{\mathbf{M}})\); the column sums of squares are one reduction. This is _r2_batch() – no candidate is ever regressed, and \(b_0\) is never explicitly solved during the search (it is recovered only once, for the winning group, in did_from_mean()).

Taken together, a forward step costs \(O(T_0 |\mathcal{R}|)\) and the entire search is \(O(T_0 N_0^2)\), with the inner loop expressed as a broadcast, a matrix–vector product, a column reduction, and an argmax – no Python-level loop over candidates and no per-candidate solve. This is what lets the implementation run the selection over \(\sim\)1,500 controls, and what makes the \(M = 5{,}000\) replication Monte Carlo in Verification tractable.

Assumptions#

Assumption 1 (Forward DiD parallel trends). There exists a subset \(\mathcal{D} \subseteq \mathcal{N}_0\) and a constant \(b_0\) such that \(y_{1t}^N = b_0 + \bar{y}_{\mathcal{D}, t} + v_t\) for all \(t\), where \(v_t\) is a weakly dependent process with zero mean and finite variance.

Remark. This says the gap between the treated unit and the selected comparison group is stable across the pre- and post-periods up to a mean-zero shock. It is strictly weaker than DiD’s requirement that all controls be parallel: it asks only that some equal-weighted subset be parallel. Both \(y_{1t}^N\) and \(\bar{y}_{\mathcal{D}, t}\) may trend arbitrarily (even non-linearly), so long as their difference is trendless – which is what makes the method valid under non-stationarity.

Assumption 2 (no anticipation / no interference). Controls are untreated throughout, and the treated unit’s outcome equals its untreated potential outcome in the pre-period.

Remark. Standard in the DiD/SC literature. It is what lets the pre-period identify the comparison group: if controls were themselves affected by the intervention (spillover), their pre/post relationship to the treated unit would shift and selection would be biased.

Assumption 3 (regularity for inference). The partial sums of \(v_t\) obey a central limit theorem; errors are weakly dependent with finite long-run variance.

Remark. This is what delivers the asymptotic normality in Proposition 2.1 below, and it holds for the broad class of stationary, weakly-dependent error processes – it does not require \(v_t\) to be i.i.d. or the levels \(y_{1t}^N\) to be stationary.

When not to use Forward DiD

Assumption 1 fails when no subset of controls can track the treated unit – most importantly when the treated unit lies outside the range of the controls (e.g. its outcome trends upward more steeply than every control’s). Equal weights cannot extrapolate beyond the controls, so no selection rescues it. In that regime Li points to methods that let the treated unit sit outside the control hull: the augmented DiD ([ADID]), factor-model / interactive-fixed-effect estimators, or SC with an intercept.

Diagnostic: a side-by-side panel where Forward PTA holds vs. fails#

The pretreatment \(R^2\) returned by FDID is the natural empirical check on Assumption 1. The script below draws two panels with the same underlying common factor and the same true ATT of zero, differing only in the treated unit’s factor loading:

Panel A (treated_loading = 1). The treated unit shares the controls’ single-factor loading. Assumption 1 holds for any subset of the donors.
Panel B (treated_loading = 3). The treated unit trends three times faster than any control. No subset of the equal-weighted donors can extrapolate the steeper trend – Assumption 1 fails.

import numpy as np
import pandas as pd

from mlsynth import FDID


def make_panel(*, treated_loading, n_controls=40, T1=24, T2=12, seed=0):
    """Synthetic panel with one common smoothly-trending factor.

    The treated unit's loading on the factor is ``treated_loading``; every
    control loads with coefficient 1. True ATT = 0 by construction.
    """
    rng = np.random.default_rng(seed)
    T = T1 + T2
    f = np.cumsum(rng.standard_normal(T)) / np.sqrt(T)
    eps_tr = 0.10 * rng.standard_normal(T)
    eps_co = 0.10 * rng.standard_normal((n_controls, T))
    y_tr = 1.0 + treated_loading * f + eps_tr
    y_co = 1.0 + 1.0 * f[None, :] + eps_co
    rows = []
    for t in range(T):
        rows.append({"unit": "treated", "time": t, "y": float(y_tr[t]),
                     "treat": int(t >= T1)})
        for i in range(n_controls):
            rows.append({"unit": f"c{i}", "time": t, "y": float(y_co[i, t]),
                         "treat": 0})
    return pd.DataFrame(rows)


for label, loading in [("Forward PTA holds (loading=1)", 1.0),
                        ("Forward PTA fails (loading=3)", 3.0)]:
    df = make_panel(treated_loading=loading, seed=0)
    res = FDID({"df": df, "outcome": "y", "treat": "treat",
                 "unitid": "unit", "time": "time",
                 "display_graphs": False}).fit()
    print(f"{label:35s}  FDID ATT = {res.fdid.att:+.3f}  "
           f"R^2 = {res.fdid.r_squared:.3f}  "
           f"selected {len(res.fdid.selected_names)} donors")

prints:

Forward PTA holds (loading=1)        FDID ATT = -0.009  R^2 = 0.975  selected 4 donors
Forward PTA fails (loading=3)        FDID ATT = -0.802  R^2 = 0.588  selected 2 donors

Two lessons jump out:

The \(R^2\) is the warning signal. When Forward PTA holds, FDID hits \(R^2 \approx 0.98\) and recovers the true zero ATT to within noise. When it fails, the in-sample fit drops to \(R^2 \approx 0.59\) – a much weaker fit on a panel of the same dimensions. Compare the two against the same threshold you would apply in a forecast exercise (Li’s empirical applications report \(R^2\) of 0.76-0.91 on Atlanta / San Diego / San Jose). When the pre-fit is weak, distrust the post-fit ATT.
The bias is large and one-sided. When Forward PTA fails because the treated unit trends faster than any subset of controls, FDID’s equal-weighted comparison group flattens the post-period counterfactual and the ATT is biased toward zero from above (here: \(-0.80\) against a true 0). A clean placebo on the pre-period will also be off: the in-sample residuals are systematically wrong when the controls cannot extrapolate the treated unit’s trend.

If your application reports \(R^2\) materially below the threshold you would consider acceptable for a forecast (say, < 0.7), treat the ATT estimate as a lower bound on the magnitude of misspecification rather than an estimate of the causal effect, and switch to one of the methods Li flags for the out-of-hull case: Forward Difference-in-Differences (FDID) with a different comparison construction is unlikely to recover it – try the augmented DiD, a factor-model / interactive-fixed-effects estimator, or synthetic control with an intercept.

Inference#

Because Forward DiD estimates a single parameter, its inference is the textbook DiD inference. Let \(\widehat{\sigma}^2_{\mathcal{D}} = T_0^{-1} \sum_{t \in \mathcal{T}_1} \widehat{v}_t^2\) be the pre-period residual variance on the selected group. Li’s Proposition 2.1 establishes

\[\left| \Pr\!\left( \frac{\sqrt{|\mathcal{T}_2|}\,(\widehat{\tau} - \mathrm{ATT})} {\widehat{\sigma}_{\mathcal{D}}} \le a \right) - \Phi(a) \right| \to 0, \qquad a \in \mathbb{R},\]

as \(T_0, |\mathcal{T}_2| \to \infty\), where \(\Phi\) is the standard-normal CDF. mlsynth reports the finite-sample standard error that also carries the estimation error in \(\widehat{b}_0\):

\[\mathrm{SE}(\widehat{\tau}) = \widehat{\sigma}_{\mathcal{D}} \sqrt{\frac{1}{T_0} + \frac{1}{|\mathcal{T}_2|}},\]

since \(\widehat{\tau} - \mathrm{ATT} = -T_0^{-1} \sum_{\mathcal{T}_1} v_t + |\mathcal{T}_2|^{-1} \sum_{\mathcal{T}_2} v_t\) contributes one \(1/T_0\) and one \(1/|\mathcal{T}_2|\) variance term. This collapses to Proposition 2.1’s \(\widehat{\sigma}_{\mathcal{D}} / \sqrt{|\mathcal{T}_2|}\) when \(T_0 \gg |\mathcal{T}_2|\). The 95% Wald interval and two-sided p-value follow in the usual way.

Consistency of the selection#

Li also shows the greedy search recovers a valid comparison group. Under Assumption 1 and the appendix’s regularity conditions, with \(N_0\) fixed, the empirical forward selection selects (one of) the same subset(s) the infeasible procedure based on true error variances would select, with probability approaching one as \(T_0 \to \infty\) (Proposition 2.2; Proposition D.1 handles ties). Proposition D.2 extends this to the case where \(N_0\) grows with \(T_0\) under a latent group structure. Intuitively, by the law of large numbers each step’s empirical \(R^2\) converges to its population value, so the greedy path tracks the population-optimal path.

Example#

The block below fits Forward DiD on the Hsiao, Ching and Wan ([HCW]) panel of quarterly real-GDP growth for Hong Kong and 24 comparison economies – the canonical setting for the forward-selected panel-data approach ([fsPDA]) that Forward DiD descends from – with Hong Kong’s economic integration with mainland China as the intervention (44 pre-treatment quarters, 17 post).

import pandas as pd
from mlsynth import FDID

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/HongKong.csv"
df = pd.read_csv(url)

res = FDID({"df": df, "outcome": "GDP", "treat": "Integration",
            "unitid": "Country", "time": "Time", "display_graphs": True}).fit()

# Forward DiD fit and the all-controls DiD benchmark, side by side.
print(res.fdid.att, res.fdid.r_squared, res.fdid.selected_names)
print(res.did.att, res.did.r_squared)

Forward DiD keeps a small, regionally sensible subset of Hong Kong’s trading partners rather than averaging all 24 economies, so it tracks Hong Kong’s pre-integration path far more closely (a higher pre-period \(R^2\)) and estimates the post-integration GDP-growth effect more precisely than the all-controls DiD. The exact selected group and the cell-by-cell match to Li’s released output are in FDID — Forward Difference-in-Differences (Li 2024).

res is an FDIDResults: res.fdid and res.did are the two FDIDMethodFit objects, the convenience accessors (res.att, res.att_se, res.counterfactual, res.donor_weights) forward to the Forward DiD fit, and res.att_by_method() / res.ci_by_method() return both side by side.

Verification#

Forward DiD is validated on two fronts. Path A – mlsynth reproduces the author’s public Hong Kong GDP companion replication cell by cell (FDID ATT \(0.0254\), \(53.84\%\), pre-period \(R^2 = 0.843\), 9 of 24 controls). Path B – the paper’s own Monte Carlo (Li 2024, Web Appendix E) reproduces cell by cell (e.g. cell \((48, 24)\) yields \(\mathrm{PMSE} = 0.084\) against the paper’s \(0.082\)), confirming Forward DiD pays only a small efficiency cost when ordinary DiD is valid and wins decisively when half the controls are mismatched. See the dedicated replication page, FDID — Forward Difference-in-Differences (Li 2024), for the full design, code, and cell-by-cell tables.

Core API#

Forward Difference-in-Differences (FDID) estimator.

Implements the forward-selection difference-in-differences method of Li (2023), Frontiers: A Simple Forward Difference-in-Differences Method, Marketing Science. FDID greedily grows the control group one donor at a time, keeping the subset that maximises pre-treatment fit, and reports both the forward-selected estimate (FDID) and the textbook all-donor difference-in-differences benchmark (DID), each with Li (2023) analytical standard errors.

The estimator is a thin orchestration layer over mlsynth.utils.fdid_helpers: it validates configuration, prepares the panel, runs forward selection, assembles a typed FDIDResults, and optionally plots the counterfactuals.

class mlsynth.estimators.fdid.FDID(config: FDIDConfig | dict)#

Bases: object

Forward Difference-in-Differences (FDID) estimator.

Parameters:: config (FDIDConfig or dict) – Validated configuration (or a compatible dictionary). See mlsynth.utils.fdid_helpers.config.FDIDConfig for the available fields (df, outcome, treat, unitid, time, display_graphs, save, counterfactual_color, treated_color, verbose).

References

Li, K. T. (2023). Frontiers: A Simple Forward Difference-in-Differences Method. Marketing Science, 43(2), 267-279. https://doi.org/10.1287/mksc.2022.0212

Examples

>>> import pandas as pd
>>> from mlsynth import FDID
>>> url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/basque_data.csv"
>>> data = pd.read_csv(url)
>>> config = {
...     "df": data,
...     "outcome": data.columns[2],
...     "treat": data.columns[-1],
...     "unitid": data.columns[0],
...     "time": data.columns[1],
...     "display_graphs": False,
... }
>>> results = FDID(config).fit()
>>> round(results.att, 3)

fit() → FDIDResults#

Run forward selection and return the typed FDID results.

Returns:

FDIDResults – Container exposing the forward-selected fdid fit (primary) and the all-donor did benchmark, plus convenience aliases (att, att_se, counterfactual, gap, donor_weights).

Raises:

MlsynthDataError – If panel balancing or data preparation fails.
MlsynthEstimationError – If there are too few pre-periods or forward selection fails.

Configuration#

class mlsynth.utils.fdid_helpers.config.FDIDConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, verbose: bool = True)#

Configuration for the Forward Difference-in-Differences (FDID) estimator. Inherits all common configuration parameters from BaseEstimatorConfig.

Additional Parameters#

verbosebool, default=True: Whether to save intermediary Forward Selection results.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

verbose: bool#

Result Containers#

FDID.fit() returns an FDIDResults, whose fdid and did fields each hold an FDIDMethodFit (counterfactual, gap, ATT, analytical standard error, 95% CI, p-value, pre-period RMSE and \(R^2\), selected donor names and equal weights, and – for Forward DiD – the \(R^2\) selection path). The prepared panel is exposed as an FDIDInputs.

Frozen dataclasses for the Forward Difference-in-Differences estimator.

FDID (Li 2023, Frontiers: A Simple Forward Difference-in-Differences Method, Marketing Science) builds the control group for a single treated unit by forward selection: it greedily adds the donor that most improves pre-treatment fit (R^2 between the treated unit and the running donor average), tracks the R^2 path, and keeps the subset that maximises it. The synthetic control is the simple average of the selected donors, with a difference-in-differences intercept.

Two estimates are always returned side by side:

FDID – the forward-selected difference-in-differences (best donor subset).
DID – the textbook two-way difference-in-differences using all donors (the average of every control unit). This is the natural benchmark the forward search improves upon.

Both carry Li (2023) analytical standard errors. The three layers below (inputs, per-method fit, top-level results) mirror the CLUSTERSC / PROXIMAL container design used elsewhere in mlsynth.

class mlsynth.utils.fdid_helpers.structures.FDIDInputs(y: ~numpy.ndarray, donor_matrix: ~numpy.ndarray, pre_periods: int, post_periods: int, T: int, donor_names: ~typing.Sequence, time_labels: ~numpy.ndarray, treated_unit_name: ~typing.Any, verbose: bool = True, prepped: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

Preprocessed panel data for the FDID pipeline.

Parameters:

y (np.ndarray) – Treated-unit outcome over all T periods, shape (T,).
donor_matrix (np.ndarray) – Donor outcomes, shape (T, n_donors).
pre_periods (int) – Number of pre-treatment periods T0.
post_periods (int) – Number of post-treatment periods T1 = T - T0.
T (int) – Total number of periods.
donor_names (Sequence) – Length-n_donors donor labels (column order of donor_matrix).
time_labels (np.ndarray) – Length-T time labels.
treated_unit_name (Any) – Identifier of the treated unit.
verbose (bool) – Whether the forward-selection path is recorded step by step.
prepped (dict) – The raw mlsynth.utils.datautils.dataprep() dictionary, kept so the plotter can reuse the prepared panel.

T: int#

donor_matrix: ndarray#

donor_names: Sequence#

property n_donors: int#: Number of donor units.

post_periods: int#

pre_periods: int#

prepped: Dict[str, Any]#

time_labels: ndarray#

treated_unit_name: Any#

verbose: bool = True#

y: ndarray#

class mlsynth.utils.fdid_helpers.structures.FDIDMethodFit(name: str, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att: float, att_se: float, att_percent: float, satt: float, pre_rmse: float, r_squared: float, intercept: float, p_value: float, ci: ~typing.Tuple[float, float], selected_indices: ~typing.List[int], selected_names: ~typing.List[~typing.Any], donor_weights: ~typing.Dict[~typing.Any, float], r2_path: ~numpy.ndarray | None = None, intermediary: list | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

Single FDID/DID fit output.

Parameters:

name (str) – Method identifier ("FDID" or "DID").
counterfactual (np.ndarray) – Estimated counterfactual outcome path, shape (T,).
gap (np.ndarray) – Observed treated minus counterfactual, shape (T,).
att (float) – Mean post-treatment treatment effect.
att_se (float) – Li (2023) analytical standard error of the ATT.
att_percent (float) – ATT as a percentage of the post-period counterfactual mean.
satt (float) – Standardised ATT (att / se * sqrt(T1)).
pre_rmse (float) – Root-mean-squared pre-treatment fit error.
r_squared (float) – Pre-treatment R^2 of the difference-in-differences fit.
intercept (float) – Difference-in-differences intercept (treated minus donor pre-period mean).
p_value (float) – Two-sided p-value for the ATT.
ci (tuple of float) – (lower, upper) 95% confidence interval for the ATT.
selected_indices (list of int) – Column indices of the donors retained (all donors for DID).
selected_names (list) – Donor labels corresponding to selected_indices.
donor_weights (dict) – Mapping {donor_name: weight} (equal weights over the selected donors).
r2_path (np.ndarray or None) – R^2 after each forward-selection step (FDID only; None for DID).
intermediary (list or None) – Per-step diagnostics when verbose (FDID only).
metadata (dict) – Free-form per-method diagnostics.

att: float#

att_percent: float#

att_se: float#

ci: Tuple[float, float]#

counterfactual: ndarray#

donor_weights: Dict[Any, float]#

gap: ndarray#

intercept: float#

intermediary: list | None = None#

metadata: Dict[str, Any]#

name: str#

p_value: float#

pre_rmse: float#

r2_path: ndarray | None = None#

r_squared: float#

satt: float#

selected_indices: List[int]#

selected_names: List[Any]#

class mlsynth.utils.fdid_helpers.structures.FDIDResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.fdid_helpers.structures.FDIDInputs, fdid: ~mlsynth.utils.fdid_helpers.structures.FDIDMethodFit, did: ~mlsynth.utils.fdid_helpers.structures.FDIDMethodFit, selected_variant: str = 'FDID', metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: BaseEstimatorResults

Top-level container returned by mlsynth.FDID.fit().

An EffectResult (the observational report): in addition to the FDID-specific fields below, it exposes the standardized sub-models (effects, time_series, weights, inference, fit_diagnostics, method_details) – derived from the selected variant – and the flat accessors att/att_ci/ counterfactual/gap/donor_weights/pre_rmse.

Parameters:

inputs (FDIDInputs) – Preprocessed panel.
fdid (FDIDMethodFit) – Forward-selected difference-in-differences fit (primary).
did (FDIDMethodFit) – Textbook difference-in-differences using all donors.
selected_variant (str) – Which fit is exposed via the convenience aliases att, att_se, counterfactual, gap, donor_weights – "FDID" or "DID". Defaults to "FDID".
metadata (dict) – Free-form pipeline diagnostics.

property att: float#: ATT of the primary variant.

att_by_method() → Dict[str, float]#: {method: ATT} for both fits.

property att_se: float#: ATT standard error of the primary variant.

ci_by_method() → Dict[str, Tuple[float, float]]#: {method: (lower, upper)} confidence intervals for both fits.

property counterfactual: ndarray#: Counterfactual of the primary variant.

did: FDIDMethodFit#

property donor_weights: Dict[Any, float]#: Donor weights of the primary variant.

fdid: FDIDMethodFit#

property gap: ndarray#: Gap of the primary variant.

inputs: FDIDInputs#

metadata: Dict[str, Any]#

property methods: Dict[str, FDIDMethodFit]#: {method_name: fit} for both fits, FDID first.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

property pre_rmse: float#: Pre-treatment RMSE of the primary variant.

se_by_method() → Dict[str, float]#: {method: ATT standard error} for both fits.

selected_variant: str#

Helper Modules#

Data preparation – balances the panel, pivots it, validates the pre-period count, and packs everything into the typed FDIDInputs.

Data preparation for the Forward Difference-in-Differences estimator.

mlsynth.utils.fdid_helpers.setup.prepare_fdid_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str, verbose: bool = True) → FDIDInputs#

Balance the panel, pivot it, and package it into FDIDInputs.

Parameters:

df (pd.DataFrame) – Long panel with outcome, treatment, unit, and time columns.
outcome, treat, unitid, time (str) – Column names identifying the outcome, treatment indicator, unit, and time period.
verbose (bool, default True) – Whether the forward-selection path should be recorded step by step.

Returns:

FDIDInputs – Preprocessed panel ready for forward selection.

Raises:

MlsynthDataError – If panel balancing or data preparation fails (e.g. no donor units).
MlsynthEstimationError – If fewer than two pre-treatment periods are available.

The forward-selection core and the difference-in-differences arithmetic. The public entry points are forward_did_select (the vectorized greedy search) and did_from_mean (the DiD fit for a fixed comparison group); the private helpers documented below are the incremental-mean and batched \(R^2\) primitives described in How mlsynth computes this.

Forward-selection and difference-in-differences estimation for FDID.

This module holds the heavy numerical core of the Forward Difference-in-Differences estimator of Li (2023):

forward_did_select() – the vectorised forward-selection loop that greedily adds the donor most improving pre-treatment R^2, tracks the R^2 path, and returns the optimal donor subset alongside the textbook all-donor difference-in-differences benchmark.
did_from_mean() – the difference-in-differences estimate for a given donor average (ATT, fit, analytical inference, and vectors).

Both previously lived in the shared selector_helpers grab-bag and the legacy estutils module; they are FDID-specific and now live with the rest of the FDID pipeline.

mlsynth.utils.fdid_helpers.estimation._choose_optimal_subset(selected: List[int], R2_path: ndarray) → Tuple[List[int], ndarray]#: Keep the donor prefix up to (and including) the R^2-maximising step.

mlsynth.utils.fdid_helpers.estimation._compute_fdid_result(treated_outcome: ndarray, control_outcomes: ndarray, optimal_idxs: List[int], pre_periods: int, R2_path: ndarray, donor_names: List[Any]) → Dict[str, Any]#: Difference-in-differences result for the selected donor subset.

mlsynth.utils.fdid_helpers.estimation._r2_batch(y_c: ndarray, ss_tot: float, X_pre: ndarray) → ndarray#

Pre-treatment R^2 of each candidate donor average vs the treated unit.

Parameters:

y_c (np.ndarray) – Centred treated pre-treatment vector (y - mean(y)).
ss_tot (float) – Total sum of squares of y_c.
X_pre (np.ndarray) – Candidate pre-treatment vectors, shape (T0, N).

Returns:

np.ndarray – R^2 for each candidate, shape (N,).

mlsynth.utils.fdid_helpers.estimation._record_verbose_step(intermediary_results: list, it: int, best_idx: int, best_r2: float, r2_cand: ndarray, selected: List[int], donor_names: List[Any], current_mean_pre: ndarray, k: int) → None#: Append one forward-selection step to the verbose diagnostics log.

mlsynth.utils.fdid_helpers.estimation._select_best_donor(X_pre: ndarray, current_mean_pre: ndarray, k: int, remaining_idx: ndarray, y_c: ndarray, ss_tot: float) → Tuple[int, float, ndarray]#: Pick the remaining donor whose addition maximises pre-period R^2.

mlsynth.utils.fdid_helpers.estimation._update_synthetic_control(current_mean: ndarray, control_outcomes: ndarray, best_idx: int, k: int) → ndarray#: Incrementally fold a newly selected donor into the running average.

mlsynth.utils.fdid_helpers.estimation.did_from_mean(treated: ndarray, mean_ctrl: ndarray, pre_periods: int) → Dict[str, Any]#

Difference-in-differences estimate from a pre-computed donor average.

Parameters:

treated (np.ndarray) – Treated-unit outcome vector, shape (T,).
mean_ctrl (np.ndarray) – Average outcome of the selected donor pool, shape (T,).
pre_periods (int) – Number of pre-treatment periods T0.

Returns:

dict – Structured result with Effects, Fit, Inference, and Vectors blocks.

mlsynth.utils.fdid_helpers.estimation.forward_did_select(treated_outcome: ndarray, control_outcomes: ndarray, pre_periods: int, donor_names: List[Any], verbose: bool = False) → Dict[str, Any]#

Run Li (2023) forward-selected difference-in-differences.

Sequentially adds the control unit that most improves pre-treatment fit (R^2) with the treated unit, tracks the path of R^2 values, and returns both the textbook all-donor DID and the optimal FDID estimate.

Parameters:

treated_outcome (np.ndarray) – Treated-unit outcome vector, shape (T,).
control_outcomes (np.ndarray) – Outcome matrix for all potential control units, shape (T, N).
pre_periods (int) – Number of pre-treatment periods T0.
donor_names (list) – Donor labels; length must equal N.
verbose (bool, default False) – If True, attach per-step diagnostics under "intermediary".

Returns:

dict – {"DID": <all-donor result>, "FDID": <forward-selected result>}.

References

Li, K. T. (2023). Frontiers: A Simple Forward Difference-in-Differences Method. Marketing Science, 43(2), 267-279. https://doi.org/10.1287/mksc.2022.0212

The Li (2023) analytical standard error, confidence interval, and p-value.

Analytical inference for Forward Difference-in-Differences (Li 2023).

Li (2023) derives a closed-form variance for the difference-in-differences ATT estimator. Writing the pre-treatment residuals of the treated unit against its difference-in-differences fit as e_t, the post-period average treatment effect has asymptotic variance

Var(ATT) = (omega_1 + omega_2) / T1,

where omega_2 = mean(e_t^2) is the pre-period residual variance and omega_1 = (T1 / T0) * omega_2 inflates it for the post-period sample size T1. The standard error is the square root of this quantity.

mlsynth.utils.fdid_helpers.inference.did_inference(att: float, pre_residuals: ndarray, pre_periods: int, post_periods: int) → Tuple[float, Tuple[float, float], float, float]#

Compute the Li (2023) analytical SE, 95% CI, p-value, and SATT.

Parameters:

att (float) – Estimated average treatment effect on the treated.
pre_residuals (np.ndarray) – Pre-treatment residuals of the treated unit against its difference-in-differences fit, shape (T0,).
pre_periods (int) – Number of pre-treatment periods T0.
post_periods (int) – Number of post-treatment periods T1.

Returns:

se (float) – Analytical standard error of the ATT (nan if undefined).
ci (tuple of float) – (lower, upper) 95% confidence interval.
p_value (float) – Two-sided p-value for the ATT.
satt (float) – Standardised ATT (att / se * sqrt(T1)).

Assembly of the raw selection output into the typed result containers.

Assemble typed FDID results from the raw estimation dictionaries.

mlsynth.utils.fdid_helpers.results_assembly.assemble_fdid_results(selector_output: Dict[str, Dict[str, Any]], inputs: FDIDInputs) → FDIDResults#

Build the typed FDIDResults container.

Parameters:

selector_output (dict) – {"DID": ..., "FDID": ...} as returned by mlsynth.utils.fdid_helpers.estimation.forward_did_select().
inputs (FDIDInputs) – Preprocessed panel.

Returns:

FDIDResults – Container exposing the FDID (primary) and DID fits.

The observed-versus-counterfactual overlay plot for the FDID and DID fits.

Plotting wrapper for the Forward Difference-in-Differences estimator.

mlsynth.utils.fdid_helpers.plotter.plot_fdid(results: FDIDResults, *, time: str, unitid: str, outcome: str, treat: str, treated_color: str, counterfactual_color: str | List[str], save: bool | dict) → None#

Plot observed vs FDID and DID counterfactuals.

Plotting failures are downgraded to warnings so a rendering problem never masks a successful estimation.

The Web Appendix E Monte Carlo DGPs (DGP1-DGP4), packaged as simulate_fdid_sample() so the replication in Verification runs as a one-liner.

Web Appendix E Monte Carlo DGPs for the Forward DiD method.

Implements the four data-generating processes from Li, Shi & Huang (2023) Web Appendix E. Each draw produces one treated unit and N controls over T1 + T2 periods, generated from three common factors:

\[\begin{split}f_{1t} &= 0.8 f_{1,t-1} + v_{1t}, \\ f_{2t} &= -0.6 f_{2,t-1} + v_{2t} + 0.8 v_{2,t-1}, \\ f_{3t} &= v_{3t} + 0.9 v_{3,t-1} + 0.4 v_{3,t-2},\end{split}\]

with \(v_{kt} \sim \mathcal{N}(0, 1)\) and outcomes

\[\begin{split}y_{tr,t} &= a_0 + c_0 \mathbf{1}' f_t + \varepsilon_{tr,t}, \\ y_{it} &= 1 + c_1 \mathbf{1}' f_t + \varepsilon_{it} \quad i \le N/2, \\ y_{it} &= 1 + c_2 \mathbf{1}' f_t + \varepsilon_{it} \quad i > N/2,\end{split}\]

where \(\varepsilon_{it} \sim \mathcal{N}(0, 1)\). The four DGPs vary \((a_0, c_0, c_1, c_2)\):

DGP  (a_0, c_0, c_1, c_2)
  (1, 1, 1, 1) — all controls match (DiD is applicable)
  (1, 1, 1, 2) — half the controls have mismatched loadings
  (2, 1, 1, 1) — treated has a different intercept
  (2, 1, 1, 2) — intercept and half-mismatched loadings

True ATT is zero in every DGP (matching the paper’s PMSE convention; the PMSE is invariant to a constant treatment effect).

Note

The appendix prints f_2t = -0.6 f_{1,t-1} + ... for the lag term, but the Monte Carlo numbers in Li’s Table 5 match the alternative reading -0.6 f_{2,t-1} (ARMA(1,1) on \(f_2\) itself). The latter is used here — it reproduces the paper’s DID PMSE values closely (within ~3%) while the literal reading reproduces only the FDID column.

class mlsynth.utils.fdid_helpers.simulation.FDIDSample(df: DataFrame, Y_treated: ndarray, Y_controls: ndarray, T1: int, T2: int, dgp: int)#

One draw from a Web Appendix E DGP.

df#

Long panel with columns unit / time / y / treat ready to feed to mlsynth.FDID.

Type:: pd.DataFrame

Y_treated#

Treated-unit outcome path, shape (T,).

Type:: np.ndarray

Y_controls#

Control outcomes, shape (N, T). Rows 0..N//2-1 carry loading c_1; rows N//2..N-1 carry loading c_2.

Type:: np.ndarray

T1, T2

Pre- / post-treatment period counts.

Type:: int

dgp#

Which of the four DGPs was drawn.

Type:: int

T1: int#

T2: int#

Y_controls: ndarray#

Y_treated: ndarray#

df: DataFrame#

dgp: int#

mlsynth.utils.fdid_helpers.simulation.simulate_fdid_sample(dgp: int, N: int = 60, T1: int = 24, T2: int = 12, rng: Generator | None = None) → FDIDSample#

Draw one sample from FDID Web Appendix E DGP dgp (1-4).

Parameters:

dgp (int) – Which DGP to draw (1, 2, 3, or 4).
N (int, default 60) – Number of control units (the paper uses N = 60).
T1, T2 (int) – Pre- and post-treatment period counts.
rng (np.random.Generator, optional) – NumPy RNG. Defaults to np.random.default_rng().

Returns:

FDIDSample

Forward Difference-in-Differences (FDID)

Contents

Forward Difference-in-Differences (FDID)#

When to Use This Estimator#

Forward Selection vs. Matching and Weighting#

Notation#

Mathematical Formulation#

The DiD model#

The forward-selection algorithm#

How mlsynth computes this: incremental means and a batched \(R^2\)#

Assumptions#

Diagnostic: a side-by-side panel where Forward PTA holds vs. fails#

Inference#

Consistency of the selection#

Example#

Verification#

Core API#

Configuration#

Additional Parameters#

Result Containers#

Helper Modules#