Forward Difference-in-Differences (FDID)#
When to Use This Estimator#
Difference-in-differences (DiD) is the workhorse of quasi-experimental causal inference, but it rests on a parallel-trends assumption: the treated unit’s untreated outcome would have moved in lockstep with the average of the controls. With a large, heterogeneous pool of candidate controls that assumption is rarely credible for the pool as a whole – most of the controls are simply the wrong comparison. The usual escape hatches each have a catch:
Plain DiD uses every control with equal weight. One badly mismatched control contaminates the average, and its bias does not shrink as the panel grows.
Synthetic control (SC, [ABADIE2010]) weights the controls on the simplex, but is justified only as the pre-period grows without bound, and has no inference theory when the data are non-stationary of unknown form – exactly the regime of most marketing and macro panels.
The panel-data approach of Hsiao, Ching and Wan ([HCW]) and its forward-selected variant ([fsPDA]) fit an unrestricted regression on the controls. When the number of controls \(N\) exceeds the number of pre-treatment periods \(T_1\) – common in store/geo studies – they overfit in-sample and predict poorly out-of-sample.
Forward DiD (Li [Li2024]) targets precisely this regime: many candidate controls, a short-to-moderate pre-period, and a need for valid inference under non-stationarity. It keeps DiD’s transparency – an equal-weighted comparison group plus a single intercept – but chooses which controls enter the comparison by a greedy forward search on pre-treatment fit. Because only one parameter is ever estimated (the DiD intercept \(\alpha\)), no matter how many controls are selected, overfitting is impossible and the textbook DiD standard error applies. Its advantages, in Li’s own summary:
It is a flexible drop-in for DiD, usable even when DiD’s all-controls parallel trend is too restrictive.
It accommodates any number of controls, including \(N > T_1\).
There are no overfitting concerns – one parameter after selection.
It is computationally cheap: a greedy \(O(N^2)\) search rather than the \(2^N\) subsets of the optimal procedure.
It has inference theory valid for stationary and non-stationary data, which SC and HCW lack.
Forward Selection vs. Matching and Weighting#
Every synthetic-control-family estimator answers the same question – what comparison reproduces the treated unit’s untreated path? – but each makes a different structural bet about the comparison. Forward DiD’s bet is distinctive and worth stating plainly:
A *subset* of the controls shares the treated unit’s trend; find that subset and average it with equal weights.
It does not try to weight all controls cleverly (SC), nor regress on all of them (HCW), nor trust all of them (DiD). It selects. The selection is greedy: add, one at a time, the control that most improves pre-treatment fit, trace the fit as the subset grows, and keep the subset that fits best.
DiD |
Synthetic control |
Forward DiD |
|
|---|---|---|---|
Comparison |
all controls, equal weight |
all controls, simplex weights |
a selected subset, equal weight |
Free parameters |
1 (intercept) |
\(N\) weights |
1 (intercept) – after selection |
Overfitting risk |
none |
controlled by the simplex |
none – one parameter |
Key assumption |
all controls are parallel |
treated is in the convex hull |
some subset is parallel |
Inference under non-stationarity |
standard |
none available |
standard (Prop. 2.1) |
The equal weights are the crux. Because the selected controls enter through a single average – not \(|\mathcal{D}|\) separate coefficients – adding more controls cannot increase the model’s degrees of freedom. This is an implicit regularization that buys both the overfitting immunity and the clean, DiD-style inference theory. Forward DiD is therefore best read as DiD with a principled, data-driven choice of comparison group, not as a new weighting scheme.
Notation#
Index the units by \(j\), with \(j = 0\) the single treated unit and \(\mathcal{N} = \{1, \ldots, N\}\) the control units. A selected subset \(\mathcal{D} \subseteq \mathcal{N}\) is the comparison group; write its equal-weighted average outcome as
Time runs over \(t \in \{1, \ldots, T\}\); the intervention starts at \(T_1 + 1\), giving a pre-period \(\mathcal{T}_1 = \{1, \ldots, T_1\}\) and a post-period \(\mathcal{T}_2 = \{T_1 + 1, \ldots, T\}\) of length \(T_2 = T - T_1\). Potential outcomes are \(y^0_{jt}\) (untreated) and \(y^1_{jt}\) (treated); we observe \(y_{0t} = y^0_{0t}\) for \(t \in \mathcal{T}_1\) and \(y_{0t} = y^1_{0t}\) for \(t \in \mathcal{T}_2\). The estimand is the average treatment effect on the treated,
Notation bridge
Li [Li2024] writes the treated outcome \(y_{tr,t}\), the selected control set \(\mathcal{N}_{co}\) with size \(N_{co}\), the control average \(\bar{y}_{\mathcal{N}_{co}, t}\), the intercept \(\alpha\), and \(T_1\) / \(T_2\) for the pre/post counts (treatment at \(T_1 + 1\)). We keep \(j = 0\) for the treated unit, \(\mathcal{D}\) for the selected comparison group, and \(\bar{y}_{\mathcal{D}, t}\) for its average.
Mathematical Formulation#
The DiD model#
For a fixed comparison group \(\mathcal{D}\), Forward DiD posits that the treated unit’s untreated outcome equals the group average plus a constant level shift:
with \(\alpha\) an unknown intercept and \(v_t\) a zero-mean, weakly dependent error. Crucially, \(y^0_{0t}\) and \(\bar{y}_{\mathcal{D}, t}\) may each be non-stationary (trending) provided their difference is stationary – this is the Forward DiD parallel-trends condition. The intercept is estimated by least squares on the pre-period,
so the in-sample fit and out-of-sample counterfactual are
and the ATT is the mean post-period gap
Because the model has a single parameter, the pre-treatment fit quality is summarized by an \(R^2\) (identical to the adjusted \(R^2\), since there is only one regressor coefficient):
where \(\bar{y}_0\) is the treated unit’s pre-period mean.
The forward-selection algorithm#
Maximizing \(R^2_{\mathcal{D}}\) is equivalent to minimizing the pre-period residual variance \(T_1^{-1} \sum_{t \in \mathcal{T}_1} \hat{v}_t^2\). Forward DiD searches over comparison groups greedily:
Step 1. For each single control \(i \in \mathcal{N}\), form the one-unit comparison group and compute its pre-period \(R^2\). Keep the best single control, \(\hat{c}_1\).
Step 2. Add to \(\{\hat{c}_1\}\) each of the remaining \(N - 1\) controls in turn; keep the two-unit group with the highest \(R^2\).
Step 3. Continue, adding one control at a time, until all \(N\) controls are in. This yields \(N\) nested groups (sizes \(1, 2, \ldots, N\)); select the one, \(\hat{\mathcal{D}} = \mathcal{N}_{co}\), with the largest \(R^2\).
The greedy search evaluates \(1 + 2 + \cdots + N = N(N+1)/2\) sub-models rather than the \(2^N\) of the exhaustive procedure (for \(N = 60\), that is 1,830 versus \(1.15 \times 10^{18}\)). The final group \(\hat{\mathcal{D}}\) is then plugged into the DiD formulas above for the ATT, its standard error, and the \(R^2\).
How mlsynth computes this: incremental means and a batched \(R^2\)#
Read literally, each step of the algorithm re-forms a subset average from
scratch and re-fits a DiD regression for every remaining candidate – an
\(O(N)\) rebuild times an \(O(N)\) candidate loop times the
per-candidate work, repeated \(N\) times. mlsynth’s
forward_did_select() instead
collapses each step into a handful of vectorized NumPy operations through
three observations.
1. The comparison average is updated incrementally, never rebuilt. Let \(\mathbf{m}^{(k)} \in \mathbb{R}^{T}\) be the running average over the \(k\) already-selected controls. Adding control \(c\) gives the \((k+1)\)-average by a single rank-one update,
which is \(O(T)\) rather than \(O(kT)\). This is
_update_synthetic_control()
(current_mean + (control - current_mean) / (k + 1)).
2. All candidate averages for a step are built in one matrix. At step \(k\), let \(\mathbf{Y}_{\mathcal{R}} \in \mathbb{R}^{T_1 \times |\mathcal{R}|}\) stack the pre-period columns of the remaining candidates \(\mathcal{R}\). Every candidate \((k+1)\)-average – one per column – is formed simultaneously by broadcasting the running pre-period mean \(\mathbf{m}^{(k)}_{\mathcal{T}_1}\):
In code this is the one line new_means = (current_mean_pre[:, None] * k +
candidates) / (k + 1) inside
_select_best_donor().
3. The intercept \(\alpha\) drops out, so scoring is pure inner products. This is the step that removes the per-candidate regression entirely. Profiling out \(\alpha\) from the DiD loss is exactly centering: the fitted residual for candidate column \(\ell\) is \(\hat v_t = (y_{0t} - \bar y_0) - (M_{t\ell} - \bar M_\ell)\). Writing \(\tilde{\mathbf{y}} = \mathbf{y}_{0,\mathcal{T}_1} - \bar y_0\) (precomputed once, with its norm \(\lVert\tilde{\mathbf{y}}\rVert^2 = \mathrm{ss}_{\text{tot}}\)), the residual sum of squares for all candidates is
The cross term for the whole candidate set is the single matvec
\(\tilde{\mathbf{y}}^\top(\mathbf{M} - \bar{\mathbf{M}})\); the column
sums of squares are one reduction. This is
_r2_batch() – no candidate is
ever regressed, and \(\alpha\) is never explicitly solved during the
search (it is recovered only once, for the winning group, in
did_from_mean()).
Taken together, a forward step costs \(O(T_1 |\mathcal{R}|)\) and the
entire search is \(O(T_1 N^2)\), with the inner loop expressed as a
broadcast, a matrix–vector product, a column reduction, and an
argmax – no Python-level loop over candidates and no per-candidate
solve. This is what lets the implementation run the selection over
\(\sim\)1,500 controls, and what makes the \(M = 5{,}000\)
replication Monte Carlo in Verification tractable.
Assumptions#
Assumption 1 (Forward DiD parallel trends). There exists a subset \(\mathcal{D} \subseteq \mathcal{N}\) and a constant \(\alpha\) such that \(y^0_{0t} = \alpha + \bar{y}_{\mathcal{D}, t} + v_t\) for all \(t\), where \(v_t\) is a weakly dependent process with zero mean and finite variance.
Remark. This says the gap between the treated unit and the selected comparison group is stable across the pre- and post-periods up to a mean-zero shock. It is strictly weaker than DiD’s requirement that all controls be parallel: it asks only that some equal-weighted subset be parallel. Both \(y^0_{0t}\) and \(\bar{y}_{\mathcal{D}, t}\) may trend arbitrarily (even non-linearly), so long as their difference is trendless – which is what makes the method valid under non-stationarity.
Assumption 2 (no anticipation / no interference). Controls are untreated throughout, and the treated unit’s outcome equals its untreated potential outcome in the pre-period.
Remark. Standard in the DiD/SC literature. It is what lets the pre-period identify the comparison group: if controls were themselves affected by the intervention (spillover), their pre/post relationship to the treated unit would shift and selection would be biased.
Assumption 3 (regularity for inference). The partial sums of \(v_t\) obey a central limit theorem; errors are weakly dependent with finite long-run variance.
Remark. This is what delivers the asymptotic normality in Proposition 2.1 below, and it holds for the broad class of stationary, weakly-dependent error processes – it does not require \(v_t\) to be i.i.d. or the levels \(y^0_{0t}\) to be stationary.
When not to use Forward DiD
Assumption 1 fails when no subset of controls can track the treated unit – most importantly when the treated unit lies outside the range of the controls (e.g. its outcome trends upward more steeply than every control’s). Equal weights cannot extrapolate beyond the controls, so no selection rescues it. In that regime Li points to methods that let the treated unit sit outside the control hull: the augmented DiD ([ADID]), factor-model / interactive-fixed-effect estimators, or SC with an intercept.
The pretreatment \(R^2\) returned by FDID is the natural empirical check on Assumption 1. The script below draws two panels with the same underlying common factor and the same true ATT of zero, differing only in the treated unit’s factor loading:
Panel A (
treated_loading = 1). The treated unit shares the controls’ single-factor loading. Assumption 1 holds for any subset of the donors.Panel B (
treated_loading = 3). The treated unit trends three times faster than any control. No subset of the equal-weighted donors can extrapolate the steeper trend – Assumption 1 fails.
import numpy as np
import pandas as pd
from mlsynth import FDID
def make_panel(*, treated_loading, n_controls=40, T1=24, T2=12, seed=0):
"""Synthetic panel with one common smoothly-trending factor.
The treated unit's loading on the factor is ``treated_loading``; every
control loads with coefficient 1. True ATT = 0 by construction.
"""
rng = np.random.default_rng(seed)
T = T1 + T2
f = np.cumsum(rng.standard_normal(T)) / np.sqrt(T)
eps_tr = 0.10 * rng.standard_normal(T)
eps_co = 0.10 * rng.standard_normal((n_controls, T))
y_tr = 1.0 + treated_loading * f + eps_tr
y_co = 1.0 + 1.0 * f[None, :] + eps_co
rows = []
for t in range(T):
rows.append({"unit": "treated", "time": t, "y": float(y_tr[t]),
"treat": int(t >= T1)})
for i in range(n_controls):
rows.append({"unit": f"c{i}", "time": t, "y": float(y_co[i, t]),
"treat": 0})
return pd.DataFrame(rows)
for label, loading in [("Forward PTA holds (loading=1)", 1.0),
("Forward PTA fails (loading=3)", 3.0)]:
df = make_panel(treated_loading=loading, seed=0)
res = FDID({"df": df, "outcome": "y", "treat": "treat",
"unitid": "unit", "time": "time",
"display_graphs": False}).fit()
print(f"{label:35s} FDID ATT = {res.fdid.att:+.3f} "
f"R^2 = {res.fdid.r_squared:.3f} "
f"selected {len(res.fdid.selected_names)} donors")
prints:
Forward PTA holds (loading=1) FDID ATT = -0.009 R^2 = 0.975 selected 4 donors
Forward PTA fails (loading=3) FDID ATT = -0.802 R^2 = 0.588 selected 2 donors
Two lessons jump out:
The :math:`R^2` is the warning signal. When Forward PTA holds, FDID hits \(R^2 \approx 0.98\) and recovers the true zero ATT to within noise. When it fails, the in-sample fit drops to \(R^2 \approx 0.59\) – a much weaker fit on a panel of the same dimensions. Compare the two against the same threshold you would apply in a forecast exercise (Li’s empirical applications report \(R^2\) of 0.76-0.91 on Atlanta / San Diego / San Jose). When the pre-fit is weak, distrust the post-fit ATT.
The bias is large and one-sided. When Forward PTA fails because the treated unit trends faster than any subset of controls, FDID’s equal-weighted comparison group flattens the post-period counterfactual and the ATT is biased toward zero from above (here: \(-0.80\) against a true 0). A clean placebo on the pre-period will also be off: the in-sample residuals are systematically wrong when the controls cannot extrapolate the treated unit’s trend.
If your application reports \(R^2\) materially below the threshold you would consider acceptable for a forecast (say, < 0.7), treat the ATT estimate as a lower bound on the magnitude of misspecification rather than an estimate of the causal effect, and switch to one of the methods Li flags for the out-of-hull case: Forward Difference-in-Differences (FDID) with a different comparison construction is unlikely to recover it – try the augmented DiD, a factor-model / interactive-fixed-effects estimator, or synthetic control with an intercept.
Inference#
Because Forward DiD estimates a single parameter, its inference is the textbook DiD inference. Let \(\hat{\sigma}^2_{\mathcal{D}} = T_1^{-1} \sum_{t \in \mathcal{T}_1} \hat{v}_t^2\) be the pre-period residual variance on the selected group. Li’s Proposition 2.1 establishes
as \(T_1, T_2 \to \infty\), where \(\Phi\) is the standard-normal CDF. mlsynth reports the finite-sample standard error that also carries the estimation error in \(\hat{\alpha}\):
since \(\widehat{\mathrm{ATT}} - \mathrm{ATT} = -T_1^{-1} \sum_{\mathcal{T}_1} v_t + T_2^{-1} \sum_{\mathcal{T}_2} v_t\) contributes one \(1/T_1\) and one \(1/T_2\) variance term. This collapses to Proposition 2.1’s \(\hat{\sigma}_{\mathcal{D}} / \sqrt{T_2}\) when \(T_1 \gg T_2\). The 95% Wald interval and two-sided p-value follow in the usual way.
Consistency of the selection#
Li also shows the greedy search recovers a valid comparison group. Under Assumption 1 and the appendix’s regularity conditions, with \(N\) fixed, the empirical forward selection selects (one of) the same subset(s) the infeasible procedure based on true error variances would select, with probability approaching one as \(T_1 \to \infty\) (Proposition 2.2; Proposition D.1 handles ties). Proposition D.2 extends this to the case where \(N\) grows with \(T_1\) under a latent group structure. Intuitively, by the law of large numbers each step’s empirical \(R^2\) converges to its population value, so the greedy path tracks the population-optimal path.
Example#
The block below is self-contained. It draws one panel from Li’s Web Appendix E data-generating process (three common factors, 60 controls), in the configuration where half the controls are the wrong comparison: the treated unit and the first 30 controls load on the common factor with weight 1, while the last 30 load with weight 2 (Li’s “DGP2”). The true ATT is zero. Forward DiD should select from the matching half and beat plain DiD, which is contaminated by the mismatched half.
import numpy as np
from mlsynth import FDID
from mlsynth.utils.fdid_helpers.simulation import simulate_fdid_sample
sample = simulate_fdid_sample(dgp=2, N=60, T1=24, T2=12,
rng=np.random.default_rng(0))
res = FDID({"df": sample.df, "outcome": "y", "treat": "treat",
"unitid": "unit", "time": "time",
"display_graphs": False}).fit()
sel = res.fdid.selected_names
matching = sum(int(s[1:]) < 60 // 2 for s in sel)
print(f"FDID: ATT={res.fdid.att:+.3f} R2={res.fdid.r_squared:.3f} "
f"selected {len(sel)} donors, {matching} from the matching group")
print(f"DID : ATT={res.did.att:+.3f} R2={res.did.r_squared:.3f} (all 60 donors)")
A representative single draw prints:
FDID: ATT=-0.556 R2=0.918 selected 4 donors, 4 from the matching group
DID : ATT=-0.924 R2=0.632 (all 60 donors)
Forward DiD picks only matching controls, lifting the pre-fit \(R^2\) from 0.63 to 0.92 and landing closer to the true zero effect than DiD – which is dragged off by the 30 mismatched controls it is forced to include. (A single draw is noisy; the averaged behaviour over many draws is in Verification below.)
res is an
FDIDResults: res.fdid
and res.did are the two
FDIDMethodFit objects, the
convenience accessors (res.att, res.att_se, res.counterfactual,
res.donor_weights) forward to the Forward DiD fit, and
res.att_by_method() / res.ci_by_method() return both side by side.
Empirical Illustration: Hong Kong’s economic integration#
Forward DiD is the DiD analogue of the forward-selected panel-data approach ([fsPDA]), and it shines on exactly the data those methods target. We use the Hsiao, Ching and Wan ([HCW]) panel of quarterly real-GDP growth for Hong Kong and 24 comparison economies, with Hong Kong’s economic integration with mainland China as the intervention (44 pre-treatment quarters, 17 post).
import pandas as pd
from mlsynth import FDID
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/main/basedata/HongKong.csv"
df = pd.read_csv(url)
res = FDID({"df": df, "outcome": "GDP", "treat": "Integration",
"unitid": "Country", "time": "Time", "display_graphs": True}).fit()
print(f"FDID ATT {res.fdid.att:.4f} SE {res.fdid.att_se:.4f} "
f"R2 {res.fdid.r_squared:.3f} ({len(res.fdid.selected_names)} of "
f"{res.inputs.n_donors} controls)")
print("selected:", res.fdid.selected_names)
print(f"DID ATT {res.did.att:.4f} SE {res.did.att_se:.4f} R2 {res.did.r_squared:.3f}")
This prints:
FDID ATT 0.0254 SE 0.0046 R2 0.843 (9 of 24 controls)
selected: ['Philippines', 'Singapore', 'Thailand', 'Norway', 'Mexico',
'Korea', 'Indonesia', 'New Zealand', 'Malaysia']
DID ATT 0.0317 SE 0.0082 R2 0.505
Forward DiD keeps 9 of the 24 economies – a regionally sensible mix of Hong Kong’s trading partners – and in doing so lifts the pre-intervention \(R^2\) from 0.51 (all-controls DiD) to 0.84, roughly halving the standard error. The selected comparison group implies a post-integration GDP-growth effect of about +2.5 percentage points, more precisely estimated and better-fitting than the all-controls DiD’s +3.2.
Verification#
Monte Carlo replication (Path B). Li’s empirical application – the
effect of opening physical stores on an online-first retailer’s
city-level sales – runs on a confidential retailer dataset, so it
cannot be reproduced value-for-value. Per the project’s replication
contract (agents/agents_estimators.md), Forward DiD is therefore
validated by reproducing the paper’s own Monte Carlo, Web Appendix E.
The four DGPs and their factor structure are all packaged in
mlsynth.utils.fdid_helpers.simulation.simulate_fdid_sample(): three
common factors – f1 AR(1) 0.8, f2 ARMA(1,1) (-0.6, 0.8),
f3 MA(2) (0.9, 0.4), innovations \(N(0,1)\) – with outcomes
\(a_0 + c_0 \sum_k f_{kt} + \varepsilon\) for the treated unit and
\(1 + c \sum_k f_{kt} + \varepsilon\) for the controls (first half
loading \(c_1\), second half \(c_2\)). Four DGPs vary
\((a_0, c_0, c_1, c_2)\): DGP1 (1,1,1,1) and DGP3
(2,1,1,1) (all controls match – DiD is applicable); DGP2
(1,1,1,2) and DGP4 (2,1,1,2) (half the controls have the
wrong loading – DiD breaks). True ATT \(= 0\) and
\(\mathrm{PMSE} = M^{-1} \sum_j \widehat{\mathrm{ATT}}_j^2\).
Replicating Table 5 is a 12-line script:
import numpy as np
from mlsynth import FDID
from mlsynth.utils.fdid_helpers.simulation import simulate_fdid_sample
def pmse_cell(dgp, N, T1, T2, M, seed=0):
fdid_sq, did_sq = [], []
for j in range(M):
rng = np.random.default_rng(seed + j)
sample = simulate_fdid_sample(dgp=dgp, N=N, T1=T1, T2=T2, rng=rng)
res = FDID({"df": sample.df, "outcome": "y", "treat": "treat",
"unitid": "unit", "time": "time",
"display_graphs": False, "verbose": False}).fit()
fdid_sq.append(res.fdid.att ** 2) # ATT = 0, so SE^2 = att^2
did_sq.append(res.did.att ** 2)
return float(np.mean(fdid_sq)), float(np.mean(did_sq))
for dgp in (1, 2, 3, 4):
for T1, T2 in [(12, 6), (24, 12), (48, 24)]:
fdid_pmse, did_pmse = pmse_cell(dgp, N=60, T1=T1, T2=T2, M=1000)
print(f"DGP{dgp} ({T1},{T2}): FDID={fdid_pmse:.4f} DID={did_pmse:.4f}")
At \(M = 1{,}000\) (Li uses \(M = 10{,}000\); runtime difference is the only material change) this reproduces Table 5 cell by cell:
DGP |
\((T_1, T_2)\) |
DID (mlsynth) |
DID (Li) |
FDID (mlsynth) |
FDID (Li) |
|---|---|---|---|---|---|
1 |
(12, 6) |
0.265 |
0.259 |
0.325 |
0.315 |
1 |
(24, 12) |
0.127 |
0.128 |
0.147 |
0.146 |
1 |
(48, 24) |
0.065 |
0.063 |
0.075 |
0.071 |
2 |
(12, 6) |
1.202 |
1.037 |
0.431 |
0.385 |
2 |
(24, 12) |
0.765 |
0.746 |
0.177 |
0.180 |
2 |
(48, 24) |
0.451 |
0.473 |
0.084 |
0.082 |
3 |
(12, 6) |
0.265 |
0.252 |
0.325 |
0.303 |
3 |
(24, 12) |
0.127 |
0.123 |
0.147 |
0.143 |
3 |
(48, 24) |
0.065 |
0.064 |
0.075 |
0.072 |
4 |
(12, 6) |
1.202 |
1.038 |
0.431 |
0.391 |
4 |
(24, 12) |
0.765 |
0.744 |
0.177 |
0.171 |
4 |
(48, 24) |
0.451 |
0.454 |
0.084 |
0.081 |
The two headline findings reproduce. When all controls are valid (DGP1, DGP3) DiD is the parsimonious efficient choice and edges out Forward DiD by a small margin at every horizon. When half the controls are mismatched (DGP2, DGP4) DiD’s PMSE stays large and does not shrink as the panel grows (DGP2 at \((48,24)\): DID still 0.45), because the contaminating controls bias the all-controls average; Forward DiD’s PMSE collapses (0.084) because the forward search discards them. Forward DiD pays only a small efficiency cost when DiD is valid, and wins decisively when it is not – Li’s central result. Identity of the DGP1/DGP3 (and DGP2/DGP4) columns also confirms the estimator’s intercept invariance – moving \(a_0\) from 1 to 2 changes nothing because Forward DiD’s \(\widehat\alpha\) absorbs it. The \((12, 6)\) cell runs slightly hot under DGP2/4, consistent with Monte Carlo noise at \(M = 1{,}000\) vs Li’s \(M = 10{,}000\).
For reference, Li’s confidential store-opening study reports a Forward DiD effect of opening a store in Atlanta of +$75,143 in monthly sales (an 86% lift, pre-period \(R^2 = 0.76\)), with DiD and SC – which fit Atlanta’s steep pre-trend poorly – overstating it.
Core API#
Forward Difference-in-Differences (FDID) estimator.
Implements the forward-selection difference-in-differences method of
Li (2023), Frontiers: A Simple Forward Difference-in-Differences
Method, Marketing Science. FDID greedily grows the control group one
donor at a time, keeping the subset that maximises pre-treatment fit,
and reports both the forward-selected estimate (FDID) and the
textbook all-donor difference-in-differences benchmark (DID), each
with Li (2023) analytical standard errors.
The estimator is a thin orchestration layer over
mlsynth.utils.fdid_helpers: it validates configuration, prepares
the panel, runs forward selection, assembles a typed
FDIDResults, and
optionally plots the counterfactuals.
- class mlsynth.estimators.fdid.FDID(config: FDIDConfig | dict)#
Bases:
objectForward Difference-in-Differences (FDID) estimator.
- Parameters:
config (FDIDConfig or dict) – Validated configuration (or a compatible dictionary). See
mlsynth.config_models.FDIDConfigfor the available fields (df,outcome,treat,unitid,time,display_graphs,save,counterfactual_color,treated_color,verbose).
References
Li, K. T. (2023). Frontiers: A Simple Forward Difference-in-Differences Method. Marketing Science, 43(2), 267-279. https://doi.org/10.1287/mksc.2022.0212
Examples
>>> import pandas as pd >>> from mlsynth import FDID >>> url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/basque_data.csv" >>> data = pd.read_csv(url) >>> config = { ... "df": data, ... "outcome": data.columns[2], ... "treat": data.columns[-1], ... "unitid": data.columns[0], ... "time": data.columns[1], ... "display_graphs": False, ... } >>> results = FDID(config).fit() >>> round(results.att, 3)
- fit() FDIDResults#
Run forward selection and return the typed FDID results.
- Returns:
FDIDResults – Container exposing the forward-selected
fdidfit (primary) and the all-donordidbenchmark, plus convenience aliases (att,att_se,counterfactual,gap,donor_weights).- Raises:
MlsynthDataError – If panel balancing or data preparation fails.
MlsynthEstimationError – If there are too few pre-periods or forward selection fails.
Configuration#
- class mlsynth.config_models.FDIDConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', verbose: bool = True)#
Configuration for the Forward Difference-in-Differences (FDID) estimator. Inherits all common configuration parameters from BaseEstimatorConfig.
Additional Parameters#
- plot_didbool, default=True
Whether to display a plot for the standard DID estimator. Has no effect on FDID or ADID plots.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Result Containers#
FDID.fit() returns an
FDIDResults, whose fdid
and did fields each hold an
FDIDMethodFit
(counterfactual, gap, ATT, analytical standard error, 95% CI, p-value,
pre-period RMSE and \(R^2\), selected donor names and equal weights,
and – for Forward DiD – the \(R^2\) selection path). The prepared
panel is exposed as an
FDIDInputs.
Frozen dataclasses for the Forward Difference-in-Differences estimator.
FDID (Li 2023, Frontiers: A Simple Forward Difference-in-Differences Method, Marketing Science) builds the control group for a single treated unit by forward selection: it greedily adds the donor that most improves pre-treatment fit (R^2 between the treated unit and the running donor average), tracks the R^2 path, and keeps the subset that maximises it. The synthetic control is the simple average of the selected donors, with a difference-in-differences intercept.
Two estimates are always returned side by side:
FDID – the forward-selected difference-in-differences (best donor subset).
DID – the textbook two-way difference-in-differences using all donors (the average of every control unit). This is the natural benchmark the forward search improves upon.
Both carry Li (2023) analytical standard errors. The three layers below
(inputs, per-method fit, top-level results) mirror the CLUSTERSC /
PROXIMAL container design used elsewhere in mlsynth.
- class mlsynth.utils.fdid_helpers.structures.FDIDInputs(y: ~numpy.ndarray, donor_matrix: ~numpy.ndarray, pre_periods: int, post_periods: int, T: int, donor_names: ~typing.Sequence, time_labels: ~numpy.ndarray, treated_unit_name: ~typing.Any, verbose: bool = True, prepped: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectPreprocessed panel data for the FDID pipeline.
- Parameters:
y (np.ndarray) – Treated-unit outcome over all
Tperiods, shape(T,).donor_matrix (np.ndarray) – Donor outcomes, shape
(T, n_donors).pre_periods (int) – Number of pre-treatment periods
T0.post_periods (int) – Number of post-treatment periods
T1 = T - T0.T (int) – Total number of periods.
donor_names (Sequence) – Length-
n_donorsdonor labels (column order ofdonor_matrix).time_labels (np.ndarray) – Length-
Ttime labels.treated_unit_name (Any) – Identifier of the treated unit.
verbose (bool) – Whether the forward-selection path is recorded step by step.
prepped (dict) – The raw
mlsynth.utils.datautils.dataprep()dictionary, kept so the plotter can reuse the prepared panel.
- donor_matrix: ndarray#
- time_labels: ndarray#
- y: ndarray#
- class mlsynth.utils.fdid_helpers.structures.FDIDMethodFit(name: str, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att: float, att_se: float, att_percent: float, satt: float, pre_rmse: float, r_squared: float, intercept: float, p_value: float, ci: ~typing.Tuple[float, float], selected_indices: ~typing.List[int], selected_names: ~typing.List[~typing.Any], donor_weights: ~typing.Dict[~typing.Any, float], r2_path: ~numpy.ndarray | None = None, intermediary: list | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectSingle FDID/DID fit output.
- Parameters:
name (str) – Method identifier (
"FDID"or"DID").counterfactual (np.ndarray) – Estimated counterfactual outcome path, shape
(T,).gap (np.ndarray) – Observed treated minus counterfactual, shape
(T,).att (float) – Mean post-treatment treatment effect.
att_se (float) – Li (2023) analytical standard error of the ATT.
att_percent (float) – ATT as a percentage of the post-period counterfactual mean.
satt (float) – Standardised ATT (
att / se * sqrt(T1)).pre_rmse (float) – Root-mean-squared pre-treatment fit error.
r_squared (float) – Pre-treatment R^2 of the difference-in-differences fit.
intercept (float) – Difference-in-differences intercept (treated minus donor pre-period mean).
p_value (float) – Two-sided p-value for the ATT.
ci (tuple of float) –
(lower, upper)95% confidence interval for the ATT.selected_indices (list of int) – Column indices of the donors retained (all donors for DID).
selected_names (list) – Donor labels corresponding to
selected_indices.donor_weights (dict) – Mapping
{donor_name: weight}(equal weights over the selected donors).r2_path (np.ndarray or None) – R^2 after each forward-selection step (FDID only;
Nonefor DID).intermediary (list or None) – Per-step diagnostics when
verbose(FDID only).metadata (dict) – Free-form per-method diagnostics.
- counterfactual: ndarray#
- gap: ndarray#
- class mlsynth.utils.fdid_helpers.structures.FDIDResults(inputs: ~mlsynth.utils.fdid_helpers.structures.FDIDInputs, fdid: ~mlsynth.utils.fdid_helpers.structures.FDIDMethodFit, did: ~mlsynth.utils.fdid_helpers.structures.FDIDMethodFit, selected_variant: str = 'FDID', metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectTop-level container returned by
mlsynth.FDID.fit().- Parameters:
inputs (FDIDInputs) – Preprocessed panel.
fdid (FDIDMethodFit) – Forward-selected difference-in-differences fit (primary).
did (FDIDMethodFit) – Textbook difference-in-differences using all donors.
selected_variant (str) – Which fit is exposed via the convenience aliases
att,att_se,counterfactual,gap,donor_weights–"FDID"or"DID". Defaults to"FDID".metadata (dict) – Free-form pipeline diagnostics.
- ci_by_method() Dict[str, Tuple[float, float]]#
{method: (lower, upper)}confidence intervals for both fits.
- property counterfactual: ndarray#
Counterfactual of the primary variant.
- did: FDIDMethodFit#
- fdid: FDIDMethodFit#
- property gap: ndarray#
Gap of the primary variant.
- inputs: FDIDInputs#
- property methods: Dict[str, FDIDMethodFit]#
{method_name: fit}for both fits, FDID first.
Helper Modules#
Data preparation – balances the panel, pivots it, validates the
pre-period count, and packs everything into the typed
FDIDInputs.
Data preparation for the Forward Difference-in-Differences estimator.
- mlsynth.utils.fdid_helpers.setup.prepare_fdid_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str, verbose: bool = True) FDIDInputs#
Balance the panel, pivot it, and package it into
FDIDInputs.- Parameters:
df (pd.DataFrame) – Long panel with outcome, treatment, unit, and time columns.
outcome, treat, unitid, time (str) – Column names identifying the outcome, treatment indicator, unit, and time period.
verbose (bool, default True) – Whether the forward-selection path should be recorded step by step.
- Returns:
FDIDInputs – Preprocessed panel ready for forward selection.
- Raises:
MlsynthDataError – If panel balancing or data preparation fails (e.g. no donor units).
MlsynthEstimationError – If fewer than two pre-treatment periods are available.
The forward-selection core and the difference-in-differences arithmetic.
The public entry points are forward_did_select (the vectorized greedy
search) and did_from_mean (the DiD fit for a fixed comparison group);
the private helpers documented below are the incremental-mean and batched
\(R^2\) primitives described in How mlsynth computes this.
Forward-selection and difference-in-differences estimation for FDID.
This module holds the heavy numerical core of the Forward Difference-in-Differences estimator of Li (2023):
forward_did_select()– the vectorised forward-selection loop that greedily adds the donor most improving pre-treatment R^2, tracks the R^2 path, and returns the optimal donor subset alongside the textbook all-donor difference-in-differences benchmark.did_from_mean()– the difference-in-differences estimate for a given donor average (ATT, fit, analytical inference, and vectors).
Both previously lived in the shared selector_helpers grab-bag and the
legacy estutils module; they are FDID-specific and now live with the
rest of the FDID pipeline.
- mlsynth.utils.fdid_helpers.estimation._choose_optimal_subset(selected: List[int], R2_path: ndarray) Tuple[List[int], ndarray]#
Keep the donor prefix up to (and including) the R^2-maximising step.
- mlsynth.utils.fdid_helpers.estimation._compute_fdid_result(treated_outcome: ndarray, control_outcomes: ndarray, optimal_idxs: List[int], pre_periods: int, R2_path: ndarray, donor_names: List[Any]) Dict[str, Any]#
Difference-in-differences result for the selected donor subset.
- mlsynth.utils.fdid_helpers.estimation._r2_batch(y_c: ndarray, ss_tot: float, X_pre: ndarray) ndarray#
Pre-treatment R^2 of each candidate donor average vs the treated unit.
- Parameters:
y_c (np.ndarray) – Centred treated pre-treatment vector (
y - mean(y)).ss_tot (float) – Total sum of squares of
y_c.X_pre (np.ndarray) – Candidate pre-treatment vectors, shape
(T0, N).
- Returns:
np.ndarray – R^2 for each candidate, shape
(N,).
- mlsynth.utils.fdid_helpers.estimation._record_verbose_step(intermediary_results: list, it: int, best_idx: int, best_r2: float, r2_cand: ndarray, selected: List[int], donor_names: List[Any], current_mean_pre: ndarray, k: int) None#
Append one forward-selection step to the verbose diagnostics log.
- mlsynth.utils.fdid_helpers.estimation._select_best_donor(X_pre: ndarray, current_mean_pre: ndarray, k: int, remaining_idx: ndarray, y_c: ndarray, ss_tot: float) Tuple[int, float, ndarray]#
Pick the remaining donor whose addition maximises pre-period R^2.
- mlsynth.utils.fdid_helpers.estimation._update_synthetic_control(current_mean: ndarray, control_outcomes: ndarray, best_idx: int, k: int) ndarray#
Incrementally fold a newly selected donor into the running average.
- mlsynth.utils.fdid_helpers.estimation.did_from_mean(treated: ndarray, mean_ctrl: ndarray, pre_periods: int) Dict[str, Any]#
Difference-in-differences estimate from a pre-computed donor average.
- Parameters:
treated (np.ndarray) – Treated-unit outcome vector, shape
(T,).mean_ctrl (np.ndarray) – Average outcome of the selected donor pool, shape
(T,).pre_periods (int) – Number of pre-treatment periods
T0.
- Returns:
dict – Structured result with
Effects,Fit,Inference, andVectorsblocks.
- mlsynth.utils.fdid_helpers.estimation.forward_did_select(treated_outcome: ndarray, control_outcomes: ndarray, pre_periods: int, donor_names: List[Any], verbose: bool = False) Dict[str, Any]#
Run Li (2023) forward-selected difference-in-differences.
Sequentially adds the control unit that most improves pre-treatment fit (R^2) with the treated unit, tracks the path of R^2 values, and returns both the textbook all-donor DID and the optimal FDID estimate.
- Parameters:
treated_outcome (np.ndarray) – Treated-unit outcome vector, shape
(T,).control_outcomes (np.ndarray) – Outcome matrix for all potential control units, shape
(T, N).pre_periods (int) – Number of pre-treatment periods
T0.donor_names (list) – Donor labels; length must equal
N.verbose (bool, default False) – If True, attach per-step diagnostics under
"intermediary".
- Returns:
dict –
{"DID": <all-donor result>, "FDID": <forward-selected result>}.
References
Li, K. T. (2023). Frontiers: A Simple Forward Difference-in-Differences Method. Marketing Science, 43(2), 267-279. https://doi.org/10.1287/mksc.2022.0212
The Li (2023) analytical standard error, confidence interval, and p-value.
Analytical inference for Forward Difference-in-Differences (Li 2023).
Li (2023) derives a closed-form variance for the difference-in-differences
ATT estimator. Writing the pre-treatment residuals of the treated unit
against its difference-in-differences fit as e_t, the post-period
average treatment effect has asymptotic variance
Var(ATT) = (omega_1 + omega_2) / T1,
where omega_2 = mean(e_t^2) is the pre-period residual variance and
omega_1 = (T1 / T0) * omega_2 inflates it for the post-period sample
size T1. The standard error is the square root of this quantity.
- mlsynth.utils.fdid_helpers.inference.did_inference(att: float, pre_residuals: ndarray, pre_periods: int, post_periods: int) Tuple[float, Tuple[float, float], float, float]#
Compute the Li (2023) analytical SE, 95% CI, p-value, and SATT.
- Parameters:
att (float) – Estimated average treatment effect on the treated.
pre_residuals (np.ndarray) – Pre-treatment residuals of the treated unit against its difference-in-differences fit, shape
(T0,).pre_periods (int) – Number of pre-treatment periods
T0.post_periods (int) – Number of post-treatment periods
T1.
- Returns:
se (float) – Analytical standard error of the ATT (
nanif undefined).ci (tuple of float) –
(lower, upper)95% confidence interval.p_value (float) – Two-sided p-value for the ATT.
satt (float) – Standardised ATT (
att / se * sqrt(T1)).
Assembly of the raw selection output into the typed result containers.
Assemble typed FDID results from the raw estimation dictionaries.
- mlsynth.utils.fdid_helpers.results_assembly.assemble_fdid_results(selector_output: Dict[str, Dict[str, Any]], inputs: FDIDInputs) FDIDResults#
Build the typed
FDIDResultscontainer.- Parameters:
selector_output (dict) –
{"DID": ..., "FDID": ...}as returned bymlsynth.utils.fdid_helpers.estimation.forward_did_select().inputs (FDIDInputs) – Preprocessed panel.
- Returns:
FDIDResults – Container exposing the FDID (primary) and DID fits.
The observed-versus-counterfactual overlay plot for the FDID and DID fits.
Plotting wrapper for the Forward Difference-in-Differences estimator.
- mlsynth.utils.fdid_helpers.plotter.plot_fdid(results: FDIDResults, *, time: str, unitid: str, outcome: str, treat: str, treated_color: str, counterfactual_color: str | List[str], save: bool | dict) None#
Plot observed vs FDID and DID counterfactuals.
Plotting failures are downgraded to warnings so a rendering problem never masks a successful estimation.
The Web Appendix E Monte Carlo DGPs (DGP1-DGP4), packaged as
simulate_fdid_sample() so the replication in Verification runs as a
one-liner.
Web Appendix E Monte Carlo DGPs for the Forward DiD method.
Implements the four data-generating processes from Li, Shi & Huang (2023)
Web Appendix E. Each draw produces one treated unit and N controls over
T1 + T2 periods, generated from three common factors:
with \(v_{kt} \sim \mathcal{N}(0, 1)\) and outcomes
where \(\varepsilon_{it} \sim \mathcal{N}(0, 1)\). The four DGPs vary \((a_0, c_0, c_1, c_2)\):
DGP (a_0, c_0, c_1, c_2)
1 (1, 1, 1, 1) — all controls match (DiD is applicable)
2 (1, 1, 1, 2) — half the controls have mismatched loadings
3 (2, 1, 1, 1) — treated has a different intercept
4 (2, 1, 1, 2) — intercept and half-mismatched loadings
True ATT is zero in every DGP (matching the paper’s PMSE convention; the PMSE is invariant to a constant treatment effect).
Note
The appendix prints f_2t = -0.6 f_{1,t-1} + ... for the lag term, but
the Monte Carlo numbers in Li’s Table 5 match the alternative reading
-0.6 f_{2,t-1} (ARMA(1,1) on \(f_2\) itself). The latter is used
here — it reproduces the paper’s DID PMSE values closely (within ~3%) while
the literal reading reproduces only the FDID column.
- class mlsynth.utils.fdid_helpers.simulation.FDIDSample(df: DataFrame, Y_treated: ndarray, Y_controls: ndarray, T1: int, T2: int, dgp: int)#
One draw from a Web Appendix E DGP.
- df#
Long panel with columns
unit/time/y/treatready to feed tomlsynth.FDID.- Type:
pd.DataFrame
- Y_treated#
Treated-unit outcome path, shape
(T,).- Type:
np.ndarray
- Y_controls#
Control outcomes, shape
(N, T). Rows 0..N//2-1 carry loadingc_1; rows N//2..N-1 carry loadingc_2.- Type:
np.ndarray
- T1, T2
Pre- / post-treatment period counts.
- Type:
- Y_controls: ndarray#
- Y_treated: ndarray#
- df: DataFrame#
- mlsynth.utils.fdid_helpers.simulation.simulate_fdid_sample(dgp: int, N: int = 60, T1: int = 24, T2: int = 12, rng: Generator | None = None) FDIDSample#
Draw one sample from FDID Web Appendix E DGP
dgp(1-4).- Parameters:
dgp (int) – Which DGP to draw (1, 2, 3, or 4).
N (int, default 60) – Number of control units (the paper uses
N = 60).T1, T2 (int) – Pre- and post-treatment period counts.
rng (np.random.Generator, optional) – NumPy RNG. Defaults to
np.random.default_rng().
- Returns:
FDIDSample