FDID — Forward Difference-in-Differences (Li 2024)#

Estimator:: Forward Difference-in-Differences (FDID) — mlsynth.FDID
Source:: Li, Kathleen T. (2024), “Frontiers: A Simple Forward Difference-in-Differences Method,” Marketing Science 43(2) [Li2024].
Replication type:: Path A — the author’s released public empirical (Hong Kong GDP) reproduced cell by cell — and Path B — the paper’s own Monte Carlo (Web Appendix E, Table 5).
Status:: Fully verified — empirical and simulation reproduced.

Validation strategy#

Li’s headline application — the effect of opening physical stores on an online-first retailer’s city-level sales — runs on a confidential retailer dataset that cannot be redistributed. For context, that study reports a Forward DiD effect of opening a store in Atlanta of about +$75,143 in monthly sales (an 86% lift, pre-period $R^2 = 0.76$), with ordinary DiD and synthetic control — which fit Atlanta’s steep pre-trend poorly — overstating it.

That headline number cannot be checked value-for-value, but it does not have to be: alongside the paper the author released a public companion replication (MATLAB and R) on the Hsiao, Ching & Wan (2012) Hong Kong GDP panel, which mlsynth reproduces cell by cell (Path A). We additionally reproduce the paper’s own Monte Carlo (Path B), which exercises the mismatched-control regimes a single empirical case cannot.

Path A — Hong Kong GDP#

The author’s public dataset (basedata/HongKong.csv — Hong Kong plus 24 OECD / Asian control economies over 61 quarters, with treatment, the political and economic integration with mainland China, beginning at quarter 44 so $T_1 = 44$) is the Hsiao, Ching & Wan (2012) panel.

import pandas as pd
from mlsynth import FDID

df = pd.read_csv("basedata/HongKong.csv")
res = FDID({"df": df, "outcome": "GDP", "treat": "Integration",
            "unitid": "Country", "time": "Time",
            "display_graphs": False, "verbose": False}).fit()
res.fdid.att, res.fdid.att_percent, res.fdid.r_squared, len(res.fdid.selected_names)

Forward DiD selects 9 of the 24 controls and reproduces the author’s own Forward DiD code (Fun_FDID.R, now run live and captured — see Reference parity and runtime, below) cell by cell:

Metric	FDID (mlsynth)	FDID (Li)	DID (mlsynth)	DID (Li)
ATT	0.0254	0.0254	0.0317	0.0317
% ATT	53.84	53.84	77.62	77.62
pre-period $R^2$	0.843	0.843	0.505	0.505
controls used	9	9	24	24

The 95% confidence interval (0.0163, 0.0345) and the standardized ATT (t-statistic) $\approx 5.49$ likewise match the released values. Forward DiD’s far higher pre-period fit ($R^2 = 0.84$ versus DiD’s $0.50$) is the whole point: the all-controls average tracks Hong Kong’s pre-integration path poorly, so plain DiD overstates the effect, while the forward search keeps only the 9 economies that co-move with Hong Kong.

The durable check lives in benchmarks/cases/fdid_hongkong.py:

python benchmarks/run_benchmarks.py --case fdid_hongkong

Reference parity and runtime#

The reference column above is not transcribed from the readme. It is a live captured run of Kathleen Li’s own Fun_FDID.R, vendored under benchmarks/reference/fdid_hongkong/ with its provenance pinned (R version, data checksum), so the two implementations are compared object to object; mlsynth matches the author’s code to about $10^{-5}$ on every quantity.

On the same Hong Kong panel — warmed up, data load excluded, averaged over 200 calls on one machine — the two implementations differ sharply in speed:

Implementation	per call	work done
Li `Fun_FDID.R` (R)	95.0 ms	forward selection and the point estimate
mlsynth `FDID().fit()` (Python)	6.5 ms	the same selection, plus inference, the all-donor DiD arm, and the typed result object

mlsynth is roughly fifteen times faster while doing strictly more, because its forward search is matrix algebra rather than nested loops. Li’s R rescans every remaining control with a doubly nested loop, re-averaging the growing donor set from scratch at each step — order $N^2 T$ work at interpreted-loop speed. mlsynth scores all remaining candidates at once: each step forms the candidate running means by broadcasting and evaluates their pre-period $R^2$ in a single matrix–vector product (_r2_batch in mlsynth.utils.fdid_helpers.estimation), then folds the chosen donor into the running synthetic control with an order-$T$ rank-one mean update instead of re-averaging the subset. The selection path is identical (9 controls, ATT $0.0254$); only the arithmetic is reorganized, which is why the numbers agree to machine display precision while the wall-clock does not.

Path B — the simulation design#

The four DGPs and their factor structure are packaged in mlsynth.utils.fdid_helpers.simulation.simulate_fdid_sample(): three common factors — f1 AR(1) 0.8, f2 ARMA(1,1) (-0.6, 0.8), f3 MA(2) (0.9, 0.4), innovations $N(0,1)$ — with outcomes $a_0 + c_0 \sum_k f_{kt} + \varepsilon$ for the treated unit and $1 + c \sum_k f_{kt} + \varepsilon$ for the controls (the first half of the donor pool loading $c_1$, the second half $c_2$). The four DGPs vary $(a_0, c_0, c_1, c_2)$:

DGP1 (1,1,1,1) and DGP3 (2,1,1,1) — all controls share the treated unit’s factor loading, so ordinary DiD is applicable.
DGP2 (1,1,1,2) and DGP4 (2,1,1,2) — half the controls carry the wrong loading, so the all-controls DiD average is contaminated.

The true ATT is $0$, and the reported risk is the predictive MSE $\mathrm{PMSE} = M^{-1}\sum_j \widehat{\mathrm{ATT}}_j^2$.

Reproducing Table 5#

import numpy as np
from mlsynth import FDID
from mlsynth.utils.fdid_helpers.simulation import simulate_fdid_sample

def pmse_cell(dgp, N, T1, T2, M, seed=0):
    fdid_sq, did_sq = [], []
    for j in range(M):
        rng = np.random.default_rng(seed + j)
        sample = simulate_fdid_sample(dgp=dgp, N=N, T1=T1, T2=T2, rng=rng)
        res = FDID({"df": sample.df, "outcome": "y", "treat": "treat",
                    "unitid": "unit", "time": "time",
                    "display_graphs": False, "verbose": False}).fit()
        fdid_sq.append(res.fdid.att ** 2)   # ATT = 0, so SE^2 = att^2
        did_sq.append(res.did.att ** 2)
    return float(np.mean(fdid_sq)), float(np.mean(did_sq))

for dgp in (1, 2, 3, 4):
    for T1, T2 in [(12, 6), (24, 12), (48, 24)]:
        fdid_pmse, did_pmse = pmse_cell(dgp, N=60, T1=T1, T2=T2, M=1000)
        print(f"DGP{dgp} ({T1},{T2}): FDID={fdid_pmse:.4f}  DID={did_pmse:.4f}")

Results#

At $M = 1{,}000$ (Li uses $M = 10{,}000$; the runtime difference is the only material change) this reproduces Table 5 cell by cell:

DGP	$(T_1, T_2)$	DID (mlsynth)	DID (Li)	FDID (mlsynth)	FDID (Li)
1	(12, 6)	0.265	0.259	0.325	0.315
1	(24, 12)	0.127	0.128	0.147	0.146
1	(48, 24)	0.065	0.063	0.075	0.071
2	(12, 6)	1.202	1.037	0.431	0.385
2	(24, 12)	0.765	0.746	0.177	0.180
2	(48, 24)	0.451	0.473	0.084	0.082
3	(12, 6)	0.265	0.252	0.325	0.303
3	(24, 12)	0.127	0.123	0.147	0.143
3	(48, 24)	0.065	0.064	0.075	0.072
4	(12, 6)	1.202	1.038	0.431	0.391
4	(24, 12)	0.765	0.744	0.177	0.171
4	(48, 24)	0.451	0.454	0.084	0.081

What it confirms#

The two headline findings reproduce:

When all controls are valid (DGP1, DGP3), DiD is the parsimonious, efficient choice and edges out Forward DiD by a small margin at every horizon — Forward DiD pays only a small efficiency cost for the safety of the search.
When half the controls are mismatched (DGP2, DGP4), DiD’s PMSE stays large and does not shrink as the panel grows (DGP2 at $(48,24)$: DID still 0.45) because the contaminating controls bias the all-controls average, while Forward DiD’s PMSE collapses (0.084) because the forward search discards them. Forward DiD wins decisively when DiD is invalid — Li’s central result.

The identity of the DGP1/DGP3 (and DGP2/DGP4) columns also confirms the estimator’s intercept invariance: moving $a_0$ from 1 to 2 changes nothing because Forward DiD’s $\widehat\alpha$ absorbs it. The $(12, 6)$ cell runs slightly hot under DGP2/4, consistent with Monte Carlo noise at $M = 1{,}000$ versus Li’s $M = 10{,}000$.

FDID — Forward Difference-in-Differences (Li 2024)

Contents