Panel Data Approach (PDA)#
When to Use This Estimator#
The panel data approach (PDA) of Hsiao, Ching and Wan [HCW] estimates a single treated unit’s untreated counterfactual by a linear regression on the control units, fit over the pre-treatment window and extrapolated out-of-sample. Unlike the synthetic control method, PDA imposes no constraints on the coefficients (no simplex, no non-negativity), so it is a plain projection justified by a latent-factor model: if every unit loads on a small set of common factors, the treated unit’s untreated path is a linear combination of the controls’ paths plus an orthogonal error.
The challenge is which controls and how many. Classical PDA was built
for low dimensions (few controls relative to pre-periods) and chooses controls
by AIC/BIC, which break down once the number of controls N approaches or
exceeds the pre-period length T0. mlsynth packages the original
Hsiao-Ching-Wan method together with three high-dimensional variants that
resolve the N-near-T0 problem differently, each with the estimation and
inference theory of its own paper:
Original best-subset (
hcw; Hsiao, Ching & Wan [HCW2012]) – the method that started the literature. The counterfactual is an unrestricted OLS regression (with intercept) of the treated pre-period on the best subset of controls, the subset and its size chosen by AICc (also AIC / BIC). Built for the low-dimensional regime (a moderate, pre-screened candidate pool); the best-subset search is combinatorial, so caphcw_nvmaxor pre-restrict the pool for largeN. The three estimators below are its scalable descendants.L2-relaxation (
l2; Shi & Wang [l2relax]) – a dense estimator (a “cousin of ridge”) for when the factor model makes the projection coefficients dense; toleratesN > T0; prediction is robust to heteroskedasticity.LASSO (
lasso; Li & Bell [LASSOPDA]) – an L1 estimator that selects a sparse set of relevant controls; computationally far cheaper than AIC/BIC and works forN > T0.Forward selection (
fs; Shi & Huang [fsPDA]) – a greedy procedure that grows the control set one unit at a time, with valid post-selection inference and no sparsity requirement (works for dense or sparse models). The scalable replacement forhcw’s combinatorial best-subset search.
All variants target the single-treated-unit, many-candidate-controls regime
and produce a time-varying treatment effect and an average treatment effect
(ATE) with a HAC-based confidence interval. The practical choice among them
(detailed below) follows the authors’ own arguments: l2 when the
coefficients are dense, lasso when only a few controls matter and you want
an interpretable selection, fs when you want a cheap, predictive ensemble
with honest post-selection inference regardless of sparsity.
Notation#
We use mlsynth’s standard notation, with the two sample operators of Shi & Wang
[l2relax] defined below. For a positive integer n,
\([n] = \{1, \ldots, n\}\); \(\mathbf{I}_n\) is the identity,
\(\mathbf{1}_n\), \(\mathbf{0}_n\) the all-ones/zeros vectors. For a
matrix \(\mathbf{A} = (a_{ij})\) and index sets
\(\mathcal{S}, \mathcal{Q}\), the submatrix is
\(\mathbf{A}_{\mathcal{S}\mathcal{Q}} = (a_{ij})_{i\in\mathcal{S}, j\in\mathcal{Q}}\)
and the subvector \(\mathbf{x}_{\mathcal{Q}} = (x_i)_{i\in\mathcal{Q}}\).
\(\phi_{\min}(\cdot)\), \(\phi_{\max}(\cdot)\) denote the smallest and
largest eigenvalues; \(\|\mathbf{A}\|_\infty = \max_{ij}|a_{ij}|\),
\(\|\mathbf{A}\|_2 = \sqrt{\phi_{\max}(\mathbf{A}'\mathbf{A})}\),
\(\|\mathbf{x}\|_1 = \sum_i |x_i|\).
The two time-series operators are central. Over an index set \(\mathcal{S}\) of periods,
the sample mean and sample covariance of a series over \(\mathcal{S}\). The intervention takes effect after the split point \(T_0\). The pre-period is \(\mathcal{T}_1 \coloneqq \{1, \ldots, T_0\}\) with \(|\mathcal{T}_1| = T_0\); the post-period is \(\mathcal{T}_2 \coloneqq \{T_0+1, \ldots, T\}\) with \(T_2 \coloneqq |\mathcal{T}_2| = T - T_0\). Let \(j=1\) denote the treated unit, with treated series \(\mathbf{y}_1 = (y_{11}, \ldots, y_{1T})^\top\); the donor pool is \(\mathcal{N}_0 \coloneqq \mathcal{N}\setminus\{1\}\), with cardinality \(N_0\), where \(\mathcal{N} = \{1,\ldots,N\}\).
Original best subset (hcw, Hsiao-Ching-Wan)#
Idea. This is the method that started the literature [HCW2012]. The recipe is the plainest one in the family – ordinary least squares of the treated unit’s pre-period outcome on a handful of control series, extrapolated past the intervention – with one twist: which controls, and how many, are chosen for you rather than fixed in advance. HCW’s Section 5 turns that choice into a model-selection problem. It helps to walk through it as four steps.
Step 1 – measure how well a candidate set of controls fits. Pick any subset \(\mathcal{S}\subseteq\mathcal{N}_0\) of the donors. Regress the treated pre-period outcome on those donors and an intercept by OLS, and record the residual sum of squares – the total squared pre-period miss,
A smaller \(\mathrm{RSS}\) means a tighter pre-period fit. By itself it is not a usable score, though: adding any control can only lower \(\mathrm{RSS}\), so “smallest RSS” always points at the largest possible model, which memorises the short pre-period and extrapolates badly past the cut.
Step 2 – charge for complexity with an information criterion. To stop the search from simply taking every donor, each subset is scored by an information criterion: the (log) fit plus a penalty that grows with the number of parameters. The default is the small-sample-corrected AICc (AIC and BIC are the same shape with a lighter or heavier penalty),
Lower is better. The first term falls as the fit improves; the second rises with
the model size \(r\), so the criterion bottoms out where one more donor stops
paying for the variance it adds – the bias-variance trade-off turned into a
single number. The penalised parameter count \(K = r + 2\) counts the
\(r\) donors, the intercept, and the error variance (the convention of the
pampe R package, leaps::regsubsets + AICc + lm); the small-sample
correction \(2K(K+1)/(T_0-K-1)\) matters because PDA’s pre-period is short.
Step 3 – keep the subset with the best score. The selected controls are the global minimiser
In practice this is computed the way leaps does it (HCW Section 5): for each
model size \(r\) find the single lowest-\(\mathrm{RSS}\) subset of that
size, then choose the size whose best subset minimises the criterion – a “best
of each size, then choose the size” search. The cap \(r_{\max}\)
(hcw_nvmax, pampe’s nvmax) limits the largest size considered and cannot
exceed the pre-period degrees of freedom (the criterion is undefined once the
donors plus intercept use up the \(T_0\) observations).
Step 4 – refit and extrapolate. Re-fit OLS on the chosen controls
\(\widehat{\mathcal{S}}\) over the pre-period, then carry the coefficients
forward: the counterfactual is that fitted line evaluated on the post-period
control values, and the per-period effect is treated minus counterfactual,
exactly as in the shared model above. On HCW’s Hong Kong sovereignty study this
selects \(\{\text{Japan},\text{Korea},\text{Taiwan},\text{USA}\}\) with
\(\mathrm{AICc} = -171.771\), reproducing Table XVI value-for-value
(cross-validated against pampe – see the benchmark below).
Best subset selection is HCW’s classical, low-dimensional choice: AICc / AIC /
BIC are only defined while \(r < T_0\), and the search is combinatorial in
\(N_0\). HCW therefore pre-screened the donor pool (limiting Hong Kong to
ten candidate economies). [HTT2020] place best subset alongside its two cheaper
relatives – forward stepwise and the lasso – and show forward stepwise tracks
best subset closely, which is exactly why fs is hcw’s scalable
descendant; lasso and l2 are the others. Use hcw when the candidate
pool is moderate and pre-screened, when you want the original method or to
reproduce HCW / pampe, and when an exact, certified optimum is the point.
Computing the best subset: the optimisation#
There are \(2^{N_0}\) candidate subsets, each an OLS fit, so naive
enumeration is the regime where even leaps strains. mlsynth solves the
problem exactly with a branch-and-bound built on three ideas, plus an optional
fourth that hands the problem to a mixed-integer solver.
Furnival-Wilson sweep engine. The default is the canonical
leaps::regsubsets algorithm [FurnivalWilson]: a depth-first search over
subsets driven by the symmetric sweep operator on the augmented cross-product
matrix \([\mathbf{1},\mathbf{X},\mathbf{y}]'
[\mathbf{1},\mathbf{X},\mathbf{y}]\). Sweeping a donor in leaves the running
\(\mathrm{RSS}\) in the response diagonal in \(O(p^2)\), and the reverse
sweep removes it on backtracking, so no OLS is ever re-solved from scratch. Each
subtree is bounded below by the smallest \(\mathrm{RSS}\) it can reach
(include every remaining donor – \(\mathrm{RSS}\) is monotone in the
variable set) paired with the smallest information-criterion penalty any
descendant can carry; when that lower bound cannot beat the incumbent, the
whole subtree is pruned. The bound is a true lower bound, so the returned subset
is the exact global optimum – identical to exhaustive enumeration, at a small
fraction of the cost.
Discrete first-order warm start. The incumbent is seeded, at every model size, with the Bertsimas-King-Mazumder projected-gradient hard-thresholding method [BKM2016] (their Algorithm 1): iterate \(\boldsymbol{\beta}\leftarrow H_r\!\bigl(\boldsymbol{\beta} - \tfrac{1}{L} \nabla g(\boldsymbol{\beta})\bigr)\), where \(H_r\) keeps the \(r\) largest-magnitude coordinates and \(L\) upper-bounds the largest eigenvalue of the centred donor Gram, with a least-squares refit on the active set and several restarts. A near-optimal incumbent tightens the bound from the outset and prunes more subtrees; because the seed can only lower the incumbent, the certified optimum is unchanged.
Node budget and a certified optimality gap. For a pool too large to certify
quickly, the search stops at a node budget and returns the best incumbent found
together with a valid lower bound – the smallest subtree bound it never reached.
The reported \(\text{optimality gap} = \text{incumbent} - \text{lower
bound}\) is then a genuine suboptimality certificate (zero means provably
optimal), the Bertsimas-King-Mazumder gap obtained from the branch-and-bound’s
own bounds with no solver. This replaces an outright refusal of large pools:
fits["hcw"].metadata reports certified_optimal and optimality_gap.
Optional exact mixed-integer backend. Setting hcw_backend="scip" certifies
the optimum at pool sizes beyond the branch-and-bound’s exact reach via the SCIP
solver. It casts the size-\(r\) problem as the Bertsimas-King-Mazumder
cardinality-constrained least squares – binary indicators \(z_i\), an SOS-1
coupling forcing \(z_i = 0 \Rightarrow \beta_i = 0\), and
\(\sum_i z_i \le r\) – solved once per size, then chooses the size by the
criterion. The dependency is optional: pyscipopt is imported only on demand
(pip install mlsynth[scip]). The default fw backend needs no solver and
is all the low-dimensional regime hcw targets requires.
Assumptions and inference. hcw shares the identifying stack documented
under Shared assumptions across the PDA class
(the latent factor model A1, single absorbing treatment A2, weak temporal
dependence A3), specialised to the low-dimensional regime \(N_0 < T_0\) that
makes the information criteria well-defined. Inference mirrors the post-selection
HAC machinery of the scalable variants: the average effect carries a
Newey-West / Bartlett long-run-variance confidence interval (prewhitened by
default; lrvar_lag switches to a fixed-lag Bartlett estimator), and
prediction_intervals=True attaches the Jiang et al. per-period intervals
described below.
L2-relaxation (l2, Shi & Wang)#
Idea. Under the factor model the projection coefficient \(\boldsymbol{\beta}^0 = \boldsymbol{\Omega}^{-1}\boldsymbol{\Lambda} (\boldsymbol{\Lambda}'\boldsymbol{\Omega}^{-1}\boldsymbol{\Lambda} + \mathbf{I}_q)^{-1}\boldsymbol{\lambda}_1\) is dense in general – almost no entries are exactly zero. Sparse methods (LASSO) are then mis-matched. With \(\widehat{\boldsymbol{\Sigma}} = \Gamma_{\mathcal{T}_1}(\mathbf{x}_t, \mathbf{x}_t')\) and \(\widehat{\boldsymbol{\eta}} = \Gamma_{\mathcal{T}_1}(\mathbf{x}_t, y_t)\), OLS solves the KKT condition \(\widehat{\boldsymbol{\Sigma}}\boldsymbol{\beta} = \widehat{\boldsymbol{\eta}}\), which is unstable or non-unique once \(N\) is close to or exceeds \(T_0\). L2-relaxation relaxes the sup-norm of this moment condition by a tuning parameter \(\varepsilon\) and minimizes the coefficient norm:
This is the “bias-variance trade-off” made explicit: tolerating a small
violation \(\varepsilon\) of the OLS moment condition shrinks the variance. At
\(\varepsilon = 0\) it reduces to (ridgeless) OLS; at \(\varepsilon \ge
\|\widehat{\boldsymbol{\eta}}\|_\infty\) it gives \(\boldsymbol{\beta} =
\mathbf{0}\). mlsynth picks \(\varepsilon\) by sequential out-of-sample
validation on the tail of the training window (the validated \(\varepsilon\)
tracks the infeasible-optimal one, and both shrink toward zero as the sample
grows) over a log-spaced grid down to \(10^{-4}\max|\widehat{\boldsymbol{\eta}}|\)
(the optimum is often a tiny fraction of the cap). This is time-respecting –
the fit never sees periods later than the validation tail – unlike the
released L2relax.CV, whose 5-block K-fold trains on both past and future
of each block.
Note
Standardisation. Following the authors’ released L2relax, the treated
and control series are standardised (demeaned and scaled to unit
variance) before forming \(\widehat{\boldsymbol{\Sigma}}\) /
\(\widehat{\boldsymbol{\eta}}\), and the solution is mapped back to the
original scale. This is the default (l2_standardize=True) – the
\(\ell_2\) penalty is scale-sensitive, so standardisation is both
recommended and what reproduces the paper’s empirical results; on the Hong
Kong panel it moves the L2 estimate from \(2.48\%\) to \(2.61\%\)
(closer to Shi & Wang’s \(2.65\%\)). Set l2_standardize=False for the
raw-scale variant.
Assumptions (Shi & Wang).
Assumption 1 (loadings). \(\|\boldsymbol{\lambda}_1\|_\infty + \|\boldsymbol{\Lambda}\|_\infty \le C\), and there is an \(r\)-unit subset making \(\boldsymbol{\Lambda}_{\mathcal{Q}\cdot}\) full column rank; the average factor strength \(\xi_N = \phi_{\min}(\boldsymbol{\Lambda}' \boldsymbol{\Lambda}/N)\) may vanish (weak factors allowed).
Remark. Bounded loadings keep the projection coefficient well-defined, while the full-rank subset is what lets the donors span the treated unit’s factor exposure; allowing \(\xi_N \to 0\) means the method tolerates weak factors that contribute little cross-sectional variation.
Assumption 2 (errors). The idiosyncratic covariance \(\boldsymbol{\Omega}\) has eigenvalues bounded between \(\underline\sigma^2\) and \(\overline\sigma^2\); errors may be heteroskedastic and cross-sectionally dependent.
Remark. The two-sided eigenvalue bound rules out a degenerate noise direction, but heteroskedastic, cross-sectionally dependent errors are permitted – the realistic case for economic panels.
Assumption 3-4 (sampling). In- and out-of-sample sampling errors of the sample moments are \(O_p(T_0^{-1/2})\) for low-dimensional pieces and \(O_p(\sqrt{\log N / (N\wedge T_0)})\) for high-dimensional ones (holding under time-series weak dependence, not just i.i.d.).
Remark. These are the convergence rates that make the relaxed moment condition usable in high dimension: the high-dimensional pieces pay only a \(\sqrt{\log N}\) price, so \(N\) may grow with (or beyond) \(T_0\).
Assumption 5 (ATE inference). The oracle prediction error \(\epsilon_t^*\) and the effect-plus-error \(d_t^* = \Delta_t - \mathbb{E}[\Delta_t] + \epsilon_t^*\) have finite, positive long-run variances \(\rho^2_{\epsilon^*}\), \(\rho^2_{d^*}\) consistently estimable by HAC, and a sequential CLT applies.
Remark. The coefficient estimator is consistent for the oracle target (Theorem 1) and – importantly – the prediction error is asymptotically irrelevant to heteroskedasticity (Theorem 2): unlike the coefficient MSE, the out-of-sample MSE does not depend on the noise heterogeneity \(\psi_{\max}\). This is what licenses the HAC-based CLT for the ATE.
Inference (Shi & Wang, Theorem 3; single treated unit). With pre-period prediction residuals \(e_t = y_{1t} - \widehat{y}_{1t}\) (\(t\in\mathcal{T}_1\)) and post-period effects \(\tau_t\) (\(t\in\mathcal{T}_2\)),
where \(\widehat{\rho}^2_{(1)}\) is the HAC long-run variance of the pre-period residuals (first-stage estimation uncertainty) and \(\widehat{\rho}^2_{(2)}\) is the HAC long-run variance of the de-meaned post-period effects (post-period averaging). Both sources of uncertainty enter, which matters when \(T_0\) and \(T_2\) are comparable.
When to use. Dense, factor-driven coefficients; high dimension (\(N>T_0\) permitted); when prediction accuracy and heteroskedasticity- robustness matter more than identifying a handful of controls.
LASSO (lasso, Li & Bell)#
Idea. When only a few controls are truly relevant, an L1 penalty selects them and shrinks the rest. Li & Bell estimate
with \(\lambda\) chosen by (leave-one-out) cross-validation, then predict the counterfactual as in the shared model. LASSO works for \(N > T_0\) (where AIC/AICC/BIC cannot even be computed) and is far cheaper.
Assumptions (Li & Bell). They relax HCW’s linear-conditional-mean assumption and drop one of HCW’s identification conditions. The key conditions are: a factor model with \(\mathrm{Rank}(\widetilde{B}) = K\) (enough independent factor variation among the controls); a weakly dependent, weakly stationary panel so laws of large numbers and CLTs apply to partial sums; consistency of the pre-period least-squares pieces (\(\widehat{\delta}_1 - \delta_1, \widehat{\delta}-\delta = O_p(T_0^{-1/2})\)); and \(\rho\)-mixing with geometric decay plus a finite limit \(\eta = \lim T_2/T_0\). Sparsity (only \(m\) of \(\boldsymbol{\beta}\) non-zero, \(m\) fixed or \(o(T_0)\)) is assumed for the high-dimensional selection.
Remark. Li & Bell prove consistency \(\widehat{\tau} - \Delta_{\mathcal{T}_2} = O_p(T_0^{-1/2} + T_2^{-1/2})\) (estimation error has two parts: first-stage \(O_p(T_0^{-1/2})\) and post-averaging \(O_p(T_2^{-1/2})\)), holding even when \(y_{1t}\) is trend-stationary.
Inference (Li & Bell, Theorem 3.2). With \(\widehat{\Sigma} = \widehat{\Sigma}_1 + \widehat{\Sigma}_2\),
where \(\widehat{\Sigma}_2\) is the Newey-West HAC long-run variance of the post-period effects, and \(\widehat{\Sigma}_1\) is the first-stage (pre-period estimation) variance – the OLS prediction variance of the mean post-period counterfactual on the selected support. Li & Bell note \(\widehat{\Sigma}_1\) is negligible when \(T_0 \gg T_2\), so the post-period term dominates in long-pre-period panels.
When to use. A genuinely sparse set of relevant controls; very large \(N\) (even \(N/T_0 \to \infty\)); when an interpretable, computa- tionally cheap selection is preferred. (For selection consistency, Li & Bell note the adaptive LASSO; for prediction, plain LASSO already beats AIC/BIC and leave-many-out CV in their simulations.)
Forward selection (fs, Shi & Huang)#
Idea. Rather than penalize, grow the control set greedily. Start empty; at each step add the control whose inclusion maximizes the pre-treatment OLS \(R^2\) (equivalently minimizes the residual sum of squares). The number of selected controls \(R\) is a tuning parameter chosen by a modified BIC (Wang, Li & Tsai),
with \(\widehat{\sigma}^2(\widehat{U}_r)\) the pre-period residual variance of OLS on the \(r\)-unit set. The counterfactual is the OLS extrapolation on the chosen set \(\widehat{U}_{\widehat{R}}\). Forward selection evaluates \(\sum_r (N-r+1)\) regressions – linear in \(N\) – versus the \(2^N\) of exhaustive subset search.
Assumptions (Shi & Huang). Asymptotics are multi-index: \(N\to\infty\) with \(T_0 = T_0(N)\) deterministic, \(\log N / T_0 \to 0\), and \(T_2 = T_2(N) \to \infty\) with \(\log N / T_2 \to 0\) (\(N\) may exceed \(T_0\)).
Assumption 1 (sparse Riesz / restricted eigenvalue). The minimal eigenvalue of the population Gram matrix over any \(u\)-unit subset (\(u \le (1+\delta_1)R\)) is bounded below – a condition the authors show is a natural implication of the latent factor model, not an ad hoc restriction.
Remark. The bounded minimal eigenvalue is what keeps any candidate subset of donors well-conditioned, so the greedy step can identify the next informative control rather than chase a near-singular direction; that it follows from the factor model means it is not an extra demand on the data.
Assumption 2 (second moments). Sample second moments converge at the high-dimensional rate \(O_p(\sqrt{\log N / T_0})\) with bounded fourth moments.
Remark. The high-dimensional rate is the pre-period price of admitting many donors: with bounded fourth moments the sample Gram matrix tracks its population counterpart even when \(N\) is large relative to \(T_0\).
Assumption 3 (post-period). Analogous convergence and long-run-variance bounds on the post-treatment data.
Remark. The matching post-period bounds are what let the average effect over \(\mathcal{T}_2\) obey a central limit theorem with a consistently estimable long-run variance, the basis for the post-selection test below.
Assumption 4 (weak dependence). The series are strong (\(\alpha\)-) mixing with geometric decay, so a Berry-Esseen bound for heterogeneous time series applies.
Remark. The validity is uniform over a class of DGPs (Theorem 1) – it holds whether the true coefficients are dense or sparse, which separates fsPDA from the post-selection-inference literature that needs sparsity or the oracle property. Theorem 2 shows the greedy algorithm attains a regression variance asymptotically as small as the best \(u\)-unit subset, so the cheap forward search is statistically efficient.
Inference (Shi & Huang, Eq. 4). Because forward selection uses only the pre-period and, under weak dependence, the pre- and post-periods become asymptotically independent (sample splitting), the naive conditional t-statistic is valid:
where \(\widehat{\rho}^2_{\tau\widehat{U}}\) is the HAC long-run variance of the de-meaned post-period effects. No first-stage variance term is needed – the asymptotic independence absorbs it – which makes fsPDA’s inference the simplest of the three.
Note
Long-run-variance estimator. mlsynth defaults to the prewhitened
Newey-West estimator (Andrews-Monahan VAR(1) prewhitening + Bartlett kernel
with the data-driven NW(1994) bandwidth + finite-sample adjustment) – R’s
sandwich::lrvar(..., prewhite = TRUE, adjust = TRUE), which Shi & Huang
use in their application scripts. Prewhitening is essential when the
treatment-effect series is strongly serially dependent: monthly growth rates
mean-revert (lag-1 autocorrelation around \(-0.45\) in the luxury-watch
panel), and a plain Bartlett kernel cannot absorb that, leaving
\(\widehat\rho\) nearly double its true value and the test far too
conservative. Setting lrvar_lag instead switches to the released
est.fsPDA package’s fixed-lag Bartlett estimator
(default lag \(\lfloor T_2^{1/4}\rfloor\), capped at
\(\lfloor\sqrt{T_2}\rfloor\)); on the watch panel that no-prewhitening
form gives an insignificant \(t \approx -1.15\), versus the prewhitened
default’s \(-2.51\) (the paper reports \(-2.457\)).
When to use. A large candidate-control pool where the goal is to synthesize an ensemble that mimics the outcome (not to interpret which controls are “causal”); when computational efficiency and honest post-selection inference matter; and regardless of whether the underlying model is sparse. (Shi & Huang recommend the adaptive LASSO instead when the identity of a few causal controls is the object of interest.)
Choosing among the three#
Method |
Coefficient structure |
Inference variance |
Use when |
|---|---|---|---|
|
dense (factor-implied) |
pre + post HAC (both terms) |
dense coefficients; |
|
sparse |
post HAC (+ first stage, small if |
few relevant controls; interpretable selection; very large |
|
dense or sparse |
post HAC only (sample splitting) |
large pool; predictive ensemble; cheap; honest post-selection inference |
Empirical Illustration: Hong Kong economic integration#
The original HCW [HCW] application – and Shi & Huang’s Example 1 – evaluates the effect of economic integration with mainland China on Hong Kong’s quarterly real-GDP growth, using 24 comparison economies. Here all four PDA methods run on the same data, and the package returns the time-varying effect, the ATE, and a HAC confidence interval for each.
import pandas as pd
from mlsynth import PDA
url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/HongKong.csv"
df = pd.read_csv(url)
res = PDA({"df": df, "outcome": "GDP", "treat": "Integration",
"unitid": "Country", "time": "Time",
"methods": ["hcw", "l2", "LASSO", "fs"], "alpha": 0.05,
"display_graphs": True}).fit()
for name, fit in res.fits.items():
print(f"{name:6s} ATE {fit.att:.4f} SE {fit.att_se:.4f} "
f"95% CI ({fit.ci[0]:.4f}, {fit.ci[1]:.4f}) p={fit.p_value:.3f} "
f"donors={len(fit.donor_weights)}")
This prints:
hcw ATE 0.0403 SE 0.0057 95% CI (0.0292, 0.0514) p=0.000 donors=6
l2 ATE 0.0261 SE 0.0033 95% CI (0.0195, 0.0326) p=0.000 donors=24
lasso ATE 0.0330 SE 0.0054 95% CI (0.0224, 0.0436) p=0.000 donors=11
fs ATE 0.0395 SE 0.0059 95% CI (0.0280, 0.0510) p=0.000 donors=9
All four find a significant positive integration effect on Hong Kong’s GDP
growth – between +2.6 and +4.0 percentage points per quarter – differing only
in how they choose controls. hcw runs the exact best-subset search and
certifies a parsimonious six-economy model (Austria, Italy, Korea, Mexico,
Norway, Singapore): on this pool the long 44-quarter pre-period gives the
branch-and-bound room to certify the global optimum (certified_optimal is
True, gap 0) even over 24 donors. l2 keeps all 24 controls with dense,
signed coefficients (pre-RMSE 0.012); lasso keeps 11; fs grows to 9. The
four estimates bracket the Forward-DiD result on the same data (0.025), a useful
cross-method check.
Prediction intervals#
The HAC confidence interval above quantifies uncertainty in the average
treatment effect. For uncertainty in each period’s effect, set
prediction_intervals=True to attach the bootstrap prediction intervals of
Jiang, Li, Shen and Zhou [pdapi] to every fitted variant. These bound the
post-period untreated counterfactual \(Y_t\) (and hence the effect
\(\Delta_t = y_t - \widehat Y_t\)), combining the in-sample estimation error
with the out-of-sample prediction error that a confidence interval for the mean
omits. The construction (their Algorithm 2.1) resamples the pre-period
prediction error with a dependent wild bootstrap and the post-period error with a
residual bootstrap, refits the chosen estimator on each bootstrap sample at the
fixed tuning parameter, and reads quantiles of the self-normalized statistic
\(\widehat e_t / \sqrt{\widehat V_t + \widehat\sigma^2}\).
The implementation lives at the shared mlsynth.utils.inferutils.pda_prediction_intervals()
so any panel-data estimator can reuse it. Both the equal-tailed (eq) and
symmetric (sy) intervals are returned, and each variant reports which
studentization it used: sandwich for the post-selection OLS HAC variance
\(\widehat V_t\) (hcw, lasso and fs, which all select then run
OLS – hcw is exactly the low-dimensional post-selection-OLS case of Jiang
et al.’s Remark 2.1, with best subset as the selector re-run on each bootstrap
draw), or the sigma2 fallback when that sandwich is undefined – as for the
dense L2-relaxation when the controls outnumber the pre-periods.
res = PDA({"df": df, "outcome": "GDP", "treat": "Integration",
"unitid": "Country", "time": "Time",
"methods": ["l2", "LASSO", "fs"],
"prediction_intervals": True, "pi_n_boot": 199, "pi_seed": 0,
"display_graphs": False}).fit()
for name, fit in res.fits.items():
eff = fit.prediction_intervals["effect"]
print(name, fit.prediction_intervals["studentization"])
for k in range(3): # first three post-quarters
lo, hi = eff["eq_lower"][k], eff["eq_upper"][k]
print(f" t+{k+1}: effect {eff['point'][k]:+.4f} 95% PI ({lo:+.4f}, {hi:+.4f})")
The pointwise 95% effect intervals for the first three post-quarters (decimal GDP growth):
method studentization t+1 t+2 t+3
l2 sandwich +0.0203 (-0.0128, +0.0411) +0.0427 (+0.0076, +0.0658) +0.0023 (-0.0347, +0.0272)
lasso sandwich +0.0312 (-0.0066, +0.0499) +0.0519 (+0.0165, +0.0706) +0.0152 (-0.0198, +0.0397)
fs sandwich +0.0249 (-0.0079, +0.0521) +0.0467 (+0.0158, +0.0719) +0.0103 (-0.0237, +0.0351)
Each band is wider than the ATE confidence interval, because it carries the out-of-sample prediction error on top of the estimation error, and it tightens or widens quarter by quarter as the fitted controls track each period better or worse.
hcw produces these same intervals, with one practical caveat: the bootstrap
refits the entire selection on every draw, and HCW’s refit re-runs the
best-subset search. On the 24-economy pool that is roughly ten seconds a draw, so
its prediction intervals there cost minutes – while the certified point estimate
and the HAC ATE interval need no resampling and return at once. On HCW’s
intended, pre-screened pool (the ten candidate economies of the sovereignty
study) the search is tiny and the intervals come straight back:
cands = ["China", "Indonesia", "Japan", "Korea", "Malaysia", "Philippines",
"Singapore", "Taiwan", "Thailand", "United States"]
sub = df[df["Country"].isin(["Hong Kong"] + cands)]
hk = PDA({"df": sub, "outcome": "GDP", "treat": "Integration",
"unitid": "Country", "time": "Time", "method": "hcw",
"prediction_intervals": True, "pi_n_boot": 199,
"display_graphs": False}).fit()
eff = hk.fits["hcw"].prediction_intervals["effect"]
lo, hi = eff["eq_lower"][0], eff["eq_upper"][0]
print(f"t+1: effect {eff['point'][0]:+.4f} 95% PI ({lo:+.4f}, {hi:+.4f})")
# t+1: effect +0.0304 95% PI (-0.0227, +0.0703)
Verification#
Note
Empirical (Path A, Hong Kong). All four PDA methods run on the HCW Hong Kong panel (above) and agree on a significant positive integration effect, consistent with the literature and the Forward-DiD cross-check (0.025).
Original HCW (Path A, best subset). benchmarks/cases/pda_hcw_hongkong.py
reproduces Hsiao, Ching & Wan (2012) Tables XVI-XVII value-for-value with
method="hcw": on the sovereignty study (ten candidate economies,
T0 = 18), AICc selects {Japan, Korea, Taiwan, USA} with the published OLS
weights, pre-period \(R^2 = 0.9314\), and an insignificant average
effect of -3.96% – cross-validating against the pampe R package.
Prediction intervals (Path B, coverage). The bootstrap prediction
intervals are validated by benchmarks/cases/pda_pi_coverage.py, which
reproduces the coverage geometry of Jiang et al. (2025) Tables 2-5 on their
Setup 1: the equal-tailed and symmetric intervals cover near the nominal
95%, while the normal-quantile intervals under-cover (to about 77% under
exponential errors).
Simulation study (Path B): forward selection vs LASSO#
Shi & Huang’s (2023) Table 1 compares forward selection against LASSO on a
four-factor DGP, re-implemented in
mlsynth.utils.pda_helpers.simulation.simulate_pda_panel(): four factors
(f1 i.i.d.; f2 AR(1) 0.9; f3 MA(2) (0.8,0.4); f4 ARMA(1,1)
(0.5,0.5) under the dynamic structure, all i.i.d. N(0,1) under the
i.i.d. structure); loadings U(1,2) on the treated + 4 relevant controls
and U(-0.1,0.1) on the remaining 96; idiosyncratic N(0,0.5); one treated
unit, \(N=100\) controls, \(T_0=T_2\). Shocks D1-D7 set the
post-period ATE (D1-D3 null → size; D4-D7 non-zero → power).
Driving the packaged PDA (methods=["fs","LASSO"],
fs_intercept=False) at \(T_0=100\) (200 reps for size, 60 for power):
factors |
method |
# donors |
size (D1) |
power (D5) |
|---|---|---|---|---|
i.i.d. |
fs |
3.9 |
0.090 |
1.00 |
i.i.d. |
LASSO |
9.5 |
0.065 |
1.00 |
dynamic |
fs |
4.5 |
0.075 |
1.00 |
dynamic |
LASSO |
15.0 |
0.140 |
1.00 |
The paper’s geometry reproduces. Forward selection is parsimonious – it
keeps to ~the 4 relevant donors in both structures – while LASSO
over-selects (9-15 donors). Forward selection’s test is correctly sized
(≈ 0.05-0.09) under both factor structures, the robustness Shi & Huang
emphasise. LASSO is correctly sized under i.i.d. factors (0.065, matching
the paper’s 0.058) but its size inflates under dynamic factors (0.140;
paper’s modified-BIC LASSO 0.184) – the size inflation the paper reports is a
dynamic-factor phenomenon, not an i.i.d. one. Both tests are fully powered at
D5 (mean-1 shift). Durable case: pda_table1.
Note
mlsynth’s LASSO is cross-validated; the paper’s is not. Shi & Huang
select the Lasso penalty with a modified BIC (Remark 4 cont., p.521:
“we tune the constants in the modified BIC to allow Lasso to take in more
variables”); mlsynth’s L1-PDA selects it with LassoCV (5-fold
cross-validation). The two are different penalty rules, so the LASSO cells
above are mlsynth’s CV variant, not a cell-by-cell match of the paper’s
Lasso. What both share – and what the benchmark pins – is the geometry:
LASSO over-selects relative to fs and its size inflates under dynamic
factors, while forward selection (the paper’s method, validated cell-by-
cell on Hong Kong in pda_hongkong) stays parsimonious and correctly
sized.
The fs_intercept knob – valid size on factor data
Achieving the correct fs size above required a fix. The released fsPDA
R package (est.fsPDA.R) fits the donor regression with an intercept;
on the paper’s mean-zero factor DGP that intercept absorbs a spurious
pre-period constant which extrapolates into the post window, biasing the
gap and inflating the null rejection rate to ~0.20. The paper’s Table 1
was produced by the simulation code (FS.R), which fits without an
intercept and yields valid size. mlsynth exposes both via
PDAConfig.fs_intercept (default False = the no-intercept, valid-size
form; set True for panels with genuine unit level shifts). With the
default, the fs D1 rejection rate drops from ~0.20 (intercept) to the
~0.05-0.09 reported above (no intercept).
One honest caveat: under dynamic factors fs shows mild residual size inflation at small \(T\) (the “imprecise long-run-variance” effect the paper itself notes), shrinking toward 5% as \(T_0 \to \infty\).
L2-relaxation (Shi & Wang). The l2 method’s out-of-sample MPSE falls
with \(T\) and its test approaches the nominal 5% size as \(T_0 \to
\infty\), matching Shi & Wang’s Table 2 (size \(0.142\) at \(T_0=50\)
→ \(0.072\) at \(200\)). Its per-fit cross-validation over the
\(\varepsilon\) grid makes large Monte Carlos expensive (~5 s/fit), so the full
table is summarized rather than swept here.
Multiple treated units#
When several units are treated by the same intervention, Shi & Wang’s L2-relaxation PDA fits a counterfactual per treated unit against the shared control pool and aggregates the effects into a per-period cross-sectional ATE,
with \(\widehat{\Sigma}_e\) the cross-sectional covariance of the pre-period
prediction residuals (so the standard error is constant across post-periods).
mlsynth.utils.pda_helpers.multitreat.run_pda_multitreat() implements this.
Because every per-unit fit shares the same
\(\widehat{\boldsymbol{\Sigma}} = X'X/T_0\), all \(J\) fits run through a
single OSQP factorisation – the batched solver
mlsynth.utils.pda_helpers.l2.batch.l2_relax_batch() updates only the
constraint bounds per (unit, tau) (hundreds of conic solves become hundreds
of warm-started ADMM updates). On the Brexit study (52 UK firms vs 300 controls)
this reproduces the paper’s first-day return ATE of \(-4.3\%\)
(\(t \approx -7.4\)); see benchmarks/cases/pda_brexit.py.
Core API#
Panel Data Approach (PDA) for program evaluation.
A thin, NumPy-first orchestration over mlsynth.utils.pda_helpers. PDA
(Hsiao, Ching & Wan 2012) predicts a single treated unit’s untreated
counterfactual by a linear regression on the control units fit over the
pre-treatment window, then extrapolates out-of-sample to estimate the ATE.
Four variants are supported, each with the inference theory from its own paper:
hcw– the original best-subset PDA (Hsiao, Ching & Wan 2012): unrestricted OLS (with intercept) on the best subset of controls, the subset chosen by AICc (also AIC / BIC); HAC t-test on the ATE. Matches thepampeR package and HCW Table XVI value-for-value.l2– L2-relaxation (Shi & Wang 2024): dense ridge-like coefficients viamin ||beta||_2^2 s.t. ||eta_hat - Sigma_hat beta||_inf <= tau; ATE inference combines pre-residual and post-effect HAC long-run variances.LASSO– L1 (Li & Bell 2017): sparse donor selection by cross-validated LASSO; HAC t-test on the ATE with a first-stage variance term.fs– forward selection (Shi & Huang 2023): greedy R^2 selection with a modified-BIC stopping rule; post-selection HAC t-test (sample splitting). The scalable descendant ofhcw’s combinatorial search.
Set method for one variant or methods to run several at once.
- class mlsynth.estimators.pda.PDA(config: PDAConfig | dict)#
Bases:
objectPanel Data Approach estimator (l2 / LASSO / fs).
- Parameters:
config (PDAConfig or dict) – Validated configuration. Beyond the common fields, PDA reads
method/methods(which variant(s) to run),tau(the l2 relaxation parameter;Noneselects it by validation), andalpha(CI level).
References
Hsiao, C., Ching, H. S., & Wan, S. K. (2012). A panel data approach for program evaluation. Journal of Applied Econometrics, 27(5), 705-740.
Shi, Z., & Wang, Y. (2024). L2-relaxation for Economic Prediction.
Li, K. T., & Bell, D. R. (2017). Estimation of average treatment effects with panel data. Journal of Econometrics, 197(1), 65-75.
Shi, Z., & Huang, J. (2023). Forward-selected panel data approach for program evaluation. Journal of Econometrics, 234(2), 512-535.
- fit() PDAResults#
Run the requested PDA variant(s) and return typed results.
- Returns:
PDAResults – Container of per-variant
PDAMethodFitobjects (coefficients, counterfactual, gap, ATE, HAC standard error, CI, p-value, donor weights), with convenience aliases (att,att_se,counterfactual,donor_weights) forwarding to the primary variant.
Configuration#
- class mlsynth.config_models.PDAConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, method: ~typing.Annotated[str, _PydanticGeneralMetadata(pattern='^(LASSO|l2|fs|hcw)$')] = 'fs', methods: ~typing.List[str] | None = None, hcw_criterion: ~typing.Annotated[str, _PydanticGeneralMetadata(pattern='^(AICc|AIC|BIC)$')] = 'AICc', hcw_nvmax: ~typing.Annotated[int | None, ~annotated_types.Gt(gt=0)] = None, hcw_backend: ~typing.Annotated[str, _PydanticGeneralMetadata(pattern='^(fw|scip)$')] = 'fw', tau: float | None = None, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, fs_intercept: bool = False, lrvar_lag: int | None = None, l2_standardize: bool = True, prediction_intervals: bool = False, pi_n_boot: ~typing.Annotated[int, ~annotated_types.Ge(ge=2)] = 999, pi_seed: int | None = 0)#
Configuration for the Panel Data Approach (PDA) estimator.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Result Containers#
PDA.fit() returns a
PDAResults, whose fits maps
each variant to a
PDAMethodFit (coefficients,
intercept, counterfactual, gap, ATE, HAC standard error, CI, p-value, donor
weights, and the selected-donor list for lasso/fs). The prepared,
NumPy-only panel is exposed as a
PDAInputs, with units and time
addressed through an IndexSet.
Note
PDA.fit() returns an EffectResult on the
standardized two-family contract. It is a dispatcher over the variants in
res.fits (l2 / lasso / fs); the selected variant drives the flat
accessors (res.att / res.att_ci / res.counterfactual /
res.gap / res.donor_weights / res.pre_rmse), which resolve
through the standardized sub-models. res.donor_weights are the regression
coefficients (PDA is a regression counterfactual, not a simplex average);
res.att_by_method() / res.se_by_method() report every variant.
Frozen, NumPy-first containers for the Panel Data Approach (PDA).
PDA (Hsiao, Ching & Wan 2012) predicts a treated unit’s untreated
counterfactual by a linear regression on the control units fit over the
pre-treatment window, then extrapolates out-of-sample. mlsynth exposes
three high-dimensional PDA variants, each in its own subpackage with the
inference theory from its own paper:
l2– L2-relaxation (Shi & Wang 2024).lasso– L1/LASSO (Li & Bell 2017).fs– forward-selected PDA (Shi & Huang 2023).hcw– original best-subset PDA (Hsiao, Ching & Wan 2012).
Everything below is pure NumPy; units/time are addressed through
IndexSet. The only DataFrame touchpoint is setup.
- class mlsynth.utils.pda_helpers.structures.PDAInputs(unit_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, time_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, y: ~numpy.ndarray, X: ~numpy.ndarray, T0: int, treated_label: ~typing.Any, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectPreprocessed, NumPy-only panel for the PDA engine.
- Parameters:
unit_index (IndexSet) – All
Ndonor units (column order ofX).time_index (IndexSet) – All
Tperiods (row order ofyandX).y (np.ndarray) – Treated-unit outcome over all periods, shape
(T,).X (np.ndarray) – Donor outcomes, shape
(T, N).T0 (int) – Number of pre-treatment periods (
T1); post isT2 = T - T0.treated_label (Any) – Identifier of the treated unit.
metadata (dict) – Free-form provenance.
- X: ndarray#
- property donor_labels: ndarray#
- time_index: IndexSet#
- unit_index: IndexSet#
- y: ndarray#
- class mlsynth.utils.pda_helpers.structures.PDAMethodFit(name: str, beta: ~numpy.ndarray, intercept: float, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att: float, att_se: float, ci: ~typing.Tuple[float, float], p_value: float, donor_weights: ~typing.Dict[~typing.Any, float], selected_donors: ~typing.List[~typing.Any] | None = None, prediction_intervals: ~typing.Dict[str, ~typing.Any] | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectA single PDA-variant fit (l2 / lasso / fs).
- beta: ndarray#
- counterfactual: ndarray#
- gap: ndarray#
- class mlsynth.utils.pda_helpers.structures.PDAResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.pda_helpers.structures.PDAInputs, fits: ~typing.Dict[str, ~mlsynth.utils.pda_helpers.structures.PDAMethodFit], selected_variant: str = 'fs', metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
BaseEstimatorResultsTop-level container returned by
mlsynth.PDA.fit().An
EffectResult(the observational report): a dispatcher over the PDA variants infits, it lifts the selected variant’s quantities into the standardized sub-models so the flat accessors (att/att_ci/counterfactual/gap/donor_weights/pre_rmse) resolve through the base contract.donor_weightshere are the regression coefficients (PDA is a regression/factor counterfactual, not a simplex donor average). The PDA-specific fields keep every variant fit.- Parameters:
inputs (PDAInputs) – Preprocessed panel.
fits (dict) –
{variant_name: PDAMethodFit}for every variant run (l2 / lasso / fs).selected_variant (str) – Which variant drives the standardized surface / flat accessors.
- fits: Dict[str, PDAMethodFit]#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Helper Modules#
Data preparation – the only DataFrame touchpoint: pivots to NumPy, builds the
unit/time IndexSetes, and splits pre/post.
Long-DataFrame -> NumPy boundary for PDA (the only pandas touchpoint).
- mlsynth.utils.pda_helpers.setup.derive_treatment(df: DataFrame, unitid: str, time: str, treat: str) Tuple[Any, Any]#
Read the single treated unit and its first treated period from
treat.
- mlsynth.utils.pda_helpers.setup.prepare_pda_inputs(df: DataFrame, *, unitid: str, time: str, outcome: str, treat: str) PDAInputs#
Pivot the panel to NumPy, build
IndexSets, split pre/post.- Returns:
PDAInputs – Pure-NumPy container: treated vector
y(T,), donor matrixX(T, N),T0, and unit/timeIndexSets.
Shared HAC long-run-variance machinery (Bartlett/Newey-West) and the N(0,1) test used by every PDA variant.
Shared HAC long-run-variance machinery for PDA inference.
All three PDA variants base inference on a heteroskedasticity- and autocorrelation-consistent (HAC) long-run variance of a scalar time series of post-period treatment effects (and, for some variants, pre-period prediction residuals). The Bartlett/Newey-West and uniform kernels are provided; the default truncation lag follows Newey-West.
- mlsynth.utils.pda_helpers.inference.fspda_lrvar_lag(T2: int, lrvar_lag: int | None = None) int#
Bartlett-kernel truncation lag for the fsPDA long-run variance.
Follows Shi & Huang’s released
fsPDApackage (est.fsPDA): the default isfloor(T2 ** (1/4))and a user-supplied lag must be a non-negative integer no larger thanfloor(sqrt(T2)).
- mlsynth.utils.pda_helpers.inference.hac_lrv(z: ndarray, lag: int | None = None, kernel: str = 'bartlett') float#
HAC long-run variance of a mean-zero-able scalar series
z.lrv = gamma_0 + sum_{l=1}^{L} k(l) * 2 * gamma_lwithgamma_lthe sample autocovariance andkthe Bartlett (1 - l/(L+1)) or uniform kernel. The series is de-meaned first.
- mlsynth.utils.pda_helpers.inference.lrvar_prewhite_nw(x: ndarray) float#
Prewhitened Newey-West long-run variance of the mean of
x.Andrews-Monahan (1992) VAR(1) prewhitening + a Bartlett kernel with the Newey-West (1994) data-driven bandwidth + the finite-sample adjustment – i.e. R’s
sandwich::lrvar(x, type = "Newey-West", prewhite = TRUE, adjust = TRUE), which Shi & Huang use for the post-selection t-test in their applications (app1_luxury_watch/fsPDA.R). Returns the variance of the sample mean (so the t-statistic ismean(x) / sqrt(lrvar_prewhite_nw(x))).Prewhitening is what makes this differ sharply from the plain Bartlett estimator on strongly serially-dependent effects (e.g. monthly growth rates, which mean-revert): the VAR(1) step removes the bulk of the dependence before the kernel smooths the remainder, then the long-run factor
1 / (1 - A)recolors it.
- mlsynth.utils.pda_helpers.inference.newey_west_lag(n: int) int#
Newey-West (1994) automatic truncation lag
floor(4 (n/100)^(2/9)).
- mlsynth.utils.pda_helpers.inference.normal_test(att: float, se: float, alpha: float = 0.05)#
Two-sided N(0,1) test: returns (p_value, (ci_lower, ci_upper)).
L2-relaxation (Shi & Wang): the relaxation primal, \(\varepsilon\) validation, and the two-term HAC ATE inference.
L2-relaxation estimation (Shi & Wang 2024).
Solves the L2-relaxation primal (their Eq. 3)
min_beta (1/2) ||beta||_2^2 s.t. || eta_hat - Sigma_hat beta ||_inf <= tau,
with Sigma_hat = X'X / T1 and eta_hat = X'y / T1 over the pre-period.
Following the authors’ released L2relax (ishwang1/L2relax-PDA),
the treated and control series are standardised (demeaned and scaled to unit
variance) before forming Sigma / eta, and the solution is mapped back to
the original scale: beta = sd_y * (beta_tilde / sd_X) and intercept
alpha = mean_y - mean_X . beta. Standardisation is the default (it is what
the replication code does and what reproduces the paper’s empirical results);
set standardize=False for the raw-scale variant. The tuning parameter
tau is chosen by out-of-sample validation over a log-spaced grid (the
optimal tau is often a tiny fraction of max|eta|).
- mlsynth.utils.pda_helpers.l2.estimation.cross_validate_tau(y_pre: ndarray, X_pre: ndarray, val_frac: float = 0.2, n_coarse: int = 40, n_fine: int = 40, standardize: bool = True, tau_grid: ndarray | None = None) float#
Validate
tauby sequential out-of-sample validation on the tail.The training window is split in time order – fit on the earlier
1 - val_fracfraction, validate on the most-recentval_fractail – so no future period informs an earlier counterfactual (unlike a K-fold split over contiguous blocks, which trains on both past and future of each fold).tau_gridoverrides the automatic log-spaced grid (e.g. to match a reference grid); otherwise the grid is log-spaced up tomax|eta|(the optimaltauis often a tiny fraction of that).
- mlsynth.utils.pda_helpers.l2.estimation.fit_l2(y: ndarray, X: ndarray, T0: int, tau: float | None = None, standardize: bool = True, tau_grid: ndarray | None = None) Tuple[ndarray, float, ndarray, float]#
Fit L2-relaxation; return
(beta, intercept, counterfactual, tau).
- mlsynth.utils.pda_helpers.l2.estimation.l2_relax(y_pre: ndarray, X_pre: ndarray, tau: float, standardize: bool = True) Tuple[ndarray, float]#
Solve the L2-relaxation primal for coefficients and intercept.
The primal objective
||beta||^2 / 2is strictly convex, so the solution is unique (Shi & Wang 2024, Lemma 2). It is solved natively with polished OSQP (mlsynth.utils.pda_helpers.l2.batch.l2_relax_solve()) – the same solver the multiple-treated batch uses – which matches an interior-point (cvxpy/CLARABEL) solve on the unique optimum without cvxpy’s per-call canonicalisation overhead.
ATE inference for L2-relaxation PDA (Shi & Wang 2024, Theorem 3).
Single treated unit. With the post-period estimated effects
Delta_hat_t = y_t - y_hat_t (t in T2) and the pre-period prediction
residuals e_t = y_t - y_hat_t (t in T1), the ATE Delta_bar is
asymptotically normal,
Z_hat = Delta_bar / sqrt( rho_hat_(1)^2 / T1 + rho_hat_(2)^2 / T2 ) -> N(0, 1),
where rho_hat_(1)^2 is the HAC long-run variance of the pre-period
residuals and rho_hat_(2)^2 is the HAC long-run variance of the de-meaned
post-period effects. Both estimation uncertainty (pre) and post-period noise
contribute.
- mlsynth.utils.pda_helpers.l2.inference.l2_ate_inference(y: ndarray, counterfactual: ndarray, T0: int, alpha: float = 0.05) Tuple[float, float, Tuple[float, float], float]#
Return
(att, se, ci, p_value)for the L2-relaxation ATE.
LASSO (Li & Bell): cross-validated L1 estimation and the HAC t-test with a first-stage variance term.
L1/LASSO PDA estimation (Li & Bell 2017).
Selects control units and estimates coefficients by LASSO on the pre-period, with the penalty chosen by cross-validation, then predicts the treated unit’s counterfactual out-of-sample.
- mlsynth.utils.pda_helpers.lasso.estimation.fit_lasso(y: ndarray, X: ndarray, T0: int, cv: int = 5, random_state: int = 0, alpha: float | None = None) Tuple[ndarray, float, ndarray, ndarray]#
Fit LASSO PDA; return
(beta, intercept, counterfactual, support_mask).alphafixes the penalty (skipping cross-validation);None(default) selects it byLassoCV. The fixed-penalty path is used by the prediction-interval bootstrap, which reuseslambda_{T0}.
- mlsynth.utils.pda_helpers.lasso.estimation.lasso_cv_alpha(y: ndarray, X: ndarray, T0: int, cv: int = 5, random_state: int = 0) float#
The cross-validated LASSO penalty for the pre-period (the
alpha_thatfit_lasso()selects).Exposed so the prediction-interval bootstrap can refit at a fixed penalty (Jiang et al. 2025 reuse
lambda_{T0}across bootstrap samples rather than re-cross-validating each one).
ATE inference for LASSO PDA (Li & Bell 2017, Theorem 3.2).
The test statistic is sqrt(T2) * Delta_bar / sqrt(Sigma_hat) with
Sigma_hat = Sigma_hat_1 + Sigma_hat_2:
Sigma_hat_2– the HAC (Newey-West) long-run variance of the post-period effects, the always-present post-period averaging variance;Sigma_hat_1– the first-stage (pre-period estimation) variance, here the OLS prediction variance of the mean post-period counterfactual on the LASSO-selected support. Li & Bell note it is negligible whenT1 >> T2.
Equivalently the ATE standard error is
SE = sqrt(var_firststage + Sigma_hat_2 / T2).
- mlsynth.utils.pda_helpers.lasso.inference.lasso_ate_inference(y: ndarray, X: ndarray, counterfactual: ndarray, support: ndarray, T0: int, alpha: float = 0.05) Tuple[float, float, Tuple[float, float], float]#
Return
(att, se, ci, p_value)for the LASSO PDA ATE.
Forward selection (Shi & Huang): greedy R^2 selection with modified-BIC stopping and the post-selection HAC t-test.
Forward-selected PDA estimation (Shi & Huang 2023).
Greedy forward selection of control units: at each step add the donor whose
inclusion maximizes the pre-treatment OLS R^2 (equivalently minimizes the
residual variance sigma^2 = mean(e^2)). Selection stops as soon as the
modified information criterion
IC(r) = log( sigma^2(U_r) ) + B * r, B = log(log N) * log(T1) / T1
stops decreasing (the stopping rule of Wang, Li & Tsai used in the authors’
fsPDA R package), starting from the intercept-only IC = log(var(y1)).
The counterfactual is the OLS extrapolation on the selected set. This is a
direct port of est.fsPDA.R.
- mlsynth.utils.pda_helpers.fs.estimation.forward_select(y: ndarray, X: ndarray, T0: int, intercept: bool = False) Tuple[List[int], ndarray, float, ndarray]#
Forward-select donors with the stop-on-increase IC rule, then refit OLS.
Returns
(selected_indices, beta_full, intercept, counterfactual)wherebeta_fullis anN-vector with zeros off the selected support.- Parameters:
intercept (bool, default False) – When
Falsethe donor regression is fit without a constant and the IC baseline islog(mean(y_pre**2))– the paper’s simulation convention (Shi & Huang 2023,simulation/FS.R). This is the correctly-specified form for the factor DGP (mean-zero factors, no unit level shift) and yields valid test size. WhenTruea constant is included and the baseline islog(var(y_pre))– the releasedfsPDAR package convention (est.fsPDA.R), appropriate for panels with genuine level differences but size-distorting on centered data.
Post-selection ATE inference for forward-selected PDA (Shi & Huang 2023, Eq. 4).
Because forward selection uses pre-treatment data only and the pre/post periods become asymptotically independent under weak dependence, the naive conditional t-statistic is valid:
Z_U = Delta_bar / sqrt(lrvar_hat) -> N(0, 1),
where lrvar_hat is the long-run variance of the sample mean of the
(de-meaned) post-period treatment effects. By default mlsynth uses the
prewhitened Newey-West estimator (sandwich::lrvar(..., prewhite = TRUE,
adjust = TRUE)) that Shi & Huang use in their applications
(app1_luxury_watch/fsPDA.R); on serially-dependent effects this is far less
conservative than a plain Bartlett kernel. Supplying lrvar_lag instead
switches to the released est.fsPDA package’s fixed-lag Bartlett estimator
(floor(T2 ** (1/4)) if the cap, see fspda_lrvar_lag()).
- mlsynth.utils.pda_helpers.fs.inference.fs_ate_inference(y: ndarray, counterfactual: ndarray, T0: int, alpha: float = 0.05, lrvar_lag: int | None = None) Tuple[float, float, Tuple[float, float], float]#
Return
(att, se, ci, p_value)for the forward-selected PDA ATE.
Run loop assembling the per-variant fits.
Run the requested PDA variant(s) and assemble typed per-method fits.
- mlsynth.utils.pda_helpers.orchestration.resolve_methods(method: str, methods: List[str] | None) List[str]#
Internal method keys to run (explicit
methodswin overmethod).
- mlsynth.utils.pda_helpers.orchestration.run_pda(inputs: PDAInputs, methods: List[str], tau: float | None, alpha: float, fs_intercept: bool = False, lrvar_lag: int | None = None, l2_standardize: bool = True, l2_tau_grid: Sequence[float] | None = None, hcw_criterion: str = 'AICc', hcw_nvmax: int | None = None, hcw_backend: str = 'fw', prediction_intervals: bool = False, pi_n_boot: int = 999, pi_seed: int | None = 0) Dict[str, PDAMethodFit]#
Fit each requested PDA variant with its own paper’s inference.
When
prediction_intervalsis set, Jiang et al. (2025) bootstrap prediction intervals (per-period treatment effect and counterfactual) are attached to each variant viamlsynth.utils.inferutils.pda_prediction_intervals().