Panel Data Approach (PDA)

Panel Data Approach (PDA)#

When to Use This Estimator#

The panel data approach (PDA) of Hsiao, Ching and Wan [HCW] estimates a single treated unit’s untreated counterfactual by a linear regression on the control units, fit over the pre-treatment window and extrapolated out-of-sample. Unlike the synthetic control method, PDA imposes no constraints on the coefficients (no simplex, no non-negativity), so it is a plain projection justified by a latent-factor model: if every unit loads on a small set of common factors, the treated unit’s untreated path is a linear combination of the controls’ paths plus an orthogonal error.

The challenge is which controls and how many. Classical PDA was built for low dimensions (few controls relative to pre-periods) and chooses controls by AIC/BIC, which break down once the number of controls N approaches or exceeds the pre-period length T0. mlsynth packages the original Hsiao-Ching-Wan method together with three high-dimensional variants that resolve the N-near-T0 problem differently, each with the estimation and inference theory of its own paper:

Original best-subset (hcw; Hsiao, Ching & Wan [HCW2012]) – the method that started the literature. The counterfactual is an unrestricted OLS regression (with intercept) of the treated pre-period on the best subset of controls, the subset and its size chosen by AICc (also AIC / BIC). Built for the low-dimensional regime (a moderate, pre-screened candidate pool); the best-subset search is combinatorial, so cap hcw_nvmax or pre-restrict the pool for large N. The three estimators below are its scalable descendants.
L2-relaxation (l2; Shi & Wang [l2relax]) – a dense estimator (a “cousin of ridge”) for when the factor model makes the projection coefficients dense; tolerates N > T0; prediction is robust to heteroskedasticity.
LASSO (lasso; Li & Bell [LASSOPDA]) – an L1 estimator that selects a sparse set of relevant controls; computationally far cheaper than AIC/BIC and works for N > T0.
Forward selection (fs; Shi & Huang [fsPDA]) – a greedy procedure that grows the control set one unit at a time, with valid post-selection inference and no sparsity requirement (works for dense or sparse models). The scalable replacement for hcw’s combinatorial best-subset search.

All variants target the single-treated-unit, many-candidate-controls regime and produce a time-varying treatment effect and an average treatment effect (ATE) with a HAC-based confidence interval. The practical choice among them (detailed below) follows the authors’ own arguments: l2 when the coefficients are dense, lasso when only a few controls matter and you want an interpretable selection, fs when you want a cheap, predictive ensemble with honest post-selection inference regardless of sparsity.

Notation#

We use mlsynth’s standard notation, with the two sample operators of Shi & Wang [l2relax] defined below. For a positive integer n, $[n] = \{1, \ldots, n\}$; $\mathbf{I}_n$ is the identity, $\mathbf{1}_n$, $\mathbf{0}_n$ the all-ones/zeros vectors. For a matrix $\mathbf{A} = (a_{ij})$ and index sets $\mathcal{S}, \mathcal{Q}$, the submatrix is $\mathbf{A}_{\mathcal{S}\mathcal{Q}} = (a_{ij})_{i\in\mathcal{S}, j\in\mathcal{Q}}$ and the subvector $\mathbf{x}_{\mathcal{Q}} = (x_i)_{i\in\mathcal{Q}}$. $\phi_{\min}(\cdot)$, $\phi_{\max}(\cdot)$ denote the smallest and largest eigenvalues; $\|\mathbf{A}\|_\infty = \max_{ij}|a_{ij}|$, $\|\mathbf{A}\|_2 = \sqrt{\phi_{\max}(\mathbf{A}'\mathbf{A})}$, $\|\mathbf{x}\|_1 = \sum_i |x_i|$.

The two time-series operators are central. Over an index set $\mathcal{S}$ of periods,

\[\mathcal{E}_{\mathcal{S}}(\mathbf{x}_t) = \frac{1}{|\mathcal{S}|} \sum_{t\in\mathcal{S}} \mathbf{x}_t, \qquad \Gamma_{\mathcal{S}}(\mathbf{x}_t, \mathbf{y}_t') = \mathcal{E}_{\mathcal{S}} \Bigl( [\mathbf{x}_t - \mathcal{E}_{\mathcal{S}}(\mathbf{x}_t)] [\mathbf{y}_t - \mathcal{E}_{\mathcal{S}}(\mathbf{y}_t)]' \Bigr),\]

the sample mean and sample covariance of a series over $\mathcal{S}$. The intervention takes effect after the split point $T_0$. The pre-period is $\mathcal{T}_1 \coloneqq \{1, \ldots, T_0\}$ with $|\mathcal{T}_1| = T_0$; the post-period is $\mathcal{T}_2 \coloneqq \{T_0+1, \ldots, T\}$ with $T_2 \coloneqq |\mathcal{T}_2| = T - T_0$. Let $j=1$ denote the treated unit, with treated series $\mathbf{y}_1 = (y_{11}, \ldots, y_{1T})^\top$; the donor pool is $\mathcal{N}_0 \coloneqq \mathcal{N}\setminus\{1\}$, with cardinality $N_0$, where $\mathcal{N} = \{1,\ldots,N\}$.

The shared model#

All four methods rest on a common latent-factor data-generating process: for $t \in \mathcal{T}$,

\[y_{jt}^N = \mu_j + \boldsymbol{\lambda}_j' \mathbf{f}_t + u_{jt}, \quad j \in \{1\}\cup\mathcal{N}_0,\]

with $\mathbf{f}_t$ an $r$-vector of latent common factors, $\boldsymbol{\lambda}_j$ factor loadings, and $u_{jt}$ a weakly-dependent idiosyncratic error orthogonal to the factors. Because the common factors drive both the treated unit and the donors, the untreated treated outcome admits a linear projection on the donors,

\[y_{1t} = \alpha^0 + \mathbf{x}_t' \boldsymbol{\beta}^0 + \epsilon_t, \qquad \mathbf{x}_t = (y_{jt})_{j\in\mathcal{N}_0}',\]

with $\mathbb{E}[\mathbf{x}_t \epsilon_t] = \mathbf{0}$. PDA fits $(\alpha^0, \boldsymbol{\beta}^0)$ on $\mathcal{T}_1$ and predicts $\widehat{y}_{1t} = \mathcal{E}_{\mathcal{T}_1}(y_{1s}) + [\mathbf{x}_t - \mathcal{E}_{\mathcal{T}_1}(\mathbf{x}_s)]'\widehat{\boldsymbol{\beta}}$ for $t \in \mathcal{T}_2$. The per-period treatment effect is $\tau_t = y_{1t} - \widehat{y}_{1t}$ and the ATE $\widehat{\tau} = \mathcal{E}_{\mathcal{T}_2}(\tau_t)$. The methods differ in how they estimate $\boldsymbol{\beta}$ and, crucially, in the inference theory each paper proves for $\widehat{\tau}$.

Original best subset (`hcw`, Hsiao-Ching-Wan)#

Idea. This is the method that started the literature [HCW2012]. The recipe is the plainest one in the family – ordinary least squares of the treated unit’s pre-period outcome on a handful of control series, extrapolated past the intervention – with one twist: which controls, and how many, are chosen for you rather than fixed in advance. HCW’s Section 5 turns that choice into a model-selection problem. It helps to walk through it as four steps.

Step 1 – measure how well a candidate set of controls fits. Pick any subset $\mathcal{S}\subseteq\mathcal{N}_0$ of the donors. Regress the treated pre-period outcome on those donors and an intercept by OLS, and record the residual sum of squares – the total squared pre-period miss,

\[\mathrm{RSS}(\mathcal{S}) = \min_{\alpha,\,\boldsymbol{\beta}_{\mathcal{S}}} \sum_{t\in\mathcal{T}_1} \bigl(y_{1t} - \alpha - \mathbf{x}_{t,\mathcal{S}}' \boldsymbol{\beta}_{\mathcal{S}}\bigr)^2 , \qquad \mathbf{x}_{t,\mathcal{S}} = (y_{jt})_{j\in\mathcal{S}}.\]

A smaller $\mathrm{RSS}$ means a tighter pre-period fit. By itself it is not a usable score, though: adding any control can only lower $\mathrm{RSS}$, so “smallest RSS” always points at the largest possible model, which memorises the short pre-period and extrapolates badly past the cut.

Step 2 – charge for complexity with an information criterion. To stop the search from simply taking every donor, each subset is scored by an information criterion: the (log) fit plus a penalty that grows with the number of parameters. The default is the small-sample-corrected AICc (AIC and BIC are the same shape with a lighter or heavier penalty),

\[\mathrm{AICc}(\mathcal{S}) = \underbrace{T_0 \log\!\frac{\mathrm{RSS} (\mathcal{S})}{T_0}}_{\text{rewards fit}} + \underbrace{2K + \frac{2K(K+1)}{T_0 - K - 1}}_{\text{charges for size}}, \qquad K = r + 2,\;\; r = |\mathcal{S}|.\]

Lower is better. The first term falls as the fit improves; the second rises with the model size $r$, so the criterion bottoms out where one more donor stops paying for the variance it adds – the bias-variance trade-off turned into a single number. The penalised parameter count $K = r + 2$ counts the $r$ donors, the intercept, and the error variance (the convention of the pampe R package, leaps::regsubsets + AICc + lm); the small-sample correction $2K(K+1)/(T_0-K-1)$ matters because PDA’s pre-period is short.

Step 3 – keep the subset with the best score. The selected controls are the global minimiser

\[\widehat{\mathcal{S}} = \operatorname*{argmin}_{\mathcal{S}\subseteq \mathcal{N}_0,\; |\mathcal{S}|\le r_{\max}} \mathrm{AICc}(\mathcal{S}).\]

In practice this is computed the way leaps does it (HCW Section 5): for each model size $r$ find the single lowest-$\mathrm{RSS}$ subset of that size, then choose the size whose best subset minimises the criterion – a “best of each size, then choose the size” search. The cap $r_{\max}$ (hcw_nvmax, pampe’s nvmax) limits the largest size considered and cannot exceed the pre-period degrees of freedom (the criterion is undefined once the donors plus intercept use up the $T_0$ observations).

Step 4 – refit and extrapolate. Re-fit OLS on the chosen controls $\widehat{\mathcal{S}}$ over the pre-period, then carry the coefficients forward: the counterfactual is that fitted line evaluated on the post-period control values, and the per-period effect is treated minus counterfactual, exactly as in the shared model above. On HCW’s Hong Kong sovereignty study this selects $\{\text{Japan},\text{Korea},\text{Taiwan},\text{USA}\}$ with $\mathrm{AICc} = -171.771$, reproducing Table XVI value-for-value (cross-validated against pampe – see the benchmark below).

Best subset selection is HCW’s classical, low-dimensional choice: AICc / AIC / BIC are only defined while $r < T_0$, and the search is combinatorial in $N_0$. HCW therefore pre-screened the donor pool (limiting Hong Kong to ten candidate economies). [HTT2020] place best subset alongside its two cheaper relatives – forward stepwise and the lasso – and show forward stepwise tracks best subset closely, which is exactly why fs is hcw’s scalable descendant; lasso and l2 are the others. Use hcw when the candidate pool is moderate and pre-screened, when you want the original method or to reproduce HCW / pampe, and when an exact, certified optimum is the point.

Computing the best subset: the optimisation#

There are $2^{N_0}$ candidate subsets, each an OLS fit, so naive enumeration is the regime where even leaps strains. mlsynth solves the problem exactly with a branch-and-bound built on three ideas, plus an optional fourth that hands the problem to a mixed-integer solver.

Furnival-Wilson sweep engine. The default is the canonical leaps::regsubsets algorithm [FurnivalWilson]: a depth-first search over subsets driven by the symmetric sweep operator on the augmented cross-product matrix $[\mathbf{1},\mathbf{X},\mathbf{y}]' [\mathbf{1},\mathbf{X},\mathbf{y}]$. Sweeping a donor in leaves the running $\mathrm{RSS}$ in the response diagonal in $O(p^2)$, and the reverse sweep removes it on backtracking, so no OLS is ever re-solved from scratch. Each subtree is bounded below by the smallest $\mathrm{RSS}$ it can reach (include every remaining donor – $\mathrm{RSS}$ is monotone in the variable set) paired with the smallest information-criterion penalty any descendant can carry; when that lower bound cannot beat the incumbent, the whole subtree is pruned. The bound is a true lower bound, so the returned subset is the exact global optimum – identical to exhaustive enumeration, at a small fraction of the cost.

Discrete first-order warm start. The incumbent is seeded, at every model size, with the Bertsimas-King-Mazumder projected-gradient hard-thresholding method [BKM2016] (their Algorithm 1): iterate $\boldsymbol{\beta}\leftarrow H_r\!\bigl(\boldsymbol{\beta} - \tfrac{1}{L} \nabla g(\boldsymbol{\beta})\bigr)$, where $H_r$ keeps the $r$ largest-magnitude coordinates and $L$ upper-bounds the largest eigenvalue of the centred donor Gram, with a least-squares refit on the active set and several restarts. A near-optimal incumbent tightens the bound from the outset and prunes more subtrees; because the seed can only lower the incumbent, the certified optimum is unchanged.

Node budget and a certified optimality gap. For a pool too large to certify quickly, the search stops at a node budget and returns the best incumbent found together with a valid lower bound – the smallest subtree bound it never reached. The reported $\text{optimality gap} = \text{incumbent} - \text{lower bound}$ is then a genuine suboptimality certificate (zero means provably optimal), the Bertsimas-King-Mazumder gap obtained from the branch-and-bound’s own bounds with no solver. This replaces an outright refusal of large pools: fits["hcw"].metadata reports certified_optimal and optimality_gap.

Optional exact mixed-integer backend. Setting hcw_backend="scip" certifies the optimum at pool sizes beyond the branch-and-bound’s exact reach via the SCIP solver. It casts the size-$r$ problem as the Bertsimas-King-Mazumder cardinality-constrained least squares – binary indicators $z_i$, an SOS-1 coupling forcing $z_i = 0 \Rightarrow \beta_i = 0$, and $\sum_i z_i \le r$ – solved once per size, then chooses the size by the criterion. The dependency is optional: pyscipopt is imported only on demand (pip install mlsynth[scip]). The default fw backend needs no solver and is all the low-dimensional regime hcw targets requires.

Assumptions and inference. hcw shares the identifying stack documented under Shared assumptions across the PDA class (the latent factor model A1, single absorbing treatment A2, weak temporal dependence A3), specialised to the low-dimensional regime $N_0 < T_0$ that makes the information criteria well-defined. Inference mirrors the post-selection HAC machinery of the scalable variants: the average effect carries a Newey-West / Bartlett long-run-variance confidence interval (prewhitened by default; lrvar_lag switches to a fixed-lag Bartlett estimator), and prediction_intervals=True attaches the Jiang et al. per-period intervals described below.

L2-relaxation (`l2`, Shi & Wang)#

Idea. Under the factor model the projection coefficient $\boldsymbol{\beta}^0 = \boldsymbol{\Omega}^{-1}\boldsymbol{\Lambda} (\boldsymbol{\Lambda}'\boldsymbol{\Omega}^{-1}\boldsymbol{\Lambda} + \mathbf{I}_q)^{-1}\boldsymbol{\lambda}_1$ is dense in general – almost no entries are exactly zero. Sparse methods (LASSO) are then mis-matched. With $\widehat{\boldsymbol{\Sigma}} = \Gamma_{\mathcal{T}_1}(\mathbf{x}_t, \mathbf{x}_t')$ and $\widehat{\boldsymbol{\eta}} = \Gamma_{\mathcal{T}_1}(\mathbf{x}_t, y_t)$, OLS solves the KKT condition $\widehat{\boldsymbol{\Sigma}}\boldsymbol{\beta} = \widehat{\boldsymbol{\eta}}$, which is unstable or non-unique once $N$ is close to or exceeds $T_0$. L2-relaxation relaxes the sup-norm of this moment condition by a tuning parameter $\varepsilon$ and minimizes the coefficient norm:

\[\min_{\boldsymbol{\beta}} \tfrac{1}{2}\|\boldsymbol{\beta}\|_2^2 \quad \text{s.t.} \quad \|\widehat{\boldsymbol{\eta}} - \widehat{\boldsymbol{\Sigma}}\boldsymbol{\beta}\|_\infty \le \varepsilon.\]

This is the “bias-variance trade-off” made explicit: tolerating a small violation $\varepsilon$ of the OLS moment condition shrinks the variance. At $\varepsilon = 0$ it reduces to (ridgeless) OLS; at $\varepsilon \ge \|\widehat{\boldsymbol{\eta}}\|_\infty$ it gives $\boldsymbol{\beta} = \mathbf{0}$. mlsynth picks $\varepsilon$ by sequential out-of-sample validation on the tail of the training window (the validated $\varepsilon$ tracks the infeasible-optimal one, and both shrink toward zero as the sample grows) over a log-spaced grid down to $10^{-4}\max|\widehat{\boldsymbol{\eta}}|$ (the optimum is often a tiny fraction of the cap). This is time-respecting – the fit never sees periods later than the validation tail – unlike the released L2relax.CV, whose 5-block K-fold trains on both past and future of each block.

Note

Standardisation. Following the authors’ released L2relax, the treated and control series are standardised (demeaned and scaled to unit variance) before forming $\widehat{\boldsymbol{\Sigma}}$ / $\widehat{\boldsymbol{\eta}}$, and the solution is mapped back to the original scale. This is the default (l2_standardize=True) – the $\ell_2$ penalty is scale-sensitive, so standardisation is both recommended and what reproduces the paper’s empirical results; on the Hong Kong panel it moves the L2 estimate from $2.48\%$ to $2.61\%$ (closer to Shi & Wang’s $2.65\%$). Set l2_standardize=False for the raw-scale variant.

Assumptions (Shi & Wang).

Assumption 1 (loadings). $\|\boldsymbol{\lambda}_1\|_\infty + \|\boldsymbol{\Lambda}\|_\infty \le C$, and there is an $r$-unit subset making $\boldsymbol{\Lambda}_{\mathcal{Q}\cdot}$ full column rank; the average factor strength $\xi_N = \phi_{\min}(\boldsymbol{\Lambda}' \boldsymbol{\Lambda}/N)$ may vanish (weak factors allowed).

Remark. Bounded loadings keep the projection coefficient well-defined, while the full-rank subset is what lets the donors span the treated unit’s factor exposure; allowing $\xi_N \to 0$ means the method tolerates weak factors that contribute little cross-sectional variation.

Assumption 2 (errors). The idiosyncratic covariance $\boldsymbol{\Omega}$ has eigenvalues bounded between $\underline\sigma^2$ and $\overline\sigma^2$; errors may be heteroskedastic and cross-sectionally dependent.

Remark. The two-sided eigenvalue bound rules out a degenerate noise direction, but heteroskedastic, cross-sectionally dependent errors are permitted – the realistic case for economic panels.

Assumption 3-4 (sampling). In- and out-of-sample sampling errors of the sample moments are $O_p(T_0^{-1/2})$ for low-dimensional pieces and $O_p(\sqrt{\log N / (N\wedge T_0)})$ for high-dimensional ones (holding under time-series weak dependence, not just i.i.d.).

Remark. These are the convergence rates that make the relaxed moment condition usable in high dimension: the high-dimensional pieces pay only a $\sqrt{\log N}$ price, so $N$ may grow with (or beyond) $T_0$.

Assumption 5 (ATE inference). The oracle prediction error $\epsilon_t^*$ and the effect-plus-error $d_t^* = \Delta_t - \mathbb{E}[\Delta_t] + \epsilon_t^*$ have finite, positive long-run variances $\rho^2_{\epsilon^*}$, $\rho^2_{d^*}$ consistently estimable by HAC, and a sequential CLT applies.

Remark. The coefficient estimator is consistent for the oracle target (Theorem 1) and – importantly – the prediction error is asymptotically irrelevant to heteroskedasticity (Theorem 2): unlike the coefficient MSE, the out-of-sample MSE does not depend on the noise heterogeneity $\psi_{\max}$. This is what licenses the HAC-based CLT for the ATE.

Inference (Shi & Wang, Theorem 3; single treated unit). With pre-period prediction residuals $e_t = y_{1t} - \widehat{y}_{1t}$ ($t\in\mathcal{T}_1$) and post-period effects $\tau_t$ ($t\in\mathcal{T}_2$),

\[\widehat{Z} = \frac{\widehat{\tau} - \Delta_{\mathcal{T}_2}} {\sqrt{\widehat{\rho}^2_{(1)}/T_0 + \widehat{\rho}^2_{(2)}/T_2}} \xrightarrow{d} N(0,1),\]

where $\widehat{\rho}^2_{(1)}$ is the HAC long-run variance of the pre-period residuals (first-stage estimation uncertainty) and $\widehat{\rho}^2_{(2)}$ is the HAC long-run variance of the de-meaned post-period effects (post-period averaging). Both sources of uncertainty enter, which matters when $T_0$ and $T_2$ are comparable.

When to use. Dense, factor-driven coefficients; high dimension ($N>T_0$ permitted); when prediction accuracy and heteroskedasticity- robustness matter more than identifying a handful of controls.

LASSO (`lasso`, Li & Bell)#

Idea. When only a few controls are truly relevant, an L1 penalty selects them and shrinks the rest. Li & Bell estimate

\[\widehat{\boldsymbol{\beta}}^{\text{las}} = \operatorname*{argmin}_{\boldsymbol{\beta}} \; \sum_{t\in\mathcal{T}_1} (y_{1t} - \mathbf{x}_t'\boldsymbol{\beta})^2 + \lambda \sum_{j} |\beta_j|,\]

with $\lambda$ chosen by (leave-one-out) cross-validation, then predict the counterfactual as in the shared model. LASSO works for $N > T_0$ (where AIC/AICC/BIC cannot even be computed) and is far cheaper.

Assumptions (Li & Bell). They relax HCW’s linear-conditional-mean assumption and drop one of HCW’s identification conditions. The key conditions are: a factor model with $\mathrm{Rank}(\widetilde{B}) = K$ (enough independent factor variation among the controls); a weakly dependent, weakly stationary panel so laws of large numbers and CLTs apply to partial sums; consistency of the pre-period least-squares pieces ($\widehat{\delta}_1 - \delta_1, \widehat{\delta}-\delta = O_p(T_0^{-1/2})$); and $\rho$-mixing with geometric decay plus a finite limit $\eta = \lim T_2/T_0$. Sparsity (only $m$ of $\boldsymbol{\beta}$ non-zero, $m$ fixed or $o(T_0)$) is assumed for the high-dimensional selection.

Remark. Li & Bell prove consistency $\widehat{\tau} - \Delta_{\mathcal{T}_2} = O_p(T_0^{-1/2} + T_2^{-1/2})$ (estimation error has two parts: first-stage $O_p(T_0^{-1/2})$ and post-averaging $O_p(T_2^{-1/2})$), holding even when $y_{1t}$ is trend-stationary.

Inference (Li & Bell, Theorem 3.2). With $\widehat{\Sigma} = \widehat{\Sigma}_1 + \widehat{\Sigma}_2$,

\[\text{T.S.} = \frac{\sqrt{T_2}\,\widehat{\tau}}{\sqrt{\widehat{\Sigma}}} \xrightarrow{d} N(0,1),\]

where $\widehat{\Sigma}_2$ is the Newey-West HAC long-run variance of the post-period effects, and $\widehat{\Sigma}_1$ is the first-stage (pre-period estimation) variance – the OLS prediction variance of the mean post-period counterfactual on the selected support. Li & Bell note $\widehat{\Sigma}_1$ is negligible when $T_0 \gg T_2$, so the post-period term dominates in long-pre-period panels.

When to use. A genuinely sparse set of relevant controls; very large $N$ (even $N/T_0 \to \infty$); when an interpretable, computa- tionally cheap selection is preferred. (For selection consistency, Li & Bell note the adaptive LASSO; for prediction, plain LASSO already beats AIC/BIC and leave-many-out CV in their simulations.)

Forward selection (`fs`, Shi & Huang)#

Idea. Rather than penalize, grow the control set greedily. Start empty; at each step add the control whose inclusion maximizes the pre-treatment OLS $R^2$ (equivalently minimizes the residual sum of squares). The number of selected controls $R$ is a tuning parameter chosen by a modified BIC (Wang, Li & Tsai),

\[\widehat{R} = \operatorname*{argmin}_{r} \log\bigl(\widehat{\sigma}^2(\widehat{U}_r)\bigr) + \log(\log N)\,\frac{r\,\log T_0}{T_0},\]

with $\widehat{\sigma}^2(\widehat{U}_r)$ the pre-period residual variance of OLS on the $r$-unit set. The counterfactual is the OLS extrapolation on the chosen set $\widehat{U}_{\widehat{R}}$. Forward selection evaluates $\sum_r (N-r+1)$ regressions – linear in $N$ – versus the $2^N$ of exhaustive subset search.

Assumptions (Shi & Huang). Asymptotics are multi-index: $N\to\infty$ with $T_0 = T_0(N)$ deterministic, $\log N / T_0 \to 0$, and $T_2 = T_2(N) \to \infty$ with $\log N / T_2 \to 0$ ($N$ may exceed $T_0$).

Assumption 1 (sparse Riesz / restricted eigenvalue). The minimal eigenvalue of the population Gram matrix over any $u$-unit subset ($u \le (1+\delta_1)R$) is bounded below – a condition the authors show is a natural implication of the latent factor model, not an ad hoc restriction.

Remark. The bounded minimal eigenvalue is what keeps any candidate subset of donors well-conditioned, so the greedy step can identify the next informative control rather than chase a near-singular direction; that it follows from the factor model means it is not an extra demand on the data.

Assumption 2 (second moments). Sample second moments converge at the high-dimensional rate $O_p(\sqrt{\log N / T_0})$ with bounded fourth moments.

Remark. The high-dimensional rate is the pre-period price of admitting many donors: with bounded fourth moments the sample Gram matrix tracks its population counterpart even when $N$ is large relative to $T_0$.

Assumption 3 (post-period). Analogous convergence and long-run-variance bounds on the post-treatment data.

Remark. The matching post-period bounds are what let the average effect over $\mathcal{T}_2$ obey a central limit theorem with a consistently estimable long-run variance, the basis for the post-selection test below.

Assumption 4 (weak dependence). The series are strong ($\alpha$-) mixing with geometric decay, so a Berry-Esseen bound for heterogeneous time series applies.

Remark. The validity is uniform over a class of DGPs (Theorem 1) – it holds whether the true coefficients are dense or sparse, which separates fsPDA from the post-selection-inference literature that needs sparsity or the oracle property. Theorem 2 shows the greedy algorithm attains a regression variance asymptotically as small as the best $u$-unit subset, so the cheap forward search is statistically efficient.

Inference (Shi & Huang, Eq. 4). Because forward selection uses only the pre-period and, under weak dependence, the pre- and post-periods become asymptotically independent (sample splitting), the naive conditional t-statistic is valid:

\[\widehat{\mathcal{Z}}_{\widehat{U}} = \widehat{\rho}_{\tau\widehat{U}}^{-1}\sqrt{T_2}\, \widehat{\tau}_{\widehat{U}} \xrightarrow{d} N(0,1),\]

where $\widehat{\rho}^2_{\tau\widehat{U}}$ is the HAC long-run variance of the de-meaned post-period effects. No first-stage variance term is needed – the asymptotic independence absorbs it – which makes fsPDA’s inference the simplest of the three.

Note

Long-run-variance estimator. mlsynth defaults to the prewhitened Newey-West estimator (Andrews-Monahan VAR(1) prewhitening + Bartlett kernel with the data-driven NW(1994) bandwidth + finite-sample adjustment) – R’s sandwich::lrvar(..., prewhite = TRUE, adjust = TRUE), which Shi & Huang use in their application scripts. Prewhitening is essential when the treatment-effect series is strongly serially dependent: monthly growth rates mean-revert (lag-1 autocorrelation around $-0.45$ in the luxury-watch panel), and a plain Bartlett kernel cannot absorb that, leaving $\widehat\rho$ nearly double its true value and the test far too conservative. Setting lrvar_lag instead switches to the released est.fsPDA package’s fixed-lag Bartlett estimator (default lag $\lfloor T_2^{1/4}\rfloor$, capped at $\lfloor\sqrt{T_2}\rfloor$); on the watch panel that no-prewhitening form gives an insignificant $t \approx -1.15$, versus the prewhitened default’s $-2.51$ (the paper reports $-2.457$).

When to use. A large candidate-control pool where the goal is to synthesize an ensemble that mimics the outcome (not to interpret which controls are “causal”); when computational efficiency and honest post-selection inference matter; and regardless of whether the underlying model is sparse. (Shi & Huang recommend the adaptive LASSO instead when the identity of a few causal controls is the object of interest.)

Choosing among the three#

Method	Coefficient structure	Inference variance	Use when
`l2`	dense (factor-implied)	pre + post HAC (both terms)	dense coefficients; `N>T0`; prediction & heteroskedasticity-robustness
`lasso`	sparse	post HAC (+ first stage, small if `T0>>T2`)	few relevant controls; interpretable selection; very large `N`
`fs`	dense or sparse	post HAC only (sample splitting)	large pool; predictive ensemble; cheap; honest post-selection inference

Shared assumptions across the PDA class#

The three estimators (l2, lasso, fs) differ in how they fit $\boldsymbol\beta$, but they share the same identifying stack. Stated formally:

A1 (Latent factor model for untreated outcomes). All $N + 1$ units share at most $r$ common latent factors,

\[y_{jt}^N \;=\; \mu_j \;+\; \boldsymbol\lambda_j' \mathbf f_t \;+\; u_{jt}, \qquad j \in \{1\} \cup \mathcal N_0, \;\; t \in \mathcal T,\]

with $\mathbb E[\mathbf f_t u_{jt}] = 0$. This is the shared model underlying HCW (Hsiao-Ching-Wan 2012), Li-Bell (2017), Shi-Huang (2023), and Shi-Wang (l2). The factor structure is what licenses the linear projection of the treated unit’s untreated outcome on the controls’ outcomes,

\[y_{1t} \;=\; \alpha^0 \;+\; \mathbf x_t' \boldsymbol\beta^0 \;+\; \epsilon_t, \qquad \mathbb E[\mathbf x_t \epsilon_t] = \mathbf 0,\]

with $\boldsymbol\beta^0 = \boldsymbol\Omega^{-1} \boldsymbol\Lambda (\boldsymbol\Lambda' \boldsymbol\Omega^{-1} \boldsymbol\Lambda + \mathbf I_q)^{-1} \boldsymbol\lambda_1$.

A2 (Single treated unit, sharp absorbing aggregate-level treatment). Unit $j = 1$ is the only treated unit; treatment turns on at $T_0 + 1$ and stays on. Donors are untreated throughout (no interference). The original HCW / Li-Bell / Shi-Huang theorems are stated for this single-treated case. (The l2-relaxation paper Section 4.4 sketches a multiple-treated-units extension with a short post-window; the mlsynth implementation tracks the single-treated form.)

A3 (Weak temporal dependence). The series $(\mathbf x_t, y_{1t})$ are $\rho$-mixing or strong- mixing with at-least-geometric decay (the exact rate varies by variant):

Li-Bell A6: $w_t = (\widetilde y_t', \epsilon_{1t})$ is a weakly stationary $\rho$-mixing process with $\rho(s) = O(\lambda^s)$.
Shi-Huang A4: strong ($\alpha$-) mixing with geometric decay, so a Berry-Esseen bound for heterogeneous time series applies.
Shi-Wang A3-A4: time-series weak dependence at the $O_p(T_0^{-1/2})$ and $O_p(\sqrt{\log N / (N \wedge T_0)})$ rates for the sample moments.

This is what makes pre-period sample moments converge at the high-dimensional rate and – crucially – what makes the pre-period and post-period asymptotically independent, which is the engine behind fs-PDA’s sample-splitting inference and the two-term HAC variance in l2 / lasso.

A4 (Sample-size regime). $N \to \infty$, $T_0 = T_0(N) \to \infty$ deterministically with $\log N / T_0 \to 0$, $T_2 \to \infty$ with $\log N / T_2 \to 0$. $N$ may exceed $T_0$, which is the entire point of the high-dimensional PDA literature. Li-Bell’s A7 additionally posits $\eta = \lim T_2 / T_0 \in [0, \infty)$, which determines whether the first-stage variance term $\widehat\Sigma_1$ matters.

A5 (Donor pool regularity). The controls’ Gram matrix has enough variation:

For lasso (Li-Bell A2): $\mathrm{Rank}(\widetilde B) = K$ – removing the first row of the loading matrix leaves full-rank factor variation; $E[x_t x_t']$ is invertible on the active set.
For fs (Shi-Huang A1): a sparse Riesz / restricted eigenvalue condition – the minimum eigenvalue of the population Gram matrix over any $u$-unit subset ($u \le (1 + \delta_1) R$) is bounded below. The paper shows this is a natural implication of the latent factor model, not an ad-hoc lasso-style assumption.
For l2 (Shi-Wang A1-A2): factor strength $\xi_N = \phi_{\min}(\boldsymbol\Lambda' \boldsymbol\Lambda / N)$ may vanish (weak factors allowed); the idiosyncratic covariance has eigenvalues bounded in $[\underline\sigma^2, \overline\sigma^2]$.

A6 (Variant-specific structure of $\boldsymbol\beta^0$ ).

lasso: sparse $\boldsymbol\beta^0$ – only $m = o(T_0)$ of its components are non-zero.
l2: dense $\boldsymbol\beta^0$ – almost no exact zeros (the factor projection gives every donor a small-but-nonzero coefficient).
fs: agnostic – the inference is valid uniformly over a class of DGPs that includes both dense and sparse $\boldsymbol\beta^0$ (Theorem 1 in Shi-Huang).

A7 (Inferential regularity). For all three, the post-period average effect $\widehat\tau$ has a CLT with HAC long-run variance consistently estimable by Newey-West. For l2 and lasso, both the pre-period (first-stage) and post-period HAC variances enter; for fs, sample-splitting absorbs the first-stage term and only the post-period HAC variance enters.

When the assumptions bind: practical diagnostics#

Factor-driven DGP (A1). PDA’s whole identification story rides on the latent factor structure. If the panel is not well-described by a small number of common factors, the linear-projection equation has a non-vanishing $\epsilon_t$ term that the high-dimensional estimators cannot remove.

Plausibly violated when donors are largely idiosyncratic (each follows its own unrelated process), or when one or two donors have a structural break that the factor model can’t absorb. Diagnostic: run an SVD on the donor pre-period matrix; the top few singular values should carry the bulk of the spectral energy. If the spectrum is flat-tailed (no clear factor cutoff), the factor model fails and PDA’s linear-projection consistency is fragile. In that regime, use a balancing-aware estimator (MicroSynth (User-Level Balancing SC) if you have unit-level data) or stay with canonical SC.
Single treated unit, absorbing aggregate-level treatment (A2). Multiple treated units, treatment that turns off, or interference among donors break the framework.

Plausibly violated when the policy is rescinded mid- sample, or when treated and donor states are spatially or economically linked enough that the donors’ untreated trajectories shift. Diagnostic: the config validator enforces single-cohort; the silent failure is treated-donor spillover. Split donors by geographic / economic distance to the treated unit and refit; large ATE shifts flag interference. Use Spillover-Aware Synthetic Control (SPILLSYNTH) or Spatial Synthetic Difference-in-Differences (SpSyDiD) for genuine spillovers, FECT / Synthetic Difference-in-Differences (SDID) for staggered designs.
Weak temporal dependence (A3). All three variants assume mixing or $\rho$-mixing pre-period series with geometric decay. Unit-root outcomes, long-memory series, or series with structural breaks fail this.

Plausibly violated when the outcome is a price level, a cumulative quantity, or an undifferenced macroeconomic series. Diagnostic: ADF / KPSS on the pre-period residuals; non-stationarity flags A3 failure. The pre/post asymptotic-independence story (which licenses fs-PDA’s sample-splitting inference) is then in question. First- difference the outcome, or use Synthetic Business Cycle (SBC) (a stationary- cycle estimator) before PDA.
Sample-size regime (A4). PDA needs both $T_0$ and $T_2$ growing with $\log N$ small relative to each. A short post-period ($T_2 \le 5$) breaks the CLT on $\widehat\tau$; a short pre-period ($T_0 \le 20$) breaks the moment-convergence rates.

Plausibly violated when the pre-period is short with many donors. Diagnostic: compute $(\log N) / T_0$ and $(\log N) / T_2$; both should be visibly below 1. If they are not, the asymptotic approximation has not kicked in. Either lengthen the panel (aggregate to a finer time grid), prune donors, or move to canonical SCM / Two-Step Synthetic Control / Forward Difference-in-Differences (FDID) which work with shorter panels.
Donor regularity (A5). Each variant has its own Gram-matrix / factor-strength condition. The practical common failure is near-collinear donors: two donor series that move together up to noise produce a near- singular pre-period Gram matrix.

Plausibly violated when the donor pool contains near-duplicates (two adjacent states with essentially identical industry mix). Diagnostic: read the condition number of $\Gamma_{\mathcal T_1}(\mathbf x_t, \mathbf x_t')$. A condition number above ~1e6 is a red flag. For lasso and fs this manifests as selection flipping between near-clones across seeds; for l2 the $\varepsilon$-validation curve gets noisy. Prune one of each clone pair before refitting.
Coefficient structure (A6) – choosing the right variant. The biggest practitioner-side decision is whether to assume sparse or dense $\boldsymbol\beta^0$. Choosing wrong pays a real cost: lasso on a dense truth over-selects and inflates size (see the Path-B table above: LASSO’s size is 0.16-0.36 under the dense factor DGP, vs fs’s 0.05); l2 on a sparse truth pays variance for the dense fit it doesn’t need.

Diagnostic: fit fs first – it’s valid in both regimes per Shi-Huang Theorem 1, and the selected $\widehat R$ and per-step $R^2$ curve tell you whether you’re in a sparse (few donors carry the fit) or dense (many donors add information) regime. Then run the matched variant (lasso if fs keeps a handful; l2 if fs keeps many).
Inferential regularity (A7). The HAC long-run variance must be consistently estimable. With strong serial correlation and a short post-period, the Bartlett / Newey-West kernel needs more lags than the post-period supports.

Plausibly violated when $T_2 \le 8$ and the treatment-effect series is autocorrelated. Diagnostic: plot the autocorrelation function of res.gap[-T2:]; if it stays high beyond $\sqrt{T_2}$ lags, the Newey-West bandwidth choice is binding and the CI is optimistic. Lengthen the post-window if you can, or report bootstrap CIs alongside the HAC ones.

When to use PDA – and when not to#

Reach for PDA when:

You have a single treated unit, a moderate-to-large donor pool ($N$ comparable to or exceeding $T_0$), and a plausibly factor-driven panel. PDA was designed for exactly this regime, and unlike canonical SCM, the projection has no simplex / non-negativity constraint – it can extrapolate through negative coefficients on far donors when the factor structure demands it.
You want HAC-based, classical-statistics inference on the ATE with a closed-form normal-distribution test, not a permutation or conformal procedure. PDA’s CLT-based inference is what makes it the closest synthetic-control cousin to difference-in-differences from an inferential standpoint.
The treated unit’s pre-trajectory lies in the linear span (not necessarily convex hull) of the controls, so a no-constraint linear projection makes sense. PDA cannot recover an effect when the projection itself is impossible.
You’re between dense (l2) and sparse (lasso) regimes and want a uniformly valid test that doesn’t require choosing the right sparsity story – run fs.

Do not use PDA when:

You need convex (non-negative, sum-to-one) weights as a policy-interpretation deliverable. PDA’s no-constraint projection produces negative coefficients on far donors, which is awkward to explain in many policy contexts. Use canonical SCM / Two-Step Synthetic Control (canonical SC) or Forward-Selected Synthetic Control (FSCM) (forward-selected SC with the simplex retained).
The treated unit is structurally outside the donor span (not just the convex hull). PDA’s linear projection cannot reach a treated outcome that no linear combination of donor outcomes can express. The pre-period RMSE stays large at every PDA variant. Use Imperfect Synthetic Controls (ISCM), whose A2(b) mechanism identifies the effect through donors that use the treated unit as a positive-weight donor in their synthetic controls.
Outcomes are non-stationary (unit-root or trending series). A3 fails and the pre/post asymptotic-independence story breaks. First-difference the outcome (the l2-relaxation paper’s empirical PPI application does exactly this), or use Synthetic Business Cycle (SBC) (stationary-cycle estimator).
You have multiple treated units with overlapping cohorts. PDA’s theorems (with the exception of the l2 relaxerm which does support multiple treated units but is not written yet) are written for the single-treated case. Use Synthetic Difference-in-Differences (SDID) for staggered adoption.
Spillovers across donors. A2’s no-interference clause fails when donor states are economically linked to the treated state. Use Spillover-Aware Synthetic Control (SPILLSYNTH) or Spatial Synthetic Difference-in-Differences (SpSyDiD).
Continuous or multi-valued treatment. PDA encodes a single binary intervention; continuous dose belongs in Continuous-Treatment Synthetic Control (CTSC).
Distributional questions (Lorenz curves, QTEs). PDA targets the mean ATE through a Gaussian-likelihood linear projection. Use Distributional Synthetic Control (DSC) for distributional effects.
You need Bayesian posterior credible bands. PDA returns frequentist HAC-based CIs. For Bayesian inference and posterior inclusion probabilities on donors, use Bayesian Synthetic Control with a Soft Simplex Constraint (BVS-SS) (spike-and-slab with a soft simplex).
Very short pre-period $(T_0 \le 15)$ with many donors. The high-dimensional approximation has not kicked in; the selected $\widehat\beta$ is noise. Use canonical SCM / Two-Step Synthetic Control / Forward Difference-in-Differences (FDID), which work without high-dimensional asymptotics.
Very short post-period $(T_2 \le 5)$. The CLT on $\widehat\tau$ is shaky; the HAC bandwidth choice dominates the inference. Either accept a wider permutation CI from canonical SCM / Two-Step Synthetic Control, or move to the l2-relaxation multiple-treated-units extension (Shi-Wang Section 4.4) which is built for this regime.
You want predictor-level matching (covariates + pre-period outcomes), not outcome-only projection. PDA’s workhorse projection is on donor outcomes, not on predictor moments. Use canonical SCM / Two-Step Synthetic Control / Sparse Synthetic Control (SparseSC) (predictor selection with L1 penalty on the V-weight matrix) for the predictor-matching setup.
The factor model itself is the object of interest (you want to identify and interpret the factors). PDA is agnostic to factor estimation – the factor model only motivates the linear projection, never enters the estimator. Use Factor Model Approach (FMA) (factor-model-aware estimator) if recovering factors is the goal.

Empirical Illustration: Hong Kong economic integration#

The original HCW [HCW] application – and Shi & Huang’s Example 1 – evaluates the effect of economic integration with mainland China on Hong Kong’s quarterly real-GDP growth, using 24 comparison economies. Here all four PDA methods run on the same data, and the package returns the time-varying effect, the ATE, and a HAC confidence interval for each.

import pandas as pd
from mlsynth import PDA

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/HongKong.csv"
df = pd.read_csv(url)

res = PDA({"df": df, "outcome": "GDP", "treat": "Integration",
           "unitid": "Country", "time": "Time",
           "methods": ["hcw", "l2", "LASSO", "fs"], "alpha": 0.05,
           "display_graphs": True}).fit()

for name, fit in res.fits.items():
    print(f"{name:6s} ATE {fit.att:.4f}  SE {fit.att_se:.4f}  "
          f"95% CI ({fit.ci[0]:.4f}, {fit.ci[1]:.4f})  p={fit.p_value:.3f}  "
          f"donors={len(fit.donor_weights)}")

This prints:

hcw    ATE 0.0403  SE 0.0057  95% CI (0.0292, 0.0514)  p=0.000  donors=6
l2     ATE 0.0261  SE 0.0033  95% CI (0.0195, 0.0326)  p=0.000  donors=24
lasso  ATE 0.0330  SE 0.0054  95% CI (0.0224, 0.0436)  p=0.000  donors=11
fs     ATE 0.0395  SE 0.0059  95% CI (0.0280, 0.0510)  p=0.000  donors=9

All four find a significant positive integration effect on Hong Kong’s GDP growth – between +2.6 and +4.0 percentage points per quarter – differing only in how they choose controls. hcw runs the exact best-subset search and certifies a parsimonious six-economy model (Austria, Italy, Korea, Mexico, Norway, Singapore): on this pool the long 44-quarter pre-period gives the branch-and-bound room to certify the global optimum (certified_optimal is True, gap 0) even over 24 donors. l2 keeps all 24 controls with dense, signed coefficients (pre-RMSE 0.012); lasso keeps 11; fs grows to 9. The four estimates bracket the Forward-DiD result on the same data (0.025), a useful cross-method check.

Prediction intervals#

The HAC confidence interval above quantifies uncertainty in the average treatment effect. For uncertainty in each period’s effect, set prediction_intervals=True to attach the bootstrap prediction intervals of Jiang, Li, Shen and Zhou [pdapi] to every fitted variant. These bound the post-period untreated counterfactual $Y_t$ (and hence the effect $\Delta_t = y_t - \widehat Y_t$), combining the in-sample estimation error with the out-of-sample prediction error that a confidence interval for the mean omits. The construction (their Algorithm 2.1) resamples the pre-period prediction error with a dependent wild bootstrap and the post-period error with a residual bootstrap, refits the chosen estimator on each bootstrap sample at the fixed tuning parameter, and reads quantiles of the self-normalized statistic $\widehat e_t / \sqrt{\widehat V_t + \widehat\sigma^2}$.

The implementation lives at the shared mlsynth.utils.inferutils.pda_prediction_intervals() so any panel-data estimator can reuse it. Both the equal-tailed (eq) and symmetric (sy) intervals are returned, and each variant reports which studentization it used: sandwich for the post-selection OLS HAC variance $\widehat V_t$ (hcw, lasso and fs, which all select then run OLS – hcw is exactly the low-dimensional post-selection-OLS case of Jiang et al.’s Remark 2.1, with best subset as the selector re-run on each bootstrap draw), or the sigma2 fallback when that sandwich is undefined – as for the dense L2-relaxation when the controls outnumber the pre-periods.

res = PDA({"df": df, "outcome": "GDP", "treat": "Integration",
           "unitid": "Country", "time": "Time",
           "methods": ["l2", "LASSO", "fs"],
           "prediction_intervals": True, "pi_n_boot": 199, "pi_seed": 0,
           "display_graphs": False}).fit()

for name, fit in res.fits.items():
    eff = fit.prediction_intervals["effect"]
    print(name, fit.prediction_intervals["studentization"])
    for k in range(3):                       # first three post-quarters
        lo, hi = eff["eq_lower"][k], eff["eq_upper"][k]
        print(f"  t+{k+1}: effect {eff['point'][k]:+.4f}  95% PI ({lo:+.4f}, {hi:+.4f})")

The pointwise 95% effect intervals for the first three post-quarters (decimal GDP growth):

method  studentization   t+1                          t+2                          t+3
l2      sandwich         +0.0203 (-0.0128, +0.0411)   +0.0427 (+0.0076, +0.0658)   +0.0023 (-0.0347, +0.0272)
lasso   sandwich         +0.0312 (-0.0066, +0.0499)   +0.0519 (+0.0165, +0.0706)   +0.0152 (-0.0198, +0.0397)
fs      sandwich         +0.0249 (-0.0079, +0.0521)   +0.0467 (+0.0158, +0.0719)   +0.0103 (-0.0237, +0.0351)

Each band is wider than the ATE confidence interval, because it carries the out-of-sample prediction error on top of the estimation error, and it tightens or widens quarter by quarter as the fitted controls track each period better or worse.

hcw produces these same intervals, with one practical caveat: the bootstrap refits the entire selection on every draw, and HCW’s refit re-runs the best-subset search. On the 24-economy pool that is roughly ten seconds a draw, so its prediction intervals there cost minutes – while the certified point estimate and the HAC ATE interval need no resampling and return at once. On HCW’s intended, pre-screened pool (the ten candidate economies of the sovereignty study) the search is tiny and the intervals come straight back:

cands = ["China", "Indonesia", "Japan", "Korea", "Malaysia", "Philippines",
         "Singapore", "Taiwan", "Thailand", "United States"]
sub = df[df["Country"].isin(["Hong Kong"] + cands)]

hk = PDA({"df": sub, "outcome": "GDP", "treat": "Integration",
          "unitid": "Country", "time": "Time", "method": "hcw",
          "prediction_intervals": True, "pi_n_boot": 199,
          "display_graphs": False}).fit()
eff = hk.fits["hcw"].prediction_intervals["effect"]
lo, hi = eff["eq_lower"][0], eff["eq_upper"][0]
print(f"t+1: effect {eff['point'][0]:+.4f}  95% PI ({lo:+.4f}, {hi:+.4f})")
# t+1: effect +0.0304  95% PI (-0.0227, +0.0703)

Verification#

Note

Empirical (Path A, Hong Kong). All four PDA methods run on the HCW Hong Kong panel (above) and agree on a significant positive integration effect, consistent with the literature and the Forward-DiD cross-check (0.025).

Original HCW (Path A, best subset). benchmarks/cases/pda_hcw_hongkong.py reproduces Hsiao, Ching & Wan (2012) Tables XVI-XVII value-for-value with method="hcw": on the sovereignty study (ten candidate economies, T0 = 18), AICc selects {Japan, Korea, Taiwan, USA} with the published OLS weights, pre-period $R^2 = 0.9314$, and an insignificant average effect of -3.96% – cross-validating against the pampe R package.

Prediction intervals (Path B, coverage). The bootstrap prediction intervals are validated by benchmarks/cases/pda_pi_coverage.py, which reproduces the coverage geometry of Jiang et al. (2025) Tables 2-5 on their Setup 1: the equal-tailed and symmetric intervals cover near the nominal 95%, while the normal-quantile intervals under-cover (to about 77% under exponential errors).

Simulation study (Path B): forward selection vs LASSO#

Shi & Huang’s (2023) Table 1 compares forward selection against LASSO on a four-factor DGP, re-implemented in mlsynth.utils.pda_helpers.simulation.simulate_pda_panel(): four factors (f1 i.i.d.; f2 AR(1) 0.9; f3 MA(2) (0.8,0.4); f4 ARMA(1,1) (0.5,0.5) under the dynamic structure, all i.i.d. N(0,1) under the i.i.d. structure); loadings U(1,2) on the treated + 4 relevant controls and U(-0.1,0.1) on the remaining 96; idiosyncratic N(0,0.5); one treated unit, $N=100$ controls, $T_0=T_2$. Shocks D1-D7 set the post-period ATE (D1-D3 null → size; D4-D7 non-zero → power). Driving the packaged PDA (methods=["fs","LASSO"], fs_intercept=False) at $T_0=100$ (200 reps for size, 60 for power):

forward selection vs LASSO, $T_0=100$#
factors	method	# donors	size (D1)	power (D5)
i.i.d.	fs	3.9	0.090	1.00
i.i.d.	LASSO	9.5	0.065	1.00
dynamic	fs	4.5	0.075	1.00
dynamic	LASSO	15.0	0.140	1.00

The paper’s geometry reproduces. Forward selection is parsimonious – it keeps to ~the 4 relevant donors in both structures – while LASSO over-selects (9-15 donors). Forward selection’s test is correctly sized (≈ 0.05-0.09) under both factor structures, the robustness Shi & Huang emphasise. LASSO is correctly sized under i.i.d. factors (0.065, matching the paper’s 0.058) but its size inflates under dynamic factors (0.140; paper’s modified-BIC LASSO 0.184) – the size inflation the paper reports is a dynamic-factor phenomenon, not an i.i.d. one. Both tests are fully powered at D5 (mean-1 shift). Durable case: pda_table1.

Note

mlsynth’s LASSO is cross-validated; the paper’s is not. Shi & Huang select the Lasso penalty with a modified BIC (Remark 4 cont., p.521: “we tune the constants in the modified BIC to allow Lasso to take in more variables”); mlsynth’s L1-PDA selects it with LassoCV (5-fold cross-validation). The two are different penalty rules, so the LASSO cells above are mlsynth’s CV variant, not a cell-by-cell match of the paper’s Lasso. What both share – and what the benchmark pins – is the geometry: LASSO over-selects relative to fs and its size inflates under dynamic factors, while forward selection (the paper’s method, validated cell-by- cell on Hong Kong in pda_hongkong) stays parsimonious and correctly sized.

The fs_intercept knob – valid size on factor data

Achieving the correct fs size above required a fix. The released fsPDA R package (est.fsPDA.R) fits the donor regression with an intercept; on the paper’s mean-zero factor DGP that intercept absorbs a spurious pre-period constant which extrapolates into the post window, biasing the gap and inflating the null rejection rate to ~0.20. The paper’s Table 1 was produced by the simulation code (FS.R), which fits without an intercept and yields valid size. mlsynth exposes both via PDAConfig.fs_intercept (default False = the no-intercept, valid-size form; set True for panels with genuine unit level shifts). With the default, the fs D1 rejection rate drops from ~0.20 (intercept) to the ~0.05-0.09 reported above (no intercept).

One honest caveat: under dynamic factors fs shows mild residual size inflation at small $T$ (the “imprecise long-run-variance” effect the paper itself notes), shrinking toward 5% as $T_0 \to \infty$.

L2-relaxation (Shi & Wang). The l2 method’s out-of-sample MPSE falls with $T$ and its test approaches the nominal 5% size as $T_0 \to \infty$, matching Shi & Wang’s Table 2 (size $0.142$ at $T_0=50$ → $0.072$ at $200$). Its per-fit cross-validation over the $\varepsilon$ grid makes large Monte Carlos expensive (~5 s/fit), so the full table is summarized rather than swept here.

Multiple treated units#

When several units are treated by the same intervention, Shi & Wang’s L2-relaxation PDA fits a counterfactual per treated unit against the shared control pool and aggregates the effects into a per-period cross-sectional ATE,

\[\widehat{\mathrm{ATE}}_t = \frac{1}{J} \sum_{j=1}^{J} \bigl( y_{jt} - \widehat{y}_{jt} \bigr), \qquad \mathrm{s.e.} = \frac{\sqrt{\mathbf{1}'\widehat{\Sigma}_e\,\mathbf{1}}}{J},\]

with $\widehat{\Sigma}_e$ the cross-sectional covariance of the pre-period prediction residuals (so the standard error is constant across post-periods). mlsynth.utils.pda_helpers.multitreat.run_pda_multitreat() implements this. Because every per-unit fit shares the same $\widehat{\boldsymbol{\Sigma}} = X'X/T_0$, all $J$ fits run through a single OSQP factorisation – the batched solver mlsynth.utils.pda_helpers.l2.batch.l2_relax_batch() updates only the constraint bounds per (unit, tau) (hundreds of conic solves become hundreds of warm-started ADMM updates). On the Brexit study (52 UK firms vs 300 controls) this reproduces the paper’s first-day return ATE of $-4.3\%$ ($t \approx -7.4$); see benchmarks/cases/pda_brexit.py.

Core API#

Panel Data Approach (PDA) for program evaluation.

A thin, NumPy-first orchestration over mlsynth.utils.pda_helpers. PDA (Hsiao, Ching & Wan 2012) predicts a single treated unit’s untreated counterfactual by a linear regression on the control units fit over the pre-treatment window, then extrapolates out-of-sample to estimate the ATE. Four variants are supported, each with the inference theory from its own paper:

hcw – the original best-subset PDA (Hsiao, Ching & Wan 2012): unrestricted OLS (with intercept) on the best subset of controls, the subset chosen by AICc (also AIC / BIC); HAC t-test on the ATE. Matches the pampe R package and HCW Table XVI value-for-value.
l2 – L2-relaxation (Shi & Wang 2024): dense ridge-like coefficients via min ||beta||_2^2 s.t. ||eta_hat - Sigma_hat beta||_inf <= tau; ATE inference combines pre-residual and post-effect HAC long-run variances.
LASSO – L1 (Li & Bell 2017): sparse donor selection by cross-validated LASSO; HAC t-test on the ATE with a first-stage variance term.
fs – forward selection (Shi & Huang 2023): greedy R^2 selection with a modified-BIC stopping rule; post-selection HAC t-test (sample splitting). The scalable descendant of hcw’s combinatorial search.

Set method for one variant or methods to run several at once.

class mlsynth.estimators.pda.PDA(config: PDAConfig | dict)#

Bases: object

Panel Data Approach estimator (l2 / LASSO / fs).

Parameters:: config (PDAConfig or dict) – Validated configuration. Beyond the common fields, PDA reads method / methods (which variant(s) to run), tau (the l2 relaxation parameter; None selects it by validation), and alpha (CI level).

References

Hsiao, C., Ching, H. S., & Wan, S. K. (2012). A panel data approach for program evaluation. Journal of Applied Econometrics, 27(5), 705-740.

Shi, Z., & Wang, Y. (2024). L2-relaxation for Economic Prediction.

Li, K. T., & Bell, D. R. (2017). Estimation of average treatment effects with panel data. Journal of Econometrics, 197(1), 65-75.

Shi, Z., & Huang, J. (2023). Forward-selected panel data approach for program evaluation. Journal of Econometrics, 234(2), 512-535.

fit() → PDAResults#

Run the requested PDA variant(s) and return typed results.

Returns:: PDAResults – Container of per-variant PDAMethodFit objects (coefficients, counterfactual, gap, ATE, HAC standard error, CI, p-value, donor weights), with convenience aliases (att, att_se, counterfactual, donor_weights) forwarding to the primary variant.

Configuration#

class mlsynth.config_models.PDAConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, method: ~typing.Annotated[str, _PydanticGeneralMetadata(pattern='^(LASSO|l2|fs|hcw)$')] = 'fs', methods: ~typing.List[str] | None = None, hcw_criterion: ~typing.Annotated[str, _PydanticGeneralMetadata(pattern='^(AICc|AIC|BIC)$')] = 'AICc', hcw_nvmax: ~typing.Annotated[int | None, ~annotated_types.Gt(gt=0)] = None, hcw_backend: ~typing.Annotated[str, _PydanticGeneralMetadata(pattern='^(fw|scip)$')] = 'fw', tau: float | None = None, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, fs_intercept: bool = False, lrvar_lag: int | None = None, l2_standardize: bool = True, prediction_intervals: bool = False, pi_n_boot: ~typing.Annotated[int, ~annotated_types.Ge(ge=2)] = 999, pi_seed: int | None = 0)#

Configuration for the Panel Data Approach (PDA) estimator.

alpha: float#

fs_intercept: bool#

hcw_backend: str#

hcw_criterion: str#

hcw_nvmax: int | None#

l2_standardize: bool#

lrvar_lag: int | None#

method: str#

methods: List[str] | None#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

pi_n_boot: int#

pi_seed: int | None#

prediction_intervals: bool#

tau: float | None#

Result Containers#

PDA.fit() returns a PDAResults, whose fits maps each variant to a PDAMethodFit (coefficients, intercept, counterfactual, gap, ATE, HAC standard error, CI, p-value, donor weights, and the selected-donor list for lasso/fs). The prepared, NumPy-only panel is exposed as a PDAInputs, with units and time addressed through an IndexSet.

Note

PDA.fit() returns an EffectResult on the standardized two-family contract. It is a dispatcher over the variants in res.fits (l2 / lasso / fs); the selected variant drives the flat accessors (res.att / res.att_ci / res.counterfactual / res.gap / res.donor_weights / res.pre_rmse), which resolve through the standardized sub-models. res.donor_weights are the regression coefficients (PDA is a regression counterfactual, not a simplex average); res.att_by_method() / res.se_by_method() report every variant.

Frozen, NumPy-first containers for the Panel Data Approach (PDA).

PDA (Hsiao, Ching & Wan 2012) predicts a treated unit’s untreated counterfactual by a linear regression on the control units fit over the pre-treatment window, then extrapolates out-of-sample. mlsynth exposes three high-dimensional PDA variants, each in its own subpackage with the inference theory from its own paper:

l2 – L2-relaxation (Shi & Wang 2024).
lasso – L1/LASSO (Li & Bell 2017).
fs – forward-selected PDA (Shi & Huang 2023).
hcw – original best-subset PDA (Hsiao, Ching & Wan 2012).

Everything below is pure NumPy; units/time are addressed through IndexSet. The only DataFrame touchpoint is setup.

class mlsynth.utils.pda_helpers.structures.PDAInputs(unit_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, time_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, y: ~numpy.ndarray, X: ~numpy.ndarray, T0: int, treated_label: ~typing.Any, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

Preprocessed, NumPy-only panel for the PDA engine.

Parameters:

unit_index (IndexSet) – All N donor units (column order of X).
time_index (IndexSet) – All T periods (row order of y and X).
y (np.ndarray) – Treated-unit outcome over all periods, shape (T,).
X (np.ndarray) – Donor outcomes, shape (T, N).
T0 (int) – Number of pre-treatment periods (T1); post is T2 = T - T0.
treated_label (Any) – Identifier of the treated unit.
metadata (dict) – Free-form provenance.

property T: int#

T0: int#

property T2: int#

X: ndarray#

property donor_labels: ndarray#

metadata: Dict[str, Any]#

property n_donors: int#

time_index: IndexSet#

treated_label: Any#

unit_index: IndexSet#

y: ndarray#

class mlsynth.utils.pda_helpers.structures.PDAMethodFit(name: str, beta: ~numpy.ndarray, intercept: float, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att: float, att_se: float, ci: ~typing.Tuple[float, float], p_value: float, donor_weights: ~typing.Dict[~typing.Any, float], selected_donors: ~typing.List[~typing.Any] | None = None, prediction_intervals: ~typing.Dict[str, ~typing.Any] | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

A single PDA-variant fit (l2 / lasso / fs).

att: float#

att_se: float#

beta: ndarray#

ci: Tuple[float, float]#

counterfactual: ndarray#

donor_weights: Dict[Any, float]#

gap: ndarray#

intercept: float#

metadata: Dict[str, Any]#

name: str#

p_value: float#

prediction_intervals: Dict[str, Any] | None = None#

selected_donors: List[Any] | None = None#

class mlsynth.utils.pda_helpers.structures.PDAResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.pda_helpers.structures.PDAInputs, fits: ~typing.Dict[str, ~mlsynth.utils.pda_helpers.structures.PDAMethodFit], selected_variant: str = 'fs', metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: BaseEstimatorResults

Top-level container returned by mlsynth.PDA.fit().

An EffectResult (the observational report): a dispatcher over the PDA variants in fits, it lifts the selected variant’s quantities into the standardized sub-models so the flat accessors (att / att_ci / counterfactual / gap / donor_weights / pre_rmse) resolve through the base contract. donor_weights here are the regression coefficients (PDA is a regression/factor counterfactual, not a simplex donor average). The PDA-specific fields keep every variant fit.

Parameters:

inputs (PDAInputs) – Preprocessed panel.
fits (dict) – {variant_name: PDAMethodFit} for every variant run (l2 / lasso / fs).
selected_variant (str) – Which variant drives the standardized surface / flat accessors.

att_by_method() → Dict[str, float]#

property att_se: float#

fits: Dict[str, PDAMethodFit]#

inputs: PDAInputs#

metadata: Dict[str, Any]#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

se_by_method() → Dict[str, float]#

selected_variant: str#

Helper Modules#

Data preparation – the only DataFrame touchpoint: pivots to NumPy, builds the unit/time IndexSetes, and splits pre/post.

Long-DataFrame -> NumPy boundary for PDA (the only pandas touchpoint).

mlsynth.utils.pda_helpers.setup.derive_treatment(df: DataFrame, unitid: str, time: str, treat: str) → Tuple[Any, Any]#: Read the single treated unit and its first treated period from treat.

mlsynth.utils.pda_helpers.setup.prepare_pda_inputs(df: DataFrame, *, unitid: str, time: str, outcome: str, treat: str) → PDAInputs#

Pivot the panel to NumPy, build IndexSets, split pre/post.

Returns:: PDAInputs – Pure-NumPy container: treated vector y (T,), donor matrix X (T, N), T0, and unit/time IndexSets.

Shared HAC long-run-variance machinery (Bartlett/Newey-West) and the N(0,1) test used by every PDA variant.

Shared HAC long-run-variance machinery for PDA inference.

All three PDA variants base inference on a heteroskedasticity- and autocorrelation-consistent (HAC) long-run variance of a scalar time series of post-period treatment effects (and, for some variants, pre-period prediction residuals). The Bartlett/Newey-West and uniform kernels are provided; the default truncation lag follows Newey-West.

mlsynth.utils.pda_helpers.inference.fspda_lrvar_lag(T2: int, lrvar_lag: int | None = None) → int#

Bartlett-kernel truncation lag for the fsPDA long-run variance.

Follows Shi & Huang’s released fsPDA package (est.fsPDA): the default is floor(T2 ** (1/4)) and a user-supplied lag must be a non-negative integer no larger than floor(sqrt(T2)).

mlsynth.utils.pda_helpers.inference.hac_lrv(z: ndarray, lag: int | None = None, kernel: str = 'bartlett') → float#

HAC long-run variance of a mean-zero-able scalar series z.

lrv = gamma_0 + sum_{l=1}^{L} k(l) * 2 * gamma_l with gamma_l the sample autocovariance and k the Bartlett (1 - l/(L+1)) or uniform kernel. The series is de-meaned first.

mlsynth.utils.pda_helpers.inference.lrvar_prewhite_nw(x: ndarray) → float#

Prewhitened Newey-West long-run variance of the mean of x.

Andrews-Monahan (1992) VAR(1) prewhitening + a Bartlett kernel with the Newey-West (1994) data-driven bandwidth + the finite-sample adjustment – i.e. R’s sandwich::lrvar(x, type = "Newey-West", prewhite = TRUE, adjust = TRUE), which Shi & Huang use for the post-selection t-test in their applications (app1_luxury_watch/fsPDA.R). Returns the variance of the sample mean (so the t-statistic is mean(x) / sqrt(lrvar_prewhite_nw(x))).

Prewhitening is what makes this differ sharply from the plain Bartlett estimator on strongly serially-dependent effects (e.g. monthly growth rates, which mean-revert): the VAR(1) step removes the bulk of the dependence before the kernel smooths the remainder, then the long-run factor 1 / (1 - A) recolors it.

mlsynth.utils.pda_helpers.inference.newey_west_lag(n: int) → int#: Newey-West (1994) automatic truncation lag floor(4 (n/100)^(2/9)).

mlsynth.utils.pda_helpers.inference.normal_test(att: float, se: float, alpha: float = 0.05)#: Two-sided N(0,1) test: returns (p_value, (ci_lower, ci_upper)).

L2-relaxation (Shi & Wang): the relaxation primal, $\varepsilon$ validation, and the two-term HAC ATE inference.

L2-relaxation estimation (Shi & Wang 2024).

Solves the L2-relaxation primal (their Eq. 3)

min_beta (1/2) ||beta||_2^2 s.t. || eta_hat - Sigma_hat beta ||_inf <= tau,

with Sigma_hat = X'X / T1 and eta_hat = X'y / T1 over the pre-period.

Following the authors’ released L2relax (ishwang1/L2relax-PDA), the treated and control series are standardised (demeaned and scaled to unit variance) before forming Sigma / eta, and the solution is mapped back to the original scale: beta = sd_y * (beta_tilde / sd_X) and intercept alpha = mean_y - mean_X . beta. Standardisation is the default (it is what the replication code does and what reproduces the paper’s empirical results); set standardize=False for the raw-scale variant. The tuning parameter tau is chosen by out-of-sample validation over a log-spaced grid (the optimal tau is often a tiny fraction of max|eta|).

mlsynth.utils.pda_helpers.l2.estimation.cross_validate_tau(y_pre: ndarray, X_pre: ndarray, val_frac: float = 0.2, n_coarse: int = 40, n_fine: int = 40, standardize: bool = True, tau_grid: ndarray | None = None) → float#

Validate tau by sequential out-of-sample validation on the tail.

The training window is split in time order – fit on the earlier 1 - val_frac fraction, validate on the most-recent val_frac tail – so no future period informs an earlier counterfactual (unlike a K-fold split over contiguous blocks, which trains on both past and future of each fold). tau_grid overrides the automatic log-spaced grid (e.g. to match a reference grid); otherwise the grid is log-spaced up to max|eta| (the optimal tau is often a tiny fraction of that).

mlsynth.utils.pda_helpers.l2.estimation.fit_l2(y: ndarray, X: ndarray, T0: int, tau: float | None = None, standardize: bool = True, tau_grid: ndarray | None = None) → Tuple[ndarray, float, ndarray, float]#: Fit L2-relaxation; return (beta, intercept, counterfactual, tau).

mlsynth.utils.pda_helpers.l2.estimation.l2_relax(y_pre: ndarray, X_pre: ndarray, tau: float, standardize: bool = True) → Tuple[ndarray, float]#

Solve the L2-relaxation primal for coefficients and intercept.

The primal objective ||beta||^2 / 2 is strictly convex, so the solution is unique (Shi & Wang 2024, Lemma 2). It is solved natively with polished OSQP (mlsynth.utils.pda_helpers.l2.batch.l2_relax_solve()) – the same solver the multiple-treated batch uses – which matches an interior-point (cvxpy/CLARABEL) solve on the unique optimum without cvxpy’s per-call canonicalisation overhead.

ATE inference for L2-relaxation PDA (Shi & Wang 2024, Theorem 3).

Single treated unit. With the post-period estimated effects Delta_hat_t = y_t - y_hat_t (t in T2) and the pre-period prediction residuals e_t = y_t - y_hat_t (t in T1), the ATE Delta_bar is asymptotically normal,

Z_hat = Delta_bar / sqrt( rho_hat_(1)^2 / T1 + rho_hat_(2)^2 / T2 ) -> N(0, 1),

where rho_hat_(1)^2 is the HAC long-run variance of the pre-period residuals and rho_hat_(2)^2 is the HAC long-run variance of the de-meaned post-period effects. Both estimation uncertainty (pre) and post-period noise contribute.

mlsynth.utils.pda_helpers.l2.inference.l2_ate_inference(y: ndarray, counterfactual: ndarray, T0: int, alpha: float = 0.05) → Tuple[float, float, Tuple[float, float], float]#: Return (att, se, ci, p_value) for the L2-relaxation ATE.

LASSO (Li & Bell): cross-validated L1 estimation and the HAC t-test with a first-stage variance term.

L1/LASSO PDA estimation (Li & Bell 2017).

Selects control units and estimates coefficients by LASSO on the pre-period, with the penalty chosen by cross-validation, then predicts the treated unit’s counterfactual out-of-sample.

mlsynth.utils.pda_helpers.lasso.estimation.fit_lasso(y: ndarray, X: ndarray, T0: int, cv: int = 5, random_state: int = 0, alpha: float | None = None) → Tuple[ndarray, float, ndarray, ndarray]#

Fit LASSO PDA; return (beta, intercept, counterfactual, support_mask).

alpha fixes the penalty (skipping cross-validation); None (default) selects it by LassoCV. The fixed-penalty path is used by the prediction-interval bootstrap, which reuses lambda_{T0}.

mlsynth.utils.pda_helpers.lasso.estimation.lasso_cv_alpha(y: ndarray, X: ndarray, T0: int, cv: int = 5, random_state: int = 0) → float#

The cross-validated LASSO penalty for the pre-period (the alpha_ that fit_lasso() selects).

Exposed so the prediction-interval bootstrap can refit at a fixed penalty (Jiang et al. 2025 reuse lambda_{T0} across bootstrap samples rather than re-cross-validating each one).

ATE inference for LASSO PDA (Li & Bell 2017, Theorem 3.2).

The test statistic is sqrt(T2) * Delta_bar / sqrt(Sigma_hat) with Sigma_hat = Sigma_hat_1 + Sigma_hat_2:

Sigma_hat_2 – the HAC (Newey-West) long-run variance of the post-period effects, the always-present post-period averaging variance;
Sigma_hat_1 – the first-stage (pre-period estimation) variance, here the OLS prediction variance of the mean post-period counterfactual on the LASSO-selected support. Li & Bell note it is negligible when T1 >> T2.

Equivalently the ATE standard error is SE = sqrt(var_firststage + Sigma_hat_2 / T2).

mlsynth.utils.pda_helpers.lasso.inference.lasso_ate_inference(y: ndarray, X: ndarray, counterfactual: ndarray, support: ndarray, T0: int, alpha: float = 0.05) → Tuple[float, float, Tuple[float, float], float]#: Return (att, se, ci, p_value) for the LASSO PDA ATE.

Forward selection (Shi & Huang): greedy R^2 selection with modified-BIC stopping and the post-selection HAC t-test.

Forward-selected PDA estimation (Shi & Huang 2023).

Greedy forward selection of control units: at each step add the donor whose inclusion maximizes the pre-treatment OLS R^2 (equivalently minimizes the residual variance sigma^2 = mean(e^2)). Selection stops as soon as the modified information criterion

IC(r) = log( sigma^2(U_r) ) + B * r, B = log(log N) * log(T1) / T1

stops decreasing (the stopping rule of Wang, Li & Tsai used in the authors’ fsPDA R package), starting from the intercept-only IC = log(var(y1)). The counterfactual is the OLS extrapolation on the selected set. This is a direct port of est.fsPDA.R.

mlsynth.utils.pda_helpers.fs.estimation.forward_select(y: ndarray, X: ndarray, T0: int, intercept: bool = False) → Tuple[List[int], ndarray, float, ndarray]#

Forward-select donors with the stop-on-increase IC rule, then refit OLS.

Returns (selected_indices, beta_full, intercept, counterfactual) where beta_full is an N-vector with zeros off the selected support.

Parameters:: intercept (bool, default False) – When False the donor regression is fit without a constant and the IC baseline is log(mean(y_pre**2)) – the paper’s simulation convention (Shi & Huang 2023, simulation/FS.R). This is the correctly-specified form for the factor DGP (mean-zero factors, no unit level shift) and yields valid test size. When True a constant is included and the baseline is log(var(y_pre)) – the released fsPDA R package convention (est.fsPDA.R), appropriate for panels with genuine level differences but size-distorting on centered data.

Post-selection ATE inference for forward-selected PDA (Shi & Huang 2023, Eq. 4).

Because forward selection uses pre-treatment data only and the pre/post periods become asymptotically independent under weak dependence, the naive conditional t-statistic is valid:

Z_U = Delta_bar / sqrt(lrvar_hat) -> N(0, 1),

where lrvar_hat is the long-run variance of the sample mean of the (de-meaned) post-period treatment effects. By default mlsynth uses the prewhitened Newey-West estimator (sandwich::lrvar(..., prewhite = TRUE, adjust = TRUE)) that Shi & Huang use in their applications (app1_luxury_watch/fsPDA.R); on serially-dependent effects this is far less conservative than a plain Bartlett kernel. Supplying lrvar_lag instead switches to the released est.fsPDA package’s fixed-lag Bartlett estimator (floor(T2 ** (1/4)) if the cap, see fspda_lrvar_lag()).

mlsynth.utils.pda_helpers.fs.inference.fs_ate_inference(y: ndarray, counterfactual: ndarray, T0: int, alpha: float = 0.05, lrvar_lag: int | None = None) → Tuple[float, float, Tuple[float, float], float]#: Return (att, se, ci, p_value) for the forward-selected PDA ATE.

Run loop assembling the per-variant fits.

Run the requested PDA variant(s) and assemble typed per-method fits.

mlsynth.utils.pda_helpers.orchestration.resolve_methods(method: str, methods: List[str] | None) → List[str]#: Internal method keys to run (explicit methods win over method).

mlsynth.utils.pda_helpers.orchestration.run_pda(inputs: PDAInputs, methods: List[str], tau: float | None, alpha: float, fs_intercept: bool = False, lrvar_lag: int | None = None, l2_standardize: bool = True, l2_tau_grid: Sequence[float] | None = None, hcw_criterion: str = 'AICc', hcw_nvmax: int | None = None, hcw_backend: str = 'fw', prediction_intervals: bool = False, pi_n_boot: int = 999, pi_seed: int | None = 0) → Dict[str, PDAMethodFit]#

Fit each requested PDA variant with its own paper’s inference.

When prediction_intervals is set, Jiang et al. (2025) bootstrap prediction intervals (per-period treatment effect and counterfactual) are attached to each variant via mlsynth.utils.inferutils.pda_prediction_intervals().

Panel Data Approach (PDA)

Contents

Panel Data Approach (PDA)#

When to Use This Estimator#

Notation#

The shared model#

Original best subset (hcw, Hsiao-Ching-Wan)#

Computing the best subset: the optimisation#

L2-relaxation (l2, Shi & Wang)#

LASSO (lasso, Li & Bell)#

Forward selection (fs, Shi & Huang)#

Choosing among the three#

Shared assumptions across the PDA class#

When the assumptions bind: practical diagnostics#

When to use PDA – and when not to#

Empirical Illustration: Hong Kong economic integration#

Prediction intervals#

Verification#

Simulation study (Path B): forward selection vs LASSO#

Multiple treated units#

Core API#

Configuration#

Result Containers#

Helper Modules#

Original best subset (`hcw`, Hsiao-Ching-Wan)#

L2-relaxation (`l2`, Shi & Wang)#

LASSO (`lasso`, Li & Bell)#

Forward selection (`fs`, Shi & Huang)#