Panel Data Approach (PDA)

Contents

Panel Data Approach (PDA)#

When to Use This Estimator#

The panel data approach (PDA) of Hsiao, Ching and Wan [HCW] estimates a single treated unit’s untreated counterfactual by a linear regression on the control units, fit over the pre-treatment window and extrapolated out-of-sample. Unlike the synthetic control method, PDA imposes no constraints on the coefficients (no simplex, no non-negativity), so it is a plain projection justified by a latent-factor model: if every unit loads on a small set of common factors, the treated unit’s untreated path is a linear combination of the controls’ paths plus an orthogonal error.

The challenge is which controls and how many. Classical PDA was built for low dimensions (few controls relative to pre-periods) and chooses controls by AIC/BIC, which break down once the number of controls N approaches or exceeds the pre-period length T1. mlsynth packages three high-dimensional PDA variants that resolve this differently, each with the estimation and inference theory of its own paper:

  • L2-relaxation (l2; Shi & Wang [l2relax]) – a dense estimator (a “cousin of ridge”) for when the factor model makes the projection coefficients dense; tolerates N > T1; prediction is robust to heteroskedasticity.

  • LASSO (lasso; Li & Bell [LASSOPDA]) – an L1 estimator that selects a sparse set of relevant controls; computationally far cheaper than AIC/BIC and works for N > T1.

  • Forward selection (fs; Shi & Huang [fsPDA]) – a greedy procedure that grows the control set one unit at a time, with valid post-selection inference and no sparsity requirement (works for dense or sparse models).

All three target the single-treated-unit, many-candidate-controls regime and produce a time-varying treatment effect and an average treatment effect (ATE) with a HAC-based confidence interval. The practical choice among them (detailed below) follows the authors’ own arguments: l2 when the coefficients are dense, lasso when only a few controls matter and you want an interpretable selection, fs when you want a cheap, predictive ensemble with honest post-selection inference regardless of sparsity.

Notation#

We adopt the notation of Shi & Wang [l2relax]. For a positive integer n, \([n] = \{1, \ldots, n\}\); \(\mathbf{I}_n\) is the identity, \(\mathbf{1}_n\), \(\mathbf{0}_n\) the all-ones/zeros vectors. For a matrix \(\mathbf{A} = (a_{ij})\) and index sets \(\mathcal{S}, \mathcal{Q}\), the submatrix is \(\mathbf{A}_{\mathcal{S}\mathcal{Q}} = (a_{ij})_{i\in\mathcal{S}, j\in\mathcal{Q}}\) and the subvector \(\mathbf{x}_{\mathcal{Q}} = (x_i)_{i\in\mathcal{Q}}\). \(\phi_{\min}(\cdot)\), \(\phi_{\max}(\cdot)\) denote the smallest and largest eigenvalues; \(\|\mathbf{A}\|_\infty = \max_{ij}|a_{ij}|\), \(\|\mathbf{A}\|_2 = \sqrt{\phi_{\max}(\mathbf{A}'\mathbf{A})}\), \(\|\mathbf{x}\|_1 = \sum_i |x_i|\).

The two time-series operators are central. Over an index set \(\mathcal{S}\) of periods,

\[\mathcal{E}_{\mathcal{S}}(\mathbf{x}_t) = \frac{1}{|\mathcal{S}|} \sum_{t\in\mathcal{S}} \mathbf{x}_t, \qquad \Gamma_{\mathcal{S}}(\mathbf{x}_t, \mathbf{y}_t') = \mathcal{E}_{\mathcal{S}} \Bigl( [\mathbf{x}_t - \mathcal{E}_{\mathcal{S}}(\mathbf{x}_t)] [\mathbf{y}_t - \mathcal{E}_{\mathcal{S}}(\mathbf{y}_t)]' \Bigr),\]

the sample mean and sample covariance of a series over \(\mathcal{S}\). The intervention occurs after the pre-period \(\mathcal{T}_1 = \{1, \ldots, T_1\}\); the post-period is \(\mathcal{T}_2 = \{T_1+1, \ldots, T\}\) with \(T_2 = |\mathcal{T}_2|\). Unit \(j=0\) is treated, \(\mathcal{N} = \{1,\ldots,N\}\) the controls.

The shared model#

All three methods rest on a common latent-factor data-generating process: for \(t \in \mathcal{T}\),

\[y_{jt}^0 = \mu_j + \boldsymbol{\lambda}_j' \mathbf{f}_t + u_{jt}, \quad j \in \{0\}\cup\mathcal{N},\]

with \(\mathbf{f}_t\) a \(q\)-vector of latent common factors, \(\boldsymbol{\lambda}_j\) factor loadings, and \(u_{jt}\) a weakly-dependent idiosyncratic error orthogonal to the factors. Because the common factors drive both the treated unit and the controls, the untreated treated outcome admits a linear projection on the controls,

\[y_{0t} = \alpha^0 + \mathbf{x}_t' \boldsymbol{\beta}^0 + \epsilon_t, \qquad \mathbf{x}_t = (y_{1t}, \ldots, y_{Nt})',\]

with \(\mathbb{E}[\mathbf{x}_t \epsilon_t] = \mathbf{0}\). PDA fits \((\alpha^0, \boldsymbol{\beta}^0)\) on \(\mathcal{T}_1\) and predicts \(\hat{y}_{0t}^0 = \mathcal{E}_{\mathcal{T}_1}(y_s) + [\mathbf{x}_t - \mathcal{E}_{\mathcal{T}_1}(\mathbf{x}_s)]'\hat{\boldsymbol{\beta}}\) for \(t \in \mathcal{T}_2\). The treatment effect is \(\hat{\Delta}_t = y_{0t} - \hat{y}_{0t}^0\) and the ATE \(\bar{\Delta} = \mathcal{E}_{\mathcal{T}_2}(\hat{\Delta}_t)\). The methods differ in how they estimate \(\boldsymbol{\beta}\) and, crucially, in the inference theory each paper proves for \(\bar{\Delta}\).

L2-relaxation (l2, Shi & Wang)#

Idea. Under the factor model the projection coefficient \(\boldsymbol{\beta}^0 = \boldsymbol{\Omega}^{-1}\boldsymbol{\Lambda} (\boldsymbol{\Lambda}'\boldsymbol{\Omega}^{-1}\boldsymbol{\Lambda} + \mathbf{I}_q)^{-1}\boldsymbol{\lambda}_0\) is dense in general – almost no entries are exactly zero. Sparse methods (LASSO) are then mis-matched. With \(\hat{\boldsymbol{\Sigma}} = \Gamma_{\mathcal{T}_1}(\mathbf{x}_t, \mathbf{x}_t')\) and \(\hat{\boldsymbol{\eta}} = \Gamma_{\mathcal{T}_1}(\mathbf{x}_t, y_t)\), OLS solves the KKT condition \(\hat{\boldsymbol{\Sigma}}\boldsymbol{\beta} = \hat{\boldsymbol{\eta}}\), which is unstable or non-unique once \(N\) is close to or exceeds \(T_1\). L2-relaxation relaxes the sup-norm of this moment condition by a tuning parameter \(\tau\) and minimizes the coefficient norm:

\[\min_{\boldsymbol{\beta}} \tfrac{1}{2}\|\boldsymbol{\beta}\|_2^2 \quad \text{s.t.} \quad \|\hat{\boldsymbol{\eta}} - \hat{\boldsymbol{\Sigma}}\boldsymbol{\beta}\|_\infty \le \tau.\]

This is the “bias-variance trade-off” made explicit: tolerating a small violation \(\tau\) of the OLS moment condition shrinks the variance. At \(\tau = 0\) it reduces to (ridgeless) OLS; at \(\tau \ge \|\hat{\boldsymbol{\eta}}\|_\infty\) it gives \(\boldsymbol{\beta} = \mathbf{0}\). mlsynth picks \(\tau\) by sequential out-of-sample validation on the tail of the training window (the validated \(\tau\) tracks the infeasible-optimal one, and both shrink toward zero as the sample grows) over a log-spaced grid down to \(10^{-4}\max|\hat{\boldsymbol{\eta}}|\) (the optimum is often a tiny fraction of the cap). This is time-respecting – the fit never sees periods later than the validation tail – unlike the released L2relax.CV, whose 5-block K-fold trains on both past and future of each block.

Note

Standardisation. Following the authors’ released L2relax, the treated and control series are standardised (demeaned and scaled to unit variance) before forming \(\hat{\boldsymbol{\Sigma}}\) / \(\hat{\boldsymbol{\eta}}\), and the solution is mapped back to the original scale. This is the default (l2_standardize=True) – the \(\ell_2\) penalty is scale-sensitive, so standardisation is both recommended and what reproduces the paper’s empirical results; on the Hong Kong panel it moves the L2 estimate from \(2.48\%\) to \(2.61\%\) (closer to Shi & Wang’s \(2.65\%\)). Set l2_standardize=False for the raw-scale variant.

Assumptions (Shi & Wang).

Assumption 1 (loadings). \(\|\boldsymbol{\lambda}_0\|_\infty + \|\boldsymbol{\Lambda}\|_\infty \le C\), and there is a \(q\)-unit subset making \(\boldsymbol{\Lambda}_{\mathcal{Q}\cdot}\) full column rank; the average factor strength \(\xi_N = \phi_{\min}(\boldsymbol{\Lambda}' \boldsymbol{\Lambda}/N)\) may vanish (weak factors allowed).

Assumption 2 (errors). The idiosyncratic covariance \(\boldsymbol{\Omega}\) has eigenvalues bounded between \(\underline\sigma^2\) and \(\overline\sigma^2\); errors may be heteroskedastic and cross-sectionally dependent.

Assumption 3-4 (sampling). In- and out-of-sample sampling errors of the sample moments are \(O_p(T_1^{-1/2})\) for low-dimensional pieces and \(O_p(\sqrt{\log N / (N\wedge T_1)})\) for high-dimensional ones (holding under time-series weak dependence, not just i.i.d.).

Assumption 5 (ATE inference). The oracle prediction error \(\epsilon_t^*\) and the effect-plus-error \(d_t^* = \Delta_t - \mathbb{E}[\Delta_t] + \epsilon_t^*\) have finite, positive long-run variances \(\rho^2_{\epsilon^*}\), \(\rho^2_{d^*}\) consistently estimable by HAC, and a sequential CLT applies.

Remark. The coefficient estimator is consistent for the oracle target (Theorem 1) and – importantly – the prediction error is asymptotically irrelevant to heteroskedasticity (Theorem 2): unlike the coefficient MSE, the out-of-sample MSE does not depend on the noise heterogeneity \(\psi_{\max}\).

Inference (Shi & Wang, Theorem 3; single treated unit). With pre-period prediction residuals \(e_t = y_t - \hat{y}_t\) (\(t\in\mathcal{T}_1\)) and post-period effects \(\hat{\Delta}_t\) (\(t\in\mathcal{T}_2\)),

\[\hat{Z} = \frac{\bar{\Delta} - \Delta_{\mathcal{T}_2}} {\sqrt{\hat{\rho}^2_{(1)}/T_1 + \hat{\rho}^2_{(2)}/T_2}} \xrightarrow{d} N(0,1),\]

where \(\hat{\rho}^2_{(1)}\) is the HAC long-run variance of the pre-period residuals (first-stage estimation uncertainty) and \(\hat{\rho}^2_{(2)}\) is the HAC long-run variance of the de-meaned post-period effects (post-period averaging). Both sources of uncertainty enter, which matters when \(T_1\) and \(T_2\) are comparable.

When to use. Dense, factor-driven coefficients; high dimension (\(N>T_1\) permitted); when prediction accuracy and heteroskedasticity- robustness matter more than identifying a handful of controls.

LASSO (lasso, Li & Bell)#

Idea. When only a few controls are truly relevant, an L1 penalty selects them and shrinks the rest. Li & Bell estimate

\[\hat{\boldsymbol{\beta}}^{\text{las}} = \operatorname*{argmin}_{\boldsymbol{\beta}} \; \sum_{t\in\mathcal{T}_1} (y_{0t} - \mathbf{x}_t'\boldsymbol{\beta})^2 + \lambda \sum_{j} |\beta_j|,\]

with \(\lambda\) chosen by (leave-one-out) cross-validation, then predict the counterfactual as in the shared model. LASSO works for \(N > T_1\) (where AIC/AICC/BIC cannot even be computed) and is far cheaper.

Assumptions (Li & Bell). They relax HCW’s linear-conditional-mean assumption and drop one of HCW’s identification conditions. The key conditions are: a factor model with \(\mathrm{Rank}(\tilde{B}) = K\) (enough independent factor variation among the controls); a weakly dependent, weakly stationary panel so laws of large numbers and CLTs apply to partial sums; consistency of the pre-period least-squares pieces (\(\hat{\delta}_1 - \delta_1, \hat{\delta}-\delta = O_p(T_1^{-1/2})\)); and \(\rho\)-mixing with geometric decay plus a finite limit \(\eta = \lim T_2/T_1\). Sparsity (only \(m\) of \(\boldsymbol{\beta}\) non-zero, \(m\) fixed or \(o(T_1)\)) is assumed for the high-dimensional selection.

Remark. Li & Bell prove consistency \(\hat{\Delta}_1 - \Delta_1 = O_p(T_1^{-1/2} + T_2^{-1/2})\) (estimation error has two parts: first-stage \(O_p(T_1^{-1/2})\) and post-averaging \(O_p(T_2^{-1/2})\)), holding even when \(y_t\) is trend-stationary.

Inference (Li & Bell, Theorem 3.2). With \(\hat{\Sigma} = \hat{\Sigma}_1 + \hat{\Sigma}_2\),

\[\text{T.S.} = \frac{\sqrt{T_2}\,\hat{\Delta}_1}{\sqrt{\hat{\Sigma}}} \xrightarrow{d} N(0,1),\]

where \(\hat{\Sigma}_2\) is the Newey-West HAC long-run variance of the post-period effects, and \(\hat{\Sigma}_1\) is the first-stage (pre-period estimation) variance – the OLS prediction variance of the mean post-period counterfactual on the selected support. Li & Bell note \(\hat{\Sigma}_1\) is negligible when \(T_1 \gg T_2\), so the post-period term dominates in long-pre-period panels.

When to use. A genuinely sparse set of relevant controls; very large \(N\) (even \(N/T_1 \to \infty\)); when an interpretable, computa- tionally cheap selection is preferred. (For selection consistency, Li & Bell note the adaptive LASSO; for prediction, plain LASSO already beats AIC/BIC and leave-many-out CV in their simulations.)

Forward selection (fs, Shi & Huang)#

Idea. Rather than penalize, grow the control set greedily. Start empty; at each step add the control whose inclusion maximizes the pre-treatment OLS \(R^2\) (equivalently minimizes the residual sum of squares). The number of selected controls \(R\) is a tuning parameter chosen by a modified BIC (Wang, Li & Tsai),

\[\hat{R} = \operatorname*{argmin}_{r} \log\bigl(\hat{\sigma}^2(\hat{U}_r)\bigr) + \log(\log N)\,\frac{r\,\log T_1}{T_1},\]

with \(\hat{\sigma}^2(\hat{U}_r)\) the pre-period residual variance of OLS on the \(r\)-unit set. The counterfactual is the OLS extrapolation on the chosen set \(\hat{U}_{\hat{R}}\). Forward selection evaluates \(\sum_r (N-r+1)\) regressions – linear in \(N\) – versus the \(2^N\) of exhaustive subset search.

Assumptions (Shi & Huang). Asymptotics are multi-index: \(N\to\infty\) with \(T_1 = T_1(N)\) deterministic, \(\log N / T_1 \to 0\), and \(T_2 = T_2(N) \to \infty\) with \(\log N / T_2 \to 0\) (\(N\) may exceed \(T_1\)).

Assumption 1 (sparse Riesz / restricted eigenvalue). The minimal eigenvalue of the population Gram matrix over any \(u\)-unit subset (\(u \le (1+\delta_1)R\)) is bounded below – a condition the authors show is a natural implication of the latent factor model, not an ad hoc restriction.

Assumption 2 (second moments). Sample second moments converge at the high-dimensional rate \(O_p(\sqrt{\log N / T_1})\) with bounded fourth moments.

Assumption 3 (post-period). Analogous convergence and long-run-variance bounds on the post-treatment data.

Assumption 4 (weak dependence). The series are strong (\(\alpha\)-) mixing with geometric decay, so a Berry-Esseen bound for heterogeneous time series applies.

Remark. The validity is uniform over a class of DGPs (Theorem 1) – it holds whether the true coefficients are dense or sparse, which separates fsPDA from the post-selection-inference literature that needs sparsity or the oracle property. Theorem 2 shows the greedy algorithm attains a regression variance asymptotically as small as the best \(u\)-unit subset, so the cheap forward search is statistically efficient.

Inference (Shi & Huang, Eq. 4). Because forward selection uses only the pre-period and, under weak dependence, the pre- and post-periods become asymptotically independent (sample splitting), the naive conditional t-statistic is valid:

\[\hat{\mathcal{Z}}_{\hat{U}} = \hat{\rho}_{\tau\hat{U}}^{-1}\sqrt{T_2}\, \bar{\Delta}_{\hat{U}} \xrightarrow{d} N(0,1),\]

where \(\hat{\rho}^2_{\tau\hat{U}}\) is the HAC long-run variance of the de-meaned post-period effects. No first-stage variance term is needed – the asymptotic independence absorbs it – which makes fsPDA’s inference the simplest of the three.

Note

Long-run-variance estimator. mlsynth defaults to the prewhitened Newey-West estimator (Andrews-Monahan VAR(1) prewhitening + Bartlett kernel with the data-driven NW(1994) bandwidth + finite-sample adjustment) – R’s sandwich::lrvar(..., prewhite = TRUE, adjust = TRUE), which Shi & Huang use in their application scripts. Prewhitening is essential when the treatment-effect series is strongly serially dependent: monthly growth rates mean-revert (lag-1 autocorrelation around \(-0.45\) in the luxury-watch panel), and a plain Bartlett kernel cannot absorb that, leaving \(\hat\rho\) nearly double its true value and the test far too conservative. Setting lrvar_lag instead switches to the released est.fsPDA package’s fixed-lag Bartlett estimator (default lag \(\lfloor T_2^{1/4}\rfloor\), capped at \(\lfloor\sqrt{T_2}\rfloor\)); on the watch panel that no-prewhitening form gives an insignificant \(t \approx -1.15\), versus the prewhitened default’s \(-2.51\) (the paper reports \(-2.457\)).

When to use. A large candidate-control pool where the goal is to synthesize an ensemble that mimics the outcome (not to interpret which controls are “causal”); when computational efficiency and honest post-selection inference matter; and regardless of whether the underlying model is sparse. (Shi & Huang recommend the adaptive LASSO instead when the identity of a few causal controls is the object of interest.)

Choosing among the three#

Method

Coefficient structure

Inference variance

Use when

l2

dense (factor-implied)

pre + post HAC (both terms)

dense coefficients; N>T1; prediction & heteroskedasticity-robustness

lasso

sparse

post HAC (+ first stage, small if T1>>T2)

few relevant controls; interpretable selection; very large N

fs

dense or sparse

post HAC only (sample splitting)

large pool; predictive ensemble; cheap; honest post-selection inference

Shared assumptions across the PDA class#

The three estimators (l2, lasso, fs) differ in how they fit \(\boldsymbol\beta\), but they share the same identifying stack. Stated formally:

A1 (Latent factor model for untreated outcomes). All \(N + 1\) units share at most \(q\) common latent factors,

\[y_{jt}^0 \;=\; \mu_j \;+\; \boldsymbol\lambda_j' \mathbf f_t \;+\; u_{jt}, \qquad j \in \{0\} \cup \mathcal N, \;\; t \in \mathcal T,\]

with \(\mathbb E[\mathbf f_t u_{jt}] = 0\). This is the shared model underlying HCW (Hsiao-Ching-Wan 2012), Li-Bell (2017), Shi-Huang (2023), and Shi-Wang (l2). The factor structure is what licenses the linear projection of the treated unit’s untreated outcome on the controls’ outcomes,

\[y_{0t} \;=\; \alpha^0 \;+\; \mathbf x_t' \boldsymbol\beta^0 \;+\; \epsilon_t, \qquad \mathbb E[\mathbf x_t \epsilon_t] = \mathbf 0,\]

with \(\boldsymbol\beta^0 = \boldsymbol\Omega^{-1} \boldsymbol\Lambda (\boldsymbol\Lambda' \boldsymbol\Omega^{-1} \boldsymbol\Lambda + \mathbf I_q)^{-1} \boldsymbol\lambda_0\).

A2 (Single treated unit, sharp absorbing aggregate-level treatment). Unit \(j = 0\) is the only treated unit; treatment turns on at \(T_1 + 1\) and stays on. Donors are untreated throughout (no interference). The original HCW / Li-Bell / Shi-Huang theorems are stated for this single-treated case. (The l2-relaxation paper Section 4.4 sketches a multiple-treated-units extension with a short post-window; the mlsynth implementation tracks the single-treated form.)

A3 (Weak temporal dependence). The series \((\mathbf x_t, y_{0t})\) are \(\rho\)-mixing or strong- mixing with at-least-geometric decay (the exact rate varies by variant):

  • Li-Bell A6: \(w_t = (\tilde y_t', \epsilon_{1t})\) is a weakly stationary \(\rho\)-mixing process with \(\rho(\tau) = O(\lambda^\tau)\).

  • Shi-Huang A4: strong (\(\alpha\)-) mixing with geometric decay, so a Berry-Esseen bound for heterogeneous time series applies.

  • Shi-Wang A3-A4: time-series weak dependence at the \(O_p(T_1^{-1/2})\) and \(O_p(\sqrt{\log N / (N \wedge T_1)})\) rates for the sample moments.

This is what makes pre-period sample moments converge at the high-dimensional rate and – crucially – what makes the pre-period and post-period asymptotically independent, which is the engine behind fs-PDA’s sample-splitting inference and the two-term HAC variance in l2 / lasso.

A4 (Sample-size regime). \(N \to \infty\), \(T_1 = T_1(N) \to \infty\) deterministically with \(\log N / T_1 \to 0\), \(T_2 \to \infty\) with \(\log N / T_2 \to 0\). \(N\) may exceed \(T_1\), which is the entire point of the high-dimensional PDA literature. Li-Bell’s A7 additionally posits \(\eta = \lim T_2 / T_1 \in [0, \infty)\), which determines whether the first-stage variance term \(\hat\Sigma_1\) matters.

A5 (Donor pool regularity). The controls’ Gram matrix has enough variation:

  • For lasso (Li-Bell A2): \(\mathrm{Rank}(\tilde B) = K\) – removing the first row of the loading matrix leaves full-rank factor variation; \(E[x_t x_t']\) is invertible on the active set.

  • For fs (Shi-Huang A1): a sparse Riesz / restricted eigenvalue condition – the minimum eigenvalue of the population Gram matrix over any \(u\)-unit subset (\(u \le (1 + \delta_1) R\)) is bounded below. The paper shows this is a natural implication of the latent factor model, not an ad-hoc lasso-style assumption.

  • For l2 (Shi-Wang A1-A2): factor strength \(\xi_N = \phi_{\min}(\boldsymbol\Lambda' \boldsymbol\Lambda / N)\) may vanish (weak factors allowed); the idiosyncratic covariance has eigenvalues bounded in \([\underline\sigma^2, \overline\sigma^2]\).

A6 (Variant-specific structure of \(\boldsymbol\beta^0\) ).

  • lasso: sparse \(\boldsymbol\beta^0\) – only \(m = o(T_1)\) of its components are non-zero.

  • l2: dense \(\boldsymbol\beta^0\) – almost no exact zeros (the factor projection gives every donor a small-but-nonzero coefficient).

  • fs: agnostic – the inference is valid uniformly over a class of DGPs that includes both dense and sparse \(\boldsymbol\beta^0\) (Theorem 1 in Shi-Huang).

A7 (Inferential regularity). For all three, the post-period average effect \(\bar\Delta\) has a CLT with HAC long-run variance consistently estimable by Newey-West. For l2 and lasso, both the pre-period (first-stage) and post-period HAC variances enter; for fs, sample-splitting absorbs the first-stage term and only the post-period HAC variance enters.

  1. Factor-driven DGP (A1). PDA’s whole identification story rides on the latent factor structure. If the panel is not well-described by a small number of common factors, the linear-projection equation has a non-vanishing \(\epsilon_t\) term that the high-dimensional estimators cannot remove.

    Plausibly violated when donors are largely idiosyncratic (each follows its own unrelated process), or when one or two donors have a structural break that the factor model can’t absorb. Diagnostic: run an SVD on the donor pre-period matrix; the top few singular values should carry the bulk of the spectral energy. If the spectrum is flat-tailed (no clear factor cutoff), the factor model fails and PDA’s linear-projection consistency is fragile. In that regime, use a balancing-aware estimator (MicroSynth (User-Level Balancing SC) if you have unit-level data) or stay with canonical SC.

  2. Single treated unit, absorbing aggregate-level treatment (A2). Multiple treated units, treatment that turns off, or interference among donors break the framework.

    Plausibly violated when the policy is rescinded mid- sample, or when treated and donor states are spatially or economically linked enough that the donors’ untreated trajectories shift. Diagnostic: the config validator enforces single-cohort; the silent failure is treated-donor spillover. Split donors by geographic / economic distance to the treated unit and refit; large ATE shifts flag interference. Use Spillover-Aware Synthetic Control (SPILLSYNTH) or Spatial Synthetic Difference-in-Differences (SpSyDiD) for genuine spillovers, FECT / Synthetic Difference-in-Differences (SDID) for staggered designs.

  3. Weak temporal dependence (A3). All three variants assume mixing or \(\rho\)-mixing pre-period series with geometric decay. Unit-root outcomes, long-memory series, or series with structural breaks fail this.

    Plausibly violated when the outcome is a price level, a cumulative quantity, or an undifferenced macroeconomic series. Diagnostic: ADF / KPSS on the pre-period residuals; non-stationarity flags A3 failure. The pre/post asymptotic-independence story (which licenses fs-PDA’s sample-splitting inference) is then in question. First- difference the outcome, or use Synthetic Business Cycle (SBC) (a stationary- cycle estimator) before PDA.

  4. Sample-size regime (A4). PDA needs both \(T_1\) and \(T_2\) growing with \(\log N\) small relative to each. A short post-period (\(T_2 \le 5\)) breaks the CLT on \(\bar\Delta\); a short pre-period (\(T_1 \le 20\)) breaks the moment-convergence rates.

    Plausibly violated when the pre-period is short with many donors. Diagnostic: compute \((\log N) / T_1\) and \((\log N) / T_2\); both should be visibly below 1. If they are not, the asymptotic approximation has not kicked in. Either lengthen the panel (aggregate to a finer time grid), prune donors, or move to canonical SCM / Two-Step Synthetic Control / Forward Difference-in-Differences (FDID) which work with shorter panels.

  5. Donor regularity (A5). Each variant has its own Gram-matrix / factor-strength condition. The practical common failure is near-collinear donors: two donor series that move together up to noise produce a near- singular pre-period Gram matrix.

    Plausibly violated when the donor pool contains near-duplicates (two adjacent states with essentially identical industry mix). Diagnostic: read the condition number of \(\Gamma_{\mathcal T_1}(\mathbf x_t, \mathbf x_t')\). A condition number above ~1e6 is a red flag. For lasso and fs this manifests as selection flipping between near-clones across seeds; for l2 the tau-validation curve gets noisy. Prune one of each clone pair before refitting.

  6. Coefficient structure (A6) – choosing the right variant. The biggest practitioner-side decision is whether to assume sparse or dense \(\boldsymbol\beta^0\). Choosing wrong pays a real cost: lasso on a dense truth over-selects and inflates size (see the Path-B table above: LASSO’s size is 0.16-0.36 under the dense factor DGP, vs fs’s 0.05); l2 on a sparse truth pays variance for the dense fit it doesn’t need.

    Diagnostic: fit fs first – it’s valid in both regimes per Shi-Huang Theorem 1, and the selected \(\hat R\) and per-step \(R^2\) curve tell you whether you’re in a sparse (few donors carry the fit) or dense (many donors add information) regime. Then run the matched variant (lasso if fs keeps a handful; l2 if fs keeps many).

  7. Inferential regularity (A7). The HAC long-run variance must be consistently estimable. With strong serial correlation and a short post-period, the Bartlett / Newey-West kernel needs more lags than the post-period supports.

    Plausibly violated when \(T_2 \le 8\) and the treatment-effect series is autocorrelated. Diagnostic: plot the autocorrelation function of res.gap[-T2:]; if it stays high beyond \(\sqrt{T_2}\) lags, the Newey-West bandwidth choice is binding and the CI is optimistic. Lengthen the post-window if you can, or report bootstrap CIs alongside the HAC ones.

Reach for PDA when:

  • You have a single treated unit, a moderate-to-large donor pool (\(N\) comparable to or exceeding \(T_1\)), and a plausibly factor-driven panel. PDA was designed for exactly this regime, and unlike canonical SCM, the projection has no simplex / non-negativity constraint – it can extrapolate through negative coefficients on far donors when the factor structure demands it.

  • You want HAC-based, classical-statistics inference on the ATE with a closed-form normal-distribution test, not a permutation or conformal procedure. PDA’s CLT-based inference is what makes it the closest synthetic-control cousin to difference-in-differences from an inferential standpoint.

  • The treated unit’s pre-trajectory lies in the linear span (not necessarily convex hull) of the controls, so a no-constraint linear projection makes sense. PDA cannot recover an effect when the projection itself is impossible.

  • You’re between dense (l2) and sparse (lasso) regimes and want a uniformly valid test that doesn’t require choosing the right sparsity story – run fs.

Do not use PDA when:

  • You need convex (non-negative, sum-to-one) weights as a policy-interpretation deliverable. PDA’s no-constraint projection produces negative coefficients on far donors, which is awkward to explain in many policy contexts. Use canonical SCM / Two-Step Synthetic Control (canonical SC) or Forward-Selected Synthetic Control (FSCM) (forward-selected SC with the simplex retained).

  • The treated unit is structurally outside the donor span (not just the convex hull). PDA’s linear projection cannot reach a treated outcome that no linear combination of donor outcomes can express. The pre-period RMSE stays large at every PDA variant. Use Imperfect Synthetic Controls (ISCM), whose A2(b) mechanism identifies the effect through donors that use the treated unit as a positive-weight donor in their synthetic controls.

  • Outcomes are non-stationary (unit-root or trending series). A3 fails and the pre/post asymptotic-independence story breaks. First-difference the outcome (the l2-relaxation paper’s empirical PPI application does exactly this), or use Synthetic Business Cycle (SBC) (stationary-cycle estimator).

  • You have multiple treated units with overlapping cohorts. PDA’s theorems (with the exception of the l2 relaxerm which does support multiple treated units but is not written yet) are written for the single-treated case. Use

  • Spillovers across donors. A2’s no-interference clause fails when donor states are economically linked to the treated state. Use Spillover-Aware Synthetic Control (SPILLSYNTH) or Spatial Synthetic Difference-in-Differences (SpSyDiD).

  • Continuous or multi-valued treatment. PDA encodes a single binary intervention; continuous dose belongs in Continuous-Treatment Synthetic Control (CTSC).

  • Distributional questions (Lorenz curves, QTEs). PDA targets the mean ATE through a Gaussian-likelihood linear projection. Use Distributional Synthetic Control (DSC) for distributional effects.

  • You need Bayesian posterior credible bands. PDA returns frequentist HAC-based CIs. For Bayesian inference and posterior inclusion probabilities on donors, use Bayesian Synthetic Control with a Soft Simplex Constraint (BVS-SS) (spike-and-slab with a soft simplex).

  • Very short pre-period \((T_1 \le 15)\) with many donors. The high-dimensional approximation has not kicked in; the selected \(\hat\beta\) is noise. Use canonical SCM / Two-Step Synthetic Control / Forward Difference-in-Differences (FDID), which work without high-dimensional asymptotics.

  • Very short post-period \((T_2 \le 5)\). The CLT on \(\bar\Delta\) is shaky; the HAC bandwidth choice dominates the inference. Either accept a wider permutation CI from canonical SCM / Two-Step Synthetic Control, or move to the l2-relaxation multiple-treated-units extension (Shi-Wang Section 4.4) which is built for this regime.

  • You want predictor-level matching (covariates + pre-period outcomes), not outcome-only projection. PDA’s workhorse projection is on donor outcomes, not on predictor moments. Use canonical SCM / Two-Step Synthetic Control / Sparse Synthetic Control (SparseSC) (predictor selection with L1 penalty on the V-weight matrix) for the predictor-matching setup.

  • The factor model itself is the object of interest (you want to identify and interpret the factors). PDA is agnostic to factor estimation – the factor model only motivates the linear projection, never enters the estimator. Use Factor Model Approach (FMA) (factor-model-aware estimator) if recovering factors is the goal.

Empirical Illustration: Hong Kong economic integration#

The original HCW [HCW] application – and Shi & Huang’s Example 1 – evaluates the effect of economic integration with mainland China on Hong Kong’s quarterly real-GDP growth, using 24 comparison economies. Here all three PDA variants run on the same data, and the package returns the time-varying effect, the ATE, and a HAC confidence interval for each.

import pandas as pd
from mlsynth import PDA

url = "https://raw.githubusercontent.com/jgreathouse9/mlsynth/refs/heads/main/basedata/HongKong.csv"
df = pd.read_csv(url)

res = PDA({"df": df, "outcome": "GDP", "treat": "Integration",
           "unitid": "Country", "time": "Time",
           "methods": ["l2", "LASSO", "fs"], "alpha": 0.05,
           "display_graphs": True}).fit()

for name, fit in res.fits.items():
    sel = "-" if fit.selected_donors is None else len(fit.selected_donors)
    print(f"{name:6s} ATE {fit.att:.4f}  SE {fit.att_se:.4f}  "
          f"95% CI ({fit.ci[0]:.4f}, {fit.ci[1]:.4f})  p={fit.p_value:.3f}  donors={sel}")

This prints:

l2     ATE 0.0248  SE 0.0032  95% CI (0.0185, 0.0311)  p=0.000  donors=24
lasso  ATE 0.0330  SE 0.0054  95% CI (0.0224, 0.0436)  p=0.000  donors=11
fs     ATE 0.0285  SE 0.0059  95% CI (0.0169, 0.0401)  p=0.000  donors=7

All three find a significant positive integration effect on Hong Kong’s GDP growth – roughly +2.5 to +3.3 percentage points – differing only by their selection philosophy: l2 keeps all 24 controls with dense, signed coefficients (pre-RMSE 0.013); lasso keeps 11; fs keeps a parsimonious 7 (Malaysia, New Zealand, Norway, Austria, Canada, Thailand, Australia). The estimates bracket the Forward-DiD result on the same data (0.025), a useful cross-method check.

Verification#

Note

Empirical (Path A, Hong Kong). All three variants run on the HCW Hong Kong panel (above) and agree on a significant positive integration effect, consistent with the literature and the Forward-DiD cross-check (0.025).

Simulation study (Path B): forward selection vs LASSO#

Shi & Huang’s (2023) Table 1 compares forward selection against LASSO on a four-factor DGP, re-implemented in mlsynth.utils.pda_helpers.simulation.simulate_pda_panel(): four factors (f1 i.i.d.; f2 AR(1) 0.9; f3 MA(2) (0.8,0.4); f4 ARMA(1,1) (0.5,0.5) under the dynamic structure, all i.i.d. N(0,1) under the i.i.d. structure); loadings U(1,2) on the treated + 4 relevant controls and U(-0.1,0.1) on the remaining 96; idiosyncratic N(0,0.5); one treated unit, \(N=100\) controls, \(T_1=T_2\). Shocks D1-D7 set the post-period ATE (D1-D3 null → size; D4-D7 non-zero → power). Driving the packaged PDA (methods=["fs","LASSO"], fs_intercept=False) at \(T_1=100\) (200 reps for size, 60 for power):

forward selection vs LASSO, \(T_1=100\)#

factors

method

# donors

size (D1)

power (D5)

i.i.d.

fs

3.9

0.090

1.00

i.i.d.

LASSO

9.5

0.065

1.00

dynamic

fs

4.5

0.075

1.00

dynamic

LASSO

15.0

0.140

1.00

The paper’s geometry reproduces. Forward selection is parsimonious – it keeps to ~the 4 relevant donors in both structures – while LASSO over-selects (9-15 donors). Forward selection’s test is correctly sized (≈ 0.05-0.09) under both factor structures, the robustness Shi & Huang emphasise. LASSO is correctly sized under i.i.d. factors (0.065, matching the paper’s 0.058) but its size inflates under dynamic factors (0.140; paper’s modified-BIC LASSO 0.184) – the size inflation the paper reports is a dynamic-factor phenomenon, not an i.i.d. one. Both tests are fully powered at D5 (mean-1 shift). Durable case: pda_table1.

Note

mlsynth’s LASSO is cross-validated; the paper’s is not. Shi & Huang select the Lasso penalty with a modified BIC (Remark 4 cont., p.521: “we tune the constants in the modified BIC to allow Lasso to take in more variables”); mlsynth’s L1-PDA selects it with LassoCV (5-fold cross-validation). The two are different penalty rules, so the LASSO cells above are mlsynth’s CV variant, not a cell-by-cell match of the paper’s Lasso. What both share – and what the benchmark pins – is the geometry: LASSO over-selects relative to fs and its size inflates under dynamic factors, while forward selection (the paper’s method, validated cell-by- cell on Hong Kong in pda_hongkong) stays parsimonious and correctly sized.

The fs_intercept knob – valid size on factor data

Achieving the correct fs size above required a fix. The released fsPDA R package (est.fsPDA.R) fits the donor regression with an intercept; on the paper’s mean-zero factor DGP that intercept absorbs a spurious pre-period constant which extrapolates into the post window, biasing the gap and inflating the null rejection rate to ~0.20. The paper’s Table 1 was produced by the simulation code (FS.R), which fits without an intercept and yields valid size. mlsynth exposes both via PDAConfig.fs_intercept (default False = the no-intercept, valid-size form; set True for panels with genuine unit level shifts). With the default, the fs D1 rejection rate drops from ~0.20 (intercept) to the ~0.05-0.09 reported above (no intercept).

One honest caveat: under dynamic factors fs shows mild residual size inflation at small \(T\) (the “imprecise long-run-variance” effect the paper itself notes), shrinking toward 5% as \(T_1 \to \infty\).

L2-relaxation (Shi & Wang). The l2 method’s out-of-sample MPSE falls with \(T\) and its test approaches the nominal 5% size as \(T_1 \to \infty\), matching Shi & Wang’s Table 2 (size \(0.142\) at \(T_1=50\)\(0.072\) at \(200\)). Its per-fit cross-validation over the \(\tau\) grid makes large Monte Carlos expensive (~5 s/fit), so the full table is summarized rather than swept here.

Multiple treated units#

When several units are treated by the same intervention, Shi & Wang’s L2-relaxation PDA fits a counterfactual per treated unit against the shared control pool and aggregates the effects into a per-period cross-sectional ATE,

\[\widehat{\mathrm{ATE}}_t = \frac{1}{J} \sum_{j=1}^{J} \bigl( y_{jt} - \hat{y}_{jt} \bigr), \qquad \mathrm{s.e.} = \frac{\sqrt{\mathbf{1}'\hat{\Sigma}_e\,\mathbf{1}}}{J},\]

with \(\hat{\Sigma}_e\) the cross-sectional covariance of the pre-period prediction residuals (so the standard error is constant across post-periods). mlsynth.utils.pda_helpers.multitreat.run_pda_multitreat() implements this. Because every per-unit fit shares the same \(\hat{\boldsymbol{\Sigma}} = X'X/T_1\), all \(J\) fits run through a single OSQP factorisation – the batched solver mlsynth.utils.pda_helpers.l2.batch.l2_relax_batch() updates only the constraint bounds per (unit, tau) (hundreds of conic solves become hundreds of warm-started ADMM updates). On the Brexit study (52 UK firms vs 300 controls) this reproduces the paper’s first-day return ATE of \(-4.3\%\) (\(t \approx -7.4\)); see benchmarks/cases/pda_brexit.py.

Core API#

Panel Data Approach (PDA) for program evaluation.

A thin, NumPy-first orchestration over mlsynth.utils.pda_helpers. PDA (Hsiao, Ching & Wan 2012) predicts a single treated unit’s untreated counterfactual by a linear regression on the control units fit over the pre-treatment window, then extrapolates out-of-sample to estimate the ATE. Three high-dimensional variants are supported, each with the inference theory from its own paper:

  • l2 – L2-relaxation (Shi & Wang 2024): dense ridge-like coefficients via min ||beta||_2^2 s.t. ||eta_hat - Sigma_hat beta||_inf <= tau; ATE inference combines pre-residual and post-effect HAC long-run variances.

  • LASSO – L1 (Li & Bell 2017): sparse donor selection by cross-validated LASSO; HAC t-test on the ATE with a first-stage variance term.

  • fs – forward selection (Shi & Huang 2023): greedy R^2 selection with a modified-BIC stopping rule; post-selection HAC t-test (sample splitting).

Set method for one variant or methods to run several at once.

class mlsynth.estimators.pda.PDA(config: PDAConfig | dict)#

Bases: object

Panel Data Approach estimator (l2 / LASSO / fs).

Parameters:

config (PDAConfig or dict) – Validated configuration. Beyond the common fields, PDA reads method / methods (which variant(s) to run), tau (the l2 relaxation parameter; None selects it by validation), and alpha (CI level).

References

Hsiao, C., Ching, H. S., & Wan, S. K. (2012). A panel data approach for program evaluation. Journal of Applied Econometrics, 27(5), 705-740.

Shi, Z., & Wang, Y. (2024). L2-relaxation for Economic Prediction.

Li, K. T., & Bell, D. R. (2017). Estimation of average treatment effects with panel data. Journal of Econometrics, 197(1), 65-75.

Shi, Z., & Huang, J. (2023). Forward-selected panel data approach for program evaluation. Journal of Econometrics, 234(2), 512-535.

fit() PDAResults#

Run the requested PDA variant(s) and return typed results.

Returns:

PDAResults – Container of per-variant PDAMethodFit objects (coefficients, counterfactual, gap, ATE, HAC standard error, CI, p-value, donor weights), with convenience aliases (att, att_se, counterfactual, donor_weights) forwarding to the primary variant.

Configuration#

class mlsynth.config_models.PDAConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, method: ~typing.Annotated[str, _PydanticGeneralMetadata(pattern='^(LASSO|l2|fs)$')] = 'fs', methods: ~typing.List[str] | None = None, tau: float | None = None, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, fs_intercept: bool = False, lrvar_lag: int | None = None, l2_standardize: bool = True)#

Configuration for the Panel Data Approach (PDA) estimator.

alpha: float#
fs_intercept: bool#
l2_standardize: bool#
lrvar_lag: int | None#
method: str#
methods: List[str] | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tau: float | None#

Result Containers#

PDA.fit() returns a PDAResults, whose fits maps each variant to a PDAMethodFit (coefficients, intercept, counterfactual, gap, ATE, HAC standard error, CI, p-value, donor weights, and the selected-donor list for lasso/fs). The prepared, NumPy-only panel is exposed as a PDAInputs, with units and time addressed through an IndexSet.

Note

PDA.fit() returns an EffectResult on the standardized two-family contract. It is a dispatcher over the variants in res.fits (l2 / lasso / fs); the selected variant drives the flat accessors (res.att / res.att_ci / res.counterfactual / res.gap / res.donor_weights / res.pre_rmse), which resolve through the standardized sub-models. res.donor_weights are the regression coefficients (PDA is a regression counterfactual, not a simplex average); res.att_by_method() / res.se_by_method() report every variant.

Frozen, NumPy-first containers for the Panel Data Approach (PDA).

PDA (Hsiao, Ching & Wan 2012) predicts a treated unit’s untreated counterfactual by a linear regression on the control units fit over the pre-treatment window, then extrapolates out-of-sample. mlsynth exposes three high-dimensional PDA variants, each in its own subpackage with the inference theory from its own paper:

  • l2 – L2-relaxation (Shi & Wang 2024).

  • lasso – L1/LASSO (Li & Bell 2017).

  • fs – forward-selected PDA (Shi & Huang 2023).

Everything below is pure NumPy; units/time are addressed through IndexSet. The only DataFrame touchpoint is setup.

class mlsynth.utils.pda_helpers.structures.PDAInputs(unit_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, time_index: ~mlsynth.utils.fast_scm_helpers.structure.IndexSet, y: ~numpy.ndarray, X: ~numpy.ndarray, T0: int, treated_label: ~typing.Any, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

Preprocessed, NumPy-only panel for the PDA engine.

Parameters:
  • unit_index (IndexSet) – All N donor units (column order of X).

  • time_index (IndexSet) – All T periods (row order of y and X).

  • y (np.ndarray) – Treated-unit outcome over all periods, shape (T,).

  • X (np.ndarray) – Donor outcomes, shape (T, N).

  • T0 (int) – Number of pre-treatment periods (T1); post is T2 = T - T0.

  • treated_label (Any) – Identifier of the treated unit.

  • metadata (dict) – Free-form provenance.

property T: int#
T0: int#
property T2: int#
X: ndarray#
property donor_labels: ndarray#
metadata: Dict[str, Any]#
property n_donors: int#
time_index: IndexSet#
treated_label: Any#
unit_index: IndexSet#
y: ndarray#
class mlsynth.utils.pda_helpers.structures.PDAMethodFit(name: str, beta: ~numpy.ndarray, intercept: float, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att: float, att_se: float, ci: ~typing.Tuple[float, float], p_value: float, donor_weights: ~typing.Dict[~typing.Any, float], selected_donors: ~typing.List[~typing.Any] | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

A single PDA-variant fit (l2 / lasso / fs).

att: float#
att_se: float#
beta: ndarray#
ci: Tuple[float, float]#
counterfactual: ndarray#
donor_weights: Dict[Any, float]#
gap: ndarray#
intercept: float#
metadata: Dict[str, Any]#
name: str#
p_value: float#
selected_donors: List[Any] | None = None#
class mlsynth.utils.pda_helpers.structures.PDAResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.pda_helpers.structures.PDAInputs, fits: ~typing.Dict[str, ~mlsynth.utils.pda_helpers.structures.PDAMethodFit], selected_variant: str = 'fs', metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: BaseEstimatorResults

Top-level container returned by mlsynth.PDA.fit().

An EffectResult (the observational report): a dispatcher over the PDA variants in fits, it lifts the selected variant’s quantities into the standardized sub-models so the flat accessors (att / att_ci / counterfactual / gap / donor_weights / pre_rmse) resolve through the base contract. donor_weights here are the regression coefficients (PDA is a regression/factor counterfactual, not a simplex donor average). The PDA-specific fields keep every variant fit.

Parameters:
  • inputs (PDAInputs) – Preprocessed panel.

  • fits (dict) – {variant_name: PDAMethodFit} for every variant run (l2 / lasso / fs).

  • selected_variant (str) – Which variant drives the standardized surface / flat accessors.

att_by_method() Dict[str, float]#
property att_se: float#
fits: Dict[str, PDAMethodFit]#
inputs: PDAInputs#
metadata: Dict[str, Any]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

se_by_method() Dict[str, float]#
selected_variant: str#

Helper Modules#

Data preparation – the only DataFrame touchpoint: pivots to NumPy, builds the unit/time ``IndexSet``es, and splits pre/post.

Long-DataFrame -> NumPy boundary for PDA (the only pandas touchpoint).

mlsynth.utils.pda_helpers.setup.derive_treatment(df: DataFrame, unitid: str, time: str, treat: str) Tuple[Any, Any]#

Read the single treated unit and its first treated period from treat.

mlsynth.utils.pda_helpers.setup.prepare_pda_inputs(df: DataFrame, *, unitid: str, time: str, outcome: str, treat: str) PDAInputs#

Pivot the panel to NumPy, build IndexSets, split pre/post.

Returns:

PDAInputs – Pure-NumPy container: treated vector y (T,), donor matrix X (T, N), T0, and unit/time IndexSets.

Shared HAC long-run-variance machinery (Bartlett/Newey-West) and the N(0,1) test used by all three variants.

Shared HAC long-run-variance machinery for PDA inference.

All three PDA variants base inference on a heteroskedasticity- and autocorrelation-consistent (HAC) long-run variance of a scalar time series of post-period treatment effects (and, for some variants, pre-period prediction residuals). The Bartlett/Newey-West and uniform kernels are provided; the default truncation lag follows Newey-West.

mlsynth.utils.pda_helpers.inference.fspda_lrvar_lag(T2: int, lrvar_lag: int | None = None) int#

Bartlett-kernel truncation lag for the fsPDA long-run variance.

Follows Shi & Huang’s released fsPDA package (est.fsPDA): the default is floor(T2 ** (1/4)) and a user-supplied lag must be a non-negative integer no larger than floor(sqrt(T2)).

mlsynth.utils.pda_helpers.inference.hac_lrv(z: ndarray, lag: int | None = None, kernel: str = 'bartlett') float#

HAC long-run variance of a mean-zero-able scalar series z.

lrv = gamma_0 + sum_{l=1}^{L} k(l) * 2 * gamma_l with gamma_l the sample autocovariance and k the Bartlett (1 - l/(L+1)) or uniform kernel. The series is de-meaned first.

mlsynth.utils.pda_helpers.inference.lrvar_prewhite_nw(x: ndarray) float#

Prewhitened Newey-West long-run variance of the mean of x.

Andrews-Monahan (1992) VAR(1) prewhitening + a Bartlett kernel with the Newey-West (1994) data-driven bandwidth + the finite-sample adjustment – i.e. R’s sandwich::lrvar(x, type = "Newey-West", prewhite = TRUE, adjust = TRUE), which Shi & Huang use for the post-selection t-test in their applications (app1_luxury_watch/fsPDA.R). Returns the variance of the sample mean (so the t-statistic is mean(x) / sqrt(lrvar_prewhite_nw(x))).

Prewhitening is what makes this differ sharply from the plain Bartlett estimator on strongly serially-dependent effects (e.g. monthly growth rates, which mean-revert): the VAR(1) step removes the bulk of the dependence before the kernel smooths the remainder, then the long-run factor 1 / (1 - A) recolors it.

mlsynth.utils.pda_helpers.inference.newey_west_lag(n: int) int#

Newey-West (1994) automatic truncation lag floor(4 (n/100)^(2/9)).

mlsynth.utils.pda_helpers.inference.normal_test(att: float, se: float, alpha: float = 0.05)#

Two-sided N(0,1) test: returns (p_value, (ci_lower, ci_upper)).

L2-relaxation (Shi & Wang): the relaxation primal, tau validation, and the two-term HAC ATE inference.

L2-relaxation estimation (Shi & Wang 2024).

Solves the L2-relaxation primal (their Eq. 3)

min_beta (1/2) ||beta||_2^2 s.t. || eta_hat - Sigma_hat beta ||_inf <= tau,

with Sigma_hat = X'X / T1 and eta_hat = X'y / T1 over the pre-period.

Following the authors’ released L2relax (ishwang1/L2relax-PDA), the treated and control series are standardised (demeaned and scaled to unit variance) before forming Sigma / eta, and the solution is mapped back to the original scale: beta = sd_y * (beta_tilde / sd_X) and intercept alpha = mean_y - mean_X . beta. Standardisation is the default (it is what the replication code does and what reproduces the paper’s empirical results); set standardize=False for the raw-scale variant. The tuning parameter tau is chosen by out-of-sample validation over a log-spaced grid (the optimal tau is often a tiny fraction of max|eta|).

mlsynth.utils.pda_helpers.l2.estimation.cross_validate_tau(y_pre: ndarray, X_pre: ndarray, val_frac: float = 0.2, n_coarse: int = 40, n_fine: int = 40, standardize: bool = True, tau_grid: ndarray | None = None) float#

Validate tau by sequential out-of-sample validation on the tail.

The training window is split in time order – fit on the earlier 1 - val_frac fraction, validate on the most-recent val_frac tail – so no future period informs an earlier counterfactual (unlike a K-fold split over contiguous blocks, which trains on both past and future of each fold). tau_grid overrides the automatic log-spaced grid (e.g. to match a reference grid); otherwise the grid is log-spaced up to max|eta| (the optimal tau is often a tiny fraction of that).

mlsynth.utils.pda_helpers.l2.estimation.fit_l2(y: ndarray, X: ndarray, T0: int, tau: float | None = None, standardize: bool = True, tau_grid: ndarray | None = None) Tuple[ndarray, float, ndarray, float]#

Fit L2-relaxation; return (beta, intercept, counterfactual, tau).

mlsynth.utils.pda_helpers.l2.estimation.l2_relax(y_pre: ndarray, X_pre: ndarray, tau: float, standardize: bool = True) Tuple[ndarray, float]#

Solve the L2-relaxation primal for coefficients and intercept.

ATE inference for L2-relaxation PDA (Shi & Wang 2024, Theorem 3).

Single treated unit. With the post-period estimated effects Delta_hat_t = y_t - y_hat_t (t in T2) and the pre-period prediction residuals e_t = y_t - y_hat_t (t in T1), the ATE Delta_bar is asymptotically normal,

Z_hat = Delta_bar / sqrt( rho_hat_(1)^2 / T1 + rho_hat_(2)^2 / T2 ) -> N(0, 1),

where rho_hat_(1)^2 is the HAC long-run variance of the pre-period residuals and rho_hat_(2)^2 is the HAC long-run variance of the de-meaned post-period effects. Both estimation uncertainty (pre) and post-period noise contribute.

mlsynth.utils.pda_helpers.l2.inference.l2_ate_inference(y: ndarray, counterfactual: ndarray, T0: int, alpha: float = 0.05) Tuple[float, float, Tuple[float, float], float]#

Return (att, se, ci, p_value) for the L2-relaxation ATE.

LASSO (Li & Bell): cross-validated L1 estimation and the HAC t-test with a first-stage variance term.

L1/LASSO PDA estimation (Li & Bell 2017).

Selects control units and estimates coefficients by LASSO on the pre-period, with the penalty chosen by cross-validation, then predicts the treated unit’s counterfactual out-of-sample.

mlsynth.utils.pda_helpers.lasso.estimation.fit_lasso(y: ndarray, X: ndarray, T0: int, cv: int = 5, random_state: int = 0) Tuple[ndarray, float, ndarray, ndarray]#

Fit LASSO PDA; return (beta, intercept, counterfactual, support_mask).

ATE inference for LASSO PDA (Li & Bell 2017, Theorem 3.2).

The test statistic is sqrt(T2) * Delta_bar / sqrt(Sigma_hat) with Sigma_hat = Sigma_hat_1 + Sigma_hat_2:

  • Sigma_hat_2 – the HAC (Newey-West) long-run variance of the post-period effects, the always-present post-period averaging variance;

  • Sigma_hat_1 – the first-stage (pre-period estimation) variance, here the OLS prediction variance of the mean post-period counterfactual on the LASSO-selected support. Li & Bell note it is negligible when T1 >> T2.

Equivalently the ATE standard error is SE = sqrt(var_firststage + Sigma_hat_2 / T2).

mlsynth.utils.pda_helpers.lasso.inference.lasso_ate_inference(y: ndarray, X: ndarray, counterfactual: ndarray, support: ndarray, T0: int, alpha: float = 0.05) Tuple[float, float, Tuple[float, float], float]#

Return (att, se, ci, p_value) for the LASSO PDA ATE.

Forward selection (Shi & Huang): greedy R^2 selection with modified-BIC stopping and the post-selection HAC t-test.

Forward-selected PDA estimation (Shi & Huang 2023).

Greedy forward selection of control units: at each step add the donor whose inclusion maximizes the pre-treatment OLS R^2 (equivalently minimizes the residual variance sigma^2 = mean(e^2)). Selection stops as soon as the modified information criterion

IC(r) = log( sigma^2(U_r) ) + B * r, B = log(log N) * log(T1) / T1

stops decreasing (the stopping rule of Wang, Li & Tsai used in the authors’ fsPDA R package), starting from the intercept-only IC = log(var(y1)). The counterfactual is the OLS extrapolation on the selected set. This is a direct port of est.fsPDA.R.

mlsynth.utils.pda_helpers.fs.estimation.forward_select(y: ndarray, X: ndarray, T0: int, intercept: bool = False) Tuple[List[int], ndarray, float, ndarray]#

Forward-select donors with the stop-on-increase IC rule, then refit OLS.

Returns (selected_indices, beta_full, intercept, counterfactual) where beta_full is an N-vector with zeros off the selected support.

Parameters:

intercept (bool, default False) – When False the donor regression is fit without a constant and the IC baseline is log(mean(y_pre**2)) – the paper’s simulation convention (Shi & Huang 2023, simulation/FS.R). This is the correctly-specified form for the factor DGP (mean-zero factors, no unit level shift) and yields valid test size. When True a constant is included and the baseline is log(var(y_pre)) – the released fsPDA R package convention (est.fsPDA.R), appropriate for panels with genuine level differences but size-distorting on centered data.

Post-selection ATE inference for forward-selected PDA (Shi & Huang 2023, Eq. 4).

Because forward selection uses pre-treatment data only and the pre/post periods become asymptotically independent under weak dependence, the naive conditional t-statistic is valid:

Z_U = Delta_bar / sqrt(lrvar_hat) -> N(0, 1),

where lrvar_hat is the long-run variance of the sample mean of the (de-meaned) post-period treatment effects. By default mlsynth uses the prewhitened Newey-West estimator (sandwich::lrvar(..., prewhite = TRUE, adjust = TRUE)) that Shi & Huang use in their applications (app1_luxury_watch/fsPDA.R); on serially-dependent effects this is far less conservative than a plain Bartlett kernel. Supplying lrvar_lag instead switches to the released est.fsPDA package’s fixed-lag Bartlett estimator (floor(T2 ** (1/4)) if the cap, see fspda_lrvar_lag()).

mlsynth.utils.pda_helpers.fs.inference.fs_ate_inference(y: ndarray, counterfactual: ndarray, T0: int, alpha: float = 0.05, lrvar_lag: int | None = None) Tuple[float, float, Tuple[float, float], float]#

Return (att, se, ci, p_value) for the forward-selected PDA ATE.

Run loop assembling the per-variant fits.

Run the requested PDA variant(s) and assemble typed per-method fits.

mlsynth.utils.pda_helpers.orchestration.resolve_methods(method: str, methods: List[str] | None) List[str]#

Internal method keys to run (explicit methods win over method).

mlsynth.utils.pda_helpers.orchestration.run_pda(inputs: PDAInputs, methods: List[str], tau: float | None, alpha: float, fs_intercept: bool = False, lrvar_lag: int | None = None, l2_standardize: bool = True, l2_tau_grid: Sequence[float] | None = None) Dict[str, PDAMethodFit]#

Fit each requested PDA variant with its own paper’s inference.