Synthetic Principal Component Design (SPCD)

Synthetic Principal Component Design (SPCD)#

When to Use This Estimator#

Most synthetic-control work takes the treated unit as given and asks only how to weight the donors. SPCD, due to Lu, Li, Ying and Blanchet (2022) [SPCD] arXiv:2211.15241, answers the prior, design-time question: you are about to run an experiment, you have a panel of pre-treatment outcomes, and you must decide which units to treat (and how to weight the rest into a synthetic comparison) so that the treatment-effect estimate is as precise as possible. It is a sibling of Synthetic Design (SYNDES) — both choose the treated set rather than assume it — but where SYNDES solves a mixed-integer program, SPCD reformulates the design as a phase-synchronization problem and solves it with a spectrally-initialized power method that converges globally under standard linear-factor models, in seconds rather than minutes.

Reach for SPCD when treatment can only be applied to coarse, expensive units, randomizing at random leaves accuracy on the table, and you want a fast, provably-good assignment:

Marketing / geo experiments. You can switch a campaign on in some media markets (DMAs), regions, or store clusters but not at the individual-customer level. SPCD reads each market’s pre-period sales or traffic history and picks the treated markets — and the synthetic comparison — so the two groups tracked each other before launch. Any post-launch divergence is then a low-variance read on the campaign.
Retail / product rollouts. Choosing which stores or SKUs get a new layout, price, or feature first. Treating a handful of stores is costly, so you want the few you pick to be maximally informative about the whole chain.
Platform / policy pilots. Selecting cities or submarkets for a pricing pilot, a new service tier, or a regulatory change, where interference rules out customer-level randomization and the number of treatable units is small.

The flip side: if assignment is already fixed (you know who was treated and only need a counterfactual), this is the wrong tool — use a standard retrospective estimator such as Forward-Selected Synthetic Control (FSCM), Cluster Synthetic Controls (CLUSTERSC), or Two-Step Synthetic Control. SPCD’s value is entirely in the design stage, before the experiment runs.

Notation#

We observe an outcome \(y_{it}\) for units \(i \in \mathcal{N} \coloneqq \{1, \dots, N\}\) over pre-treatment periods \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\), stacked into the pre-treatment donor matrix \(\mathbf{Y}_0 \in \mathbb{R}^{T \times N}\) (one column per unit). At \(t = T\) the experimenter assigns each unit a sign \(D_i \in \{-1, +1\}\): \(D_i = +1\) puts unit \(i\) in the treated group, \(D_i = -1\) in the control group. Each unit also carries a non-negative weight \(w_i \ge 0\), normalized to sum to one on each side. After assignment, outcomes are observed for \(S\) further periods \(t = T+1, \dots, T+S\), with potential outcomes \(y_{it}(-1) = \mu_{it} + e_{it}\) and \(y_{it}(+1) = y_{it}(-1) + \tau\), where \(\mu_{it}\) is the base (signal) outcome, \(e_{it}\) is mean-zero idiosyncratic noise with variance \(\sigma\), and \(\tau\) is the treatment effect to be estimated. The estimand is the weighted average treatment effect on the treated (wATET).

Throughout, \(\mathbf{1}\) is the all-ones vector, \(\mathbf{I}\) the identity, \(\operatorname{sgn}(\cdot)\) the elementwise sign, \(\|\cdot\|_1\) / \(\|\cdot\|_2\) the \(\ell_1\) / \(\ell_2\) norms, and \((\cdot)^{-1}\) the matrix inverse.

Note

Notation bridge. The single-treated-unit synthetic-control canon (treated \(j=1\), donor pool \(\mathcal{N}_0\), treatment dummy \(d_{jt} \in \{0,1\}\)) does not fit a design problem in which the assignment is itself the decision variable. Following the paper, the assignment is the signed vector \(\mathbf{D} \in \{-1,+1\}^N\) and the two groups are symmetric (neither is privileged as “the donors”). The implementation stores the pre-treatment matrix in the canon orientation, as \(\mathbf{Y}_0 \in \mathbb{R}^{T \times N}\) (time in rows, units in columns), so the iteration matrix below is built in code as Y_pre.T @ Y_pre + alpha I + lambda 1 1.T.

How SPCD Works (Plain-Language Walkthrough)#

Before the equations, here is the whole idea in five steps.

The goal is a fair pre-period match. Split the units into a treated group and a control group, and weight each unit, so that before the experiment the weighted treated history and the weighted control history are nearly identical curves. If the two groups move together in the pre-period, then after treatment the gap between them is the treatment effect — and a well-balanced design makes that gap a low-variance estimate.
Trying every split is hopeless. With \(N\) units there are exponentially many ways to form two groups; choosing the best one is NP-hard. SPCD sidesteps the brute-force search with a change of variable.
One signed number per unit. Pack each unit’s group label and weight into a single signed number \(W_i = w_i D_i\): its sign says which group the unit is in, its magnitude is the weight. “Make the two weighted groups match” then becomes “find a signed vector \(\mathbf{W}\) that is as small as possible against the data,” i.e. minimize \(\mathbf{W}^\top (\mathbf{Y}_0 \mathbf{Y}_0^\top) \mathbf{W}\) subject to the groups balancing out. This is the \(\ell_1\)-PCA / phase-synchronization problem — a well-studied shape with fast, globally-convergent solvers.
Solve it with a power method, warm-started by an eigenvector. Start from a smart first guess — the sign pattern of the smallest eigenvector of the data matrix (the spectral initialization). Then repeatedly refine the group labels: multiply the current sign vector by \(\mathbf{M}^{-1}\) and take signs again (the generalized power method). Keep going until the labels stop flipping. The normalized variant (the package default) rescales by the diagonal of \(\mathbf{M}^{-1}\) first and is guaranteed to converge to the global optimum at a linear rate.
Read off the design, then estimate. Once the labels are fixed, the weights follow in closed form (Eq. (9)) or from a tiny convex program (Eq. (6)). Treat the smaller of the two groups (the minority rule). Run the experiment. When the post-period data arrives, the treatment effect is simply the gap between the two weighted groups, averaged over the post-period.

A useful analogy: it is like drafting two evenly-matched teams from a pool of players based on their past stats — except SPCD also decides how much each player counts, and it finds the balanced split with a fast eigenvector computation instead of trying every possible roster.

Mathematical Formulation#

SPCD jointly chooses the signs \(\{D_i\}_{i=1}^N\) and non-negative weights \(\{w_i\}_{i=1}^N\) so that the weighted pre-treatment trajectories of the two groups track each other as closely as possible.

Following Doudchenko et al. (2021) [SYNDES] and Abadie & Zhao (2021) [ABADIE2024], the expected MSE of the difference-in-weighted-means treatment-effect estimator decomposes as

\[\mathbb{E}\!\left[(\widehat{\tau} - \tau)^2 \,\middle|\, \{D_i, w_i\}\right] \;=\; \underbrace{\left( \sum_{i: D_i=1} w_i \mu_{i, T+1} - \sum_{i: D_i=-1} w_i \mu_{i, T+1} \right)^2}_{\text{weighted covariate balancing}} \;+\; \sigma^2 \sum_{i=1}^N w_i^2.\]

The first term is a bias from imperfect pre-treatment balance; the second a variance that grows with weight concentration. Minimizing the first term over the pre-treatment window gives the mixed-integer program SPCD solves (Eq. (1) of the paper):

\[\begin{split}\begin{aligned} \min_{\{D_i, w_i\}_{i=1}^N} \quad & \frac{1}{T} \sum_{t=1}^T \left( \sum_{i: D_i=1} w_i y_{it} - \sum_{i: D_i=-1} w_i y_{it} \right)^2 + \sigma \sum_{i=1}^N w_i^2 \\ \text{s.t.} \quad & w_i \geq 0,\quad D_i \in \{-1, +1\}, \\ & \sum_{i: D_i=1} w_i \;=\; \sum_{i: D_i=-1} w_i \;=\; 1. \end{aligned}\end{split}\]

Reformulation as \(\ell_1\)-PCA#

The key observation of the paper is that the change of variable \(W_i = w_i D_i\) collapses the assignment and weight variables into a single signed vector \(\mathbf{W} \in \mathbb{R}^N\). Under \(\mathbf{W}\), \(D_i = \operatorname{sgn}(W_i)\) and \(w_i = |W_i|\), the adding-up constraints become \(\mathbf{1}^\top \mathbf{W} = 0\), and the design problem becomes

\[\min_{\mathbf{W} \in \mathbb{R}^N,\ \mathbf{1}^\top \mathbf{W} = 0,\ \|\mathbf{W}\|_1 = 1} \mathbf{W}^\top \left( \mathbf{Y}_0 \mathbf{Y}_0^\top + \sigma \mathbf{I} \right) \mathbf{W}.\]

The hard equality \(\mathbf{1}^\top \mathbf{W} = 0\) is absorbed into the objective via a quadratic penalty:

\[\min_{\mathbf{W} \in \mathbb{R}^N,\ \|\mathbf{W}\|_1 = 1} \mathbf{W}^\top \left( \mathbf{Y}_0 \mathbf{Y}_0^\top + \sigma \mathbf{I} + \lambda \mathbf{1}\mathbf{1}^\top \right) \mathbf{W}.\]

Theorem 1 of the paper shows that for \(\lambda\) large enough, the sign vector \(\operatorname{sgn}(\mathbf{W}^\ast)\) of any global minimum coincides with the sign vector of an associated phase-synchronization problem. Once the signs are recovered, the magnitudes follow from a small convex QP (or, in practice, the closed form in Eq. (9)).

The Iteration Matrix#

All four code-paths of SPCD operate on the same \(N \times N\) iteration matrix from Eq. (2) of the paper:

\[\mathbf{M} \;\coloneqq\; \mathbf{Y}_0 \mathbf{Y}_0^\top \;+\; \alpha \mathbf{I} \;+\; \lambda \mathbf{1} \mathbf{1}^\top,\]

where \(\alpha\) plays the role of \(\sigma\) (noise variance) and \(\lambda\) is the constraint-absorbing penalty. The paper treats both as pre-defined hyperparameters and gives no formula for them.

Note

Choosing \(\alpha\). The paper’s appendix sets the perturbation ridge on the noise scale (\(\alpha = \lVert\boldsymbol{\Delta}\rVert\) with \(\alpha \le \lVert\boldsymbol{\Delta}\rVert\)), so \(\alpha\) should track the idiosyncratic noise variance, not the dominant (signal/level) eigenvalue of \(\mathbf{Y}_0 \mathbf{Y}_0^\top\). When \(N > T_{\text{pre}}\) the matrix \(\mathbf{Y}_0 \mathbf{Y}_0^\top\) is rank-deficient and the post-period RMSE is a non-monotone, jumpy function of \(\alpha\) (small changes flip the discrete sign vector), so no single closed-form estimate is robust. When the user does not supply \(\alpha\), formulation.py first fixes its scale with the Gavish-Donoho median-singular-value noise estimate (estimate_noise_variance()), and orchestration.select_alpha_by_holdout then picks the value on a noise-scale grid that balances a held-out tail of the pre-period best out of sample. \(\lambda\) and \(\beta\) still default from the spectrum of \(\mathbf{Y}_0 \mathbf{Y}_0^\top\). If you know the noise level (e.g. the simulations below fix \(\sigma = 1\)), pass alpha_ridge explicitly to skip selection.

Balancing on Covariates#

By default SPCD balances the two groups on the pre-treatment outcomes only. Passing covariates (a list of column names) additionally balances on auxiliary unit characteristics. Each unit’s per-covariate pre-period mean is taken (time-invariant covariates – e.g. last year’s market share – collapse to their value), the resulting \(N \times P\) matrix \(\mathbf{X}\) is z-scored across units, and a covariate-balance term is folded into the iteration matrix:

\[\mathbf{M} \;\coloneqq\; \mathbf{Y}_0 \mathbf{Y}_0^\top \;+\; \texttt{covariate\_weight} \cdot s \cdot \mathbf{X} \mathbf{X}^\top \;+\; \alpha \mathbf{I} \;+\; \lambda \mathbf{1}\mathbf{1}^\top,\]

where \(s\) rescales \(\mathbf{X} \mathbf{X}^\top\) to match the trace (“energy”) of \(\mathbf{Y}_0 \mathbf{Y}_0^\top\), so covariate_weight = 1 weights covariates equally to the outcomes, > 1 upweights them, and 0 (or omitting covariates) recovers the outcome-only design. Crucially this keeps the iteration matrix a quadratic form \(\mathbf{W}^\top \mathbf{M} \mathbf{W}\) with the same phase-synchronization structure, so the spectral solver and the global-optimality theory are unchanged – the design now drives the weighted treated and control groups to agree on outcomes and covariates jointly. (This is the same way classical synthetic control balances on predictors; the relative weight covariate_weight is the analog of SCM’s \(V\).)

Note

Budget / cardinality constraints are not supported by SPCD: they are linear constraints on the discrete assignment and break the phase-synchronization reduction (which is precisely what removing the cardinality constraint buys, per the paper’s Remark 1). Use Synthetic Design (SYNDES), whose MIP takes such constraints natively, when you need a budget.

res = SPCD({
    "df": df, "outcome": "sales", "unitid": "DMA", "time": "Week",
    "covariates": ["mkt_share"],     # balance outcomes + last-year market share
    "covariate_weight": 1.0,          # equal weight to outcomes (default)
}).fit()
# the weighted treated/control groups now match on mkt_share as well

Spectral Initialization#

Both Algorithm 1 and Algorithm 2 of the paper start from the same warm start: the sign of the smallest eigenvector of \(\mathbf{M}\),

\[\mathbf{y}^{0} \;=\; \operatorname{sgn}(\mathbf{v}), \qquad \mathbf{v} \;=\; \operatorname*{argmin}_{\|\mathbf{v}\|_2 = 1} \mathbf{v}^\top \mathbf{M} \mathbf{v}.\]

Appendix 3.2.1 of the paper (Lemma 4) shows that under the linear latent-factor model \(Y_{it} = \delta_t + \theta_t^\top \mu_i + e_{it}\) (Assumption 1) together with the realizability assumption (Assumption 2), the sign vector \(\operatorname{sgn}(\mathbf{v})\) already agrees with the global optimum \(\operatorname{sgn}(\mathbf{w}^\ast)\) up to a bounded fraction of entries.

The (Normalized) Generalized Power Method#

To refine the spectral initialization, SPCD performs sign-iterations on \(\mathbf{M}^{-1}\). The two variant choices correspond to the two update boxes of Algorithm 1 / Algorithm 2:

variant="spcd" (Eq. (4)/(7)) — Generalized Power Iteration:

\[\mathbf{y}^{t+1} \;=\; \operatorname{sgn}\!\left[\, \left( \mathbf{M}^{-1} + \beta \mathbf{I} \right) \, \mathbf{y}^{t} \,\right].\]
variant="norm_spcd" (Eq. (5)/(8)) — Normalized Generalized Power Iteration:

\[\mathbf{y}^{t+1} \;=\; \operatorname{sgn}\!\left[\, \left( \mathbf{M}^{-1} + \beta \mathbf{I} \right) \, \big( \mathbf{y}^{t} \,/\, \mathbf{d} \big) \,\right], \quad \mathbf{d} \;=\; \sqrt{\operatorname{diag}(\mathbf{M}^{-1})}.\]

Here \(\beta > 0\) is the step parameter (auto-defaulted to \(1/\lambda_{\max}(\mathbf{M})\)), and \(/\) denotes element-wise division. The loop terminates as soon as the sign vector stops changing between successive iterations.

Under Assumptions 1 and 2 with small enough idiosyncratic noise, Theorem 3 of the paper guarantees:

variant="spcd" converges globally if \(\epsilon \,>\, (\sqrt{3}/2) - 1\);
variant="norm_spcd" converges linearly and globally for any \(\epsilon > 0\).

This is why the normalized variant is the package default.

Final Weight Step#

Once the iteration converges to a sign vector \(\mathbf{y}^\ast \in \{-1, +1\}^N\), the unit weights are produced by one of two procedures:

weights="empirical" (Eq. (9), Algorithm 2 — the paper’s experimental default):

\[\mathbf{w} \;=\; \frac{2 \, \mathbf{M}^{-1} \mathbf{y}^\ast}{\left\| \mathbf{M}^{-1} \mathbf{y}^\ast \right\|_1}.\]

The optimality condition of the original QP (Eq. (6)) implies that \(\operatorname{sgn}(\mathbf{w}) = \mathbf{y}^\ast\) whenever the iteration has converged to a fixed point of the closed-form map, so the signed vector \(\mathbf{w}\) simultaneously encodes group membership and weights.
weights="exact" (Eq. (6), Algorithm 1) — solves the convex QP

\[\begin{split}\begin{aligned} \min_{\mathbf{w} \geq 0} \quad & \frac{1}{T} \sum_{t=1}^T \left( \sum_{i:\, y^\ast_i = +1} w_i Y_{it} - \sum_{i:\, y^\ast_i = -1} w_i Y_{it} \right)^2 + \alpha \sum_{i=1}^N w_i^2 \\ \text{s.t.} \quad & \sum_{i:\, y^\ast_i = +1} w_i \;=\; \sum_{i:\, y^\ast_i = -1} w_i \;=\; 1. \end{aligned}\end{split}\]

via cvxpy. Use this when you need the exact Algorithm-1 weights rather than the closed-form approximation.

Minority Convention#

Per the bottom of Algorithm 1 (page 7 of the paper), SPCD then applies the minority-group rule

\[\text{Treat unit } i \iff y^\ast_i \;=\; -\operatorname{sgn}\!\left(\sum_{j} y^\ast_j\right),\]

which flips the sign vector (if necessary) so that the smaller of the two groups is the treated group. The treated unit labels reported by SPCDResults.selected_unit_labels() follow this convention.

Treatment Effect and Pre-Period Fit#

For post-treatment periods \(t = T+1, \dots, T+S\), the SPCD treatment-effect estimator at the bottom of Algorithm 1 is

\[\widehat{\tau} \;=\; \frac{1}{S} \sum_{t = T+1}^{T+S} \left( \sum_{i: y^\ast_i = +1} w_i \, Y_{i,t} - \sum_{i: y^\ast_i = -1} w_i \, Y_{i,t} \right),\]

which is precisely the mean of the post-period synthetic gap reported as results.att. When no post period is supplied (design-only mode), the estimator reports att = 0.0.

The pre-period RMSE between the synthetic treated and synthetic control trajectories,

\[\mathrm{RMSE}_{\text{pre}} \;=\; \sqrt{ \frac{1}{T} \sum_{t = 1}^T \left( \sum_{i: y^\ast_i = +1} w_i \, Y_{i,t} - \sum_{i: y^\ast_i = -1} w_i \, Y_{i,t} \right)^2 },\]

measures the residual of Eq. (1) on the chosen sign vector and is reported as results.rmse_pre.

Algorithm 3 and Algorithm 4 of the Paper#

The paper’s Appendix 3.2 also defines two further numbered algorithms — Algorithm 3 (Generalized Power Method on an abstract Hermitian matrix) and Algorithm 4 (its normalized counterpart). These are not implemented as separate code paths in mlsynth: they are the abstract meta-versions of Algorithms 1 and 2 used to prove the global convergence theorem (Theorem 3) and operate on a generic matrix \(\mathbf{C} = \mathbf{z} \mathbf{z}^\top + \boldsymbol{\Delta}\) rather than on the SPCD-specific iteration matrix \(\mathbf{M}\). The two variant options exposed in the API already cover both procedures applied to \(\mathbf{M}\).

Assumptions and Theory#

SPCD’s global-optimality guarantee rests on two assumptions — one structural, one a realizability condition — both stated formally below, each paired with a plain-language Remark.

Assumption 1 (linear latent-factor model). Outcomes are generated by

\[Y_{it} \;=\; \delta_t \;+\; \frac{D_{it} + 1}{2}\,\tau \;+\; \theta_t^\top \mu_i \;+\; e_{it}, \qquad \mathbb{E}[e_{it} \mid \delta_t, \mu_i, D_{it}] = 0, \quad \operatorname{Var}[e_{it} \mid \cdot] = \sigma,\]

where \(\delta_t\) is a time fixed effect, \(\mu_i\) are unobserved common factors, \(\theta_t\) is a vector of unknown factor loadings, \(e_{it}\) is i.i.d. idiosyncratic noise, and \(\tau\) is the treatment effect. In the pre-treatment period \(D_{it} = -1\) for all units.

Remark. This is the standard interactive-fixed-effects model that underlies the consistency theory of synthetic control (Abadie et al., 2010 [ABADIE2010]; Xu, 2017 [Xu2017]). It is the assumption that makes “units that shared latent factors in the past will keep tracking each other” a defensible basis for the design — exactly the structure SPCD exploits when it balances the two groups on pre-period outcomes.

Assumption 2 (realizable design). There exists a unique parameter \((w_i, D_i)_{i=1}^N\) satisfying: (a) \(D_i \in \{-1, +1\}\); (b) \(w_i \ge 0\) with \(\sum_i D_i w_i = 0\); (c) \(\|\mathbf{w}\|_2^2 = N\) and \(\epsilon \le |w_i| \le 1/\epsilon\) for all \(i\); and (d) the weights balance the factors, \(\sum_i w_i D_i \mu_i = 0\).

Remark. This says a perfect, balanced design exists in population — a split whose weighted factor loadings cancel exactly — and that it is unique (so the optimizer need not arbitrate between competing perfect designs). It is the design-time analogue of the “treated unit lies in the convex hull of the donors” condition in retrospective SCM, and is what turns an NP-hard search into a problem with a recoverable global optimum.

Theorem 1 (sign recovery). For \(\lambda\) large enough, the global solution \(\mathbf{w}^\ast\) of the penalized program satisfies \(\operatorname{sgn}(\mathbf{w}^\ast) = \operatorname{sgn}(\operatorname*{argmin}_{\|\mathbf{W}\|_1=1} \mathbf{W}^\top (\mathbf{Y}_0\mathbf{Y}_0^\top + \sigma \mathbf{I} + \lambda \mathbf{1}\mathbf{1}^\top) \mathbf{W})\). In words: the quadratic penalty does not corrupt the signs of the optimal design, so recovering group membership and recovering the weights can be separated.

Theorem 2 (equivalence to phase synchronization). The design problem is symbolically identical to a phase-synchronization problem on the matrix \((\mathbf{A}^\top \mathbf{A})^{-1}\). In words: SPCD inherits the entire fast-solver toolkit developed for phase synchronization — most importantly the spectrally-initialized generalized power method.

Theorem 3 (global convergence). Under Assumptions 1–2, if \(\sigma\) is small enough and \(T \ge \mathrm{poly}(N, 1/\epsilon)\), then variant="spcd" converges to the global optimum whenever \(\epsilon > \sqrt{3}/2 - 1\), and variant="norm_spcd" converges to the global optimum at a linear rate for any \(\epsilon > 0\). In words: the normalization step buys global convergence under a strictly weaker condition, which is why norm_spcd is the default.

Inference and Power Analysis#

Beyond the design itself, SPCD produces a moving-block conformal confidence interval for the post-period ATT and a Monte Carlo estimate of the minimum detectable effect (MDE). The procedure mirrors LEXSCM’s train-on-E / calibrate-on-B discipline.

Estimation / Holdout Split#

Given the pretreatment matrix \(\mathbf{Y}_{\text{pre}} \in \mathbb{R}^{T_{\text{pre}} \times N}\), SPCD splits the pretreatment window into

\[\begin{split}\mathbf{Y}_{\text{pre}} \;=\; \begin{bmatrix} \mathbf{Y}_{E} \\[3pt] \mathbf{Y}_{B} \end{bmatrix},\end{split}\]

where \(\mathbf{Y}_E\) contains the first holdout_frac_E of the pretreatment periods (default 70 %) and \(\mathbf{Y}_B\) contains the remainder. The SPCD design — sign vector, treated/control weights, and the iteration matrix \(\mathbf{M}\) — is fit on \(\mathbf{Y}_E\) only.

Backwards-Compatibility Guarantee#

Because the design is fit on \(\mathbf{Y}_E\) alone, two callers who share the same pretreatment data — one in planning mode (no \(\mathbf{Y}_{\text{post}}\) yet) and one in retrospective mode (with \(\mathbf{Y}_{\text{post}}\)) — receive identical designs. The only difference between the two callers is whether a post-period ATT and its conformal CI are reported.

Holdout Residuals#

Applying the design weights to \(\mathbf{Y}_B\) gives an out-of-sample synthetic-gap series

\[\mathbf{r}_B \;=\; \mathbf{Y}_B \cdot \mathbf{w},\]

where \(\mathbf{w}\) is the signed contrast_weights vector from the \(\mathbf{Y}_E\)-fit design. Under the linear-factor model with no treatment, \(\mathbf{r}_B\) is a zero-mean noise series whose empirical distribution characterizes the noise structure of any synthetic-gap linear functional based on the same \(\mathbf{w}\). This makes \(\mathbf{r}_B\) the natural calibration set for inference and the natural noise pool for power simulations.

Moving-Block Conformal CI#

When \(\mathbf{Y}_{\text{post}}\) is supplied, the post-period synthetic gap is \(\mathbf{g} = \mathbf{Y}_{\text{post}} \cdot \mathbf{w} \in \mathbb{R}^S\), with mean

\[\widehat{\tau} \;=\; \frac{1}{S} \sum_{t=1}^S g_t.\]

The test statistic is \(T(\mathbf{g}) = \mathrm{mean}(|\mathbf{g}|)\). Conformity scores are computed by taking mean-absolute values over all sliding blocks of size \(b = \max(3, \lfloor\sqrt{S}\rfloor)\) (both standard and circular) of \(\mathbf{r}_B\). The conformal p-value vs. \(H_0\colon \tau = 0\) is

\[p \;=\; \frac{ \#\{\text{blocks with score} \geq T(\mathbf{g})\} }{\text{number of blocks}}.\]

The \((1 - \alpha)\) confidence interval for \(\widehat{\tau}\) is obtained by inversion: scan a grid of candidate values \(\theta\) around \(\widehat{\tau}\), include \(\theta\) if the adjusted-residual score \(T(\mathbf{g} - \theta)\) is in the in-distribution region of the block-score empirical distribution. Pointwise bands at the \((1 - \alpha)\) quantile \(q\) of the block scores are reported as \(g_t \pm q\).

This is exchangeability-based inference: coverage holds in finite samples under the assumption that overlapping blocks of \(\mathbf{r}_B\) are exchangeable with overlapping blocks of \(\mathbf{g}\) under the null. This is a stronger assumption than IID noise and weaker than perfect \(H_0\). It breaks if the treatment introduces variance changes.

Monte Carlo MDE#

The MDE answers the pre-experiment planning question: given my holdout residuals and a planned post-period horizon \(S\), what is the smallest constant treatment effect I could detect with power \(\geq \pi\)?

The procedure follows mlsynth.utils.fast_scm_helpers.power_helpers:

Build the null distribution of \(T\) at horizon \(S\) by resampling \(\mathbf{r}_B\) (padded with Gaussian draws so that resampling of size \(S\) is always feasible). Compute \(c_\alpha = \text{quantile}(\text{null}, 1 - \alpha)\).
For each candidate \(\tau\) on a grid of effect sizes, draw \(n_{\text{trials}}\) post-period vectors of the form \(\tau + \mathcal{N}(0, \widehat{\sigma}_B^2)\) and count the fraction exceeding \(c_\alpha\).
The smallest \(\tau\) whose empirical power reaches the target \(\pi\) is reported as the MDE, both on the absolute scale (mde) and as a percentage of the holdout-baseline outcome level (mde_pct).

A detectability curve — the MDE as a function of \(S\) — can be requested by passing mde_horizon_grid to the config. This answers the related question: how long do I need to run the experiment to detect a target effect?

Power Analysis Always Runs#

The MDE computation depends only on \(\mathbf{r}_B\), not on \(\mathbf{Y}_{\text{post}}\). So power analysis runs whenever enable_inference=True and the holdout window is large enough (at least min_blank_size periods, default 5). The conformal CI only runs when \(\mathbf{Y}_{\text{post}}\) is supplied; otherwise the ATT and CI are reported as None (honest absence rather than a silent zero).

Opting Out#

Setting enable_inference=False skips the E/B split entirely. The design is then fit on the full pretreatment matrix (legacy behavior), no holdout residuals are produced, and inference / power analysis are not run. Use this when you need byte-identical reproducibility against a pre-inference SPCD release, or when your pretreatment window is too short to spare a holdout.

Example: Reproducing the Paper’s RMSE Advantage#

The paper’s headline finding (Section 4.1, Table 1, Figures 2–4) is not merely that SPCD is unbiased — it is that SPCD’s design yields a far lower root-mean-square error of the treatment-effect estimate than a random design on the linear latent-factor model. The block below reproduces that finding and is self-contained: it draws panels from the paper’s data-generating process (\(Y_{it} = \text{level}_i + v_t^\top \gamma_i + e_{it}\), with a true effect \(\tau = 1\) and noise \(\sigma = 1\)), and for each draw estimates \(\tau\) two ways — with SPCD, and with the paper’s random baseline (assign each unit to treated/control with probability one half, then take a simple difference in group means). Because the SPCD design is fit on pre-treatment data only, injecting a post-period effect never changes which units SPCD selects, so the two SPCD calls per draw share an identical design.

import numpy as np
import pandas as pd
from mlsynth import SPCD

def factor_panel(rng, N=10, T_pre=20, T_post=10, L=8, sigma=1.0):
    """One draw from the paper's linear factor model (Section 4.1):
    Y_it = level_i + v_t^T gamma_i + e_it, with no treatment applied yet."""
    gamma = rng.standard_normal((N, L))                 # unit loadings
    v = rng.standard_normal((T_pre + T_post, L))        # time factors
    level = rng.uniform(40.0, 60.0, size=N)             # positive baselines
    noise = rng.normal(scale=sigma, size=(T_pre + T_post, N))
    return level[None, :] + v @ gamma.T + noise         # shape (T, N)

def make_df(Y, T_pre):
    T, N = Y.shape
    return pd.DataFrame(
        [{"unit": i, "time": t, "y": float(Y[t, i]), "post": int(t >= T_pre)}
         for i in range(N) for t in range(T)]
    )

def spcd_att(Y, T_pre, tau):
    """Fit the SPCD design on the pre-period, inject tau on the chosen
    treated units, then read the estimated ATT off the post-period."""
    cfg = dict(outcome="y", unitid="unit", time="time", post_col="post",
               enable_inference=False, alpha_ridge=1.0)  # sigma^2 known = 1
    treated = SPCD({"df": make_df(Y, T_pre), **cfg}).fit().selected_unit_labels
    Y = Y.copy(); Y[T_pre:, np.asarray(treated)] += tau
    return SPCD({"df": make_df(Y, T_pre), **cfg}).fit().att

def random_att(rng, Y, T_pre, tau):
    """Paper's baseline: assign each unit to treated/control with prob 1/2
    and estimate tau by a difference in (equally weighted) group means."""
    N = Y.shape[1]
    sign = rng.choice([-1, 1], size=N)
    if sign.min() == sign.max():                        # keep both groups non-empty
        sign[rng.integers(N)] *= -1
    treated, control = np.where(sign == 1)[0], np.where(sign == -1)[0]
    Y = Y.copy(); Y[T_pre:, treated] += tau
    gap = Y[T_pre:, treated].mean(axis=1) - Y[T_pre:, control].mean(axis=1)
    return gap.mean()

rng = np.random.default_rng(0)
T_pre, T_post, tau, n_draws = 20, 10, 1.0, 200

# --- one draw: SPCD recovers the injected effect ---
Y = factor_panel(rng, T_pre=T_pre, T_post=T_post)
print(f"single-draw SPCD ATT: {spcd_att(Y, T_pre, tau):.3f}  (true {tau})")

# --- many draws: SPCD's design slashes ATT RMSE vs a random design ---
err_spcd, err_rand = [], []
for _ in range(n_draws):
    Y = factor_panel(rng, T_pre=T_pre, T_post=T_post)
    err_spcd.append(spcd_att(Y, T_pre, tau) - tau)
    err_rand.append(random_att(rng, Y, T_pre, tau) - tau)
err_spcd, err_rand = np.array(err_spcd), np.array(err_rand)

rmse = lambda e: np.sqrt(np.mean(e ** 2))
print(f"SPCD   mean ATT {tau + err_spcd.mean():.3f}   RMSE {rmse(err_spcd):.3f}")
print(f"random mean ATT {tau + err_rand.mean():.3f}   RMSE {rmse(err_rand):.3f}")
print(f"RMSE ratio (random / SPCD): {rmse(err_rand) / rmse(err_spcd):.1f}x")

Both designs are roughly unbiased (mean ATT \(\approx 1\)), but SPCD’s RMSE is about 0.4 against the random design’s 3.9 — an order-of- magnitude reduction (≈ 9–10× on this seed), reproducing the paper’s central claim that SPCD “surpasses the random design by a large margin.” The mechanism is the balance term of Eq. (1): a single draw of the random design can split the units so that the two group means differ substantially even before treatment, and that pre-period imbalance carries straight through to the ATT; SPCD instead chooses the split (and weights) that makes the groups track each other, so almost nothing but the injected effect survives into the post-period gap.

Switch variant ("spcd" vs. "norm_spcd") and weights ("empirical" vs. "exact") to compare the four code-paths; all share the same iteration matrix \(\mathbf{M}\) and differ only in the refinement and weight steps described above. Enabling inference (the default) additionally returns a conformal p-value, CI, and the design-time MDE on each fit.

Note

selected_unit_labels returns the minority group (the treated side) by the minority-group convention. Treating the smaller group keeps the control pool large, which is what the variance term \(\sigma^2 \sum_i w_i^2\) rewards.

API Variants and Multi-Arm Designs#

SPCD accepts either an SPCDConfig instance or a plain dict. The two orthogonal modelling choices are:

variant: which power-method update box to iterate. "spcd" is the generalized power iteration (Eq. (4)/(7)); "norm_spcd" is the normalized iteration (Eq. (5)/(8)) used in all of the paper’s Section 4 experiments and is the package default.
weights: how to compute the final unit weights. "empirical" uses the closed-form Eq. (9) (fast; the authors’ experimental default); "exact" solves the convex QP Eq. (6) via cvxpy (slower; matches Algorithm 1 to the letter).

Passing an arm column solves the SPCD design independently within each arm’s units and returns an SPCDMultiArmResults (a dict of per-arm SPCDResults); when arm is None (default) a single SPCDResults is returned.

Multi-Arm Power: the Pooled Average Effect#

Each arm carries its own per-arm MDE (arm_designs[label].mde), but those are m independent analyses with no family-wise error control. When the estimand of interest is the average effect across arms — the usual program-level question — SPCDMultiArmResults also reports a single pooled average-effect MDE (results.pooled_mde / results.pooled_mde_pct), computed by default whenever inference is enabled and at least two arms have a usable holdout window.

It is formed by pooling the arms’ time-aligned holdout residuals into one contrast, \(g_t = \sum_a w_a\, r^{(a)}_{B,t}\), and running the ordinary single-series MDE on \(\mathbf{g}\). Because the arms share calendar time, summing the aligned series makes the cross-arm correlation enter through \(\operatorname{Var}(\mathbf{g}) = \mathbf{w}^\top \boldsymbol{\Sigma}\, \mathbf{w}\) automatically — which is why the residual series must be pooled, never resampled per arm independently (that would drop positive correlation and report an over-optimistic MDE). The weights \(w_a\) are set by pooled_weights: "size" (default) weights each arm by its unit count, giving the population-average effect; "equal" weights arms equally.

Pooling answers an easier question than the per-arm analyses — averaging cancels idiosyncratic noise, so the detectable average effect is typically several times smaller than any single arm’s MDE.

Warning

The pooled MDE targets the weighted-average effect. If arm effects are heterogeneous and opposite-signed they can cancel in the average and go undetected even when individual arms move a lot. Use the per-arm arm_designs[label].mde (or a family-wise procedure) when detecting individual arms is what matters.

res = SPCD({
    "df": df, "outcome": "sales", "unitid": "unit", "time": "time",
    "arm": "region", "post_col": "post",
    "pooled_weights": "size",    # population-average effect ("equal" also available)
}).fit()

print(res.pooled_mde)            # smallest detectable AVERAGE effect across arms
print(res.pooled_mde_pct)        # ... as % of the pooled baseline
print({k: r.mde for k, r in res.arm_designs.items()})   # per-arm MDEs

Power Curves and the MDE at Each Time Point#

Every fitted design stores two diagnostic curves on its SPCDPowerAnalysis — for each arm (arm_designs[label].power) and for the whole study (pooled_power):

Power vs. effect size — power.effect_grid_pct and power.power_curve give the full power curve at the analysis horizon; the MDE is the smallest effect whose power reaches power_target.
MDE vs. horizon (“MDE at time point \(t\)”) — power.detectability is a {horizon -> MDE percent} map. It is computed by default over horizons \(1, \dots, \min(12, n_{\text{post}})\) — real market tests rarely run past ~12 periods, so the grid is capped at 12 rather than sweeping the full (often 30+ period) post window. Pass mde_horizon_grid to choose your own horizons (e.g. range(1, 9) for an 8-week test). The curve is built both per arm and for the whole study; the MDE shrinks as the horizon grows, and the pooled curve sits below every per-arm curve.

Warning

The headline pooled_mde_pct is computed at the full post-window length in your data. If you will actually run a shorter test, quote the curve point for that length — e.g. pooled_power.detectability[12] for a 12-period test — not the headline number.

Pre-Period Agreement (Design Diagnostic)#

Every fitted design also reports how closely the synthetic treated and synthetic control trajectories agree before treatment — the RMSE of their gap, broken out by window (results.pre_fit / results.pre_fit_rmse):

estimation — RMSE over the estimation window \(E\) (in-sample for the design fit);
blank — RMSE over the blank/holdout window \(B\) (out-of-sample — the honest read on whether the match will hold up post-launch);
overall_pre — RMSE over the full pre-period (\(E \cup B\)).

Lower is better. The blank value is None when enable_inference=False (no holdout split is made).

r = results.arm_designs["A"]            # or a single SPCDResults
print(r.pre_fit_rmse)
# {"estimation": 25.8, "blank": 16.7, "overall_pre": 23.4}

res = SPCD({
    "df": df, "outcome": "sales", "unitid": "unit", "time": "time",
    "arm": "region", "post_col": "post",
}).fit()

whole_study = res.pooled_power.detectability    # {horizon: MDE % of baseline}
per_arm = {k: r.power.detectability for k, r in res.arm_designs.items()}

# Recommend the shortest horizon that detects a target average effect:
target_pct = 1.0
feasible = [h for h, m in sorted(whole_study.items()) if m <= target_pct]
recommended_periods = min(feasible) if feasible else None

Three built-in plot helpers turn these into slides-ready figures:

from mlsynth import plot_mde_bars, plot_power_curves, plot_detectability

plot_mde_bars(res)        # bar chart: per-arm MDE % with the pooled bar
plot_power_curves(res)    # power vs. effect size, per arm + pooled
plot_detectability(res)   # MDE vs. study length (time points), per arm + pooled

Each returns a Matplotlib Figure (so it renders inline in a notebook and can be .savefig(...)-ed), and works on a single SPCDResults as well as a multi-arm SPCDMultiArmResults.

Note

Computing the per-horizon detectability for every arm and the pooled contrast multiplies the Monte Carlo cost by \(n_{\text{post}}\). For long post windows, pass a coarser mde_horizon_grid (e.g. range(2, n_post + 1, 2)) or lower mde_n_sims to keep fits fast.

from mlsynth import SPCD

config = {
    "df": df,
    "outcome": "sales",
    "unitid": "unitid",
    "time": "time",
    "post_col": "post",        # or T0=...
    "variant": "norm_spcd",    # iteration box; "spcd" also available
    "weights": "empirical",    # paper's experimental default; "exact" uses cvxpy
    "display_graph": True,
}

results = SPCD(config).fit()

# Selected units and design diagnostics
print(results.selected_unit_labels)
print(results.design.n_iterations, results.design.converged)
print(results.design.alpha_ridge, results.design.lam_balance, results.design.beta)

# Standardized result bundle — same shape as the rest of mlsynth
print(results.att)                  # mean post-period synthetic gap (= 0 if no post)
print(results.rmse_pre)             # pre-period RMSE between synthetic groups
print(results.rmse_post)            # post-period RMSE (None if no post)
print(results.donor_weights)        # {control unit label: weight}

# Or via the full BaseEstimatorResults object
summary = results.summary
summary.effects.att
summary.fit_diagnostics.rmse_pre
summary.time_series.observed_outcome      # synthetic treated trajectory
summary.time_series.counterfactual_outcome  # synthetic control trajectory
summary.time_series.estimated_gap           # synthetic gap (signed)
summary.weights.donor_weights
summary.method_details.method_name          # "SPCD (norm_spcd, weights=empirical)"
summary.method_details.parameters_used      # variant, alpha/lam/beta, iterations

Verification#

Cross-validated against the paper’s Prop 99 design study (Section 4.2, Table 1): on the ADH panel (38 states, California excluded, 1970-2000), SPCD’s placebo RMSE of the estimated effect reproduces the reported values – 0.94 at T=25 (paper 0.98), and within ~2x at T=15 – and the design beats both a randomized difference-in-means and a single-unit synthetic control at both horizons (SPCD \(\ll\) Random \(\ll\) SC), the paper’s headline. See SPCD — Lu et al. (2022) Prop 99 design study; run it with python benchmarks/run_benchmarks.py spcd_prop99.

Note

Table 1’s other block (US BLS unemployment, target SPCD RMSE 0.9) is not reproducible from the paper alone: SPCD ships no public code and leaves alpha/lambda/beta unspecified, so a faithful Eq.-9 implementation (mlsynth’s, verified to \(\|w\|_1 = 2\)) lands near 8 on those noisy rank-deficient subsamples. The Prop 99 cell is the durable target.

Core API#

Synthetic Principal Component Design (SPCD) estimator.

This module implements:

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

Algorithm flow (see mlsynth.utils.spcd_helpers):

formulation.py : Eq. (2) M = Y Y^T + alpha I + lambda 1 1^T spectral_init.py : Algs 1 & 2 y^0 = sgn(smallest eigvec of M) iteration_spcd.py : Eqs. (4)/(7) SPCD update (variant=”spcd”) iteration_norm_spcd.py : Eqs. (5)/(8) NormSPCD update (variant=”norm_spcd”) weights_empirical.py : Eq. (9) Algorithm 2 final step (weights=”empirical”) weights_exact.py : Eq. (6) Algorithm 1 final step (weights=”exact”) treatment_effect.py : Algs 1 & 2 footer (minority flip, synthetic paths)

Theoretical artifacts (not implemented as separate code paths):

Algorithm 3 (Appx 3.2, p.20) : abstract GPW used in the proof of Theorem 3 Algorithm 4 (Appx 3.2, p.20) : abstract Normalized GPW used in the proof

class mlsynth.estimators.spcd.SPCD(config: SPCDConfig | dict)#

Bases: object

Synthetic Principal Component Design (SPCD) estimator.

Selects a treated group and constructs synthetic control weights simultaneously by solving the optimal experiment-design problem of Lu, Li, Ying, Blanchet (2022) via a (normalized) generalized power method with spectral initialization.

Parameters:: config (SPCDConfig or dict) – Configuration object defining data, iteration variant, and weight-step choice.

config#

Parsed configuration object.

Type:: SPCDConfig

df#

Input panel data.

Type:: pandas.DataFrame

outcome#

Name of outcome column.

Type:: str

unitid#

Unit identifier column.

Type:: str

time#

Time identifier column.

Type:: str

variant#

Iteration-box choice. "spcd" uses Eq. (4)/(7); "norm_spcd" uses Eq. (5)/(8).

Type:: {“spcd”, “norm_spcd”}

weights#

Final-weight-step choice. "empirical" uses Eq. (9), the paper’s experimental default; "exact" solves Eq. (6) via cvxpy.

Type:: {“empirical”, “exact”}

alpha_ridge, lam_balance, beta

Hyperparameters of Eq. (2) and the iteration update. When None, each is auto-estimated from the spectrum of Y_pre.T @ Y_pre.

Type:: float or None

max_iter#

Maximum iterations for the SPCD/NormSPCD while loop.

Type:: int

T0, post_col

Pre/post split specification (mirrors SYNDES’s interface).

Type:: optional

solver#

Passed to cvxpy when weights="exact".

Type:: optional

display_graph#

Whether to plot synthetic treated/control series after fit.

Type:: bool

verbose#

Solver verbosity flag.

Type:: bool

Returns:: SPCDResults – Container with the optimized design, attached preprocessed inputs, and synthetic-path arrays.

fit() → SPCDResults | SPCDMultiArmResults#

Run the SPCD pipeline and return the design.

Returns:: SPCDResults or SPCDMultiArmResults – A single design when no arm column is configured; otherwise one independent SPCD design per arm, keyed by arm label.

Configuration#

class mlsynth.config_models.SPCDConfig(*, df: DataFrame, outcome: str, unitid: str, time: str, variant: Literal['spcd', 'norm_spcd'] = 'norm_spcd', weights: Literal['empirical', 'exact'] = 'empirical', alpha_ridge: Annotated[float | None, Ge(ge=0.0)] = None, lam_balance: Annotated[float | None, Ge(ge=0.0)] = None, beta: Annotated[float | None, Ge(ge=0.0)] = None, max_iter: Annotated[int, Gt(gt=0)] = 200, T0: Annotated[int | None, Gt(gt=0)] = None, post_col: str | None = None, arm: str | None = None, solver: Any = None, display_graph: bool = False, verbose: bool = False, enable_inference: bool = True, holdout_frac_E: Annotated[float, Ge(ge=0.1), Le(le=0.95)] = 0.7, inference_alpha: Annotated[float, Gt(gt=0.0), Lt(lt=1.0)] = 0.05, power_target: Annotated[float, Gt(gt=0.0), Lt(lt=1.0)] = 0.8, mde_n_sims: Annotated[int, Gt(gt=0)] = 5000, mde_n_trials: Annotated[int, Gt(gt=0)] = 400, mde_horizon_grid: List[int] | None = None, inference_seed: int = 1400, min_blank_size: Annotated[int, Gt(gt=0)] = 5, pooled_weights: Literal['size', 'equal'] = 'size', covariates: List[str] | None = None, covariate_weight: Annotated[float, Ge(ge=0)] = 1.0)#

Configuration for the Synthetic Principal Component Design (SPCD) estimator.

Implements Lu, Li, Ying, Blanchet (2022), “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls”, arXiv:2211.15241v1.

Parameters:

variant ({“spcd”, “norm_spcd”}) – Iteration-box choice. "spcd" uses Eq. (4)/(7) of the paper; "norm_spcd" uses Eq. (5)/(8). The paper’s Section 4 experiments use "norm_spcd" with closed-form weights.
weights ({“empirical”, “exact”}) – Final-weight-step choice. "empirical" uses Eq. (9) — the closed-form approximation used in all of the paper’s experiments. "exact" solves Eq. (6) via cvxpy.
alpha_ridge (float or None, optional) – Ridge term alpha in Eq. (2), playing the role of the noise variance sigma. If None, it is chosen by out-of-sample pre-period balance over a noise-scale grid (select_alpha_by_holdout), since the post-period RMSE is a jumpy function of alpha when N > T_pre. Pass a value (e.g. a known noise variance) to bypass selection.
lam_balance (float or None, optional) – Sum-zero penalty lambda in Eq. (2). Auto-estimated if None. Theorem 1 requires this to be “large enough”.
beta (float or None, optional) – Iteration step parameter beta in Eqs. (4)/(5)/(7)/(8). Auto-estimated from the spectrum if None.
max_iter (int) – Maximum iterations for the SPCD/NormSPCD while loop.
T0 (int or None, optional) – Number of pre-treatment periods.
post_col (str or None, optional) – Column indicating post-treatment periods.
solver (Any, optional) – CVXPY-compatible solver. Used only when weights="exact".
display_graph (bool) – Whether to display the synthetic treated/control plot.
verbose (bool) – Solver verbosity.

Notes

Algorithms 3 and 4 of the paper (Appendix 3.2) are abstract meta-versions used in the proof of Theorem 3 (global convergence). They correspond to the same iterations as Algorithms 1 and 2 acting on a generic Hermitian perturbed rank-1 matrix and are not exposed as separate user options here.

T0: int | None#

alpha_ridge: float | None#

arm: str | None#

beta: float | None#

classmethod check_spcd_params(values: Any) → Any#

covariate_weight: float#

covariates: List[str] | None#

display_graph: bool#

enable_inference: bool#

holdout_frac_E: float#

inference_alpha: float#

inference_seed: int#

lam_balance: float | None#

max_iter: int#

mde_horizon_grid: List[int] | None#

mde_n_sims: int#

mde_n_trials: int#

min_blank_size: int#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

pooled_weights: Literal['size', 'equal']#

post_col: str | None#

power_target: float#

solver: Any#

variant: Literal['spcd', 'norm_spcd']#

verbose: bool#

weights: Literal['empirical', 'exact']#

Result Containers#

SPCD.fit() returns an SPCDResults, bundling the optimized SPCDDesign (sign vector, treated/control/contrast weights, selected unit labels, iteration diagnostics, and the auto-chosen \(\alpha\), \(\lambda\), \(\beta\)), the prepared SPCDInputs, a standardized summary (BaseEstimatorResults), and the optional conformal CI and power-analysis objects. When an arm column is configured, an SPCDMultiArmResults collects one independent SPCDResults per arm.

Structured containers for the SPCD synthetic design pipeline.

Implements the data containers used throughout the SPCD estimator, which itself implements:

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls. arXiv:2211.15241v1.

class mlsynth.utils.spcd_helpers.structures.SPCDDesign(variant: str, weights_mode: str, assignment_pm1: ndarray, selected_mask: ndarray, raw_weights: ndarray, treated_weights: ndarray, control_weights: ndarray, contrast_weights: ndarray, synthetic_treated: ndarray, synthetic_control: ndarray, synthetic_gap: ndarray, selected_unit_indices: ndarray, selected_unit_labels: ndarray, n_treated: int, n_iterations: int, converged: bool, alpha_ridge: float, lam_balance: float, beta: float)#

Bases: object

Optimized SPCD design solution.

The “assignment” is represented in two equivalent forms:

assignment_pm1 — the paper’s internal {-1, +1} sign vector y* in {-1, +1}^N (see Algorithm 1, page 7).
selected_mask — a {0, 1} indicator marking the minority group that is treated, following the rule “Treat Unit i if gamma(i) = -sgn(sum_j gamma(j))” at the bottom of Algorithm 1.

Parameters:

variant (str) – Iteration variant used: "spcd" (Eq. (4)/(7)) or "norm_spcd" (Eq. (5)/(8)).
weights_mode (str) – Final weight step used: "empirical" (Eq. (9), Algorithm 2) or "exact" (Eq. (6), Algorithm 1).
assignment_pm1 (np.ndarray) – Final sign vector y* in {-1, +1}^N produced by the iteration. See Algorithm 1, page 7.
selected_mask (np.ndarray) – Binary {0, 1} indicator of treated units, following the minority-group convention from the bottom of Algorithm 1.
raw_weights (np.ndarray) – Signed weights w in R^N computed by the final weight step (Eq. (9) for weights="empirical" or Eq. (6) for weights="exact").
treated_weights (np.ndarray) – Weights restricted to the treated group, normalized to sum to 1.
control_weights (np.ndarray) – Weights restricted to the control group, normalized to sum to 1.
contrast_weights (np.ndarray) – Signed contrast weights forming treated_weights - control_weights (with appropriate signs), used to construct the synthetic gap.
synthetic_treated (np.ndarray) – Synthetic treated trajectory Y @ treated_weights of length T_pre + T_post.
synthetic_control (np.ndarray) – Synthetic control trajectory Y @ control_weights of length T_pre + T_post.
synthetic_gap (np.ndarray) – Pointwise difference synthetic_treated - synthetic_control.
selected_unit_indices (np.ndarray) – Integer indices of treated units.
selected_unit_labels (np.ndarray) – Original labels of treated units.
n_treated (int) – Number of treated units.
n_iterations (int) – Number of iterations executed by Algorithm 1’s while loop.
converged (bool) – True if the sign vector stabilized before max_iter.
alpha_ridge (float) – Value of alpha used in Eq. (2) (ridge on Y Y^T).
lam_balance (float) – Value of lambda used in Eq. (2) (penalty on (1^T W)^2).
beta (float) – Value of beta used in the iteration update.

alpha_ridge: float#

assignment_pm1: ndarray#

beta: float#

contrast_weights: ndarray#

control_weights: ndarray#

converged: bool#

lam_balance: float#

n_iterations: int#

n_treated: int#

raw_weights: ndarray#

selected_mask: ndarray#

selected_unit_indices: ndarray#

selected_unit_labels: ndarray#

synthetic_control: ndarray#

synthetic_gap: ndarray#

synthetic_treated: ndarray#

treated_weights: ndarray#

variant: str#

weights_mode: str#

class mlsynth.utils.spcd_helpers.structures.SPCDInputs(Y_pre: ndarray, Y_post: ndarray | None, unit_index: IndexSet, time_index: IndexSet, pre_time_index: IndexSet, post_time_index: IndexSet | None, outcome: str, covariates: ndarray | None = None, covariate_names: list | None = None)#

Bases: object

Preprocessed panel data for SPCD estimation.

Parameters:

Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape (T_pre, N). Note that rows are time periods and columns are units, matching the convention used by prepare_syndes_inputs. The paper’s equations use Y in R^{N x T}, so when implementing Eq. (2) of the paper, the iteration matrix is built as Y_pre.T @ Y_pre + alpha I + lambda 1 1.T.
Y_post (np.ndarray or None) – Post-treatment outcome matrix of shape (T_post, N).
unit_index (IndexSet) – Mapping from unit labels to integer indices.
time_index (IndexSet) – Mapping from time labels to integer indices.
pre_time_index (IndexSet) – Index set for pre-treatment periods.
post_time_index (IndexSet or None) – Index set for post-treatment periods.
outcome (str) – Name of outcome variable.

Y_post: ndarray | None#

Y_pre: ndarray#

covariate_names: list | None = None#

covariates: ndarray | None = None#

outcome: str#

post_time_index: IndexSet | None#

pre_time_index: IndexSet#

time_index: IndexSet#

unit_index: IndexSet#

class mlsynth.utils.spcd_helpers.structures.SPCDMultiArmResults(arm_designs: Dict[Any, SPCDResults], arm: str, pooled_power: 'SPCDPowerAnalysis' | None = None, pooled_weights: str | None = None)#

Bases: object

Per-arm SPCD designs.

Returned by mlsynth.estimators.SPCD when an arm column is configured: the SPCD design problem is solved independently within each arm’s units, and each arm’s full SPCDResults (design, inputs, summary, conformal CI and power) is collected here.

Parameters:

arm_designs (dict) – {arm_label: SPCDResults} – one independent SPCD solution per arm.
arm (str) – Name of the arm column the units were partitioned on.
pooled_power (SPCDPowerAnalysis, optional) – Minimum detectable effect for the average effect across arms (the pooled, size- or equal-weighted contrast). Computed by default when inference is enabled and at least two arms have a usable holdout window. None otherwise. Note this targets the weighted-average effect, so opposite-signed arm effects can cancel – use the per-arm power for individual-arm detection.
pooled_weights (str, optional) – Weighting used for the pooled average ("size" or "equal"); None when no pooled MDE was computed.

arm: str#

arm_designs: Dict[Any, SPCDResults]#

att_by_arm() → Dict[Any, float | None]#: {arm_label: ATT} across arms (None where no summary).

property mode: str#

property pooled_mde: float | None#: MDE of the average effect across arms (absolute scale), if computed.

property pooled_mde_pct: float | None#: Pooled average-effect MDE as a percentage of the pooled baseline.

pooled_power: 'SPCDPowerAnalysis' | None = None#

pooled_weights: str | None = None#

class mlsynth.utils.spcd_helpers.structures.SPCDPreFit(rmse_estimation: float, rmse_blank: float | None, rmse_pre: float, n_estimation: int, n_blank: int)#

Bases: object

Pre-period agreement between the synthetic treated and synthetic control trajectories – a design-phase fit diagnostic.

Reports the RMSE of the synthetic gap (synthetic treated minus synthetic control) over three windows:

rmse_estimation – the estimation window E, in-sample for the design fit;
rmse_blank – the blank/holdout window B, out-of-sample (None when enable_inference=False so no split was made);
rmse_pre – the overall pre-period (E + B).

Lower values mean the two synthetic series track each other more closely before treatment. The blank-window RMSE is the honest out-of-sample read on how well the design will hold up post-launch.

n_blank: int#

n_estimation: int#

rmse_blank: float | None#

rmse_estimation: float#

rmse_pre: float#

class mlsynth.utils.spcd_helpers.structures.SPCDResults(design: SPCDDesign, inputs: SPCDInputs | None = None, summary: 'BaseEstimatorResults' | None = None, conformal: 'SPCDConformalResult' | None = None, power: 'SPCDPowerAnalysis' | None = None, pre_fit: SPCDPreFit | None = None)#

Bases: object

User-facing output of the SPCD estimator.

Parameters:

design (SPCDDesign) – Optimization solution.
inputs (SPCDInputs, optional) – Preprocessed data used in estimation. Attached by the mlsynth.estimators.SPCD orchestrator so the result is self-contained for plotting.
summary (BaseEstimatorResults, optional) – Standardized result bundle containing ATT, pre/post fit RMSEs, synthetic-paths time series, per-unit signed weights, and method diagnostics. Attached by the SPCD orchestrator so users get a single object whose shape matches the rest of the mlsynth estimator suite.

Notes

mode always reports "spcd" so plotting and dispatch code can branch on it uniformly with SYNDESResults.

Convenience properties (att, rmse_pre, rmse_post, donor_weights) forward to the corresponding fields of summary when it is attached.

property assignment: ndarray#: Alias for design.selected_mask (0/1 indicator of treated units).

property att: float | None#: Average treatment effect on the treated, or None if no summary.

property ci_lower: float | None#: Conformal lower bound of the ATT CI, if computed.

property ci_upper: float | None#: Conformal upper bound of the ATT CI, if computed.

conformal: 'SPCDConformalResult' | None = None#

property control_weights_by_unit: dict | None#

Control-side weights as {label: weight} (non-negative, sum to 1).

Alias of donor_weights() for users who prefer the explicit treated/control naming over the SC-literature “donor” terminology.

design: SPCDDesign#

property donor_weights: dict | None#: Per-unit signed contrast weights as a label-to-float dict.

inputs: SPCDInputs | None = None#

property mde: float | None#: Minimum detectable effect on the absolute scale, if computed.

property mde_pct: float | None#: Minimum detectable effect as a percentage of the holdout baseline.

property mode: str#: Solver mode reported to downstream consumers.

property p_value: float | None#

tau = 0, if computed.

Type:: Conformal p-value vs. H0

power: 'SPCDPowerAnalysis' | None = None#

pre_fit: SPCDPreFit | None = None#

property pre_fit_rmse: dict | None#

Synthetic treated-vs-control RMSE by window.

{"estimation": ..., "blank": ..., "overall_pre": ...} – how closely the synthetic treated and synthetic control trajectories agree over the estimation, blank, and overall pre-treatment windows. None if the pre-fit diagnostic was not computed.

property rmse_post: float | None#: Post-treatment RMSE of the synthetic gap, if a post period exists.

property rmse_pre: float | None#: Pre-treatment RMSE of the synthetic gap.

property selected_unit_indices: ndarray#: Integer indices of units selected into treatment.

property selected_unit_labels: ndarray#: Labels of units selected into treatment.

summary: 'BaseEstimatorResults' | None = None#

property treated_weights_by_unit: dict | None#: Treated-side weights as {label: weight} (non-negative, sum to 1).

Helper Modules#

Data preparation helpers for SPCD.

Pivots long panel data into the wide (T, N) matrix consumed by the SPCD iteration. The orientation here is the inverse of the paper’s Y in R^{N x T} (Algorithm 1, page 7), so the iteration matrix from Eq. (2),

M = Y Y^T + alpha I + lambda 1 1^T,

is implemented as Y_pre.T @ Y_pre + alpha I + lambda 1 1^T further downstream in formulation.py.

mlsynth.utils.spcd_helpers.setup.prepare_spcd_inputs(df: DataFrame, outcome: str, unitid: str, time: str, T0: int | None = None, post_col: str | None = None, covariates: List[str] | None = None) → SPCDInputs#

Pivot long panel data and split it into pre/post matrices for SPCD.

Parameters:

df (pd.DataFrame) – Long balanced panel data.
outcome, unitid, time (str) – Column names identifying the outcome, units, and time periods.
T0 (Optional[int]) – Number of pre-treatment periods when post_col is not supplied.
post_col (Optional[str]) – Optional 0/1 or boolean column identifying post-treatment periods.
covariates (Optional[list of str]) – Optional covariate columns to balance on alongside the outcomes. Each unit’s per-covariate pre-period mean is taken and the resulting (N, P) matrix is z-scored across units. Time-invariant covariates collapse to their constant value.

Returns:

SPCDInputs – Wide pre/post matrices and label metadata.

Iteration-matrix construction for SPCD.

Implements Eq. (2) of the paper,

M = Y Y^T + alpha I + lambda 1 1^T,

and supplies spectral-based auto-defaults for the three hyperparameters alpha, lambda, beta when the user has not specified them.

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.formulation.build_iteration_matrix(Y_pre: ndarray, alpha: float | None = None, lam: float | None = None, beta: float | None = None, covariates: ndarray | None = None, covariate_weight: float = 1.0) → Tuple[ndarray, ndarray, float, float, float]#

Build the SPCD iteration matrix M and its inverse.

Implements Eq. (2) of the paper:

M = Y_pre.T @ Y_pre + alpha I + lambda 1 1^T

Note that Y_pre here is (T_pre, N) (mlsynth convention), which is the transpose of the paper’s Y in R^{N x T}. The product Y_pre.T @ Y_pre is exactly the paper’s Y Y^T.

When covariates (an (N, P) z-scored matrix) is supplied, a covariate-balance term is folded in:

M = Y_pre.T @ Y_pre + covariate_weight * s * (X X^T)

alpha I + lambda 1 1^T,

where s scales the covariate Gram to match the trace (total “energy”) of Y_pre.T @ Y_pre so covariate_weight = 1 weights covariates equally to outcomes. This keeps the phase-synchronization structure (a quadratic form W^T M W) intact, so the solver and the global-optimality theory are unchanged – the design now balances the two groups on outcomes and covariates jointly.

Parameters:

Y_pre (np.ndarray) – Pre-treatment matrix of shape (T_pre, N).
alpha (float, optional) – Ridge term alpha in Eq. (2), which plays the role of the idiosyncratic noise variance sigma (Assumption 1). If None, it is estimated from the noise scale of Y_pre via the Gavish-Donoho median-singular-value estimator (estimate_noise_variance()), consistent with the paper’s appendix condition alpha = ||Delta|| on the perturbation scale. Note the paper itself treats alpha as a pre-defined hyperparameter and gives no formula; this default is a heuristic. For data generated with a known noise level (e.g. the paper’s Section 4.1 simulations with sigma = 1), pass that value explicitly.
lam (float, optional) – Sum-zero penalty lambda in Eq. (2). If None, defaults to the largest eigenvalue of Y_pre.T @ Y_pre so that Theorem 1’s “large enough lambda” condition is satisfied on the scale of the data.
beta (float, optional) – Iteration step parameter beta used in Eqs. (4), (5), (7), (8). If None, defaults to 1 / lambda_max(M), the smallest eigenvalue of M^{-1}. This is the natural scale that keeps M^{-1} + beta I from being numerically dominated by either term at large N.

Returns:

M (np.ndarray) – The N x N iteration matrix from Eq. (2).
M_inv (np.ndarray) – The inverse of M, computed via np.linalg.solve(M, I).
alpha (float) – Final value used (auto-estimated if input was None).
lam (float) – Final value used (auto-estimated if input was None).
beta (float) – Final value used (auto-estimated if input was None).

mlsynth.utils.spcd_helpers.formulation.estimate_noise_variance(Y_pre: ndarray) → float#

Estimate the idiosyncratic noise variance sigma^2 of Y_pre.

In the paper’s outcome model the ridge term alpha in Eq. (2) plays the role of the noise variance sigma = Var(e_it) (it is the coefficient of the variance term sigma * sum_i w_i^2 in the MSE decomposition), and the appendix convergence theorems set the perturbation ridge on the noise scale alpha = ||Delta|| with alpha <= ||Delta||. So alpha must track the noise scale of the data, not its dominant (signal/level) eigenvalue.

This uses the Gavish-Donoho (2014) median-singular-value estimator, which is parameter-free and robust when the signal is approximately low-rank: with Y_pre of shape (m, n) and beta = min(m, n) / max(m, n),

\[\hat\sigma = \frac{\operatorname{median}(\text{singular values})} {\sqrt{\max(m, n)\, \mu_\beta}},\]

where mu_beta is the Marchenko-Pastur median. Returns hat_sigma^2.

Parameters:: Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape (T_pre, N).
Returns:: float – Estimated noise variance sigma^2 (always positive).

mlsynth.utils.spcd_helpers.formulation.validate_spcd_inputs(Y_pre: ndarray) → None#

Check basic shape and feasibility of the pre-treatment matrix.

Parameters:: Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape (T_pre, N).
Raises:: MlsynthDataError – If Y_pre is not 2D, has fewer than 2 periods, or fewer than 2 units.

Spectral initialization for the SPCD iteration.

Implements the “Spectral Initialization” step of Algorithm 1 (page 7) and Algorithm 2 (page 13) of the SPCD paper:

y^0 = sgn(v),

where v is the smallest eigenvector of the iteration matrix

M = Y Y^T + alpha I + lambda 1 1^T (Eq. 2)

The theoretical justification appears in Appendix 3.2.1 (“The Spectral Initialization”, page 20). Under Assumption 2 (realizability) and the generative process of Section 3.2, the spectral estimator’s sign vector agrees with the global optimum’s sign vector up to a bounded fraction of entries (Lemma 4, page 20).

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.spectral_init.spectral_initialization(M: ndarray) → ndarray#

Compute the spectral-initialization sign vector for SPCD.

Implements the “Spectral Initialization” step from Algorithm 1 (page 7) and Algorithm 2 (page 13) of the paper:

y^0 = sgn(v), v = smallest eigenvector of M.

Parameters:: M (np.ndarray) – N x N symmetric positive-definite iteration matrix from Eq. (2).
Returns:: y0 (np.ndarray) – Length-N sign vector with entries in {-1, +1}.

Notes

np.linalg.eigh returns eigenvalues in ascending order, so column zero of the eigenvector matrix is the eigenvector associated with the smallest eigenvalue. Any zero entries in sgn(v) are mapped to +1 so that y^0 is strictly in {-1, +1}^N.

SPCD iteration step (generalized power method).

Implements the SPCD update box from Algorithm 1 (page 7) and the identical update from Algorithm 2 (page 13) of the paper:

Eq. (4) / Eq. (7):
y^{t+1} = sgn[ (M^{-1} + beta I) y^t ],

where M = Y Y^T + alpha I + lambda 1 1^T (Eq. 2) and beta is a pre-defined step parameter.

Theoretical justification: Algorithm 3 of Appendix 3.2 (page 20) gives the abstract meta-version of this iteration on a generic Hermitian perturbed rank-1 matrix. Theorem 3 (page 9, informal) gives global convergence under Assumptions 1-2.

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.iteration_spcd.run_spcd_iteration(M_inv: ndarray, y0: ndarray, beta: float, max_iter: int) → Tuple[ndarray, int, bool]#

Run the SPCD while Converged do loop using Eq. (4)/(7).

Implements the SPCD branch of the outer loop in Algorithm 1 (page 7) and Algorithm 2 (page 13). The loop terminates as soon as the sign vector stops changing between successive iterations.

Parameters:

M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y0 (np.ndarray) – Initial sign vector from Spectral Initialization.
beta (float) – Iteration step parameter from Eq. (4)/(7).
max_iter (int) – Maximum number of iterations to perform before terminating.

Returns:

y_final (np.ndarray) – Final sign vector y* in {-1, +1}^N.
n_iter (int) – Number of iterations actually performed.
converged (bool) – True if the sign vector stabilized before max_iter.

mlsynth.utils.spcd_helpers.iteration_spcd.spcd_step(M_inv: ndarray, y: ndarray, beta: float) → ndarray#

One iteration of the SPCD update box, Eq. (4)/(7).

Computes:

y^{t+1} = sgn[ (M^{-1} + beta I) y^t ]

Parameters:

M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y (np.ndarray) – Length-N sign vector at iteration t.
beta (float) – Iteration step parameter from Eq. (4)/(7).

Returns:

y_new (np.ndarray) – Length-N sign vector at iteration t + 1.

NormSPCD iteration step (normalized generalized power method).

Implements the NormSPCD update box from Algorithm 1 (page 7) and the identical update from Algorithm 2 (page 13) of the paper:

Eq. (5) / Eq. (8):
y^{t+1} = sgn[ (M^{-1} + beta I) (y^t / d) ],

where:

M = Y Y^T + alpha I + lambda 1 1^T          (Eq. 2)
d = sqrt(diag(M^{-1}))                      (definition below Eq. 5)

and / denotes element-wise division.

Theoretical justification: Algorithm 4 of Appendix 3.2 (page 20) gives the abstract meta-version of this iteration. In the same appendix the paper interprets NormSPCD as a Riemannian gradient descent under a diagonal preconditioning given by the inverse-covariance diagonal (Appendix 3, page 8 and Appendix 3.2). Theorem 3 (page 9) gives linear-rate global convergence under Assumptions 1-2 with epsilon > 0.

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.iteration_norm_spcd.norm_spcd_step(M_inv: ndarray, d: ndarray, y: ndarray, beta: float) → ndarray#

One iteration of the NormSPCD update box, Eq. (5)/(8).

Computes:

y^{t+1} = sgn[ (M^{-1} + beta I) (y^t / d) ]

Parameters:

M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
d (np.ndarray) – Length-N normalization vector sqrt(diag(M^{-1})). See the definition immediately below Eq. (5), page 7.
y (np.ndarray) – Length-N sign vector at iteration t.
beta (float) – Iteration step parameter from Eq. (5)/(8).

Returns:

y_new (np.ndarray) – Length-N sign vector at iteration t + 1.

mlsynth.utils.spcd_helpers.iteration_norm_spcd.run_norm_spcd_iteration(M_inv: ndarray, y0: ndarray, beta: float, max_iter: int) → Tuple[ndarray, int, bool]#

Run the NormSPCD while Converged do loop using Eq. (5)/(8).

Implements the NormSPCD branch of the outer loop in Algorithm 1 (page 7) and Algorithm 2 (page 13). The loop terminates as soon as the sign vector stops changing between successive iterations.

Parameters:

M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y0 (np.ndarray) – Initial sign vector from Spectral Initialization.
beta (float) – Iteration step parameter from Eq. (5)/(8).
max_iter (int) – Maximum number of iterations to perform before terminating.

Returns:

y_final (np.ndarray) – Final sign vector y* in {-1, +1}^N.
n_iter (int) – Number of iterations actually performed.
converged (bool) – True if the sign vector stabilized before max_iter.

Closed-form empirical weights (Algorithm 2 of the paper).

Implements Eq. (9) of the paper (page 13), the closed-form weight construction used by “Algorithm 2: Empirical Implementation of SPCD”:

w = 2 (Y Y^T + alpha I + lambda 1 1^T)^{-1} y*
/ || (Y Y^T + alpha I + lambda 1 1^T)^{-1} y* ||_1

The paper notes that this expression is exactly the optimal weight under the optimality conditions of Eq. (6) when the global solution sign vector y* is known (see the remark “From [10] we know that once the optimal design profile y* is obtained, then w is the optimal design weight” on page 9). In practice, since Algorithm 1’s iteration recovers the correct signs (Theorem 1) but not necessarily the exact optimum of the convex QP (Eq. 6), this closed-form gives a fast near-optimal approximation. The paper states: “In all the experiment in this paper, we use this simplified implementation” (page 9, top).

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.weights_empirical.empirical_weights(M_inv: ndarray, y_star: ndarray) → ndarray#

Compute the closed-form SPCD weights from Eq. (9).

Implements:

w = 2 * M^{-1} y* / || M^{-1} y* ||_1

where M is the iteration matrix from Eq. (2) and y* is the converged sign vector from Algorithm 1’s iteration.

Parameters:

M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y_star (np.ndarray) – Length-N final sign vector y* in {-1, +1}^N from the SPCD or NormSPCD iteration.

Returns:

w (np.ndarray) – Length-N signed weight vector. Per the optimality condition noted on page 13 (“The optimality condition ensures sgn(w) = gamma”), the signs of w should agree with y_star whenever the iteration has converged to a fixed point of the closed-form map.

Exact convex-QP weights (Algorithm 1 of the paper).

Implements Eq. (6) of the paper (page 7), the convex quadratic program used by “Algorithm 1: Synthetic principal component Design” once the sign vector gamma is fixed by the iteration:

Eq. (6):
    min_{w_i >= 0}    (1/T) sum_t (
                          sum_{i: gamma(i)=1}  w_i Y_{it}
                        - sum_{i: gamma(i)=-1} w_i Y_{it}
                      )^2 + sigma sum_i w_i^2

    s.t.   w_i >= 0  for all i in [N],
           sum_{gamma(i)=1}  w_i = 1,
           sum_{gamma(i)=-1} w_i = 1.

This is a convex QP and is solved exactly via cvxpy. sigma plays the role of the ridge coefficient alpha from Eq. (2).

The paper notes that in their experiments they prefer the closed-form approximation in Eq. (9) (see weights_empirical.py), but Eq. (6) is the original optimization problem and is provided here for users who want the exact Algorithm 1 weights.

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.weights_exact.exact_weights(Y_pre: ndarray, y_star: ndarray, sigma: float, solver: Any | None = None, verbose: bool = False) → ndarray#

Solve Eq. (6) via cvxpy and return the signed weight vector.

Parameters:

Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape (T_pre, N) (mlsynth convention; the paper’s Y in R^{N x T} corresponds to Y_pre.T).
y_star (np.ndarray) – Length-N sign vector gamma in {-1, +1}^N from the iteration. Used to identify the treated and control groups in the constraints of Eq. (6).
sigma (float) – Ridge coefficient sigma from Eq. (6). Plays the same role as alpha in Eq. (2).
solver (optional) – Optional cvxpy solver name or object passed through to cp.Problem.solve.
verbose (bool) – Whether to print solver progress.

Returns:

w (np.ndarray) – Length-N signed weight vector (2 * indicator(gamma=1) - 1) * w_raw so that downstream consumers see one signed vector per unit. The unsigned positive weights w_raw[i] sum to 1 within each group.

Raises:

MlsynthEstimationError – If cvxpy is not installed, or if the solver fails to find an optimal solution.

Treatment-effect assembly for SPCD (final step of Algorithms 1 & 2).

Implements the synthetic-paths construction and treatment-effect formula at the bottom of Algorithm 1 (page 7) and the bottom of Algorithm 2 (page 13):

Treat Unit i if  gamma(i) = -sgn(sum_j gamma(j))            (minority rule)

tau_hat = sum_t (
              sum_{gamma(i)=sgn(sum_j gamma(j))}  w(i) Y_{i, T+t}
            - sum_{gamma(i)=-sgn(sum_j gamma(j))} w(i) Y_{i, T+t}
          )

The signed-weight convention used elsewhere in this module collapses the above difference-of-sums into a single dot product Y_post @ contrast_weights once the signs of contrast_weights match the minority-flipped sign vector.

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.treatment_effect.apply_minority_flip(y_star: ndarray, raw_weights: ndarray) → Tuple[ndarray, ndarray]#

Apply the minority-group convention from the bottom of Algorithm 1.

Per the paper (page 7, bottom of Algorithm 1):

“Treat Unit i if gamma(i) = -sgn(sum_j gamma(j))
(To ensure the size of the treated group is smaller than the control group.)”

If the spectral iteration converges to a sign vector whose positive entries outnumber its negative entries, we flip the sign so that +1 corresponds to the minority (treated) group. The raw weights are flipped in lock-step so that raw_weights continues to be the signed Eq. (9) (or Eq. 6) weights of the chosen treated and control groups.

Parameters:

y_star (np.ndarray) – Length-N sign vector y* in {-1, +1}^N from the iteration.
raw_weights (np.ndarray) – Length-N signed weights from Eq. (9) or Eq. (6).

Returns:

assignment_pm1 (np.ndarray) – Possibly sign-flipped version of y_star, with +1 marking the minority (treated) group.
raw_weights_flipped (np.ndarray) – Possibly sign-flipped raw_weights consistent with the new assignment.

mlsynth.utils.spcd_helpers.treatment_effect.build_synthetic_paths(Y_pre: ndarray, Y_post: ndarray | None, treated_weights: ndarray, control_weights: ndarray) → Tuple[ndarray, ndarray, ndarray]#

Form synthetic treated, synthetic control, and gap trajectories.

Parameters:

Y_pre (np.ndarray) – Pre-treatment matrix of shape (T_pre, N).
Y_post (np.ndarray or None) – Post-treatment matrix of shape (T_post, N), or None.
treated_weights (np.ndarray) – Length-N treated-group weights summing to 1.
control_weights (np.ndarray) – Length-N control-group weights summing to 1.

Returns:

synthetic_treated (np.ndarray) – Synthetic treated trajectory over T_pre + T_post periods.
synthetic_control (np.ndarray) – Synthetic control trajectory over T_pre + T_post periods.
synthetic_gap (np.ndarray) – Difference synthetic_treated - synthetic_control.

mlsynth.utils.spcd_helpers.treatment_effect.build_weight_groups(assignment_pm1: ndarray, raw_weights: ndarray) → Tuple[ndarray, ndarray, ndarray, ndarray]#

Split signed weights into treated/control/contrast components.

The signed weight vector returned by Eq. (9) or Eq. (6) already encodes group membership: positive entries belong to the treated group, negative entries to the control group. This helper turns that into the four arrays used downstream by the plotter and by the treatment-effect formula.

Parameters:

assignment_pm1 (np.ndarray) – Length-N sign vector with +1 for treated units, -1 for control units.
raw_weights (np.ndarray) – Length-N signed weights.

Returns:

selected_mask (np.ndarray) – 0/1 indicator of treated units (parity with the SYNDES design container’s assignment field).
treated_weights (np.ndarray) – Length-N weights restricted to the treated group; zero elsewhere. Normalized to sum to 1 over treated units.
control_weights (np.ndarray) – Length-N weights restricted to the control group; zero elsewhere. Normalized to sum to 1 over control units.
contrast_weights (np.ndarray) – Length-N signed contrast weights treated_weights - control_weights. The synthetic gap is Y_full @ contrast_weights.

mlsynth.utils.spcd_helpers.treatment_effect.compute_att_and_fit(Y_pre: ndarray, Y_post: ndarray | None, treated_weights: ndarray, control_weights: ndarray) → Tuple[float, float, float | None]#

Compute SPCD’s ATT and pre/post fit RMSEs.

Implements the tau_hat formula at the bottom of Algorithm 1 (page 7) of the paper, averaged over the post-treatment horizon:

ATT = (1/S) sum_{t=T+1..T+S} (
        sum_{gamma(i)=+1} w(i) Y_{i,t}
      - sum_{gamma(i)=-1} w(i) Y_{i,t}
      )
    = mean(synthetic_gap[post])

When no post-treatment matrix is provided, the SPCD design step is pure covariate-balancing and ATT is reported as 0.0.

rmse_pre measures the pre-period balance achieved by the design (the residual of Eq. (1)/(2) on the chosen sign vector); smaller is better. rmse_post measures the post-period dispersion of the synthetic gap, useful for sanity-checking convergence behavior.

Parameters:

Y_pre (np.ndarray) – Pre-treatment matrix of shape (T_pre, N).
Y_post (np.ndarray or None) – Post-treatment matrix of shape (T_post, N), or None.
treated_weights (np.ndarray) – Length-N treated-group weights summing to 1.
control_weights (np.ndarray) – Length-N control-group weights summing to 1.

Returns:

att (float) – Mean of the post-period synthetic gap, or 0.0 when Y_post is None.
rmse_pre (float) – Root-mean-square synthetic gap over the pre-treatment period.
rmse_post (float or None) – Root-mean-square synthetic gap over the post-treatment period, or None when Y_post is None.

Pre-treatment estimation/holdout split helpers for SPCD inference.

Implements the train-on-E / calibrate-on-B discipline used by LEXSCM and adapted here for SPCD. The first frac_E of the pretreatment window is used to fit the design; the remaining periods are held out and used to produce out-of-sample residuals for conformal inference and power analysis.

Key guarantee#

With enable_inference=True the design is fit on Y_E only, regardless of whether Y_post is supplied. This means a user who has not yet treated (planning mode) and a user who has treated (retrospective mode) both receive the same recommended design as long as their pretreatment data are identical.

mlsynth.utils.spcd_helpers.holdout.compute_holdout_residuals(Y_B: ndarray, contrast_weights: ndarray) → ndarray#

Compute out-of-sample residuals on the holdout window.

The contrast weights were learned on Y_E only, so

r_B = Y_B @ contrast_weights

is the synthetic gap on data the optimization never saw. Under the linear-factor model with no treatment, r_B is a zero-mean noise series whose empirical distribution is used as the calibration set for conformal inference and the noise pool for Monte Carlo MDE.

Parameters:

Y_B (np.ndarray) – Holdout window of shape (n_B, N).
contrast_weights (np.ndarray) – Length-N signed weights from the SPCD design fit on Y_E.

Returns:

residuals_B (np.ndarray) – Length-n_B out-of-sample synthetic gap.

mlsynth.utils.spcd_helpers.holdout.split_pre_window(Y_pre: ndarray, frac_E: float = 0.7, min_blank_size: int = 5) → Tuple[ndarray, ndarray, int, int, bool]#

Split the pretreatment matrix into estimation (E) and holdout (B).

Parameters:

Y_pre (np.ndarray) – Pretreatment outcome matrix of shape (T_pre, N).
frac_E (float) – Fraction of pretreatment periods to use for estimation. The remaining 1 - frac_E periods form the holdout window. Values outside [0.1, 0.95] are rejected.
min_blank_size (int) – Minimum number of holdout periods required for inference to be meaningful. If T_B < min_blank_size the function still returns a valid split (so that the design can be fit on Y_E) but flags can_infer=False so the caller can skip the inference / MDE machinery with a warning.

Returns:

Y_E (np.ndarray) – Estimation window of shape (n_E, N). n_E = floor(T_pre * frac_E).
Y_B (np.ndarray) – Holdout window of shape (n_B, N). n_B = T_pre - n_E.
n_E (int) – Number of periods in the estimation window.
n_B (int) – Number of periods in the holdout window.
can_infer (bool) – True if n_B >= min_blank_size, False otherwise.

Raises:

MlsynthConfigError – If frac_E is out of range.
MlsynthDataError – If the resulting estimation window has fewer than 2 periods, in which case SPCD cannot be fit at all.

Moving-block conformal inference for the SPCD post-period effect.

Adapts the moving-block conformal procedure from mlsynth.utils.fast_scm_helpers.inference.compute_moving_block_conformal_ci to the SPCD setting, where the test statistic is the post-period synthetic-gap mean and the calibration set is the out-of-sample residual vector r_B from the holdout window.

The procedure is exchangeability-based: confidence coverage holds in finite samples under the assumption that overlapping blocks of r_B have the same distribution as overlapping blocks of the post-period gap under the null. This is a stronger assumption than IID noise and weaker than perfect H0; in practice it is the standard for synthetic control conformal inference.

class mlsynth.utils.spcd_helpers.inference.SPCDConformalResult(att: float, ci_lower: float, ci_upper: float, p_value: float, pointwise_lower: ndarray, pointwise_upper: ndarray, block_size: int, alpha: float, method: str = 'moving_block_conformal')#

Container for the moving-block conformal output.

att#

Observed ATT (mean of the post-period synthetic gap).

Type:: float

ci_lower, ci_upper

Conformal CI for the ATT at level 1 - alpha.

Type:: float

p_value#

Conformal p-value vs. H_0: tau = 0.

Type:: float

pointwise_lower, pointwise_upper

Period-by-period conformal bands of length equal to the post horizon.

Type:: np.ndarray

block_size#

Moving-block size used.

Type:: int

alpha#

Two-sided significance level.

Type:: float

method#

Identifier string. Always "moving_block_conformal".

Type:: str

alpha: float#

att: float#

block_size: int#

ci_lower: float#

ci_upper: float#

method: str = 'moving_block_conformal'#

p_value: float#

pointwise_lower: ndarray#

pointwise_upper: ndarray#

mlsynth.utils.spcd_helpers.inference.compute_conformal_ci(residuals_B: ndarray, post_gap: ndarray, alpha: float = 0.05, block_size: int | None = None, grid_size: int = 200) → SPCDConformalResult#

Compute a moving-block conformal CI for the SPCD post-period ATT.

Parameters:

residuals_B (np.ndarray) – Length-n_B out-of-sample residuals from the holdout window (Y_B @ contrast_weights).
post_gap (np.ndarray) – Length-S post-period synthetic gap (Y_post @ contrast_weights).
alpha (float) – Two-sided significance level. Coverage is 1 - alpha.
block_size (int, optional) – Moving-block size. Defaults to max(3, floor(sqrt(S))).
grid_size (int) – Number of grid points to search for the CI inversion.

Returns:

SPCDConformalResult

Pre-experiment power analysis for SPCD.

Adapts the Monte Carlo MDE machinery from mlsynth.utils.fast_scm_helpers.power_helpers to operate directly on the SPCD holdout residuals r_B. The MDE is the smallest treatment effect tau such that the mean-absolute-effect statistic rejects the null with probability at least power_target at significance alpha.

Because SPCD’s post-period estimator is a fixed linear functional tau_hat_t = sum_i contrast_weights[i] * Y[i, t], the noise distribution of tau_hat_t under H0 is fully characterized by the distribution of r_B – no further model-fitting is needed at the power stage.

class mlsynth.utils.spcd_helpers.power.SPCDPowerAnalysis(mde_tau: float, mde_pct: float, baseline: float, critical_stat: float, feasible: bool, n_post: int, alpha: float, power_target: float, detectability: Dict[int, float] | None = None, effect_grid_pct: ndarray | None = None, power_curve: ndarray | None = None)#

Container for MDE and detectability outputs.

mde_tau#

Smallest absolute effect detectable at power_target. np.nan when no point on the grid reaches the target.

Type:: float

mde_pct#

mde_tau expressed as a percentage of the holdout baseline.

Type:: float

baseline#

Baseline level used for the percentage scaling (mean of the synthetic-treated trajectory over the holdout window).

Type:: float

critical_stat#

(1 - alpha)-quantile of the null distribution of the test statistic at horizon n_post.

Type:: float

feasible#

True if a non-nan MDE was identified.

Type:: bool

n_post#

Post-treatment horizon for which the MDE was computed.

Type:: int

alpha#

Significance level used.

Type:: float

power_target#

Power target used.

Type:: float

detectability#

Optional horizon -> MDE mapping if a horizon grid was supplied; None otherwise.

Type:: dict[int, float] or None

effect_grid_pct#

Grid of candidate effect sizes (as a percentage of baseline) at which the power curve was evaluated, at horizon n_post.

Type:: np.ndarray or None

power_curve#

Empirical power at each entry of effect_grid_pct (same length). Together they give the full power-vs-effect curve; the MDE is the smallest effect whose power reaches power_target.

Type:: np.ndarray or None

alpha: float#

baseline: float#

critical_stat: float#

detectability: Dict[int, float] | None = None#

effect_grid_pct: ndarray | None = None#

feasible: bool#

mde_pct: float#

mde_tau: float#

n_post: int#

power_curve: ndarray | None = None#

power_target: float#

mlsynth.utils.spcd_helpers.power.compute_detectability_curve(residuals_B: ndarray, baseline: float, horizon_grid: List[int], alpha: float = 0.05, power_target: float = 0.8, n_sims: int = 5000, n_trials: int = 400, seed: int = 1400) → Dict[int, float]#

Compute MDE as a function of post-treatment horizon length.

Useful for answering “how long does the experiment need to run to detect a target effect?” before committing to a treatment window.

Parameters:

residuals_B (np.ndarray) – Out-of-sample holdout residuals.
baseline (float) – Baseline level for percentage scaling.
horizon_grid (list of int) – Horizons (number of post periods) at which to evaluate the MDE.
alpha, power_target, n_sims, n_trials, seed (see compute_mde().)

Returns:

dict – Mapping horizon -> MDE in percent. Infeasible entries are recorded as nan.

mlsynth.utils.spcd_helpers.power.compute_mde(residuals_B: ndarray, baseline: float, n_post: int, alpha: float = 0.05, power_target: float = 0.8, tau_grid: ndarray | None = None, n_sims: int = 5000, n_trials: int = 400, seed: int = 1400, store_curve: bool = True) → SPCDPowerAnalysis#

Compute the minimum detectable effect via Monte Carlo.

Procedure follows power_helpers._analytical_mde in fast_scm_helpers:

Build the null distribution of the test statistic at horizon n_post by resampling residuals_B (padded with Gaussian draws to handle horizon overflow).
Compute the critical value c_alpha = quantile(null_stats, 1 - alpha).
For each candidate tau on the grid, draw n_trials post-period vectors of the form tau + Gaussian noise and record the fraction of trials whose statistic exceeds c_alpha. The smallest tau with empirical power at or above power_target is the MDE.

Parameters:

residuals_B (np.ndarray) – Out-of-sample residuals on the holdout window.
baseline (float) – Baseline level used to express the MDE as a percentage. Typically the mean of the synthetic-treated trajectory over the holdout window.
n_post (int) – Post-treatment horizon for the MDE.
alpha, power_target (float) – Significance and power.
tau_grid (np.ndarray, optional) – Grid of candidate effect sizes (in absolute units). Defaults to linspace(0, 5 * std(residuals_B), 60).
n_sims, n_trials, seed (int) – Monte Carlo control.

Returns:

SPCDPowerAnalysis

mlsynth.utils.spcd_helpers.power.compute_pooled_average_mde(residuals_by_arm: Dict[object, ndarray], baselines_by_arm: Dict[object, float], sizes_by_arm: Dict[object, int], n_post: int, weights: str = 'size', alpha: float = 0.05, power_target: float = 0.8, n_sims: int = 5000, n_trials: int = 400, seed: int = 1400, horizon_grid: List[int] | None = None) → SPCDPowerAnalysis#

MDE for the (size- or equal-weighted) average effect across arms.

Forms the pooled contrast series g_t = sum_a w_a * r_B^(a)_t on time-aligned holdout residuals, then runs the ordinary single-series compute_mde() on g. Because the arms share calendar time, summing the aligned series makes the cross-arm correlation enter through Var(g) = w' Sigma w automatically – so one must pool the aligned series, never resample arms independently, or positive correlation is dropped and the MDE comes out over-optimistic.

This answers “what average effect across arms can we detect?” and is the most powerful pooled question (averaging cancels idiosyncratic noise). Its estimand is the weighted-average effect, so heterogeneous, opposite-signed arm effects can cancel and be missed; use the per-arm analyses when individual-arm detection matters.

Parameters:

residuals_by_arm (dict) – {arm_label: r_B} out-of-sample holdout residual series. The series are truncated to the common (minimum) length and aligned from the start, which is correct when arms share the calendar time index.
baselines_by_arm (dict) – {arm_label: baseline} per-arm holdout baselines; pooled by the same weights for the percentage scaling.
sizes_by_arm (dict) – {arm_label: n_units} used for weights="size".
n_post (int) – Post-treatment horizon for the MDE.
weights ({“size”, “equal”}) – "size" weights each arm by its unit count (the population-average effect); "equal" weights arms equally.
horizon_grid (list of int, optional) – If given, also computes the pooled-average MDE at each post-treatment horizon and attaches it as detectability ({horizon -> MDE percent}). This answers “how long must the study run to detect a target average effect?”.
alpha, power_target, n_sims, n_trials, seed (see compute_mde().)

Returns:

SPCDPowerAnalysis – The MDE of the pooled average-effect contrast (with a detectability curve when horizon_grid is supplied).

mlsynth.utils.spcd_helpers.power.default_detectability_grid(n_post: int, cap: int = 12) → List[int]#: Default per-horizon detectability grid: 1 .. min(cap, n_post).

Assemble the standardized BaseEstimatorResults summary for SPCD.

Packages an SPCDDesign together with its preprocessed SPCDInputs and optional inference / power outputs into the project’s standardized result pydantic models defined in mlsynth.config_models:

EffectsResults          : ATT (mean post-period synthetic gap)
FitDiagnosticsResults   : pre/post RMSE of the synthetic gap +
                          MDE / detectability in additional_metrics
TimeSeriesResults       : synthetic treated/control/gap trajectories
WeightsResults          : per-unit signed contrast weights
InferenceResults        : conformal p-value, CI, pointwise bands
MethodDetailsResults    : variant, weights mode, alpha/lam/beta, iters

The result is wrapped in a BaseEstimatorResults so SPCD’s public results.summary matches the shape used by the rest of the mlsynth estimator suite.

mlsynth.utils.spcd_helpers.results_assembly.build_summary(design: SPCDDesign, inputs: SPCDInputs, conformal: SPCDConformalResult | None = None, power: SPCDPowerAnalysis | None = None) → BaseEstimatorResults#

Build the standardized result bundle for an SPCD design.

Parameters:

design (SPCDDesign) – Output of solve_spcd_with_holdout() (or solve_spcd()).
inputs (SPCDInputs) – Preprocessed panel data used in the fit.
conformal (SPCDConformalResult, optional) – Moving-block conformal CI for the post-period ATT. None when no post period is supplied or inference was disabled.
power (SPCDPowerAnalysis, optional) – MDE / detectability output. None when inference was disabled or the holdout window was too small.

Returns:

BaseEstimatorResults – Bundle with effects, fit_diagnostics, time_series, weights, inference, and method_details populated. Fields are honest about absence: ATT is None (not 0.0) when no post period is supplied.

Top-level SPCD solver — glues the four algorithm steps together.

The flow follows Algorithm 1 (page 7) and Algorithm 2 (page 13) of the paper, both of which share the same skeleton:

Formulation (Eq. 2) build_iteration_matrix()

Spectral Initialization spectral_initialization()

while not Converged do run_spcd_iteration()
Pick ONE update box: or - SPCD (Eq. 4 / 7) run_norm_spcd_iteration() - NormSPCD (Eq. 5 / 8)

Final weight step empirical_weights() OR exact_weights()
Algorithm 1 uses Eq. (6). Algorithm 2 uses Eq. (9).

Minority-group flip + tau-hat apply_minority_flip(),
build_synthetic_paths()

The two iteration-box choices and the two weight-step choices are exposed as independent options here. The combination variant="norm_spcd" + weights="empirical" matches the algorithm used in the paper’s Section 4 experiments (“In all the experiment in this paper, we use this simplified implementation”, page 9).

Reference#

Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.

mlsynth.utils.spcd_helpers.orchestration.select_alpha_by_holdout(Y_pre: ndarray, *, variant: str = 'norm_spcd', weights: str = 'empirical', lam: float | None = None, beta: float | None = None, max_iter: int = 200, solver: Any | None = None, val_frac: float = 0.3, covariates: ndarray | None = None, covariate_weight: float = 1.0) → float#

Choose the ridge alpha by out-of-sample pre-period balance.

The post-period RMSE of an SPCD design is a non-monotone, jumpy function of alpha when N > T_pre (the assignment is a discrete sign vector that flips as alpha moves), so no single closed-form estimate of the noise variance is robust. Instead, this fits the design on the first 1 - val_frac of the pre-period and scores each candidate alpha by the RMS of the resulting contrast on the held-out pre-period tail, returning the best. Candidates are a multiplicative grid around the Gavish-Donoho noise-variance estimate (estimate_noise_variance()), which sets the correct scale; the holdout picks the robust point on the jumpy curve.

Falls back to the bare noise-variance estimate when the pre-period is too short to spare a validation tail.

Parameters:

Y_pre (np.ndarray) – Pre-treatment matrix of shape (T_pre, N).
variant, weights, lam, beta, max_iter, solver – Passed through to the design pipeline.
val_frac (float) – Fraction of the pre-period held out (from the end) for scoring.

Returns:

float – Selected alpha.

mlsynth.utils.spcd_helpers.orchestration.solve_spcd(inputs: SPCDInputs, *, variant: str = 'norm_spcd', weights: str = 'empirical', alpha: float | None = None, lam: float | None = None, beta: float | None = None, max_iter: int = 200, solver: Any | None = None, verbose: bool = False, covariate_weight: float = 1.0) → SPCDDesign#

Run SPCD end-to-end and return the resulting design.

Parameters:

inputs (SPCDInputs) – Pre-treatment (and optional post-treatment) outcome matrices and index metadata.
variant ({“spcd”, “norm_spcd”}) – Iteration-box choice. "spcd" selects Eq. (4)/(7), "norm_spcd" selects Eq. (5)/(8).
weights ({“empirical”, “exact”}) – Final-weight-step choice. "empirical" uses Eq. (9) (the closed-form approximation used in the paper’s experiments), "exact" solves Eq. (6) via cvxpy.
alpha, lam, beta (float, optional) – Hyperparameters for Eq. (2) and Eqs. (4)/(5)/(7)/(8). The paper treats all three as pre-defined. When alpha is None it is chosen by select_alpha_by_holdout() (out-of-sample pre-period balance over a noise-scale grid); lam and beta default from the spectrum via formulation.build_iteration_matrix. Pass an explicit alpha (e.g. a known noise variance) to bypass selection.
max_iter (int) – Maximum iterations for the SPCD/NormSPCD while loop.
solver (optional) – Solver passed to cvxpy when weights="exact". Ignored otherwise.
verbose (bool) – Whether to print solver progress for weights="exact".

Returns:

SPCDDesign – Container with the final sign vector, weight components, synthetic paths, and diagnostics.

mlsynth.utils.spcd_helpers.orchestration.solve_spcd_with_holdout(inputs: SPCDInputs, *, variant: str = 'norm_spcd', weights: str = 'empirical', alpha: float | None = None, lam: float | None = None, beta: float | None = None, max_iter: int = 200, solver: Any | None = None, verbose: bool = False, enable_inference: bool = True, holdout_frac_E: float = 0.7, inference_alpha: float = 0.05, power_target: float = 0.8, mde_n_sims: int = 5000, mde_n_trials: int = 400, mde_horizon_grid: list | None = None, inference_seed: int = 1400, min_blank_size: int = 5, covariate_weight: float = 1.0)#

Run SPCD with the train-on-E / calibrate-on-B inference flow.

When enable_inference=False this falls back to solve_spcd() on the full pretreatment matrix and returns (design, None, None) for backwards compatibility with the pre-inference SPCD release.

When enable_inference=True:

Split inputs.Y_pre into Y_E (first frac_E) and Y_B (remainder).
Fit the SPCD design on Y_E only.
Re-render the synthetic paths over the full timeline (Y_E + Y_B + Y_post) so plots show the entire window.
Compute holdout residuals r_B = Y_B @ contrast_weights.
Run the Monte Carlo MDE analysis on r_B (always, since this is pre-experiment planning).
If Y_post is present, also compute the moving-block conformal CI for the ATT.

Returns:

design (SPCDDesign) – The SPCD design, with synthetic paths spanning the full timeline. Fit on Y_E only when enable_inference=True.
conformal (SPCDConformalResult or None) – Conformal CI for the post-period ATT. None if Y_post is None or enable_inference=False or the holdout window is too small.
power (SPCDPowerAnalysis or None) – MDE / detectability output. None if enable_inference=False or the holdout window is too small.

Plotting helpers for SPCD results.

mlsynth.utils.spcd_helpers.plotter.plot_detectability(results: SPCDResults | SPCDMultiArmResults, ax: Axes | None = None)#

Plot the MDE as a function of post-treatment horizon (“MDE at time point t”) for each arm and the pooled whole-study contrast – the “how long must the study run?” view.

Returns:: matplotlib.figure.Figure

mlsynth.utils.spcd_helpers.plotter.plot_mde_bars(results: SPCDResults | SPCDMultiArmResults, ax: Axes | None = None)#

Bar chart of the percent MDE per arm, with the pooled whole-study MDE alongside (when present).

Parameters:

results (SPCDResults or SPCDMultiArmResults) – A fitted SPCD result. Power analysis must have run (enable_inference=True).
ax (matplotlib Axes, optional) – Axes to draw on; a new figure is created if omitted.

Returns:

matplotlib.figure.Figure

mlsynth.utils.spcd_helpers.plotter.plot_power_curves(results: SPCDResults | SPCDMultiArmResults, ax: Axes | None = None)#

Plot the stored power-vs-effect curves for each arm and the pooled whole-study contrast. The MDE is where each curve crosses the power_target line.

Returns:: matplotlib.figure.Figure

mlsynth.utils.spcd_helpers.plotter.plot_spcd_design(results: SPCDResults) → None#

Plot synthetic treated and synthetic control trajectories for SPCD.

Mirrors the layout of plot_relaxed_design in syndes_helpers.plotter so SPCD and SYNDES plots feel consistent.

Parameters:: results (SPCDResults) – Output of mlsynth.estimators.SPCD. Must have inputs attached so that pre/post matrices can be stacked.
Raises:: MlsynthPlottingError – If the result has no attached inputs.

Synthetic Principal Component Design (SPCD)

Contents

Synthetic Principal Component Design (SPCD)#

When to Use This Estimator#

Notation#

How SPCD Works (Plain-Language Walkthrough)#

Mathematical Formulation#

Reformulation as \(\ell_1\)-PCA#

The Iteration Matrix#

Balancing on Covariates#

Spectral Initialization#

The (Normalized) Generalized Power Method#

Final Weight Step#

Minority Convention#

Treatment Effect and Pre-Period Fit#

Algorithm 3 and Algorithm 4 of the Paper#

Assumptions and Theory#

Inference and Power Analysis#

Estimation / Holdout Split#

Backwards-Compatibility Guarantee#

Holdout Residuals#

Moving-Block Conformal CI#

Monte Carlo MDE#

Power Analysis Always Runs#

Opting Out#

Example: Reproducing the Paper’s RMSE Advantage#

API Variants and Multi-Arm Designs#

Multi-Arm Power: the Pooled Average Effect#

Power Curves and the MDE at Each Time Point#

Pre-Period Agreement (Design Diagnostic)#

Verification#

Core API#

Configuration#

Result Containers#

Helper Modules#

Reference#

Reference#

Reference#

Reference#

Reference#

Reference#

Reference#

Key guarantee#

Reference#