Synthetic Principal Component Design (SPCD)#
When to Use This Estimator#
Most synthetic-control work takes the treated unit as given and asks only how to weight the donors. SPCD, due to Lu, Li, Ying and Blanchet (2022) [SPCD] arXiv:2211.15241, answers the prior, design-time question: you are about to run an experiment, you have a panel of pre-treatment outcomes, and you must decide which units to treat (and how to weight the rest into a synthetic comparison) so that the treatment-effect estimate is as precise as possible. It is a sibling of Synthetic Design (SYNDES) — both choose the treated set rather than assume it — but where SYNDES solves a mixed-integer program, SPCD reformulates the design as a phase-synchronization problem and solves it with a spectrally-initialized power method that converges globally under standard linear-factor models, in seconds rather than minutes.
Reach for SPCD when treatment can only be applied to coarse, expensive units, randomizing at random leaves accuracy on the table, and you want a fast, provably-good assignment:
Marketing / geo experiments. You can switch a campaign on in some media markets (DMAs), regions, or store clusters but not at the individual-customer level. SPCD reads each market’s pre-period sales or traffic history and picks the treated markets — and the synthetic comparison — so the two groups tracked each other before launch. Any post-launch divergence is then a low-variance read on the campaign.
Retail / product rollouts. Choosing which stores or SKUs get a new layout, price, or feature first. Treating a handful of stores is costly, so you want the few you pick to be maximally informative about the whole chain.
Platform / policy pilots. Selecting cities or submarkets for a pricing pilot, a new service tier, or a regulatory change, where interference rules out customer-level randomization and the number of treatable units is small.
The flip side: if assignment is already fixed (you know who was treated and only need a counterfactual), this is the wrong tool — use a standard retrospective estimator such as Forward-Selected Synthetic Control (FSCM), Cluster Synthetic Controls (CLUSTERSC), or Two-Step Synthetic Control. SPCD’s value is entirely in the design stage, before the experiment runs.
Notation#
We observe an outcome \(Y_{it}\) for units \(i \in \{1, \dots, N\}\) over pre-treatment periods \(t \in \{1, \dots, T\}\), stacked into the pre-treatment matrix \(Y \in \mathbb{R}^{N \times T}\) (units in rows, periods in columns). At \(t = T\) the experimenter assigns each unit a sign \(D_i \in \{-1, +1\}\): \(D_i = +1\) puts unit \(i\) in the treated group, \(D_i = -1\) in the control group. Each unit also carries a non-negative weight \(w_i \ge 0\), normalized to sum to one on each side. After assignment, outcomes are observed for \(S\) further periods \(t = T+1, \dots, T+S\), with potential outcomes \(Y_{it}(-1) = \mu_{it} + e_{it}\) and \(Y_{it}(+1) = Y_{it}(-1) + \tau\), where \(\mu_{it}\) is the base (signal) outcome, \(e_{it}\) is mean-zero idiosyncratic noise with variance \(\sigma\), and \(\tau\) is the treatment effect to be estimated. The estimand is the weighted average treatment effect on the treated (wATET).
Throughout, \(\mathbf{1}\) is the all-ones vector, \(I\) the identity, \(\operatorname{sgn}(\cdot)\) the elementwise sign, \(\|\cdot\|_1\) / \(\|\cdot\|_2\) the L1 / L2 norms, and \((\cdot)^{-1}\) the matrix inverse.
Note
Notation bridge. The single-treated-unit synthetic-control canon
(treated \(j=0\), donors \(1, \dots, N\), treatment dummy
\(d_{jt} \in \{0,1\}\)) does not fit a design problem in which the
assignment is itself the decision variable. Following the paper, the
assignment is the signed vector \(D \in \{-1,+1\}^N\) and the two
groups are symmetric (neither is privileged as “the donors”). The
implementation stores the pre-treatment matrix transposed, as
\(Y_{\text{pre}} \in \mathbb{R}^{T \times N}\) (time in rows, units in
columns), so the iteration matrix below is built in code as
Y_pre.T @ Y_pre + alpha I + lambda 1 1.T.
How SPCD Works (Plain-Language Walkthrough)#
Before the equations, here is the whole idea in five steps.
The goal is a fair pre-period match. Split the units into a treated group and a control group, and weight each unit, so that before the experiment the weighted treated history and the weighted control history are nearly identical curves. If the two groups move together in the pre-period, then after treatment the gap between them is the treatment effect — and a well-balanced design makes that gap a low-variance estimate.
Trying every split is hopeless. With \(N\) units there are exponentially many ways to form two groups; choosing the best one is NP-hard. SPCD sidesteps the brute-force search with a change of variable.
One signed number per unit. Pack each unit’s group label and weight into a single signed number \(W_i = w_i D_i\): its sign says which group the unit is in, its magnitude is the weight. “Make the two weighted groups match” then becomes “find a signed vector \(W\) that is as small as possible against the data,” i.e. minimize \(W^\top (Y Y^\top) W\) subject to the groups balancing out. This is the \(\ell_1\)-PCA / phase-synchronization problem — a well-studied shape with fast, globally-convergent solvers.
Solve it with a power method, warm-started by an eigenvector. Start from a smart first guess — the sign pattern of the smallest eigenvector of the data matrix (the spectral initialization). Then repeatedly refine the group labels: multiply the current sign vector by \(M^{-1}\) and take signs again (the generalized power method). Keep going until the labels stop flipping. The normalized variant (the package default) rescales by the diagonal of \(M^{-1}\) first and is guaranteed to converge to the global optimum at a linear rate.
Read off the design, then estimate. Once the labels are fixed, the weights follow in closed form (Eq. (9)) or from a tiny convex program (Eq. (6)). Treat the smaller of the two groups (the minority rule). Run the experiment. When the post-period data arrives, the treatment effect is simply the gap between the two weighted groups, averaged over the post-period.
A useful analogy: it is like drafting two evenly-matched teams from a pool of players based on their past stats — except SPCD also decides how much each player counts, and it finds the balanced split with a fast eigenvector computation instead of trying every possible roster.
Mathematical Formulation#
SPCD jointly chooses the signs \(\{D_i\}_{i=1}^N\) and non-negative weights \(\{w_i\}_{i=1}^N\) so that the weighted pre-treatment trajectories of the two groups track each other as closely as possible.
Following Doudchenko et al. (2021) [SYNDES] and Abadie & Zhao (2021) [ABADIE2024], the expected MSE of the difference-in-weighted-means treatment-effect estimator decomposes as
The first term is a bias from imperfect pre-treatment balance; the second a variance that grows with weight concentration. Minimizing the first term over the pre-treatment window gives the mixed-integer program SPCD solves (Eq. (1) of the paper):
Reformulation as \(\ell_1\)-PCA#
The key observation of the paper is that the change of variable \(W_i = w_i D_i\) collapses the assignment and weight variables into a single signed vector \(W \in \mathbb{R}^N\). Under \(W\), \(D_i = \operatorname{sgn}(W_i)\) and \(w_i = |W_i|\), the adding-up constraints become \(\mathbf{1}^\top W = 0\), and the design problem becomes
The hard equality \(\mathbf{1}^\top W = 0\) is absorbed into the objective via a quadratic penalty:
Theorem 1 of the paper shows that for \(\lambda\) large enough, the sign vector \(\operatorname{sgn}(W^*)\) of any global minimum coincides with the sign vector of an associated phase-synchronization problem. Once the signs are recovered, the magnitudes follow from a small convex QP (or, in practice, the closed form in Eq. (9)).
The Iteration Matrix#
All four code-paths of SPCD operate on the same \(N \times N\) iteration matrix from Eq. (2) of the paper:
where \(\alpha\) plays the role of \(\sigma\) (noise variance) and \(\lambda\) is the constraint-absorbing penalty. The paper treats both as pre-defined hyperparameters and gives no formula for them.
Note
Choosing \(\alpha\). The paper’s appendix sets the perturbation
ridge on the noise scale (\(\alpha = \lVert\Delta\rVert\) with
\(\alpha \le \lVert\Delta\rVert\)), so \(\alpha\) should track the
idiosyncratic noise variance, not the dominant (signal/level) eigenvalue
of \(Y Y^\top\). When \(N > T_{\text{pre}}\) the matrix
\(Y Y^\top\) is rank-deficient and the post-period RMSE is a
non-monotone, jumpy function of \(\alpha\) (small changes flip the
discrete sign vector), so no single closed-form estimate is robust. When
the user does not supply \(\alpha\), formulation.py first fixes its
scale with the Gavish-Donoho median-singular-value noise estimate
(estimate_noise_variance()),
and orchestration.select_alpha_by_holdout then picks the value on a
noise-scale grid that balances a held-out tail of the pre-period best
out of sample. \(\lambda\) and \(\beta\) still default from the
spectrum of \(Y Y^\top\). If you know the noise level (e.g. the
simulations below fix \(\sigma = 1\)), pass alpha_ridge explicitly
to skip selection.
Balancing on Covariates#
By default SPCD balances the two groups on the pre-treatment outcomes
only. Passing covariates (a list of column names) additionally balances on
auxiliary unit characteristics. Each unit’s per-covariate pre-period mean
is taken (time-invariant covariates – e.g. last year’s market share – collapse
to their value), the resulting \(N \times P\) matrix \(X\) is z-scored
across units, and a covariate-balance term is folded into the iteration matrix:
where \(s\) rescales \(X X^\top\) to match the trace (“energy”) of
\(Y Y^\top\), so covariate_weight = 1 weights covariates equally to
the outcomes, > 1 upweights them, and 0 (or omitting covariates)
recovers the outcome-only design. Crucially this keeps the iteration matrix a
quadratic form \(W^\top M W\) with the same phase-synchronization
structure, so the spectral solver and the global-optimality theory are
unchanged – the design now drives the weighted treated and control groups to
agree on outcomes and covariates jointly. (This is the same way classical
synthetic control balances on predictors; the relative weight covariate_weight
is the analog of SCM’s \(V\).)
Note
Budget / cardinality constraints are not supported by SPCD: they are linear constraints on the discrete assignment and break the phase-synchronization reduction (which is precisely what removing the cardinality constraint buys, per the paper’s Remark 1). Use Synthetic Design (SYNDES), whose MIP takes such constraints natively, when you need a budget.
res = SPCD({
"df": df, "outcome": "sales", "unitid": "DMA", "time": "Week",
"covariates": ["mkt_share"], # balance outcomes + last-year market share
"covariate_weight": 1.0, # equal weight to outcomes (default)
}).fit()
# the weighted treated/control groups now match on mkt_share as well
Spectral Initialization#
Both Algorithm 1 and Algorithm 2 of the paper start from the same warm start: the sign of the smallest eigenvector of \(M\),
Appendix 3.2.1 of the paper (Lemma 4) shows that under the linear latent-factor model \(Y_{it} = \delta_t + \theta_t^\top \mu_i + e_{it}\) (Assumption 1) together with the realizability assumption (Assumption 2), the sign vector \(\operatorname{sgn}(v)\) already agrees with the global optimum \(\operatorname{sgn}(W^*)\) up to a bounded fraction of entries.
The (Normalized) Generalized Power Method#
To refine the spectral initialization, SPCD performs sign-iterations on
\(M^{-1}\). The two variant choices correspond to the two update boxes
of Algorithm 1 / Algorithm 2:
variant="spcd"(Eq. (4)/(7)) — Generalized Power Iteration:\[y^{t+1} \;=\; \operatorname{sgn}\!\left[\, \left( M^{-1} + \beta I \right) \, y^{t} \,\right].\]variant="norm_spcd"(Eq. (5)/(8)) — Normalized Generalized Power Iteration:\[y^{t+1} \;=\; \operatorname{sgn}\!\left[\, \left( M^{-1} + \beta I \right) \, \big( y^{t} \,/\, d \big) \,\right], \quad d \;=\; \sqrt{\operatorname{diag}(M^{-1})}.\]
Here \(\beta > 0\) is the step parameter (auto-defaulted to \(1/\lambda_{\max}(M)\)), and \(/\) denotes element-wise division. The loop terminates as soon as the sign vector stops changing between successive iterations.
Under Assumptions 1 and 2 with small enough idiosyncratic noise, Theorem 3 of the paper guarantees:
variant="spcd"converges globally if \(\epsilon \,>\, (\sqrt{3}/2) - 1\);variant="norm_spcd"converges linearly and globally for any \(\epsilon > 0\).
This is why the normalized variant is the package default.
Final Weight Step#
Once the iteration converges to a sign vector \(y^* \in \{-1, +1\}^N\), the unit weights are produced by one of two procedures:
weights="empirical"(Eq. (9), Algorithm 2 — the paper’s experimental default):\[w \;=\; \frac{2 \, M^{-1} y^*}{\left\| M^{-1} y^* \right\|_1}.\]The optimality condition of the original QP (Eq. (6)) implies that \(\operatorname{sgn}(w) = y^*\) whenever the iteration has converged to a fixed point of the closed-form map, so the signed vector \(w\) simultaneously encodes group membership and weights.
weights="exact"(Eq. (6), Algorithm 1) — solves the convex QP\[\begin{split}\begin{aligned} \min_{w \geq 0} \quad & \frac{1}{T} \sum_{t=1}^T \left( \sum_{i:\, y^*_i = +1} w_i Y_{it} - \sum_{i:\, y^*_i = -1} w_i Y_{it} \right)^2 + \alpha \sum_{i=1}^N w_i^2 \\ \text{s.t.} \quad & \sum_{i:\, y^*_i = +1} w_i \;=\; \sum_{i:\, y^*_i = -1} w_i \;=\; 1. \end{aligned}\end{split}\]via
cvxpy. Use this when you need the exact Algorithm-1 weights rather than the closed-form approximation.
Minority Convention#
Per the bottom of Algorithm 1 (page 7 of the paper), SPCD then applies the minority-group rule
which flips the sign vector (if necessary) so that the smaller of the two
groups is the treated group. The treated unit labels reported by
SPCDResults.selected_unit_labels() follow this convention.
Treatment Effect and Pre-Period Fit#
For post-treatment periods \(t = T+1, \dots, T+S\), the SPCD treatment-effect estimator at the bottom of Algorithm 1 is
which is precisely the mean of the post-period synthetic gap reported as
results.att. When no post period is supplied (design-only mode), the
estimator reports att = 0.0.
The pre-period RMSE between the synthetic treated and synthetic control trajectories,
measures the residual of Eq. (1) on the chosen sign vector and is reported as
results.rmse_pre.
Algorithm 3 and Algorithm 4 of the Paper#
The paper’s Appendix 3.2 also defines two further numbered algorithms —
Algorithm 3 (Generalized Power Method on an abstract Hermitian matrix) and
Algorithm 4 (its normalized counterpart). These are not implemented as
separate code paths in mlsynth: they are the abstract meta-versions of
Algorithms 1 and 2 used to prove the global convergence theorem (Theorem 3)
and operate on a generic matrix \(C = z z^\top + \Delta\) rather than on
the SPCD-specific iteration matrix \(M\). The two variant options
exposed in the API already cover both procedures applied to \(M\).
Assumptions and Theory#
SPCD’s global-optimality guarantee rests on two assumptions — one structural, one a realizability condition — both stated formally below, each paired with a plain-language Remark.
Assumption 1 (linear latent-factor model). Outcomes are generated by
where \(\delta_t\) is a time fixed effect, \(\mu_j\) are unobserved common factors, \(\theta_t\) is a vector of unknown factor loadings, \(e_{jt}\) is i.i.d. idiosyncratic noise, and \(\tau\) is the treatment effect. In the pre-treatment period \(D_{jt} = -1\) for all units.
Remark. This is the standard interactive-fixed-effects model that underlies the consistency theory of synthetic control (Abadie et al., 2010 [ABADIE2010]; Xu, 2017 [Xu2017]). It is the assumption that makes “units that shared latent factors in the past will keep tracking each other” a defensible basis for the design — exactly the structure SPCD exploits when it balances the two groups on pre-period outcomes.
Assumption 2 (realizable design). There exists a unique parameter \((w_i, D_i)_{i=1}^N\) satisfying: (a) \(D_i \in \{-1, +1\}\); (b) \(w_i \ge 0\) with \(\sum_i D_i w_i = 0\); (c) \(\|w\|_2^2 = N\) and \(\epsilon \le |w_i| \le 1/\epsilon\) for all \(i\); and (d) the weights balance the factors, \(\sum_i w_i D_i \mu_i = 0\).
Remark. This says a perfect, balanced design exists in population — a split whose weighted factor loadings cancel exactly — and that it is unique (so the optimizer need not arbitrate between competing perfect designs). It is the design-time analogue of the “treated unit lies in the convex hull of the donors” condition in retrospective SCM, and is what turns an NP-hard search into a problem with a recoverable global optimum.
Theorem 1 (sign recovery). For \(\lambda\) large enough, the global solution \(W^*\) of the penalized program satisfies \(\operatorname{sgn}(W^*) = \operatorname{sgn}(\arg\min_{\|W\|_1=1} W^\top (YY^\top + \sigma I + \lambda \mathbf{1}\mathbf{1}^\top) W)\). In words: the quadratic penalty does not corrupt the signs of the optimal design, so recovering group membership and recovering the weights can be separated.
Theorem 2 (equivalence to phase synchronization). The design problem is symbolically identical to a phase-synchronization problem on the matrix \((A^\top A)^{-1}\). In words: SPCD inherits the entire fast-solver toolkit developed for phase synchronization — most importantly the spectrally-initialized generalized power method.
Theorem 3 (global convergence). Under Assumptions 1–2, if \(\sigma\)
is small enough and \(T \ge \mathrm{poly}(N, 1/\epsilon)\), then
variant="spcd" converges to the global optimum whenever \(\epsilon >
\sqrt{3}/2 - 1\), and variant="norm_spcd" converges to the global optimum at
a linear rate for any \(\epsilon > 0\). In words: the normalization
step buys global convergence under a strictly weaker condition, which is why
norm_spcd is the default.
Inference and Power Analysis#
Beyond the design itself, SPCD produces a moving-block conformal
confidence interval for the post-period ATT and a Monte Carlo estimate of the
minimum detectable effect (MDE). The procedure mirrors LEXSCM’s train-on-E /
calibrate-on-B discipline.
Estimation / Holdout Split#
Given the pretreatment matrix \(Y_{\text{pre}} \in \mathbb{R}^{T_{\text{pre}} \times N}\), SPCD splits the pretreatment window into
where \(Y_E\) contains the first holdout_frac_E of the pretreatment
periods (default 70 %) and \(Y_B\) contains the remainder. The SPCD design
— sign vector, treated/control weights, and the iteration matrix \(M\) —
is fit on \(Y_E\) only.
Backwards-Compatibility Guarantee#
Because the design is fit on \(Y_E\) alone, two callers who share the same pretreatment data — one in planning mode (no \(Y_{\text{post}}\) yet) and one in retrospective mode (with \(Y_{\text{post}}\)) — receive identical designs. The only difference between the two callers is whether a post-period ATT and its conformal CI are reported.
Holdout Residuals#
Applying the design weights to \(Y_B\) gives an out-of-sample synthetic-gap series
where \(w\) is the signed contrast_weights vector from the
\(Y_E\)-fit design. Under the linear-factor model with no treatment,
\(r_B\) is a zero-mean noise series whose empirical distribution
characterizes the noise structure of any synthetic-gap linear functional based
on the same \(w\). This makes \(r_B\) the natural calibration set for
inference and the natural noise pool for power simulations.
Moving-Block Conformal CI#
When \(Y_{\text{post}}\) is supplied, the post-period synthetic gap is \(g = Y_{\text{post}} \cdot w \in \mathbb{R}^S\), with mean
The test statistic is \(T(g) = \mathrm{mean}(|g|)\). Conformity scores are computed by taking mean-absolute values over all sliding blocks of size \(b = \max(3, \lfloor\sqrt{S}\rfloor)\) (both standard and circular) of \(r_B\). The conformal p-value vs. \(H_0\colon \tau = 0\) is
The \((1 - \alpha)\) confidence interval for \(\hat\tau\) is obtained by inversion: scan a grid of candidate values \(\theta\) around \(\hat\tau\), include \(\theta\) if the adjusted-residual score \(T(g - \theta)\) is in the in-distribution region of the block-score empirical distribution. Pointwise bands at the \((1 - \alpha)\) quantile \(q\) of the block scores are reported as \(g_t \pm q\).
This is exchangeability-based inference: coverage holds in finite samples under the assumption that overlapping blocks of \(r_B\) are exchangeable with overlapping blocks of \(g\) under the null. This is a stronger assumption than IID noise and weaker than perfect \(H_0\). It breaks if the treatment introduces variance changes.
Monte Carlo MDE#
The MDE answers the pre-experiment planning question: given my holdout residuals and a planned post-period horizon \(S\), what is the smallest constant treatment effect I could detect with power \(\geq \pi\)?
The procedure follows mlsynth.utils.fast_scm_helpers.power_helpers:
Build the null distribution of \(T\) at horizon \(S\) by resampling \(r_B\) (padded with Gaussian draws so that resampling of size \(S\) is always feasible). Compute \(c_\alpha = \text{quantile}(\text{null}, 1 - \alpha)\).
For each candidate \(\tau\) on a grid of effect sizes, draw \(n_{\text{trials}}\) post-period vectors of the form \(\tau + \mathcal{N}(0, \hat\sigma_B^2)\) and count the fraction exceeding \(c_\alpha\).
The smallest \(\tau\) whose empirical power reaches the target \(\pi\) is reported as the MDE, both on the absolute scale (
mde) and as a percentage of the holdout-baseline outcome level (mde_pct).
A detectability curve — the MDE as a function of \(S\) — can be requested
by passing mde_horizon_grid to the config. This answers the related
question: how long do I need to run the experiment to detect a target effect?
Power Analysis Always Runs#
The MDE computation depends only on \(r_B\), not on
\(Y_{\text{post}}\). So power analysis runs whenever
enable_inference=True and the holdout window is large enough (at least
min_blank_size periods, default 5). The conformal CI only runs when
\(Y_{\text{post}}\) is supplied; otherwise the ATT and CI are reported as
None (honest absence rather than a silent zero).
Opting Out#
Setting enable_inference=False skips the E/B split entirely. The design is
then fit on the full pretreatment matrix (legacy behavior), no holdout
residuals are produced, and inference / power analysis are not run. Use this
when you need byte-identical reproducibility against a pre-inference SPCD
release, or when your pretreatment window is too short to spare a holdout.
Example: Reproducing the Paper’s RMSE Advantage#
The paper’s headline finding (Section 4.1, Table 1, Figures 2–4) is not merely that SPCD is unbiased — it is that SPCD’s design yields a far lower root-mean-square error of the treatment-effect estimate than a random design on the linear latent-factor model. The block below reproduces that finding and is self-contained: it draws panels from the paper’s data-generating process (\(Y_{it} = \text{level}_i + v_t^\top \gamma_i + e_{it}\), with a true effect \(\tau = 1\) and noise \(\sigma = 1\)), and for each draw estimates \(\tau\) two ways — with SPCD, and with the paper’s random baseline (assign each unit to treated/control with probability one half, then take a simple difference in group means). Because the SPCD design is fit on pre-treatment data only, injecting a post-period effect never changes which units SPCD selects, so the two SPCD calls per draw share an identical design.
import numpy as np
import pandas as pd
from mlsynth import SPCD
def factor_panel(rng, N=10, T_pre=20, T_post=10, L=8, sigma=1.0):
"""One draw from the paper's linear factor model (Section 4.1):
Y_it = level_i + v_t^T gamma_i + e_it, with no treatment applied yet."""
gamma = rng.standard_normal((N, L)) # unit loadings
v = rng.standard_normal((T_pre + T_post, L)) # time factors
level = rng.uniform(40.0, 60.0, size=N) # positive baselines
noise = rng.normal(scale=sigma, size=(T_pre + T_post, N))
return level[None, :] + v @ gamma.T + noise # shape (T, N)
def make_df(Y, T_pre):
T, N = Y.shape
return pd.DataFrame(
[{"unit": i, "time": t, "y": float(Y[t, i]), "post": int(t >= T_pre)}
for i in range(N) for t in range(T)]
)
def spcd_att(Y, T_pre, tau):
"""Fit the SPCD design on the pre-period, inject tau on the chosen
treated units, then read the estimated ATT off the post-period."""
cfg = dict(outcome="y", unitid="unit", time="time", post_col="post",
enable_inference=False, alpha_ridge=1.0) # sigma^2 known = 1
treated = SPCD({"df": make_df(Y, T_pre), **cfg}).fit().selected_unit_labels
Y = Y.copy(); Y[T_pre:, np.asarray(treated)] += tau
return SPCD({"df": make_df(Y, T_pre), **cfg}).fit().att
def random_att(rng, Y, T_pre, tau):
"""Paper's baseline: assign each unit to treated/control with prob 1/2
and estimate tau by a difference in (equally weighted) group means."""
N = Y.shape[1]
sign = rng.choice([-1, 1], size=N)
if sign.min() == sign.max(): # keep both groups non-empty
sign[rng.integers(N)] *= -1
treated, control = np.where(sign == 1)[0], np.where(sign == -1)[0]
Y = Y.copy(); Y[T_pre:, treated] += tau
gap = Y[T_pre:, treated].mean(axis=1) - Y[T_pre:, control].mean(axis=1)
return gap.mean()
rng = np.random.default_rng(0)
T_pre, T_post, tau, n_draws = 20, 10, 1.0, 200
# --- one draw: SPCD recovers the injected effect ---
Y = factor_panel(rng, T_pre=T_pre, T_post=T_post)
print(f"single-draw SPCD ATT: {spcd_att(Y, T_pre, tau):.3f} (true {tau})")
# --- many draws: SPCD's design slashes ATT RMSE vs a random design ---
err_spcd, err_rand = [], []
for _ in range(n_draws):
Y = factor_panel(rng, T_pre=T_pre, T_post=T_post)
err_spcd.append(spcd_att(Y, T_pre, tau) - tau)
err_rand.append(random_att(rng, Y, T_pre, tau) - tau)
err_spcd, err_rand = np.array(err_spcd), np.array(err_rand)
rmse = lambda e: np.sqrt(np.mean(e ** 2))
print(f"SPCD mean ATT {tau + err_spcd.mean():.3f} RMSE {rmse(err_spcd):.3f}")
print(f"random mean ATT {tau + err_rand.mean():.3f} RMSE {rmse(err_rand):.3f}")
print(f"RMSE ratio (random / SPCD): {rmse(err_rand) / rmse(err_spcd):.1f}x")
Both designs are roughly unbiased (mean ATT \(\approx 1\)), but SPCD’s RMSE is about 0.4 against the random design’s 3.9 — an order-of- magnitude reduction (≈ 9–10× on this seed), reproducing the paper’s central claim that SPCD “surpasses the random design by a large margin.” The mechanism is the balance term of Eq. (1): a single draw of the random design can split the units so that the two group means differ substantially even before treatment, and that pre-period imbalance carries straight through to the ATT; SPCD instead chooses the split (and weights) that makes the groups track each other, so almost nothing but the injected effect survives into the post-period gap.
Switch variant ("spcd" vs. "norm_spcd") and weights
("empirical" vs. "exact") to compare the four code-paths; all share the
same iteration matrix \(M\) and differ only in the refinement and weight
steps described above. Enabling inference (the default) additionally returns a
conformal p-value, CI, and the design-time MDE on each fit.
Note
selected_unit_labels returns the minority group (the treated side) by
the minority-group convention. Treating the smaller group keeps the control
pool large, which is what the variance term \(\sigma^2 \sum_i w_i^2\)
rewards.
API Variants and Multi-Arm Designs#
SPCD accepts either an SPCDConfig instance or a plain dict. The
two orthogonal modelling choices are:
variant: which power-method update box to iterate."spcd"is the generalized power iteration (Eq. (4)/(7));"norm_spcd"is the normalized iteration (Eq. (5)/(8)) used in all of the paper’s Section 4 experiments and is the package default.weights: how to compute the final unit weights."empirical"uses the closed-form Eq. (9) (fast; the authors’ experimental default);"exact"solves the convex QP Eq. (6) viacvxpy(slower; matches Algorithm 1 to the letter).
Passing an arm column solves the SPCD design independently within each
arm’s units and returns an SPCDMultiArmResults (a dict of per-arm
SPCDResults); when arm is None (default) a single
SPCDResults is returned.
Multi-Arm Power: the Pooled Average Effect#
Each arm carries its own per-arm MDE (arm_designs[label].mde), but those
are m independent analyses with no family-wise error control. When the
estimand of interest is the average effect across arms — the usual
program-level question — SPCDMultiArmResults also reports a single
pooled average-effect MDE (results.pooled_mde /
results.pooled_mde_pct), computed by default whenever inference is enabled
and at least two arms have a usable holdout window.
It is formed by pooling the arms’ time-aligned holdout residuals into one
contrast, \(g_t = \sum_a w_a\, r^{(a)}_{B,t}\), and running the ordinary
single-series MDE on \(g\). Because the arms share calendar time, summing
the aligned series makes the cross-arm correlation enter through
\(\operatorname{Var}(g) = w^\top \Sigma\, w\) automatically — which is why
the residual series must be pooled, never resampled per arm independently
(that would drop positive correlation and report an over-optimistic MDE). The
weights \(w_a\) are set by pooled_weights: "size" (default) weights
each arm by its unit count, giving the population-average effect;
"equal" weights arms equally.
Pooling answers an easier question than the per-arm analyses — averaging cancels idiosyncratic noise, so the detectable average effect is typically several times smaller than any single arm’s MDE.
Warning
The pooled MDE targets the weighted-average effect. If arm effects are
heterogeneous and opposite-signed they can cancel in the average and go
undetected even when individual arms move a lot. Use the per-arm
arm_designs[label].mde (or a family-wise procedure) when detecting
individual arms is what matters.
res = SPCD({
"df": df, "outcome": "sales", "unitid": "unit", "time": "time",
"arm": "region", "post_col": "post",
"pooled_weights": "size", # population-average effect ("equal" also available)
}).fit()
print(res.pooled_mde) # smallest detectable AVERAGE effect across arms
print(res.pooled_mde_pct) # ... as % of the pooled baseline
print({k: r.mde for k, r in res.arm_designs.items()}) # per-arm MDEs
Power Curves and the MDE at Each Time Point#
Every fitted design stores two diagnostic curves on its
SPCDPowerAnalysis — for each arm
(arm_designs[label].power) and for the whole study
(pooled_power):
Power vs. effect size —
power.effect_grid_pctandpower.power_curvegive the full power curve at the analysis horizon; the MDE is the smallest effect whose power reachespower_target.MDE vs. horizon (“MDE at time point \(t\)”) —
power.detectabilityis a{horizon -> MDE percent}map. It is computed by default over horizons \(1, \dots, \min(12, n_{\text{post}})\) — real market tests rarely run past ~12 periods, so the grid is capped at 12 rather than sweeping the full (often 30+ period) post window. Passmde_horizon_gridto choose your own horizons (e.g.range(1, 9)for an 8-week test). The curve is built both per arm and for the whole study; the MDE shrinks as the horizon grows, and the pooled curve sits below every per-arm curve.
Warning
The headline pooled_mde_pct is computed at the full post-window
length in your data. If you will actually run a shorter test, quote the
curve point for that length — e.g. pooled_power.detectability[12] for a
12-period test — not the headline number.
Pre-Period Agreement (Design Diagnostic)#
Every fitted design also reports how closely the synthetic treated and
synthetic control trajectories agree before treatment — the RMSE of
their gap, broken out by window (results.pre_fit /
results.pre_fit_rmse):
estimation— RMSE over the estimation window \(E\) (in-sample for the design fit);blank— RMSE over the blank/holdout window \(B\) (out-of-sample — the honest read on whether the match will hold up post-launch);overall_pre— RMSE over the full pre-period (\(E \cup B\)).
Lower is better. The blank value is None when
enable_inference=False (no holdout split is made).
r = results.arm_designs["A"] # or a single SPCDResults
print(r.pre_fit_rmse)
# {"estimation": 25.8, "blank": 16.7, "overall_pre": 23.4}
res = SPCD({
"df": df, "outcome": "sales", "unitid": "unit", "time": "time",
"arm": "region", "post_col": "post",
}).fit()
whole_study = res.pooled_power.detectability # {horizon: MDE % of baseline}
per_arm = {k: r.power.detectability for k, r in res.arm_designs.items()}
# Recommend the shortest horizon that detects a target average effect:
target_pct = 1.0
feasible = [h for h, m in sorted(whole_study.items()) if m <= target_pct]
recommended_periods = min(feasible) if feasible else None
Three built-in plot helpers turn these into slides-ready figures:
from mlsynth import plot_mde_bars, plot_power_curves, plot_detectability
plot_mde_bars(res) # bar chart: per-arm MDE % with the pooled bar
plot_power_curves(res) # power vs. effect size, per arm + pooled
plot_detectability(res) # MDE vs. study length (time points), per arm + pooled
Each returns a Matplotlib Figure (so it renders inline in a notebook and
can be .savefig(...)-ed), and works on a single SPCDResults as
well as a multi-arm SPCDMultiArmResults.
Note
Computing the per-horizon detectability for every arm and the pooled
contrast multiplies the Monte Carlo cost by \(n_{\text{post}}\). For
long post windows, pass a coarser mde_horizon_grid (e.g.
range(2, n_post + 1, 2)) or lower mde_n_sims to keep fits fast.
from mlsynth import SPCD
config = {
"df": df,
"outcome": "sales",
"unitid": "unitid",
"time": "time",
"post_col": "post", # or T0=...
"variant": "norm_spcd", # iteration box; "spcd" also available
"weights": "empirical", # paper's experimental default; "exact" uses cvxpy
"display_graph": True,
}
results = SPCD(config).fit()
# Selected units and design diagnostics
print(results.selected_unit_labels)
print(results.design.n_iterations, results.design.converged)
print(results.design.alpha_ridge, results.design.lam_balance, results.design.beta)
# Standardized result bundle — same shape as the rest of mlsynth
print(results.att) # mean post-period synthetic gap (= 0 if no post)
print(results.rmse_pre) # pre-period RMSE between synthetic groups
print(results.rmse_post) # post-period RMSE (None if no post)
print(results.donor_weights) # {control unit label: weight}
# Or via the full BaseEstimatorResults object
summary = results.summary
summary.effects.att
summary.fit_diagnostics.rmse_pre
summary.time_series.observed_outcome # synthetic treated trajectory
summary.time_series.counterfactual_outcome # synthetic control trajectory
summary.time_series.estimated_gap # synthetic gap (signed)
summary.weights.donor_weights
summary.method_details.method_name # "SPCD (norm_spcd, weights=empirical)"
summary.method_details.parameters_used # variant, alpha/lam/beta, iterations
Core API#
Synthetic Principal Component Design (SPCD) estimator.
This module implements:
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
Algorithm flow (see mlsynth.utils.spcd_helpers):
formulation.py : Eq. (2) M = Y Y^T + alpha I + lambda 1 1^T spectral_init.py : Algs 1 & 2 y^0 = sgn(smallest eigvec of M) iteration_spcd.py : Eqs. (4)/(7) SPCD update (variant=”spcd”) iteration_norm_spcd.py : Eqs. (5)/(8) NormSPCD update (variant=”norm_spcd”) weights_empirical.py : Eq. (9) Algorithm 2 final step (weights=”empirical”) weights_exact.py : Eq. (6) Algorithm 1 final step (weights=”exact”) treatment_effect.py : Algs 1 & 2 footer (minority flip, synthetic paths)
Theoretical artifacts (not implemented as separate code paths):
Algorithm 3 (Appx 3.2, p.20) : abstract GPW used in the proof of Theorem 3 Algorithm 4 (Appx 3.2, p.20) : abstract Normalized GPW used in the proof
- class mlsynth.estimators.spcd.SPCD(config: SPCDConfig | dict)#
Bases:
objectSynthetic Principal Component Design (SPCD) estimator.
Selects a treated group and constructs synthetic control weights simultaneously by solving the optimal experiment-design problem of Lu, Li, Ying, Blanchet (2022) via a (normalized) generalized power method with spectral initialization.
- Parameters:
config (SPCDConfig or dict) – Configuration object defining data, iteration variant, and weight-step choice.
- config#
Parsed configuration object.
- Type:
- df#
Input panel data.
- Type:
pandas.DataFrame
- variant#
Iteration-box choice.
"spcd"uses Eq. (4)/(7);"norm_spcd"uses Eq. (5)/(8).- Type:
{“spcd”, “norm_spcd”}
- weights#
Final-weight-step choice.
"empirical"uses Eq. (9), the paper’s experimental default;"exact"solves Eq. (6) via cvxpy.- Type:
{“empirical”, “exact”}
- alpha_ridge, lam_balance, beta
Hyperparameters of Eq. (2) and the iteration update. When
None, each is auto-estimated from the spectrum ofY_pre.T @ Y_pre.- Type:
float or None
- T0, post_col
Pre/post split specification (mirrors SYNDES’s interface).
- Type:
optional
- solver#
Passed to cvxpy when
weights="exact".- Type:
optional
- Returns:
SPCDResults – Container with the optimized design, attached preprocessed inputs, and synthetic-path arrays.
- fit() SPCDResults | SPCDMultiArmResults#
Run the SPCD pipeline and return the design.
- Returns:
SPCDResults or SPCDMultiArmResults – A single design when no
armcolumn is configured; otherwise one independent SPCD design per arm, keyed by arm label.
Configuration#
- class mlsynth.config_models.SPCDConfig(*, df: DataFrame, outcome: str, unitid: str, time: str, variant: Literal['spcd', 'norm_spcd'] = 'norm_spcd', weights: Literal['empirical', 'exact'] = 'empirical', alpha_ridge: Annotated[float | None, Ge(ge=0.0)] = None, lam_balance: Annotated[float | None, Ge(ge=0.0)] = None, beta: Annotated[float | None, Ge(ge=0.0)] = None, max_iter: Annotated[int, Gt(gt=0)] = 200, T0: Annotated[int | None, Gt(gt=0)] = None, post_col: str | None = None, arm: str | None = None, solver: Any = None, display_graph: bool = False, verbose: bool = False, enable_inference: bool = True, holdout_frac_E: Annotated[float, Ge(ge=0.1), Le(le=0.95)] = 0.7, inference_alpha: Annotated[float, Gt(gt=0.0), Lt(lt=1.0)] = 0.05, power_target: Annotated[float, Gt(gt=0.0), Lt(lt=1.0)] = 0.8, mde_n_sims: Annotated[int, Gt(gt=0)] = 5000, mde_n_trials: Annotated[int, Gt(gt=0)] = 400, mde_horizon_grid: List[int] | None = None, inference_seed: int = 1400, min_blank_size: Annotated[int, Gt(gt=0)] = 5, pooled_weights: Literal['size', 'equal'] = 'size', covariates: List[str] | None = None, covariate_weight: Annotated[float, Ge(ge=0)] = 1.0)#
Configuration for the Synthetic Principal Component Design (SPCD) estimator.
Implements Lu, Li, Ying, Blanchet (2022), “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls”, arXiv:2211.15241v1.
- Parameters:
variant ({“spcd”, “norm_spcd”}) – Iteration-box choice.
"spcd"uses Eq. (4)/(7) of the paper;"norm_spcd"uses Eq. (5)/(8). The paper’s Section 4 experiments use"norm_spcd"with closed-form weights.weights ({“empirical”, “exact”}) – Final-weight-step choice.
"empirical"uses Eq. (9) — the closed-form approximation used in all of the paper’s experiments."exact"solves Eq. (6) via cvxpy.alpha_ridge (float or None, optional) – Ridge term
alphain Eq. (2), playing the role of the noise variancesigma. IfNone, it is chosen by out-of-sample pre-period balance over a noise-scale grid (select_alpha_by_holdout), since the post-period RMSE is a jumpy function ofalphawhenN > T_pre. Pass a value (e.g. a known noise variance) to bypass selection.lam_balance (float or None, optional) – Sum-zero penalty
lambdain Eq. (2). Auto-estimated ifNone. Theorem 1 requires this to be “large enough”.beta (float or None, optional) – Iteration step parameter
betain Eqs. (4)/(5)/(7)/(8). Auto-estimated from the spectrum ifNone.max_iter (int) – Maximum iterations for the SPCD/NormSPCD while loop.
T0 (int or None, optional) – Number of pre-treatment periods.
post_col (str or None, optional) – Column indicating post-treatment periods.
solver (Any, optional) – CVXPY-compatible solver. Used only when
weights="exact".display_graph (bool) – Whether to display the synthetic treated/control plot.
verbose (bool) – Solver verbosity.
Notes
Algorithms 3 and 4 of the paper (Appendix 3.2) are abstract meta-versions used in the proof of Theorem 3 (global convergence). They correspond to the same iterations as Algorithms 1 and 2 acting on a generic Hermitian perturbed rank-1 matrix and are not exposed as separate user options here.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Result Containers#
SPCD.fit() returns an
SPCDResults, bundling the
optimized SPCDDesign
(sign vector, treated/control/contrast weights, selected unit labels,
iteration diagnostics, and the auto-chosen \(\alpha\), \(\lambda\),
\(\beta\)), the prepared
SPCDInputs, a standardized
summary (BaseEstimatorResults), and the
optional conformal CI and power-analysis objects. When an arm column is
configured, an SPCDMultiArmResults
collects one independent SPCDResults per arm.
Structured containers for the SPCD synthetic design pipeline.
Implements the data containers used throughout the SPCD estimator, which itself implements:
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls. arXiv:2211.15241v1.
- class mlsynth.utils.spcd_helpers.structures.SPCDDesign(variant: str, weights_mode: str, assignment_pm1: ndarray, selected_mask: ndarray, raw_weights: ndarray, treated_weights: ndarray, control_weights: ndarray, contrast_weights: ndarray, synthetic_treated: ndarray, synthetic_control: ndarray, synthetic_gap: ndarray, selected_unit_indices: ndarray, selected_unit_labels: ndarray, n_treated: int, n_iterations: int, converged: bool, alpha_ridge: float, lam_balance: float, beta: float)#
Bases:
objectOptimized SPCD design solution.
The “assignment” is represented in two equivalent forms:
assignment_pm1— the paper’s internal{-1, +1}sign vectory* in {-1, +1}^N(see Algorithm 1, page 7).selected_mask— a{0, 1}indicator marking the minority group that is treated, following the rule “Treat Unit i ifgamma(i) = -sgn(sum_j gamma(j))” at the bottom of Algorithm 1.
- Parameters:
variant (str) – Iteration variant used:
"spcd"(Eq. (4)/(7)) or"norm_spcd"(Eq. (5)/(8)).weights_mode (str) – Final weight step used:
"empirical"(Eq. (9), Algorithm 2) or"exact"(Eq. (6), Algorithm 1).assignment_pm1 (np.ndarray) – Final sign vector
y* in {-1, +1}^Nproduced by the iteration. See Algorithm 1, page 7.selected_mask (np.ndarray) – Binary
{0, 1}indicator of treated units, following the minority-group convention from the bottom of Algorithm 1.raw_weights (np.ndarray) – Signed weights
w in R^Ncomputed by the final weight step (Eq. (9) forweights="empirical"or Eq. (6) forweights="exact").treated_weights (np.ndarray) – Weights restricted to the treated group, normalized to sum to 1.
control_weights (np.ndarray) – Weights restricted to the control group, normalized to sum to 1.
contrast_weights (np.ndarray) – Signed contrast weights forming
treated_weights - control_weights(with appropriate signs), used to construct the synthetic gap.synthetic_treated (np.ndarray) – Synthetic treated trajectory
Y @ treated_weightsof lengthT_pre + T_post.synthetic_control (np.ndarray) – Synthetic control trajectory
Y @ control_weightsof lengthT_pre + T_post.synthetic_gap (np.ndarray) – Pointwise difference
synthetic_treated - synthetic_control.selected_unit_indices (np.ndarray) – Integer indices of treated units.
selected_unit_labels (np.ndarray) – Original labels of treated units.
n_treated (int) – Number of treated units.
n_iterations (int) – Number of iterations executed by Algorithm 1’s while loop.
converged (bool) – True if the sign vector stabilized before
max_iter.alpha_ridge (float) – Value of
alphaused in Eq. (2) (ridge onY Y^T).lam_balance (float) – Value of
lambdaused in Eq. (2) (penalty on(1^T W)^2).beta (float) – Value of
betaused in the iteration update.
- assignment_pm1: ndarray#
- contrast_weights: ndarray#
- control_weights: ndarray#
- raw_weights: ndarray#
- selected_mask: ndarray#
- selected_unit_indices: ndarray#
- selected_unit_labels: ndarray#
- synthetic_control: ndarray#
- synthetic_gap: ndarray#
- synthetic_treated: ndarray#
- treated_weights: ndarray#
- class mlsynth.utils.spcd_helpers.structures.SPCDInputs(Y_pre: ndarray, Y_post: ndarray | None, unit_index: IndexSet, time_index: IndexSet, pre_time_index: IndexSet, post_time_index: IndexSet | None, outcome: str, covariates: ndarray | None = None, covariate_names: list | None = None)#
Bases:
objectPreprocessed panel data for SPCD estimation.
- Parameters:
Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape
(T_pre, N). Note that rows are time periods and columns are units, matching the convention used byprepare_syndes_inputs. The paper’s equations useY in R^{N x T}, so when implementing Eq. (2) of the paper, the iteration matrix is built asY_pre.T @ Y_pre + alpha I + lambda 1 1.T.Y_post (np.ndarray or None) – Post-treatment outcome matrix of shape
(T_post, N).unit_index (IndexSet) – Mapping from unit labels to integer indices.
time_index (IndexSet) – Mapping from time labels to integer indices.
pre_time_index (IndexSet) – Index set for pre-treatment periods.
post_time_index (IndexSet or None) – Index set for post-treatment periods.
outcome (str) – Name of outcome variable.
- Y_pre: ndarray#
- pre_time_index: IndexSet#
- time_index: IndexSet#
- unit_index: IndexSet#
- class mlsynth.utils.spcd_helpers.structures.SPCDMultiArmResults(arm_designs: Dict[Any, SPCDResults], arm: str, pooled_power: 'SPCDPowerAnalysis' | None = None, pooled_weights: str | None = None)#
Bases:
objectPer-arm SPCD designs.
Returned by
mlsynth.estimators.SPCDwhen anarmcolumn is configured: the SPCD design problem is solved independently within each arm’s units, and each arm’s fullSPCDResults(design, inputs, summary, conformal CI and power) is collected here.- Parameters:
arm_designs (dict) –
{arm_label: SPCDResults}– one independent SPCD solution per arm.arm (str) – Name of the arm column the units were partitioned on.
pooled_power (SPCDPowerAnalysis, optional) – Minimum detectable effect for the average effect across arms (the pooled, size- or equal-weighted contrast). Computed by default when inference is enabled and at least two arms have a usable holdout window.
Noneotherwise. Note this targets the weighted-average effect, so opposite-signed arm effects can cancel – use the per-armpowerfor individual-arm detection.pooled_weights (str, optional) – Weighting used for the pooled average (
"size"or"equal");Nonewhen no pooled MDE was computed.
- arm_designs: Dict[Any, SPCDResults]#
- property pooled_mde: float | None#
MDE of the average effect across arms (absolute scale), if computed.
- class mlsynth.utils.spcd_helpers.structures.SPCDPreFit(rmse_estimation: float, rmse_blank: float | None, rmse_pre: float, n_estimation: int, n_blank: int)#
Bases:
objectPre-period agreement between the synthetic treated and synthetic control trajectories – a design-phase fit diagnostic.
Reports the RMSE of the synthetic gap (synthetic treated minus synthetic control) over three windows:
rmse_estimation– the estimation window E, in-sample for the design fit;rmse_blank– the blank/holdout window B, out-of-sample (Nonewhenenable_inference=Falseso no split was made);rmse_pre– the overall pre-period (E + B).
Lower values mean the two synthetic series track each other more closely before treatment. The blank-window RMSE is the honest out-of-sample read on how well the design will hold up post-launch.
- class mlsynth.utils.spcd_helpers.structures.SPCDResults(design: SPCDDesign, inputs: SPCDInputs | None = None, summary: 'BaseEstimatorResults' | None = None, conformal: 'SPCDConformalResult' | None = None, power: 'SPCDPowerAnalysis' | None = None, pre_fit: SPCDPreFit | None = None)#
Bases:
objectUser-facing output of the SPCD estimator.
- Parameters:
design (SPCDDesign) – Optimization solution.
inputs (SPCDInputs, optional) – Preprocessed data used in estimation. Attached by the
mlsynth.estimators.SPCDorchestrator so the result is self-contained for plotting.summary (BaseEstimatorResults, optional) – Standardized result bundle containing ATT, pre/post fit RMSEs, synthetic-paths time series, per-unit signed weights, and method diagnostics. Attached by the SPCD orchestrator so users get a single object whose shape matches the rest of the mlsynth estimator suite.
Notes
modealways reports"spcd"so plotting and dispatch code can branch on it uniformly withSYNDESResults.Convenience properties (
att,rmse_pre,rmse_post,donor_weights) forward to the corresponding fields ofsummarywhen it is attached.- property assignment: ndarray#
Alias for
design.selected_mask(0/1 indicator of treated units).
- property control_weights_by_unit: dict | None#
Control-side weights as
{label: weight}(non-negative, sum to 1).Alias of
donor_weights()for users who prefer the explicit treated/control naming over the SC-literature “donor” terminology.
- design: SPCDDesign#
- inputs: SPCDInputs | None = None#
- pre_fit: SPCDPreFit | None = None#
- property pre_fit_rmse: dict | None#
Synthetic treated-vs-control RMSE by window.
{"estimation": ..., "blank": ..., "overall_pre": ...}– how closely the synthetic treated and synthetic control trajectories agree over the estimation, blank, and overall pre-treatment windows.Noneif the pre-fit diagnostic was not computed.
- property rmse_post: float | None#
Post-treatment RMSE of the synthetic gap, if a post period exists.
- property selected_unit_indices: ndarray#
Integer indices of units selected into treatment.
- property selected_unit_labels: ndarray#
Labels of units selected into treatment.
Helper Modules#
Data preparation helpers for SPCD.
Pivots long panel data into the wide (T, N) matrix consumed by the
SPCD iteration. The orientation here is the inverse of the paper’s
Y in R^{N x T} (Algorithm 1, page 7), so the iteration matrix from
Eq. (2),
M = Y Y^T + alpha I + lambda 1 1^T,
is implemented as Y_pre.T @ Y_pre + alpha I + lambda 1 1^T further
downstream in formulation.py.
- mlsynth.utils.spcd_helpers.setup.prepare_spcd_inputs(df: DataFrame, outcome: str, unitid: str, time: str, T0: int | None = None, post_col: str | None = None, covariates: List[str] | None = None) SPCDInputs#
Pivot long panel data and split it into pre/post matrices for SPCD.
- Parameters:
df (pd.DataFrame) – Long balanced panel data.
outcome, unitid, time (str) – Column names identifying the outcome, units, and time periods.
T0 (Optional[int]) – Number of pre-treatment periods when
post_colis not supplied.post_col (Optional[str]) – Optional 0/1 or boolean column identifying post-treatment periods.
covariates (Optional[list of str]) – Optional covariate columns to balance on alongside the outcomes. Each unit’s per-covariate pre-period mean is taken and the resulting
(N, P)matrix is z-scored across units. Time-invariant covariates collapse to their constant value.
- Returns:
SPCDInputs – Wide pre/post matrices and label metadata.
Iteration-matrix construction for SPCD.
Implements Eq. (2) of the paper,
M = Y Y^T + alpha I + lambda 1 1^T,
and supplies spectral-based auto-defaults for the three hyperparameters
alpha, lambda, beta when the user has not specified them.
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.formulation.build_iteration_matrix(Y_pre: ndarray, alpha: float | None = None, lam: float | None = None, beta: float | None = None, covariates: ndarray | None = None, covariate_weight: float = 1.0) Tuple[ndarray, ndarray, float, float, float]#
Build the SPCD iteration matrix
Mand its inverse.Implements Eq. (2) of the paper:
M = Y_pre.T @ Y_pre + alpha I + lambda 1 1^T
Note that
Y_prehere is(T_pre, N)(mlsynth convention), which is the transpose of the paper’sY in R^{N x T}. The productY_pre.T @ Y_preis exactly the paper’sY Y^T.When
covariates(an(N, P)z-scored matrix) is supplied, a covariate-balance term is folded in:- M = Y_pre.T @ Y_pre + covariate_weight * s * (X X^T)
alpha I + lambda 1 1^T,
where
sscales the covariate Gram to match the trace (total “energy”) ofY_pre.T @ Y_presocovariate_weight = 1weights covariates equally to outcomes. This keeps the phase-synchronization structure (a quadratic formW^T M W) intact, so the solver and the global-optimality theory are unchanged – the design now balances the two groups on outcomes and covariates jointly.- Parameters:
Y_pre (np.ndarray) – Pre-treatment matrix of shape
(T_pre, N).alpha (float, optional) – Ridge term
alphain Eq. (2), which plays the role of the idiosyncratic noise variancesigma(Assumption 1). IfNone, it is estimated from the noise scale ofY_previa the Gavish-Donoho median-singular-value estimator (estimate_noise_variance()), consistent with the paper’s appendix conditionalpha = ||Delta||on the perturbation scale. Note the paper itself treatsalphaas a pre-defined hyperparameter and gives no formula; this default is a heuristic. For data generated with a known noise level (e.g. the paper’s Section 4.1 simulations withsigma = 1), pass that value explicitly.lam (float, optional) – Sum-zero penalty
lambdain Eq. (2). IfNone, defaults to the largest eigenvalue ofY_pre.T @ Y_preso that Theorem 1’s “large enough lambda” condition is satisfied on the scale of the data.beta (float, optional) – Iteration step parameter
betaused in Eqs. (4), (5), (7), (8). IfNone, defaults to1 / lambda_max(M), the smallest eigenvalue ofM^{-1}. This is the natural scale that keepsM^{-1} + beta Ifrom being numerically dominated by either term at largeN.
- Returns:
M (np.ndarray) – The N x N iteration matrix from Eq. (2).
M_inv (np.ndarray) – The inverse of
M, computed vianp.linalg.solve(M, I).alpha (float) – Final value used (auto-estimated if input was
None).lam (float) – Final value used (auto-estimated if input was
None).beta (float) – Final value used (auto-estimated if input was
None).
- mlsynth.utils.spcd_helpers.formulation.estimate_noise_variance(Y_pre: ndarray) float#
Estimate the idiosyncratic noise variance
sigma^2ofY_pre.In the paper’s outcome model the ridge term
alphain Eq. (2) plays the role of the noise variancesigma = Var(e_it)(it is the coefficient of the variance termsigma * sum_i w_i^2in the MSE decomposition), and the appendix convergence theorems set the perturbation ridge on the noise scalealpha = ||Delta||withalpha <= ||Delta||. Soalphamust track the noise scale of the data, not its dominant (signal/level) eigenvalue.This uses the Gavish-Donoho (2014) median-singular-value estimator, which is parameter-free and robust when the signal is approximately low-rank: with
Y_preof shape(m, n)andbeta = min(m, n) / max(m, n),\[\hat\sigma = \frac{\operatorname{median}(\text{singular values})} {\sqrt{\max(m, n)\, \mu_\beta}},\]where
mu_betais the Marchenko-Pastur median. Returnshat_sigma^2.- Parameters:
Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape
(T_pre, N).- Returns:
float – Estimated noise variance
sigma^2(always positive).
- mlsynth.utils.spcd_helpers.formulation.validate_spcd_inputs(Y_pre: ndarray) None#
Check basic shape and feasibility of the pre-treatment matrix.
- Parameters:
Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape
(T_pre, N).- Raises:
MlsynthDataError – If
Y_preis not 2D, has fewer than 2 periods, or fewer than 2 units.
Spectral initialization for the SPCD iteration.
Implements the “Spectral Initialization” step of Algorithm 1 (page 7) and Algorithm 2 (page 13) of the SPCD paper:
y^0 = sgn(v),
where v is the smallest eigenvector of the iteration matrix
M = Y Y^T + alpha I + lambda 1 1^T (Eq. 2)
The theoretical justification appears in Appendix 3.2.1 (“The Spectral Initialization”, page 20). Under Assumption 2 (realizability) and the generative process of Section 3.2, the spectral estimator’s sign vector agrees with the global optimum’s sign vector up to a bounded fraction of entries (Lemma 4, page 20).
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.spectral_init.spectral_initialization(M: ndarray) ndarray#
Compute the spectral-initialization sign vector for SPCD.
Implements the “Spectral Initialization” step from Algorithm 1 (page 7) and Algorithm 2 (page 13) of the paper:
y^0 = sgn(v), v = smallest eigenvector of M.
- Parameters:
M (np.ndarray) – N x N symmetric positive-definite iteration matrix from Eq. (2).
- Returns:
y0 (np.ndarray) – Length-N sign vector with entries in
{-1, +1}.
Notes
np.linalg.eighreturns eigenvalues in ascending order, so column zero of the eigenvector matrix is the eigenvector associated with the smallest eigenvalue. Any zero entries insgn(v)are mapped to+1so thaty^0is strictly in{-1, +1}^N.
SPCD iteration step (generalized power method).
Implements the SPCD update box from Algorithm 1 (page 7) and the identical update from Algorithm 2 (page 13) of the paper:
- Eq. (4) / Eq. (7):
y^{t+1} = sgn[ (M^{-1} + beta I) y^t ],
where M = Y Y^T + alpha I + lambda 1 1^T (Eq. 2) and beta is a
pre-defined step parameter.
Theoretical justification: Algorithm 3 of Appendix 3.2 (page 20) gives the abstract meta-version of this iteration on a generic Hermitian perturbed rank-1 matrix. Theorem 3 (page 9, informal) gives global convergence under Assumptions 1-2.
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.iteration_spcd.run_spcd_iteration(M_inv: ndarray, y0: ndarray, beta: float, max_iter: int) Tuple[ndarray, int, bool]#
Run the SPCD
while Converged doloop using Eq. (4)/(7).Implements the SPCD branch of the outer loop in Algorithm 1 (page 7) and Algorithm 2 (page 13). The loop terminates as soon as the sign vector stops changing between successive iterations.
- Parameters:
M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y0 (np.ndarray) – Initial sign vector from Spectral Initialization.
beta (float) – Iteration step parameter from Eq. (4)/(7).
max_iter (int) – Maximum number of iterations to perform before terminating.
- Returns:
y_final (np.ndarray) – Final sign vector
y* in {-1, +1}^N.n_iter (int) – Number of iterations actually performed.
converged (bool) –
Trueif the sign vector stabilized beforemax_iter.
- mlsynth.utils.spcd_helpers.iteration_spcd.spcd_step(M_inv: ndarray, y: ndarray, beta: float) ndarray#
One iteration of the SPCD update box, Eq. (4)/(7).
Computes:
y^{t+1} = sgn[ (M^{-1} + beta I) y^t ]
- Parameters:
M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y (np.ndarray) – Length-N sign vector at iteration
t.beta (float) – Iteration step parameter from Eq. (4)/(7).
- Returns:
y_new (np.ndarray) – Length-N sign vector at iteration
t + 1.
NormSPCD iteration step (normalized generalized power method).
Implements the NormSPCD update box from Algorithm 1 (page 7) and the identical update from Algorithm 2 (page 13) of the paper:
- Eq. (5) / Eq. (8):
y^{t+1} = sgn[ (M^{-1} + beta I) (y^t / d) ],
where:
M = Y Y^T + alpha I + lambda 1 1^T (Eq. 2)
d = sqrt(diag(M^{-1})) (definition below Eq. 5)
and / denotes element-wise division.
Theoretical justification: Algorithm 4 of Appendix 3.2 (page 20) gives
the abstract meta-version of this iteration. In the same appendix the
paper interprets NormSPCD as a Riemannian gradient descent under a
diagonal preconditioning given by the inverse-covariance diagonal
(Appendix 3, page 8 and Appendix 3.2). Theorem 3 (page 9) gives
linear-rate global convergence under Assumptions 1-2 with epsilon > 0.
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.iteration_norm_spcd.norm_spcd_step(M_inv: ndarray, d: ndarray, y: ndarray, beta: float) ndarray#
One iteration of the NormSPCD update box, Eq. (5)/(8).
Computes:
y^{t+1} = sgn[ (M^{-1} + beta I) (y^t / d) ]
- Parameters:
M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
d (np.ndarray) – Length-N normalization vector
sqrt(diag(M^{-1})). See the definition immediately below Eq. (5), page 7.y (np.ndarray) – Length-N sign vector at iteration
t.beta (float) – Iteration step parameter from Eq. (5)/(8).
- Returns:
y_new (np.ndarray) – Length-N sign vector at iteration
t + 1.
- mlsynth.utils.spcd_helpers.iteration_norm_spcd.run_norm_spcd_iteration(M_inv: ndarray, y0: ndarray, beta: float, max_iter: int) Tuple[ndarray, int, bool]#
Run the NormSPCD
while Converged doloop using Eq. (5)/(8).Implements the NormSPCD branch of the outer loop in Algorithm 1 (page 7) and Algorithm 2 (page 13). The loop terminates as soon as the sign vector stops changing between successive iterations.
- Parameters:
M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y0 (np.ndarray) – Initial sign vector from Spectral Initialization.
beta (float) – Iteration step parameter from Eq. (5)/(8).
max_iter (int) – Maximum number of iterations to perform before terminating.
- Returns:
y_final (np.ndarray) – Final sign vector
y* in {-1, +1}^N.n_iter (int) – Number of iterations actually performed.
converged (bool) –
Trueif the sign vector stabilized beforemax_iter.
Closed-form empirical weights (Algorithm 2 of the paper).
Implements Eq. (9) of the paper (page 13), the closed-form weight construction used by “Algorithm 2: Empirical Implementation of SPCD”:
- w = 2 (Y Y^T + alpha I + lambda 1 1^T)^{-1} y*
/ || (Y Y^T + alpha I + lambda 1 1^T)^{-1} y* ||_1
The paper notes that this expression is exactly the optimal weight
under the optimality conditions of Eq. (6) when the global solution
sign vector y* is known (see the remark “From [10] we know that
once the optimal design profile y* is obtained, then w is the optimal
design weight” on page 9). In practice, since Algorithm 1’s iteration
recovers the correct signs (Theorem 1) but not necessarily the exact
optimum of the convex QP (Eq. 6), this closed-form gives a fast
near-optimal approximation. The paper states: “In all the experiment
in this paper, we use this simplified implementation” (page 9, top).
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.weights_empirical.empirical_weights(M_inv: ndarray, y_star: ndarray) ndarray#
Compute the closed-form SPCD weights from Eq. (9).
Implements:
w = 2 * M^{-1} y* / || M^{-1} y* ||_1
where
Mis the iteration matrix from Eq. (2) andy*is the converged sign vector from Algorithm 1’s iteration.- Parameters:
M_inv (np.ndarray) – N x N inverse of the iteration matrix from Eq. (2).
y_star (np.ndarray) – Length-N final sign vector
y* in {-1, +1}^Nfrom the SPCD or NormSPCD iteration.
- Returns:
w (np.ndarray) – Length-N signed weight vector. Per the optimality condition noted on page 13 (“The optimality condition ensures sgn(w) = gamma”), the signs of
wshould agree withy_starwhenever the iteration has converged to a fixed point of the closed-form map.
Exact convex-QP weights (Algorithm 1 of the paper).
Implements Eq. (6) of the paper (page 7), the convex quadratic program
used by “Algorithm 1: Synthetic principal component Design” once the
sign vector gamma is fixed by the iteration:
Eq. (6):
min_{w_i >= 0} (1/T) sum_t (
sum_{i: gamma(i)=1} w_i Y_{it}
- sum_{i: gamma(i)=-1} w_i Y_{it}
)^2 + sigma sum_i w_i^2
s.t. w_i >= 0 for all i in [N],
sum_{gamma(i)=1} w_i = 1,
sum_{gamma(i)=-1} w_i = 1.
This is a convex QP and is solved exactly via cvxpy. sigma plays
the role of the ridge coefficient alpha from Eq. (2).
The paper notes that in their experiments they prefer the closed-form
approximation in Eq. (9) (see weights_empirical.py), but Eq. (6)
is the original optimization problem and is provided here for users
who want the exact Algorithm 1 weights.
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.weights_exact.exact_weights(Y_pre: ndarray, y_star: ndarray, sigma: float, solver: Any | None = None, verbose: bool = False) ndarray#
Solve Eq. (6) via cvxpy and return the signed weight vector.
- Parameters:
Y_pre (np.ndarray) – Pre-treatment outcome matrix of shape
(T_pre, N)(mlsynth convention; the paper’sY in R^{N x T}corresponds toY_pre.T).y_star (np.ndarray) – Length-N sign vector
gamma in {-1, +1}^Nfrom the iteration. Used to identify the treated and control groups in the constraints of Eq. (6).sigma (float) – Ridge coefficient
sigmafrom Eq. (6). Plays the same role asalphain Eq. (2).solver (optional) – Optional cvxpy solver name or object passed through to
cp.Problem.solve.verbose (bool) – Whether to print solver progress.
- Returns:
w (np.ndarray) – Length-N signed weight vector
(2 * indicator(gamma=1) - 1) * w_rawso that downstream consumers see one signed vector per unit. The unsigned positive weightsw_raw[i]sum to 1 within each group.- Raises:
MlsynthEstimationError – If cvxpy is not installed, or if the solver fails to find an optimal solution.
Treatment-effect assembly for SPCD (final step of Algorithms 1 & 2).
Implements the synthetic-paths construction and treatment-effect formula at the bottom of Algorithm 1 (page 7) and the bottom of Algorithm 2 (page 13):
Treat Unit i if gamma(i) = -sgn(sum_j gamma(j)) (minority rule)
tau_hat = sum_t (
sum_{gamma(i)=sgn(sum_j gamma(j))} w(i) Y_{i, T+t}
- sum_{gamma(i)=-sgn(sum_j gamma(j))} w(i) Y_{i, T+t}
)
The signed-weight convention used elsewhere in this module collapses
the above difference-of-sums into a single dot product
Y_post @ contrast_weights once the signs of contrast_weights
match the minority-flipped sign vector.
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.treatment_effect.apply_minority_flip(y_star: ndarray, raw_weights: ndarray) Tuple[ndarray, ndarray]#
Apply the minority-group convention from the bottom of Algorithm 1.
Per the paper (page 7, bottom of Algorithm 1):
- “Treat Unit i if gamma(i) = -sgn(sum_j gamma(j))
(To ensure the size of the treated group is smaller than the control group.)”
If the spectral iteration converges to a sign vector whose positive entries outnumber its negative entries, we flip the sign so that
+1corresponds to the minority (treated) group. The raw weights are flipped in lock-step so thatraw_weightscontinues to be the signed Eq. (9) (or Eq. 6) weights of the chosen treated and control groups.- Parameters:
y_star (np.ndarray) – Length-N sign vector
y* in {-1, +1}^Nfrom the iteration.raw_weights (np.ndarray) – Length-N signed weights from Eq. (9) or Eq. (6).
- Returns:
assignment_pm1 (np.ndarray) – Possibly sign-flipped version of
y_star, with+1marking the minority (treated) group.raw_weights_flipped (np.ndarray) – Possibly sign-flipped
raw_weightsconsistent with the new assignment.
- mlsynth.utils.spcd_helpers.treatment_effect.build_synthetic_paths(Y_pre: ndarray, Y_post: ndarray | None, treated_weights: ndarray, control_weights: ndarray) Tuple[ndarray, ndarray, ndarray]#
Form synthetic treated, synthetic control, and gap trajectories.
- Parameters:
Y_pre (np.ndarray) – Pre-treatment matrix of shape
(T_pre, N).Y_post (np.ndarray or None) – Post-treatment matrix of shape
(T_post, N), orNone.treated_weights (np.ndarray) – Length-N treated-group weights summing to 1.
control_weights (np.ndarray) – Length-N control-group weights summing to 1.
- Returns:
synthetic_treated (np.ndarray) – Synthetic treated trajectory over
T_pre + T_postperiods.synthetic_control (np.ndarray) – Synthetic control trajectory over
T_pre + T_postperiods.synthetic_gap (np.ndarray) – Difference
synthetic_treated - synthetic_control.
- mlsynth.utils.spcd_helpers.treatment_effect.build_weight_groups(assignment_pm1: ndarray, raw_weights: ndarray) Tuple[ndarray, ndarray, ndarray, ndarray]#
Split signed weights into treated/control/contrast components.
The signed weight vector returned by Eq. (9) or Eq. (6) already encodes group membership: positive entries belong to the treated group, negative entries to the control group. This helper turns that into the four arrays used downstream by the plotter and by the treatment-effect formula.
- Parameters:
assignment_pm1 (np.ndarray) – Length-N sign vector with
+1for treated units,-1for control units.raw_weights (np.ndarray) – Length-N signed weights.
- Returns:
selected_mask (np.ndarray) – 0/1 indicator of treated units (parity with the SYNDES design container’s
assignmentfield).treated_weights (np.ndarray) – Length-N weights restricted to the treated group; zero elsewhere. Normalized to sum to 1 over treated units.
control_weights (np.ndarray) – Length-N weights restricted to the control group; zero elsewhere. Normalized to sum to 1 over control units.
contrast_weights (np.ndarray) – Length-N signed contrast weights
treated_weights - control_weights. The synthetic gap isY_full @ contrast_weights.
- mlsynth.utils.spcd_helpers.treatment_effect.compute_att_and_fit(Y_pre: ndarray, Y_post: ndarray | None, treated_weights: ndarray, control_weights: ndarray) Tuple[float, float, float | None]#
Compute SPCD’s ATT and pre/post fit RMSEs.
Implements the
tau_hatformula at the bottom of Algorithm 1 (page 7) of the paper, averaged over the post-treatment horizon:ATT = (1/S) sum_{t=T+1..T+S} ( sum_{gamma(i)=+1} w(i) Y_{i,t} - sum_{gamma(i)=-1} w(i) Y_{i,t} ) = mean(synthetic_gap[post])
When no post-treatment matrix is provided, the SPCD design step is pure covariate-balancing and ATT is reported as 0.0.
rmse_premeasures the pre-period balance achieved by the design (the residual of Eq. (1)/(2) on the chosen sign vector); smaller is better.rmse_postmeasures the post-period dispersion of the synthetic gap, useful for sanity-checking convergence behavior.- Parameters:
Y_pre (np.ndarray) – Pre-treatment matrix of shape
(T_pre, N).Y_post (np.ndarray or None) – Post-treatment matrix of shape
(T_post, N), orNone.treated_weights (np.ndarray) – Length-N treated-group weights summing to 1.
control_weights (np.ndarray) – Length-N control-group weights summing to 1.
- Returns:
att (float) – Mean of the post-period synthetic gap, or
0.0whenY_postisNone.rmse_pre (float) – Root-mean-square synthetic gap over the pre-treatment period.
rmse_post (float or None) – Root-mean-square synthetic gap over the post-treatment period, or
NonewhenY_postisNone.
Pre-treatment estimation/holdout split helpers for SPCD inference.
Implements the train-on-E / calibrate-on-B discipline used by LEXSCM and
adapted here for SPCD. The first frac_E of the pretreatment window
is used to fit the design; the remaining periods are held out and used
to produce out-of-sample residuals for conformal inference and power
analysis.
Key guarantee#
With enable_inference=True the design is fit on Y_E only,
regardless of whether Y_post is supplied. This means a user who has
not yet treated (planning mode) and a user who has treated
(retrospective mode) both receive the same recommended design as long
as their pretreatment data are identical.
- mlsynth.utils.spcd_helpers.holdout.compute_holdout_residuals(Y_B: ndarray, contrast_weights: ndarray) ndarray#
Compute out-of-sample residuals on the holdout window.
The contrast weights were learned on
Y_Eonly, sor_B = Y_B @ contrast_weights
is the synthetic gap on data the optimization never saw. Under the linear-factor model with no treatment,
r_Bis a zero-mean noise series whose empirical distribution is used as the calibration set for conformal inference and the noise pool for Monte Carlo MDE.- Parameters:
Y_B (np.ndarray) – Holdout window of shape
(n_B, N).contrast_weights (np.ndarray) – Length-N signed weights from the SPCD design fit on
Y_E.
- Returns:
residuals_B (np.ndarray) – Length-
n_Bout-of-sample synthetic gap.
- mlsynth.utils.spcd_helpers.holdout.split_pre_window(Y_pre: ndarray, frac_E: float = 0.7, min_blank_size: int = 5) Tuple[ndarray, ndarray, int, int, bool]#
Split the pretreatment matrix into estimation (E) and holdout (B).
- Parameters:
Y_pre (np.ndarray) – Pretreatment outcome matrix of shape
(T_pre, N).frac_E (float) – Fraction of pretreatment periods to use for estimation. The remaining
1 - frac_Eperiods form the holdout window. Values outside[0.1, 0.95]are rejected.min_blank_size (int) – Minimum number of holdout periods required for inference to be meaningful. If
T_B < min_blank_sizethe function still returns a valid split (so that the design can be fit onY_E) but flagscan_infer=Falseso the caller can skip the inference / MDE machinery with a warning.
- Returns:
Y_E (np.ndarray) – Estimation window of shape
(n_E, N).n_E = floor(T_pre * frac_E).Y_B (np.ndarray) – Holdout window of shape
(n_B, N).n_B = T_pre - n_E.n_E (int) – Number of periods in the estimation window.
n_B (int) – Number of periods in the holdout window.
can_infer (bool) –
Trueifn_B >= min_blank_size,Falseotherwise.
- Raises:
MlsynthConfigError – If
frac_Eis out of range.MlsynthDataError – If the resulting estimation window has fewer than 2 periods, in which case SPCD cannot be fit at all.
Moving-block conformal inference for the SPCD post-period effect.
Adapts the moving-block conformal procedure from
mlsynth.utils.fast_scm_helpers.inference.compute_moving_block_conformal_ci
to the SPCD setting, where the test statistic is the post-period
synthetic-gap mean and the calibration set is the out-of-sample
residual vector r_B from the holdout window.
The procedure is exchangeability-based: confidence coverage holds in
finite samples under the assumption that overlapping blocks of r_B
have the same distribution as overlapping blocks of the post-period
gap under the null. This is a stronger assumption than IID noise and
weaker than perfect H0; in practice it is the standard for synthetic
control conformal inference.
- class mlsynth.utils.spcd_helpers.inference.SPCDConformalResult(att: float, ci_lower: float, ci_upper: float, p_value: float, pointwise_lower: ndarray, pointwise_upper: ndarray, block_size: int, alpha: float, method: str = 'moving_block_conformal')#
Container for the moving-block conformal output.
- ci_lower, ci_upper
Conformal CI for the ATT at level
1 - alpha.- Type:
- pointwise_lower, pointwise_upper
Period-by-period conformal bands of length equal to the post horizon.
- Type:
np.ndarray
- pointwise_lower: ndarray#
- pointwise_upper: ndarray#
- mlsynth.utils.spcd_helpers.inference.compute_conformal_ci(residuals_B: ndarray, post_gap: ndarray, alpha: float = 0.05, block_size: int | None = None, grid_size: int = 200) SPCDConformalResult#
Compute a moving-block conformal CI for the SPCD post-period ATT.
- Parameters:
residuals_B (np.ndarray) – Length-
n_Bout-of-sample residuals from the holdout window (Y_B @ contrast_weights).post_gap (np.ndarray) – Length-
Spost-period synthetic gap (Y_post @ contrast_weights).alpha (float) – Two-sided significance level. Coverage is
1 - alpha.block_size (int, optional) – Moving-block size. Defaults to
max(3, floor(sqrt(S))).grid_size (int) – Number of grid points to search for the CI inversion.
- Returns:
SPCDConformalResult
Pre-experiment power analysis for SPCD.
Adapts the Monte Carlo MDE machinery from
mlsynth.utils.fast_scm_helpers.power_helpers to operate directly on
the SPCD holdout residuals r_B. The MDE is the smallest treatment
effect tau such that the mean-absolute-effect statistic rejects the
null with probability at least power_target at significance
alpha.
Because SPCD’s post-period estimator is a fixed linear functional
tau_hat_t = sum_i contrast_weights[i] * Y[i, t], the noise
distribution of tau_hat_t under H0 is fully characterized by the
distribution of r_B – no further model-fitting is needed at the
power stage.
- class mlsynth.utils.spcd_helpers.power.SPCDPowerAnalysis(mde_tau: float, mde_pct: float, baseline: float, critical_stat: float, feasible: bool, n_post: int, alpha: float, power_target: float, detectability: Dict[int, float] | None = None, effect_grid_pct: ndarray | None = None, power_curve: ndarray | None = None)#
Container for MDE and detectability outputs.
- mde_tau#
Smallest absolute effect detectable at
power_target.np.nanwhen no point on the grid reaches the target.- Type:
- baseline#
Baseline level used for the percentage scaling (mean of the synthetic-treated trajectory over the holdout window).
- Type:
- critical_stat#
(1 - alpha)-quantile of the null distribution of the test statistic at horizonn_post.- Type:
- detectability#
Optional
horizon -> MDEmapping if a horizon grid was supplied;Noneotherwise.
- effect_grid_pct#
Grid of candidate effect sizes (as a percentage of
baseline) at which the power curve was evaluated, at horizonn_post.- Type:
np.ndarray or None
- power_curve#
Empirical power at each entry of
effect_grid_pct(same length). Together they give the full power-vs-effect curve; the MDE is the smallest effect whose power reachespower_target.- Type:
np.ndarray or None
- mlsynth.utils.spcd_helpers.power.compute_detectability_curve(residuals_B: ndarray, baseline: float, horizon_grid: List[int], alpha: float = 0.05, power_target: float = 0.8, n_sims: int = 5000, n_trials: int = 400, seed: int = 1400) Dict[int, float]#
Compute MDE as a function of post-treatment horizon length.
Useful for answering “how long does the experiment need to run to detect a target effect?” before committing to a treatment window.
- Parameters:
residuals_B (np.ndarray) – Out-of-sample holdout residuals.
baseline (float) – Baseline level for percentage scaling.
horizon_grid (list of int) – Horizons (number of post periods) at which to evaluate the MDE.
alpha, power_target, n_sims, n_trials, seed (see
compute_mde().)
- Returns:
dict – Mapping
horizon -> MDE in percent. Infeasible entries are recorded asnan.
- mlsynth.utils.spcd_helpers.power.compute_mde(residuals_B: ndarray, baseline: float, n_post: int, alpha: float = 0.05, power_target: float = 0.8, tau_grid: ndarray | None = None, n_sims: int = 5000, n_trials: int = 400, seed: int = 1400, store_curve: bool = True) SPCDPowerAnalysis#
Compute the minimum detectable effect via Monte Carlo.
Procedure follows
power_helpers._analytical_mdeinfast_scm_helpers:Build the null distribution of the test statistic at horizon
n_postby resamplingresiduals_B(padded with Gaussian draws to handle horizon overflow).Compute the critical value
c_alpha = quantile(null_stats, 1 - alpha).For each candidate
tauon the grid, drawn_trialspost-period vectors of the formtau + Gaussian noiseand record the fraction of trials whose statistic exceedsc_alpha. The smallesttauwith empirical power at or abovepower_targetis the MDE.
- Parameters:
residuals_B (np.ndarray) – Out-of-sample residuals on the holdout window.
baseline (float) – Baseline level used to express the MDE as a percentage. Typically the mean of the synthetic-treated trajectory over the holdout window.
n_post (int) – Post-treatment horizon for the MDE.
alpha, power_target (float) – Significance and power.
tau_grid (np.ndarray, optional) – Grid of candidate effect sizes (in absolute units). Defaults to
linspace(0, 5 * std(residuals_B), 60).n_sims, n_trials, seed (int) – Monte Carlo control.
- Returns:
SPCDPowerAnalysis
- mlsynth.utils.spcd_helpers.power.compute_pooled_average_mde(residuals_by_arm: Dict[object, ndarray], baselines_by_arm: Dict[object, float], sizes_by_arm: Dict[object, int], n_post: int, weights: str = 'size', alpha: float = 0.05, power_target: float = 0.8, n_sims: int = 5000, n_trials: int = 400, seed: int = 1400, horizon_grid: List[int] | None = None) SPCDPowerAnalysis#
MDE for the (size- or equal-weighted) average effect across arms.
Forms the pooled contrast series
g_t = sum_a w_a * r_B^(a)_ton time-aligned holdout residuals, then runs the ordinary single-seriescompute_mde()ong. Because the arms share calendar time, summing the aligned series makes the cross-arm correlation enter throughVar(g) = w' Sigma wautomatically – so one must pool the aligned series, never resample arms independently, or positive correlation is dropped and the MDE comes out over-optimistic.This answers “what average effect across arms can we detect?” and is the most powerful pooled question (averaging cancels idiosyncratic noise). Its estimand is the weighted-average effect, so heterogeneous, opposite-signed arm effects can cancel and be missed; use the per-arm analyses when individual-arm detection matters.
- Parameters:
residuals_by_arm (dict) –
{arm_label: r_B}out-of-sample holdout residual series. The series are truncated to the common (minimum) length and aligned from the start, which is correct when arms share the calendar time index.baselines_by_arm (dict) –
{arm_label: baseline}per-arm holdout baselines; pooled by the same weights for the percentage scaling.sizes_by_arm (dict) –
{arm_label: n_units}used forweights="size".n_post (int) – Post-treatment horizon for the MDE.
weights ({“size”, “equal”}) –
"size"weights each arm by its unit count (the population-average effect);"equal"weights arms equally.horizon_grid (list of int, optional) – If given, also computes the pooled-average MDE at each post-treatment horizon and attaches it as
detectability({horizon -> MDE percent}). This answers “how long must the study run to detect a target average effect?”.alpha, power_target, n_sims, n_trials, seed (see
compute_mde().)
- Returns:
SPCDPowerAnalysis – The MDE of the pooled average-effect contrast (with a
detectabilitycurve whenhorizon_gridis supplied).
- mlsynth.utils.spcd_helpers.power.default_detectability_grid(n_post: int, cap: int = 12) List[int]#
Default per-horizon detectability grid:
1 .. min(cap, n_post).
Assemble the standardized BaseEstimatorResults summary for SPCD.
Packages an SPCDDesign together with its preprocessed
SPCDInputs and optional inference / power outputs into the
project’s standardized result pydantic models defined in
mlsynth.config_models:
EffectsResults : ATT (mean post-period synthetic gap)
FitDiagnosticsResults : pre/post RMSE of the synthetic gap +
MDE / detectability in additional_metrics
TimeSeriesResults : synthetic treated/control/gap trajectories
WeightsResults : per-unit signed contrast weights
InferenceResults : conformal p-value, CI, pointwise bands
MethodDetailsResults : variant, weights mode, alpha/lam/beta, iters
The result is wrapped in a BaseEstimatorResults so SPCD’s
public results.summary matches the shape used by the rest of the
mlsynth estimator suite.
- mlsynth.utils.spcd_helpers.results_assembly.build_summary(design: SPCDDesign, inputs: SPCDInputs, conformal: SPCDConformalResult | None = None, power: SPCDPowerAnalysis | None = None) BaseEstimatorResults#
Build the standardized result bundle for an SPCD design.
- Parameters:
design (SPCDDesign) – Output of
solve_spcd_with_holdout()(orsolve_spcd()).inputs (SPCDInputs) – Preprocessed panel data used in the fit.
conformal (SPCDConformalResult, optional) – Moving-block conformal CI for the post-period ATT.
Nonewhen no post period is supplied or inference was disabled.power (SPCDPowerAnalysis, optional) – MDE / detectability output.
Nonewhen inference was disabled or the holdout window was too small.
- Returns:
BaseEstimatorResults – Bundle with
effects,fit_diagnostics,time_series,weights,inference, andmethod_detailspopulated. Fields are honest about absence: ATT isNone(not 0.0) when no post period is supplied.
Top-level SPCD solver — glues the four algorithm steps together.
The flow follows Algorithm 1 (page 7) and Algorithm 2 (page 13) of the paper, both of which share the same skeleton:
Formulation (Eq. 2) build_iteration_matrix()
Spectral Initialization spectral_initialization()
- while not Converged do run_spcd_iteration()
Pick ONE update box: or - SPCD (Eq. 4 / 7) run_norm_spcd_iteration() - NormSPCD (Eq. 5 / 8)
- Final weight step empirical_weights() OR exact_weights()
Algorithm 1 uses Eq. (6). Algorithm 2 uses Eq. (9).
- Minority-group flip + tau-hat apply_minority_flip(),
build_synthetic_paths()
The two iteration-box choices and the two weight-step choices are
exposed as independent options here. The combination
variant="norm_spcd" + weights="empirical" matches the algorithm
used in the paper’s Section 4 experiments (“In all the experiment in
this paper, we use this simplified implementation”, page 9).
Reference#
Lu, Y., Li, J., Ying, L., & Blanchet, J. (2022). “Synthetic Principal Component Design: Fast Covariate Balancing with Synthetic Controls.” arXiv:2211.15241v1.
- mlsynth.utils.spcd_helpers.orchestration.select_alpha_by_holdout(Y_pre: ndarray, *, variant: str = 'norm_spcd', weights: str = 'empirical', lam: float | None = None, beta: float | None = None, max_iter: int = 200, solver: Any | None = None, val_frac: float = 0.3, covariates: ndarray | None = None, covariate_weight: float = 1.0) float#
Choose the ridge
alphaby out-of-sample pre-period balance.The post-period RMSE of an SPCD design is a non-monotone, jumpy function of
alphawhenN > T_pre(the assignment is a discrete sign vector that flips asalphamoves), so no single closed-form estimate of the noise variance is robust. Instead, this fits the design on the first1 - val_fracof the pre-period and scores each candidatealphaby the RMS of the resulting contrast on the held-out pre-period tail, returning the best. Candidates are a multiplicative grid around the Gavish-Donoho noise-variance estimate (estimate_noise_variance()), which sets the correct scale; the holdout picks the robust point on the jumpy curve.Falls back to the bare noise-variance estimate when the pre-period is too short to spare a validation tail.
- Parameters:
Y_pre (np.ndarray) – Pre-treatment matrix of shape
(T_pre, N).variant, weights, lam, beta, max_iter, solver – Passed through to the design pipeline.
val_frac (float) – Fraction of the pre-period held out (from the end) for scoring.
- Returns:
float – Selected
alpha.
- mlsynth.utils.spcd_helpers.orchestration.solve_spcd(inputs: SPCDInputs, *, variant: str = 'norm_spcd', weights: str = 'empirical', alpha: float | None = None, lam: float | None = None, beta: float | None = None, max_iter: int = 200, solver: Any | None = None, verbose: bool = False, covariate_weight: float = 1.0) SPCDDesign#
Run SPCD end-to-end and return the resulting design.
- Parameters:
inputs (SPCDInputs) – Pre-treatment (and optional post-treatment) outcome matrices and index metadata.
variant ({“spcd”, “norm_spcd”}) – Iteration-box choice.
"spcd"selects Eq. (4)/(7),"norm_spcd"selects Eq. (5)/(8).weights ({“empirical”, “exact”}) – Final-weight-step choice.
"empirical"uses Eq. (9) (the closed-form approximation used in the paper’s experiments),"exact"solves Eq. (6) via cvxpy.alpha, lam, beta (float, optional) – Hyperparameters for Eq. (2) and Eqs. (4)/(5)/(7)/(8). The paper treats all three as pre-defined. When
alphaisNoneit is chosen byselect_alpha_by_holdout()(out-of-sample pre-period balance over a noise-scale grid);lamandbetadefault from the spectrum viaformulation.build_iteration_matrix. Pass an explicitalpha(e.g. a known noise variance) to bypass selection.max_iter (int) – Maximum iterations for the SPCD/NormSPCD while loop.
solver (optional) – Solver passed to cvxpy when
weights="exact". Ignored otherwise.verbose (bool) – Whether to print solver progress for
weights="exact".
- Returns:
SPCDDesign – Container with the final sign vector, weight components, synthetic paths, and diagnostics.
- mlsynth.utils.spcd_helpers.orchestration.solve_spcd_with_holdout(inputs: SPCDInputs, *, variant: str = 'norm_spcd', weights: str = 'empirical', alpha: float | None = None, lam: float | None = None, beta: float | None = None, max_iter: int = 200, solver: Any | None = None, verbose: bool = False, enable_inference: bool = True, holdout_frac_E: float = 0.7, inference_alpha: float = 0.05, power_target: float = 0.8, mde_n_sims: int = 5000, mde_n_trials: int = 400, mde_horizon_grid: list | None = None, inference_seed: int = 1400, min_blank_size: int = 5, covariate_weight: float = 1.0)#
Run SPCD with the train-on-E / calibrate-on-B inference flow.
When
enable_inference=Falsethis falls back tosolve_spcd()on the full pretreatment matrix and returns(design, None, None)for backwards compatibility with the pre-inference SPCD release.When
enable_inference=True:Split
inputs.Y_preintoY_E(firstfrac_E) andY_B(remainder).Fit the SPCD design on
Y_Eonly.Re-render the synthetic paths over the full timeline (
Y_E+Y_B+Y_post) so plots show the entire window.Compute holdout residuals
r_B = Y_B @ contrast_weights.Run the Monte Carlo MDE analysis on
r_B(always, since this is pre-experiment planning).If
Y_postis present, also compute the moving-block conformal CI for the ATT.
- Returns:
design (SPCDDesign) – The SPCD design, with synthetic paths spanning the full timeline. Fit on
Y_Eonly whenenable_inference=True.conformal (SPCDConformalResult or None) – Conformal CI for the post-period ATT.
NoneifY_post is Noneorenable_inference=Falseor the holdout window is too small.power (SPCDPowerAnalysis or None) – MDE / detectability output.
Noneifenable_inference=Falseor the holdout window is too small.
Plotting helpers for SPCD results.
- mlsynth.utils.spcd_helpers.plotter.plot_detectability(results: SPCDResults | SPCDMultiArmResults, ax: Axes | None = None)#
Plot the MDE as a function of post-treatment horizon (“MDE at time point t”) for each arm and the pooled whole-study contrast – the “how long must the study run?” view.
- Returns:
matplotlib.figure.Figure
- mlsynth.utils.spcd_helpers.plotter.plot_mde_bars(results: SPCDResults | SPCDMultiArmResults, ax: Axes | None = None)#
Bar chart of the percent MDE per arm, with the pooled whole-study MDE alongside (when present).
- Parameters:
results (SPCDResults or SPCDMultiArmResults) – A fitted SPCD result. Power analysis must have run (
enable_inference=True).ax (matplotlib Axes, optional) – Axes to draw on; a new figure is created if omitted.
- Returns:
matplotlib.figure.Figure
- mlsynth.utils.spcd_helpers.plotter.plot_power_curves(results: SPCDResults | SPCDMultiArmResults, ax: Axes | None = None)#
Plot the stored power-vs-effect curves for each arm and the pooled whole-study contrast. The MDE is where each curve crosses the
power_targetline.- Returns:
matplotlib.figure.Figure
- mlsynth.utils.spcd_helpers.plotter.plot_spcd_design(results: SPCDResults) None#
Plot synthetic treated and synthetic control trajectories for SPCD.
Mirrors the layout of
plot_relaxed_designinsyndes_helpers.plotterso SPCD and SYNDES plots feel consistent.- Parameters:
results (SPCDResults) – Output of
mlsynth.estimators.SPCD. Must haveinputsattached so that pre/post matrices can be stacked.- Raises:
MlsynthPlottingError – If the result has no attached
inputs.