Cluster Synthetic Controls (CLUSTERSC)#
Overview#
CLUSTERSC packages two paper-aligned families of robust synthetic control behind a single estimator.
PCR-SC. Rho, Tang, Bergam, Cummings & Misra (2025), ClusterSC: Advancing Synthetic Control with Donor Selection (arXiv:2503.21629). Hard Singular Value Thresholding (HSVT) of the donor matrix, optional \(k\)-means donor clustering on the right-singular-vector embedding, then ordinary least squares against the denoised donor matrix. Builds on the Robust Synthetic Control proposal of Amjad, Shah & Shen (2018) and the PCR consistency results of Agarwal, Shah, Shen & Song (2021). mlsynth exposes two alternative weight solvers on top of Algorithm 2: the Bayesian Robust Synthetic Control posterior over the weights (Amjad, Shah & Shen 2018) and a simplex-constrained variant that retains Abadie-Diamond-Hainmueller weights.
RPCA-SC. Bayani (2021), Robust PCA Synthetic Control (arXiv:2108.12542; Chapter 1 of Bayani 2022). Functional PCA on pre-period trajectories, silhouette-driven \(k\)-means on the resulting FPC scores, robust \(L + S\) decomposition of the treated unit’s cluster via Principal Component Pursuit (Candes, Li, Ma & Wright 2011) or half-quadratic non-convex regularisation (Wang, Li, So & Liu 2023), then non-negative least squares against the low-rank donor matrix.
Either family can be selected via CLUSTERSCConfig.method
("pcr", "rpca", or "both"); when both run, the
CLUSTERSCConfig.primary field selects which fit drives the
convenience aliases (att, counterfactual, gap,
donor_weights) on the result object.
Mathematical Formulation#
Setup#
We observe a single treated unit (indexed 1) and \(J\) controls
over \(T\) periods, with treatment starting at \(T_0 + 1\).
Stack the donor outcomes as \(Y_0 \in \mathbb{R}^{T \times J}\)
(columns = donors) and write \(y_1 \in \mathbb{R}^{T}\) for the
treated series. Pre-period slices are
\(Y_0^- \in \mathbb{R}^{T_0 \times J}\) and \(y_1^- \in
\mathbb{R}^{T_0}\); post-period slices carry the + superscript.
Treatment-effect targets are the per-period gaps
\(\tau_t = y_{1, t} - \hat y^0_{1, t}\) and the average
treatment effect on the treated,
Both families assume the untreated potential outcome has an approximately low-rank structure \(Y^0 = M + E\), where \(M_{i,t} = g(\theta_i, \rho_t)\) is a deterministic latent factor signal (Athey et al. 2021) and \(E\) collects zero-mean idiosyncratic noise.
PCR-SC family (Rho et al. 2025)#
The PCR family combines three building blocks: rank-\(r\) HSVT denoising of the donor matrix, an optional donor-clustering pre-step, and a weight solver. Algorithm 2 of the paper sets out the basic pipeline; Algorithm 4 wraps it with the clustering step.
Why PCR: theory and assumptions#
What PCR is, and why it is the right tool. Principal component regression (project the donors onto their top-\(r\) principal components, then regress) is, without any change, robust to the noise and missingness that pervade panel data [Agarwal2021]. The key identity is that PCR is equivalent to ordinary least squares after hard singular-value thresholding (HSVT) of the covariate (donor) matrix (Agarwal et al. 2021, Prop. 2.1): retaining the top-\(r\) PCs and regressing yields the same fitted response as \(\mathrm{HSVT}_r\) followed by OLS. That HSVT step implicitly de-noises the donors by projecting them onto the rank-\(r\) signal subspace, which is exactly why the synthetic control inherits robustness to the idiosyncratic shocks every donor series carries. In the synthetic-control language this estimator is Robust Synthetic Control (Amjad-Shah-Shen [Amjad2018]); Agarwal et al. prove PCR and RSC are identical, so the HSVT+OLS path here carries RSC’s guarantees.
Model (error-in-variables). In the synthetic-control setting [ClusterSC] formalizes the donor panel as a noisy, partially observed view of a low-rank signal, \(Y_0 = M + E\), where \(E\) collects mean-zero idiosyncratic shocks and (optionally) missing cells, and the treated series obeys an approximate linear synthetic control on the signal,
with response noise \(\varepsilon_t\) (mean zero, variance \(\le \sigma^2\)) and a deterministic model-mismatch term \(\phi_t\).
Assumptions.
Assumption 1 (low-rank latent-factor signal). The signal is generated by a latent variable model \(M_{i,t} = g(\theta_i, \rho_t)\) with \(g\) Lipschitz in finite-dimensional unit factors \(\theta_i\) and time factors \(\rho_t\), and bounded entries (\(\|M\|_\infty \le 1\)). This forces \(M\) to be (approximately) low rank, \(\mathrm{rank}(M) = r = O(\log T) < T\). Remark. This is the assumption that lets PCR/HSVT work at all – and it earns the synthetic control rather than assuming it: under this model an (approximate) linear SC \(f^\star\) provably exists (Agarwal et al. 2021, Prop. 4.1), so the existence of a synthetic combination need not be imposed as an axiom as in classical SC.
Assumption 2 (error-in-variables donors). We observe \(Z_{jt} = M_{jt} + \eta_{jt}\) with each cell present independently with probability \(\rho\) (missing otherwise); the covariate noise \(\eta\) has independent, mean-zero, \(\psi_\alpha\) (sub-exponential-type) rows with bounded variance. Remark. SC is inherently an error-in-variables problem – donor series are noisy proxies for their signal – which is precisely the regime where regressing on the raw donors (or Lasso/Ridge) loses consistency; the HSVT pre-step is what restores it.
Assumption 3 (approximate linear SC). The treated signal lies (approximately) in the span of the donor signals, \(y_1 = M^\top f^\star + \varepsilon + \phi\), with bounded \(\|f^\star\|\) and deterministic mismatch \(\phi\). Remark. This relaxes the classical convex-hull (non-negative, sum-to-one) restriction: the natural solver is unconstrained OLS on the denoised donors (no simplex), which is the default below; the SIMPLEX option re-imposes the convex hull when interpretability is preferred.
Assumption 4 (independent noise sources). The response noise \(\varepsilon\), covariate noise \(\eta\), and the missingness pattern are mutually independent (Agarwal et al. 2021, Rmk. 3.2). Remark. PCR is noise-model-agnostic: unlike the error-in-variables literature it needs no knowledge of the noise covariance, which is what makes it practical for panels with unknown shock structure.
Finite-sample guarantees. Under Assumptions 1-4 the pre-period (training) error of the HSVT+OLS estimator decomposes into three interpretable pieces (Agarwal et al. 2021, Thm. 3.1):
where the donor-corruption term is controlled by a novel \(\ell_{2,\infty}\)-norm bound on HSVT (Lemma 3.1, stronger than the usual Frobenius bound). With observation fraction \(\rho\), the overall pre-period rate is \(\rho^{-4} r / \min(T_0, J) + \|\phi\|^2/T_0\) (Cor. 3.1) – i.e. PCR matches, up to log factors, the minimax OLS rate one would get with perfectly observed donors, despite seeing only noisy, partially observed ones. The post-period (test) error is bounded by the training error plus a generalization penalty scaling as \(r^{5/2}/\sqrt{T_0}\) (Thm. 3.2), which is exactly what justifies the data-driven rank selection (cumulative-variance / USVT rules) used below: choose \(r\) to trade pre-period fit against this complexity penalty.
HSVT denoising. mlsynth follows the Amjad-Shah-Shen (2018)
convention and applies HSVT to the pre-period donor matrix only.
The full-matrix variant proposed in Rho et al. (2025) Algorithm 2
(SVD on the entire \((T, J)\) panel, then slice the pre-period
rows) leaks post-period donor information into the rank-\(r\)
reconstruction, which can wash out the very post-period deviations
the synthetic control is meant to detect. The user can opt into
the full-matrix variant via CLUSTERSCConfig.project_denoised,
in which case HSVT is also applied to \((Y_0)\) for the
projection step.
Let \(Y_0^- = U \Sigma V^\top\) be the SVD of the pre-period donor matrix. The rank-\(r\) hard truncation
isolates the low-rank signal \(M^-\). The truncation rank can be chosen three ways:
Cumulative variance (default; paper Section 6.1) – smallest \(r\) with \(\sum_{i \le r} \sigma_i^2 / \sum_i \sigma_i^2 \ge\)
cumvar_threshold(default \(0.95\)).Explicit rank. Caller passes \(r\) directly via the
rankparameter; the dispatcher promotesrank_method="fixed"automatically.USVT (Chatterjee 2015; Donoho-Gavish 2014). Universal threshold; preserved for back-compatibility with Amjad et al. 2018.
For the data-driven rules ("cumvar" and "usvt"), the
spectral comparison is computed on the column-centred donor
matrix by default (standardize_for_rank=True). Each donor is
demeaned over time; the variance scale is left untouched.
Otherwise the leading singular value of an uncentered panel
(e.g. cigarette sales in absolute units) absorbs the overall
scale and dwarfs the remaining components, leaving
cumvar_threshold unable to discriminate. Full z-scoring
(also dividing by the donor standard deviations) is avoided
because it equalises donor variances and artificially inflates
the rank that a cumvar threshold picks. Centring is only used
for rank picking – the HSVT step itself still consumes the raw
matrix so the counterfactual is returned in original units.
The same rank is applied to the full \(Y_0\) so that the post-period projection (Algorithm 4 Step 5) consumes the denoised matrix \(\widetilde M = \mathrm{HSVT}_r(Y_0)\).
Donor clustering (Algorithm 3). When clustering=True the
estimator clusters donors on the rows of
\(\widetilde U = U \Sigma_r\). With \(k\) chosen by the
silhouette coefficient (Rousseeuw 1987) over
\(k \in [2, k_{\max}]\), run \(k\)-means and embed the
treated unit via \(\tilde u = V_r^\top y_1^-\); the donor pool
becomes the cluster minimising \(\|c_\ell - \tilde u\|_2\)
(Algorithm 4 Step 2). The selected donor sub-matrix is denoised
again at the same rank before the weight step.
OLS weights (Algorithm 2 Step 3). Drop the simplex constraints of canonical synthetic control and solve
with \((\widetilde M^-)^{+}\) the Moore-Penrose pseudo-inverse.
Appendix E of the paper compares this OLS path to ridge / lasso
variants; mlsynth exposes optional elastic-net knobs
(lambda_penalty, p, q) for the same purpose.
The counterfactual is
Algorithm 4 Step 5 of Rho et al. (2025) writes the projection
through the denoised matrix \(\widetilde M\). For the OLS
solver the two are mathematically identical – \(\widehat f\)
lies in the column space of \(V_r\), so the discarded
high-order components annihilate. They differ for the Bayesian and
SIMPLEX solvers below, where \(\widehat f\) is not constrained
to the rank-\(r\) subspace. Set project_denoised=True to
recover the paper-strict projection.
mlsynth extensions#
Two paper-extensible weight solvers live alongside the OLS default:
Bayesian PCR (
estimator="bayesian"). This is the Bayesian Robust Synthetic Control of Amjad, Shah & Shen [Amjad2018]: replace the point-estimate OLS with a Gaussian posterior over the weights (Bayesian linear regression on the HSVT-denoised donors), which yields calibrated uncertainty directly rather than by resampling. Concretely,\[f \mid y_1^-, \widetilde M^- \sim \mathcal N\bigl(\mu_n, \Sigma_n\bigr), \quad \Sigma_n = \bigl( \sigma_e^{-2} (\widetilde M^-)^\top \widetilde M^- + \alpha_0 I \bigr)^{-1}, \quad \mu_n = \sigma_e^{-2} \Sigma_n (\widetilde M^-)^\top y_1^-,\]with \(\sigma_e^2 = \mathrm{Var}(y_1^-)\) and prior precision \(\alpha_0\).
n_bayes_samplesposterior draws are propagated through \(\widetilde M\) to yield per-period credible bands at level \(1 - \alpha\). Aggregated over the post period, the implied ATT credible interval is reported onCLUSTERSCInference.Convex PCR (
pcr_objective="SIMPLEX"). Keep the HSVT denoising of the donor matrix but solve the classical Abadie-Diamond-Hainmueller (2010) program,\[\widehat f = \arg\min_{f \in \Delta_J} \bigl\| \widetilde M^- f - y_1^- \bigr\|_2^2, \qquad \Delta_J = \bigl\{ f \in \mathbb{R}_{\ge 0}^J : \mathbf 1^\top f = 1 \bigr\}.\]Useful when the user wants non-extrapolation and an interpretable convex-combination donor weighting on top of the PCR denoising.
RPCA-SC family (Bayani 2021)#
The RPCA family follows the same five-step skeleton as PCR-SC but swaps the feature step, the denoising step, and the weight constraints. Algorithm 4 of the paper is the orchestrator.
Step 1 – Functional PCA on pre-period trajectories. Each unit’s pre-period series \(y_j^- (t)\) is smoothed via a cubic B-spline expansion and projected onto the FPC basis \(\{\phi_k\}_{k=1}^K\) (Li, Wang & Carroll 2010):
The rank \(K\) is the smallest integer whose cumulative
spectral energy meets fpca_cumvar (paper default \(0.95\)).
The scores are standardised before clustering.
Step 2 – :math:`k`-means clustering. Apply Hartigan-Wong (1979) \(k\)-means to the FPC scores with \(k\) chosen by the silhouette coefficient
where \(a(j)\) is the mean distance from unit \(j\) to other members of its own cluster and \(b(j)\) the mean distance to the closest neighbouring cluster. The donor pool \(\mathcal{D} \subseteq \{2, \dots, J + 1\}\) is the set of non-treated units sharing the treated unit’s cluster.
Step 3 – Robust PCA on the donor pool. Stack the cluster donor outcomes into \(Y_{-1} \in \mathbb{R}^{|\mathcal{D}| \times T}\). mlsynth offers two robust decompositions \(Y_{-1} = L + S\).
PCP (Candes et al. 2011) replaces the NP-hard \(\min_{L, S} \mathrm{rank}(L) + \lambda \|S\|_0\) with the convex relaxation
solved by ADMM with default penalties \(\lambda = 1 / \sqrt{\max(|\mathcal D|, T)}\) and \(\mu = |\mathcal D| T / (4 \sum_{i, t} |Y_{-1, it}|)\) per Bayani Section 2.4. Each iteration alternates a singular-value soft-threshold update on \(L\), an entry-wise soft-threshold update on \(S\), and the standard dual ascent on the multiplier \(\Lambda\).
HQF (Wang et al. 2023) instead factors \(L = UV\) directly and alternates two Tikhonov-regularised least-squares updates,
with the sparse component updated by a median-absolute-deviation
threshold on the residual (controlled by hqf_ip). The rank
defaults to the smallest \(r\) whose cumulative singular-value
energy meets hqf_cumvar (Bayani uses \(0.999\)).
Step 4 – Non-negative least squares. Let \(L^- \in \mathbb{R}^{|\mathcal{D}| \times T_0}\) be the pre-period slice of the low-rank component. The weights solve
The simplex constraint of canonical synthetic control is deliberately dropped: the clustering step has already restricted the donor pool to behaviourally similar units, so non-negativity suffices to keep the counterfactual interpretable (Bayani 2021, Section 2.4).
Step 5 – Projection. Counterfactual and ATT come from the denoised donor matrix,
RPCA-SC tuning via leave-one-time-out cross-validation#
Bayani (2021) takes the Candes-Li-Ma-Wright (2011) optimal-recovery value \(\lambda^\star = 1/\sqrt{\max(|\mathcal D|, T)}\) for PCP, and chooses the HQF factorisation rank by an explained-variance threshold. Both defaults are conservative for L/S decomposition identifiability but can be poor for counterfactual prediction. On the California Proposition 99 panel the Candes default leaves PCP under-regularised by roughly 2x and HQF stops at a rank that loses ATT magnitude.
mlsynth therefore exposes two opt-in cross-validation tuners
(CLUSTERSCConfig.cv_lambda for PCP,
CLUSTERSCConfig.cv_hqf_rank for HQF). The same
algorithm drives both:
Let \(L \in \mathbb{R}^{|\mathcal D| \times T}\) be the low-rank component returned by the robust PCA solver at candidate hyperparameter \(\theta\). The pre-period slice \(L^- \in \mathbb{R}^{|\mathcal D| \times T_0}\) is the design matrix for the NNLS weight step. Define the leave-one-time-period-out mean squared error
where the \(-t\) subscript drops the \(t\)-th pre-period column. The donor matrix \(Y_{-1}\) is fully observed at every period so the RPCA decomposition is run once per \(\theta\), not \(T_0\) times. The chosen \(\widehat \theta\) minimises \(\mathrm{CV}(\theta)\) over a grid:
For PCP, the grid is
cv_lambda_multipliers\(\times \lambda^\star\) with default multipliers \((0.5, 1, 2, 3, 5, 8, 12)\).For HQF, the grid is the integer ranks \(\{1, 2, \dots, \min(|\mathcal D|, T_0 - 1)\}\).
On the Proposition 99 panel, cv_lambda=True picks
\(\widehat \lambda = 2 \lambda^\star\) and cuts pre-period RMSE
in half (2.11 → 1.08, ATT moving from −15.5 to −17.7);
cv_hqf_rank=True picks \(\widehat r = 8\) (vs. cumvar
default of 4) with ATT moving from −12.8 to −16.6. Both ATTs land
in the canonical SCM / RSC range of −19 to −24.
When the assumptions bind: practical diagnostics#
Assumptions 1-4 above (Agarwal et al. 2021, transcribed in the
PCR-SC family) are written as structural conditions. Here is how
each shows up in real data and what to check in the
CLUSTERSCResults container before trusting the
counterfactual.
Low-rank latent-factor signal (A1). The donor pool’s untreated mean \(M\) is approximately low rank. If the panel is genuinely full-rank or the spectrum decays slowly, HSVT throws away signal and the OLS step over-fits whatever is left.
Plausibly violated when donor series are idiosyncratic in a way that no factor model can absorb – think 200 firm-level time series where each firm has its own unrelated business cycle. Diagnostic: compute the spectral-energy share of the top-\(r\) singular values of the column-centred donor matrix; if you need \(r \ge \min(T_0, J) / 2\) to capture \(90\%\) of the energy, the low-rank story is failing. On the Prop 99 panel the top-1 singular value carries \(\approx 99\%\) of the energy – the exact regime where PCR-SC dominates. When the spectrum decays slowly, fall back to canonical SC (canonical SCM, Two-Step Synthetic Control) or to a balancing-aware estimator (MicroSynth (User-Level Balancing SC) for unit-level data; Factor Model Approach (FMA) for factor-model-aware estimation).
Error-in-variables donors (A2). Donor entries are noisy / partially-observed views of the signal. The HSVT step is precisely what restores consistency when this assumption holds. The condition fails softly when the noise is sparse and heavy-tailed instead of Gaussian, hard when missingness is non-random (donor entries missing because the donor deviates from the signal).
Plausibly violated when you have a panel with structural outliers (a donor with a one-time policy shock spike) rather than i.i.d. Gaussian noise. Diagnostic: residualise each donor against the rank-\(r\) HSVT reconstruction and histogram the residuals; sparse heavy tails are a red flag. The fix is the RPCA-SC family – robust \(L + S\) decomposition explicitly separates the low-rank signal from the sparse outliers, so heavy-tailed donor noise is absorbed into \(S\) rather than contaminating \(L\).
Approximate linear SC (A3). The treated signal lies (approximately) in the span of the denoised donor signals. PCR-SC uses unconstrained OLS so this relaxes the canonical convex-hull restriction – but the model-mismatch term \(\phi_t\) (the gap between the true treated mean and the closest linear combination of donor signals) still bounds the finite-sample error.
Plausibly violated when the treated unit is structurally outside the donor pool – coastal vs. interior states, a tech-led economy with only commodity-led donors. Diagnostic: inspect
res.pcr.pre_rmseagainst the donor noise floor (the mean leave-one-out pre-RMSE across donors). If treated pre-RMSE is substantially higher than the donor noise floor, the mismatch \(\phi_t\) is large and PCR-SC’s unbiasedness argument breaks. For the structurally outside- hull case, switch to Imperfect Synthetic Controls (ISCM) (whose A2(b) mechanism identifies the effect through donors that use the treated unit as a donor).Independent noise sources (A4). Response noise, donor noise, and the missingness pattern are mutually independent. The PCR-SC consistency results assume the missingness pattern is data-independent.
Plausibly violated when donor observations are missing because of the outcome value (e.g. countries stopping reporting GDP during recessions). Diagnostic: cross-tabulate missingness with the pre-period outcome quantiles; if missing entries are concentrated in the tails, the assumption fails. Multiple imputation pre-step or a missing-data-aware estimator is the fix; mlsynth’s Synthetic Nearest Neighbors / Causal Matrix Completion (SNN) is built for missingness- informative panels.
Cluster structure (Rho et al. 2025, ClusterSC-specific). The donor pool decomposes into latent subgroups distinguishable on the right-singular-vector embedding. Clustering only helps if such subgroups actually exist and the silhouette statistic can find them.
Plausibly violated when the donor pool is genuinely homogeneous – a tight set of comparable units already pre-screened by the analyst. Diagnostic: read
res.pcr.metadata['k_clusters']; if the silhouette picks \(k = 1\), clustering adds no value and you should setclustering=Falseto use the full pool. If \(k > 1\) is picked but the post-fit ATT moves substantially when you flipclusteringoff and on, the cluster structure is spurious (cluster boundaries are within the noise floor) and the un-clustered fit is the more honest one.Functional smoothness (Bayani 2021, RPCA-SC-specific). Pre-period trajectories admit a parsimonious FPCA basis. If trajectories are dominated by high-frequency jagged noise the cubic-spline FPCA pre-step throws away most of the signal.
Plausibly violated when the outcome is noisy at the observation frequency (daily stock prices, hourly sensor readings). Diagnostic: read
res.rpca.metadata['fpca_components']; if the silhouette or cumulative-variance step keeps a large number of FPC components, the smoothness assumption is failing. Aggregate to a coarser time grid (weekly, daily) before refitting, or move to a stationary-cycle estimator (Synthetic Business Cycle (SBC)) that handles unsmoothed series natively.NNLS-friendly truth (Bayani 2021, RPCA-SC-specific). RPCA-SC ends with a non-negative least-squares step. If the true counterfactual is best represented as an extrapolation (some donor weights would naturally be negative), NNLS cannot reach it.
Plausibly violated when the treated unit is at the edge of the donor distribution and the un-clustered PCR-SC OLS fit returns substantial negative weights. Diagnostic: refit with PCR-SC + OLS (
method="pcr",pcr_objective="OLS") and inspect the weight vector. If many large negative weights appear, the convex-combination restriction is binding and RPCA-SC is throwing away identification.
When to use PCR-SC, RPCA-SC, or neither#
The four papers behind CLUSTERSC chip at different parts of the canonical SC pipeline. The decision logic for picking among them:
Reach for PCR-SC + clustering (the CLUSTERSC default) when:
The donor pool is large and noisy (\(J \gtrsim T_0\), disaggregated panels with hundreds of donors), and the donor matrix has a clear low-rank spectrum. This is the Rho et al. (2025) regime – the empirical case studies are individual- level health records and disaggregated economic panels.
The treated unit comes from a plausible subgroup of the donor pool that you cannot easily isolate by hand (similar patient phenotype, similar state-economy composition). The silhouette-driven \(k\)-means step formalises the subgroup decision.
You’re willing to let weights be negative in the un-clustered OLS solve (Assumption 3 above) for tighter pre-fit and lower bias – and you accept that “California = +0.4 Utah +0.3 Montana -0.1 Tennessee” is a defensible weight story for your application.
Reach for PCR-SC without clustering (i.e., classic Amjad-Shah-Shen RSC) when:
You have a moderate-size donor pool that you have already pre-screened, and you mostly want the HSVT denoising step to protect against donor-matrix noise / missingness. Pre-screening has done the work that clustering would do.
You want the Shen-Ding-Sekhon-Yu (2023) closed-form CIs. These are wired in for
pcr_objective="OLS"and give you proper frequentist HZ/VT/DR-source standard errors without a bootstrap; the Bayesian path delivers credible bands.The empirical setting is one of the canonical aggregate SC case studies (Prop 99, Basque, etc.) where the donor pool is small (J ≤ 40) and a hand-picked subgroup already exists.
Reach for RPCA-SC when:
The donor matrix has sparse heavy-tailed outliers rather than uniform Gaussian noise – a few donors with one-time policy shocks, structural breaks, or recording errors. The \(L + S\) decomposition explicitly absorbs these into \(S\), leaving a clean \(L\) for the weight fit. PCR-SC’s HSVT is an \(L_2\)-based denoiser and is less robust to this exact regime.
Pre-period trajectories are smooth functional curves (annual GDP, monthly population) where the FPCA basis is a natural language for the donor pool. RPCA-SC’s Step 1 was built for this.
You want non-negative interpretable weights – the NNLS step at the end produces a sparse convex-combination story similar to canonical SC, while still benefiting from the robust-PCA denoising. The German-reunification application is the canonical case (Norway 0.48, France 0.35, New Zealand 0.30, Austria 0.02).
You want Cattaneo-Feng-Titiunik (2021) prediction intervals – mlsynth’s CFT port runs on RPCA-SC, providing in-sample bootstrap + Hoeffding out-of-sample bands.
Use PCR-SC over RPCA-SC when:
The noise is Gaussian / sub-Gaussian (PCR-SC’s HSVT is optimal under this), the donor matrix is high-dimensional (clustering helps), and you want negative weights / closed- form Shen CIs.
You need speed – PCR-SC is one SVD + one OLS, while RPCA-SC’s PCP is an ADMM loop and HQF is an iterative factorisation. Differences are small on Prop99-size panels; they bite on disaggregated panels with thousands of donors.
Use RPCA-SC over PCR-SC when:
The donor matrix has visible sparse outliers (one or two donors with a single shock period that would dominate the HSVT spectrum). RPCA’s \(L + S\) decomposition is built for exactly that pattern.
Trajectories are smooth and you want the donor-pool reduction that FPCA + \(k\)-means delivers ahead of the weight fit (this is what selects the 11-economy West-Germany cluster from the 17-country OECD pool).
You want a non-negative interpretable weight vector for policy storytelling.
Do not use either family when:
The donor matrix has no low-rank structure. Both PCR-SC and RPCA-SC depend on this. With high-dimensional donor pools where the spectrum decays slowly, switch to a balancing-aware estimator (MicroSynth (User-Level Balancing SC) if you have user-level data; Factor Model Approach (FMA) if a factor-model fit is defensible).
The treated unit is structurally outside the donor convex hull in a way that no linear combination can capture – even with negative weights, the pre-RMSE stays large. Switch to Imperfect Synthetic Controls (ISCM), whose A2(b) mechanism identifies the effect through donors that use the treated unit as a positive- weight donor in their own synthetic controls.
Distributional questions (Lorenz curves, QTEs, tail changes). PCR-SC and RPCA-SC target the mean ATT. Use Distributional Synthetic Control (DSC) instead.
Continuous or multi-valued treatment. The CLUSTERSC pipeline encodes a single binary treatment indicator. Continuous dose belongs in Continuous-Treatment Synthetic Control (CTSC).
Spillovers / interference across donors. The low-rank signal model assumes donors are untreated and independent of the treatment of unit 1. Spillovers break this; use Spillover-Aware Synthetic Control (SPILLSYNTH) or Spatial Synthetic Difference-in-Differences (SpSyDiD).
Tiny donor pool (\(J \le 10\)) and a tight canonical- SC pre-fit. The denoising and clustering machinery is overkill; both add variance without identification gain. Use canonical SCM, Two-Step Synthetic Control, or Forward Difference-in-Differences (FDID).
Staggered adoption with a long mixed-treatment pre-period. CLUSTERSC assumes a common pre-period free of treatment. Use FECT or Synthetic Difference-in-Differences (SDID).
You need the donor weight vector to be a single sparse convex combination as the headline story (and you’re not willing to use RPCA-SC’s NNLS variant). The PCR-SC default is the unrestricted OLS solve; the SIMPLEX variant re-imposes the convex hull at the cost of the denoising-bias gain. If the sparse convex combination is the deliverable, canonical SCM or Two-Step Synthetic Control are the more honest default.
Inference#
CLUSTERSC.fit() returns an
EffectResult: the flat accessors (res.att,
res.counterfactual, res.gap, res.att_ci, res.donor_weights,
res.pre_rmse) and the standardized sub-models (res.effects,
res.time_series, res.weights, res.inference, res.fit_diagnostics,
res.method_details) are populated from the primary variant. The rich,
estimator-specific inference object lives on res.cluster_inference (its
scalar ATT interval is mirrored into the standardized res.inference), and the
two family fits remain available side by side on res.pcr / res.rpca.
Three inference families are wired into CLUSTERSCInference
(accessed via res.cluster_inference):
Frequentist PCR – Shen-Ding-Sekhon-Yu (2023) closed-form CIs. Default on for
estimator="frequentist"andpcr_objective="OLS". See the next subsection.Bayesian PCR – posterior credible interval. Computed when
estimator="bayesian"from posterior draws of the counterfactual (the Bayesian Robust Synthetic Control of Amjad, Shah & Shen [Amjad2018]).RPCA-SC – Cattaneo-Feng-Titiunik (2021) prediction intervals. Opt-in via
CLUSTERSCConfig.compute_cft_pi. See the dedicated subsection below.
Shen-Ding-Sekhon-Yu (2023) frequentist CIs for OLS PCR#
For the symmetric estimator class – OLS minimum \(\ell_2\)-norm, PCR, and ridge – Theorem 1 of Shen et al. (2023) shows that the horizontal (HZ) and vertical (VT) regression formulations give algebraically identical point estimates. The two formulations nevertheless quantify uncertainty against different generative models:
HZ model (Assumption 1). Each donor’s post-period outcome is a noisy linear combination of its own pre-period values:
\[Y_{i, T} = \sum_{t \le T_0} \alpha^*_t Y_{i, t} + \varepsilon_{i, T}, \quad i = 1, \dots, N_0.\]The randomness lives in the cross-sectional dimension.
VT model (Assumption 2). The treated unit’s pre-period outcome is a noisy linear combination of the donors’ pre-period values:
\[Y_{N, t} = \sum_{i \le N_0} \beta^*_i Y_{i, t} + \varepsilon_{N, t}, \quad t = 1, \dots, T_0.\]The randomness lives in the time-series dimension.
DR model (Assumption 3). Both sources of randomness are present.
Each model yields a distinct estimand and a distinct asymptotic variance for the same point estimate \(\widehat Y_{N, T}(0)\) (Theorem 3). With rank-\(k\) HSVT projections \(H^u_\perp = I - U_k U_k^\top\) (donor-space) and \(H^v_\perp = I - V_k^\top V_k\) (time-space), and the homoskedastic variance plug-ins (paper eq 19):
(where \(R = \mathrm{rank}(Y_0)\) after truncation), the per-period variance estimators are
with \(\widehat A = \widehat \sigma^2_{\mathrm{hz}} \widehat \sigma^2_{\mathrm{vt}}\, Y_0^{+} (Y_0^\top)^{+}\) the interaction term. The \((1 - \alpha)\) CI under source \(s \in \{\mathrm{hz}, \mathrm{vt}, \mathrm{dr}\}\) is
mlsynth also ports the jackknife and HRK (Hartley-Rao-Kish) variance estimators from var.py in the authors’ reference repository. The HRK estimator is only valid when \(\max_i (1 - (H_\perp)_{ii}) < 1/2\) for both projections; mlsynth checks this and raises if violated.
For multi-period extrapolation the procedure runs per post-period: at each \(t > T_0\) the donor outcomes \(y_t\) change but the projections, weights, and per-period variances are recomputed from the same fitted weight pair. The ATT is the mean of per-period gaps \(\widehat{ATT} = (T - T_0)^{-1} \sum_{t > T_0} \widehat \tau_t\), and its variance is aggregated assuming independence across post-periods:
The independence assumption is the standard first-pass; serially correlated shocks would inflate the true variance.
Config knobs: CLUSTERSCConfig.compute_shen_ci (default
True) toggles the inference; CLUSTERSCConfig.shen_variance
chooses "homoskedastic" (default), "jackknife", or
"hrk". The output CLUSTERSCInference.shen carries
per-period gaps, per-period CIs under each source, and the
aggregated ATT CIs.
Cattaneo-Feng-Titiunik (2021) prediction intervals for RPCA-SC#
Bayani (2021) leaves inference unaddressed and the Shen et al. (2023)
closed-form CIs do not apply to the asymmetric NNLS weights of
RPCA-SC. For uncertainty quantification mlsynth ships a focused port
of Cattaneo, M. D., Feng, Y., & Titiunik, R. (2021), “Prediction
Intervals for Synthetic Control Methods” (JASA 116(536):1865-1880),
opt-in via CLUSTERSCConfig.compute_cft_pi.
The CFT prediction error at post-period \(t > T_0\) decomposes as
where \(p_t\) is the loading at \(t\) (for RPCA-SC the denoised donor row \(L_t\)), \(w^*\) is the population weight, and \(\widehat w\) is the NNLS estimate. The two components are quantified separately:
In-sample component \(M_w(t, \alpha/2)\). The paper’s reference implementation solves a constrained \(\sup / \inf\) over the “compatible set” of weights via an ECOS LP at each post-period. mlsynth replaces that with an HC1-scaled parametric bootstrap of the pre-period residuals: at each of
cft_simsdraws,\[\widehat u^*_t = \sqrt{\tfrac{T_0}{T_0 - 1}} \cdot \widehat u^-_{\sigma(t)}, \quad \sigma \sim \mathrm{Unif}\{1, \dots, T_0\},\]we perturb the treated pre-period outcome \(y^*_t = \widehat y^0_t + \widehat u^*_t\), refit the full RPCA-SC pipeline, and collect the resulting counterfactual \(\widehat y^{0, *}\) at every period. The asymmetric \((\alpha/2, 1 - \alpha/2)\) empirical quantiles of \(\widehat y^{0, *}_t - \widehat y^0_t\) form the in-sample band per post-period. This is equivalent to the ECOS-based bound under regularity conditions and avoids pulling in
ecos/dask/plotnineas hard dependencies.Out-of-sample component \(M_e(t, \alpha/2)\). Under sub-Gaussian post-period shocks with scale parameter \(\sigma_e\), Hoeffding’s inequality gives the tail bound
\[M_e(t, \alpha/2) = \sqrt{-2 \log \alpha} \; \widehat \sigma_e, \qquad \widehat \sigma_e = \mathrm{sd}\!\bigl(\widehat u^-\bigr).\]This is the
gaussianvariant of the paper’s three out-of-sample options (gaussian/ls/qreg); the other two are deferred.
The combined \((1 - \alpha)\) PI on the counterfactual at \(t\) is
which inverts to the PI on the per-period treatment effect \(\widehat \tau_t = y_t - \widehat y^0_t\).
For the ATT, the in-sample component aggregates by storing the post-period mean of the counterfactual at each bootstrap draw and taking quantiles. The out-of-sample component shrinks by \(\sqrt{T_1}\) under post-period shock independence:
Config knobs: CLUSTERSCConfig.compute_cft_pi (default
False – the bootstrap costs cft_sims full pipeline refits,
roughly 0.3-0.5 s each on a moderate panel),
CLUSTERSCConfig.cft_sims (default 200),
CLUSTERSCConfig.cft_alpha (default 0.05), and
CLUSTERSCConfig.cft_e_method (currently only
"gaussian"). The output
CLUSTERSCInference.cft carries the per-period gaps,
per-period PIs, in-sample bootstrap bands, the Hoeffding constant,
and the aggregated ATT PI.
Verification#
In the high-dimensional-subgroup regime (pooled donor rank
\(> T_0\)), CLUSTERSC reproduces the central claim of Rho et al.
(2025): donor clustering lowers the post-period prediction MSE versus the
whole-pool RSC baseline at every noise level (down
\(60.8\% / 43.2\% / 24.3\%\) at \(\sigma = 0.10/0.25/0.40\)).
Both modes run through the one estimator (clustering=False is RSC,
clustering=True is ClusterSC). Pinned in
benchmarks/cases/clustersc_subgroups.py; the authors’ own code is
cross-checked against its paper in clustersc_subgroups_ref.py; the RSC
pre/post-error and Shen-CI coverage are pinned in rsc_synth_error.py /
rsc_shen_coverage.py. The RPCA-SC family is pinned separately on the
West-German-reunification application (clustersc_rpca_germany.py: Norway 0.49
/ France 0.35 / pre-RMSE ~89). See the dedicated page
ClusterSC — Synthetic Control with Donor Selection (Rho et al. 2025).
Core API#
Cluster-based Synthetic Control (CLUSTERSC) estimator.
Bundles two complementary robust SCM families behind a single orchestrator:
PCR-RSC – SVD-based donor clustering (Amjad, Shah, Shen 2018) plus Principal Component Regression weight estimation (Agarwal, Shah, Shen, Song 2021). Frequentist QP or Bayesian Robust Synthetic Control posterior (Amjad, Shah, Shen 2018) selected via
estimator.RPCA-SC – robust low-rank donor denoising (PCP – Candes, Li, Ma, Wright 2011; or HQF – Wang, Li, So, Liu 2023) followed by simplex SCM weights.
method = "both" runs both families in parallel and exposes them
side by side on the CLUSTERSCResults container. The
primary field selects which fit drives the convenience aliases
att / counterfactual / gap / donor_weights.
- class mlsynth.estimators.clustersc.CLUSTERSC(config: CLUSTERSCConfig | dict)#
Bases:
objectCluster-based Synthetic Control estimator.
- Parameters:
config (CLUSTERSCConfig or dict) – Configuration object. See
mlsynth.config_models.CLUSTERSCConfig.- Returns:
CLUSTERSCResults – Frozen container with optional PCR-RSC and RPCA-SC fits plus Bayesian credible interval inference (PCR estimator = “bayesian” only).
- fit() CLUSTERSCResults#
Run the requested family (or both) and return a
CLUSTERSCResults.
Configuration#
- class mlsynth.config_models.CLUSTERSCConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, method: ~typing.Literal['pcr', 'rpca', 'both'] = 'pcr', primary: ~typing.Literal['pcr', 'rpca'] = 'pcr', pcr_objective: ~typing.Literal['OLS', 'SIMPLEX'] = 'OLS', clustering: bool = True, estimator: ~typing.Literal['frequentist', 'bayesian'] = 'frequentist', rpca_method: ~typing.Literal['PCP', 'HQF'] = 'PCP', lambda_penalty: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, p: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, q: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, rank_method: ~typing.Literal['cumvar', 'fixed', 'usvt'] = 'cumvar', cumvar_threshold: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.95, standardize_for_rank: bool = True, project_denoised: bool = False, k_clusters: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, k_max: ~typing.Annotated[int, ~annotated_types.Ge(ge=2)] = 8, n_bayes_samples: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1000, random_state: int = 0, compute_shen_ci: bool = True, shen_variance: ~typing.Literal['homoskedastic', 'jackknife', 'hrk'] = 'homoskedastic', fpca_cumvar: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.95, pcp_lambda: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, pcp_mu: ~typing.Annotated[float | None, ~annotated_types.Gt(gt=0.0)] = None, pcp_max_iter: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1000, pcp_tol: ~typing.Annotated[float, ~annotated_types.Gt(gt=0)] = 1e-09, hqf_rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, hqf_cumvar: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.999, hqf_lambda: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, hqf_ip: ~typing.Annotated[float, ~annotated_types.Gt(gt=0)] = 1.0, hqf_max_iter: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1000, cv_lambda: bool = False, cv_hqf_rank: bool = False, compute_cft_pi: bool = False, cft_sims: ~typing.Annotated[int, ~annotated_types.Ge(ge=10)] = 200, cft_alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, cft_e_method: ~typing.Literal['gaussian'] = 'gaussian')#
Configuration for the Cluster-based Synthetic Control (CLUSTERSC) estimator.
CLUSTERSC packages two robust synthetic-control families behind a single interface:
PCR-RSC – SVD-based donor clustering plus Principal Component Regression weight estimation (Amjad, Shah, Shen 2018; Agarwal, Shah, Shen, Song 2021). Frequentist QP via
cvxpyor Bayesian Robust Synthetic Control posterior (Amjad, Shah, Shen 2018).RPCA-SC – robust low-rank donor denoising (PCP – Candes, Li, Ma, Wright 2011; or HQF – Wang, Li, So, Liu 2023) followed by simplex SCM weights.
- Parameters:
method ({“pcr”, “rpca”, “both”}) – Estimation family to run.
"both"runs PCR and RPCA in parallel;primaryselects which one exposesatt/counterfactual/gapon the result object.primary ({“pcr”, “rpca”}) – Which fit drives the result aliases when
method = "both".pcr_objective ({“OLS”, “SIMPLEX”}) – Inner SCM weight objective for the PCR family.
clustering (bool) – Whether to apply SVD-based donor clustering before the PCR fit.
estimator ({“frequentist”, “bayesian”}) – Frequentist QP versus Bayesian posterior for the PCR family.
rpca_method ({“PCP”, “HQF”}) – Robust-PCA decomposition for the RPCA family.
lambda_penalty, p, q (float or None) – Elastic-net-style regularization knobs forwarded to the PCR inner solver.
alpha (float) – Two-sided level for the Bayesian credible interval (PCR estimator = “bayesian” only).
- cft_e_method: Literal['gaussian']#
- estimator: Literal['frequentist', 'bayesian']#
- method: Literal['pcr', 'rpca', 'both']#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- pcr_objective: Literal['OLS', 'SIMPLEX']#
- primary: Literal['pcr', 'rpca']#
- rank_method: Literal['cumvar', 'fixed', 'usvt']#
- rpca_method: Literal['PCP', 'HQF']#
- shen_variance: Literal['homoskedastic', 'jackknife', 'hrk']#
Helper Modules#
PCR-SC pipeline#
HSVT (Hard Singular Value Thresholding) primitives for PCR-SC.
Implements rank selection and the rank-\(r\) truncation step \(\widetilde{M} = HSVT_r(X)\) from Rho, Tang, Bergam, Cummings & Misra (2025), ClusterSC: Advancing Synthetic Control with Donor Selection, Algorithm 2.
Three rank-selection modes are exposed:
"cumvar"– smallest \(r\) whose cumulative spectral energy reaches a user threshold (paper’s empirical default of 95%, Section 6.1)."fixed"– caller supplies the explicit rank \(r\)."usvt"– Universal Singular Value Thresholding (Chatterjee 2015; Donoho & Gavish 2014). Preserved for back-compat with the legacy Amjad-Shah-Shen 2018 path.
- mlsynth.utils.clustersc_helpers.pcr.hsvt.hsvt(X: ndarray, rank: int) Tuple[ndarray, ndarray, ndarray, ndarray]#
Apply hard rank-
ranktruncation toX.Computes \(\widetilde{M} = \sum_{i=1}^{r} \sigma_i u_i v_i^\top\) (Algorithm 2 step 2 of Rho et al. 2025).
- Returns:
M_hat (np.ndarray) – Rank-
rankreconstruction, shape(m, n).U_r (np.ndarray) – Truncated left singular vectors, shape
(m, rank).s_r (np.ndarray) – Truncated singular values, shape
(rank,).Vt_r (np.ndarray) – Truncated right singular vectors (transposed), shape
(rank, n).
- mlsynth.utils.clustersc_helpers.pcr.hsvt.select_rank(X: ndarray, method: str = 'cumvar', cumvar_threshold: float = 0.95, r: int | None = None, standardize: bool = True) int#
Pick a truncation rank for
Xunder the chosen rule.- Parameters:
X (np.ndarray) – Matrix to decompose, shape
(m, n).method ({“cumvar”, “fixed”, “usvt”}) – Rank-selection rule.
cumvar_threshold (float) – Cumulative-variance target in
(0, 1]formethod="cumvar".r (int, optional) – Explicit rank for
method="fixed".standardize (bool) – When True (default), the data-driven rules (
"cumvar"/"usvt") operate on the column-standardised version ofX. This prevents the leading singular value from being dominated by uncentered level information (e.g. cigarette sales numbers, GDP per capita) so thatcumvar_thresholdactually discriminates over the panel’s factor structure rather than the overall scale. Ignored formethod="fixed".
- Returns:
int – Selected rank \(r \in [1, \min(m, n)]\).
Donor clustering for ClusterSC (Rho et al. 2025, Algorithm 3 & 4).
Implements the two paper-aligned steps:
cluster_donors()– Algorithm 3. SVD-truncate the (transposed) donor matrix at rank \(r\), build features \(\widetilde{U} = U \Sigma_r\), and run \(k\)-means on the rows of \(\widetilde{U}\). The number of clusters is either user-supplied or selected by the silhouette coefficient (Amjad, Shah, Shen 2018).assign_target()– Algorithm 4 Step 2. Embed the treated unit into the right-singular-vector basis via \(\tilde{u} = V_r^\top x_0^-\) and assign it to the nearest cluster centroid in \(\widetilde{U}\)-space.
Convention: X_pre here is shape (J, T0) – one row per donor unit,
one column per pre-period – to match the paper’s notation
(\(X \in \mathbb{R}^{n \times T}\)). The orchestrator transposes
as needed.
- class mlsynth.utils.clustersc_helpers.pcr.clustering.ClusterPartition(labels: ndarray, centers: ndarray, V_r: ndarray, k: int, rank: int)#
Output of the paper’s Algorithm 3.
- V_r: ndarray#
- centers: ndarray#
- labels: ndarray#
- mlsynth.utils.clustersc_helpers.pcr.clustering.assign_target(target_outcome_pre: ndarray, partition: ClusterPartition) Tuple[int, ndarray]#
Algorithm 4 Step 2: assign the target to its matching cluster.
Computes \(\tilde{u} = V_r^\top x_0^-\) and picks the centroid minimising \(\| c_\ell - \tilde{u} \|_2\).
- Returns:
cluster_id (int) – Index of the nearest cluster.
target_embedding (np.ndarray) – \(\tilde{u}\), shape
(r,).
- mlsynth.utils.clustersc_helpers.pcr.clustering.cluster_donors(donor_outcomes_pre: ndarray, rank: int, k_clusters: int | None = None, k_max: int = 8, random_state: int = 0) ClusterPartition#
Run Algorithm 3 of Rho et al. (2025).
- Parameters:
donor_outcomes_pre (np.ndarray) – Donor outcomes in the pre-period, shape
(J, T0)(donors on rows).rank (int) – Truncation rank \(r\) for the SVD feature map.
k_clusters (int, optional) – Number of clusters. If
None, the silhouette coefficient picks \(k \in [2, \min(k_{\max}, J-1)]\).k_max (int) – Upper bound for the silhouette search.
random_state (int) – Seed forwarded to
sklearn.cluster.KMeansfor reproducibility.
- Returns:
ClusterPartition – Cluster labels, centroids in \(\widetilde{U}\)-space, and the rank-truncated right singular matrix \(V_r\) needed for the target-assignment step.
Frequentist OLS weight solver for PCR-SC.
Implements the learn step of Algorithm 2 of Rho et al. (2025): given the HSVT-denoised pre-period donor matrix \(\widetilde{M}^- \in \mathbb{R}^{T_0 \times J}\) and the target pre-period vector \(x_0^- \in \mathbb{R}^{T_0}\), return
Optionally extends to p-norm (ridge / lasso) regularisation via
lambda_penalty, p, q (Appendix E of the paper compares OLS,
Ridge, Lasso): min_f ||x0 - M f||_2^2 + lambda ||f||_p^q. For the
unregularised path (lambda_penalty in {None, 0}) the closed-form
pseudo-inverse solver is used – the paper’s exact Algorithm 2 Step 3.
The regularised path is solved directly with cvxpy, so this module has no
dependency on the legacy estutils optimizer.
- mlsynth.utils.clustersc_helpers.pcr.frequentist.solve_ols(denoised_donor_pre: ndarray, target_pre: ndarray, donor_names: List[str] | None = None, lambda_penalty: float | None = None, p: float | None = None, q: float | None = None) ndarray#
Return the OLS weight vector \(\widehat{f}\).
- Parameters:
denoised_donor_pre (np.ndarray) – HSVT-denoised pre-period donor matrix, shape
(T0, J).target_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape
(T0,).donor_names (list of str, optional) – Unused; retained for call-signature compatibility.
lambda_penalty, p, q (float or None) – Optional
p-norm regularisation.None(or0) means the paper’s plain OLS via pseudo-inverse. Defaults are ridge (p = q = 2).
Bayesian PCR-SC weight solver (mlsynth extension; Bayani 2022 Ch. 1).
posterior– core Gaussian-conjugate posterior over the weights (BayesSCM()), relocated from the former sharedbayesutils.solver–solve_bayesian(), the PCR-SC wrapper that draws from that posterior and forms per-period credible bands.
- mlsynth.utils.clustersc_helpers.pcr.bayesian.BayesSCM(denoised_donor_matrix: ndarray, target_outcome_pre_intervention: ndarray, observation_noise_variance: float, weights_prior_precision: float) Tuple[ndarray, ndarray, ndarray, ndarray]#
Implement the fully Bayesian synthetic control method.
This function calculates the posterior distribution of synthetic control weights and the predictive distribution of the counterfactual outcome under a Bayesian framework. It assumes a Gaussian likelihood for the outcome and a Gaussian prior for the weights.
- Parameters:
denoised_donor_matrix (np.ndarray) – Denoised donor matrix, representing the pre-intervention outcomes or predictors for donor units. Shape (num_pre_intervention_periods, num_donors), where num_pre_intervention_periods is the number of pre-intervention time periods and num_donors is the number of donor units.
target_outcome_pre_intervention (np.ndarray) – Pre-intervention outcome vector for the treated unit. Shape (num_pre_intervention_periods,) or (num_pre_intervention_periods, 1).
observation_noise_variance (float) – Variance of the noise term in the outcome model (scalar). This represents the observation noise \(\sigma^2\).
weights_prior_precision (float) – Prior precision parameter for the synthetic control weights (scalar). This is the inverse of the prior variance for the weights, assuming a zero-mean Gaussian prior \(\beta \sim N(0, \text{weights_prior_precision}^{-1}I)\).
- Returns:
weights_posterior_mean (np.ndarray) – Posterior mean of the synthetic control weights. Shape (num_donors,).
weights_posterior_covariance (np.ndarray) – Posterior covariance matrix of the synthetic control weights. Shape (num_donors, num_donors).
counterfactual_predictive_mean_pre_intervention (np.ndarray) – Predictive mean of the counterfactual outcome for the treated unit during the pre-intervention periods. Shape (num_pre_intervention_periods,).
counterfactual_predictive_variance_pre_intervention (np.ndarray) – Predictive variance of the counterfactual outcome for the treated unit during the pre-intervention periods. Shape (num_pre_intervention_periods,).
- Raises:
MlsynthDataError – If input data types, shapes, or values are invalid.
MlsynthEstimationError – If matrix inversion fails during posterior computation.
- mlsynth.utils.clustersc_helpers.pcr.bayesian.solve_bayesian(denoised_donor_pre: ndarray, target_pre: ndarray, denoised_donor_full: ndarray, *, alpha: float = 0.05, n_samples: int = 1000, alpha_prior: float = 1.0, rng: Generator | None = None) Tuple[ndarray, ndarray, ndarray, ndarray]#
Return weights, median counterfactual, and per-period credible bands.
- Parameters:
denoised_donor_pre (np.ndarray) – HSVT-denoised pre-period donor matrix, shape
(T0, J).target_pre (np.ndarray) – Pre-period treated outcomes, shape
(T0,).denoised_donor_full (np.ndarray) – HSVT-denoised donor matrix across all periods, shape
(T, J). Used to project posterior draws into the counterfactual band.alpha (float) – Nominal level: returned band is the central
1 - alphaposterior interval.n_samples (int) – Number of posterior draws for the predictive band.
alpha_prior (float) – Prior precision for the Gaussian weights prior (passed to
BayesSCM()).rng (np.random.Generator, optional) – Source of randomness. A fresh
default_rng()is used otherwise.
- Returns:
f_hat (np.ndarray) – Posterior mean weights, shape
(J,).cf_median (np.ndarray) – Posterior median of the counterfactual, shape
(T,).cf_lower (np.ndarray) – Lower per-period bound at level
alpha/2, shape(T,).cf_upper (np.ndarray) – Upper per-period bound at level
1 - alpha/2, shape(T,).
Convex (simplex-constrained) weight solver for PCR-SC.
mlsynth extension on top of Algorithm 2 of Rho et al. (2025): keep the HSVT denoising step, but replace the OLS inner solve with the classical Abadie-Diamond-Hainmueller (2010) simplex-constrained program,
This combines the denoising robustness of PCR with the
non-extrapolation and interpretability properties of convex synthetic
control weights. The simplex least-squares program is solved directly with
cvxpy (CLARABEL), so this module has no dependency on the legacy estutils
optimizer.
- mlsynth.utils.clustersc_helpers.pcr.convex.solve_simplex(denoised_donor_pre: ndarray, target_pre: ndarray, donor_names: List[str] | None = None) ndarray#
Return simplex-constrained weights against the denoised donor matrix.
Solves
min_w ||x0_pre - M_pre w||_2subject tow >= 0andsum(w) == 1.- Parameters:
denoised_donor_pre (np.ndarray) – HSVT-denoised pre-period donor matrix, shape
(T0, J).target_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape
(T0,).donor_names (list of str, optional) – Unused; retained for call-signature compatibility.
Shen-Ding-Sekhon-Yu (2023) confidence intervals for PCR-SC.
Implements the variance estimators from Section 4 of Same Root
Different Leaves: Time Series and Cross-Sectional Methods in Panel
Data (Econometrica 91(6):2125-2154). The closed-form formulas below
are ports of var.py from the authors’ reference implementation
at deshen24/panel-data-regressions, adapted to
mlsynth’s column-orientation (donors as columns of Y_pre).
The three variance estimators correspond to three assumed sources of randomness:
HZ (horizontal / time-series). Assumption 1 of the paper: donors’ post-period outcome \(y_T\) is generated from their pre-period outcomes plus mean-zero noise that varies across donors. The relevant residual variance comes from the cross-sectional dimension. Confidence intervals built from \(\sigma^2_{\mathrm{hz}}\) quantify uncertainty under the HZ generative model.
VT (vertical / cross-sectional). Assumption 2 of the paper: treated unit’s pre-period outcome \(y_N\) is generated from the donors’ pre-period outcomes plus mean-zero noise that varies across pre-periods. The relevant residual variance comes from the time-series dimension.
DR (doubly robust). Assumption 3: noise varies across both units and time. The DR variance is \(v_{\mathrm{hz}} + v_{\mathrm{vt}} - \mathrm{tr}(A)\) where \(A\) is an interaction term, clipped at zero (paper eq. 17 / var.py).
Three variance estimators are available:
"homoskedastic"– paper Section 4.1.3, eq 19. Closed-form, assumes errors are i.i.d."jackknife"– paper Supplemental Material. Diagonal weighting."hrk"– Hartley-Rao-Kish, paper Supplemental Material. Only valid whenmax(1 - diag(H_perp)) < 1/2.
- class mlsynth.utils.clustersc_helpers.pcr.inference.ShenInference(method: str, alpha: float, att: float, per_period_gap: ndarray, per_period_se_hz: ndarray, per_period_se_vt: ndarray, per_period_se_dr: ndarray, per_period_ci_hz: ndarray, per_period_ci_vt: ndarray, per_period_ci_dr: ndarray, att_se_hz: float, att_se_vt: float, att_se_dr: float, att_ci_hz: Tuple[float, float], att_ci_vt: Tuple[float, float], att_ci_dr: Tuple[float, float], rank: int)#
Output of
shen_inference()for a PCR-SC fit.- per_period_gap#
Per-period treatment effect estimates, shape
(T1,).- Type:
np.ndarray
- per_period_se_{hz,vt,dr}
Per-period standard errors under each source-of-randomness assumption, shape
(T1,).- Type:
np.ndarray
- per_period_ci_{hz,vt,dr}
Per-period \((1-\alpha)\) CIs, shape
(T1, 2).- Type:
np.ndarray
- att_se_{hz,vt,dr}
ATT standard errors (variance-of-mean under independence).
- Type:
- per_period_ci_dr: ndarray#
- per_period_ci_hz: ndarray#
- per_period_ci_vt: ndarray#
- per_period_gap: ndarray#
- per_period_se_dr: ndarray#
- per_period_se_hz: ndarray#
- per_period_se_vt: ndarray#
- mlsynth.utils.clustersc_helpers.pcr.inference.shen_inference(treated_outcome: ndarray, donor_outcomes: ndarray, T0: int, rank: int, *, variance: Literal['homoskedastic', 'jackknife', 'hrk'] = 'homoskedastic', alpha: float = 0.05) ShenInference#
Per-period and ATT CIs for PCR-SC, following Shen et al. (2023).
- Parameters:
treated_outcome (np.ndarray) – Treated outcome series, shape
(T,).donor_outcomes (np.ndarray) – Donor outcomes (columns = donors), shape
(T, J).T0 (int) – Number of pre-treatment periods.
rank (int) – PCR truncation rank \(k\) already chosen by the pipeline. Reused so the inference uses the same low-rank donor matrix the weights were fit on.
variance ({“homoskedastic”, “jackknife”, “hrk”}) – Variance estimator. Defaults to homoskedastic (paper eq 19).
alpha (float) – Two-sided level for the returned CIs.
PCR-SC orchestration pipeline.
Implements Rho, Tang, Bergam, Cummings & Misra (2025), ClusterSC: Advancing Synthetic Control with Donor Selection, by composing:
hsvt– rank selection + HSVT denoising (Algorithm 2 Step 2).clustering– donor clustering and target-cluster assignment (Algorithms 3 and 4 Steps 1-3).frequentist,bayesian,convex– the three weight solvers.
The public entry point run_pcr() accepts the same call signature
the legacy estutils.pcr exposed (so the orchestrator and SI estimator
work unchanged) plus a handful of paper-aligned knobs:
rankandrank_method("cumvar"/"fixed"/"usvt");k_clustersandk_maxfor the silhouette-driven clustering step;alphaandn_bayes_samplesfor the Bayesian credible band.
- mlsynth.utils.clustersc_helpers.pcr.pipeline.run_pcr(treated_outcome: ndarray, donor_outcomes: ndarray, donor_names: Sequence[str], T0: int, *, objective: str = 'OLS', estimator: str = 'frequentist', clustering: bool = False, rank: int | None = None, rank_method: str = 'cumvar', cumvar_threshold: float = 0.95, standardize_for_rank: bool = True, project_denoised: bool = False, k_clusters: int | None = None, k_max: int = 8, alpha: float = 0.05, n_bayes_samples: int = 1000, alpha_prior: float = 1.0, lambda_penalty: float | None = None, p: float | None = None, q: float | None = None, shen_variance: str = 'homoskedastic', compute_shen_ci: bool = True, random_state: int = 0) Tuple[MethodFit, Tuple[ndarray, ndarray] | None]#
Run the paper-aligned PCR-SC pipeline.
- Parameters:
treated_outcome (np.ndarray) – Treated outcome series, shape
(T,).donor_outcomes (np.ndarray) – Donor outcomes (columns = donors), shape
(T, J).donor_names (sequence of str) – Length-
Jdonor labels.T0 (int) – Number of pre-treatment periods.
objective ({“OLS”, “SIMPLEX”}) – Which weight solver to use for
estimator="frequentist"."OLS"is the paper’s Algorithm 2;"SIMPLEX"is the Abadie-style mlsynth extension.estimator ({“frequentist”, “bayesian”}) – Frequentist Algorithm 2 vs. the Bayani 2022 posterior.
clustering (bool) – If True, run Algorithm 4 (donor clustering + target match).
rank, rank_method, cumvar_threshold – Truncation-rank controls; see
hsvt.select_rank().k_clusters, k_max – Cluster-count controls; see
clustering.cluster_donors().alpha, n_bayes_samples, alpha_prior – Bayesian-path controls; see
bayesian.solve_bayesian().lambda_penalty, p, q – Optional elastic-net knobs for the frequentist OLS path.
random_state (int) – Seed forwarded to k-means and the Bayesian sampler.
- Returns:
fit (MethodFit) – Frozen container with the PCR fit (counterfactual projected through the denoised donor matrix, Algorithm 4 Step 5).
credible_band (tuple of np.ndarray, optional) –
(lower, upper)per-period credible bounds whenestimator="bayesian";Noneotherwise.
RPCA-SC pipeline#
Functional Principal Component Analysis (FPCA) for RPCA-SC.
Implements Step 1 of Bayani (2021), Robust PCA Synthetic Control:
Smooth each unit’s pre-period outcome trajectory with a cubic B-spline (Li, Wang & Carroll 2010).
Apply PCA to the smoothed trajectories.
Truncate to the leading components that explain at least
cumvar_thresholdof the variance (paper Section 3 recommends \(\geq 95\%\)).Standardise the surviving scores so they are ready for k-means.
A fallback to plain PCA is used when the pre-period is too short for a cubic spline. Edge cases (no units / no scores after truncation) are handled by returning a degenerate one-cluster partition upstream.
- class mlsynth.utils.clustersc_helpers.rpca.fpca.FPCAFeatures(scores: ndarray, rank: int, smoothing: str)#
Output of
compute_fpca_features().- scores: ndarray#
- mlsynth.utils.clustersc_helpers.rpca.fpca.compute_fpca_features(pre_outcomes: ndarray, cumvar_threshold: float = 0.95, spline_degree: int = 3) FPCAFeatures#
Compute standardised FPC scores from a panel of pre-period trajectories.
- Parameters:
pre_outcomes (np.ndarray) – Pre-period outcomes, shape
(n_units, T0). Rows are units (the treated unit included, by paper convention), columns are time points.cumvar_threshold (float) – Cumulative-variance target for truncating the FPC expansion. Paper default is 0.95.
spline_degree (int) – B-spline degree. Default 3 (cubic, Bayani 2021).
K-means clustering on FPC scores for RPCA-SC.
Implements Step 2 of Bayani (2021): apply Hartigan-Wong style \(k\)-means to the standardised FPC scores (treated unit included in the panel), then select the donor pool as the other members of the cluster containing the treated unit.
The number of clusters is either user-supplied or chosen by the silhouette coefficient (Rousseeuw 1987) over \(k \in [2, k_{\max}]\), matching Bayani’s recommendation.
- class mlsynth.utils.clustersc_helpers.rpca.clustering.FPCACluster(labels: ndarray, treated_cluster: int, donor_indices: ndarray, k: int)#
Output of
assign_clusters().- donor_indices: ndarray#
- labels: ndarray#
- mlsynth.utils.clustersc_helpers.rpca.clustering.assign_clusters(scores: ndarray, treated_row: int, k_clusters: int | None = None, k_max: int = 8, random_state: int = 0) FPCACluster#
Cluster units by their FPC scores and pick the treated unit’s cluster.
- Parameters:
scores (np.ndarray) – Standardised FPC scores, shape
(n_units, rank). The treated unit is rowtreated_row.treated_row (int) – Index of the treated unit’s row in
scores.k_clusters (int, optional) – Number of clusters. If
None, the silhouette coefficient selects \(k \in [2, k_{\max}]\).k_max (int) – Upper bound for the silhouette search.
random_state (int) – Seed for
sklearn.cluster.KMeansinitialisation.
- Returns:
FPCACluster – Cluster labels, the treated unit’s cluster id, the indices of the donor pool (cluster members minus the treated row), and the number of clusters used.
Principal Component Pursuit (PCP) via ADMM.
Implements Step 3 of Bayani (2021), with the convex relaxation due to Candes, Li, Ma & Wright (2011):
The augmented Lagrangian is solved by alternating direction method of multipliers (ADMM): \(L\) is updated via the singular value soft-thresholding operator \(\mathcal{D}_{1/\mu}\), \(S\) via element-wise soft-thresholding \(\mathcal{S}_{\lambda/\mu}\), and the dual variable \(\Lambda\) is the standard multiplier update.
Defaults follow Bayani (2021) Section 2.4:
\(\lambda = 1 / \sqrt{\max(N, T)}\) (Candes et al. 2011 default).
\(\mu = \frac{N \cdot T}{4 \cdot \sum_{ij} |Y_{ij}|}\) (Bayani’s rule of thumb).
- class mlsynth.utils.clustersc_helpers.rpca.pcp.PCPResult(low_rank: ndarray, sparse: ndarray, iterations: int, converged: bool, lambda_used: float, mu_used: float)#
Decomposition output:
Y = L + S, plus solver diagnostics.- low_rank: ndarray#
- sparse: ndarray#
- mlsynth.utils.clustersc_helpers.rpca.pcp.pcp_decompose(Y: ndarray, lam: float | None = None, mu: float | None = None, max_iter: int = 1000, tol: float = 1e-09) PCPResult#
Solve the PCP problem and return the
(L, S)decomposition.- Parameters:
Y (np.ndarray) – Observed matrix, shape
(m, n).lam (float, optional) – Sparsity penalty \(\lambda\). Defaults to \(1/\sqrt{\max(m, n)}\) (Candes et al. 2011).
mu (float, optional) – Augmented-Lagrangian penalty \(\mu\). Defaults to \(m n / (4 \sum |Y_{ij}|)\) (Bayani 2021).
max_iter (int) – Iteration cap. Default
1000.tol (float) – Frobenius-norm tolerance for
Y - L - S, scaled by||Y||_F.
Half-quadratic regularised Robust PCA (HQF).
Implements the non-convex robust PCA variant of Wang, Li, So & Liu (2023), Robust PCA via non-convex half-quadratic regularization, Signal Processing 204, 108816.
Solves a factored model \(Y = UV + S\) where \(U \in \mathbb{R}^{m \times r}\), \(V \in \mathbb{R}^{r \times n}\), and \(S\) collects sparse / outlier entries. Updates alternate between regularised least squares for the factors and an adaptive median-based threshold on the residual for the sparse component.
Default hyperparameters (Bayani 2021 Section 2.4, recommending Wang et al.’s defaults):
rank– smallest \(r\) with cumulative spectral energy \(\geq\)cumvar_threshold(Bayani uses 0.999).ip(noise-scale adaptation factor) –1.0.lam(factor regularisation) – \(1 / \sqrt{\max(m, n)}\).max_iter–1000.
- class mlsynth.utils.clustersc_helpers.rpca.hqf.HQFResult(low_rank: ndarray, sparse: ndarray, iterations: int, rank_used: int, lambda_used: float, ip_used: float)#
Output of
hqf_decompose(): low-rank + sparse parts + diagnostics.- low_rank: ndarray#
- sparse: ndarray#
- mlsynth.utils.clustersc_helpers.rpca.hqf.hqf_decompose(Y: ndarray, rank: int | None = None, cumvar_threshold: float = 0.999, lam: float | None = None, ip: float = 1.0, max_iter: int = 1000, random_state: int = 0) HQFResult#
Run the HQF robust PCA decomposition.
- Parameters:
Y (np.ndarray) – Observed matrix, shape
(m, n).rank (int, optional) – Explicit factorisation rank. If
None, picked by cumulative spectral energy atcumvar_threshold.cumvar_threshold (float) – Cumulative-variance target used when
rankis None. Bayani 2021 uses0.999.lam (float, optional) – Tikhonov factor for the alternating LS step. Defaults to \(1 / \sqrt{\max(m, n)}\) (Bayani 2021).
ip (float) – Noise-scale adaptation factor. Default
1.0.max_iter (int) – Iteration cap. Default
1000.random_state (int) – Seed for the initialisation of \(U\).
Non-negative least squares weight solver for RPCA-SC.
Implements Step 4 of Bayani (2021):
where \(L^-\) is the HSVT-denoised pre-period donor matrix. The paper deliberately drops the sum-to-one constraint that classical synthetic control imposes – the clustering step already restricts the donor pool to units that behave like the treated unit, so the non-negativity constraint is enough to keep the counterfactual interpretable.
- mlsynth.utils.clustersc_helpers.rpca.weights.solve_nnls(denoised_donor_pre: ndarray, target_pre: ndarray) ndarray#
Return the non-negative least squares weights \(\widehat{\beta}\).
- Parameters:
denoised_donor_pre (np.ndarray) – Pre-period denoised donor matrix, shape
(T0, J). Columns are donor units.target_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape
(T0,).
Notes
Wraps
scipy.optimize.nnls(). The Lawson-Hanson algorithm is finite-step exact for the box-constrained QP arising from Bayani’s Step 4.
Cross-validation tuners for the RPCA-SC pipeline.
Bayani (2021) takes the Candes-Li-Ma-Wright (2011) optimal-recovery value \(\lambda = 1/\sqrt{\max(N, T)}\) for the PCP sparsity penalty, and selects the HQF factorisation rank by an explained- variance threshold. Both defaults are conservative for the L/S decomposition identifiability problem but can be poor for the counterfactual prediction problem (e.g. on the California Proposition 99 panel they leave PCP under-regularised by ~2x and PCR-SC beats RPCA-SC on pre-period RMSE by a factor of ~2).
This module adds leave-one-time-period-out cross-validation for the PCP \(\lambda\) and HQF \(r\) knobs. For each candidate value:
Run the RPCA decomposition on the full pre-period donor matrix (the donor data is observed at every period – only the treated unit’s pre-period is “held out”).
For each pre-period \(t \in [0, T_0)\), refit the NNLS weights on the other \(T_0 - 1\) periods and predict the held-out \(y_t\). Aggregate the squared errors.
Pick the value with the lowest mean held-out MSE.
This evaluates exactly the quantity the user cares about (out-of-sample prediction accuracy of the synthetic control) without touching the donor matrix (so the RPCA decomposition is not over-fit to the held-out treated period).
- class mlsynth.utils.clustersc_helpers.rpca.tuning.CVResult(grid: ndarray, cv_mse: ndarray, best: float, best_idx: int)#
Output of one CV sweep.
- cv_mse: ndarray#
- grid: ndarray#
- mlsynth.utils.clustersc_helpers.rpca.tuning.cv_hqf_rank(donor_pre: ndarray, treated_pre: ndarray, grid: Sequence[int] | None = None, *, max_rank: int | None = None, hqf_lambda: float | None = None, hqf_ip: float = 1.0, hqf_max_iter: int = 1000, random_state: int = 0) CVResult#
Leave-one-time-out CV for HQF’s factorisation rank \(r\).
- Parameters:
donor_pre (np.ndarray) – Pre-period donor matrix, shape
(T0, J).treated_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape
(T0,).grid (sequence of int, optional) – Integer ranks to try. Defaults to
range(1, max_rank+1)withmax_rank = min(J, T0-1).max_rank – Upper bound for the default grid.
hqf_lambda, hqf_ip, hqf_max_iter, random_state – Forwarded to
mlsynth.utils.clustersc_helpers.rpca.hqf.hqf_decompose().
- mlsynth.utils.clustersc_helpers.rpca.tuning.cv_pcp_lambda(donor_pre: ndarray, treated_pre: ndarray, grid: Sequence[float] | None = None, *, base_lambda: float | None = None, multipliers: Sequence[float] = (0.5, 1.0, 2.0, 3.0, 5.0, 8.0, 12.0), pcp_mu: float | None = None, pcp_max_iter: int = 1000, pcp_tol: float = 1e-09) CVResult#
Leave-one-time-out CV for PCP’s sparsity penalty \(\lambda\).
- Parameters:
donor_pre (np.ndarray) – Pre-period donor matrix, shape
(T0, J)(columns are donors).treated_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape
(T0,).grid (sequence of float, optional) – Explicit grid of lambda values. If
None, the grid is built asmultipliers * base_lambdawherebase_lambdadefaults to Candes’ value \(1/\sqrt{\max(J, T_0)}\).base_lambda, multipliers – Used only when
gridis not supplied.pcp_mu, pcp_max_iter, pcp_tol – Forwarded to
mlsynth.utils.clustersc_helpers.rpca.pcp.pcp_decompose().
Cattaneo-Feng-Titiunik prediction intervals for RPCA-SC.
Implements a focused port of the prediction-interval machinery from
Cattaneo, M. D., Feng, Y., & Titiunik, R. (2021), “Prediction
Intervals for Synthetic Control Methods” (JASA 116(536):1865-1880)
and its multi-period extension in Cattaneo, Feng, Palomba, Titiunik
(2025). The reference implementation lives at
nppackages/scpi (Python package scpi_pkg).
The full scpi_pkg ships an ECOS-based constrained optimisation
solver for the in-sample component and three out-of-sample methods
(gaussian / ls / qreg) plus dask-parallel simulation.
Vendoring that for one inference method on one estimator family
would pull in dask, plotnine, mizani, fsspec, cloudpickle, ecos,
luddite, and partd as hard dependencies. Instead this module
implements the same two-component structure with two lightweight
choices that capture the spirit of the paper:
Prediction interval at post-period \(t > T_0\):
where the two components are quantified as follows.
In-sample uncertainty \(M_w(t, \alpha/2)\) – HC1-scaled parametric bootstrap of the pre-period residuals. At each of
simsdraws we perturb the treated pre-period outcome with a resampled-residual error and refit the full RPCA-SC pipeline via the user-suppliedrefit_fn. The \((\alpha/2, 1-\alpha/2)\) empirical quantiles of the resulting counterfactual draws give an asymmetric in-sample band per post-period. This replaces the paper’s ECOS solver, which solves a constrained \(\sup / \inf\) over the “compatible set” of weights – the bootstrap is equivalent under regularity conditions and avoids pulling inecosas a dependency.Out-of-sample uncertainty \(M_e(t, \alpha/2)\) – the
gaussianvariant of the paper’se_method. Under sub-Gaussian post-period shocks the Hoeffding bound gives\[M_e(t, \alpha/2) = \sqrt{-2 \log \alpha} \; \widehat\sigma_e,\]where \(\widehat\sigma_e = \mathrm{sd}(\widehat u^-)\) is estimated from the pre-period residuals. The two more elaborate variants in the paper (
ls– location-scale model;qreg– quantile regression on residuals) are deferred for a future revision.
For the average treatment effect on the treated, the in-sample component aggregates by simulating the post-period mean of the counterfactual across bootstrap draws and taking quantiles. The out-of-sample component aggregates as \(\widehat\sigma_e \sqrt{-2 \log \alpha} / \sqrt{T_1}\) under post-period shock independence.
- class mlsynth.utils.clustersc_helpers.rpca.inference.CFTInference(method: str, alpha: float, att: float, per_period_gap: ndarray, per_period_pi: ndarray, in_sample_band: ndarray, out_of_sample_eps: float, att_pi: Tuple[float, float], sims: int, sigma_e: float)#
Output of
cft_prediction_intervals().- per_period_gap#
Per-post-period treatment effect estimates, shape
(T1,).- Type:
np.ndarray
- per_period_pi#
Per-post-period \((1-\alpha)\) PI for the treatment effect \(\tau_t\), shape
(T1, 2).- Type:
np.ndarray
- in_sample_band#
Per-post-period in-sample bootstrap band on the counterfactual, shape
(T1, 2).in_sample_band[t, 0]is the bootstrap \(\alpha/2\) quantile minus the point estimate;in_sample_band[t, 1]is the \((1 - \alpha/2)\) quantile minus the point estimate.- Type:
np.ndarray
- out_of_sample_eps#
Out-of-sample Hoeffding bound – the same scalar at every post-period under the homoskedastic Gaussian default.
- Type:
- in_sample_band: ndarray#
- per_period_gap: ndarray#
- per_period_pi: ndarray#
- mlsynth.utils.clustersc_helpers.rpca.inference.cft_prediction_intervals(treated_outcome: ndarray, counterfactual: ndarray, T0: int, refit_fn: Callable[[ndarray], ndarray], *, e_method: str = 'gaussian', alpha: float = 0.05, sims: int = 200, random_state: int = 0) CFTInference#
Two-component prediction intervals for an RPCA-SC fit.
- Parameters:
treated_outcome (np.ndarray) – Treated outcome series, shape
(T,).counterfactual (np.ndarray) – Counterfactual from the actual fit, shape
(T,).T0 (int) – Number of pre-treatment periods.
refit_fn (callable) –
refit_fn(y_perturbed) -> counterfactualwhere the argument is a length-Tmodified treated outcome whose pre-period entries have been perturbed by a resampled residual draw, and the return is the resulting full-period counterfactual. The pipeline supplies this thunk so the inference module stays decoupled from the orchestration.e_method ({“gaussian”}) – Out-of-sample method. Currently only
"gaussian"(Hoeffding bound) is supported.alpha (float) – Two-sided level. Default 0.05 -> 95% intervals.
sims (int) – Number of bootstrap draws for the in-sample component.
random_state (int) – Seed for the resampling RNG.
Orchestration pipeline for RPCA-SC (Bayani 2021, Algorithm 4).
Composes the five steps of Robust PCA Synthetic Control:
fpca– standardised FPC scores from pre-period trajectories.clustering– silhouette-driven \(k\)-means and donor selection via the treated unit’s cluster membership.pcp/hqf– robust \(Y = L + S\) decomposition of the selected donor pool.weights– non-negative least squares against the denoised donor matrix \(L^-\).Project: \(\widehat{Y}_i^+ = (L^+)^\top \widehat{\beta}\) using the same denoised donor matrix in the post-period.
The dispatcher run_rpca() takes the same numpy-array signature
as run_pcr() so the CLUSTERSC orchestrator can call both behind
a uniform interface.
- mlsynth.utils.clustersc_helpers.rpca.pipeline.run_rpca(treated_outcome: ndarray, donor_outcomes: ndarray, donor_names: Sequence[str], T0: int, *, rpca_method: str = 'PCP', fpca_cumvar: float = 0.95, k_clusters: int | None = None, k_max: int = 8, pcp_lambda: float | None = None, pcp_mu: float | None = None, pcp_max_iter: int = 1000, pcp_tol: float = 1e-09, hqf_rank: int | None = None, hqf_cumvar: float = 0.999, hqf_lambda: float | None = None, hqf_ip: float = 1.0, hqf_max_iter: int = 1000, cv_lambda: bool = False, cv_hqf_rank: bool = False, cv_lambda_multipliers: Sequence[float] = (0.5, 1.0, 2.0, 3.0, 5.0, 8.0, 12.0), cv_hqf_rank_grid: Sequence[int] | None = None, compute_cft_pi: bool = False, cft_alpha: float = 0.05, cft_sims: int = 200, cft_e_method: str = 'gaussian', random_state: int = 0) MethodFit#
Run the five-step RPCA-SC pipeline and assemble a
MethodFit.- Parameters:
treated_outcome (np.ndarray) – Treated outcome series, shape
(T,).donor_outcomes (np.ndarray) – Donor outcomes (columns = donors), shape
(T, J).donor_names (sequence of str) – Length-
Jdonor labels.T0 (int) – Number of pre-treatment periods.
rpca_method ({“PCP”, “HQF”}) – Robust PCA decomposition.
fpca_cumvar (float) – Cumulative-variance target for FPCA truncation (Step 1). Paper default
0.95.k_clusters, k_max – Cluster-count controls; see
clustering.assign_clusters().pcp_lambda, pcp_mu, pcp_max_iter, pcp_tol – PCP solver knobs (Candes et al. 2011 / Bayani 2021).
hqf_rank, hqf_cumvar, hqf_lambda, hqf_ip, hqf_max_iter – HQF solver knobs (Wang et al. 2023).
random_state (int) – Seed for k-means and HQF.
- Returns:
MethodFit – Frozen container with the RPCA-SC fit (counterfactual projected through the denoised donor matrix in both pre and post).
Shared utilities#
Data preparation helpers for CLUSTERSC.
- mlsynth.utils.clustersc_helpers.setup.prepare_clustersc_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str) CLUSTERSCInputs#
Pivot the long panel into the CLUSTERSC layout.
- Parameters:
df (pd.DataFrame) – Long balanced panel.
outcome, treat, unitid, time (str) – Column names.
- Returns:
CLUSTERSCInputs – Preprocessed panel.
- Raises:
MlsynthDataError – On unbalanced panel, missing values, multiple treated cohorts, or zero pre/post-treatment periods.
Result structures for the Cluster-based Synthetic Control (CLUSTERSC) estimator.
The inputs and per-method fits are frozen dataclasses; the top-level
CLUSTERSCResults is a Pydantic EffectResult
(the two-family result contract), carrying the dataclass layers as typed fields.
CLUSTERSC bundles two complementary robust synthetic-control families:
PCR-RSC – Robust Synthetic Control via Principal Component Regression. Combines SVD-based donor clustering (Amjad, Shah and Shen 2018) with PCR weight estimation (Agarwal, Shah, Shen and Song 2021). Frequentist mode uses an SCM-style QP via
cvxpy; Bayesian mode uses the conjugate-prior posterior of Bayani (2022, CUNY dissertation Chapter 1) with credible intervals.RPCA-SC – Robust PCA Synthetic Control. Applies functional-PCA clustering on the pre-treatment outcomes, then a robust low-rank decomposition (PCP – Candes, Li, Ma, Wright 2011; or HQF – Wang, Li, So and Liu 2023) to denoise the donor panel before fitting simplex SCM weights.
Both methods can run in parallel via method = "both" so the user
can compare their estimates side by side.
The four layers below (inputs, per-method fit, optional Bayesian inference, top-level results) keep the pipeline pluggable.
- class mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInference(method: str, alpha: float, att: float = nan, credible_interval: Tuple[float, float] = (nan, nan), shen: Any | None = None, cft: Any | None = None)#
Optional inferential output.
Two inference families are supported on the PCR-SC fit:
Bayesian credible band – populated when
estimator = "bayesian". Reports a posterior credible interval for the ATT (Bayani 2022 Ch. 1).Shen-Ding-Sekhon-Yu (2023) frequentist CIs – populated for the frequentist OLS PCR path. Reports per-period and ATT CIs under three sources of randomness (HZ / VT / DR) and one of three variance estimators (homoskedastic / jackknife / HRK).
- Parameters:
method (str) –
"bayesian_credible"for the Bayesian path,"shen_<variance>"(e.g."shen_homoskedastic") for the Shen et al. path, or"none"if no inference was run.alpha (float) – Two-sided significance level (e.g. 0.05 -> 95% interval).
att (float) – Mean post-treatment gap for the primary variant.
credible_interval (tuple of float) –
(lower, upper)posterior credible interval for the ATT (Bayesian path only;(nan, nan)otherwise).shen (object, optional) – Full
mlsynth.utils.clustersc_helpers.pcr.inference.ShenInferenceobject when the Shen et al. CIs were computed (frequentist OLS PCR only). Carries per-period and ATT CIs under each source-of-randomness assumption.cft (object, optional) – Full
mlsynth.utils.clustersc_helpers.rpca.inference.CFTInferenceobject when Cattaneo-Feng-Titiunik (2021) prediction intervals were computed (RPCA-SC only, opt-in viaCLUSTERSCConfig.compute_cft_pi). Carries per-period and ATT prediction intervals.
- class mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInputs(treated_outcome: ndarray, donor_outcomes: ndarray, donor_names: ndarray, treated_unit_name: Any, T: int, T0: int, time_labels: ndarray)#
Preprocessed panel data for CLUSTERSC.
- Parameters:
treated_outcome (np.ndarray) – Treated outcome series, shape
(T,).donor_outcomes (np.ndarray) – Donor outcome matrix, shape
(T, J).donor_names (np.ndarray) – Length-
Jlabels of the donor units.treated_unit_name (Any) – Label of the treated unit.
T (int) – Total number of panel periods.
T0 (int) – Number of pre-treatment periods.
time_labels (np.ndarray) – Length-
Tlabels of the time periods.
- donor_names: ndarray#
- donor_outcomes: ndarray#
- time_labels: ndarray#
- treated_outcome: ndarray#
- class mlsynth.utils.clustersc_helpers.structures.CLUSTERSCResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInputs, pcr: ~mlsynth.utils.clustersc_helpers.structures.MethodFit | None = None, rpca: ~mlsynth.utils.clustersc_helpers.structures.MethodFit | None = None, selected_variant: str = 'pcr', cluster_inference: ~mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInference | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Top-level container returned by
mlsynth.CLUSTERSC.fit().An
EffectResult(the observational report): it populates the standardized sub-models (effects,time_series,weights,inference,fit_diagnostics,method_details) from the primary variant, so the flat accessorsatt/counterfactual/gap/att_ci/donor_weights/pre_rmseresolve through the base contract. The CLUSTERSC-specific fields below carry the dispatcher’s extra structure (both family fits side by side, and the rich inference).- Parameters:
inputs (CLUSTERSCInputs) – Preprocessed panel.
pcr (MethodFit or None) – PCR-RSC fit (populated when
method in {"pcr", "both"}).rpca (MethodFit or None) – RPCA-SC fit (populated when
method in {"rpca", "both"}).selected_variant (str) – Which fit drives the standardized sub-models / flat accessors –
"pcr"or"rpca". Whenmethod = "both"the user picks viaCLUSTERSCConfig.primary; default"pcr".cluster_inference (CLUSTERSCInference or None) – The rich inferential output (Bayesian credible interval, Shen et al. per-period/ATT CIs, or CFT prediction intervals). The scalar summary is mirrored into the standardized
inferenceslot sores.att_ciresolves.metadata (dict) – Free-form pipeline diagnostics.
- cluster_inference: CLUSTERSCInference | None#
- inputs: CLUSTERSCInputs#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class mlsynth.utils.clustersc_helpers.structures.MethodFit(name: str, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att: float, pre_rmse: float, donor_weights: ~typing.Dict[~typing.Any, float], selected_donors: ~numpy.ndarray, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Single-method (PCR or RPCA) fit output.
- Parameters:
name (str) – Method identifier (
"pcr_frequentist","pcr_bayesian", or"rpca").counterfactual (np.ndarray) – Synthetic-control imputation of the treated outcome at every period, shape
(T,).gap (np.ndarray) – Observed treated minus counterfactual, shape
(T,).att (float) – Mean post-treatment gap.
pre_rmse (float) – Root-mean-squared pre-treatment fit error.
donor_weights (dict) – Mapping
{donor_name: weight}with non-trivial weights only.selected_donors (np.ndarray) – Subset of
donor_namesretained after clustering (PCR) or cluster selection (RPCA). Equal to all donors when no clustering is applied.metadata (dict) – Free-form per-method diagnostics (e.g. cluster-id assignment, rank used by RPCA, posterior credible bounds for Bayesian PCR).
- counterfactual: ndarray#
- gap: ndarray#
- selected_donors: ndarray#
Diagnostic plot for CLUSTERSC results.
- mlsynth.utils.clustersc_helpers.plotter.plot_clustersc(results: CLUSTERSCResults) None#
Two-panel plot: trajectories + gap, both with PCR / RPCA overlays.
Example#
A self-contained one-draw Monte Carlo at a two-factor DGP, fitting
both PCR-SC and RPCA-SC in parallel and inspecting their fits via
the frozen CLUSTERSCResults container.
"""One draw of a two-factor CLUSTERSC simulation."""
import numpy as np
import pandas as pd
from mlsynth import CLUSTERSC
# ---------------------------------------------------------------------
# 1. Simulate one panel from a two-factor DGP
# ---------------------------------------------------------------------
rng = np.random.default_rng(0)
J = 12 # control units
T_pre = 14
T_post = 6
T = T_pre + T_post
r_true = 2
tau_true = 1.0 # additive treatment effect on the treated unit
F = rng.standard_normal((T, r_true))
lam = rng.standard_normal((J + 1, r_true))
eps = rng.standard_normal((T, J + 1)) * 0.4
Y = F @ lam.T + eps
Y[T_pre:, 0] += tau_true # unit 0 treated
rows = [
{
"unit": j,
"time": t,
"y": float(Y[t, j]),
"D": int(j == 0 and t >= T_pre),
}
for j in range(J + 1)
for t in range(T)
]
df = pd.DataFrame(rows)
# ---------------------------------------------------------------------
# 2. Fit CLUSTERSC with both families and Bayesian PCR inference
# ---------------------------------------------------------------------
res = CLUSTERSC({
"df": df,
"outcome": "y",
"treat": "D",
"unitid": "unit",
"time": "time",
"method": "both",
"primary": "pcr",
"pcr_objective": "OLS",
"estimator": "bayesian",
"rpca_method": "HQF",
"alpha": 0.10,
# No cluster structure in the synthetic panel; pool donors for
# the RPCA branch so its Algorithm 4 Step 2 does not isolate
# the treated row.
"k_clusters": 1,
}).fit()
# ---------------------------------------------------------------------
# 3. Inspect the output
# ---------------------------------------------------------------------
print(f"true tau = {tau_true:+.3f}")
print(f"selected variant = {res.selected_variant}")
print(f"primary ATT = {res.att:+.3f}")
print(f"PCR ATT = {res.pcr.att:+.3f}")
print(f"RPCA ATT = {res.rpca.att:+.3f}")
print(f"PCR pre-RMSE = {res.pcr.pre_rmse:.4f}")
print(f"RPCA pre-RMSE = {res.rpca.pre_rmse:.4f}")
print(f"HSVT rank (PCR) = {res.pcr.metadata['rank']}")
print(f"HQF rank (RPCA) = {res.rpca.metadata['hqf_rank']}")
if res.cluster_inference.method == "bayesian_credible":
lo, hi = res.cluster_inference.credible_interval
print(f"Bayesian 90% CrI ATT = [{lo:+.3f}, {hi:+.3f}]")
Empirical example: California Proposition 99#
The PCR-SC family also runs without clustering, in which case it
reduces to Amjad-Shah-Shen (2018) Robust Synthetic Control on the
full donor pool. Below: PCR with explicit rank \(r = 4\), then
inspect the Shen-Ding-Sekhon-Yu (2023) frequentist CIs that the
default compute_shen_ci=True produces.
from mlsynth import CLUSTERSC
import pandas as pd
file = (
"https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
"refs/heads/main/basedata/smoking_data.csv"
)
df = pd.read_csv(file)
res = CLUSTERSC({
"df": df,
"outcome": "cigsale",
"treat": "Proposition 99",
"unitid": "state",
"time": "year",
"method": "pcr",
"clustering": False,
"rank": 4, # explicit HSVT rank
"display_graphs": True,
}).fit()
print(f"ATT = {res.att:+.3f}")
shen = res.cluster_inference.shen
print(f"95% CI (VT src) = [{shen.att_ci_vt[0]:+.2f}, {shen.att_ci_vt[1]:+.2f}]")
print(f"95% CI (HZ src) = [{shen.att_ci_hz[0]:+.2f}, {shen.att_ci_hz[1]:+.2f}]")
Same panel, RPCA-SC with cross-validated PCP \(\lambda\). The
cv_lambda=True flag triggers leave-one-time-out CV over
cv_lambda_multipliers \(\times \lambda^\star\) to pick the
prediction-optimal sparsity penalty.
res = CLUSTERSC({
"df": df,
"outcome": "cigsale",
"treat": "Proposition 99",
"unitid": "state",
"time": "year",
"method": "rpca",
"rpca_method": "PCP",
"cv_lambda": True,
"k_clusters": 1, # weak cluster structure on this panel
}).fit()
md = res.rpca.metadata
print(f"PCP lambda chosen by CV = {md['pcp_lambda']:.4f} "
f"(best of grid {[round(x,3) for x in md['cv_lambda_grid']]})")
print(f"pre-RMSE = {res.rpca.pre_rmse:.3f}")
print(f"ATT = {res.rpca.att:+.3f}")
Same fit with Cattaneo-Feng-Titiunik (2021) prediction intervals
(opt-in via compute_cft_pi=True; the bootstrap reruns the full
pipeline cft_sims times, so this takes ~30s on Prop 99 with
the default cft_sims=200).
res = CLUSTERSC({
"df": df,
"outcome": "cigsale",
"treat": "Proposition 99",
"unitid": "state",
"time": "year",
"method": "rpca",
"rpca_method": "PCP",
"cv_lambda": True,
"k_clusters": 1,
"compute_cft_pi": True,
"cft_sims": 200,
"cft_alpha": 0.05,
}).fit()
cft = res.cluster_inference.cft
print(f"ATT = {res.att:+.3f}")
print(f"95% CFT PI on ATT = [{cft.att_pi[0]:+.2f}, {cft.att_pi[1]:+.2f}]")
print(f"sigma_e (pre-RMSE) = {cft.sigma_e:.3f}")
print(f"out-of-sample eps = {cft.out_of_sample_eps:.3f}")
for ti in range(len(cft.per_period_gap)):
lo, hi = cft.per_period_pi[ti]
print(f" t={res.inputs.T0+ti}: tau_t={cft.per_period_gap[ti]:+.2f} "
f"PI=[{lo:+.2f}, {hi:+.2f}]")
Empirical validation: West German reunification (Bayani 2021)#
Bayani (2021, Section 3) validates RPCA-SC on the German
reunification panel of Abadie, Diamond & Hainmueller (2015): the
1990 reunification is treated as an intervention on West German
per-capita GDP, with 16 OECD economies as the candidate donor pool.
It is the canonical empirical benchmark for the RPCA-SC family –
the analogue of Proposition 99 for PCR-SC – and mlsynth reproduces
it from the bundled german_reunification.csv. RPCA-SC uses only
the outcome series (no auxiliary covariates), in contrast to the
five predictors Abadie et al. (2015) used.
from mlsynth import CLUSTERSC
import pandas as pd
file = (
"https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
"refs/heads/main/basedata/german_reunification.csv"
)
df = pd.read_csv(file)
res = CLUSTERSC({
"df": df,
"outcome": "gdp",
"treat": "Reunification",
"unitid": "country",
"time": "year",
"method": "rpca",
"rpca_method": "PCP", # convex RPCA, as in Bayani (2021)
"compute_cft_pi": True,
"cft_sims": 200,
"cft_alpha": 0.05,
"display_graphs": True,
}).fit()
cft = res.cluster_inference.cft
gap = res.inputs.treated_outcome - res.rpca.counterfactual
weights = {k: round(v, 2) for k, v in res.rpca.donor_weights.items()
if abs(v) > 1e-3}
print(f"donor pool (clustered) = {len(res.rpca.donor_weights)}")
print(f"positive weights = {weights}")
print(f"pre-RMSE (1960-1989) = {res.rpca.pre_rmse:.1f}")
print(f"reunification ATT = {res.rpca.att:+.0f}")
print(f"gap in 2003 = {gap[-1]:+.0f}")
print(f"95% CFT PI on ATT = [{cft.att_pi[0]:+.0f}, {cft.att_pi[1]:+.0f}]")
The FPCA + \(k\)-means step (Steps 1-2) selects an 11-economy
cluster for West Germany – Australia, Austria, Belgium, Denmark,
France, Italy, Japan, the Netherlands, New Zealand, Norway and the
UK – excluding the level outliers (the USA, Switzerland) and the
lower-income periphery (Greece, Portugal, Spain), exactly the donor
pool of Bayani’s Table 2. PCP then reproduces that table’s weights
to two digits: Norway \(\approx 0.48\), France
\(\approx 0.35\), New Zealand \(\approx 0.30\) and Austria
\(\approx 0.02\), with the remaining donors at zero. The average
post-1990 gap is \(\approx -1500\) per-capita units with a
1960-1989 fit RMSE near \(90\), and the per-period gap widens to
about \(-3730\) by 2003 – tracing the Robust PCA Synthetic
Control trajectory in Bayani’s Figure 4 (a small positive blip in
1991-1992 followed by a steadily widening negative gap). The
Cattaneo-Feng-Titiunik (2021) 95% prediction interval on the ATT is
about \([-1765, -1280]\), excluding zero (its width tracks
cft_sims and random_state); the HQF decomposition lands near
\(-1920\). The conclusion echoes Abadie et al. (2015):
reunification depressed West German per-capita GDP relative to its
synthetic counterpart.
Simulation studies#
The three families documented here were each validated by their authors on a purpose-built simulation. mlsynth ships these designs verbatim where reproducible so the estimators can be regression- tested against the originals.
PCR-SC – missing-data robustness (Agarwal et al. 2021)#
Agarwal et al. (2021, Section 4.5) do not use a synthetic DGP; their validation is a missing-data robustness study on two canonical panels – Basque terrorism (Abadie-Gardeazabal 2003) and California Proposition 99 (Abadie et al. 2010). For each panel they
take the widely accepted published counterfactual as ground truth;
confirm the donor matrix is near rank-one – over \(99\%\) of the spectral energy sits in the top singular value, exactly the low-rank regime in which the Theorem 3.1 / 3.2 error bounds bite;
randomly obfuscate \(5\%\)-\(20\%\) of the donor entries and re-fit PCR (HSVT + OLS) on the outcome series only.
The finding: PCR tracks the published baseline across all missing- data levels, whereas classical convex SC degrades sharply and plain OLS overfits the pre-period noise (it even flips the sign of the Basque effect). This is the empirical face of the HSVT denoising step – PCR recovers the same conclusion with neither auxiliary covariates nor complete data. mlsynth reproduces the headline check on Proposition 99: the singular spectrum is near rank-one and a rank-1 HSVT PCR fit matches the canonical estimate.
import numpy as np, pandas as pd
from mlsynth import CLUSTERSC
file = (
"https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
"refs/heads/main/basedata/smoking_data.csv"
)
df = pd.read_csv(file)
wide = df.pivot(index="year", columns="state", values="cigsale")
sv = np.linalg.svd(wide.values, compute_uv=False)
print(f"top-SV energy share = {sv[0]**2 / (sv**2).sum():.3f}") # ~0.99
res = CLUSTERSC({
"df": df, "outcome": "cigsale", "treat": "Proposition 99",
"unitid": "state", "time": "year",
"method": "pcr", "clustering": False, "rank": 1,
}).fit()
print(f"rank-1 PCR ATT = {res.att:+.2f}")
PCR-SC – de-noising and the Bayesian posterior (Amjad-Shah-Shen 2018)#
Amjad-Shah-Shen ([Amjad2018], Section 5.3) introduce the HSVT de-noising step that PCR-SC is built on, and validate it on a fully synthetic panel where the true mean is known. Each unit \(i\) draws a latent feature \(\theta_i \sim U[0, 1]\), time is \(\rho_t = t\), and the mean is
with periodicities \(f_1 = \rho_t \bmod 360\), \(f_2 = \rho_t \bmod 180\), \(f_3 = 2\rho_t \bmod 360\), \(f_4 = 2\rho_t \bmod 180\); the observed \(X_{it} = m_{it} + \mathcal N(0, \sigma^2)\). They use \(N = 100\), \(T = 2000\), intervention at \(t = 1600\). Two findings drive the paper:
Training error tracks generalization error (Table 1). The pre-intervention MSE of the estimated mean closely matches the post-intervention MSE at every noise level – the de-noised fit generalizes.
De-noising beats no de-noising (Table 2). The HSVT step lowers the generalization error consistently versus regressing on the raw donor matrix.
mlsynth reproduces both: the pre/post MSE ratio is \(\approx 1\)
across \(\sigma^2\), and rank-4 HSVT de-noising cuts the
generalization error several-fold relative to the full-rank (no
de-noising) fit. The paper’s Bayesian counterpart (Figure 12) trades
the inner OLS for a Gaussian posterior over the weights; mlsynth’s
estimator="bayesian" path now tracks the frequentist point
estimate to within Monte-Carlo error. Three implementation choices
make that hold (in
mlsynth.utils.clustersc_helpers.pcr.bayesian.solve_bayesian()):
the observation noise \(\sigma^2\) is estimated from the pre-period
fit residual (not the total target variance, which conflates signal
with noise); the point counterfactual is the posterior-mean
projection (not a Monte-Carlo median of draws); and both the point
estimate and the credible band are projected through the de-noised
rank-\(r\) donor matrix, so the band reflects the signal subspace
rather than raw-donor noise in the weight null space.
import numpy as np
import pandas as pd
from mlsynth import CLUSTERSC
N, T, T0 = 100, 2000, 1600
rng = np.random.default_rng(0)
theta = rng.uniform(0, 1, N)
rho = np.arange(1, T + 1).astype(float)
time_term = (np.cos((rho % 360) * np.pi / 180)
+ 0.5 * np.sin((rho % 180) * np.pi / 180)
+ 1.5 * np.cos(((2 * rho) % 360) * np.pi / 180)
- 0.5 * np.sin(((2 * rho) % 180) * np.pi / 180))
a = 1.0 + 0.3 * (rho / T) * np.exp(rho / T)
M = np.outer(theta, a) + np.outer(np.ones(N), time_term) # true means (N, T)
m_true = M[0] # treated unit's truth
def fit(X, rank, estimator="frequentist"):
df = pd.DataFrame(
[(j, t, float(X[j, t]), int(j == 0 and t >= T0))
for j in range(N) for t in range(T)],
columns=["unit", "time", "y", "D"])
res = CLUSTERSC({
"df": df, "outcome": "y", "treat": "D", "unitid": "unit",
"time": "time", "method": "pcr", "estimator": estimator,
"clustering": False, "rank": rank, "rank_method": "fixed",
"standardize_for_rank": False, "compute_shen_ci": False,
}).fit()
return np.asarray(res.pcr.counterfactual)
X = M + rng.normal(0, np.sqrt(1.9), (N, T)) # sigma^2 = 1.9
cf4 = fit(X, rank=4) # de-noised (4 singular values)
cf_full = fit(X, rank=N - 1) # no de-noising (full rank)
cf_b = fit(X, rank=4, estimator="bayesian")
mse = lambda u, v: float(np.mean((u - v) ** 2))
print(f"Table 1 train(pre)={mse(cf4[:T0], m_true[:T0]):.4f} "
f"gen(post)={mse(cf4[T0:], m_true[T0:]):.4f}") # ~equal
print(f"Table 2 de-noised gen={mse(cf4[T0:], m_true[T0:]):.4f} "
f"no-de-noising gen={mse(cf_full[T0:], m_true[T0:]):.4f}")
print(f"Bayesian gen={mse(cf_b[T0:], m_true[T0:]):.4f} (tracks frequentist)")
A single draw at \(\sigma^2 = 1.9\) gives training and generalization MSE both \(\approx 0.02\), a no-de-noising generalization error \(\approx 6\times\) larger, and a Bayesian generalization error indistinguishable from the frequentist – the de-noising and posterior machinery behaving as Amjad-Shah-Shen describe.
PCR-SC – donor-selection gains (Rho et al. 2025, ClusterSC)#
Rho et al. (2025, Section 6.1) build a synthetic panel with two latent subgroups to show that clustering the donor pool before fitting SC tightens the post-intervention error (their Theorems 5.11 / 5.13). The DGP, sized after disaggregated SC applications, uses \(T = 10\), \(T_0 = 8\), \(n \in \{1000, 2000\}\) donors split evenly into groups \(A\) and \(B\). Each group is a sum of \(r_S\) sinusoids
with group-specific frequency/magnitude laws –
\(\alpha_i \sim \mathrm{Beta}(2, 2)\),
\(\omega_i \sim \mathrm{Unif}(1, 3)\) for \(A\) versus
\(\alpha_i \sim \mathrm{Beta}(2, 5)\),
\(\omega_i \sim \mathrm{Unif}(3, 6)\) for \(B\), and
\(\phi_i \sim \mathcal N(0, 1)\) for both – then additive noise
\(E_{i,t} \sim \mathcal N(0, s^2)\) swept over
\(s \in \{0.10, 0.15, \dots, 0.40\}\). Over 500 datasets, a
leave-one-out placebo test on \(30\%\) of group \(A\)
compares ClusterSC against full-pool SC via the per-target
improvement \(I_i = \mathrm{MSE}_{\text{SC}} -
\mathrm{MSE}_{\text{ClusterSC}}\). The median improvement is positive
at every noise level and grows with \(s\): once noise blurs
the latent structure, restricting to the treated unit’s cluster pays
off. mlsynth’s clustering=True (the default) is exactly this
donor-selection step ahead of the PCR weight solve.
mlsynth’s PCR-SC was validated head to head against the authors’
own reference implementation (the syclib library released with
the paper) on this exact DGP and on the paper’s empirical
house-price-index panel. Running both pipelines – SVD
\(k\)-means clustering, HSVT denoising, then an OLS weight solve
– on the same targets, the two produce near-identical
counterfactuals: the per-target counterfactual trajectories
correlate at \(\ge 0.99\) on the synthetic sine panel and at
\(\ge 0.999\) on the house-price-index panel, with matching
cluster assignment / donor-pool sizes and pre- and post-intervention
MSEs in the same range. The two residual differences are documented
design choices, not discrepancies: mlsynth denoises the pre-period
donor block for the weight fit (the Amjad-Shah-Shen [Amjad2018]
convention, project_denoised=False by default) where the
reference code denoises the full \(Y_0\) once, and mlsynth’s
clustering features come from the pre-period SVD rather than the full
panel. Set project_denoised=True to match the reference code’s
projection convention.
The synthetic DGP is self-contained – one draw, one treated subgroup-\(A\) unit, donors clustered before the PCR fit:
import numpy as np
import pandas as pd
from mlsynth import CLUSTERSC
def _sine_panel(n, T, noise, n_signals, alpha_beta, omega_lo, omega_hi, rng):
"""One subgroup: n units, each a random mix of n_signals sinusoids + noise."""
basis = np.zeros((n_signals, T))
grid = np.arange(int(T * 1.2)) * 10 * np.pi
for i in range(n_signals):
alpha = rng.beta(*alpha_beta)
omega = rng.uniform(omega_lo, omega_hi)
phi = rng.normal(0, 1)
wave = alpha * np.sin(2 * np.pi * omega * grid / 360 + phi)
basis[i] = wave[int(0.2 * T):] # drop first 20%
W = rng.uniform(0, 1, (n, n_signals))
return W @ basis + rng.normal(0, noise, (n, T)) # (n, T)
rng = np.random.default_rng(0)
T, T0, rS, nA, nB, noise = 10, 8, 3, 60, 60, 0.30
A = _sine_panel(nA, T, noise, rS, (2, 2), 1, 3, rng) # subgroup A
B = _sine_panel(nB, T, noise, rS, (2, 5), 3, 6, rng) # subgroup B
panel = np.vstack([A, B]) # (nA + nB, T)
# Treat one subgroup-A unit; everyone else is a donor. Add a known effect.
tau_true, treated = 5.0, 0
panel[treated, T0:] += tau_true
df = pd.DataFrame([
{"unit": j, "time": t, "y": float(panel[j, t]),
"D": int(j == treated and t >= T0)}
for j in range(nA + nB) for t in range(T)
])
res = CLUSTERSC({
"df": df, "outcome": "y", "treat": "D", "unitid": "unit", "time": "time",
"method": "pcr", "pcr_objective": "OLS", "clustering": True,
"k_clusters": 2, "rank": rS, "rank_method": "fixed",
}).fit()
n_sel = len(res.pcr.donor_weights)
print(f"donors selected by clustering = {n_sel} of {nA + nB - 1}") # ~ subgroup A
print(f"selected (mostly) subgroup A = {n_sel <= nA}") # True
print(f"pre-RMSE = {res.pcr.pre_rmse:.3f}")
print(f"ATT (true {tau_true:+.1f}) = {res.pcr.att:+.3f}") # ~ +5
The clustering step isolates the treated unit’s subgroup (\(\approx n_A - 1\) donors), the pre-period fit is tight, and the OLS weight solve recovers the planted effect – matching the reference ClusterSC+OLS output to the differences noted above.
RPCA-SC – two-process recovery under noise and missingness (Bayani 2021)#
Bayani (2021, Section 4) generates \(N_1 = N_2 = 100\) units from two deterministic processes plus i.i.d. Gaussian noise \(\epsilon_t \sim \mathcal N(0, \sigma^2)\) over \(t \in [0, T]\) with \(T = 250\) and intervention at \(T_0 = 150\):
with the noise variance swept over \(\sigma^2 \in \{1, 4, 9, 16, 25\}\). The study has two halves:
Clustering recovery (Steps 1-2). FPCA over the pre-period plus \(k\)-means recovers the two latent processes with \(100\%\) accuracy at every noise level; the first FPC score alone explains over \(95\%\) of the variation and the silhouette statistic selects \(k = 2\).
Counterfactual accuracy (Steps 3-5). Treating the noiseless mean of \(f_1\) as the target and its 100 noisy realisations as the donor pool, RPCA extracts the low-rank component and projects the counterfactual. The estimation RMSPE stays low and pre-/post- intervention errors stay close even at high noise:
Bayani (2021) Table 3 – RMSPE, fully observed# \(\sigma^2\)
Pre-intervention
Post-intervention
1
0.09
0.13
4
0.19
0.25
9
0.29
0.38
16
0.39
0.51
25
0.49
0.64
Repeating with \(30\%\) of entries removed at random (Table 4) inflates the errors – the robust PCA step also imputes – but the counterfactual still tracks the true mean (e.g. post-intervention RMSPE \(0.65\) at \(\sigma^2 = 1\), rising to \(2.59\) at \(\sigma^2 = 25\)). Bayani’s design uses the convex (PCP) decomposition; the HQF solver is a later mlsynth addition.
A compact one-draw reproduction of the clustering half:
import numpy as np, pandas as pd
from mlsynth import CLUSTERSC
rng = np.random.default_rng(0)
T, T0, sigma2 = 250, 150, 9.0
t = np.arange(1, T + 1)
def f1(): return (0.3 * (t % (T + 1)) - (t % 10) * np.sin(t / np.pi)
+ (t % 10) * np.cos(t / np.pi))
def f2(): return np.log(t) + 4 * np.sin(t / np.pi) + 4 * np.cos(t / np.pi)
rows = []
for j in range(200): # 100 of each process
base = f1() if j < 100 else f2()
y = base + rng.normal(0, np.sqrt(sigma2), size=T)
treated = (j == 0) # one f1 unit is treated
if treated:
y[T0:] += 5.0
rows += [{"unit": j, "time": int(tt), "y": float(yi),
"D": int(treated and tt > T0)}
for tt, yi in zip(t, y)]
df = pd.DataFrame(rows)
res = CLUSTERSC({
"df": df, "outcome": "y", "treat": "D",
"unitid": "unit", "time": "time",
"method": "rpca", "rpca_method": "PCP",
}).fit()
print(f"donor pool size = {len(res.rpca.donor_weights)}") # treated unit's cluster (f1 subgroup)
print(f"pre-RMSE = {res.rpca.pre_rmse:.2f}")
print(f"ATT (true 5.0) = {res.rpca.att:+.2f}")
References#
Abadie, A., & Gardeazabal, J. (2003). “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review 93(1):113-132.
Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105(490):493-505.
Abadie, A., Diamond, A., & Hainmueller, J. (2015). “Comparative Politics and the Synthetic Control Method.” American Journal of Political Science 59(2):495-510.
Agarwal, A., Shah, D., Shen, D., & Song, D. (2021). “On Robustness of Principal Component Regression.” Journal of the American Statistical Association 116(536):1731-1745.
Amjad, M., Shah, D., & Shen, D. (2018). “Robust Synthetic Control.” Journal of Machine Learning Research 19(22):1-51.
Athey, S., Bayati, M., Doudchenko, N., Imbens, G., & Khosravi, K. (2021). “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association 116(536): 1716-1730.
Bayani, M. (2021). “Robust PCA Synthetic Control.” arXiv:2108.12542.
Bayani, M. (2022). “Essays on Machine Learning Methods in Economics.” Chapter 1, CUNY Academic Works.
Candes, E. J., Li, X., Ma, Y., & Wright, J. (2011). “Robust Principal Component Analysis?” Journal of the ACM 58(3):11.
Cattaneo, M. D., Feng, Y., & Titiunik, R. (2021). “Prediction Intervals for Synthetic Control Methods.” Journal of the American Statistical Association 116(536):1865-1880. Reference implementation: nppackages/scpi.
Chatterjee, S. (2015). “Matrix Estimation by Universal Singular Value Thresholding.” Annals of Statistics 43(1):177-214.
Hartigan, J. A., & Wong, M. A. (1979). “Algorithm AS 136: A K-means Clustering Algorithm.” Applied Statistics 28(1):100-108.
Li, Y., Wang, N., & Carroll, R. J. (2010). “Generalized Functional Linear Models with Semiparametric Single-index Interactions.” Journal of the American Statistical Association 105(490):621-633.
Rho, S., Tang, A., Bergam, N., Cummings, R., & Misra, V. (2025). “ClusterSC: Advancing Synthetic Control with Donor Selection.” arXiv:2503.21629.
Rousseeuw, P. J. (1987). “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 20:53-65.
Shen, D., Ding, P., Sekhon, J., & Yu, B. (2023). “Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data.” Econometrica 91(6):2125-2154. Reference implementation: deshen24/panel-data-regressions.
Wang, Z., Li, X. P., So, H. C., & Liu, Z. (2023). “Robust PCA via Non-convex Half-quadratic Regularization.” Signal Processing 204:108816.