Cluster Synthetic Controls (CLUSTERSC)

Contents

Cluster Synthetic Controls (CLUSTERSC)#

Overview#

CLUSTERSC packages two paper-aligned families of robust synthetic control behind a single estimator.

  • PCR-SC. Rho, Tang, Bergam, Cummings & Misra (2025), ClusterSC: Advancing Synthetic Control with Donor Selection (arXiv:2503.21629). Hard Singular Value Thresholding (HSVT) of the donor matrix, optional \(k\)-means donor clustering on the right-singular-vector embedding, then ordinary least squares against the denoised donor matrix. Builds on the Robust Synthetic Control proposal of Amjad, Shah & Shen (2018) and the PCR consistency results of Agarwal, Shah, Shen & Song (2021). mlsynth exposes two alternative weight solvers on top of Algorithm 2: the Bayesian Robust Synthetic Control posterior over the weights (Amjad, Shah & Shen 2018) and a simplex-constrained variant that retains Abadie-Diamond-Hainmueller weights.

  • RPCA-SC. Bayani (2021), Robust PCA Synthetic Control (arXiv:2108.12542; Chapter 1 of Bayani 2022). Functional PCA on pre-period trajectories, silhouette-driven \(k\)-means on the resulting FPC scores, robust \(L + S\) decomposition of the treated unit’s cluster via Principal Component Pursuit (Candes, Li, Ma & Wright 2011) or half-quadratic non-convex regularisation (Wang, Li, So & Liu 2023), then non-negative least squares against the low-rank donor matrix.

Either family can be selected via CLUSTERSCConfig.method ("pcr", "rpca", or "both"); when both run, the CLUSTERSCConfig.primary field selects which fit drives the convenience aliases (att, counterfactual, gap, donor_weights) on the result object.

Mathematical Formulation#

Setup#

We observe a single treated unit (indexed 1) and \(J\) controls over \(T\) periods, with treatment starting at \(T_0 + 1\). Stack the donor outcomes as \(Y_0 \in \mathbb{R}^{T \times J}\) (columns = donors) and write \(y_1 \in \mathbb{R}^{T}\) for the treated series. Pre-period slices are \(Y_0^- \in \mathbb{R}^{T_0 \times J}\) and \(y_1^- \in \mathbb{R}^{T_0}\); post-period slices carry the + superscript. Treatment-effect targets are the per-period gaps \(\tau_t = y_{1, t} - \hat y^0_{1, t}\) and the average treatment effect on the treated,

\[ATT = \frac{1}{T - T_0} \sum_{t > T_0} \tau_t.\]

Both families assume the untreated potential outcome has an approximately low-rank structure \(Y^0 = M + E\), where \(M_{i,t} = g(\theta_i, \rho_t)\) is a deterministic latent factor signal (Athey et al. 2021) and \(E\) collects zero-mean idiosyncratic noise.

PCR-SC family (Rho et al. 2025)#

The PCR family combines three building blocks: rank-\(r\) HSVT denoising of the donor matrix, an optional donor-clustering pre-step, and a weight solver. Algorithm 2 of the paper sets out the basic pipeline; Algorithm 4 wraps it with the clustering step.

Why PCR: theory and assumptions#

What PCR is, and why it is the right tool. Principal component regression (project the donors onto their top-\(r\) principal components, then regress) is, without any change, robust to the noise and missingness that pervade panel data [Agarwal2021]. The key identity is that PCR is equivalent to ordinary least squares after hard singular-value thresholding (HSVT) of the covariate (donor) matrix (Agarwal et al. 2021, Prop. 2.1): retaining the top-\(r\) PCs and regressing yields the same fitted response as \(\mathrm{HSVT}_r\) followed by OLS. That HSVT step implicitly de-noises the donors by projecting them onto the rank-\(r\) signal subspace, which is exactly why the synthetic control inherits robustness to the idiosyncratic shocks every donor series carries. In the synthetic-control language this estimator is Robust Synthetic Control (Amjad-Shah-Shen [Amjad2018]); Agarwal et al. prove PCR and RSC are identical, so the HSVT+OLS path here carries RSC’s guarantees.

Model (error-in-variables). In the synthetic-control setting [ClusterSC] formalizes the donor panel as a noisy, partially observed view of a low-rank signal, \(Y_0 = M + E\), where \(E\) collects mean-zero idiosyncratic shocks and (optionally) missing cells, and the treated series obeys an approximate linear synthetic control on the signal,

\[y_{1,t} = M_{\cdot,t}^\top f^\star + \varepsilon_t + \phi_t , \qquad \|f^\star\| \le \mu ,\]

with response noise \(\varepsilon_t\) (mean zero, variance \(\le \sigma^2\)) and a deterministic model-mismatch term \(\phi_t\).

Assumptions.

Assumption 1 (low-rank latent-factor signal). The signal is generated by a latent variable model \(M_{i,t} = g(\theta_i, \rho_t)\) with \(g\) Lipschitz in finite-dimensional unit factors \(\theta_i\) and time factors \(\rho_t\), and bounded entries (\(\|M\|_\infty \le 1\)). This forces \(M\) to be (approximately) low rank, \(\mathrm{rank}(M) = r = O(\log T) < T\). Remark. This is the assumption that lets PCR/HSVT work at all – and it earns the synthetic control rather than assuming it: under this model an (approximate) linear SC \(f^\star\) provably exists (Agarwal et al. 2021, Prop. 4.1), so the existence of a synthetic combination need not be imposed as an axiom as in classical SC.

Assumption 2 (error-in-variables donors). We observe \(Z_{jt} = M_{jt} + \eta_{jt}\) with each cell present independently with probability \(\rho\) (missing otherwise); the covariate noise \(\eta\) has independent, mean-zero, \(\psi_\alpha\) (sub-exponential-type) rows with bounded variance. Remark. SC is inherently an error-in-variables problem – donor series are noisy proxies for their signal – which is precisely the regime where regressing on the raw donors (or Lasso/Ridge) loses consistency; the HSVT pre-step is what restores it.

Assumption 3 (approximate linear SC). The treated signal lies (approximately) in the span of the donor signals, \(y_1 = M^\top f^\star + \varepsilon + \phi\), with bounded \(\|f^\star\|\) and deterministic mismatch \(\phi\). Remark. This relaxes the classical convex-hull (non-negative, sum-to-one) restriction: the natural solver is unconstrained OLS on the denoised donors (no simplex), which is the default below; the SIMPLEX option re-imposes the convex hull when interpretability is preferred.

Assumption 4 (independent noise sources). The response noise \(\varepsilon\), covariate noise \(\eta\), and the missingness pattern are mutually independent (Agarwal et al. 2021, Rmk. 3.2). Remark. PCR is noise-model-agnostic: unlike the error-in-variables literature it needs no knowledge of the noise covariance, which is what makes it practical for panels with unknown shock structure.

Finite-sample guarantees. Under Assumptions 1-4 the pre-period (training) error of the HSVT+OLS estimator decomposes into three interpretable pieces (Agarwal et al. 2021, Thm. 3.1):

\[\mathrm{MSE}_{\mathrm{pre}}(\widehat y_1) \;\lesssim\; \underbrace{\frac{\sigma^2 r}{T_0}}_{\text{regression}} + \underbrace{\frac{\|f^\star\|^2}{T_0}\, \mathbb{E}\bigl\|\mathrm{HSVT}_r(Z) - M\bigr\|_{2,\infty}^2}_{\text{donor corruption}} + \underbrace{\frac{\|\phi\|^2}{T_0}}_{\text{mismatch}} ,\]

where the donor-corruption term is controlled by a novel \(\ell_{2,\infty}\)-norm bound on HSVT (Lemma 3.1, stronger than the usual Frobenius bound). With observation fraction \(\rho\), the overall pre-period rate is \(\rho^{-4} r / \min(T_0, J) + \|\phi\|^2/T_0\) (Cor. 3.1) – i.e. PCR matches, up to log factors, the minimax OLS rate one would get with perfectly observed donors, despite seeing only noisy, partially observed ones. The post-period (test) error is bounded by the training error plus a generalization penalty scaling as \(r^{5/2}/\sqrt{T_0}\) (Thm. 3.2), which is exactly what justifies the data-driven rank selection (cumulative-variance / USVT rules) used below: choose \(r\) to trade pre-period fit against this complexity penalty.

HSVT denoising. mlsynth follows the Amjad-Shah-Shen (2018) convention and applies HSVT to the pre-period donor matrix only. The full-matrix variant proposed in Rho et al. (2025) Algorithm 2 (SVD on the entire \((T, J)\) panel, then slice the pre-period rows) leaks post-period donor information into the rank-\(r\) reconstruction, which can wash out the very post-period deviations the synthetic control is meant to detect. The user can opt into the full-matrix variant via CLUSTERSCConfig.project_denoised, in which case HSVT is also applied to \((Y_0)\) for the projection step.

Let \(Y_0^- = U \Sigma V^\top\) be the SVD of the pre-period donor matrix. The rank-\(r\) hard truncation

\[\widetilde M^- = \mathrm{HSVT}_r(Y_0^-) = \sum_{i = 1}^{r} \sigma_i u_i v_i^\top\]

isolates the low-rank signal \(M^-\). The truncation rank can be chosen three ways:

  • Cumulative variance (default; paper Section 6.1) – smallest \(r\) with \(\sum_{i \le r} \sigma_i^2 / \sum_i \sigma_i^2 \ge\) cumvar_threshold (default \(0.95\)).

  • Explicit rank. Caller passes \(r\) directly via the rank parameter; the dispatcher promotes rank_method="fixed" automatically.

  • USVT (Chatterjee 2015; Donoho-Gavish 2014). Universal threshold; preserved for back-compatibility with Amjad et al. 2018.

For the data-driven rules ("cumvar" and "usvt"), the spectral comparison is computed on the column-centred donor matrix by default (standardize_for_rank=True). Each donor is demeaned over time; the variance scale is left untouched. Otherwise the leading singular value of an uncentered panel (e.g. cigarette sales in absolute units) absorbs the overall scale and dwarfs the remaining components, leaving cumvar_threshold unable to discriminate. Full z-scoring (also dividing by the donor standard deviations) is avoided because it equalises donor variances and artificially inflates the rank that a cumvar threshold picks. Centring is only used for rank picking – the HSVT step itself still consumes the raw matrix so the counterfactual is returned in original units.

The same rank is applied to the full \(Y_0\) so that the post-period projection (Algorithm 4 Step 5) consumes the denoised matrix \(\widetilde M = \mathrm{HSVT}_r(Y_0)\).

Donor clustering (Algorithm 3). When clustering=True the estimator clusters donors on the rows of \(\widetilde U = U \Sigma_r\). With \(k\) chosen by the silhouette coefficient (Rousseeuw 1987) over \(k \in [2, k_{\max}]\), run \(k\)-means and embed the treated unit via \(\tilde u = V_r^\top y_1^-\); the donor pool becomes the cluster minimising \(\|c_\ell - \tilde u\|_2\) (Algorithm 4 Step 2). The selected donor sub-matrix is denoised again at the same rank before the weight step.

OLS weights (Algorithm 2 Step 3). Drop the simplex constraints of canonical synthetic control and solve

\[\widehat f = \arg\min_{f \in \mathbb{R}^J} \bigl\| \widetilde M^- f - y_1^- \bigr\|_2^2 = (\widetilde M^-)^{+} y_1^-,\]

with \((\widetilde M^-)^{+}\) the Moore-Penrose pseudo-inverse. Appendix E of the paper compares this OLS path to ridge / lasso variants; mlsynth exposes optional elastic-net knobs (lambda_penalty, p, q) for the same purpose.

The counterfactual is

\[\widehat y^0_1 = Y_0\, \widehat f, \qquad \widehat{ATT} = \frac{1}{T - T_0} \sum_{t > T_0} \bigl(y_{1, t} - (Y_0\, \widehat f)_t\bigr).\]

Algorithm 4 Step 5 of Rho et al. (2025) writes the projection through the denoised matrix \(\widetilde M\). For the OLS solver the two are mathematically identical – \(\widehat f\) lies in the column space of \(V_r\), so the discarded high-order components annihilate. They differ for the Bayesian and SIMPLEX solvers below, where \(\widehat f\) is not constrained to the rank-\(r\) subspace. Set project_denoised=True to recover the paper-strict projection.

mlsynth extensions#

Two paper-extensible weight solvers live alongside the OLS default:

  • Bayesian PCR (estimator="bayesian"). This is the Bayesian Robust Synthetic Control of Amjad, Shah & Shen [Amjad2018]: replace the point-estimate OLS with a Gaussian posterior over the weights (Bayesian linear regression on the HSVT-denoised donors), which yields calibrated uncertainty directly rather than by resampling. Concretely,

    \[f \mid y_1^-, \widetilde M^- \sim \mathcal N\bigl(\mu_n, \Sigma_n\bigr), \quad \Sigma_n = \bigl( \sigma_e^{-2} (\widetilde M^-)^\top \widetilde M^- + \alpha_0 I \bigr)^{-1}, \quad \mu_n = \sigma_e^{-2} \Sigma_n (\widetilde M^-)^\top y_1^-,\]

    with \(\sigma_e^2 = \mathrm{Var}(y_1^-)\) and prior precision \(\alpha_0\). n_bayes_samples posterior draws are propagated through \(\widetilde M\) to yield per-period credible bands at level \(1 - \alpha\). Aggregated over the post period, the implied ATT credible interval is reported on CLUSTERSCInference.

  • Convex PCR (pcr_objective="SIMPLEX"). Keep the HSVT denoising of the donor matrix but solve the classical Abadie-Diamond-Hainmueller (2010) program,

    \[\widehat f = \arg\min_{f \in \Delta_J} \bigl\| \widetilde M^- f - y_1^- \bigr\|_2^2, \qquad \Delta_J = \bigl\{ f \in \mathbb{R}_{\ge 0}^J : \mathbf 1^\top f = 1 \bigr\}.\]

    Useful when the user wants non-extrapolation and an interpretable convex-combination donor weighting on top of the PCR denoising.

RPCA-SC family (Bayani 2021)#

The RPCA family follows the same five-step skeleton as PCR-SC but swaps the feature step, the denoising step, and the weight constraints. Algorithm 4 of the paper is the orchestrator.

Step 1 – Functional PCA on pre-period trajectories. Each unit’s pre-period series \(y_j^- (t)\) is smoothed via a cubic B-spline expansion and projected onto the FPC basis \(\{\phi_k\}_{k=1}^K\) (Li, Wang & Carroll 2010):

\[y_j^-(t) \approx \mu(t) + \sum_{k=1}^{K} \xi_{jk}\, \phi_k(t), \qquad \widehat \xi_{jk} = \int_a^b (y_j^-(t) - \widehat \mu(t))\, \widehat \phi_k(t)\, dt.\]

The rank \(K\) is the smallest integer whose cumulative spectral energy meets fpca_cumvar (paper default \(0.95\)). The scores are standardised before clustering.

Step 2 – :math:`k`-means clustering. Apply Hartigan-Wong (1979) \(k\)-means to the FPC scores with \(k\) chosen by the silhouette coefficient

\[s(j) = \frac{b(j) - a(j)}{\max\{a(j), b(j)\}},\]

where \(a(j)\) is the mean distance from unit \(j\) to other members of its own cluster and \(b(j)\) the mean distance to the closest neighbouring cluster. The donor pool \(\mathcal{D} \subseteq \{2, \dots, J + 1\}\) is the set of non-treated units sharing the treated unit’s cluster.

Step 3 – Robust PCA on the donor pool. Stack the cluster donor outcomes into \(Y_{-1} \in \mathbb{R}^{|\mathcal{D}| \times T}\). mlsynth offers two robust decompositions \(Y_{-1} = L + S\).

PCP (Candes et al. 2011) replaces the NP-hard \(\min_{L, S} \mathrm{rank}(L) + \lambda \|S\|_0\) with the convex relaxation

\[\min_{L, S} \; \| L \|_* + \lambda \| S \|_1 \quad \text{s.t.} \quad Y_{-1} = L + S,\]

solved by ADMM with default penalties \(\lambda = 1 / \sqrt{\max(|\mathcal D|, T)}\) and \(\mu = |\mathcal D| T / (4 \sum_{i, t} |Y_{-1, it}|)\) per Bayani Section 2.4. Each iteration alternates a singular-value soft-threshold update on \(L\), an entry-wise soft-threshold update on \(S\), and the standard dual ascent on the multiplier \(\Lambda\).

HQF (Wang et al. 2023) instead factors \(L = UV\) directly and alternates two Tikhonov-regularised least-squares updates,

\[U \leftarrow \bigl((Y_{-1} - S) V^\top + \lambda U_{\text{prev}}\bigr) (V V^\top + \lambda I)^{-1}, \quad V \leftarrow (U^\top U + \lambda I)^{-1} (U^\top (Y_{-1} - S) + \lambda V_{\text{prev}}),\]

with the sparse component updated by a median-absolute-deviation threshold on the residual (controlled by hqf_ip). The rank defaults to the smallest \(r\) whose cumulative singular-value energy meets hqf_cumvar (Bayani uses \(0.999\)).

Step 4 – Non-negative least squares. Let \(L^- \in \mathbb{R}^{|\mathcal{D}| \times T_0}\) be the pre-period slice of the low-rank component. The weights solve

\[\widehat \beta = \arg\min_{\beta \ge 0} \bigl\| y_1^- - (L^-)^\top \beta \bigr\|_2^2.\]

The simplex constraint of canonical synthetic control is deliberately dropped: the clustering step has already restricted the donor pool to behaviourally similar units, so non-negativity suffices to keep the counterfactual interpretable (Bayani 2021, Section 2.4).

Step 5 – Projection. Counterfactual and ATT come from the denoised donor matrix,

\[\widehat y^0_1 = L^\top \widehat \beta, \qquad \widehat{ATT} = \frac{1}{T - T_0} \sum_{t > T_0} \bigl(y_{1, t} - (L^\top \widehat \beta)_t\bigr).\]

RPCA-SC tuning via leave-one-time-out cross-validation#

Bayani (2021) takes the Candes-Li-Ma-Wright (2011) optimal-recovery value \(\lambda^\star = 1/\sqrt{\max(|\mathcal D|, T)}\) for PCP, and chooses the HQF factorisation rank by an explained-variance threshold. Both defaults are conservative for L/S decomposition identifiability but can be poor for counterfactual prediction. On the California Proposition 99 panel the Candes default leaves PCP under-regularised by roughly 2x and HQF stops at a rank that loses ATT magnitude.

mlsynth therefore exposes two opt-in cross-validation tuners (CLUSTERSCConfig.cv_lambda for PCP, CLUSTERSCConfig.cv_hqf_rank for HQF). The same algorithm drives both:

Let \(L \in \mathbb{R}^{|\mathcal D| \times T}\) be the low-rank component returned by the robust PCA solver at candidate hyperparameter \(\theta\). The pre-period slice \(L^- \in \mathbb{R}^{|\mathcal D| \times T_0}\) is the design matrix for the NNLS weight step. Define the leave-one-time-period-out mean squared error

\[\mathrm{CV}(\theta) = \frac{1}{T_0} \sum_{t = 1}^{T_0} \biggl( y_{1, t} - (L^-_{:, -t})^\top \widehat \beta^{(-t)} \biggr)^2, \quad \widehat \beta^{(-t)} = \arg\min_{\beta \ge 0} \bigl\| y_{1, -t}^- - (L^-_{:, -t})^\top \beta \bigr\|_2^2,\]

where the \(-t\) subscript drops the \(t\)-th pre-period column. The donor matrix \(Y_{-1}\) is fully observed at every period so the RPCA decomposition is run once per \(\theta\), not \(T_0\) times. The chosen \(\widehat \theta\) minimises \(\mathrm{CV}(\theta)\) over a grid:

  • For PCP, the grid is cv_lambda_multipliers \(\times \lambda^\star\) with default multipliers \((0.5, 1, 2, 3, 5, 8, 12)\).

  • For HQF, the grid is the integer ranks \(\{1, 2, \dots, \min(|\mathcal D|, T_0 - 1)\}\).

On the Proposition 99 panel, cv_lambda=True picks \(\widehat \lambda = 2 \lambda^\star\) and cuts pre-period RMSE in half (2.11 → 1.08, ATT moving from −15.5 to −17.7); cv_hqf_rank=True picks \(\widehat r = 8\) (vs. cumvar default of 4) with ATT moving from −12.8 to −16.6. Both ATTs land in the canonical SCM / RSC range of −19 to −24.

When the assumptions bind: practical diagnostics#

Assumptions 1-4 above (Agarwal et al. 2021, transcribed in the PCR-SC family) are written as structural conditions. Here is how each shows up in real data and what to check in the CLUSTERSCResults container before trusting the counterfactual.

  1. Low-rank latent-factor signal (A1). The donor pool’s untreated mean \(M\) is approximately low rank. If the panel is genuinely full-rank or the spectrum decays slowly, HSVT throws away signal and the OLS step over-fits whatever is left.

    Plausibly violated when donor series are idiosyncratic in a way that no factor model can absorb – think 200 firm-level time series where each firm has its own unrelated business cycle. Diagnostic: compute the spectral-energy share of the top-\(r\) singular values of the column-centred donor matrix; if you need \(r \ge \min(T_0, J) / 2\) to capture \(90\%\) of the energy, the low-rank story is failing. On the Prop 99 panel the top-1 singular value carries \(\approx 99\%\) of the energy – the exact regime where PCR-SC dominates. When the spectrum decays slowly, fall back to canonical SC (canonical SCM, Two-Step Synthetic Control) or to a balancing-aware estimator (MicroSynth (User-Level Balancing SC) for unit-level data; Factor Model Approach (FMA) for factor-model-aware estimation).

  2. Error-in-variables donors (A2). Donor entries are noisy / partially-observed views of the signal. The HSVT step is precisely what restores consistency when this assumption holds. The condition fails softly when the noise is sparse and heavy-tailed instead of Gaussian, hard when missingness is non-random (donor entries missing because the donor deviates from the signal).

    Plausibly violated when you have a panel with structural outliers (a donor with a one-time policy shock spike) rather than i.i.d. Gaussian noise. Diagnostic: residualise each donor against the rank-\(r\) HSVT reconstruction and histogram the residuals; sparse heavy tails are a red flag. The fix is the RPCA-SC family – robust \(L + S\) decomposition explicitly separates the low-rank signal from the sparse outliers, so heavy-tailed donor noise is absorbed into \(S\) rather than contaminating \(L\).

  3. Approximate linear SC (A3). The treated signal lies (approximately) in the span of the denoised donor signals. PCR-SC uses unconstrained OLS so this relaxes the canonical convex-hull restriction – but the model-mismatch term \(\phi_t\) (the gap between the true treated mean and the closest linear combination of donor signals) still bounds the finite-sample error.

    Plausibly violated when the treated unit is structurally outside the donor pool – coastal vs. interior states, a tech-led economy with only commodity-led donors. Diagnostic: inspect res.pcr.pre_rmse against the donor noise floor (the mean leave-one-out pre-RMSE across donors). If treated pre-RMSE is substantially higher than the donor noise floor, the mismatch \(\phi_t\) is large and PCR-SC’s unbiasedness argument breaks. For the structurally outside- hull case, switch to Imperfect Synthetic Controls (ISCM) (whose A2(b) mechanism identifies the effect through donors that use the treated unit as a donor).

  4. Independent noise sources (A4). Response noise, donor noise, and the missingness pattern are mutually independent. The PCR-SC consistency results assume the missingness pattern is data-independent.

    Plausibly violated when donor observations are missing because of the outcome value (e.g. countries stopping reporting GDP during recessions). Diagnostic: cross-tabulate missingness with the pre-period outcome quantiles; if missing entries are concentrated in the tails, the assumption fails. Multiple imputation pre-step or a missing-data-aware estimator is the fix; mlsynth’s Synthetic Nearest Neighbors / Causal Matrix Completion (SNN) is built for missingness- informative panels.

  5. Cluster structure (Rho et al. 2025, ClusterSC-specific). The donor pool decomposes into latent subgroups distinguishable on the right-singular-vector embedding. Clustering only helps if such subgroups actually exist and the silhouette statistic can find them.

    Plausibly violated when the donor pool is genuinely homogeneous – a tight set of comparable units already pre-screened by the analyst. Diagnostic: read res.pcr.metadata['k_clusters']; if the silhouette picks \(k = 1\), clustering adds no value and you should set clustering=False to use the full pool. If \(k > 1\) is picked but the post-fit ATT moves substantially when you flip clustering off and on, the cluster structure is spurious (cluster boundaries are within the noise floor) and the un-clustered fit is the more honest one.

  6. Functional smoothness (Bayani 2021, RPCA-SC-specific). Pre-period trajectories admit a parsimonious FPCA basis. If trajectories are dominated by high-frequency jagged noise the cubic-spline FPCA pre-step throws away most of the signal.

    Plausibly violated when the outcome is noisy at the observation frequency (daily stock prices, hourly sensor readings). Diagnostic: read res.rpca.metadata['fpca_components']; if the silhouette or cumulative-variance step keeps a large number of FPC components, the smoothness assumption is failing. Aggregate to a coarser time grid (weekly, daily) before refitting, or move to a stationary-cycle estimator (Synthetic Business Cycle (SBC)) that handles unsmoothed series natively.

  7. NNLS-friendly truth (Bayani 2021, RPCA-SC-specific). RPCA-SC ends with a non-negative least-squares step. If the true counterfactual is best represented as an extrapolation (some donor weights would naturally be negative), NNLS cannot reach it.

    Plausibly violated when the treated unit is at the edge of the donor distribution and the un-clustered PCR-SC OLS fit returns substantial negative weights. Diagnostic: refit with PCR-SC + OLS (method="pcr", pcr_objective="OLS") and inspect the weight vector. If many large negative weights appear, the convex-combination restriction is binding and RPCA-SC is throwing away identification.

When to use PCR-SC, RPCA-SC, or neither#

The four papers behind CLUSTERSC chip at different parts of the canonical SC pipeline. The decision logic for picking among them:

Reach for PCR-SC + clustering (the CLUSTERSC default) when:

  • The donor pool is large and noisy (\(J \gtrsim T_0\), disaggregated panels with hundreds of donors), and the donor matrix has a clear low-rank spectrum. This is the Rho et al. (2025) regime – the empirical case studies are individual- level health records and disaggregated economic panels.

  • The treated unit comes from a plausible subgroup of the donor pool that you cannot easily isolate by hand (similar patient phenotype, similar state-economy composition). The silhouette-driven \(k\)-means step formalises the subgroup decision.

  • You’re willing to let weights be negative in the un-clustered OLS solve (Assumption 3 above) for tighter pre-fit and lower bias – and you accept that “California = +0.4 Utah +0.3 Montana -0.1 Tennessee” is a defensible weight story for your application.

Reach for PCR-SC without clustering (i.e., classic Amjad-Shah-Shen RSC) when:

  • You have a moderate-size donor pool that you have already pre-screened, and you mostly want the HSVT denoising step to protect against donor-matrix noise / missingness. Pre-screening has done the work that clustering would do.

  • You want the Shen-Ding-Sekhon-Yu (2023) closed-form CIs. These are wired in for pcr_objective="OLS" and give you proper frequentist HZ/VT/DR-source standard errors without a bootstrap; the Bayesian path delivers credible bands.

  • The empirical setting is one of the canonical aggregate SC case studies (Prop 99, Basque, etc.) where the donor pool is small (J ≤ 40) and a hand-picked subgroup already exists.

Reach for RPCA-SC when:

  • The donor matrix has sparse heavy-tailed outliers rather than uniform Gaussian noise – a few donors with one-time policy shocks, structural breaks, or recording errors. The \(L + S\) decomposition explicitly absorbs these into \(S\), leaving a clean \(L\) for the weight fit. PCR-SC’s HSVT is an \(L_2\)-based denoiser and is less robust to this exact regime.

  • Pre-period trajectories are smooth functional curves (annual GDP, monthly population) where the FPCA basis is a natural language for the donor pool. RPCA-SC’s Step 1 was built for this.

  • You want non-negative interpretable weights – the NNLS step at the end produces a sparse convex-combination story similar to canonical SC, while still benefiting from the robust-PCA denoising. The German-reunification application is the canonical case (Norway 0.48, France 0.35, New Zealand 0.30, Austria 0.02).

  • You want Cattaneo-Feng-Titiunik (2021) prediction intervals – mlsynth’s CFT port runs on RPCA-SC, providing in-sample bootstrap + Hoeffding out-of-sample bands.

Use PCR-SC over RPCA-SC when:

  • The noise is Gaussian / sub-Gaussian (PCR-SC’s HSVT is optimal under this), the donor matrix is high-dimensional (clustering helps), and you want negative weights / closed- form Shen CIs.

  • You need speed – PCR-SC is one SVD + one OLS, while RPCA-SC’s PCP is an ADMM loop and HQF is an iterative factorisation. Differences are small on Prop99-size panels; they bite on disaggregated panels with thousands of donors.

Use RPCA-SC over PCR-SC when:

  • The donor matrix has visible sparse outliers (one or two donors with a single shock period that would dominate the HSVT spectrum). RPCA’s \(L + S\) decomposition is built for exactly that pattern.

  • Trajectories are smooth and you want the donor-pool reduction that FPCA + \(k\)-means delivers ahead of the weight fit (this is what selects the 11-economy West-Germany cluster from the 17-country OECD pool).

  • You want a non-negative interpretable weight vector for policy storytelling.

Do not use either family when:

  • The donor matrix has no low-rank structure. Both PCR-SC and RPCA-SC depend on this. With high-dimensional donor pools where the spectrum decays slowly, switch to a balancing-aware estimator (MicroSynth (User-Level Balancing SC) if you have user-level data; Factor Model Approach (FMA) if a factor-model fit is defensible).

  • The treated unit is structurally outside the donor convex hull in a way that no linear combination can capture – even with negative weights, the pre-RMSE stays large. Switch to Imperfect Synthetic Controls (ISCM), whose A2(b) mechanism identifies the effect through donors that use the treated unit as a positive- weight donor in their own synthetic controls.

  • Distributional questions (Lorenz curves, QTEs, tail changes). PCR-SC and RPCA-SC target the mean ATT. Use Distributional Synthetic Control (DSC) instead.

  • Continuous or multi-valued treatment. The CLUSTERSC pipeline encodes a single binary treatment indicator. Continuous dose belongs in Continuous-Treatment Synthetic Control (CTSC).

  • Spillovers / interference across donors. The low-rank signal model assumes donors are untreated and independent of the treatment of unit 1. Spillovers break this; use Spillover-Aware Synthetic Control (SPILLSYNTH) or Spatial Synthetic Difference-in-Differences (SpSyDiD).

  • Tiny donor pool (\(J \le 10\)) and a tight canonical- SC pre-fit. The denoising and clustering machinery is overkill; both add variance without identification gain. Use canonical SCM, Two-Step Synthetic Control, or Forward Difference-in-Differences (FDID).

  • Staggered adoption with a long mixed-treatment pre-period. CLUSTERSC assumes a common pre-period free of treatment. Use FECT or Synthetic Difference-in-Differences (SDID).

  • You need the donor weight vector to be a single sparse convex combination as the headline story (and you’re not willing to use RPCA-SC’s NNLS variant). The PCR-SC default is the unrestricted OLS solve; the SIMPLEX variant re-imposes the convex hull at the cost of the denoising-bias gain. If the sparse convex combination is the deliverable, canonical SCM or Two-Step Synthetic Control are the more honest default.

Inference#

CLUSTERSC.fit() returns an EffectResult: the flat accessors (res.att, res.counterfactual, res.gap, res.att_ci, res.donor_weights, res.pre_rmse) and the standardized sub-models (res.effects, res.time_series, res.weights, res.inference, res.fit_diagnostics, res.method_details) are populated from the primary variant. The rich, estimator-specific inference object lives on res.cluster_inference (its scalar ATT interval is mirrored into the standardized res.inference), and the two family fits remain available side by side on res.pcr / res.rpca.

Three inference families are wired into CLUSTERSCInference (accessed via res.cluster_inference):

  • Frequentist PCR – Shen-Ding-Sekhon-Yu (2023) closed-form CIs. Default on for estimator="frequentist" and pcr_objective="OLS". See the next subsection.

  • Bayesian PCR – posterior credible interval. Computed when estimator="bayesian" from posterior draws of the counterfactual (the Bayesian Robust Synthetic Control of Amjad, Shah & Shen [Amjad2018]).

  • RPCA-SC – Cattaneo-Feng-Titiunik (2021) prediction intervals. Opt-in via CLUSTERSCConfig.compute_cft_pi. See the dedicated subsection below.

Shen-Ding-Sekhon-Yu (2023) frequentist CIs for OLS PCR#

For the symmetric estimator class – OLS minimum \(\ell_2\)-norm, PCR, and ridge – Theorem 1 of Shen et al. (2023) shows that the horizontal (HZ) and vertical (VT) regression formulations give algebraically identical point estimates. The two formulations nevertheless quantify uncertainty against different generative models:

  • HZ model (Assumption 1). Each donor’s post-period outcome is a noisy linear combination of its own pre-period values:

    \[Y_{i, T} = \sum_{t \le T_0} \alpha^*_t Y_{i, t} + \varepsilon_{i, T}, \quad i = 1, \dots, N_0.\]

    The randomness lives in the cross-sectional dimension.

  • VT model (Assumption 2). The treated unit’s pre-period outcome is a noisy linear combination of the donors’ pre-period values:

    \[Y_{N, t} = \sum_{i \le N_0} \beta^*_i Y_{i, t} + \varepsilon_{N, t}, \quad t = 1, \dots, T_0.\]

    The randomness lives in the time-series dimension.

  • DR model (Assumption 3). Both sources of randomness are present.

Each model yields a distinct estimand and a distinct asymptotic variance for the same point estimate \(\widehat Y_{N, T}(0)\) (Theorem 3). With rank-\(k\) HSVT projections \(H^u_\perp = I - U_k U_k^\top\) (donor-space) and \(H^v_\perp = I - V_k^\top V_k\) (time-space), and the homoskedastic variance plug-ins (paper eq 19):

\[\widehat \sigma^2_{\mathrm{hz}} = \frac{\| H^u_\perp y_T \|_2^2}{N_0 - R}, \qquad \widehat \sigma^2_{\mathrm{vt}} = \frac{\| H^v_\perp y_N \|_2^2}{T_0 - R},\]

(where \(R = \mathrm{rank}(Y_0)\) after truncation), the per-period variance estimators are

\[\widehat v_{\mathrm{hz}} = \widehat \sigma^2_{\mathrm{hz}}\, \| \widehat \beta \|_2^2, \quad \widehat v_{\mathrm{vt}} = \widehat \sigma^2_{\mathrm{vt}}\, \| \widehat \alpha \|_2^2, \quad \widehat v_{\mathrm{dr}} = \max\!\bigl(0,\, \widehat v_{\mathrm{hz}} + \widehat v_{\mathrm{vt}} - \mathrm{tr}\widehat A\bigr),\]

with \(\widehat A = \widehat \sigma^2_{\mathrm{hz}} \widehat \sigma^2_{\mathrm{vt}}\, Y_0^{+} (Y_0^\top)^{+}\) the interaction term. The \((1 - \alpha)\) CI under source \(s \in \{\mathrm{hz}, \mathrm{vt}, \mathrm{dr}\}\) is

\[\widehat Y_{N, T}(0) \pm z_{\alpha/2}\, \sqrt{\widehat v_s}.\]

mlsynth also ports the jackknife and HRK (Hartley-Rao-Kish) variance estimators from var.py in the authors’ reference repository. The HRK estimator is only valid when \(\max_i (1 - (H_\perp)_{ii}) < 1/2\) for both projections; mlsynth checks this and raises if violated.

For multi-period extrapolation the procedure runs per post-period: at each \(t > T_0\) the donor outcomes \(y_t\) change but the projections, weights, and per-period variances are recomputed from the same fitted weight pair. The ATT is the mean of per-period gaps \(\widehat{ATT} = (T - T_0)^{-1} \sum_{t > T_0} \widehat \tau_t\), and its variance is aggregated assuming independence across post-periods:

\[\widehat v_s(\widehat{ATT}) = \frac{1}{T_1} \cdot \frac{1}{T_1} \sum_{t > T_0} \widehat v_s(t), \qquad T_1 = T - T_0.\]

The independence assumption is the standard first-pass; serially correlated shocks would inflate the true variance.

Config knobs: CLUSTERSCConfig.compute_shen_ci (default True) toggles the inference; CLUSTERSCConfig.shen_variance chooses "homoskedastic" (default), "jackknife", or "hrk". The output CLUSTERSCInference.shen carries per-period gaps, per-period CIs under each source, and the aggregated ATT CIs.

Cattaneo-Feng-Titiunik (2021) prediction intervals for RPCA-SC#

Bayani (2021) leaves inference unaddressed and the Shen et al. (2023) closed-form CIs do not apply to the asymmetric NNLS weights of RPCA-SC. For uncertainty quantification mlsynth ships a focused port of Cattaneo, M. D., Feng, Y., & Titiunik, R. (2021), “Prediction Intervals for Synthetic Control Methods” (JASA 116(536):1865-1880), opt-in via CLUSTERSCConfig.compute_cft_pi.

The CFT prediction error at post-period \(t > T_0\) decomposes as

\[y_t - \widehat y^0_t = \underbrace{\bigl\langle p_t, w^* - \widehat w \bigr\rangle}_{ \text{in-sample } u_t} + \underbrace{e_t}_{\text{out-of-sample shock}},\]

where \(p_t\) is the loading at \(t\) (for RPCA-SC the denoised donor row \(L_t\)), \(w^*\) is the population weight, and \(\widehat w\) is the NNLS estimate. The two components are quantified separately:

  • In-sample component \(M_w(t, \alpha/2)\). The paper’s reference implementation solves a constrained \(\sup / \inf\) over the “compatible set” of weights via an ECOS LP at each post-period. mlsynth replaces that with an HC1-scaled parametric bootstrap of the pre-period residuals: at each of cft_sims draws,

    \[\widehat u^*_t = \sqrt{\tfrac{T_0}{T_0 - 1}} \cdot \widehat u^-_{\sigma(t)}, \quad \sigma \sim \mathrm{Unif}\{1, \dots, T_0\},\]

    we perturb the treated pre-period outcome \(y^*_t = \widehat y^0_t + \widehat u^*_t\), refit the full RPCA-SC pipeline, and collect the resulting counterfactual \(\widehat y^{0, *}\) at every period. The asymmetric \((\alpha/2, 1 - \alpha/2)\) empirical quantiles of \(\widehat y^{0, *}_t - \widehat y^0_t\) form the in-sample band per post-period. This is equivalent to the ECOS-based bound under regularity conditions and avoids pulling in ecos / dask / plotnine as hard dependencies.

  • Out-of-sample component \(M_e(t, \alpha/2)\). Under sub-Gaussian post-period shocks with scale parameter \(\sigma_e\), Hoeffding’s inequality gives the tail bound

    \[M_e(t, \alpha/2) = \sqrt{-2 \log \alpha} \; \widehat \sigma_e, \qquad \widehat \sigma_e = \mathrm{sd}\!\bigl(\widehat u^-\bigr).\]

    This is the gaussian variant of the paper’s three out-of-sample options (gaussian / ls / qreg); the other two are deferred.

The combined \((1 - \alpha)\) PI on the counterfactual at \(t\) is

\[\widehat y^0_t \pm \bigl[ M_w(t, \alpha/2) + M_e(t, \alpha/2) \bigr],\]

which inverts to the PI on the per-period treatment effect \(\widehat \tau_t = y_t - \widehat y^0_t\).

For the ATT, the in-sample component aggregates by storing the post-period mean of the counterfactual at each bootstrap draw and taking quantiles. The out-of-sample component shrinks by \(\sqrt{T_1}\) under post-period shock independence:

\[M_e^{\mathrm{ATT}}(\alpha/2) = \frac{\sqrt{-2 \log \alpha} \; \widehat \sigma_e}{\sqrt{T_1}}.\]

Config knobs: CLUSTERSCConfig.compute_cft_pi (default False – the bootstrap costs cft_sims full pipeline refits, roughly 0.3-0.5 s each on a moderate panel), CLUSTERSCConfig.cft_sims (default 200), CLUSTERSCConfig.cft_alpha (default 0.05), and CLUSTERSCConfig.cft_e_method (currently only "gaussian"). The output CLUSTERSCInference.cft carries the per-period gaps, per-period PIs, in-sample bootstrap bands, the Hoeffding constant, and the aggregated ATT PI.

Verification#

In the high-dimensional-subgroup regime (pooled donor rank \(> T_0\)), CLUSTERSC reproduces the central claim of Rho et al. (2025): donor clustering lowers the post-period prediction MSE versus the whole-pool RSC baseline at every noise level (down \(60.8\% / 43.2\% / 24.3\%\) at \(\sigma = 0.10/0.25/0.40\)). Both modes run through the one estimator (clustering=False is RSC, clustering=True is ClusterSC). Pinned in benchmarks/cases/clustersc_subgroups.py; the authors’ own code is cross-checked against its paper in clustersc_subgroups_ref.py; the RSC pre/post-error and Shen-CI coverage are pinned in rsc_synth_error.py / rsc_shen_coverage.py. The RPCA-SC family is pinned separately on the West-German-reunification application (clustersc_rpca_germany.py: Norway 0.49 / France 0.35 / pre-RMSE ~89). See the dedicated page ClusterSC — Synthetic Control with Donor Selection (Rho et al. 2025).

Core API#

Cluster-based Synthetic Control (CLUSTERSC) estimator.

Bundles two complementary robust SCM families behind a single orchestrator:

  • PCR-RSC – SVD-based donor clustering (Amjad, Shah, Shen 2018) plus Principal Component Regression weight estimation (Agarwal, Shah, Shen, Song 2021). Frequentist QP or Bayesian Robust Synthetic Control posterior (Amjad, Shah, Shen 2018) selected via estimator.

  • RPCA-SC – robust low-rank donor denoising (PCP – Candes, Li, Ma, Wright 2011; or HQF – Wang, Li, So, Liu 2023) followed by simplex SCM weights.

method = "both" runs both families in parallel and exposes them side by side on the CLUSTERSCResults container. The primary field selects which fit drives the convenience aliases att / counterfactual / gap / donor_weights.

class mlsynth.estimators.clustersc.CLUSTERSC(config: CLUSTERSCConfig | dict)#

Bases: object

Cluster-based Synthetic Control estimator.

Parameters:

config (CLUSTERSCConfig or dict) – Configuration object. See mlsynth.config_models.CLUSTERSCConfig.

Returns:

CLUSTERSCResults – Frozen container with optional PCR-RSC and RPCA-SC fits plus Bayesian credible interval inference (PCR estimator = “bayesian” only).

fit() CLUSTERSCResults#

Run the requested family (or both) and return a CLUSTERSCResults.

Configuration#

class mlsynth.config_models.CLUSTERSCConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, method: ~typing.Literal['pcr', 'rpca', 'both'] = 'pcr', primary: ~typing.Literal['pcr', 'rpca'] = 'pcr', pcr_objective: ~typing.Literal['OLS', 'SIMPLEX'] = 'OLS', clustering: bool = True, estimator: ~typing.Literal['frequentist', 'bayesian'] = 'frequentist', rpca_method: ~typing.Literal['PCP', 'HQF'] = 'PCP', lambda_penalty: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, p: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, q: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, rank_method: ~typing.Literal['cumvar', 'fixed', 'usvt'] = 'cumvar', cumvar_threshold: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.95, standardize_for_rank: bool = True, project_denoised: bool = False, k_clusters: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, k_max: ~typing.Annotated[int, ~annotated_types.Ge(ge=2)] = 8, n_bayes_samples: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1000, random_state: int = 0, compute_shen_ci: bool = True, shen_variance: ~typing.Literal['homoskedastic', 'jackknife', 'hrk'] = 'homoskedastic', fpca_cumvar: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.95, pcp_lambda: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, pcp_mu: ~typing.Annotated[float | None, ~annotated_types.Gt(gt=0.0)] = None, pcp_max_iter: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1000, pcp_tol: ~typing.Annotated[float, ~annotated_types.Gt(gt=0)] = 1e-09, hqf_rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, hqf_cumvar: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.999, hqf_lambda: ~typing.Annotated[float | None, ~annotated_types.Ge(ge=0.0)] = None, hqf_ip: ~typing.Annotated[float, ~annotated_types.Gt(gt=0)] = 1.0, hqf_max_iter: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1000, cv_lambda: bool = False, cv_hqf_rank: bool = False, compute_cft_pi: bool = False, cft_sims: ~typing.Annotated[int, ~annotated_types.Ge(ge=10)] = 200, cft_alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, cft_e_method: ~typing.Literal['gaussian'] = 'gaussian')#

Configuration for the Cluster-based Synthetic Control (CLUSTERSC) estimator.

CLUSTERSC packages two robust synthetic-control families behind a single interface:

  • PCR-RSC – SVD-based donor clustering plus Principal Component Regression weight estimation (Amjad, Shah, Shen 2018; Agarwal, Shah, Shen, Song 2021). Frequentist QP via cvxpy or Bayesian Robust Synthetic Control posterior (Amjad, Shah, Shen 2018).

  • RPCA-SC – robust low-rank donor denoising (PCP – Candes, Li, Ma, Wright 2011; or HQF – Wang, Li, So, Liu 2023) followed by simplex SCM weights.

Parameters:
  • method ({“pcr”, “rpca”, “both”}) – Estimation family to run. "both" runs PCR and RPCA in parallel; primary selects which one exposes att / counterfactual / gap on the result object.

  • primary ({“pcr”, “rpca”}) – Which fit drives the result aliases when method = "both".

  • pcr_objective ({“OLS”, “SIMPLEX”}) – Inner SCM weight objective for the PCR family.

  • clustering (bool) – Whether to apply SVD-based donor clustering before the PCR fit.

  • estimator ({“frequentist”, “bayesian”}) – Frequentist QP versus Bayesian posterior for the PCR family.

  • rpca_method ({“PCP”, “HQF”}) – Robust-PCA decomposition for the RPCA family.

  • lambda_penalty, p, q (float or None) – Elastic-net-style regularization knobs forwarded to the PCR inner solver.

  • alpha (float) – Two-sided level for the Bayesian credible interval (PCR estimator = “bayesian” only).

alpha: float#
cft_alpha: float#
cft_e_method: Literal['gaussian']#
cft_sims: int#
clustering: bool#
compute_cft_pi: bool#
compute_shen_ci: bool#
cumvar_threshold: float#
cv_hqf_rank: bool#
cv_lambda: bool#
estimator: Literal['frequentist', 'bayesian']#
fpca_cumvar: float#
hqf_cumvar: float#
hqf_ip: float#
hqf_lambda: float | None#
hqf_max_iter: int#
hqf_rank: int | None#
k_clusters: int | None#
k_max: int#
lambda_penalty: float | None#
method: Literal['pcr', 'rpca', 'both']#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_bayes_samples: int#
p: float | None#
pcp_lambda: float | None#
pcp_max_iter: int#
pcp_mu: float | None#
pcp_tol: float#
pcr_objective: Literal['OLS', 'SIMPLEX']#
primary: Literal['pcr', 'rpca']#
project_denoised: bool#
q: float | None#
random_state: int#
rank: int | None#
rank_method: Literal['cumvar', 'fixed', 'usvt']#
rpca_method: Literal['PCP', 'HQF']#
shen_variance: Literal['homoskedastic', 'jackknife', 'hrk']#
standardize_for_rank: bool#

Helper Modules#

PCR-SC pipeline#

HSVT (Hard Singular Value Thresholding) primitives for PCR-SC.

Implements rank selection and the rank-\(r\) truncation step \(\widetilde{M} = HSVT_r(X)\) from Rho, Tang, Bergam, Cummings & Misra (2025), ClusterSC: Advancing Synthetic Control with Donor Selection, Algorithm 2.

Three rank-selection modes are exposed:

  • "cumvar" – smallest \(r\) whose cumulative spectral energy reaches a user threshold (paper’s empirical default of 95%, Section 6.1).

  • "fixed" – caller supplies the explicit rank \(r\).

  • "usvt" – Universal Singular Value Thresholding (Chatterjee 2015; Donoho & Gavish 2014). Preserved for back-compat with the legacy Amjad-Shah-Shen 2018 path.

mlsynth.utils.clustersc_helpers.pcr.hsvt.hsvt(X: ndarray, rank: int) Tuple[ndarray, ndarray, ndarray, ndarray]#

Apply hard rank-rank truncation to X.

Computes \(\widetilde{M} = \sum_{i=1}^{r} \sigma_i u_i v_i^\top\) (Algorithm 2 step 2 of Rho et al. 2025).

Returns:

  • M_hat (np.ndarray) – Rank-rank reconstruction, shape (m, n).

  • U_r (np.ndarray) – Truncated left singular vectors, shape (m, rank).

  • s_r (np.ndarray) – Truncated singular values, shape (rank,).

  • Vt_r (np.ndarray) – Truncated right singular vectors (transposed), shape (rank, n).

mlsynth.utils.clustersc_helpers.pcr.hsvt.select_rank(X: ndarray, method: str = 'cumvar', cumvar_threshold: float = 0.95, r: int | None = None, standardize: bool = True) int#

Pick a truncation rank for X under the chosen rule.

Parameters:
  • X (np.ndarray) – Matrix to decompose, shape (m, n).

  • method ({“cumvar”, “fixed”, “usvt”}) – Rank-selection rule.

  • cumvar_threshold (float) – Cumulative-variance target in (0, 1] for method="cumvar".

  • r (int, optional) – Explicit rank for method="fixed".

  • standardize (bool) – When True (default), the data-driven rules ("cumvar" / "usvt") operate on the column-standardised version of X. This prevents the leading singular value from being dominated by uncentered level information (e.g. cigarette sales numbers, GDP per capita) so that cumvar_threshold actually discriminates over the panel’s factor structure rather than the overall scale. Ignored for method="fixed".

Returns:

int – Selected rank \(r \in [1, \min(m, n)]\).

Donor clustering for ClusterSC (Rho et al. 2025, Algorithm 3 & 4).

Implements the two paper-aligned steps:

  • cluster_donors() – Algorithm 3. SVD-truncate the (transposed) donor matrix at rank \(r\), build features \(\widetilde{U} = U \Sigma_r\), and run \(k\)-means on the rows of \(\widetilde{U}\). The number of clusters is either user-supplied or selected by the silhouette coefficient (Amjad, Shah, Shen 2018).

  • assign_target() – Algorithm 4 Step 2. Embed the treated unit into the right-singular-vector basis via \(\tilde{u} = V_r^\top x_0^-\) and assign it to the nearest cluster centroid in \(\widetilde{U}\)-space.

Convention: X_pre here is shape (J, T0) – one row per donor unit, one column per pre-period – to match the paper’s notation (\(X \in \mathbb{R}^{n \times T}\)). The orchestrator transposes as needed.

class mlsynth.utils.clustersc_helpers.pcr.clustering.ClusterPartition(labels: ndarray, centers: ndarray, V_r: ndarray, k: int, rank: int)#

Output of the paper’s Algorithm 3.

V_r: ndarray#
centers: ndarray#
k: int#
labels: ndarray#
rank: int#
mlsynth.utils.clustersc_helpers.pcr.clustering.assign_target(target_outcome_pre: ndarray, partition: ClusterPartition) Tuple[int, ndarray]#

Algorithm 4 Step 2: assign the target to its matching cluster.

Computes \(\tilde{u} = V_r^\top x_0^-\) and picks the centroid minimising \(\| c_\ell - \tilde{u} \|_2\).

Returns:

  • cluster_id (int) – Index of the nearest cluster.

  • target_embedding (np.ndarray) – \(\tilde{u}\), shape (r,).

mlsynth.utils.clustersc_helpers.pcr.clustering.cluster_donors(donor_outcomes_pre: ndarray, rank: int, k_clusters: int | None = None, k_max: int = 8, random_state: int = 0) ClusterPartition#

Run Algorithm 3 of Rho et al. (2025).

Parameters:
  • donor_outcomes_pre (np.ndarray) – Donor outcomes in the pre-period, shape (J, T0) (donors on rows).

  • rank (int) – Truncation rank \(r\) for the SVD feature map.

  • k_clusters (int, optional) – Number of clusters. If None, the silhouette coefficient picks \(k \in [2, \min(k_{\max}, J-1)]\).

  • k_max (int) – Upper bound for the silhouette search.

  • random_state (int) – Seed forwarded to sklearn.cluster.KMeans for reproducibility.

Returns:

ClusterPartition – Cluster labels, centroids in \(\widetilde{U}\)-space, and the rank-truncated right singular matrix \(V_r\) needed for the target-assignment step.

Frequentist OLS weight solver for PCR-SC.

Implements the learn step of Algorithm 2 of Rho et al. (2025): given the HSVT-denoised pre-period donor matrix \(\widetilde{M}^- \in \mathbb{R}^{T_0 \times J}\) and the target pre-period vector \(x_0^- \in \mathbb{R}^{T_0}\), return

\[\widehat{f} = \arg\min_{f \in \mathbb{R}^J} \| \widetilde{M}^- f - x_0^- \|_2^2.\]

Optionally extends to p-norm (ridge / lasso) regularisation via lambda_penalty, p, q (Appendix E of the paper compares OLS, Ridge, Lasso): min_f ||x0 - M f||_2^2 + lambda ||f||_p^q. For the unregularised path (lambda_penalty in {None, 0}) the closed-form pseudo-inverse solver is used – the paper’s exact Algorithm 2 Step 3.

The regularised path is solved directly with cvxpy, so this module has no dependency on the legacy estutils optimizer.

mlsynth.utils.clustersc_helpers.pcr.frequentist.solve_ols(denoised_donor_pre: ndarray, target_pre: ndarray, donor_names: List[str] | None = None, lambda_penalty: float | None = None, p: float | None = None, q: float | None = None) ndarray#

Return the OLS weight vector \(\widehat{f}\).

Parameters:
  • denoised_donor_pre (np.ndarray) – HSVT-denoised pre-period donor matrix, shape (T0, J).

  • target_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape (T0,).

  • donor_names (list of str, optional) – Unused; retained for call-signature compatibility.

  • lambda_penalty, p, q (float or None) – Optional p-norm regularisation. None (or 0) means the paper’s plain OLS via pseudo-inverse. Defaults are ridge (p = q = 2).

Bayesian PCR-SC weight solver (mlsynth extension; Bayani 2022 Ch. 1).

  • posterior – core Gaussian-conjugate posterior over the weights (BayesSCM()), relocated from the former shared bayesutils.

  • solversolve_bayesian(), the PCR-SC wrapper that draws from that posterior and forms per-period credible bands.

mlsynth.utils.clustersc_helpers.pcr.bayesian.BayesSCM(denoised_donor_matrix: ndarray, target_outcome_pre_intervention: ndarray, observation_noise_variance: float, weights_prior_precision: float) Tuple[ndarray, ndarray, ndarray, ndarray]#

Implement the fully Bayesian synthetic control method.

This function calculates the posterior distribution of synthetic control weights and the predictive distribution of the counterfactual outcome under a Bayesian framework. It assumes a Gaussian likelihood for the outcome and a Gaussian prior for the weights.

Parameters:
  • denoised_donor_matrix (np.ndarray) – Denoised donor matrix, representing the pre-intervention outcomes or predictors for donor units. Shape (num_pre_intervention_periods, num_donors), where num_pre_intervention_periods is the number of pre-intervention time periods and num_donors is the number of donor units.

  • target_outcome_pre_intervention (np.ndarray) – Pre-intervention outcome vector for the treated unit. Shape (num_pre_intervention_periods,) or (num_pre_intervention_periods, 1).

  • observation_noise_variance (float) – Variance of the noise term in the outcome model (scalar). This represents the observation noise \(\sigma^2\).

  • weights_prior_precision (float) – Prior precision parameter for the synthetic control weights (scalar). This is the inverse of the prior variance for the weights, assuming a zero-mean Gaussian prior \(\beta \sim N(0, \text{weights_prior_precision}^{-1}I)\).

Returns:

  • weights_posterior_mean (np.ndarray) – Posterior mean of the synthetic control weights. Shape (num_donors,).

  • weights_posterior_covariance (np.ndarray) – Posterior covariance matrix of the synthetic control weights. Shape (num_donors, num_donors).

  • counterfactual_predictive_mean_pre_intervention (np.ndarray) – Predictive mean of the counterfactual outcome for the treated unit during the pre-intervention periods. Shape (num_pre_intervention_periods,).

  • counterfactual_predictive_variance_pre_intervention (np.ndarray) – Predictive variance of the counterfactual outcome for the treated unit during the pre-intervention periods. Shape (num_pre_intervention_periods,).

Raises:
  • MlsynthDataError – If input data types, shapes, or values are invalid.

  • MlsynthEstimationError – If matrix inversion fails during posterior computation.

mlsynth.utils.clustersc_helpers.pcr.bayesian.solve_bayesian(denoised_donor_pre: ndarray, target_pre: ndarray, denoised_donor_full: ndarray, *, alpha: float = 0.05, n_samples: int = 1000, alpha_prior: float = 1.0, rng: Generator | None = None) Tuple[ndarray, ndarray, ndarray, ndarray]#

Return weights, median counterfactual, and per-period credible bands.

Parameters:
  • denoised_donor_pre (np.ndarray) – HSVT-denoised pre-period donor matrix, shape (T0, J).

  • target_pre (np.ndarray) – Pre-period treated outcomes, shape (T0,).

  • denoised_donor_full (np.ndarray) – HSVT-denoised donor matrix across all periods, shape (T, J). Used to project posterior draws into the counterfactual band.

  • alpha (float) – Nominal level: returned band is the central 1 - alpha posterior interval.

  • n_samples (int) – Number of posterior draws for the predictive band.

  • alpha_prior (float) – Prior precision for the Gaussian weights prior (passed to BayesSCM()).

  • rng (np.random.Generator, optional) – Source of randomness. A fresh default_rng() is used otherwise.

Returns:

  • f_hat (np.ndarray) – Posterior mean weights, shape (J,).

  • cf_median (np.ndarray) – Posterior median of the counterfactual, shape (T,).

  • cf_lower (np.ndarray) – Lower per-period bound at level alpha/2, shape (T,).

  • cf_upper (np.ndarray) – Upper per-period bound at level 1 - alpha/2, shape (T,).

Convex (simplex-constrained) weight solver for PCR-SC.

mlsynth extension on top of Algorithm 2 of Rho et al. (2025): keep the HSVT denoising step, but replace the OLS inner solve with the classical Abadie-Diamond-Hainmueller (2010) simplex-constrained program,

\[\widehat{f} = \arg\min_{f \in \mathbb{R}_{\geq 0}^{J},\, \mathbf{1}^\top f = 1} \| \widetilde{M}^- f - x_0^- \|_2.\]

This combines the denoising robustness of PCR with the non-extrapolation and interpretability properties of convex synthetic control weights. The simplex least-squares program is solved directly with cvxpy (CLARABEL), so this module has no dependency on the legacy estutils optimizer.

mlsynth.utils.clustersc_helpers.pcr.convex.solve_simplex(denoised_donor_pre: ndarray, target_pre: ndarray, donor_names: List[str] | None = None) ndarray#

Return simplex-constrained weights against the denoised donor matrix.

Solves min_w ||x0_pre - M_pre w||_2 subject to w >= 0 and sum(w) == 1.

Parameters:
  • denoised_donor_pre (np.ndarray) – HSVT-denoised pre-period donor matrix, shape (T0, J).

  • target_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape (T0,).

  • donor_names (list of str, optional) – Unused; retained for call-signature compatibility.

Shen-Ding-Sekhon-Yu (2023) confidence intervals for PCR-SC.

Implements the variance estimators from Section 4 of Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data (Econometrica 91(6):2125-2154). The closed-form formulas below are ports of var.py from the authors’ reference implementation at deshen24/panel-data-regressions, adapted to mlsynth’s column-orientation (donors as columns of Y_pre).

The three variance estimators correspond to three assumed sources of randomness:

  • HZ (horizontal / time-series). Assumption 1 of the paper: donors’ post-period outcome \(y_T\) is generated from their pre-period outcomes plus mean-zero noise that varies across donors. The relevant residual variance comes from the cross-sectional dimension. Confidence intervals built from \(\sigma^2_{\mathrm{hz}}\) quantify uncertainty under the HZ generative model.

  • VT (vertical / cross-sectional). Assumption 2 of the paper: treated unit’s pre-period outcome \(y_N\) is generated from the donors’ pre-period outcomes plus mean-zero noise that varies across pre-periods. The relevant residual variance comes from the time-series dimension.

  • DR (doubly robust). Assumption 3: noise varies across both units and time. The DR variance is \(v_{\mathrm{hz}} + v_{\mathrm{vt}} - \mathrm{tr}(A)\) where \(A\) is an interaction term, clipped at zero (paper eq. 17 / var.py).

Three variance estimators are available:

  • "homoskedastic" – paper Section 4.1.3, eq 19. Closed-form, assumes errors are i.i.d.

  • "jackknife" – paper Supplemental Material. Diagonal weighting.

  • "hrk" – Hartley-Rao-Kish, paper Supplemental Material. Only valid when max(1 - diag(H_perp)) < 1/2.

class mlsynth.utils.clustersc_helpers.pcr.inference.ShenInference(method: str, alpha: float, att: float, per_period_gap: ndarray, per_period_se_hz: ndarray, per_period_se_vt: ndarray, per_period_se_dr: ndarray, per_period_ci_hz: ndarray, per_period_ci_vt: ndarray, per_period_ci_dr: ndarray, att_se_hz: float, att_se_vt: float, att_se_dr: float, att_ci_hz: Tuple[float, float], att_ci_vt: Tuple[float, float], att_ci_dr: Tuple[float, float], rank: int)#

Output of shen_inference() for a PCR-SC fit.

method#

Variance estimator tag.

Type:

str

alpha#

Nominal level used to form the CIs.

Type:

float

att#

Point estimate of the ATT (average gap over the post-period).

Type:

float

per_period_gap#

Per-period treatment effect estimates, shape (T1,).

Type:

np.ndarray

per_period_se_{hz,vt,dr}

Per-period standard errors under each source-of-randomness assumption, shape (T1,).

Type:

np.ndarray

per_period_ci_{hz,vt,dr}

Per-period \((1-\alpha)\) CIs, shape (T1, 2).

Type:

np.ndarray

att_se_{hz,vt,dr}

ATT standard errors (variance-of-mean under independence).

Type:

float

att_ci_{hz,vt,dr}

ATT \((1-\alpha)\) CIs.

Type:

tuple of float

rank#

PCR rank used.

Type:

int

alpha: float#
att: float#
att_ci_dr: Tuple[float, float]#
att_ci_hz: Tuple[float, float]#
att_ci_vt: Tuple[float, float]#
att_se_dr: float#
att_se_hz: float#
att_se_vt: float#
method: str#
per_period_ci_dr: ndarray#
per_period_ci_hz: ndarray#
per_period_ci_vt: ndarray#
per_period_gap: ndarray#
per_period_se_dr: ndarray#
per_period_se_hz: ndarray#
per_period_se_vt: ndarray#
rank: int#
mlsynth.utils.clustersc_helpers.pcr.inference.shen_inference(treated_outcome: ndarray, donor_outcomes: ndarray, T0: int, rank: int, *, variance: Literal['homoskedastic', 'jackknife', 'hrk'] = 'homoskedastic', alpha: float = 0.05) ShenInference#

Per-period and ATT CIs for PCR-SC, following Shen et al. (2023).

Parameters:
  • treated_outcome (np.ndarray) – Treated outcome series, shape (T,).

  • donor_outcomes (np.ndarray) – Donor outcomes (columns = donors), shape (T, J).

  • T0 (int) – Number of pre-treatment periods.

  • rank (int) – PCR truncation rank \(k\) already chosen by the pipeline. Reused so the inference uses the same low-rank donor matrix the weights were fit on.

  • variance ({“homoskedastic”, “jackknife”, “hrk”}) – Variance estimator. Defaults to homoskedastic (paper eq 19).

  • alpha (float) – Two-sided level for the returned CIs.

PCR-SC orchestration pipeline.

Implements Rho, Tang, Bergam, Cummings & Misra (2025), ClusterSC: Advancing Synthetic Control with Donor Selection, by composing:

  • hsvt – rank selection + HSVT denoising (Algorithm 2 Step 2).

  • clustering – donor clustering and target-cluster assignment (Algorithms 3 and 4 Steps 1-3).

  • frequentist, bayesian, convex – the three weight solvers.

The public entry point run_pcr() accepts the same call signature the legacy estutils.pcr exposed (so the orchestrator and SI estimator work unchanged) plus a handful of paper-aligned knobs:

  • rank and rank_method ("cumvar" / "fixed" / "usvt");

  • k_clusters and k_max for the silhouette-driven clustering step;

  • alpha and n_bayes_samples for the Bayesian credible band.

mlsynth.utils.clustersc_helpers.pcr.pipeline.run_pcr(treated_outcome: ndarray, donor_outcomes: ndarray, donor_names: Sequence[str], T0: int, *, objective: str = 'OLS', estimator: str = 'frequentist', clustering: bool = False, rank: int | None = None, rank_method: str = 'cumvar', cumvar_threshold: float = 0.95, standardize_for_rank: bool = True, project_denoised: bool = False, k_clusters: int | None = None, k_max: int = 8, alpha: float = 0.05, n_bayes_samples: int = 1000, alpha_prior: float = 1.0, lambda_penalty: float | None = None, p: float | None = None, q: float | None = None, shen_variance: str = 'homoskedastic', compute_shen_ci: bool = True, random_state: int = 0) Tuple[MethodFit, Tuple[ndarray, ndarray] | None]#

Run the paper-aligned PCR-SC pipeline.

Parameters:
  • treated_outcome (np.ndarray) – Treated outcome series, shape (T,).

  • donor_outcomes (np.ndarray) – Donor outcomes (columns = donors), shape (T, J).

  • donor_names (sequence of str) – Length-J donor labels.

  • T0 (int) – Number of pre-treatment periods.

  • objective ({“OLS”, “SIMPLEX”}) – Which weight solver to use for estimator="frequentist". "OLS" is the paper’s Algorithm 2; "SIMPLEX" is the Abadie-style mlsynth extension.

  • estimator ({“frequentist”, “bayesian”}) – Frequentist Algorithm 2 vs. the Bayani 2022 posterior.

  • clustering (bool) – If True, run Algorithm 4 (donor clustering + target match).

  • rank, rank_method, cumvar_threshold – Truncation-rank controls; see hsvt.select_rank().

  • k_clusters, k_max – Cluster-count controls; see clustering.cluster_donors().

  • alpha, n_bayes_samples, alpha_prior – Bayesian-path controls; see bayesian.solve_bayesian().

  • lambda_penalty, p, q – Optional elastic-net knobs for the frequentist OLS path.

  • random_state (int) – Seed forwarded to k-means and the Bayesian sampler.

Returns:

  • fit (MethodFit) – Frozen container with the PCR fit (counterfactual projected through the denoised donor matrix, Algorithm 4 Step 5).

  • credible_band (tuple of np.ndarray, optional) – (lower, upper) per-period credible bounds when estimator="bayesian"; None otherwise.

RPCA-SC pipeline#

Functional Principal Component Analysis (FPCA) for RPCA-SC.

Implements Step 1 of Bayani (2021), Robust PCA Synthetic Control:

  • Smooth each unit’s pre-period outcome trajectory with a cubic B-spline (Li, Wang & Carroll 2010).

  • Apply PCA to the smoothed trajectories.

  • Truncate to the leading components that explain at least cumvar_threshold of the variance (paper Section 3 recommends \(\geq 95\%\)).

  • Standardise the surviving scores so they are ready for k-means.

A fallback to plain PCA is used when the pre-period is too short for a cubic spline. Edge cases (no units / no scores after truncation) are handled by returning a degenerate one-cluster partition upstream.

class mlsynth.utils.clustersc_helpers.rpca.fpca.FPCAFeatures(scores: ndarray, rank: int, smoothing: str)#

Output of compute_fpca_features().

rank: int#
scores: ndarray#
smoothing: str#
mlsynth.utils.clustersc_helpers.rpca.fpca.compute_fpca_features(pre_outcomes: ndarray, cumvar_threshold: float = 0.95, spline_degree: int = 3) FPCAFeatures#

Compute standardised FPC scores from a panel of pre-period trajectories.

Parameters:
  • pre_outcomes (np.ndarray) – Pre-period outcomes, shape (n_units, T0). Rows are units (the treated unit included, by paper convention), columns are time points.

  • cumvar_threshold (float) – Cumulative-variance target for truncating the FPC expansion. Paper default is 0.95.

  • spline_degree (int) – B-spline degree. Default 3 (cubic, Bayani 2021).

K-means clustering on FPC scores for RPCA-SC.

Implements Step 2 of Bayani (2021): apply Hartigan-Wong style \(k\)-means to the standardised FPC scores (treated unit included in the panel), then select the donor pool as the other members of the cluster containing the treated unit.

The number of clusters is either user-supplied or chosen by the silhouette coefficient (Rousseeuw 1987) over \(k \in [2, k_{\max}]\), matching Bayani’s recommendation.

class mlsynth.utils.clustersc_helpers.rpca.clustering.FPCACluster(labels: ndarray, treated_cluster: int, donor_indices: ndarray, k: int)#

Output of assign_clusters().

donor_indices: ndarray#
k: int#
labels: ndarray#
treated_cluster: int#
mlsynth.utils.clustersc_helpers.rpca.clustering.assign_clusters(scores: ndarray, treated_row: int, k_clusters: int | None = None, k_max: int = 8, random_state: int = 0) FPCACluster#

Cluster units by their FPC scores and pick the treated unit’s cluster.

Parameters:
  • scores (np.ndarray) – Standardised FPC scores, shape (n_units, rank). The treated unit is row treated_row.

  • treated_row (int) – Index of the treated unit’s row in scores.

  • k_clusters (int, optional) – Number of clusters. If None, the silhouette coefficient selects \(k \in [2, k_{\max}]\).

  • k_max (int) – Upper bound for the silhouette search.

  • random_state (int) – Seed for sklearn.cluster.KMeans initialisation.

Returns:

FPCACluster – Cluster labels, the treated unit’s cluster id, the indices of the donor pool (cluster members minus the treated row), and the number of clusters used.

Principal Component Pursuit (PCP) via ADMM.

Implements Step 3 of Bayani (2021), with the convex relaxation due to Candes, Li, Ma & Wright (2011):

\[\min_{L,S} \| L \|_* + \lambda \| S \|_1 \quad \text{s.t.} \quad Y = L + S.\]

The augmented Lagrangian is solved by alternating direction method of multipliers (ADMM): \(L\) is updated via the singular value soft-thresholding operator \(\mathcal{D}_{1/\mu}\), \(S\) via element-wise soft-thresholding \(\mathcal{S}_{\lambda/\mu}\), and the dual variable \(\Lambda\) is the standard multiplier update.

Defaults follow Bayani (2021) Section 2.4:

  • \(\lambda = 1 / \sqrt{\max(N, T)}\) (Candes et al. 2011 default).

  • \(\mu = \frac{N \cdot T}{4 \cdot \sum_{ij} |Y_{ij}|}\) (Bayani’s rule of thumb).

class mlsynth.utils.clustersc_helpers.rpca.pcp.PCPResult(low_rank: ndarray, sparse: ndarray, iterations: int, converged: bool, lambda_used: float, mu_used: float)#

Decomposition output: Y = L + S, plus solver diagnostics.

converged: bool#
iterations: int#
lambda_used: float#
low_rank: ndarray#
mu_used: float#
sparse: ndarray#
mlsynth.utils.clustersc_helpers.rpca.pcp.pcp_decompose(Y: ndarray, lam: float | None = None, mu: float | None = None, max_iter: int = 1000, tol: float = 1e-09) PCPResult#

Solve the PCP problem and return the (L, S) decomposition.

Parameters:
  • Y (np.ndarray) – Observed matrix, shape (m, n).

  • lam (float, optional) – Sparsity penalty \(\lambda\). Defaults to \(1/\sqrt{\max(m, n)}\) (Candes et al. 2011).

  • mu (float, optional) – Augmented-Lagrangian penalty \(\mu\). Defaults to \(m n / (4 \sum |Y_{ij}|)\) (Bayani 2021).

  • max_iter (int) – Iteration cap. Default 1000.

  • tol (float) – Frobenius-norm tolerance for Y - L - S, scaled by ||Y||_F.

Half-quadratic regularised Robust PCA (HQF).

Implements the non-convex robust PCA variant of Wang, Li, So & Liu (2023), Robust PCA via non-convex half-quadratic regularization, Signal Processing 204, 108816.

Solves a factored model \(Y = UV + S\) where \(U \in \mathbb{R}^{m \times r}\), \(V \in \mathbb{R}^{r \times n}\), and \(S\) collects sparse / outlier entries. Updates alternate between regularised least squares for the factors and an adaptive median-based threshold on the residual for the sparse component.

Default hyperparameters (Bayani 2021 Section 2.4, recommending Wang et al.’s defaults):

  • rank – smallest \(r\) with cumulative spectral energy \(\geq\) cumvar_threshold (Bayani uses 0.999).

  • ip (noise-scale adaptation factor) – 1.0.

  • lam (factor regularisation) – \(1 / \sqrt{\max(m, n)}\).

  • max_iter1000.

class mlsynth.utils.clustersc_helpers.rpca.hqf.HQFResult(low_rank: ndarray, sparse: ndarray, iterations: int, rank_used: int, lambda_used: float, ip_used: float)#

Output of hqf_decompose(): low-rank + sparse parts + diagnostics.

ip_used: float#
iterations: int#
lambda_used: float#
low_rank: ndarray#
rank_used: int#
sparse: ndarray#
mlsynth.utils.clustersc_helpers.rpca.hqf.hqf_decompose(Y: ndarray, rank: int | None = None, cumvar_threshold: float = 0.999, lam: float | None = None, ip: float = 1.0, max_iter: int = 1000, random_state: int = 0) HQFResult#

Run the HQF robust PCA decomposition.

Parameters:
  • Y (np.ndarray) – Observed matrix, shape (m, n).

  • rank (int, optional) – Explicit factorisation rank. If None, picked by cumulative spectral energy at cumvar_threshold.

  • cumvar_threshold (float) – Cumulative-variance target used when rank is None. Bayani 2021 uses 0.999.

  • lam (float, optional) – Tikhonov factor for the alternating LS step. Defaults to \(1 / \sqrt{\max(m, n)}\) (Bayani 2021).

  • ip (float) – Noise-scale adaptation factor. Default 1.0.

  • max_iter (int) – Iteration cap. Default 1000.

  • random_state (int) – Seed for the initialisation of \(U\).

Non-negative least squares weight solver for RPCA-SC.

Implements Step 4 of Bayani (2021):

\[\widehat{\beta} = \arg\min_{\beta \geq 0} \| y_i^- - (L^-)^\top \beta \|_2^2,\]

where \(L^-\) is the HSVT-denoised pre-period donor matrix. The paper deliberately drops the sum-to-one constraint that classical synthetic control imposes – the clustering step already restricts the donor pool to units that behave like the treated unit, so the non-negativity constraint is enough to keep the counterfactual interpretable.

mlsynth.utils.clustersc_helpers.rpca.weights.solve_nnls(denoised_donor_pre: ndarray, target_pre: ndarray) ndarray#

Return the non-negative least squares weights \(\widehat{\beta}\).

Parameters:
  • denoised_donor_pre (np.ndarray) – Pre-period denoised donor matrix, shape (T0, J). Columns are donor units.

  • target_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape (T0,).

Notes

Wraps scipy.optimize.nnls(). The Lawson-Hanson algorithm is finite-step exact for the box-constrained QP arising from Bayani’s Step 4.

Cross-validation tuners for the RPCA-SC pipeline.

Bayani (2021) takes the Candes-Li-Ma-Wright (2011) optimal-recovery value \(\lambda = 1/\sqrt{\max(N, T)}\) for the PCP sparsity penalty, and selects the HQF factorisation rank by an explained- variance threshold. Both defaults are conservative for the L/S decomposition identifiability problem but can be poor for the counterfactual prediction problem (e.g. on the California Proposition 99 panel they leave PCP under-regularised by ~2x and PCR-SC beats RPCA-SC on pre-period RMSE by a factor of ~2).

This module adds leave-one-time-period-out cross-validation for the PCP \(\lambda\) and HQF \(r\) knobs. For each candidate value:

  1. Run the RPCA decomposition on the full pre-period donor matrix (the donor data is observed at every period – only the treated unit’s pre-period is “held out”).

  2. For each pre-period \(t \in [0, T_0)\), refit the NNLS weights on the other \(T_0 - 1\) periods and predict the held-out \(y_t\). Aggregate the squared errors.

  3. Pick the value with the lowest mean held-out MSE.

This evaluates exactly the quantity the user cares about (out-of-sample prediction accuracy of the synthetic control) without touching the donor matrix (so the RPCA decomposition is not over-fit to the held-out treated period).

class mlsynth.utils.clustersc_helpers.rpca.tuning.CVResult(grid: ndarray, cv_mse: ndarray, best: float, best_idx: int)#

Output of one CV sweep.

best: float#
best_idx: int#
cv_mse: ndarray#
grid: ndarray#
mlsynth.utils.clustersc_helpers.rpca.tuning.cv_hqf_rank(donor_pre: ndarray, treated_pre: ndarray, grid: Sequence[int] | None = None, *, max_rank: int | None = None, hqf_lambda: float | None = None, hqf_ip: float = 1.0, hqf_max_iter: int = 1000, random_state: int = 0) CVResult#

Leave-one-time-out CV for HQF’s factorisation rank \(r\).

Parameters:
  • donor_pre (np.ndarray) – Pre-period donor matrix, shape (T0, J).

  • treated_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape (T0,).

  • grid (sequence of int, optional) – Integer ranks to try. Defaults to range(1, max_rank+1) with max_rank = min(J, T0-1).

  • max_rank – Upper bound for the default grid.

  • hqf_lambda, hqf_ip, hqf_max_iter, random_state – Forwarded to mlsynth.utils.clustersc_helpers.rpca.hqf.hqf_decompose().

mlsynth.utils.clustersc_helpers.rpca.tuning.cv_pcp_lambda(donor_pre: ndarray, treated_pre: ndarray, grid: Sequence[float] | None = None, *, base_lambda: float | None = None, multipliers: Sequence[float] = (0.5, 1.0, 2.0, 3.0, 5.0, 8.0, 12.0), pcp_mu: float | None = None, pcp_max_iter: int = 1000, pcp_tol: float = 1e-09) CVResult#

Leave-one-time-out CV for PCP’s sparsity penalty \(\lambda\).

Parameters:
  • donor_pre (np.ndarray) – Pre-period donor matrix, shape (T0, J) (columns are donors).

  • treated_pre (np.ndarray) – Treated unit’s pre-period outcomes, shape (T0,).

  • grid (sequence of float, optional) – Explicit grid of lambda values. If None, the grid is built as multipliers * base_lambda where base_lambda defaults to Candes’ value \(1/\sqrt{\max(J, T_0)}\).

  • base_lambda, multipliers – Used only when grid is not supplied.

  • pcp_mu, pcp_max_iter, pcp_tol – Forwarded to mlsynth.utils.clustersc_helpers.rpca.pcp.pcp_decompose().

Cattaneo-Feng-Titiunik prediction intervals for RPCA-SC.

Implements a focused port of the prediction-interval machinery from Cattaneo, M. D., Feng, Y., & Titiunik, R. (2021), “Prediction Intervals for Synthetic Control Methods” (JASA 116(536):1865-1880) and its multi-period extension in Cattaneo, Feng, Palomba, Titiunik (2025). The reference implementation lives at nppackages/scpi (Python package scpi_pkg).

The full scpi_pkg ships an ECOS-based constrained optimisation solver for the in-sample component and three out-of-sample methods (gaussian / ls / qreg) plus dask-parallel simulation. Vendoring that for one inference method on one estimator family would pull in dask, plotnine, mizani, fsspec, cloudpickle, ecos, luddite, and partd as hard dependencies. Instead this module implements the same two-component structure with two lightweight choices that capture the spirit of the paper:

Prediction interval at post-period \(t > T_0\):

\[\mathrm{PI}_t = \widehat y^0_t \;\pm\; \bigl[ M_w(t, \alpha/2) + M_e(t, \alpha/2) \bigr],\]

where the two components are quantified as follows.

  • In-sample uncertainty \(M_w(t, \alpha/2)\) – HC1-scaled parametric bootstrap of the pre-period residuals. At each of sims draws we perturb the treated pre-period outcome with a resampled-residual error and refit the full RPCA-SC pipeline via the user-supplied refit_fn. The \((\alpha/2, 1-\alpha/2)\) empirical quantiles of the resulting counterfactual draws give an asymmetric in-sample band per post-period. This replaces the paper’s ECOS solver, which solves a constrained \(\sup / \inf\) over the “compatible set” of weights – the bootstrap is equivalent under regularity conditions and avoids pulling in ecos as a dependency.

  • Out-of-sample uncertainty \(M_e(t, \alpha/2)\) – the gaussian variant of the paper’s e_method. Under sub-Gaussian post-period shocks the Hoeffding bound gives

    \[M_e(t, \alpha/2) = \sqrt{-2 \log \alpha} \; \widehat\sigma_e,\]

    where \(\widehat\sigma_e = \mathrm{sd}(\widehat u^-)\) is estimated from the pre-period residuals. The two more elaborate variants in the paper (ls – location-scale model; qreg – quantile regression on residuals) are deferred for a future revision.

For the average treatment effect on the treated, the in-sample component aggregates by simulating the post-period mean of the counterfactual across bootstrap draws and taking quantiles. The out-of-sample component aggregates as \(\widehat\sigma_e \sqrt{-2 \log \alpha} / \sqrt{T_1}\) under post-period shock independence.

class mlsynth.utils.clustersc_helpers.rpca.inference.CFTInference(method: str, alpha: float, att: float, per_period_gap: ndarray, per_period_pi: ndarray, in_sample_band: ndarray, out_of_sample_eps: float, att_pi: Tuple[float, float], sims: int, sigma_e: float)#

Output of cft_prediction_intervals().

method#

Tag identifying the e_method used (currently "cft_gaussian").

Type:

str

alpha#

Nominal level used to form the intervals.

Type:

float

att#

Mean post-period gap.

Type:

float

per_period_gap#

Per-post-period treatment effect estimates, shape (T1,).

Type:

np.ndarray

per_period_pi#

Per-post-period \((1-\alpha)\) PI for the treatment effect \(\tau_t\), shape (T1, 2).

Type:

np.ndarray

in_sample_band#

Per-post-period in-sample bootstrap band on the counterfactual, shape (T1, 2). in_sample_band[t, 0] is the bootstrap \(\alpha/2\) quantile minus the point estimate; in_sample_band[t, 1] is the \((1 - \alpha/2)\) quantile minus the point estimate.

Type:

np.ndarray

out_of_sample_eps#

Out-of-sample Hoeffding bound – the same scalar at every post-period under the homoskedastic Gaussian default.

Type:

float

att_pi#

Aggregated PI on the ATT.

Type:

tuple of float

sims#

Number of bootstrap draws used.

Type:

int

sigma_e#

Estimated pre-period residual standard deviation.

Type:

float

alpha: float#
att: float#
att_pi: Tuple[float, float]#
in_sample_band: ndarray#
method: str#
out_of_sample_eps: float#
per_period_gap: ndarray#
per_period_pi: ndarray#
sigma_e: float#
sims: int#
mlsynth.utils.clustersc_helpers.rpca.inference.cft_prediction_intervals(treated_outcome: ndarray, counterfactual: ndarray, T0: int, refit_fn: Callable[[ndarray], ndarray], *, e_method: str = 'gaussian', alpha: float = 0.05, sims: int = 200, random_state: int = 0) CFTInference#

Two-component prediction intervals for an RPCA-SC fit.

Parameters:
  • treated_outcome (np.ndarray) – Treated outcome series, shape (T,).

  • counterfactual (np.ndarray) – Counterfactual from the actual fit, shape (T,).

  • T0 (int) – Number of pre-treatment periods.

  • refit_fn (callable) – refit_fn(y_perturbed) -> counterfactual where the argument is a length-T modified treated outcome whose pre-period entries have been perturbed by a resampled residual draw, and the return is the resulting full-period counterfactual. The pipeline supplies this thunk so the inference module stays decoupled from the orchestration.

  • e_method ({“gaussian”}) – Out-of-sample method. Currently only "gaussian" (Hoeffding bound) is supported.

  • alpha (float) – Two-sided level. Default 0.05 -> 95% intervals.

  • sims (int) – Number of bootstrap draws for the in-sample component.

  • random_state (int) – Seed for the resampling RNG.

Orchestration pipeline for RPCA-SC (Bayani 2021, Algorithm 4).

Composes the five steps of Robust PCA Synthetic Control:

  1. fpca – standardised FPC scores from pre-period trajectories.

  2. clustering – silhouette-driven \(k\)-means and donor selection via the treated unit’s cluster membership.

  3. pcp / hqf – robust \(Y = L + S\) decomposition of the selected donor pool.

  4. weights – non-negative least squares against the denoised donor matrix \(L^-\).

  5. Project: \(\widehat{Y}_i^+ = (L^+)^\top \widehat{\beta}\) using the same denoised donor matrix in the post-period.

The dispatcher run_rpca() takes the same numpy-array signature as run_pcr() so the CLUSTERSC orchestrator can call both behind a uniform interface.

mlsynth.utils.clustersc_helpers.rpca.pipeline.run_rpca(treated_outcome: ndarray, donor_outcomes: ndarray, donor_names: Sequence[str], T0: int, *, rpca_method: str = 'PCP', fpca_cumvar: float = 0.95, k_clusters: int | None = None, k_max: int = 8, pcp_lambda: float | None = None, pcp_mu: float | None = None, pcp_max_iter: int = 1000, pcp_tol: float = 1e-09, hqf_rank: int | None = None, hqf_cumvar: float = 0.999, hqf_lambda: float | None = None, hqf_ip: float = 1.0, hqf_max_iter: int = 1000, cv_lambda: bool = False, cv_hqf_rank: bool = False, cv_lambda_multipliers: Sequence[float] = (0.5, 1.0, 2.0, 3.0, 5.0, 8.0, 12.0), cv_hqf_rank_grid: Sequence[int] | None = None, compute_cft_pi: bool = False, cft_alpha: float = 0.05, cft_sims: int = 200, cft_e_method: str = 'gaussian', random_state: int = 0) MethodFit#

Run the five-step RPCA-SC pipeline and assemble a MethodFit.

Parameters:
  • treated_outcome (np.ndarray) – Treated outcome series, shape (T,).

  • donor_outcomes (np.ndarray) – Donor outcomes (columns = donors), shape (T, J).

  • donor_names (sequence of str) – Length-J donor labels.

  • T0 (int) – Number of pre-treatment periods.

  • rpca_method ({“PCP”, “HQF”}) – Robust PCA decomposition.

  • fpca_cumvar (float) – Cumulative-variance target for FPCA truncation (Step 1). Paper default 0.95.

  • k_clusters, k_max – Cluster-count controls; see clustering.assign_clusters().

  • pcp_lambda, pcp_mu, pcp_max_iter, pcp_tol – PCP solver knobs (Candes et al. 2011 / Bayani 2021).

  • hqf_rank, hqf_cumvar, hqf_lambda, hqf_ip, hqf_max_iter – HQF solver knobs (Wang et al. 2023).

  • random_state (int) – Seed for k-means and HQF.

Returns:

MethodFit – Frozen container with the RPCA-SC fit (counterfactual projected through the denoised donor matrix in both pre and post).

Shared utilities#

Data preparation helpers for CLUSTERSC.

mlsynth.utils.clustersc_helpers.setup.prepare_clustersc_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str) CLUSTERSCInputs#

Pivot the long panel into the CLUSTERSC layout.

Parameters:
  • df (pd.DataFrame) – Long balanced panel.

  • outcome, treat, unitid, time (str) – Column names.

Returns:

CLUSTERSCInputs – Preprocessed panel.

Raises:

MlsynthDataError – On unbalanced panel, missing values, multiple treated cohorts, or zero pre/post-treatment periods.

Result structures for the Cluster-based Synthetic Control (CLUSTERSC) estimator.

The inputs and per-method fits are frozen dataclasses; the top-level CLUSTERSCResults is a Pydantic EffectResult (the two-family result contract), carrying the dataclass layers as typed fields.

CLUSTERSC bundles two complementary robust synthetic-control families:

  • PCR-RSC – Robust Synthetic Control via Principal Component Regression. Combines SVD-based donor clustering (Amjad, Shah and Shen 2018) with PCR weight estimation (Agarwal, Shah, Shen and Song 2021). Frequentist mode uses an SCM-style QP via cvxpy; Bayesian mode uses the conjugate-prior posterior of Bayani (2022, CUNY dissertation Chapter 1) with credible intervals.

  • RPCA-SC – Robust PCA Synthetic Control. Applies functional-PCA clustering on the pre-treatment outcomes, then a robust low-rank decomposition (PCP – Candes, Li, Ma, Wright 2011; or HQF – Wang, Li, So and Liu 2023) to denoise the donor panel before fitting simplex SCM weights.

Both methods can run in parallel via method = "both" so the user can compare their estimates side by side.

The four layers below (inputs, per-method fit, optional Bayesian inference, top-level results) keep the pipeline pluggable.

class mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInference(method: str, alpha: float, att: float = nan, credible_interval: Tuple[float, float] = (nan, nan), shen: Any | None = None, cft: Any | None = None)#

Optional inferential output.

Two inference families are supported on the PCR-SC fit:

  • Bayesian credible band – populated when estimator = "bayesian". Reports a posterior credible interval for the ATT (Bayani 2022 Ch. 1).

  • Shen-Ding-Sekhon-Yu (2023) frequentist CIs – populated for the frequentist OLS PCR path. Reports per-period and ATT CIs under three sources of randomness (HZ / VT / DR) and one of three variance estimators (homoskedastic / jackknife / HRK).

Parameters:
  • method (str) – "bayesian_credible" for the Bayesian path, "shen_<variance>" (e.g. "shen_homoskedastic") for the Shen et al. path, or "none" if no inference was run.

  • alpha (float) – Two-sided significance level (e.g. 0.05 -> 95% interval).

  • att (float) – Mean post-treatment gap for the primary variant.

  • credible_interval (tuple of float) – (lower, upper) posterior credible interval for the ATT (Bayesian path only; (nan, nan) otherwise).

  • shen (object, optional) – Full mlsynth.utils.clustersc_helpers.pcr.inference.ShenInference object when the Shen et al. CIs were computed (frequentist OLS PCR only). Carries per-period and ATT CIs under each source-of-randomness assumption.

  • cft (object, optional) – Full mlsynth.utils.clustersc_helpers.rpca.inference.CFTInference object when Cattaneo-Feng-Titiunik (2021) prediction intervals were computed (RPCA-SC only, opt-in via CLUSTERSCConfig.compute_cft_pi). Carries per-period and ATT prediction intervals.

alpha: float#
att: float = nan#
cft: Any | None = None#
credible_interval: Tuple[float, float] = (nan, nan)#
method: str#
shen: Any | None = None#
class mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInputs(treated_outcome: ndarray, donor_outcomes: ndarray, donor_names: ndarray, treated_unit_name: Any, T: int, T0: int, time_labels: ndarray)#

Preprocessed panel data for CLUSTERSC.

Parameters:
  • treated_outcome (np.ndarray) – Treated outcome series, shape (T,).

  • donor_outcomes (np.ndarray) – Donor outcome matrix, shape (T, J).

  • donor_names (np.ndarray) – Length-J labels of the donor units.

  • treated_unit_name (Any) – Label of the treated unit.

  • T (int) – Total number of panel periods.

  • T0 (int) – Number of pre-treatment periods.

  • time_labels (np.ndarray) – Length-T labels of the time periods.

property J: int#

Number of donor units.

T: int#
T0: int#
donor_names: ndarray#
donor_outcomes: ndarray#
property n_post: int#

Number of post-treatment periods.

time_labels: ndarray#
treated_outcome: ndarray#
treated_unit_name: Any#
class mlsynth.utils.clustersc_helpers.structures.CLUSTERSCResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInputs, pcr: ~mlsynth.utils.clustersc_helpers.structures.MethodFit | None = None, rpca: ~mlsynth.utils.clustersc_helpers.structures.MethodFit | None = None, selected_variant: str = 'pcr', cluster_inference: ~mlsynth.utils.clustersc_helpers.structures.CLUSTERSCInference | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Top-level container returned by mlsynth.CLUSTERSC.fit().

An EffectResult (the observational report): it populates the standardized sub-models (effects, time_series, weights, inference, fit_diagnostics, method_details) from the primary variant, so the flat accessors att / counterfactual / gap / att_ci / donor_weights / pre_rmse resolve through the base contract. The CLUSTERSC-specific fields below carry the dispatcher’s extra structure (both family fits side by side, and the rich inference).

Parameters:
  • inputs (CLUSTERSCInputs) – Preprocessed panel.

  • pcr (MethodFit or None) – PCR-RSC fit (populated when method in {"pcr", "both"}).

  • rpca (MethodFit or None) – RPCA-SC fit (populated when method in {"rpca", "both"}).

  • selected_variant (str) – Which fit drives the standardized sub-models / flat accessors – "pcr" or "rpca". When method = "both" the user picks via CLUSTERSCConfig.primary; default "pcr".

  • cluster_inference (CLUSTERSCInference or None) – The rich inferential output (Bayesian credible interval, Shen et al. per-period/ATT CIs, or CFT prediction intervals). The scalar summary is mirrored into the standardized inference slot so res.att_ci resolves.

  • metadata (dict) – Free-form pipeline diagnostics.

cluster_inference: CLUSTERSCInference | None#
inputs: CLUSTERSCInputs#
metadata: Dict[str, Any]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

pcr: MethodFit | None#
rpca: MethodFit | None#
selected_variant: str#
class mlsynth.utils.clustersc_helpers.structures.MethodFit(name: str, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att: float, pre_rmse: float, donor_weights: ~typing.Dict[~typing.Any, float], selected_donors: ~numpy.ndarray, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Single-method (PCR or RPCA) fit output.

Parameters:
  • name (str) – Method identifier ("pcr_frequentist", "pcr_bayesian", or "rpca").

  • counterfactual (np.ndarray) – Synthetic-control imputation of the treated outcome at every period, shape (T,).

  • gap (np.ndarray) – Observed treated minus counterfactual, shape (T,).

  • att (float) – Mean post-treatment gap.

  • pre_rmse (float) – Root-mean-squared pre-treatment fit error.

  • donor_weights (dict) – Mapping {donor_name: weight} with non-trivial weights only.

  • selected_donors (np.ndarray) – Subset of donor_names retained after clustering (PCR) or cluster selection (RPCA). Equal to all donors when no clustering is applied.

  • metadata (dict) – Free-form per-method diagnostics (e.g. cluster-id assignment, rank used by RPCA, posterior credible bounds for Bayesian PCR).

att: float#
counterfactual: ndarray#
donor_weights: Dict[Any, float]#
gap: ndarray#
metadata: Dict[str, Any]#
name: str#
pre_rmse: float#
selected_donors: ndarray#

Diagnostic plot for CLUSTERSC results.

mlsynth.utils.clustersc_helpers.plotter.plot_clustersc(results: CLUSTERSCResults) None#

Two-panel plot: trajectories + gap, both with PCR / RPCA overlays.

Example#

A self-contained one-draw Monte Carlo at a two-factor DGP, fitting both PCR-SC and RPCA-SC in parallel and inspecting their fits via the frozen CLUSTERSCResults container.

"""One draw of a two-factor CLUSTERSC simulation."""

import numpy as np
import pandas as pd

from mlsynth import CLUSTERSC


# ---------------------------------------------------------------------
# 1. Simulate one panel from a two-factor DGP
# ---------------------------------------------------------------------

rng = np.random.default_rng(0)
J = 12             # control units
T_pre = 14
T_post = 6
T = T_pre + T_post
r_true = 2
tau_true = 1.0     # additive treatment effect on the treated unit

F = rng.standard_normal((T, r_true))
lam = rng.standard_normal((J + 1, r_true))
eps = rng.standard_normal((T, J + 1)) * 0.4
Y = F @ lam.T + eps
Y[T_pre:, 0] += tau_true                              # unit 0 treated

rows = [
    {
        "unit": j,
        "time": t,
        "y": float(Y[t, j]),
        "D": int(j == 0 and t >= T_pre),
    }
    for j in range(J + 1)
    for t in range(T)
]
df = pd.DataFrame(rows)


# ---------------------------------------------------------------------
# 2. Fit CLUSTERSC with both families and Bayesian PCR inference
# ---------------------------------------------------------------------

res = CLUSTERSC({
    "df": df,
    "outcome": "y",
    "treat": "D",
    "unitid": "unit",
    "time": "time",
    "method": "both",
    "primary": "pcr",
    "pcr_objective": "OLS",
    "estimator": "bayesian",
    "rpca_method": "HQF",
    "alpha": 0.10,
    # No cluster structure in the synthetic panel; pool donors for
    # the RPCA branch so its Algorithm 4 Step 2 does not isolate
    # the treated row.
    "k_clusters": 1,
}).fit()


# ---------------------------------------------------------------------
# 3. Inspect the output
# ---------------------------------------------------------------------

print(f"true tau              = {tau_true:+.3f}")
print(f"selected variant      = {res.selected_variant}")
print(f"primary ATT           = {res.att:+.3f}")
print(f"PCR  ATT              = {res.pcr.att:+.3f}")
print(f"RPCA ATT              = {res.rpca.att:+.3f}")
print(f"PCR  pre-RMSE         = {res.pcr.pre_rmse:.4f}")
print(f"RPCA pre-RMSE         = {res.rpca.pre_rmse:.4f}")
print(f"HSVT rank (PCR)       = {res.pcr.metadata['rank']}")
print(f"HQF  rank (RPCA)      = {res.rpca.metadata['hqf_rank']}")
if res.cluster_inference.method == "bayesian_credible":
    lo, hi = res.cluster_inference.credible_interval
    print(f"Bayesian 90% CrI ATT  = [{lo:+.3f}, {hi:+.3f}]")

Empirical example: California Proposition 99#

The PCR-SC family also runs without clustering, in which case it reduces to Amjad-Shah-Shen (2018) Robust Synthetic Control on the full donor pool. Below: PCR with explicit rank \(r = 4\), then inspect the Shen-Ding-Sekhon-Yu (2023) frequentist CIs that the default compute_shen_ci=True produces.

from mlsynth import CLUSTERSC
import pandas as pd

file = (
    "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
    "refs/heads/main/basedata/smoking_data.csv"
)
df = pd.read_csv(file)

res = CLUSTERSC({
    "df": df,
    "outcome": "cigsale",
    "treat": "Proposition 99",
    "unitid": "state",
    "time": "year",
    "method": "pcr",
    "clustering": False,
    "rank": 4,                 # explicit HSVT rank
    "display_graphs": True,
}).fit()

print(f"ATT             = {res.att:+.3f}")
shen = res.cluster_inference.shen
print(f"95% CI (VT src) = [{shen.att_ci_vt[0]:+.2f}, {shen.att_ci_vt[1]:+.2f}]")
print(f"95% CI (HZ src) = [{shen.att_ci_hz[0]:+.2f}, {shen.att_ci_hz[1]:+.2f}]")

Same panel, RPCA-SC with cross-validated PCP \(\lambda\). The cv_lambda=True flag triggers leave-one-time-out CV over cv_lambda_multipliers \(\times \lambda^\star\) to pick the prediction-optimal sparsity penalty.

res = CLUSTERSC({
    "df": df,
    "outcome": "cigsale",
    "treat": "Proposition 99",
    "unitid": "state",
    "time": "year",
    "method": "rpca",
    "rpca_method": "PCP",
    "cv_lambda": True,
    "k_clusters": 1,            # weak cluster structure on this panel
}).fit()

md = res.rpca.metadata
print(f"PCP lambda chosen by CV = {md['pcp_lambda']:.4f} "
      f"(best of grid {[round(x,3) for x in md['cv_lambda_grid']]})")
print(f"pre-RMSE  = {res.rpca.pre_rmse:.3f}")
print(f"ATT       = {res.rpca.att:+.3f}")

Same fit with Cattaneo-Feng-Titiunik (2021) prediction intervals (opt-in via compute_cft_pi=True; the bootstrap reruns the full pipeline cft_sims times, so this takes ~30s on Prop 99 with the default cft_sims=200).

res = CLUSTERSC({
    "df": df,
    "outcome": "cigsale",
    "treat": "Proposition 99",
    "unitid": "state",
    "time": "year",
    "method": "rpca",
    "rpca_method": "PCP",
    "cv_lambda": True,
    "k_clusters": 1,
    "compute_cft_pi": True,
    "cft_sims": 200,
    "cft_alpha": 0.05,
}).fit()

cft = res.cluster_inference.cft
print(f"ATT                = {res.att:+.3f}")
print(f"95% CFT PI on ATT  = [{cft.att_pi[0]:+.2f}, {cft.att_pi[1]:+.2f}]")
print(f"sigma_e (pre-RMSE) = {cft.sigma_e:.3f}")
print(f"out-of-sample eps  = {cft.out_of_sample_eps:.3f}")
for ti in range(len(cft.per_period_gap)):
    lo, hi = cft.per_period_pi[ti]
    print(f"  t={res.inputs.T0+ti}:  tau_t={cft.per_period_gap[ti]:+.2f}  "
          f"PI=[{lo:+.2f}, {hi:+.2f}]")

Empirical validation: West German reunification (Bayani 2021)#

Bayani (2021, Section 3) validates RPCA-SC on the German reunification panel of Abadie, Diamond & Hainmueller (2015): the 1990 reunification is treated as an intervention on West German per-capita GDP, with 16 OECD economies as the candidate donor pool. It is the canonical empirical benchmark for the RPCA-SC family – the analogue of Proposition 99 for PCR-SC – and mlsynth reproduces it from the bundled german_reunification.csv. RPCA-SC uses only the outcome series (no auxiliary covariates), in contrast to the five predictors Abadie et al. (2015) used.

from mlsynth import CLUSTERSC
import pandas as pd

file = (
    "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
    "refs/heads/main/basedata/german_reunification.csv"
)
df = pd.read_csv(file)

res = CLUSTERSC({
    "df": df,
    "outcome": "gdp",
    "treat": "Reunification",
    "unitid": "country",
    "time": "year",
    "method": "rpca",
    "rpca_method": "PCP",     # convex RPCA, as in Bayani (2021)
    "compute_cft_pi": True,
    "cft_sims": 200,
    "cft_alpha": 0.05,
    "display_graphs": True,
}).fit()

cft = res.cluster_inference.cft
gap = res.inputs.treated_outcome - res.rpca.counterfactual
weights = {k: round(v, 2) for k, v in res.rpca.donor_weights.items()
           if abs(v) > 1e-3}
print(f"donor pool (clustered) = {len(res.rpca.donor_weights)}")
print(f"positive weights       = {weights}")
print(f"pre-RMSE (1960-1989)   = {res.rpca.pre_rmse:.1f}")
print(f"reunification ATT      = {res.rpca.att:+.0f}")
print(f"gap in 2003            = {gap[-1]:+.0f}")
print(f"95% CFT PI on ATT      = [{cft.att_pi[0]:+.0f}, {cft.att_pi[1]:+.0f}]")

The FPCA + \(k\)-means step (Steps 1-2) selects an 11-economy cluster for West Germany – Australia, Austria, Belgium, Denmark, France, Italy, Japan, the Netherlands, New Zealand, Norway and the UK – excluding the level outliers (the USA, Switzerland) and the lower-income periphery (Greece, Portugal, Spain), exactly the donor pool of Bayani’s Table 2. PCP then reproduces that table’s weights to two digits: Norway \(\approx 0.48\), France \(\approx 0.35\), New Zealand \(\approx 0.30\) and Austria \(\approx 0.02\), with the remaining donors at zero. The average post-1990 gap is \(\approx -1500\) per-capita units with a 1960-1989 fit RMSE near \(90\), and the per-period gap widens to about \(-3730\) by 2003 – tracing the Robust PCA Synthetic Control trajectory in Bayani’s Figure 4 (a small positive blip in 1991-1992 followed by a steadily widening negative gap). The Cattaneo-Feng-Titiunik (2021) 95% prediction interval on the ATT is about \([-1765, -1280]\), excluding zero (its width tracks cft_sims and random_state); the HQF decomposition lands near \(-1920\). The conclusion echoes Abadie et al. (2015): reunification depressed West German per-capita GDP relative to its synthetic counterpart.

Simulation studies#

The three families documented here were each validated by their authors on a purpose-built simulation. mlsynth ships these designs verbatim where reproducible so the estimators can be regression- tested against the originals.

PCR-SC – missing-data robustness (Agarwal et al. 2021)#

Agarwal et al. (2021, Section 4.5) do not use a synthetic DGP; their validation is a missing-data robustness study on two canonical panels – Basque terrorism (Abadie-Gardeazabal 2003) and California Proposition 99 (Abadie et al. 2010). For each panel they

  • take the widely accepted published counterfactual as ground truth;

  • confirm the donor matrix is near rank-one – over \(99\%\) of the spectral energy sits in the top singular value, exactly the low-rank regime in which the Theorem 3.1 / 3.2 error bounds bite;

  • randomly obfuscate \(5\%\)-\(20\%\) of the donor entries and re-fit PCR (HSVT + OLS) on the outcome series only.

The finding: PCR tracks the published baseline across all missing- data levels, whereas classical convex SC degrades sharply and plain OLS overfits the pre-period noise (it even flips the sign of the Basque effect). This is the empirical face of the HSVT denoising step – PCR recovers the same conclusion with neither auxiliary covariates nor complete data. mlsynth reproduces the headline check on Proposition 99: the singular spectrum is near rank-one and a rank-1 HSVT PCR fit matches the canonical estimate.

import numpy as np, pandas as pd
from mlsynth import CLUSTERSC

file = (
    "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
    "refs/heads/main/basedata/smoking_data.csv"
)
df = pd.read_csv(file)

wide = df.pivot(index="year", columns="state", values="cigsale")
sv = np.linalg.svd(wide.values, compute_uv=False)
print(f"top-SV energy share = {sv[0]**2 / (sv**2).sum():.3f}")  # ~0.99

res = CLUSTERSC({
    "df": df, "outcome": "cigsale", "treat": "Proposition 99",
    "unitid": "state", "time": "year",
    "method": "pcr", "clustering": False, "rank": 1,
}).fit()
print(f"rank-1 PCR ATT = {res.att:+.2f}")

PCR-SC – de-noising and the Bayesian posterior (Amjad-Shah-Shen 2018)#

Amjad-Shah-Shen ([Amjad2018], Section 5.3) introduce the HSVT de-noising step that PCR-SC is built on, and validate it on a fully synthetic panel where the true mean is known. Each unit \(i\) draws a latent feature \(\theta_i \sim U[0, 1]\), time is \(\rho_t = t\), and the mean is

\[m_{it} = \theta_i\bigl(1 + 0.3\tfrac{\rho_t}{T} e^{\rho_t/T}\bigr) + \cos\tfrac{f_1\pi}{180} + 0.5\sin\tfrac{f_2\pi}{180} + 1.5\cos\tfrac{f_3\pi}{180} - 0.5\sin\tfrac{f_4\pi}{180},\]

with periodicities \(f_1 = \rho_t \bmod 360\), \(f_2 = \rho_t \bmod 180\), \(f_3 = 2\rho_t \bmod 360\), \(f_4 = 2\rho_t \bmod 180\); the observed \(X_{it} = m_{it} + \mathcal N(0, \sigma^2)\). They use \(N = 100\), \(T = 2000\), intervention at \(t = 1600\). Two findings drive the paper:

  • Training error tracks generalization error (Table 1). The pre-intervention MSE of the estimated mean closely matches the post-intervention MSE at every noise level – the de-noised fit generalizes.

  • De-noising beats no de-noising (Table 2). The HSVT step lowers the generalization error consistently versus regressing on the raw donor matrix.

mlsynth reproduces both: the pre/post MSE ratio is \(\approx 1\) across \(\sigma^2\), and rank-4 HSVT de-noising cuts the generalization error several-fold relative to the full-rank (no de-noising) fit. The paper’s Bayesian counterpart (Figure 12) trades the inner OLS for a Gaussian posterior over the weights; mlsynth’s estimator="bayesian" path now tracks the frequentist point estimate to within Monte-Carlo error. Three implementation choices make that hold (in mlsynth.utils.clustersc_helpers.pcr.bayesian.solve_bayesian()): the observation noise \(\sigma^2\) is estimated from the pre-period fit residual (not the total target variance, which conflates signal with noise); the point counterfactual is the posterior-mean projection (not a Monte-Carlo median of draws); and both the point estimate and the credible band are projected through the de-noised rank-\(r\) donor matrix, so the band reflects the signal subspace rather than raw-donor noise in the weight null space.

import numpy as np
import pandas as pd
from mlsynth import CLUSTERSC

N, T, T0 = 100, 2000, 1600
rng = np.random.default_rng(0)
theta = rng.uniform(0, 1, N)
rho = np.arange(1, T + 1).astype(float)
time_term = (np.cos((rho % 360) * np.pi / 180)
             + 0.5 * np.sin((rho % 180) * np.pi / 180)
             + 1.5 * np.cos(((2 * rho) % 360) * np.pi / 180)
             - 0.5 * np.sin(((2 * rho) % 180) * np.pi / 180))
a = 1.0 + 0.3 * (rho / T) * np.exp(rho / T)
M = np.outer(theta, a) + np.outer(np.ones(N), time_term)   # true means (N, T)
m_true = M[0]                                              # treated unit's truth

def fit(X, rank, estimator="frequentist"):
    df = pd.DataFrame(
        [(j, t, float(X[j, t]), int(j == 0 and t >= T0))
         for j in range(N) for t in range(T)],
        columns=["unit", "time", "y", "D"])
    res = CLUSTERSC({
        "df": df, "outcome": "y", "treat": "D", "unitid": "unit",
        "time": "time", "method": "pcr", "estimator": estimator,
        "clustering": False, "rank": rank, "rank_method": "fixed",
        "standardize_for_rank": False, "compute_shen_ci": False,
    }).fit()
    return np.asarray(res.pcr.counterfactual)

X = M + rng.normal(0, np.sqrt(1.9), (N, T))                # sigma^2 = 1.9
cf4 = fit(X, rank=4)                  # de-noised (4 singular values)
cf_full = fit(X, rank=N - 1)          # no de-noising (full rank)
cf_b = fit(X, rank=4, estimator="bayesian")

mse = lambda u, v: float(np.mean((u - v) ** 2))
print(f"Table 1  train(pre)={mse(cf4[:T0], m_true[:T0]):.4f}  "
      f"gen(post)={mse(cf4[T0:], m_true[T0:]):.4f}")          # ~equal
print(f"Table 2  de-noised gen={mse(cf4[T0:], m_true[T0:]):.4f}  "
      f"no-de-noising gen={mse(cf_full[T0:], m_true[T0:]):.4f}")
print(f"Bayesian gen={mse(cf_b[T0:], m_true[T0:]):.4f}  (tracks frequentist)")

A single draw at \(\sigma^2 = 1.9\) gives training and generalization MSE both \(\approx 0.02\), a no-de-noising generalization error \(\approx 6\times\) larger, and a Bayesian generalization error indistinguishable from the frequentist – the de-noising and posterior machinery behaving as Amjad-Shah-Shen describe.

PCR-SC – donor-selection gains (Rho et al. 2025, ClusterSC)#

Rho et al. (2025, Section 6.1) build a synthetic panel with two latent subgroups to show that clustering the donor pool before fitting SC tightens the post-intervention error (their Theorems 5.11 / 5.13). The DGP, sized after disaggregated SC applications, uses \(T = 10\), \(T_0 = 8\), \(n \in \{1000, 2000\}\) donors split evenly into groups \(A\) and \(B\). Each group is a sum of \(r_S\) sinusoids

\[v_{i,t} = \alpha_i \sin(2\pi \omega_i t + \phi_i), \qquad S_{i,t} = \sum_{i=1}^{r_S} w_i\, v_{i,t}, \quad w_i \sim \mathrm{Unif}[0, 1],\]

with group-specific frequency/magnitude laws – \(\alpha_i \sim \mathrm{Beta}(2, 2)\), \(\omega_i \sim \mathrm{Unif}(1, 3)\) for \(A\) versus \(\alpha_i \sim \mathrm{Beta}(2, 5)\), \(\omega_i \sim \mathrm{Unif}(3, 6)\) for \(B\), and \(\phi_i \sim \mathcal N(0, 1)\) for both – then additive noise \(E_{i,t} \sim \mathcal N(0, s^2)\) swept over \(s \in \{0.10, 0.15, \dots, 0.40\}\). Over 500 datasets, a leave-one-out placebo test on \(30\%\) of group \(A\) compares ClusterSC against full-pool SC via the per-target improvement \(I_i = \mathrm{MSE}_{\text{SC}} - \mathrm{MSE}_{\text{ClusterSC}}\). The median improvement is positive at every noise level and grows with \(s\): once noise blurs the latent structure, restricting to the treated unit’s cluster pays off. mlsynth’s clustering=True (the default) is exactly this donor-selection step ahead of the PCR weight solve.

mlsynth’s PCR-SC was validated head to head against the authors’ own reference implementation (the syclib library released with the paper) on this exact DGP and on the paper’s empirical house-price-index panel. Running both pipelines – SVD \(k\)-means clustering, HSVT denoising, then an OLS weight solve – on the same targets, the two produce near-identical counterfactuals: the per-target counterfactual trajectories correlate at \(\ge 0.99\) on the synthetic sine panel and at \(\ge 0.999\) on the house-price-index panel, with matching cluster assignment / donor-pool sizes and pre- and post-intervention MSEs in the same range. The two residual differences are documented design choices, not discrepancies: mlsynth denoises the pre-period donor block for the weight fit (the Amjad-Shah-Shen [Amjad2018] convention, project_denoised=False by default) where the reference code denoises the full \(Y_0\) once, and mlsynth’s clustering features come from the pre-period SVD rather than the full panel. Set project_denoised=True to match the reference code’s projection convention.

The synthetic DGP is self-contained – one draw, one treated subgroup-\(A\) unit, donors clustered before the PCR fit:

import numpy as np
import pandas as pd
from mlsynth import CLUSTERSC


def _sine_panel(n, T, noise, n_signals, alpha_beta, omega_lo, omega_hi, rng):
    """One subgroup: n units, each a random mix of n_signals sinusoids + noise."""
    basis = np.zeros((n_signals, T))
    grid = np.arange(int(T * 1.2)) * 10 * np.pi
    for i in range(n_signals):
        alpha = rng.beta(*alpha_beta)
        omega = rng.uniform(omega_lo, omega_hi)
        phi = rng.normal(0, 1)
        wave = alpha * np.sin(2 * np.pi * omega * grid / 360 + phi)
        basis[i] = wave[int(0.2 * T):]                  # drop first 20%
    W = rng.uniform(0, 1, (n, n_signals))
    return W @ basis + rng.normal(0, noise, (n, T))      # (n, T)


rng = np.random.default_rng(0)
T, T0, rS, nA, nB, noise = 10, 8, 3, 60, 60, 0.30
A = _sine_panel(nA, T, noise, rS, (2, 2), 1, 3, rng)     # subgroup A
B = _sine_panel(nB, T, noise, rS, (2, 5), 3, 6, rng)     # subgroup B
panel = np.vstack([A, B])                                # (nA + nB, T)

# Treat one subgroup-A unit; everyone else is a donor. Add a known effect.
tau_true, treated = 5.0, 0
panel[treated, T0:] += tau_true
df = pd.DataFrame([
    {"unit": j, "time": t, "y": float(panel[j, t]),
     "D": int(j == treated and t >= T0)}
    for j in range(nA + nB) for t in range(T)
])

res = CLUSTERSC({
    "df": df, "outcome": "y", "treat": "D", "unitid": "unit", "time": "time",
    "method": "pcr", "pcr_objective": "OLS", "clustering": True,
    "k_clusters": 2, "rank": rS, "rank_method": "fixed",
}).fit()

n_sel = len(res.pcr.donor_weights)
print(f"donors selected by clustering = {n_sel} of {nA + nB - 1}")  # ~ subgroup A
print(f"selected (mostly) subgroup A  = {n_sel <= nA}")             # True
print(f"pre-RMSE                      = {res.pcr.pre_rmse:.3f}")
print(f"ATT (true {tau_true:+.1f})             = {res.pcr.att:+.3f}")  # ~ +5

The clustering step isolates the treated unit’s subgroup (\(\approx n_A - 1\) donors), the pre-period fit is tight, and the OLS weight solve recovers the planted effect – matching the reference ClusterSC+OLS output to the differences noted above.

RPCA-SC – two-process recovery under noise and missingness (Bayani 2021)#

Bayani (2021, Section 4) generates \(N_1 = N_2 = 100\) units from two deterministic processes plus i.i.d. Gaussian noise \(\epsilon_t \sim \mathcal N(0, \sigma^2)\) over \(t \in [0, T]\) with \(T = 250\) and intervention at \(T_0 = 150\):

\[\begin{split}f_1(t) &= 0.3\,(t \bmod (T{+}1)) - (t \bmod 10)\sin(t/\pi) + (t \bmod 10)\cos(t/\pi) + \epsilon_t, \\ f_2(t) &= \log(t) + 4\sin(t/\pi) + 4\cos(t/\pi) + \epsilon_t,\end{split}\]

with the noise variance swept over \(\sigma^2 \in \{1, 4, 9, 16, 25\}\). The study has two halves:

  • Clustering recovery (Steps 1-2). FPCA over the pre-period plus \(k\)-means recovers the two latent processes with \(100\%\) accuracy at every noise level; the first FPC score alone explains over \(95\%\) of the variation and the silhouette statistic selects \(k = 2\).

  • Counterfactual accuracy (Steps 3-5). Treating the noiseless mean of \(f_1\) as the target and its 100 noisy realisations as the donor pool, RPCA extracts the low-rank component and projects the counterfactual. The estimation RMSPE stays low and pre-/post- intervention errors stay close even at high noise:

    Bayani (2021) Table 3 – RMSPE, fully observed#

    \(\sigma^2\)

    Pre-intervention

    Post-intervention

    1

    0.09

    0.13

    4

    0.19

    0.25

    9

    0.29

    0.38

    16

    0.39

    0.51

    25

    0.49

    0.64

    Repeating with \(30\%\) of entries removed at random (Table 4) inflates the errors – the robust PCA step also imputes – but the counterfactual still tracks the true mean (e.g. post-intervention RMSPE \(0.65\) at \(\sigma^2 = 1\), rising to \(2.59\) at \(\sigma^2 = 25\)). Bayani’s design uses the convex (PCP) decomposition; the HQF solver is a later mlsynth addition.

A compact one-draw reproduction of the clustering half:

import numpy as np, pandas as pd
from mlsynth import CLUSTERSC

rng = np.random.default_rng(0)
T, T0, sigma2 = 250, 150, 9.0
t = np.arange(1, T + 1)

def f1(): return (0.3 * (t % (T + 1)) - (t % 10) * np.sin(t / np.pi)
                  + (t % 10) * np.cos(t / np.pi))
def f2(): return np.log(t) + 4 * np.sin(t / np.pi) + 4 * np.cos(t / np.pi)

rows = []
for j in range(200):                       # 100 of each process
    base = f1() if j < 100 else f2()
    y = base + rng.normal(0, np.sqrt(sigma2), size=T)
    treated = (j == 0)                      # one f1 unit is treated
    if treated:
        y[T0:] += 5.0
    rows += [{"unit": j, "time": int(tt), "y": float(yi),
              "D": int(treated and tt > T0)}
             for tt, yi in zip(t, y)]
df = pd.DataFrame(rows)

res = CLUSTERSC({
    "df": df, "outcome": "y", "treat": "D",
    "unitid": "unit", "time": "time",
    "method": "rpca", "rpca_method": "PCP",
}).fit()
print(f"donor pool size = {len(res.rpca.donor_weights)}")  # treated unit's cluster (f1 subgroup)
print(f"pre-RMSE        = {res.rpca.pre_rmse:.2f}")
print(f"ATT (true 5.0)  = {res.rpca.att:+.2f}")

References#

Abadie, A., & Gardeazabal, J. (2003). “The Economic Costs of Conflict: A Case Study of the Basque Country.” American Economic Review 93(1):113-132.

Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105(490):493-505.

Abadie, A., Diamond, A., & Hainmueller, J. (2015). “Comparative Politics and the Synthetic Control Method.” American Journal of Political Science 59(2):495-510.

Agarwal, A., Shah, D., Shen, D., & Song, D. (2021). “On Robustness of Principal Component Regression.” Journal of the American Statistical Association 116(536):1731-1745.

Amjad, M., Shah, D., & Shen, D. (2018). “Robust Synthetic Control.” Journal of Machine Learning Research 19(22):1-51.

Athey, S., Bayati, M., Doudchenko, N., Imbens, G., & Khosravi, K. (2021). “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association 116(536): 1716-1730.

Bayani, M. (2021). “Robust PCA Synthetic Control.” arXiv:2108.12542.

Bayani, M. (2022). “Essays on Machine Learning Methods in Economics.” Chapter 1, CUNY Academic Works.

Candes, E. J., Li, X., Ma, Y., & Wright, J. (2011). “Robust Principal Component Analysis?” Journal of the ACM 58(3):11.

Cattaneo, M. D., Feng, Y., & Titiunik, R. (2021). “Prediction Intervals for Synthetic Control Methods.” Journal of the American Statistical Association 116(536):1865-1880. Reference implementation: nppackages/scpi.

Chatterjee, S. (2015). “Matrix Estimation by Universal Singular Value Thresholding.” Annals of Statistics 43(1):177-214.

Hartigan, J. A., & Wong, M. A. (1979). “Algorithm AS 136: A K-means Clustering Algorithm.” Applied Statistics 28(1):100-108.

Li, Y., Wang, N., & Carroll, R. J. (2010). “Generalized Functional Linear Models with Semiparametric Single-index Interactions.” Journal of the American Statistical Association 105(490):621-633.

Rho, S., Tang, A., Bergam, N., Cummings, R., & Misra, V. (2025). “ClusterSC: Advancing Synthetic Control with Donor Selection.” arXiv:2503.21629.

Rousseeuw, P. J. (1987). “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 20:53-65.

Shen, D., Ding, P., Sekhon, J., & Yu, B. (2023). “Same Root Different Leaves: Time Series and Cross-Sectional Methods in Panel Data.” Econometrica 91(6):2125-2154. Reference implementation: deshen24/panel-data-regressions.

Wang, Z., Li, X. P., So, H. C., & Liu, Z. (2023). “Robust PCA via Non-convex Half-quadratic Regularization.” Signal Processing 204:108816.