Multivariate Square-root Lasso Synthetic Control (MSQRT)#
When to Use This Estimator#
MSQRT implements the high-dimensional synthetic-control estimator of Shen,
Song and Abadie [MSQRT]. It is built for disaggregated panels: many
fine-grained units (stores, ZIP codes, geos, products) where the number of
candidate donors \(n\) is comparable to or larger than the number of
pre-treatment periods \(T_0\), and where several units are treated at the
same time (a block design). Two ideas distinguish it from running an
ordinary synthetic control unit-by-unit.
First, it pools the treated units into one matrix regression \(Y = X\Theta + E\) and estimates the entire donor-weight matrix \(\Theta\) jointly, borrowing strength across treated units rather than solving \(m\) independent, noisy fits. Second, the loss is the nuclear norm of the residual matrix, a “square-root” (pivotal) objective whose optimal penalty level does not depend on the unknown noise variance – so a single cross-validated \(\lambda\) regularises the whole problem, while the element-wise \(\ell_1\) term performs donor selection in the regime where the classical quadratic program is non-unique and plain Lasso over-selects.
Reach for MSQRT when you have a block of treated units adopting
together, a wide donor pool (\(n \gtrsim T_0\)), and you want a single
interpretable, sparse set of donor weights shared across the treated block,
with an ATT read as the mean post-period gap. It is the natural tool for
retrospective “matched-market” lift studies and other disaggregated multi-unit
roll-outs.
Do not use MSQRT when#
Adoption is staggered – treated units switch on at different times. MSQRT assumes a block design and will refuse staggered panels. Use Synthetic Difference-in-Differences (SDID), Sequential Synthetic Difference-in-Differences (Sequential SDiD), Partially Pooled SCM (PPSCM), or Matrix Completion with Nuclear Norm Minimization (MCNNM) (which handles staggered missingness natively).
There is a single treated unit. The pooling that makes MSQRT efficient has nothing to borrow across. Start at Forward Difference-in-Differences (FDID) or Two-Step Synthetic Control, or the high-dimensional single-unit tools (Forward-Selected Synthetic Control (FSCM), Sparse Synthetic Control (SparseSC)).
The donor pool is small (\(n \ll T_0\)). The classical quadratic program is well-posed; use Two-Step Synthetic Control, Synthetic Control with Multiple Outcomes (SCMO), or Forward Difference-in-Differences (FDID) and keep the simplex interpretation.
The outcome panel has missing cells / gaps. MSQRT needs a balanced block panel; for informative or structured missingness use Matrix Completion with Nuclear Norm Minimization (MCNNM) (MAR) or Synthetic Nearest Neighbors / Causal Matrix Completion (SNN) (MNAR).
Spillovers contaminate the donor block (SUTVA fails) – use Spatial Synthetic Difference-in-Differences (SpSyDiD) or Spillover-Aware Synthetic Control (SPILLSYNTH).
Treatment is endogenous and you have an instrument – use Synthetic IV.
You are designing an experiment (choosing whom to treat) rather than estimating a retrospective effect – use Synthetic Controls for Experimental Design (MAREX), Synthetic Design (SYNDES), or Lexicographic Synthetic Control (LEXSCM).
Notation#
Stack the \(m\) treated units’ outcomes into the time-major matrix \(Y \in \mathbb{R}^{T_0 \times m}\) and the \(n\) never-treated donors into \(X \in \mathbb{R}^{T_0 \times n}\) over the pre-treatment periods \(t = 1, \ldots, T_0\) (all treated units adopt at \(T_0 + 1\)). The donor-weight matrix \(\Theta \in \mathbb{R}^{n \times m}\) has one column per treated unit; column \(j\) is that unit’s synthetic-control weights. The nuclear norm \(\lVert A \rVert_{*} = \sum_i \sigma_i(A)\) sums the singular values of \(A\). The treatment effect on treated cell \((t, j)\) in the post-period is \(\hat\tau_{tj} = Y_{tj} - (X\widehat\Theta)_{tj}\), and the ATT is its average over the treated block.
The estimator#
MSQRT estimates the weight matrix by Multivariate Square-root Lasso (paper Eq. 5):
The first term is a pivotal (square-root-type) loss: the nuclear norm of the residual matrix, scaled by \(1/\sqrt{T_0}\). Its key property is that the penalty level \(\lambda\) that delivers the oracle rate does not depend on the unknown noise scale, so the same \(\lambda\) regularises every column of \(\Theta\) simultaneously. The second term is the element-wise \(\ell_1\) penalty driving sparse donor selection. The synthetic counterfactual for the treated block is \(X\widehat\Theta\), extended into the post-period with the post-period donor matrix \(X_{\text{post}}\).
Algorithm and tuning#
The objective is convex but couples a nuclear norm and an \(\ell_1\)
penalty, so mlsynth solves it with a purpose-built two-split ADMM rather
than a generic conic solver (the paper itself emphasises computational
efficiency). Splitting \(R = X\Theta\) (carrying the nuclear-norm term) and
\(Z = \Theta\) (carrying the \(\ell_1\) term) gives three closed-form
updates per iteration: a singular-value soft-threshold for \(R\), an
element-wise soft-threshold for \(Z\), and an exact least-squares step
for \(\Theta\) whose system matrix \(X^\top X + I\) is Cholesky-factored
once and reused. The penalty parameter is balanced adaptively (Boyd et al.
2011) and the updates are over-relaxed, so a high-dimensional fit takes a few
hundred lightweight iterations – orders of magnitude faster than the conic
reformulation, which it matches to numerical precision. (A cvxpy backend is
retained for validation.)
The penalty \(\lambda\) is chosen by rolling-origin (expanding-window)
cross-validation on the pre-period: an expanding training window fits
\(\Theta\), the next cv_val_window periods are held out, and the
\(\lambda\) minimising mean validation MSE across folds is selected. The
search is pathwise – candidate penalties are visited in descending order and
each solve is warm-started from the previous one, sharply cutting iterations.
The fold schedule adapts to \(T_0\) by default and can be overridden
(cv_initial_train, cv_val_window, cv_step, cv_folds); a fixed
lambda_ skips CV entirely.
Assumptions and remarks#
Assumption 1 (block design). All treated units adopt at the same period, and
the panel is balanced. Remark. This is what lets the treated units share one
\(\Theta\) problem; mlsynth enforces it at ingestion and points
staggered panels to Synthetic Difference-in-Differences (SDID)/Matrix Completion with Nuclear Norm Minimization (MCNNM).
Assumption 2 (approximate factor structure / sparsity). Each treated unit is well approximated by a sparse combination of donors driven by a common low-dimensional factor structure. Remark. The nuclear-norm loss exploits the shared factor structure across treated units; the \(\ell_1\) penalty keeps the per-unit weight vectors sparse and interpretable when \(n \gtrsim T_0\).
Assumption 3 (pivotal tuning). The noise is homoskedastic enough that the square-root loss’s variance-free penalty calibration applies. Remark. This is the payoff of the nuclear-norm formulation – one \(\lambda\) for the whole matrix, no per-unit variance estimation.
Causal use#
MSQRT forms \(\hat\tau_{tj} = Y_{tj} - (X\widehat\Theta)_{tj}\) over the
post-treatment block and aggregates to the overall ATT (mean post-period
gap), reporting it both in levels (att) and as a percentage of the synthetic
counterfactual (att_percent). It also exposes att_t (mean treated gap at
each post period) and unit_att (per-treated-unit post-period mean gap), so
heterogeneity across the treated block is visible.
Inference#
The Shen, Song & Abadie paper establishes finite-sample estimation-error
bounds for \(\widehat\Theta\) rather than a confidence-interval procedure
for the treatment effect. For uncertainty quantification mlsynth instead
uses the non-asymptotic prediction intervals of Cattaneo, Feng, Palomba and
Titiunik [SCPI] (the scpi framework). Those intervals decompose the
prediction error \(\widehat\tau - \tau\) into an in-sample error (from
estimating the SC weights) and an out-of-sample error (the irreducible
post-treatment sampling noise), and bound each separately:
The in-sample bound is derived from the optimality condition of a
quadratic-loss SC program. MSQRT instead minimises a nuclear-norm
square-root-Lasso objective with no constraints on \(\Theta\), so that
derivation does not strictly apply. mlsynth therefore models, for MSQRT,
only the out-of-sample error – the rigorous component – via the
sub-Gaussian concentration bound
with the conditional mean and variance proxy \(\widehat\sigma^2\) estimated
from each treated unit’s pre-treatment residuals. The full miscoverage budget
\(\alpha\) is spent on the out-of-sample band
(\(\alpha_{\text{in}} = 0\)). The resulting intervals reflect
post-treatment sampling uncertainty but not weight-estimation uncertainty;
they are correspondingly narrower, and this is recorded on the result
(in_sample_included = False).
Setting inference=True returns a
SCPIResults with the four
scpi predictands, each with its own band:
taua– Time-averaged unit-averaged: the overall ATT.tsua– Time-specific unit-averaged:{period: band}, the average effect across treated units at each post-period.taus– Time-averaged unit-specific:{unit: band}, each treated unit’s effect averaged over the post-window.tsus– Time-specific unit-specific:{(unit, period): band}, the per-cell effect.
plus simultaneous – TSUS bands widened (Bonferroni over the post-periods)
for joint coverage across all post-periods of a unit, which supports
statements such as “the largest per-period effect is significant”. The
time-averaged predictands accept a time_dependence setting ("iid",
default, shrinks the band by \(\sqrt{L}\); "general" makes no
serial-dependence assumption). The reusable
mlsynth.utils.scpi_helpers module also implements the in-sample
simulation bound for use by quadratic-loss SC estimators.
Example#
A high-dimensional, multiple-treated panel in the Shen-Song-Abadie regime: five
treated units adopt together at period 100, against forty never-treated donors
following an AR(1) process, with each treated unit a sparse convex combination
of the donors plus noise and a constant treatment effect of +2. MSQRT
pools the treated units, selects \(\lambda\) by cross-validation, and
recovers the ATT with a CFPT/scpi prediction band.
from mlsynth import MSQRT
from mlsynth.utils.msqrt_helpers.simulation import simulate_msqrt_panel
df = simulate_msqrt_panel(
n_treated=5, n_control=40, T0=100, n_post=10, att=2.0, seed=0,
)
res = MSQRT({
"df": df, "outcome": "Y", "treat": "treated",
"unitid": "unit", "time": "time",
"n_lambda": 15, # log-spaced CV grid
"inference": True, # CFPT/scpi prediction intervals
"display_graphs": True, # treated mean vs synthetic mean
}).fit()
print(f"ATT (true 2.0) = {res.att:+.3f}")
lo, hi = res.inference.taua.ci # overall ATT (TAUA) band
print(f"90% scpi band = [{lo:+.3f}, {hi:+.3f}]")
# per-period (unit-averaged) effects with their bands
for period, band in res.inference.tsua.items():
print(f" k={period}: {band.point:+.3f} [{band.lower:+.3f}, {band.upper:+.3f}]")
print(f"selected lambda = {res.best_lambda:.3f}")
print(f"avg active donors per treated = "
f"{res.weights.summary_stats['avg_active_donors_per_treated']:.1f}")
This prints an ATT of roughly +2.05 with a 90% band that brackets the true
+2 and a pre-treatment RMSE below the noise floor.
Verification#
Note
Monte-Carlo replication. mlsynth.utils.msqrt_helpers.replication
reproduces the Shen, Song & Abadie [MSQRT] simulation study (their
Section 6) through the public mlsynth.MSQRT.fit() API:
\(T_0 = 100\) pre-periods, \(n = 400\) donors, a treated-unit grid
\(m \in \{50, \ldots, 400\}\), and the paper’s two data-generating
settings. The headline findings reproduce – the ATT estimator is
essentially unbiased (bias centred at zero) and the RMSE for the imputed
\(Y(0)\) stays flat in \(m\) at roughly 0.71-0.73,
matching the paper’s Table 1. The PAPER preset runs the full
500-replication study; the DEMO preset is a faster, reduced-count
version.
Solver. The two-split ADMM matches the cvxpy conic solution of eq. (5) to numerical precision while being orders of magnitude faster – the property that makes the high-dimensional, many-treated regime tractable.
Block-design guard. Feeding a staggered panel raises
MlsynthDataError, redirecting to the
staggered-adoption estimators – MSQRT’s assumptions are enforced, not
assumed.
Core API#
MSQRT: Multivariate Square-root Lasso Synthetic Control (Shen et al. 2025).
Shen, Z., Song, X. & Abadie, A. (2025). “Efficiently Learning Synthetic Control Models for High-dimensional Disaggregated Data.” arXiv:2510.22828.
MSQRT targets the regime where the panel is disaggregated – many fine-grained units (often with the number of donors \(n\) comparable to or exceeding the number of pre-treatment periods \(T_0\)) and several units treated at the same time (a block design). Rather than fitting each treated unit separately, it stacks all treated units into a single matrix regression \(Y = X\Theta + E\) and estimates the donor-weight matrix \(\Theta\) by Multivariate Square-root Lasso,
where \(\lVert\cdot\rVert_{*}\) is the nuclear norm. The nuclear-norm (“square-root”) loss is pivotal: the optimal penalty level does not depend on the unknown noise variance, so a single \(\lambda\) regularises the whole weight matrix while the element-wise \(\ell_1\) term performs donor selection. Borrowing strength across treated units in one joint problem is what makes the high-dimensional, multiple-treated case tractable and efficient.
The treatment effect is the mean post-treatment gap (observed minus synthetic)
over the treated cells – an ATT. mlsynth selects \(\lambda\) by
rolling-origin cross-validation on the pre-period and, optionally, attaches
CFPT/scpi prediction intervals (Cattaneo, Feng, Palomba & Titiunik 2025) for
all four predictands – TSUS, TAUS, TSUA and the overall ATT (TAUA) – plus
simultaneous bands. For MSQRT only the out-of-sample (post-treatment noise)
error is modelled; see mlsynth.utils.scpi_helpers.
- class mlsynth.estimators.msqrt.MSQRT(config: MSQRTConfig | dict)#
Bases:
objectMultivariate Square-root Lasso Synthetic Control estimator.
- Parameters:
config (MSQRTConfig or dict) – Configuration object. See
mlsynth.config_models.MSQRTConfig.
- fit() MSQRTResults#
Run MSQRT and return
MSQRTResults.
Configuration#
- class mlsynth.config_models.MSQRTConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', lambda_: ~typing.Annotated[float | None, ~annotated_types.Gt(gt=0.0)] = None, n_lambda: ~typing.Annotated[int, ~annotated_types.Ge(ge=2)] = 15, cv_initial_train: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=2)] = None, cv_val_window: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, cv_step: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, cv_folds: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, inference: bool = False, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.1, time_dependence: ~typing.Literal['iid', 'general'] = 'iid')#
Configuration for the MSQRT estimator.
Shen, Song & Abadie (2025), “Efficiently Learning Synthetic Control Models for High-dimensional Disaggregated Data” (arXiv:2510.22828). Stacks all treated units into one matrix regression
Y = X Theta + Eand fits the donor-weight matrix by Multivariate Square-root Lasso (nuclear-norm loss + element-wise L1), with the L1 penalty chosen by rolling-origin cross-validation on the pre-period. Assumes a block design (all treated units adopt at the same period) and inherits the standarddf/outcome/treat/unitid/timeinterface.- Parameters:
lambda_ (float, optional) – Fixed L1 penalty. If None (default), selected by cross-validation.
n_lambda (int) – Number of log-spaced candidate penalties in the CV grid.
cv_initial_train, cv_val_window, cv_step, cv_folds (int, optional) – Rolling-origin CV schedule overrides. Adaptive defaults (scaled to the pre-period length) are used when left None.
inference (bool) – Attach CFPT/scpi prediction intervals (Cattaneo, Feng, Palomba & Titiunik 2025) for all four predictands plus simultaneous bands. For MSQRT only the out-of-sample error is modelled. Default False.
alpha (float) – Total miscoverage level for the intervals (default 0.1 -> 90%).
time_dependence ({“iid”, “general”}) – Time-averaging assumption for the time-averaged predictands (TAUS, TAUA).
"iid"(default) shrinks the band bysqrt(L);"general"makes no serial-dependence assumption.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Result Containers#
MSQRT.fit() returns a
MSQRTResults: the ATT (levels
and percent), the donor-weight matrix \(\Theta\), per-treated-unit and
aggregate donor weights (WeightsResults), the
synthetic counterfactual and gap matrices, att_t and unit_att, the
cross-unit observed/synthetic means the plotter draws, the CV-selected
best_lambda, per-unit sparsity, the pre-period RMSE, and – when requested
– the SCPIResults prediction
intervals.
Frozen dataclasses for the MSQRT estimator.
Shen, Song & Abadie (2025), Efficiently Learning Synthetic Control Models
for High-dimensional Disaggregated Data (arXiv:2510.22828). The estimator
stacks all treated units into one matrix regression Y = X Theta + E and
fits the donor-weight matrix Theta by Multivariate Square-root Lasso
(nuclear-norm loss + element-wise L1), then forms the ATT as the mean
post-treatment gap (observed minus synthetic) over the treated cells.
- class mlsynth.utils.msqrt_helpers.structures.MSQRTInputs(Y_pre: ndarray, Y_post: ndarray, X_pre: ndarray, X_post: ndarray, treated_names: List[Any], control_names: List[Any], time_labels: ndarray, T0: int)#
Bases:
objectPreprocessed block panel for MSQRT (multiple treated units, common timing).
Matrices are time-major (rows = periods) to match the paper’s
Y = X Theta + Econvention.- Y_pre, Y_post
Treated outcomes, shape
(T0, m)and(T_post, m).- Type:
np.ndarray
- X_pre, X_post
Control (donor) outcomes, shape
(T0, n)and(T_post, n).- Type:
np.ndarray
- treated_names, control_names
- Type:
- time_labels#
All period labels (length
T0 + T_post).- Type:
np.ndarray
- X_post: ndarray#
- X_pre: ndarray#
- Y_post: ndarray#
- Y_pre: ndarray#
- time_labels: ndarray#
- class mlsynth.utils.msqrt_helpers.structures.MSQRTResults(inputs: ~mlsynth.utils.msqrt_helpers.structures.MSQRTInputs, att: float, att_percent: float, theta: ~numpy.ndarray, weights: ~typing.Any, counterfactual: ~numpy.ndarray, gap: ~numpy.ndarray, att_t: ~numpy.ndarray, unit_att: ~typing.Dict[~typing.Any, float], treated_mean: ~numpy.ndarray, synthetic_mean: ~numpy.ndarray, best_lambda: float, sparsity: ~numpy.ndarray, pre_rmse: float, inference: ~typing.Any | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectTop-level container returned by
mlsynth.MSQRT.fit().- inputs#
- Type:
- att#
Average treatment effect on the treated (mean post-period gap over all treated units).
- Type:
- att_percent#
attas a percentage of the mean synthetic counterfactual on the post window.- Type:
- theta#
Estimated donor-weight matrix
Theta, shape(n, m).- Type:
np.ndarray
- weights#
mlsynth.config_models.WeightsResults– per-treated-unit donor weight dicts plus aggregate sparsity stats.- Type:
- counterfactual#
Synthetic (untreated) outcome for the treated units, shape
(T0 + T_post, m).- Type:
np.ndarray
- gap#
Observed minus synthetic for the treated units, same shape.
- Type:
np.ndarray
- att_t#
Mean treated gap at each post-treatment period, shape
(T_post,).- Type:
np.ndarray
- unit_att#
{treated_name: post-period mean gap}.- Type:
Dict
- treated_mean, synthetic_mean
Cross-treated-unit means of observed / synthetic over the full timeline (the series the plotter draws), length
T0 + T_post.- Type:
np.ndarray
- sparsity#
Per-treated-unit count of active donors (
|Theta_ij| > tol).- Type:
np.ndarray
- inference#
CFPT/scpi prediction intervals (Cattaneo, Feng, Palomba & Titiunik 2025); see
mlsynth.utils.scpi_helpers. For MSQRT only the out-of-sample error is modelled.- Type:
SCPIResults, optional
- att_t: ndarray#
- counterfactual: ndarray#
- gap: ndarray#
- inputs: MSQRTInputs#
- sparsity: ndarray#
- synthetic_mean: ndarray#
- theta: ndarray#
- treated_mean: ndarray#
Helper Modules#
Data preparation – the DataFrame touchpoint: pivots the long panel into the
stacked Y = X Theta + E block matrices and enforces the block design.
Panel ingestion for the MSQRT estimator (block, multiple treated units).
- mlsynth.utils.msqrt_helpers.setup.prepare_msqrt_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str) MSQRTInputs#
Pivot a long panel into the stacked
Y = X Theta + Eblock matrices.MSQRT (Shen, Song & Abadie 2025) assumes a block design: every ever-treated unit adopts at the same period. Treated units form the columns of
Y; never-treated units form the columns ofX.
The two-split ADMM solve (singular-value + element-wise soft-thresholding with
a cached exact \(\Theta\) step, adaptive penalty and over-relaxation) and
the warm-started pathwise rolling-origin cross-validation over \(\lambda\).
A cvxpy reference solve is retained for validation.
Multivariate Square-root Lasso solve + rolling-origin lambda selection.
Estimator (Shen, Song & Abadie 2025, eq. 5):
Theta_hat = argmin_Theta (1/sqrt(T0)) * ||Y - X Theta||_* + lambda * ||Theta||_1
where ||.||_* is the nuclear norm (the “square-root”/pivotal loss; tuning
does not depend on the unknown noise variance) and ||.||_1 is the
elementwise L1 penalty sum_{i,j} |Theta_ij| that drives donor selection
in the high-dimensional regime.
The default solver is a two-split (consensus) ADMM in which every subproblem has a closed form:
the nuclear-norm term -> singular-value soft-thresholding of the residual,
the L1 term -> elementwise soft-thresholding,
the coupling -> an exact least-squares step, solved with a Cholesky factorisation of
X'X + Ithat is computed once and reused across every iteration (it does not depend on the penalty parameterrho).
Replacing the linearised proximal-gradient step of a one-split scheme with this
exact step collapses the iteration count from thousands to a few hundred, which
is what makes the high-dimensional, many-treated regime tractable. A cvxpy
path is retained for validation against a general conic solver.
- mlsynth.utils.msqrt_helpers.optimization.fit_msqrt_admm(Y: ndarray, X: ndarray, lambd: float, *, rho: float = 1.0, over_relax: float = 1.5, max_iter: int = 5000, abstol: float = 1e-06, reltol: float = 0.0001, accelerate: bool = False, eta: float = 0.999, adaptive_rho: bool = True, warm_start: dict | None = None, return_state: bool = False)#
Solve the Multivariate Square-root Lasso (eq. 5) by two-split ADMM.
Minimises
\[\frac{1}{\sqrt{T}}\,\lVert Y - X\Theta\rVert_* + \lambda \lVert \Theta \rVert_1\]over
Theta in R^{n x m}. The problem is cast in the standard ADMM formmin f(x) + g(z) s.t. A x + B z = cwithx = Theta, two consensus blocksz = (R, Z)forR = X Theta(nuclear-norm term) andZ = Theta(L1 term),A = [X; I],B = -Iandc = 0. The scaled-dual updates (Boyd et al. 2011, Sec. 3.1.1) areTheta– exact least squares:(X'X + I) Theta = X'(Rhat - Uhat) + (Zhat - What);R–R = Y - svt(Y - X Theta - Uhat, (1/sqrt(T)) / rho);Z–Z = soft_threshold(Theta + What, lambda / rho);duals –
U = Uhat + (X Theta - R),W = What + (Theta - Z).
where
(Rhat, Zhat, Uhat, What)are the (possibly extrapolated) inputs to each iteration – equal to the previous iterate for vanilla ADMM, or a Nesterov-momentum extrapolation whenaccelerate=True.The system matrix
M = X'X + Idoes not depend onrho, so it is Cholesky-factorised once and reused for every iteration.- Parameters:
Y (numpy.ndarray, shape (T, m)) – Treated-unit outcomes, time-major (rows are periods).
X (numpy.ndarray, shape (T, n)) – Donor (control) outcomes, time-major.
lambd (float) – Non-negative L1 penalty
lambdain eq. (5).rho (float, optional) – ADMM penalty parameter (default
1.0). Affects only the convergence path, not the solution. Held fixed whenaccelerate=True.over_relax (float, optional) – Over-relaxation parameter
alphain(0, 2)(default1.5; Boyd et al. 2011, Sec. 3.4.3). In theR/Zand dual updates the coupling termA Thetais replaced byalpha (A Theta) + (1 - alpha) (previous z). Values in[1.5, 1.8]typically cut the iteration count appreciably at no per-iteration cost;alpha = 1recovers the un-relaxed scheme.max_iter (int, optional) – Maximum number of iterations (default
5000); a safety cap rarely reached because the exactThetastep (and acceleration) converge quickly.abstol (float, optional) – Absolute tolerance for the primal/dual stopping rule (default
1e-6).reltol (float, optional) – Relative tolerance for the primal/dual stopping rule (default
1e-4).accelerate (bool, optional) – If True, apply the Nesterov-momentum acceleration with adaptive restart of Goldstein, O’Donoghue, Setzer & Baraniuk (2014), “Fast Alternating Direction Optimization Methods”, SIAM J. Imaging Sci. 7(3) (UCLA CAM report 12-35), Algorithm 8 (the general-convex restart variant).
rhois held fixed in this mode. Default False: acceleration targets the convergence order, but for this problem the limiting factor is the mismatch in scale between the two constraints (R = X Thetais on the scale of the data;Z = Thetaon the scale of the weights). Boyd’s adaptive-rhoresidual balancing (Sec. 3.4.1) – the practical proxy for the per-constraint penalties of Sec. 3.4.2 – handles that directly and, empirically, converges to the cvxpy optimum (objective agreeing to five significant figures) far faster than fixed-rhoacceleration.eta (float, optional) – Restart threshold in
(0, 1)(default0.999): when accelerating, momentum is kept while the combined residual decreases by at least a factoreta, otherwise the iterate restarts (momentum reset). Only used whenaccelerate=True.adaptive_rho (bool, optional) – If True (default), and only when
accelerate=False, apply residual-balancingrhoupdates (Boyd et al. 2011, Sec. 3.4.1): double/halverhowhen the primal and dual residuals differ by more than 10x (mu = 10,tau = 2), rescaling the scaled duals accordingly. This is the default solving mode. Ignored whenaccelerate=True.warm_start (dict or None, optional) – Initial solver state to start from, e.g. the
statereturned by a previous call (keysR,Z,U,W,rho). Defaults to a cold start (zeros). Warm starting from a nearby solution – as in the pathwise cross-validation ofselect_lambda_cv()– sharply reduces the iteration count.return_state (bool, optional) – If True, also return the final solver state
dict(for warm-starting a subsequent solve). Default False.
- Returns:
numpy.ndarray, shape (n, m) – The estimated weight matrix
Theta(the L1Z-block, hence exactly sparse).X @ Thetais the fitted treated-outcome matrix.state (dict, optional) – Returned only when
return_state=True: the finalR,Z,U,Wandrho, suitable as awarm_startfor a related problem.
Notes
Termination uses the primal/dual residual criterion of Boyd et al. (2011, Sec. 3.3): stop when
||r||_2 <= eps_priand||s||_2 <= eps_dualwith primal residualr = (X Theta - R, Theta - Z), dual residuals = rho * A'(z^k - z^{k-1}), and toleranceseps_pri = sqrt((T+n) m) abstol + reltol max(||A Theta||, ||B z||)andeps_dual = sqrt(n m) abstol + reltol ||A' (rho u)||. The accelerated restart uses the combined residualc_k = rho (||u^k - uhat^k||^2 + ||z^k - zhat^k||^2).
- mlsynth.utils.msqrt_helpers.optimization.fit_msqrt_weights(Y: ndarray, X: ndarray, lambd: float, *, tol: float = 0.01, solver: str = 'admm', **solver_kwargs) Tuple[ndarray, ndarray, ndarray]#
Solve eq. (5) for the donor-weight matrix and report fit + sparsity.
- Parameters:
Y (numpy.ndarray, shape (T, m)) – Treated-unit outcomes, time-major.
X (numpy.ndarray, shape (T, n)) – Donor outcomes, time-major.
lambd (float) – Non-negative L1 penalty
lambda.tol (float, optional) – Magnitude below which a weight is treated as zero when counting active donors (default
1e-2). Does not affect the solve.solver ({“admm”, “cvxpy”}, optional) – Which backend to use.
"admm"(default) is the fast two-split solverfit_msqrt_admm();"cvxpy"is the slow conic reference_fit_msqrt_cvxpy().**solver_kwargs – Extra keyword arguments forwarded to
fit_msqrt_admm()(ignored for the cvxpy backend), e.g.rho,max_iter,abstol,reltol.
- Returns:
Theta_hat (numpy.ndarray, shape (n, m)) – Estimated weight matrix.
Y_hat (numpy.ndarray, shape (T, m)) – Fitted treated outcomes
X @ Theta_hat.nonzero_per_col (numpy.ndarray, shape (m,)) – Number of active donors (
|Theta_ij| > tol) for each treated unit.
- Raises:
ValueError – If
solveris not"admm"or"cvxpy".
- mlsynth.utils.msqrt_helpers.optimization.select_lambda_cv(Y_pre: ndarray, X_pre: ndarray, lambdas: Sequence[float], *, initial_train: int | None = None, val_window: int | None = None, step: int | None = None, n_folds: int | None = None, solver: str = 'admm') float#
Select the L1 penalty by rolling-origin cross-validation on the pre-period.
Uses an expanding training window: for each fold the model is fit on
Y_pre[:train_end]and scored by mean squared error on the nextval_windowperiods. The penalty minimising the mean validation MSE over all folds is returned.- Parameters:
Y_pre (numpy.ndarray, shape (T0, m)) – Pre-treatment treated-unit outcomes, time-major.
X_pre (numpy.ndarray, shape (T0, n)) – Pre-treatment donor outcomes, time-major.
lambdas (sequence of float) – Candidate penalty values to search.
initial_train, val_window, step (int or None, optional) – Rolling-origin schedule overrides; see
_cv_schedule()for the adaptive defaults used when these are None.n_folds (int or None, optional) – If given, use at most this many folds (the earliest ones).
solver ({“admm”, “cvxpy”}, optional) – Backend forwarded to the solver (default
"admm").
- Returns:
float – The cross-validated penalty. Falls back to
min(lambdas)when the pre-period is too short to form any fold.
Notes
With the ADMM backend the search is pathwise: within each fold the candidate penalties are visited in descending order and each solve is warm-started from the previous (larger-
lambda, sparser) solution. Since neighbouring penalties have nearby solutions, this cuts the per-solve iteration count substantially relative to cold starts, without changing the selected penalty.
Run loop: CV / fit, ATT and per-period/per-unit effects, donor-weight assembly, and the optional scpi prediction intervals.
Orchestration for the MSQRT estimator (Shen, Song & Abadie 2025).
Coordinates the steps end to end: choose the L1 penalty (or take the supplied
one), fit the donor-weight matrix by Multivariate Square-root Lasso, build the
synthetic counterfactual, form the ATT and per-period/per-unit effects,
optionally attach a block-conformal band, and assemble MSQRTResults.
- mlsynth.utils.msqrt_helpers.pipeline.run_msqrt(inputs: MSQRTInputs, *, lambda_: float | None = None, n_lambda: int = 15, lambdas: Sequence[float] | None = None, cv_initial_train: int | None = None, cv_val_window: int | None = None, cv_step: int | None = None, cv_folds: int | None = None, inference: bool = False, alpha: float = 0.1, time_dependence: str = 'iid') MSQRTResults#
Fit MSQRT and assemble
MSQRTResults.- Parameters:
inputs (MSQRTInputs)
lambda_ (float, optional) – Fixed L1 penalty. If None, chosen by rolling-origin CV.
n_lambda (int) – Size of the default log-spaced CV grid (used when
lambdasis None).lambdas (sequence, optional) – Explicit CV grid (overrides
n_lambda).cv_* (int, optional) – Rolling-origin CV schedule overrides; adaptive defaults otherwise.
inference (bool) – Attach CFPT/scpi prediction intervals (Cattaneo, Feng, Palomba & Titiunik 2025) on all four predictands + simultaneous bands. For MSQRT only the out-of-sample (post-treatment noise) error is modelled – see
mlsynth.utils.scpi_helpers. Default False.alpha (float) – Total miscoverage level for the intervals.
time_dependence ({“iid”, “general”}) – Time-averaging assumption for the time-averaged predictands.
The data-generating process: AR(1) donors and a sparse convex donor-weight matrix (the Shen-Song-Abadie regime), wrapped into a long panel with an injected treatment effect for examples and tests.
Data-generating processes for MSQRT examples, tests and replication.
Implements the regime of Shen, Song & Abadie (2025): many disaggregated donor
units following an AR(1) process, with several treated units adopting at the
same period (a block design), each treated unit a sparse, convex combination
of the donors. These are the building blocks of the paper’s Monte-Carlo study
(see mlsynth.utils.msqrt_helpers.replication); simulate_msqrt_panel()
wraps them into a tidy long panel – with an injected treatment effect – ready
for mlsynth.MSQRT.
- mlsynth.utils.msqrt_helpers.simulation.ar1_panel(n: int, T: int, rng: Generator, *, burn: int = 50) ndarray#
Simulate the paper’s AR(1) outcome panel, time-major.
Each unit follows
Y_{i,t} = 0.1 * c_i + 0.9 * Y_{i,t-1} + Z_{i,t}withc_i in {1, ..., 10}(cycled across units) andZ_{i,t} ~ N(0, 1), so the stationary mean of unitiisc_iand the series is persistent.- Parameters:
n (int) – Number of units.
T (int) – Number of retained periods (after burn-in).
rng (numpy.random.Generator) – Random generator (drives the innovations).
burn (int, optional) – Burn-in periods discarded so the panel starts near stationarity (default
50).
- Returns:
numpy.ndarray, shape (T, n) – Time-major outcome matrix (rows are periods, columns are units).
- mlsynth.utils.msqrt_helpers.simulation.random_theta(n: int, m: int, s: int, rng: Generator) ndarray#
Random sparse weight matrix with
stotal non-zeros, columns summing to 1.The
snon-zeros are spread across themcolumns as evenly as possible (at least one per column); within a column the non-zero weights are drawn uniformly and normalised to sum to one (a convex combination of donors).- Parameters:
n (int) – Number of donors (rows of
Theta).m (int) – Number of treated units (columns of
Theta).s (int) – Total number of non-zero entries across the whole matrix.
rng (numpy.random.Generator) – Random generator.
- Returns:
numpy.ndarray, shape (n, m) – Sparse, column-stochastic weight matrix.
- mlsynth.utils.msqrt_helpers.simulation.simulate_msqrt_panel(n_treated: int = 5, n_control: int = 40, T0: int = 100, n_post: int = 10, nonzeros_per_unit: int = 5, att: float = 2.0, noise: float = 0.5, seed: int = 0) DataFrame#
Simulate a block, multiple-treated panel in the Shen-Song-Abadie regime.
Donors follow the AR(1) process of
ar1_panel(); each treated unit is a sparse convex combination of the donors (random_theta()) plus Gaussian noise – i.e.Y_treated = X @ Theta + E– so the synthetic-control model is well specified. A constant treatment effectattis then added to the treated units’ post-treatment cells; all treated units adopt at periodT0(a block design).- Parameters:
n_treated, n_control (int) – Counts of treated and never-treated (donor) units.
T0, n_post (int) – Pre- and post-treatment period counts (common adoption at
T0).nonzeros_per_unit (int) – Approximate number of active donors per treated unit (controls the sparsity of
Theta); capped atn_control.att (float) – Constant additive treatment effect applied to treated post-period cells.
noise (float) – Standard deviation of the idiosyncratic error
E.seed (int) – RNG seed.
- Returns:
pandas.DataFrame – Long panel with columns
unit(c*donors,t*treated),time,Y,treated.
Replication of the paper’s Monte-Carlo study (Section 6) through the public
MSQRT.fit API – the two data-generating settings, the ATT-bias and RMSE
metrics, and the PAPER / DEMO presets.
Replication of the Shen, Song & Abadie (2025) Monte-Carlo study (Section 6).
Shen, Z., Song, X. & Abadie, A. (2025), Efficiently Learning Synthetic Control Models for High-dimensional Disaggregated Data (arXiv:2510.22828).
This reproduces the simulation design for the MSC estimator (the paper’s
name for the Multivariate Square-root Lasso SC that mlsynth ships as MSQRT):
Design#
T0pre-treatment periods,T1post-treatment periods,ncontrols,mtreated units (swept over a grid),stotal non-zeros inTheta.Controls
Xfollow an AR(1):Y_{i,t} = 0.1*c_i + 0.9*Y_{i,t-1} + Z,c_i in {1..10},Z ~ N(0, 1).Setting (1) – treated units generated by the same AR(1) process (no true sparse SC representation; misspecified).
Setting (2) – treated units generated by
Y = X Theta + Ewith a randoms-sparseThetawhose columns sum to 1 andE ~ N(0, sigma^2)(well-specified sparse SC).
No treatment effect is injected, so the true ATT delta = 0 and the reported
ATT bias delta_hat - delta is the mean counterfactual-prediction error
at the first post-period (paper eq. 1). The RMSE is the root-mean-square
error of the predicted Y(0) over all T1 post-periods (paper Table 1).
The paper’s headline parameters (PAPER) are n=400, T0=100,
T1=10, s=1000, m in {50,100,150,200,250,300,350,400}, 500 reps.
Those nuclear-norm programs are large; DEMO is a proportionally scaled
preset (same s/n, m/n and s/m ratios) that runs quickly and
directionally reproduces the findings.
- class mlsynth.utils.msqrt_helpers.replication.SimConfig(n: 'int', T0: 'int', T1: 'int', s: 'int', sigma: 'float', m_grid: 'List[int]', n_reps: 'int', lambda_grid: 'List[float]' = <factory>)#
- mlsynth.utils.msqrt_helpers.replication.run_msqrt_simulation(cfg: SimConfig = SimConfig(n=150, T0=100, T1=10, s=375, sigma=0.5, m_grid=[20, 40, 60, 80, 100, 120, 150], n_reps=10, lambda_grid=[np.float64(0.01), np.float64(0.039810717055349734), np.float64(0.15848931924611134), np.float64(0.630957344480193), np.float64(2.5118864315095797), np.float64(10.0)]), *, settings=(1, 2), seed: int = 0, lambda_override: float | None = None, verbose: bool = True) Dict#
Run the MSC Monte-Carlo and return summary tables per setting.
For each setting the L1 penalty is pre-selected once (as in the paper) by rolling-origin CV on a representative draw, then reused across all
mand replications. Passlambda_overrideto fix the penalty and skip CV (much faster – CV is many extra solves).- Returns:
dict –
{setting: list of {m, rmse_mean, rmse_std, bias_mean, bias_std}}.
- mlsynth.utils.msqrt_helpers.replication.simulate_shen2025(setting: int, m: int, *, n: int, T0: int, T1: int, s: int, sigma: float, rng: Generator)#
One Monte-Carlo draw. Returns
(X, Y_treated)time-major over T0+T1.
- mlsynth.utils.msqrt_helpers.replication.to_long_df(X: ndarray, Y: ndarray, T0: int) DataFrame#
Stack control matrix
X (T,n)and treated matrixY (T,m)into the long panel MSQRT expects: columnsunit,time,Y,treated.Treated units (one block, common adoption) switch on at period
T0; no treatment effect is injected, so the observed outcome isY(0).
Uncertainty quantification is provided by the shared CFPT/scpi module (Cattaneo, Feng, Palomba & Titiunik 2025), which any synthetic-control estimator can reuse.
CFPT/scpi prediction-interval construction.
Out-of-sample bands (Section 4.2 of Cattaneo, Feng, Palomba & Titiunik 2025) for the four causal predictands and their simultaneous (uniform-over-periods) versions, plus the in-sample simulation bound (Section 4.1) as a reusable helper for quadratic-loss SC estimators.
All bands are returned as SCPIBand objects and assembled into a
SCPIResults. The interval for a predictand tau is
[ tau_hat - Mbar_in - Mbar_out , tau_hat - M_in - M_out ].
When the in-sample error is not modelled (in_sample_included=False, e.g.
MSQRT), the in-sample contribution is (0, 0) and the full miscoverage
budget alpha is spent on the out-of-sample band.
- mlsynth.utils.scpi_helpers.intervals.in_sample_band_gaussian(donor_post: ndarray, Q_hat: ndarray, Sigma_hat: ndarray, *, alpha_in: float = 0.05, n_sim: int = 1000, random_state: int = 0) Tuple[float, float]#
In-sample error band for a TSUS counterfactual, unconstrained weights.
Implements Cattaneo et al.’s simulation (their eq. 4.4) for the special case of an unconstrained donor regression, where the feasible set
Delta*is all ofR^J0and the inf/sup of the donor projection over the quadratic constraintdelta' Q_hat delta - 2 G*' delta <= 0has the closed forma’ Q_hat^{-1} G* ± sqrt( (G*’ Q_hat^{-1} G*) (a’ Q_hat^{-1} a) ),
with
a = donor_post. DrawsG* ~ N(0, Sigma_hat)and returns the(alpha_in/2, 1 - alpha_in/2)quantiles(M_in, Mbar_in).- Parameters:
donor_post (np.ndarray) –
(J0,)donor outcomes at the post-treatment period.Q_hat (np.ndarray) –
(J0, J0)donor Gram matrixY_N' Y_Nover the pre-period.Sigma_hat (np.ndarray) –
(J0, J0)estimate ofV[gamma_hat | H].
- mlsynth.utils.scpi_helpers.intervals.out_of_sample_intervals(effects: ndarray, pre_residuals: ndarray, unit_names: Sequence, period_labels: Sequence, *, alpha: float = 0.1, time_dependence: str = 'iid', assume_zero_mean: bool = False) SCPIResults#
Build the full CFPT out-of-sample interval family for a block design.
- Parameters:
effects (np.ndarray) –
(L, m)post-period predicted effectstau_hat_ik(rows = post-periods, columns = treated units).pre_residuals (np.ndarray) –
(T0, m)pre-period SC residuals (one column per treated unit).unit_names (sequence) – Length-
mtreated-unit identifiers.period_labels (sequence) – Length-
Lpost-period labels.alpha (float) – Total (and, here, out-of-sample) miscoverage level.
time_dependence ({“iid”, “general”}) – Time-averaging bound for the time-averaged predictands (TAUS, TAUA).
"iid"shrinks the band bysqrt(L)(residuals independent over time);"general"makes no dependence assumption (no shrinkage).assume_zero_mean (bool) – Pass-through to the conditional-moment estimator.
Conditional-moment estimation for the out-of-sample CFPT band.
The out-of-sample error of an SC prediction is the post-treatment residual
u_i(Ti+k). Cattaneo et al. bound it with a sub-Gaussian concentration
inequality that needs two conditional moments per treated unit: the mean
E[u_it | H] and a variance proxy sigma_it^2 (the optimal sub-Gaussian
parameter, which equals V[u_it | H] under strict sub-Gaussianity). Both are
estimated from the pre-treatment residuals of the synthetic-control fit.
For the block, short-post settings MSQRT targets, a per-unit constant proxy
(the residual mean and variance over the pre-period) is the natural default;
assume_zero_mean imposes the common E[u|H] = 0 simplification.
- mlsynth.utils.scpi_helpers.moments.conditional_moments(pre_residuals: ndarray, *, assume_zero_mean: bool = False) Tuple[float, float]#
Per-unit
(mean, variance-proxy)from a vector of pre-period residuals.- Parameters:
pre_residuals (np.ndarray) – Synthetic-control residuals
u_hat_itover the pre-treatment periods for one treated unit.assume_zero_mean (bool) – If True, fix
E[u|H] = 0and estimate the variance proxy as the mean squared residual; otherwise demean first.
- mlsynth.utils.scpi_helpers.moments.unit_moments(pre_residuals: ndarray, unit_names, *, assume_zero_mean: bool = False) Tuple[Dict, Dict]#
Stack
conditional_moments()over the columns of a residual matrix.- Parameters:
pre_residuals (np.ndarray) –
(T0, m)pre-period residuals (one column per treated unit).unit_names (sequence) – Length-
mtreated-unit identifiers (the dict keys).
- Returns:
(mu, sigma) ((dict, dict)) –
{unit: mean}and{unit: sigma}(standard deviation, not variance).
Result containers for the CFPT/scpi prediction intervals.
Cattaneo, Feng, Palomba & Titiunik (2025), “Uncertainty Quantification in Synthetic Controls with Staggered Treatment Adoption” (arXiv:2210.05026).
The method quantifies the uncertainty of a synthetic-control prediction
tau_hat by separately bounding two error sources and combining them by a
union bound,
I(tau) = [ tau_hat - Mbar_in - Mbar_out , tau_hat - M_in - M_out ],
where [M_in, Mbar_in] covers the in-sample error (SC-weight estimation)
and [M_out, Mbar_out] covers the out-of-sample error (post-treatment
sampling noise). Four causal predictands are supported – TSUS (unit x period),
TAUS (unit, time-averaged), TSUA (period, unit-averaged) and TAUA (overall) –
plus simultaneous (uniform-over-periods) bands.
- class mlsynth.utils.scpi_helpers.structures.SCPIBand(predictand: str, label: Any, point: float, lower: float, upper: float, out_sample: Tuple[float, float], in_sample: Tuple[float, float] = (0.0, 0.0))#
Bases:
objectOne prediction interval for a single predictand value.
- label#
Identifier of the value: a unit (TAUS), a post-period (TSUA), a
(unit, period)pair (TSUS), orNone(TAUA).- Type:
Any
- lower, upper
Prediction-interval endpoints.
- Type:
- in_sample#
The in-sample contribution
(M_in, Mbar_in)((0, 0)when the in-sample error is not modelled, e.g. for MSQRT).
- class mlsynth.utils.scpi_helpers.structures.SCPIResults(method: str, alpha: float, alpha_in: float, alpha_out: float, in_sample_included: bool, taua: ~mlsynth.utils.scpi_helpers.structures.SCPIBand, tsua: ~typing.Dict[~typing.Any, ~mlsynth.utils.scpi_helpers.structures.SCPIBand], taus: ~typing.Dict[~typing.Any, ~mlsynth.utils.scpi_helpers.structures.SCPIBand], tsus: ~typing.Dict[~typing.Tuple[~typing.Any, ~typing.Any], ~mlsynth.utils.scpi_helpers.structures.SCPIBand], simultaneous: ~typing.Dict[~typing.Any, ~typing.List[~mlsynth.utils.scpi_helpers.structures.SCPIBand]], sigma: ~typing.Dict[~typing.Any, float], time_dependence: str = 'iid', metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectFull CFPT/scpi uncertainty quantification for a set of predictands.
- alpha, alpha_in, alpha_out
Total target miscoverage and its in/out split. When the in-sample error is omitted,
alpha_out == alphaandalpha_in == 0.- Type:
- simultaneous#
{unit: [SCPIBand, ...]}– TSUS bands widened for joint coverage across all post-periods of that unit.- Type: