Robust Matrix estimation with Side Information (RMSI)#
When to Use This Estimator#
RMSI implements the side-information matrix estimator of Agarwal, Choi and
Yuan [RMSI]. It is a causal-panel matrix-completion method – like
Matrix Completion with Nuclear Norm Minimization (MCNNM), it imputes the treated units’ missing counterfactual cells – but
it exploits covariates on both margins of the panel: unit-level (row)
characteristics \(X\) and time-level (column) characteristics \(Z\).
Its robustness comes from decomposing the target matrix into four complementary pieces and estimating each separately:
Unlike inductive matrix completion (which forces an exact low-rank linear
covariate-interaction term and no genuine noise component), this decomposition
accommodates nonlinear covariate effects, a part explained by only one
margin, and a part explained by neither – and it degrades gracefully when
the covariates are uninformative (it then reduces to a de-meaned low-rank
completion). Reach for RMSI when you have a block-adoption causal panel,
informative unit and/or time covariates, and you expect those covariates to
carry signal the raw outcome matrix alone would estimate noisily.
Do not use RMSI when#
Adoption is staggered (treated units switch on at different times). RMSI assumes a block design; use Matrix Completion with Nuclear Norm Minimization (MCNNM), Staggered Synthetic Control (SSC), Synthetic Difference-in-Differences (SDID), or Partially Pooled SCM (PPSCM).
You have no covariates and no reason to expect one margin’s structure to help. With no side information RMSI reduces to a low-rank completion, so Matrix Completion with Nuclear Norm Minimization (MCNNM) (purpose-built, with inference) is the more direct choice.
An interpretable donor-weight story is the deliverable – RMSI imputes a matrix, not a sparse convex combination; use Two-Step Synthetic Control/Synthetic Control with Multiple Outcomes (SCMO).
Spillovers contaminate the controls (SUTVA fails) – use Spatial Synthetic Difference-in-Differences (SpSyDiD) or Spillover-Aware Synthetic Control (SPILLSYNTH).
The estimator#
Algorithm 1 (fully observed). With polynomial sieve bases of the covariates and projectors \(P_X, P_Z\) onto them, the component explained by both margins is \(\widehat M_1 = P_X Y P_Z\), and the other three are singular-value soft-thresholds of the projected residuals (the penalised least squares \(\arg\min_A \lVert B - A\rVert_F^2 + \nu\lVert A\rVert_*\) has the closed form \(\operatorname{svt}(B, \nu/2)\)):
with \(\nu_2 = C_2\sqrt{T}\), \(\nu_3 = C_3\sqrt{N}\), \(\nu_4 = C_4(\sqrt{N}+\sqrt{T})\). The estimate is \(\widehat M = \widehat M_1 + \widehat M_2 + \widehat M_3 + \widehat M_4\) – only projections and SVDs, no iterative solver.
Algorithm 3 (block-missing causal). The treated post-treatment cells form a
missing block. RMSI applies Algorithm 1 to the fully observed tall submatrix
(all units, pre-treatment periods) and wide submatrix (control units, all
periods), then recombines their singular subspaces,
\(\widehat M = \widehat U_{\text{tall}}\, \widehat H\, \widehat D_{\text{wide}} \widehat V_{\text{wide}}^\top\),
where \(\widehat H\) rotates the wide left-singular vectors onto the tall
ones over the control rows. The ATT is the observed minus the imputed outcome
over the treated cells. RMSI targets the block (common adoption time)
setting.
Side information#
Pass the covariate columns through the config: unit_covariates are columns
(approximately) constant within a unit – averaged per unit to form the row
feature matrix \(X\) – and time_covariates are columns constant within
a period – averaged per period to form \(Z\). Either may be empty.
Example#
A block panel of forty units (eight treated at period 20) whose untreated
outcomes are driven by nonlinear functions of two unit covariates and two time
covariates plus a residual low-rank part, with a constant effect of +5:
from mlsynth import RMSI
from mlsynth.utils.rmsi_helpers.simulation import simulate_rmsi_panel
df = simulate_rmsi_panel(
n_units=40, n_treated=8, T0=20, n_post=11, att=5.0, seed=0,
)
res = RMSI({
"df": df, "outcome": "Y", "treat": "treated",
"unitid": "unit", "time": "time",
"unit_covariates": ["x0", "x1"], # row side information
"time_covariates": ["z0", "z1"], # column side information
"sieve_order": 2,
"display_graphs": True, # observed vs. imputed counterfactual
}).fit()
print(f"ATT (true 5.0) = {res.att:+.3f} [rank {res.rank}]")
Replication#
Both of the paper’s empirical pieces are reproduced through the public API.
Path A – Proposition 99 (Section 5.2). Using the Abadie et al. (2010)
tobacco panel shipped at
basedata/P99data.csv,
RMSI treats California from 1989 and uses the state-level Abadie predictors
as unit covariates and the year-average retail price as the time covariate:
from mlsynth.utils.rmsi_helpers.replication import replicate_prop99
# downloads basedata/P99data.csv by default; pass a path/DataFrame to override
res = replicate_prop99(rank=3)
# -> California Proposition 99 ATT ~ -21 packs/capita (1989 ~ -7, 2000 ~ -32)
The estimated ATT of about \(-21\) packs per capita is consistent with
Abadie, Diamond & Hainmueller [ABADIE2010] and with the Matrix Completion with Nuclear Norm Minimization (MCNNM) / Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)
estimates elsewhere in mlsynth. The equivalent explicit call:
import pandas as pd
from mlsynth import RMSI
url = ("https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
"main/basedata/P99data.csv")
d = pd.read_csv(url)
for c in ["lnincome", "beer", "age15to24", "retprice"]:
d[c] = d.groupby("state")[c].transform(lambda s: s.fillna(s.mean()))
d[c] = d[c].fillna(d[c].mean())
d["treated"] = ((d["state"] == "California") & (d["year"] >= 1989)).astype(int)
res = RMSI({"df": d, "outcome": "cigsale", "treat": "treated",
"unitid": "state", "time": "year",
"unit_covariates": ["lnincome", "beer", "age15to24", "retprice"],
"time_covariates": ["retprice"], "rank": 3,
"display_graphs": False}).fit()
print(res.att)
Path B – synthetic Monte Carlo (Section 5.1). The paper’s four-component DGP under the block-missing (MNAR) pattern; RMSI imputes the missing block at a lower average mean-squared error than the no-side-information baseline:
from mlsynth.utils.rmsi_helpers.replication import (
run_rmsi_simulation, RMSISimConfig, PAPER,
)
# quick, reduced-count preset (use PAPER for the full N=T=400, 100-rep study)
out = run_rmsi_simulation(RMSISimConfig(N=120, T=120, N0=60, T0=60,
J=5, n_reps=10))
# -> AMSE side-info < AMSE no-side-info (side information lowers the error)
Verification#
Note
Path B (synthetic). On the paper’s four-component MNAR DGP
(simulate_rmsi_dgp()), RMSI’s
block imputation achieves a lower missing-block AMSE than the
no-side-information baseline – reproducing the paper’s central finding that
leveraging side information improves imputation accuracy.
Path A (Proposition 99). replicate_prop99 recovers a California
Proposition 99 ATT of about \(-21\) packs per capita (widening toward
\(-32\) by 2000), matching the Abadie-Diamond-Hainmueller [ABADIE2010]
baseline and the other mlsynth estimators on the same panel.
Core API#
RMSI: Robust Matrix estimation with Side Information (Agarwal, Choi & Yuan 2026).
Agarwal, A., Choi, J. & Yuan, M. (2026). “Robust Matrix Estimation with Side Information.” arXiv:2603.24833.
RMSI estimates a panel’s untreated-outcome matrix while exploiting covariates on both margins – unit-level (row) characteristics \(X\) and time-level (column) characteristics \(Z\). It decomposes the target matrix into four complementary pieces,
and estimates each by sieve projection plus nuclear-norm penalization: with projectors \(P_X, P_Z\) onto polynomial sieve bases of the covariates, the component explained by both margins is \(\widehat M_1 = P_X Y P_Z\), and the remaining three are singular-value soft-thresholds of the corresponding projected residuals (the penalised least squares has that closed form). Because the pieces are recovered separately and summed, the method is robust: it accommodates nonlinear covariate effects, parts explained by only one margin, and a part explained by neither – and degrades gracefully when the covariates are uninformative (it then reduces to a de-meaned low-rank completion).
For causal panel data the treated cells form a missing block. RMSI applies the estimator to the fully observed “tall” submatrix (all units, pre-treatment periods) and “wide” submatrix (control units, all periods) and recombines their singular subspaces to impute the missing treated counterfactual; the ATT is the observed minus the imputed outcome over the treated cells. This estimator targets the block (common adoption time) causal setting.
- class mlsynth.estimators.rmsi.RMSI(config: RMSIConfig | dict)#
Bases:
objectRobust Matrix estimation with Side Information.
- Parameters:
config (RMSIConfig or dict) – Configuration object. See
mlsynth.config_models.RMSIConfig.
- fit() RMSIResults#
Run RMSI and return
RMSIResults.
Configuration#
- class mlsynth.config_models.RMSIConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', unit_covariates: ~typing.List[str] = <factory>, time_covariates: ~typing.List[str] = <factory>, sieve_order: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 2, rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None)#
Configuration for the RMSI estimator.
Agarwal, Choi & Yuan (2026), “Robust Matrix Estimation with Side Information” (arXiv:2603.24833). Imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates. Inherits the standard
df/outcome/treat/unitid/timeinterface.- Parameters:
unit_covariates (list of str) – Columns that vary across units and are (approximately) constant within a unit; averaged per unit to form the row feature matrix
X. May be empty (then the row projection captures only the mean).time_covariates (list of str) – Columns that vary across periods and are constant across units within a period; averaged per period to form the column feature matrix
Z. May be empty.sieve_order (int) – Polynomial sieve order
J(default 2, matching the paper).rank (int, optional) – Factor rank for the block recombination; chosen by a relative-magnitude singular-value threshold when None.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Result Containers#
RMSI.fit() returns a
RMSIResults: the ATT, the imputed
counterfactual matrix counterfactual and per-cell effects,
att_by_period, the cross-treated-unit observed/imputed means the plotter
draws, the factor rank used, and metadata.
Frozen dataclasses for the RMSI estimator.
Agarwal, Choi & Yuan (2026), Robust Matrix Estimation with Side Information (arXiv:2603.24833). RMSI imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates.
- class mlsynth.utils.rmsi_helpers.structures.RMSIInputs(Y: ndarray, D: ndarray, X: ndarray, Z: ndarray, T0: int, treated_idx: ndarray, control_idx: ndarray, unit_names: List[Any], time_labels: ndarray, unit_covariates: List[str], time_covariates: List[str])#
Bases:
objectPreprocessed block-adoption panel with side information for RMSI.
- Y#
Outcomes, shape
(N, T).- Type:
np.ndarray
- D#
Treatment indicators, shape
(N, T)(1where treated).- Type:
np.ndarray
- X#
Row (unit) covariates, shape
(N, d1)– one row per unit.- Type:
np.ndarray
- Z#
Column (time) covariates, shape
(T, d2)– one row per period.- Type:
np.ndarray
- treated_idx, control_idx
Indices of ever-treated and never-treated units.
- Type:
np.ndarray
- unit_names, time_labels
- Type:
sequence
- unit_covariates, time_covariates
The covariate column names used to build
XandZ.
- D: ndarray#
- X: ndarray#
- Y: ndarray#
- Z: ndarray#
- control_idx: ndarray#
- time_labels: ndarray#
- treated_idx: ndarray#
- class mlsynth.utils.rmsi_helpers.structures.RMSIResults(inputs: ~mlsynth.utils.rmsi_helpers.structures.RMSIInputs, att: float, counterfactual: ~numpy.ndarray, effects: ~numpy.ndarray, att_by_period: ~typing.Dict[~typing.Any, float], treated_mean: ~numpy.ndarray, synthetic_mean: ~numpy.ndarray, rank: int, components: ~typing.Dict[str, ~numpy.ndarray] | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
objectTop-level container returned by
mlsynth.RMSI.fit().- inputs#
- Type:
- counterfactual#
Estimated untreated potential outcomes
M_hat, shape(N, T).- Type:
np.ndarray
- effects#
Observed minus imputed on treated cells (NaN elsewhere), shape
(N, T).- Type:
np.ndarray
- treated_mean, synthetic_mean
Cross-treated-unit means of observed / imputed over the full timeline (the series the plotter draws), length
T.- Type:
np.ndarray
- counterfactual: ndarray#
- effects: ndarray#
- inputs: RMSIInputs#
- synthetic_mean: ndarray#
- treated_mean: ndarray#
Helper Modules#
The numerical core: sieve bases, projections, singular-value soft-thresholding, Algorithm 1 (four-component) and Algorithm 3 (block-missing recombination).
Numerical core for RMSI (Agarwal, Choi & Yuan 2026).
Implements the four-component sieve + nuclear-norm estimator (their Algorithm 1) and the block-missing causal extension (their Algorithm 3). Both reduce to projections and singular-value soft-thresholding – no iterative solver.
Algorithm 1 decomposes M = M1 + M2 + M3 + M4:
M1 = PX Y PZ– explained by both row & column featuresM2 = svt(PX Y (I-PZ), nu2/2)– row-feature-drivenM3 = svt((I-PX) Y PZ, nu3/2)– column-feature-drivenM4 = svt((I-PX) Y (I-PZ), nu4/2)– residual low-rank (no features)
where PX/PZ project onto sieve bases of the row/column covariates, the
penalised least-squares argmin ||B - A||_F^2 + nu||A||_* has the closed form
svt(B, nu/2), and nu2 = C2 sqrt(T), nu3 = C3 sqrt(N),
nu4 = C4 (sqrt(N) + sqrt(T)).
- mlsynth.utils.rmsi_helpers.core.algorithm1(Y: ndarray, X: ndarray, Z: ndarray, *, J: int = 2, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0, PX: ndarray | None = None, PZ: ndarray | None = None)#
Four-component sieve + SVT estimator (Algorithm 1) for a full matrix.
- Parameters:
Y (np.ndarray, shape (N, T)) – Fully observed outcome matrix.
X (np.ndarray, shape (N, d1)) – Row (unit) covariates.
Z (np.ndarray, shape (T, d2)) – Column (time) covariates.
J (int) – Sieve order.
C2, C3, C4 (float) – Penalty constants for the three soft-thresholded components.
PX, PZ (np.ndarray, optional) – Precomputed projectors (built from the sieve bases if omitted).
- Returns:
M_hat (np.ndarray, shape (N, T)) – The aggregated estimate
M1 + M2 + M3 + M4.components (dict) –
{"M1", "M2", "M3", "M4"}– the individual components.
- mlsynth.utils.rmsi_helpers.core.algorithm3(Y: ndarray, X: ndarray, Z: ndarray, *, control_idx: ndarray, T0: int, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0)#
Block-missing causal estimator (Algorithm 3).
Imputes the full matrix from the “tall” submatrix (all units, pre-treatment periods) and the “wide” submatrix (control units, all periods), recombining via
M_hat = U_tall H_adj D_wide V_wide'with a rotationH_adjthat aligns the wide left-singular vectors to the tall ones on the control rows.- Parameters:
Y (np.ndarray, shape (N, T)) – Outcome matrix; the treated post-treatment block is treated as missing.
X, Z (np.ndarray) – Row and column covariates.
control_idx (np.ndarray) – Indices of never-treated (control) units (the “wide” rows).
T0 (int) – Number of clean pre-treatment periods (the “tall” columns).
J (int) – Sieve order.
rank (int, optional) – Factor rank
K; a relative-magnitude singular-value threshold (_auto_rank()) is used when omitted.C2, C3, C4 (float) – Penalty constants forwarded to
algorithm1().
- Returns:
M_hat (np.ndarray, shape (N, T)) – The completed matrix (the imputed counterfactual everywhere).
rank (int) – The factor rank used.
- mlsynth.utils.rmsi_helpers.core.sieve_poly(C: ndarray, J: int = 2) ndarray#
Polynomial sieve basis (with intercept) of a covariate matrix.
For
J = 2each covariateccontributescandc**2; a constant column is prepended so the projection captures means.- Parameters:
C (np.ndarray, shape (n, d)) – Covariate matrix (rows are units/periods).
J (int) – Polynomial sieve order.
- Returns:
np.ndarray, shape (n, 1 + d*J) – Basis matrix
[1, c_1, c_1^2, ..., c_d, c_d^2, ...].
Panel + side-information ingestion (block design enforced).
Panel + side-information ingestion for the RMSI estimator.
- mlsynth.utils.rmsi_helpers.setup.prepare_rmsi_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str, unit_covariates: List[str] | None = None, time_covariates: List[str] | None = None) RMSIInputs#
Pivot a long panel into block matrices and extract side information.
RMSI (Agarwal, Choi & Yuan 2026) assumes a block design: every ever-treated unit adopts at the same period
T0.unit_covariatesare columns (approximately) constant within a unit -> the row feature matrixX(one row per unit, time-averaged);time_covariatesare columns constant within a period -> the column feature matrixZ(one row per period, unit-averaged). Either may be empty, in which case the corresponding projection captures only the mean.
Run loop: Algorithm 3, the ATT and per-period effects, and the imputed series.
Orchestration for the RMSI estimator (Agarwal, Choi & Yuan 2026).
- mlsynth.utils.rmsi_helpers.pipeline.run_rmsi(inputs: RMSIInputs, *, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0) RMSIResults#
Run RMSI (Algorithm 3) and assemble
RMSIResults.- Parameters:
inputs (RMSIInputs)
J (int) – Polynomial sieve order (default 2, matching the paper).
rank (int, optional) – Factor rank; estimated by the eigenvalue-ratio method if omitted.
C2, C3, C4 (float) – Penalty constants for the soft-thresholded components.
Block causal DGP with side information for examples and tests.
Block-adoption causal DGP with side information for RMSI examples/tests.
Mirrors the regime RMSI targets: an outcome matrix whose untreated potential
outcomes are driven by (nonlinear) functions of unit covariates X and time
covariates Z plus a residual low-rank part, with a block of treated units
adopting at a common period. Returns a tidy long panel (with the covariate
columns) ready for mlsynth.RMSI.
- mlsynth.utils.rmsi_helpers.simulation.simulate_rmsi_panel(n_units: int = 40, n_treated: int = 8, T0: int = 20, n_post: int = 11, d_unit: int = 2, d_time: int = 2, att: float = 5.0, noise: float = 0.5, seed: int = 0) DataFrame#
Simulate a block, side-information, low-rank causal panel.
- Parameters:
n_units, n_treated (int) – Total units and the number treated (treated units adopt at
T0).T0, n_post (int) – Pre- and post-treatment period counts.
d_unit, d_time (int) – Numbers of unit (row) and time (column) covariates.
att (float) – Constant additive treatment effect on treated post cells.
noise (float) – Idiosyncratic-noise standard deviation.
seed (int) – RNG seed.
- Returns:
pandas.DataFrame – Long panel with columns
unit,time,Y,treated,x0..(unit covariates),z0..(time covariates).
Path-A (Proposition 99) and Path-B (synthetic Monte-Carlo) replications.
Replications of Agarwal, Choi & Yuan (2026).
Path B – the paper’s synthetic Monte-Carlo study (Section 5.1): the four-component DGP under the block-missing (MNAR) pattern. RMSI recovers the missing block at a lower error than the no-side-information baseline.
Path A – the empirical Proposition 99 / tobacco application (Section 5.2), run on the data shipped in
basedata/P99data.csv.
Both run through the public estimator surface / the documented helper functions.
- class mlsynth.utils.rmsi_helpers.replication.RMSISimConfig(N: int = 400, T: int = 400, N0: int = 200, T0: int = 200, J: int = 5, sigma: float = 0.5, alphas: tuple = (0.25, 0.25, 0.25, 0.25), n_reps: int = 100)#
Parameters for the RMSI synthetic Monte-Carlo (paper Section 5.1, MNAR).
- mlsynth.utils.rmsi_helpers.replication.replicate_prop99(data: str | DataFrame | None = None, *, rank: int | None = 3, sieve_order: int = 2, verbose: bool = True)#
Estimate California’s Proposition 99 effect with RMSI (Path A).
Loads the Abadie et al. (2010) tobacco panel (
basedata/P99data.csv), treats California from 1989, uses the state-level Abadie predictors (lnincome,beer,age15to24,retprice) as unit covariates and the year-average retail price as the time covariate, and runsmlsynth.RMSI.- Parameters:
data (str or pandas.DataFrame, optional) – Path/URL to the panel, or a DataFrame. Default downloads
P99data.csvfrombasedata/.rank (int, optional) – Factor rank (default 3; suits the single-treated-unit case – the eigenvalue-ratio default tends to pick rank 1 here).
sieve_order (int) – Polynomial sieve order.
- Returns:
mlsynth.utils.rmsi_helpers.structures.RMSIResults
- mlsynth.utils.rmsi_helpers.replication.run_rmsi_simulation(cfg: RMSISimConfig = RMSISimConfig(N=120, T=120, N0=60, T0=60, J=5, sigma=0.5, alphas=(0.25, 0.25, 0.25, 0.25), n_reps=10), *, seed: int = 0, verbose: bool = True) Dict#
Run the synthetic MNAR Monte-Carlo and return missing-block AMSE.
Compares RMSI (with side information) to a no-side-information baseline (the same block estimator with empty covariates, i.e. a de-meaned low-rank completion) by the average mean squared error over the missing block.
- Returns:
dict –
{"rmsi": amse_side_info, "no_side_info": amse_baseline, "rel_improvement": ...}.
- mlsynth.utils.rmsi_helpers.replication.simulate_rmsi_dgp(N: int, T: int, *, alphas=(0.25, 0.25, 0.25, 0.25), sigma: float = 0.5, seed: int = 0)#
Generate
(Y, M, X, Z)from the paper’s four-component DGP (Section 5.1).Eight characteristics (four per margin) with the paper’s distributions;
M = sum_r alpha_r M_rwith eachM_rnormalised to||M_r||_F = sqrt(2 N T); noiseN(0, sigma^2).