Robust Matrix estimation with Side Information (RMSI)#

When to Use This Estimator#

RMSI implements the side-information matrix estimator of Agarwal, Choi and Yuan [RMSI]. It is a causal-panel matrix-completion method – like Matrix Completion with Nuclear Norm Minimization (MCNNM), it imputes the treated units’ missing counterfactual cells – but it exploits covariates on both margins of the panel: unit-level (row) characteristics \(X\) and time-level (column) characteristics \(Z\).

Its robustness comes from decomposing the target matrix into four complementary pieces and estimating each separately:

\[M = \underbrace{G_1(X) Q_1(Z)^\top}_{M_1:\ \text{both margins}} + \underbrace{G_2(X) V_1^\top}_{M_2:\ \text{row-driven}} + \underbrace{W_1 Q_2(Z)^\top}_{M_3:\ \text{column-driven}} + \underbrace{W_2 V_2^\top}_{M_4:\ \text{residual low-rank}} .\]

Unlike inductive matrix completion (which forces an exact low-rank linear covariate-interaction term and no genuine noise component), this decomposition accommodates nonlinear covariate effects, a part explained by only one margin, and a part explained by neither – and it degrades gracefully when the covariates are uninformative (it then reduces to a de-meaned low-rank completion). Reach for RMSI when you have a block-adoption causal panel, informative unit and/or time covariates, and you expect those covariates to carry signal the raw outcome matrix alone would estimate noisily.

Do not use RMSI when#

The estimator#

Algorithm 1 (fully observed). With polynomial sieve bases of the covariates and projectors \(P_X, P_Z\) onto them, the component explained by both margins is \(\widehat M_1 = P_X Y P_Z\), and the other three are singular-value soft-thresholds of the projected residuals (the penalised least squares \(\arg\min_A \lVert B - A\rVert_F^2 + \nu\lVert A\rVert_*\) has the closed form \(\operatorname{svt}(B, \nu/2)\)):

\[\widehat M_2 = \operatorname{svt}\!\big(P_X Y (I - P_Z),\ \nu_2/2\big),\quad \widehat M_3 = \operatorname{svt}\!\big((I - P_X) Y P_Z,\ \nu_3/2\big),\quad \widehat M_4 = \operatorname{svt}\!\big((I - P_X) Y (I - P_Z),\ \nu_4/2\big),\]

with \(\nu_2 = C_2\sqrt{T}\), \(\nu_3 = C_3\sqrt{N}\), \(\nu_4 = C_4(\sqrt{N}+\sqrt{T})\). The estimate is \(\widehat M = \widehat M_1 + \widehat M_2 + \widehat M_3 + \widehat M_4\) – only projections and SVDs, no iterative solver.

Algorithm 3 (block-missing causal). The treated post-treatment cells form a missing block. RMSI applies Algorithm 1 to the fully observed tall submatrix (all units, pre-treatment periods) and wide submatrix (control units, all periods), then recombines their singular subspaces, \(\widehat M = \widehat U_{\text{tall}}\, \widehat H\, \widehat D_{\text{wide}} \widehat V_{\text{wide}}^\top\), where \(\widehat H\) rotates the wide left-singular vectors onto the tall ones over the control rows. The ATT is the observed minus the imputed outcome over the treated cells. RMSI targets the block (common adoption time) setting.

Side information#

Pass the covariate columns through the config: unit_covariates are columns (approximately) constant within a unit – averaged per unit to form the row feature matrix \(X\) – and time_covariates are columns constant within a period – averaged per period to form \(Z\). Either may be empty.

Example#

A block panel of forty units (eight treated at period 20) whose untreated outcomes are driven by nonlinear functions of two unit covariates and two time covariates plus a residual low-rank part, with a constant effect of +5:

from mlsynth import RMSI
from mlsynth.utils.rmsi_helpers.simulation import simulate_rmsi_panel

df = simulate_rmsi_panel(
    n_units=40, n_treated=8, T0=20, n_post=11, att=5.0, seed=0,
)

res = RMSI({
    "df": df, "outcome": "Y", "treat": "treated",
    "unitid": "unit", "time": "time",
    "unit_covariates": ["x0", "x1"],   # row side information
    "time_covariates": ["z0", "z1"],   # column side information
    "sieve_order": 2,
    "display_graphs": True,            # observed vs. imputed counterfactual
}).fit()

print(f"ATT (true 5.0) = {res.att:+.3f}   [rank {res.rank}]")

Replication#

Both of the paper’s empirical pieces are reproduced through the public API.

Path A – Proposition 99 (Section 5.2). Using the Abadie et al. (2010) tobacco panel shipped at basedata/P99data.csv, RMSI treats California from 1989 and uses the state-level Abadie predictors as unit covariates and the year-average retail price as the time covariate:

from mlsynth.utils.rmsi_helpers.replication import replicate_prop99

# downloads basedata/P99data.csv by default; pass a path/DataFrame to override
res = replicate_prop99(rank=3)
# -> California Proposition 99 ATT ~ -21 packs/capita (1989 ~ -7, 2000 ~ -32)

The estimated ATT of about \(-21\) packs per capita is consistent with Abadie, Diamond & Hainmueller [ABADIE2010] and with the Matrix Completion with Nuclear Norm Minimization (MCNNM) / Synthetic Nearest Neighbors / Causal Matrix Completion (SNN) estimates elsewhere in mlsynth. The equivalent explicit call:

import pandas as pd
from mlsynth import RMSI

url = ("https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
       "main/basedata/P99data.csv")
d = pd.read_csv(url)
for c in ["lnincome", "beer", "age15to24", "retprice"]:
    d[c] = d.groupby("state")[c].transform(lambda s: s.fillna(s.mean()))
    d[c] = d[c].fillna(d[c].mean())
d["treated"] = ((d["state"] == "California") & (d["year"] >= 1989)).astype(int)

res = RMSI({"df": d, "outcome": "cigsale", "treat": "treated",
            "unitid": "state", "time": "year",
            "unit_covariates": ["lnincome", "beer", "age15to24", "retprice"],
            "time_covariates": ["retprice"], "rank": 3,
            "display_graphs": False}).fit()
print(res.att)

Path B – synthetic Monte Carlo (Section 5.1). The paper’s four-component DGP under the block-missing (MNAR) pattern; RMSI imputes the missing block at a lower average mean-squared error than the no-side-information baseline:

from mlsynth.utils.rmsi_helpers.replication import (
    run_rmsi_simulation, RMSISimConfig, PAPER,
)

# quick, reduced-count preset (use PAPER for the full N=T=400, 100-rep study)
out = run_rmsi_simulation(RMSISimConfig(N=120, T=120, N0=60, T0=60,
                                        J=5, n_reps=10))
# -> AMSE side-info < AMSE no-side-info (side information lowers the error)

Verification#

Note

Path B (synthetic). On the paper’s four-component MNAR DGP (simulate_rmsi_dgp()), RMSI’s block imputation achieves a lower missing-block AMSE than the no-side-information baseline – reproducing the paper’s central finding that leveraging side information improves imputation accuracy.

Path A (Proposition 99). replicate_prop99 recovers a California Proposition 99 ATT of about \(-21\) packs per capita (widening toward \(-32\) by 2000), matching the Abadie-Diamond-Hainmueller [ABADIE2010] baseline and the other mlsynth estimators on the same panel.

Core API#

RMSI: Robust Matrix estimation with Side Information (Agarwal, Choi & Yuan 2026).

Agarwal, A., Choi, J. & Yuan, M. (2026). “Robust Matrix Estimation with Side Information.” arXiv:2603.24833.

RMSI estimates a panel’s untreated-outcome matrix while exploiting covariates on both margins – unit-level (row) characteristics \(X\) and time-level (column) characteristics \(Z\). It decomposes the target matrix into four complementary pieces,

\[M = \underbrace{G_1(X) Q_1(Z)^\top}_{M_1\text{: both}} + \underbrace{G_2(X) V_1^\top}_{M_2\text{: row-driven}} + \underbrace{W_1 Q_2(Z)^\top}_{M_3\text{: column-driven}} + \underbrace{W_2 V_2^\top}_{M_4\text{: residual low-rank}},\]

and estimates each by sieve projection plus nuclear-norm penalization: with projectors \(P_X, P_Z\) onto polynomial sieve bases of the covariates, the component explained by both margins is \(\widehat M_1 = P_X Y P_Z\), and the remaining three are singular-value soft-thresholds of the corresponding projected residuals (the penalised least squares has that closed form). Because the pieces are recovered separately and summed, the method is robust: it accommodates nonlinear covariate effects, parts explained by only one margin, and a part explained by neither – and degrades gracefully when the covariates are uninformative (it then reduces to a de-meaned low-rank completion).

For causal panel data the treated cells form a missing block. RMSI applies the estimator to the fully observed “tall” submatrix (all units, pre-treatment periods) and “wide” submatrix (control units, all periods) and recombines their singular subspaces to impute the missing treated counterfactual; the ATT is the observed minus the imputed outcome over the treated cells. This estimator targets the block (common adoption time) causal setting.

class mlsynth.estimators.rmsi.RMSI(config: RMSIConfig | dict)#

Bases: object

Robust Matrix estimation with Side Information.

Parameters:

config (RMSIConfig or dict) – Configuration object. See mlsynth.config_models.RMSIConfig.

fit() RMSIResults#

Run RMSI and return RMSIResults.

Configuration#

class mlsynth.config_models.RMSIConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', unit_covariates: ~typing.List[str] = <factory>, time_covariates: ~typing.List[str] = <factory>, sieve_order: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 2, rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None)#

Configuration for the RMSI estimator.

Agarwal, Choi & Yuan (2026), “Robust Matrix Estimation with Side Information” (arXiv:2603.24833). Imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates. Inherits the standard df / outcome / treat / unitid / time interface.

Parameters:
  • unit_covariates (list of str) – Columns that vary across units and are (approximately) constant within a unit; averaged per unit to form the row feature matrix X. May be empty (then the row projection captures only the mean).

  • time_covariates (list of str) – Columns that vary across periods and are constant across units within a period; averaged per period to form the column feature matrix Z. May be empty.

  • sieve_order (int) – Polynomial sieve order J (default 2, matching the paper).

  • rank (int, optional) – Factor rank for the block recombination; chosen by a relative-magnitude singular-value threshold when None.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

rank: int | None#
sieve_order: int#
time_covariates: List[str]#
unit_covariates: List[str]#

Result Containers#

RMSI.fit() returns a RMSIResults: the ATT, the imputed counterfactual matrix counterfactual and per-cell effects, att_by_period, the cross-treated-unit observed/imputed means the plotter draws, the factor rank used, and metadata.

Frozen dataclasses for the RMSI estimator.

Agarwal, Choi & Yuan (2026), Robust Matrix Estimation with Side Information (arXiv:2603.24833). RMSI imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates.

class mlsynth.utils.rmsi_helpers.structures.RMSIInputs(Y: ndarray, D: ndarray, X: ndarray, Z: ndarray, T0: int, treated_idx: ndarray, control_idx: ndarray, unit_names: List[Any], time_labels: ndarray, unit_covariates: List[str], time_covariates: List[str])#

Bases: object

Preprocessed block-adoption panel with side information for RMSI.

Y#

Outcomes, shape (N, T).

Type:

np.ndarray

D#

Treatment indicators, shape (N, T) (1 where treated).

Type:

np.ndarray

X#

Row (unit) covariates, shape (N, d1) – one row per unit.

Type:

np.ndarray

Z#

Column (time) covariates, shape (T, d2) – one row per period.

Type:

np.ndarray

T0#

Number of clean pre-treatment periods (common adoption at T0).

Type:

int

treated_idx, control_idx

Indices of ever-treated and never-treated units.

Type:

np.ndarray

unit_names, time_labels
Type:

sequence

unit_covariates, time_covariates

The covariate column names used to build X and Z.

Type:

list of str

D: ndarray#
property N: int#
property T: int#
T0: int#
X: ndarray#
Y: ndarray#
Z: ndarray#
control_idx: ndarray#
time_covariates: List[str]#
time_labels: ndarray#
treated_idx: ndarray#
unit_covariates: List[str]#
unit_names: List[Any]#
class mlsynth.utils.rmsi_helpers.structures.RMSIResults(inputs: ~mlsynth.utils.rmsi_helpers.structures.RMSIInputs, att: float, counterfactual: ~numpy.ndarray, effects: ~numpy.ndarray, att_by_period: ~typing.Dict[~typing.Any, float], treated_mean: ~numpy.ndarray, synthetic_mean: ~numpy.ndarray, rank: int, components: ~typing.Dict[str, ~numpy.ndarray] | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: object

Top-level container returned by mlsynth.RMSI.fit().

inputs#
Type:

RMSIInputs

att#

Average treatment effect on the treated (mean post-period gap over treated cells).

Type:

float

counterfactual#

Estimated untreated potential outcomes M_hat, shape (N, T).

Type:

np.ndarray

effects#

Observed minus imputed on treated cells (NaN elsewhere), shape (N, T).

Type:

np.ndarray

att_by_period#

{time_label: ATT} over the post-treatment periods.

Type:

dict

treated_mean, synthetic_mean

Cross-treated-unit means of observed / imputed over the full timeline (the series the plotter draws), length T.

Type:

np.ndarray

rank#

Factor rank used in the block recombination.

Type:

int

components#

The four components of the “tall” fit (diagnostic), if retained.

Type:

dict, optional

metadata#
Type:

dict

att: float#
att_by_period: Dict[Any, float]#
components: Dict[str, ndarray] | None = None#
counterfactual: ndarray#
effects: ndarray#
inputs: RMSIInputs#
metadata: Dict[str, Any]#
rank: int#
synthetic_mean: ndarray#
treated_mean: ndarray#

Helper Modules#

The numerical core: sieve bases, projections, singular-value soft-thresholding, Algorithm 1 (four-component) and Algorithm 3 (block-missing recombination).

Numerical core for RMSI (Agarwal, Choi & Yuan 2026).

Implements the four-component sieve + nuclear-norm estimator (their Algorithm 1) and the block-missing causal extension (their Algorithm 3). Both reduce to projections and singular-value soft-thresholding – no iterative solver.

Algorithm 1 decomposes M = M1 + M2 + M3 + M4:

  • M1 = PX Y PZ – explained by both row & column features

  • M2 = svt(PX Y (I-PZ), nu2/2) – row-feature-driven

  • M3 = svt((I-PX) Y PZ, nu3/2) – column-feature-driven

  • M4 = svt((I-PX) Y (I-PZ), nu4/2) – residual low-rank (no features)

where PX/PZ project onto sieve bases of the row/column covariates, the penalised least-squares argmin ||B - A||_F^2 + nu||A||_* has the closed form svt(B, nu/2), and nu2 = C2 sqrt(T), nu3 = C3 sqrt(N), nu4 = C4 (sqrt(N) + sqrt(T)).

mlsynth.utils.rmsi_helpers.core.algorithm1(Y: ndarray, X: ndarray, Z: ndarray, *, J: int = 2, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0, PX: ndarray | None = None, PZ: ndarray | None = None)#

Four-component sieve + SVT estimator (Algorithm 1) for a full matrix.

Parameters:
  • Y (np.ndarray, shape (N, T)) – Fully observed outcome matrix.

  • X (np.ndarray, shape (N, d1)) – Row (unit) covariates.

  • Z (np.ndarray, shape (T, d2)) – Column (time) covariates.

  • J (int) – Sieve order.

  • C2, C3, C4 (float) – Penalty constants for the three soft-thresholded components.

  • PX, PZ (np.ndarray, optional) – Precomputed projectors (built from the sieve bases if omitted).

Returns:

  • M_hat (np.ndarray, shape (N, T)) – The aggregated estimate M1 + M2 + M3 + M4.

  • components (dict) – {"M1", "M2", "M3", "M4"} – the individual components.

mlsynth.utils.rmsi_helpers.core.algorithm3(Y: ndarray, X: ndarray, Z: ndarray, *, control_idx: ndarray, T0: int, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0)#

Block-missing causal estimator (Algorithm 3).

Imputes the full matrix from the “tall” submatrix (all units, pre-treatment periods) and the “wide” submatrix (control units, all periods), recombining via M_hat = U_tall H_adj D_wide V_wide' with a rotation H_adj that aligns the wide left-singular vectors to the tall ones on the control rows.

Parameters:
  • Y (np.ndarray, shape (N, T)) – Outcome matrix; the treated post-treatment block is treated as missing.

  • X, Z (np.ndarray) – Row and column covariates.

  • control_idx (np.ndarray) – Indices of never-treated (control) units (the “wide” rows).

  • T0 (int) – Number of clean pre-treatment periods (the “tall” columns).

  • J (int) – Sieve order.

  • rank (int, optional) – Factor rank K; a relative-magnitude singular-value threshold (_auto_rank()) is used when omitted.

  • C2, C3, C4 (float) – Penalty constants forwarded to algorithm1().

Returns:

  • M_hat (np.ndarray, shape (N, T)) – The completed matrix (the imputed counterfactual everywhere).

  • rank (int) – The factor rank used.

mlsynth.utils.rmsi_helpers.core.sieve_poly(C: ndarray, J: int = 2) ndarray#

Polynomial sieve basis (with intercept) of a covariate matrix.

For J = 2 each covariate c contributes c and c**2; a constant column is prepended so the projection captures means.

Parameters:
  • C (np.ndarray, shape (n, d)) – Covariate matrix (rows are units/periods).

  • J (int) – Polynomial sieve order.

Returns:

np.ndarray, shape (n, 1 + d*J) – Basis matrix [1, c_1, c_1^2, ..., c_d, c_d^2, ...].

Panel + side-information ingestion (block design enforced).

Panel + side-information ingestion for the RMSI estimator.

mlsynth.utils.rmsi_helpers.setup.prepare_rmsi_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str, unit_covariates: List[str] | None = None, time_covariates: List[str] | None = None) RMSIInputs#

Pivot a long panel into block matrices and extract side information.

RMSI (Agarwal, Choi & Yuan 2026) assumes a block design: every ever-treated unit adopts at the same period T0. unit_covariates are columns (approximately) constant within a unit -> the row feature matrix X (one row per unit, time-averaged); time_covariates are columns constant within a period -> the column feature matrix Z (one row per period, unit-averaged). Either may be empty, in which case the corresponding projection captures only the mean.

Run loop: Algorithm 3, the ATT and per-period effects, and the imputed series.

Orchestration for the RMSI estimator (Agarwal, Choi & Yuan 2026).

mlsynth.utils.rmsi_helpers.pipeline.run_rmsi(inputs: RMSIInputs, *, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0) RMSIResults#

Run RMSI (Algorithm 3) and assemble RMSIResults.

Parameters:
  • inputs (RMSIInputs)

  • J (int) – Polynomial sieve order (default 2, matching the paper).

  • rank (int, optional) – Factor rank; estimated by the eigenvalue-ratio method if omitted.

  • C2, C3, C4 (float) – Penalty constants for the soft-thresholded components.

Block causal DGP with side information for examples and tests.

Block-adoption causal DGP with side information for RMSI examples/tests.

Mirrors the regime RMSI targets: an outcome matrix whose untreated potential outcomes are driven by (nonlinear) functions of unit covariates X and time covariates Z plus a residual low-rank part, with a block of treated units adopting at a common period. Returns a tidy long panel (with the covariate columns) ready for mlsynth.RMSI.

mlsynth.utils.rmsi_helpers.simulation.simulate_rmsi_panel(n_units: int = 40, n_treated: int = 8, T0: int = 20, n_post: int = 11, d_unit: int = 2, d_time: int = 2, att: float = 5.0, noise: float = 0.5, seed: int = 0) DataFrame#

Simulate a block, side-information, low-rank causal panel.

Parameters:
  • n_units, n_treated (int) – Total units and the number treated (treated units adopt at T0).

  • T0, n_post (int) – Pre- and post-treatment period counts.

  • d_unit, d_time (int) – Numbers of unit (row) and time (column) covariates.

  • att (float) – Constant additive treatment effect on treated post cells.

  • noise (float) – Idiosyncratic-noise standard deviation.

  • seed (int) – RNG seed.

Returns:

pandas.DataFrame – Long panel with columns unit, time, Y, treated, x0.. (unit covariates), z0.. (time covariates).

Path-A (Proposition 99) and Path-B (synthetic Monte-Carlo) replications.

Replications of Agarwal, Choi & Yuan (2026).

  • Path B – the paper’s synthetic Monte-Carlo study (Section 5.1): the four-component DGP under the block-missing (MNAR) pattern. RMSI recovers the missing block at a lower error than the no-side-information baseline.

  • Path A – the empirical Proposition 99 / tobacco application (Section 5.2), run on the data shipped in basedata/P99data.csv.

Both run through the public estimator surface / the documented helper functions.

class mlsynth.utils.rmsi_helpers.replication.RMSISimConfig(N: int = 400, T: int = 400, N0: int = 200, T0: int = 200, J: int = 5, sigma: float = 0.5, alphas: tuple = (0.25, 0.25, 0.25, 0.25), n_reps: int = 100)#

Parameters for the RMSI synthetic Monte-Carlo (paper Section 5.1, MNAR).

J: int = 5#
N: int = 400#
N0: int = 200#
T: int = 400#
T0: int = 200#
alphas: tuple = (0.25, 0.25, 0.25, 0.25)#
n_reps: int = 100#
sigma: float = 0.5#
mlsynth.utils.rmsi_helpers.replication.replicate_prop99(data: str | DataFrame | None = None, *, rank: int | None = 3, sieve_order: int = 2, verbose: bool = True)#

Estimate California’s Proposition 99 effect with RMSI (Path A).

Loads the Abadie et al. (2010) tobacco panel (basedata/P99data.csv), treats California from 1989, uses the state-level Abadie predictors (lnincome, beer, age15to24, retprice) as unit covariates and the year-average retail price as the time covariate, and runs mlsynth.RMSI.

Parameters:
  • data (str or pandas.DataFrame, optional) – Path/URL to the panel, or a DataFrame. Default downloads P99data.csv from basedata/.

  • rank (int, optional) – Factor rank (default 3; suits the single-treated-unit case – the eigenvalue-ratio default tends to pick rank 1 here).

  • sieve_order (int) – Polynomial sieve order.

Returns:

mlsynth.utils.rmsi_helpers.structures.RMSIResults

mlsynth.utils.rmsi_helpers.replication.run_rmsi_simulation(cfg: RMSISimConfig = RMSISimConfig(N=120, T=120, N0=60, T0=60, J=5, sigma=0.5, alphas=(0.25, 0.25, 0.25, 0.25), n_reps=10), *, seed: int = 0, verbose: bool = True) Dict#

Run the synthetic MNAR Monte-Carlo and return missing-block AMSE.

Compares RMSI (with side information) to a no-side-information baseline (the same block estimator with empty covariates, i.e. a de-meaned low-rank completion) by the average mean squared error over the missing block.

Returns:

dict{"rmsi": amse_side_info, "no_side_info": amse_baseline, "rel_improvement": ...}.

mlsynth.utils.rmsi_helpers.replication.simulate_rmsi_dgp(N: int, T: int, *, alphas=(0.25, 0.25, 0.25, 0.25), sigma: float = 0.5, seed: int = 0)#

Generate (Y, M, X, Z) from the paper’s four-component DGP (Section 5.1).

Eight characteristics (four per margin) with the paper’s distributions; M = sum_r alpha_r M_r with each M_r normalised to ||M_r||_F = sqrt(2 N T); noise N(0, sigma^2).