Robust Matrix estimation with Side Information (RMSI)#

When to Use This Estimator#

RMSI implements the side-information matrix estimator of Agarwal, Choi and Yuan [RMSI]. It is a causal-panel matrix-completion method – like Matrix Completion with Nuclear Norm Minimization (MCNNM), it imputes the treated units’ missing counterfactual cells – but it exploits covariates on both margins of the panel: unit-level (row) characteristics \(\mathbf{X}\) and time-level (column) characteristics \(\mathbf{Z}\).

Its robustness comes from decomposing the target matrix into four complementary pieces and estimating each separately:

\[\mathbf{M} = \underbrace{\mathbf{G}_1(\mathbf{X})\, \mathbf{Q}_1(\mathbf{Z})^\top}_{\mathbf{M}_1:\ \text{both margins}} + \underbrace{\mathbf{G}_2(\mathbf{X})\, \mathbf{V}_1^\top}_{\mathbf{M}_2:\ \text{row-driven}} + \underbrace{\mathbf{W}_1\, \mathbf{Q}_2(\mathbf{Z})^\top}_{\mathbf{M}_3:\ \text{column-driven}} + \underbrace{\mathbf{W}_2\, \mathbf{V}_2^\top}_{\mathbf{M}_4:\ \text{residual low-rank}} .\]

Unlike inductive matrix completion (which forces an exact low-rank linear covariate-interaction term and no genuine noise component), this decomposition accommodates nonlinear covariate effects, a part explained by only one margin, and a part explained by neither – and it degrades gracefully when the covariates are uninformative (it then reduces to a de-meaned low-rank completion). Reach for RMSI when you have a block-adoption causal panel, informative unit and/or time covariates, and you expect those covariates to carry signal the raw outcome matrix alone would estimate noisily.

Do not use RMSI when#

Notation#

The panel has \(N\) units \(\mathcal{N} \coloneqq \{1, \dots, N\}\) and \(T\) periods \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\), 1-indexed; the intervention takes effect after period \(T_0\), splitting \(\mathcal{T}\) into the pre-period \(\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}\) and the post-period \(\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}\). A block of treated units adopts at the common time \(T_0\); the remaining units are controls. The outcome of unit \(i\) at time \(t\) is \(y_{it}\), collected into the \(T \times N\) matrix \(\mathbf{Y}\) (one column per unit). Unit-level covariates are averaged per unit into the row feature matrix \(\mathbf{X}\) and time-level covariates per period into \(\mathbf{Z}\); \(\mathbf{P}_X, \mathbf{P}_Z\) are the sieve projectors onto their spans. The target counterfactual matrix is \(\mathbf{M}\) with the no-intervention entries \(y_{it}^N\); its estimate is \(\widehat{\mathbf{M}}\), with imputed entries \(\widehat{y}_{it}\). The per-cell effect is \(\tau_{it} \coloneqq y_{it} - \widehat{y}_{it}\) (estimating \(y_{it}^I - y_{it}^N\)), and the ATT \(\widehat{\tau}\) averages \(\tau_{it}\) over the treated cells in \(\mathcal{T}_2\).

Identifying assumptions#

  1. Block adoption. All treated units adopt at the common time \(T_0\), so the missing cells form a single rectangular block (treated units over \(\mathcal{T}_2\)); the tall submatrix (all units over \(\mathcal{T}_1\)) and the wide submatrix (controls over \(\mathcal{T}\)) are fully observed.

    Remark. This is what lets Algorithm 1 run on the two observed submatrices and Algorithm 3 recombine their singular subspaces. Staggered adoption breaks the rectangular missingness pattern; use Matrix Completion with Nuclear Norm Minimization (MCNNM), Staggered Synthetic Control (SSC), Synthetic Difference-in-Differences (SDID), or Partially Pooled SCM (PPSCM) instead.

  2. Side-information structure. The no-intervention matrix decomposes into the four components \(\mathbf{M} = \mathbf{M}_1 + \mathbf{M}_2 + \mathbf{M}_3 + \mathbf{M}_4\) – explained by both margins, the row margin only, the column margin only, and a residual low-rank part – with the margin-driven pieces captured by (possibly nonlinear) functions of \(\mathbf{X}\) and \(\mathbf{Z}\) realised through the sieve bases.

    Remark. When the covariates are uninformative the first three components vanish and the model reduces to a de-meaned low-rank completion, recovering the no-side-information baseline – the source of RMSI’s graceful degradation.

  3. No anticipation and SUTVA. Pre-period outcomes reflect the no-intervention path, \(y_{it} = y_{it}^N\) for \(t \in \mathcal{T}_1\), and controls are untreated and uncontaminated over \(\mathcal{T}\), so the observed submatrices carry only no-intervention outcomes.

    Remark. Spillovers onto the controls or a pre-\(T_0\) response bias the imputed block. Quarantine contaminated controls before fitting; if spillovers are intrinsic, use Spatial Synthetic Difference-in-Differences (SpSyDiD) or Spillover-Aware Synthetic Control (SPILLSYNTH).

The estimator#

Algorithm 1 (fully observed). With polynomial sieve bases of the covariates and projectors \(\mathbf{P}_X, \mathbf{P}_Z\) onto them, the component explained by both margins is \(\widehat{\mathbf{M}}_1 = \mathbf{P}_X \mathbf{Y} \mathbf{P}_Z\), and the other three are singular-value soft-thresholds of the projected residuals (the penalised least squares \(\operatorname*{argmin}_{\mathbf{A}} \lVert \mathbf{B} - \mathbf{A}\rVert_F^2 + \nu\lVert \mathbf{A}\rVert_*\) has the closed form \(\operatorname{svt}(\mathbf{B}, \nu/2)\)):

\[\widehat{\mathbf{M}}_2 = \operatorname{svt}\!\big(\mathbf{P}_X \mathbf{Y} (\mathbf{I} - \mathbf{P}_Z),\ \nu_2/2\big),\quad \widehat{\mathbf{M}}_3 = \operatorname{svt}\!\big((\mathbf{I} - \mathbf{P}_X) \mathbf{Y} \mathbf{P}_Z,\ \nu_3/2\big),\quad \widehat{\mathbf{M}}_4 = \operatorname{svt}\!\big((\mathbf{I} - \mathbf{P}_X) \mathbf{Y} (\mathbf{I} - \mathbf{P}_Z),\ \nu_4/2\big),\]

with \(\nu_2 = C_2\sqrt{T}\), \(\nu_3 = C_3\sqrt{N}\), \(\nu_4 = C_4(\sqrt{N}+\sqrt{T})\). The estimate is \(\widehat{\mathbf{M}} = \widehat{\mathbf{M}}_1 + \widehat{\mathbf{M}}_2 + \widehat{\mathbf{M}}_3 + \widehat{\mathbf{M}}_4\) – only projections and SVDs, no iterative solver.

Algorithm 3 (block-missing causal). The treated post-treatment cells form a missing block. RMSI applies Algorithm 1 to the fully observed tall submatrix (all units, pre-treatment periods) and wide submatrix (control units, all periods), then recombines their singular subspaces, \(\widehat{\mathbf{M}} = \widehat{\mathbf{U}}_{\text{tall}}\, \widehat{\mathbf{H}}\, \widehat{\mathbf{D}}_{\text{wide}} \widehat{\mathbf{V}}_{\text{wide}}^\top\), where \(\widehat{\mathbf{H}}\) rotates the wide left-singular vectors onto the tall ones over the control rows. The ATT \(\widehat{\tau}\) is the observed minus the imputed outcome averaged over the treated post-treatment cells \(\mathcal{T}_2\), with per-cell effects \(\tau_{it} \coloneqq y_{it} - \widehat{y}_{it}\). RMSI targets the block (common adoption time) setting.

Side information#

Pass the covariate columns through the config: unit_covariates are columns (approximately) constant within a unit – averaged per unit to form the row feature matrix \(\mathbf{X}\) – and time_covariates are columns constant within a period – averaged per period to form \(\mathbf{Z}\). Either may be empty.

Example#

A block panel of forty units (eight treated at period 20) whose untreated outcomes are driven by nonlinear functions of two unit covariates and two time covariates plus a residual low-rank part, with a constant effect of +5:

from mlsynth import RMSI
from mlsynth.utils.rmsi_helpers.simulation import simulate_rmsi_panel

df = simulate_rmsi_panel(
    n_units=40, n_treated=8, T0=20, n_post=11, att=5.0, seed=0,
)

res = RMSI({
    "df": df, "outcome": "Y", "treat": "treated",
    "unitid": "unit", "time": "time",
    "unit_covariates": ["x0", "x1"],   # row side information
    "time_covariates": ["z0", "z1"],   # column side information
    "sieve_order": 2,
    "display_graphs": True,            # observed vs. imputed counterfactual
}).fit()

print(f"ATT (true 5.0) = {res.att:+.3f}   [rank {res.rank}]")

Replication#

Both of the paper’s empirical pieces are reproduced through the public API.

Path A – Proposition 99 (Section 5.2). Using the Abadie et al. (2010) tobacco panel shipped at basedata/P99data.csv, RMSI treats California from 1989 and uses the state-level Abadie predictors as unit covariates and the year-average retail price as the time covariate:

from mlsynth.utils.rmsi_helpers.replication import replicate_prop99

# downloads basedata/P99data.csv by default; pass a path/DataFrame to override
res = replicate_prop99(rank=3)
# -> California Proposition 99 ATT ~ -21 packs/capita (1989 ~ -7, 2000 ~ -32)

The estimated ATT of about \(-21\) packs per capita is consistent with Abadie, Diamond & Hainmueller [ABADIE2010] and with the Matrix Completion with Nuclear Norm Minimization (MCNNM) / Synthetic Nearest Neighbors / Causal Matrix Completion (SNN) estimates elsewhere in mlsynth. The equivalent explicit call:

import pandas as pd
from mlsynth import RMSI

url = ("https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
       "main/basedata/P99data.csv")
d = pd.read_csv(url)
for c in ["lnincome", "beer", "age15to24", "retprice"]:
    d[c] = d.groupby("state")[c].transform(lambda s: s.fillna(s.mean()))
    d[c] = d[c].fillna(d[c].mean())
d["treated"] = ((d["state"] == "California") & (d["year"] >= 1989)).astype(int)

res = RMSI({"df": d, "outcome": "cigsale", "treat": "treated",
            "unitid": "state", "time": "year",
            "unit_covariates": ["lnincome", "beer", "age15to24", "retprice"],
            "time_covariates": ["retprice"], "rank": 3,
            "display_graphs": False}).fit()
print(res.att)

Path B – synthetic Monte Carlo (Section 5.1). The paper’s four-component DGP under the block-missing (MNAR) pattern; RMSI imputes the missing block at a lower average mean-squared error than the no-side-information baseline:

from mlsynth.utils.rmsi_helpers.replication import (
    run_rmsi_simulation, RMSISimConfig, PAPER,
)

# quick, reduced-count preset (use PAPER for the full N=T=400, 100-rep study)
out = run_rmsi_simulation(RMSISimConfig(N=120, T=120, N0=60, T0=60,
                                        J=5, n_reps=10))
# -> AMSE side-info < AMSE no-side-info (side information lowers the error)

Verification#

Note

Path B (synthetic). On the paper’s four-component MNAR DGP (simulate_rmsi_dgp()), RMSI’s block imputation achieves a lower missing-block AMSE than the no-side-information baseline – reproducing the paper’s central finding that leveraging side information improves imputation accuracy.

Path A (Proposition 99). replicate_prop99 recovers a California Proposition 99 ATT of about \(-21\) packs per capita (widening toward \(-32\) by 2000), matching the Abadie-Diamond-Hainmueller [ABADIE2010] baseline and the other mlsynth estimators on the same panel.

Core API#

RMSI: Robust Matrix estimation with Side Information (Agarwal, Choi & Yuan 2026).

Agarwal, A., Choi, J. & Yuan, M. (2026). “Robust Matrix Estimation with Side Information.” arXiv:2603.24833.

RMSI estimates a panel’s untreated-outcome matrix while exploiting covariates on both margins – unit-level (row) characteristics \(X\) and time-level (column) characteristics \(Z\). It decomposes the target matrix into four complementary pieces,

\[M = \underbrace{G_1(X) Q_1(Z)^\top}_{M_1\text{: both}} + \underbrace{G_2(X) V_1^\top}_{M_2\text{: row-driven}} + \underbrace{W_1 Q_2(Z)^\top}_{M_3\text{: column-driven}} + \underbrace{W_2 V_2^\top}_{M_4\text{: residual low-rank}},\]

and estimates each by sieve projection plus nuclear-norm penalization: with projectors \(P_X, P_Z\) onto polynomial sieve bases of the covariates, the component explained by both margins is \(\widehat M_1 = P_X Y P_Z\), and the remaining three are singular-value soft-thresholds of the corresponding projected residuals (the penalised least squares has that closed form). Because the pieces are recovered separately and summed, the method is robust: it accommodates nonlinear covariate effects, parts explained by only one margin, and a part explained by neither – and degrades gracefully when the covariates are uninformative (it then reduces to a de-meaned low-rank completion).

For causal panel data the treated cells form a missing block. RMSI applies the estimator to the fully observed “tall” submatrix (all units, pre-treatment periods) and “wide” submatrix (control units, all periods) and recombines their singular subspaces to impute the missing treated counterfactual; the ATT is the observed minus the imputed outcome over the treated cells. This estimator targets the block (common adoption time) causal setting.

class mlsynth.estimators.rmsi.RMSI(config: RMSIConfig | dict)#

Bases: object

Robust Matrix estimation with Side Information.

Parameters:

config (RMSIConfig or dict) – Configuration object. See mlsynth.config_models.RMSIConfig.

fit() RMSIResults#

Run RMSI and return RMSIResults.

Configuration#

class mlsynth.config_models.RMSIConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, unit_covariates: ~typing.List[str] = <factory>, time_covariates: ~typing.List[str] = <factory>, sieve_order: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 2, rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None)#

Configuration for the RMSI estimator.

Agarwal, Choi & Yuan (2026), “Robust Matrix Estimation with Side Information” (arXiv:2603.24833). Imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates. Inherits the standard df / outcome / treat / unitid / time interface.

Parameters:
  • unit_covariates (list of str) – Columns that vary across units and are (approximately) constant within a unit; averaged per unit to form the row feature matrix X. May be empty (then the row projection captures only the mean).

  • time_covariates (list of str) – Columns that vary across periods and are constant across units within a period; averaged per period to form the column feature matrix Z. May be empty.

  • sieve_order (int) – Polynomial sieve order J (default 2, matching the paper).

  • rank (int, optional) – Factor rank for the block recombination; chosen by a relative-magnitude singular-value threshold when None.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

rank: int | None#
sieve_order: int#
time_covariates: List[str]#
unit_covariates: List[str]#

Result Containers#

RMSI.fit() returns a RMSIResults: the ATT, the imputed counterfactual matrix counterfactual and per-cell effects, att_by_period, the cross-treated-unit observed/imputed means the plotter draws, the factor rank used, and metadata.

Frozen dataclasses for the RMSI estimator.

Agarwal, Choi & Yuan (2026), Robust Matrix Estimation with Side Information (arXiv:2603.24833). RMSI imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates.

class mlsynth.utils.rmsi_helpers.structures.RMSIInputs(Y: ndarray, D: ndarray, X: ndarray, Z: ndarray, T0: int, treated_idx: ndarray, control_idx: ndarray, unit_names: List[Any], time_labels: ndarray, unit_covariates: List[str], time_covariates: List[str])#

Bases: object

Preprocessed block-adoption panel with side information for RMSI.

Y#

Outcomes, shape (N, T).

Type:

np.ndarray

D#

Treatment indicators, shape (N, T) (1 where treated).

Type:

np.ndarray

X#

Row (unit) covariates, shape (N, d1) – one row per unit.

Type:

np.ndarray

Z#

Column (time) covariates, shape (T, d2) – one row per period.

Type:

np.ndarray

T0#

Number of clean pre-treatment periods (common adoption at T0).

Type:

int

treated_idx, control_idx

Indices of ever-treated and never-treated units.

Type:

np.ndarray

unit_names, time_labels
Type:

sequence

unit_covariates, time_covariates

The covariate column names used to build X and Z.

Type:

list of str

D: ndarray#
property N: int#
property T: int#
T0: int#
X: ndarray#
Y: ndarray#
Z: ndarray#
control_idx: ndarray#
time_covariates: List[str]#
time_labels: ndarray#
treated_idx: ndarray#
unit_covariates: List[str]#
unit_names: List[Any]#
class mlsynth.utils.rmsi_helpers.structures.RMSIResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.rmsi_helpers.structures.RMSIInputs, counterfactual_matrix: ~numpy.ndarray, effects_matrix: ~numpy.ndarray, att_by_period: ~typing.Dict[~typing.Any, float], treated_mean: ~numpy.ndarray, synthetic_mean: ~numpy.ndarray, rank: int, components: ~typing.Dict[str, ~numpy.ndarray] | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Bases: BaseEstimatorResults

Top-level container returned by mlsynth.RMSI.fit().

An EffectResult (the observational report): in addition to the RMSI-specific fields below it exposes the standardized sub-models (effects, time_series, weights, inference, fit_diagnostics, method_details) and the flat accessors att / counterfactual / gap / pre_rmse. The treated counterfactual path (res.counterfactual) is the cross-treated-unit synthetic mean; the full imputed (N, T) matrix lives in counterfactual_matrix.

Parameters:
  • inputs (RMSIInputs)

  • counterfactual_matrix (np.ndarray) – Estimated untreated potential outcomes M_hat, shape (N, T). (Renamed from counterfactual, which now returns the 1-D treated path per the result contract.)

  • effects_matrix (np.ndarray) – Observed minus imputed on treated cells (NaN elsewhere), shape (N, T). (Renamed from effects, which is now the standardized EffectsResults slot.)

  • att_by_period (dict) – {time_label: ATT} over the post-treatment periods.

  • treated_mean, synthetic_mean (np.ndarray) – Cross-treated-unit means of observed / imputed over the full timeline (the series the plotter draws), length T.

  • rank (int) – Factor rank used in the block recombination.

  • components (dict, optional) – The four components of the “tall” fit (diagnostic), if retained.

  • metadata (dict)

att_by_period: Dict[Any, float]#
components: Dict[str, np.ndarray] | None#
counterfactual_matrix: np.ndarray#
effects_matrix: np.ndarray#
inputs: RMSIInputs#
metadata: Dict[str, Any]#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

rank: int#
synthetic_mean: np.ndarray#
treated_mean: np.ndarray#

Helper Modules#

The numerical core: sieve bases, projections, singular-value soft-thresholding, Algorithm 1 (four-component) and Algorithm 3 (block-missing recombination).

Numerical core for RMSI (Agarwal, Choi & Yuan 2026).

Implements the four-component sieve + nuclear-norm estimator (their Algorithm 1) and the block-missing causal extension (their Algorithm 3). Both reduce to projections and singular-value soft-thresholding – no iterative solver.

Algorithm 1 decomposes M = M1 + M2 + M3 + M4:

  • M1 = PX Y PZ – explained by both row & column features

  • M2 = svt(PX Y (I-PZ), nu2/2) – row-feature-driven

  • M3 = svt((I-PX) Y PZ, nu3/2) – column-feature-driven

  • M4 = svt((I-PX) Y (I-PZ), nu4/2) – residual low-rank (no features)

where PX/PZ project onto sieve bases of the row/column covariates, the penalised least-squares argmin ||B - A||_F^2 + nu||A||_* has the closed form svt(B, nu/2), and nu2 = C2 sqrt(T), nu3 = C3 sqrt(N), nu4 = C4 (sqrt(N) + sqrt(T)).

mlsynth.utils.rmsi_helpers.core.algorithm1(Y: ndarray, X: ndarray, Z: ndarray, *, J: int = 2, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0, PX: ndarray | None = None, PZ: ndarray | None = None)#

Four-component sieve + SVT estimator (Algorithm 1) for a full matrix.

Parameters:
  • Y (np.ndarray, shape (N, T)) – Fully observed outcome matrix.

  • X (np.ndarray, shape (N, d1)) – Row (unit) covariates.

  • Z (np.ndarray, shape (T, d2)) – Column (time) covariates.

  • J (int) – Sieve order.

  • C2, C3, C4 (float) – Penalty constants for the three soft-thresholded components.

  • PX, PZ (np.ndarray, optional) – Precomputed projectors (built from the sieve bases if omitted).

Returns:

  • M_hat (np.ndarray, shape (N, T)) – The aggregated estimate M1 + M2 + M3 + M4.

  • components (dict) – {"M1", "M2", "M3", "M4"} – the individual components.

mlsynth.utils.rmsi_helpers.core.algorithm3(Y: ndarray, X: ndarray, Z: ndarray, *, control_idx: ndarray, T0: int, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0)#

Block-missing causal estimator (Algorithm 3).

Imputes the full matrix from the “tall” submatrix (all units, pre-treatment periods) and the “wide” submatrix (control units, all periods), recombining via M_hat = U_tall H_adj D_wide V_wide' with a rotation H_adj that aligns the wide left-singular vectors to the tall ones on the control rows.

Parameters:
  • Y (np.ndarray, shape (N, T)) – Outcome matrix; the treated post-treatment block is treated as missing.

  • X, Z (np.ndarray) – Row and column covariates.

  • control_idx (np.ndarray) – Indices of never-treated (control) units (the “wide” rows).

  • T0 (int) – Number of clean pre-treatment periods (the “tall” columns).

  • J (int) – Sieve order.

  • rank (int, optional) – Factor rank K; a relative-magnitude singular-value threshold (_auto_rank()) is used when omitted.

  • C2, C3, C4 (float) – Penalty constants forwarded to algorithm1().

Returns:

  • M_hat (np.ndarray, shape (N, T)) – The completed matrix (the imputed counterfactual everywhere).

  • rank (int) – The factor rank used.

mlsynth.utils.rmsi_helpers.core.sieve_poly(C: ndarray, J: int = 2) ndarray#

Polynomial sieve basis (with intercept) of a covariate matrix.

For J = 2 each covariate c contributes c and c**2; a constant column is prepended so the projection captures means.

Parameters:
  • C (np.ndarray, shape (n, d)) – Covariate matrix (rows are units/periods).

  • J (int) – Polynomial sieve order.

Returns:

np.ndarray, shape (n, 1 + d*J) – Basis matrix [1, c_1, c_1^2, ..., c_d, c_d^2, ...].

Panel + side-information ingestion (block design enforced).

Panel + side-information ingestion for the RMSI estimator.

mlsynth.utils.rmsi_helpers.setup.prepare_rmsi_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str, unit_covariates: List[str] | None = None, time_covariates: List[str] | None = None) RMSIInputs#

Pivot a long panel into block matrices and extract side information.

RMSI (Agarwal, Choi & Yuan 2026) assumes a block design: every ever-treated unit adopts at the same period T0. unit_covariates are columns (approximately) constant within a unit -> the row feature matrix X (one row per unit, time-averaged); time_covariates are columns constant within a period -> the column feature matrix Z (one row per period, unit-averaged). Either may be empty, in which case the corresponding projection captures only the mean.

Run loop: Algorithm 3, the ATT and per-period effects, and the imputed series.

Orchestration for the RMSI estimator (Agarwal, Choi & Yuan 2026).

mlsynth.utils.rmsi_helpers.pipeline.run_rmsi(inputs: RMSIInputs, *, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0) RMSIResults#

Run RMSI (Algorithm 3) and assemble RMSIResults.

Parameters:
  • inputs (RMSIInputs)

  • J (int) – Polynomial sieve order (default 2, matching the paper).

  • rank (int, optional) – Factor rank; estimated by the eigenvalue-ratio method if omitted.

  • C2, C3, C4 (float) – Penalty constants for the soft-thresholded components.

Block causal DGP with side information for examples and tests.

Block-adoption causal DGP with side information for RMSI examples/tests.

Mirrors the regime RMSI targets: an outcome matrix whose untreated potential outcomes are driven by (nonlinear) functions of unit covariates X and time covariates Z plus a residual low-rank part, with a block of treated units adopting at a common period. Returns a tidy long panel (with the covariate columns) ready for mlsynth.RMSI.

mlsynth.utils.rmsi_helpers.simulation.simulate_rmsi_panel(n_units: int = 40, n_treated: int = 8, T0: int = 20, n_post: int = 11, d_unit: int = 2, d_time: int = 2, att: float = 5.0, noise: float = 0.5, seed: int = 0) DataFrame#

Simulate a block, side-information, low-rank causal panel.

Parameters:
  • n_units, n_treated (int) – Total units and the number treated (treated units adopt at T0).

  • T0, n_post (int) – Pre- and post-treatment period counts.

  • d_unit, d_time (int) – Numbers of unit (row) and time (column) covariates.

  • att (float) – Constant additive treatment effect on treated post cells.

  • noise (float) – Idiosyncratic-noise standard deviation.

  • seed (int) – RNG seed.

Returns:

pandas.DataFrame – Long panel with columns unit, time, Y, treated, x0.. (unit covariates), z0.. (time covariates).

Path-A (Proposition 99) and Path-B (synthetic Monte-Carlo) replications.

Replications of Agarwal, Choi & Yuan (2026).

  • Path B – the paper’s synthetic Monte-Carlo study (Section 5.1): the four-component DGP under the block-missing (MNAR) pattern. RMSI recovers the missing block at a lower error than the no-side-information baseline.

  • Path A – the empirical Proposition 99 / tobacco application (Section 5.2), run on the data shipped in basedata/P99data.csv.

Both run through the public estimator surface / the documented helper functions.

class mlsynth.utils.rmsi_helpers.replication.RMSISimConfig(N: int = 400, T: int = 400, N0: int = 200, T0: int = 200, J: int = 5, sigma: float = 0.5, alphas: tuple = (0.25, 0.25, 0.25, 0.25), n_reps: int = 100)#

Parameters for the RMSI synthetic Monte-Carlo (paper Section 5.1, MNAR).

J: int = 5#
N: int = 400#
N0: int = 200#
T: int = 400#
T0: int = 200#
alphas: tuple = (0.25, 0.25, 0.25, 0.25)#
n_reps: int = 100#
sigma: float = 0.5#
mlsynth.utils.rmsi_helpers.replication.replicate_prop99(data: str | DataFrame | None = None, *, rank: int | None = 3, sieve_order: int = 2, verbose: bool = True)#

Estimate California’s Proposition 99 effect with RMSI (Path A).

Loads the Abadie et al. (2010) tobacco panel (basedata/P99data.csv), treats California from 1989, uses the state-level Abadie predictors (lnincome, beer, age15to24, retprice) as unit covariates and the year-average retail price as the time covariate, and runs mlsynth.RMSI.

Parameters:
  • data (str or pandas.DataFrame, optional) – Path/URL to the panel, or a DataFrame. Default downloads P99data.csv from basedata/.

  • rank (int, optional) – Factor rank (default 3; suits the single-treated-unit case – the eigenvalue-ratio default tends to pick rank 1 here).

  • sieve_order (int) – Polynomial sieve order.

Returns:

mlsynth.utils.rmsi_helpers.structures.RMSIResults

mlsynth.utils.rmsi_helpers.replication.run_rmsi_simulation(cfg: RMSISimConfig = RMSISimConfig(N=120, T=120, N0=60, T0=60, J=5, sigma=0.5, alphas=(0.25, 0.25, 0.25, 0.25), n_reps=10), *, seed: int = 0, verbose: bool = True) Dict#

Run the synthetic MNAR Monte-Carlo and return missing-block AMSE.

Compares RMSI (with side information) to a no-side-information baseline (the same block estimator with empty covariates, i.e. a de-meaned low-rank completion) by the average mean squared error over the missing block.

Returns:

dict{"rmsi": amse_side_info, "no_side_info": amse_baseline, "rel_improvement": ...}.

mlsynth.utils.rmsi_helpers.replication.simulate_rmsi_dgp(N: int, T: int, *, alphas=(0.25, 0.25, 0.25, 0.25), sigma: float = 0.5, seed: int = 0)#

Generate (Y, M, X, Z) from the paper’s four-component DGP (Section 5.1).

Eight characteristics (four per margin) with the paper’s distributions; M = sum_r alpha_r M_r with each M_r normalised to ||M_r||_F = sqrt(2 N T); noise N(0, sigma^2).