Robust Matrix estimation with Side Information (RMSI)#
When to Use This Estimator#
RMSI implements the side-information matrix estimator of Agarwal, Choi and
Yuan [RMSI]. It is a causal-panel matrix-completion method – like
Matrix Completion with Nuclear Norm Minimization (MCNNM), it imputes the treated units’ missing counterfactual cells – but
it exploits covariates on both margins of the panel: unit-level (row)
characteristics \(\mathbf{X}\) and time-level (column) characteristics
\(\mathbf{Z}\).
Its robustness comes from decomposing the target matrix into four complementary pieces and estimating each separately:
Unlike inductive matrix completion (which forces an exact low-rank linear
covariate-interaction term and no genuine noise component), this decomposition
accommodates nonlinear covariate effects, a part explained by only one
margin, and a part explained by neither – and it degrades gracefully when
the covariates are uninformative (it then reduces to a de-meaned low-rank
completion). Reach for RMSI when you have a block-adoption causal panel,
informative unit and/or time covariates, and you expect those covariates to
carry signal the raw outcome matrix alone would estimate noisily.
Do not use RMSI when#
Adoption is staggered (treated units switch on at different times). RMSI assumes a block design; use Matrix Completion with Nuclear Norm Minimization (MCNNM), Staggered Synthetic Control (SSC), Synthetic Difference-in-Differences (SDID), or Partially Pooled SCM (PPSCM).
You have no covariates and no reason to expect one margin’s structure to help. With no side information RMSI reduces to a low-rank completion, so Matrix Completion with Nuclear Norm Minimization (MCNNM) (purpose-built, with inference) is the more direct choice.
An interpretable donor-weight story is the deliverable – RMSI imputes a matrix, not a sparse convex combination; use Two-Step Synthetic Control/Synthetic Control with Multiple Outcomes (SCMO).
Spillovers contaminate the controls (SUTVA fails) – use Spatial Synthetic Difference-in-Differences (SpSyDiD) or Spillover-Aware Synthetic Control (SPILLSYNTH).
Notation#
The panel has \(N\) units \(\mathcal{N} \coloneqq \{1, \dots, N\}\) and \(T\) periods \(t \in \mathcal{T} \coloneqq \{1, \dots, T\}\), 1-indexed; the intervention takes effect after period \(T_0\), splitting \(\mathcal{T}\) into the pre-period \(\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}\) and the post-period \(\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}\). A block of treated units adopts at the common time \(T_0\); the remaining units are controls. The outcome of unit \(i\) at time \(t\) is \(y_{it}\), collected into the \(T \times N\) matrix \(\mathbf{Y}\) (one column per unit). Unit-level covariates are averaged per unit into the row feature matrix \(\mathbf{X}\) and time-level covariates per period into \(\mathbf{Z}\); \(\mathbf{P}_X, \mathbf{P}_Z\) are the sieve projectors onto their spans. The target counterfactual matrix is \(\mathbf{M}\) with the no-intervention entries \(y_{it}^N\); its estimate is \(\widehat{\mathbf{M}}\), with imputed entries \(\widehat{y}_{it}\). The per-cell effect is \(\tau_{it} \coloneqq y_{it} - \widehat{y}_{it}\) (estimating \(y_{it}^I - y_{it}^N\)), and the ATT \(\widehat{\tau}\) averages \(\tau_{it}\) over the treated cells in \(\mathcal{T}_2\).
Identifying assumptions#
Block adoption. All treated units adopt at the common time \(T_0\), so the missing cells form a single rectangular block (treated units over \(\mathcal{T}_2\)); the tall submatrix (all units over \(\mathcal{T}_1\)) and the wide submatrix (controls over \(\mathcal{T}\)) are fully observed.
Remark. This is what lets Algorithm 1 run on the two observed submatrices and Algorithm 3 recombine their singular subspaces. Staggered adoption breaks the rectangular missingness pattern; use Matrix Completion with Nuclear Norm Minimization (MCNNM), Staggered Synthetic Control (SSC), Synthetic Difference-in-Differences (SDID), or Partially Pooled SCM (PPSCM) instead.
Side-information structure. The no-intervention matrix decomposes into the four components \(\mathbf{M} = \mathbf{M}_1 + \mathbf{M}_2 + \mathbf{M}_3 + \mathbf{M}_4\) – explained by both margins, the row margin only, the column margin only, and a residual low-rank part – with the margin-driven pieces captured by (possibly nonlinear) functions of \(\mathbf{X}\) and \(\mathbf{Z}\) realised through the sieve bases.
Remark. When the covariates are uninformative the first three components vanish and the model reduces to a de-meaned low-rank completion, recovering the no-side-information baseline – the source of RMSI’s graceful degradation.
No anticipation and SUTVA. Pre-period outcomes reflect the no-intervention path, \(y_{it} = y_{it}^N\) for \(t \in \mathcal{T}_1\), and controls are untreated and uncontaminated over \(\mathcal{T}\), so the observed submatrices carry only no-intervention outcomes.
Remark. Spillovers onto the controls or a pre-\(T_0\) response bias the imputed block. Quarantine contaminated controls before fitting; if spillovers are intrinsic, use Spatial Synthetic Difference-in-Differences (SpSyDiD) or Spillover-Aware Synthetic Control (SPILLSYNTH).
The estimator#
Algorithm 1 (fully observed). With polynomial sieve bases of the covariates and projectors \(\mathbf{P}_X, \mathbf{P}_Z\) onto them, the component explained by both margins is \(\widehat{\mathbf{M}}_1 = \mathbf{P}_X \mathbf{Y} \mathbf{P}_Z\), and the other three are singular-value soft-thresholds of the projected residuals (the penalised least squares \(\operatorname*{argmin}_{\mathbf{A}} \lVert \mathbf{B} - \mathbf{A}\rVert_F^2 + \nu\lVert \mathbf{A}\rVert_*\) has the closed form \(\operatorname{svt}(\mathbf{B}, \nu/2)\)):
with \(\nu_2 = C_2\sqrt{T}\), \(\nu_3 = C_3\sqrt{N}\), \(\nu_4 = C_4(\sqrt{N}+\sqrt{T})\). The estimate is \(\widehat{\mathbf{M}} = \widehat{\mathbf{M}}_1 + \widehat{\mathbf{M}}_2 + \widehat{\mathbf{M}}_3 + \widehat{\mathbf{M}}_4\) – only projections and SVDs, no iterative solver.
Algorithm 3 (block-missing causal). The treated post-treatment cells form a
missing block. RMSI applies Algorithm 1 to the fully observed tall submatrix
(all units, pre-treatment periods) and wide submatrix (control units, all
periods), then recombines their singular subspaces,
\(\widehat{\mathbf{M}} = \widehat{\mathbf{U}}_{\text{tall}}\, \widehat{\mathbf{H}}\, \widehat{\mathbf{D}}_{\text{wide}} \widehat{\mathbf{V}}_{\text{wide}}^\top\),
where \(\widehat{\mathbf{H}}\) rotates the wide left-singular vectors onto
the tall ones over the control rows. The ATT
\(\widehat{\tau}\) is the observed minus the imputed outcome averaged over
the treated post-treatment cells \(\mathcal{T}_2\), with per-cell effects
\(\tau_{it} \coloneqq y_{it} - \widehat{y}_{it}\). RMSI targets the
block (common adoption time) setting.
Side information#
Pass the covariate columns through the config: unit_covariates are columns
(approximately) constant within a unit – averaged per unit to form the row
feature matrix \(\mathbf{X}\) – and time_covariates are columns constant
within a period – averaged per period to form \(\mathbf{Z}\). Either may be
empty.
Example#
A block panel of forty units (eight treated at period 20) whose untreated
outcomes are driven by nonlinear functions of two unit covariates and two time
covariates plus a residual low-rank part, with a constant effect of +5:
from mlsynth import RMSI
from mlsynth.utils.rmsi_helpers.simulation import simulate_rmsi_panel
df = simulate_rmsi_panel(
n_units=40, n_treated=8, T0=20, n_post=11, att=5.0, seed=0,
)
res = RMSI({
"df": df, "outcome": "Y", "treat": "treated",
"unitid": "unit", "time": "time",
"unit_covariates": ["x0", "x1"], # row side information
"time_covariates": ["z0", "z1"], # column side information
"sieve_order": 2,
"display_graphs": True, # observed vs. imputed counterfactual
}).fit()
print(f"ATT (true 5.0) = {res.att:+.3f} [rank {res.rank}]")
Replication#
Both of the paper’s empirical pieces are reproduced through the public API.
Path A – Proposition 99 (Section 5.2). Using the Abadie et al. (2010)
tobacco panel shipped at
basedata/P99data.csv,
RMSI treats California from 1989 and uses the state-level Abadie predictors
as unit covariates and the year-average retail price as the time covariate:
from mlsynth.utils.rmsi_helpers.replication import replicate_prop99
# downloads basedata/P99data.csv by default; pass a path/DataFrame to override
res = replicate_prop99(rank=3)
# -> California Proposition 99 ATT ~ -21 packs/capita (1989 ~ -7, 2000 ~ -32)
The estimated ATT of about \(-21\) packs per capita is consistent with
Abadie, Diamond & Hainmueller [ABADIE2010] and with the Matrix Completion with Nuclear Norm Minimization (MCNNM) / Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)
estimates elsewhere in mlsynth. The equivalent explicit call:
import pandas as pd
from mlsynth import RMSI
url = ("https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
"main/basedata/P99data.csv")
d = pd.read_csv(url)
for c in ["lnincome", "beer", "age15to24", "retprice"]:
d[c] = d.groupby("state")[c].transform(lambda s: s.fillna(s.mean()))
d[c] = d[c].fillna(d[c].mean())
d["treated"] = ((d["state"] == "California") & (d["year"] >= 1989)).astype(int)
res = RMSI({"df": d, "outcome": "cigsale", "treat": "treated",
"unitid": "state", "time": "year",
"unit_covariates": ["lnincome", "beer", "age15to24", "retprice"],
"time_covariates": ["retprice"], "rank": 3,
"display_graphs": False}).fit()
print(res.att)
Path B – synthetic Monte Carlo (Section 5.1). The paper’s four-component DGP under the block-missing (MNAR) pattern; RMSI imputes the missing block at a lower average mean-squared error than the no-side-information baseline:
from mlsynth.utils.rmsi_helpers.replication import (
run_rmsi_simulation, RMSISimConfig, PAPER,
)
# quick, reduced-count preset (use PAPER for the full N=T=400, 100-rep study)
out = run_rmsi_simulation(RMSISimConfig(N=120, T=120, N0=60, T0=60,
J=5, n_reps=10))
# -> AMSE side-info < AMSE no-side-info (side information lowers the error)
Verification#
Note
Path B (synthetic). On the paper’s four-component MNAR DGP
(simulate_rmsi_dgp()), RMSI’s
block imputation achieves a lower missing-block AMSE than the
no-side-information baseline – reproducing the paper’s central finding that
leveraging side information improves imputation accuracy.
Path A (Proposition 99). replicate_prop99 recovers a California
Proposition 99 ATT of about \(-21\) packs per capita (widening toward
\(-32\) by 2000), matching the Abadie-Diamond-Hainmueller [ABADIE2010]
baseline and the other mlsynth estimators on the same panel.
Core API#
RMSI: Robust Matrix estimation with Side Information (Agarwal, Choi & Yuan 2026).
Agarwal, A., Choi, J. & Yuan, M. (2026). “Robust Matrix Estimation with Side Information.” arXiv:2603.24833.
RMSI estimates a panel’s untreated-outcome matrix while exploiting covariates on both margins – unit-level (row) characteristics \(X\) and time-level (column) characteristics \(Z\). It decomposes the target matrix into four complementary pieces,
and estimates each by sieve projection plus nuclear-norm penalization: with projectors \(P_X, P_Z\) onto polynomial sieve bases of the covariates, the component explained by both margins is \(\widehat M_1 = P_X Y P_Z\), and the remaining three are singular-value soft-thresholds of the corresponding projected residuals (the penalised least squares has that closed form). Because the pieces are recovered separately and summed, the method is robust: it accommodates nonlinear covariate effects, parts explained by only one margin, and a part explained by neither – and degrades gracefully when the covariates are uninformative (it then reduces to a de-meaned low-rank completion).
For causal panel data the treated cells form a missing block. RMSI applies the estimator to the fully observed “tall” submatrix (all units, pre-treatment periods) and “wide” submatrix (control units, all periods) and recombines their singular subspaces to impute the missing treated counterfactual; the ATT is the observed minus the imputed outcome over the treated cells. This estimator targets the block (common adoption time) causal setting.
- class mlsynth.estimators.rmsi.RMSI(config: RMSIConfig | dict)#
Bases:
objectRobust Matrix estimation with Side Information.
- Parameters:
config (RMSIConfig or dict) – Configuration object. See
mlsynth.config_models.RMSIConfig.
- fit() RMSIResults#
Run RMSI and return
RMSIResults.
Configuration#
- class mlsynth.config_models.RMSIConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, unit_covariates: ~typing.List[str] = <factory>, time_covariates: ~typing.List[str] = <factory>, sieve_order: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 2, rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None)#
Configuration for the RMSI estimator.
Agarwal, Choi & Yuan (2026), “Robust Matrix Estimation with Side Information” (arXiv:2603.24833). Imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates. Inherits the standard
df/outcome/treat/unitid/timeinterface.- Parameters:
unit_covariates (list of str) – Columns that vary across units and are (approximately) constant within a unit; averaged per unit to form the row feature matrix
X. May be empty (then the row projection captures only the mean).time_covariates (list of str) – Columns that vary across periods and are constant across units within a period; averaged per period to form the column feature matrix
Z. May be empty.sieve_order (int) – Polynomial sieve order
J(default 2, matching the paper).rank (int, optional) – Factor rank for the block recombination; chosen by a relative-magnitude singular-value threshold when None.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Result Containers#
RMSI.fit() returns a
RMSIResults: the ATT, the imputed
counterfactual matrix counterfactual and per-cell effects,
att_by_period, the cross-treated-unit observed/imputed means the plotter
draws, the factor rank used, and metadata.
Frozen dataclasses for the RMSI estimator.
Agarwal, Choi & Yuan (2026), Robust Matrix Estimation with Side Information (arXiv:2603.24833). RMSI imputes the treated counterfactual of a block-adoption causal panel by a four-component sieve + nuclear-norm matrix estimator that exploits unit-level (row) and time-level (column) covariates.
- class mlsynth.utils.rmsi_helpers.structures.RMSIInputs(Y: ndarray, D: ndarray, X: ndarray, Z: ndarray, T0: int, treated_idx: ndarray, control_idx: ndarray, unit_names: List[Any], time_labels: ndarray, unit_covariates: List[str], time_covariates: List[str])#
Bases:
objectPreprocessed block-adoption panel with side information for RMSI.
- Y#
Outcomes, shape
(N, T).- Type:
np.ndarray
- D#
Treatment indicators, shape
(N, T)(1where treated).- Type:
np.ndarray
- X#
Row (unit) covariates, shape
(N, d1)– one row per unit.- Type:
np.ndarray
- Z#
Column (time) covariates, shape
(T, d2)– one row per period.- Type:
np.ndarray
- treated_idx, control_idx
Indices of ever-treated and never-treated units.
- Type:
np.ndarray
- unit_names, time_labels
- Type:
sequence
- unit_covariates, time_covariates
The covariate column names used to build
XandZ.
- D: ndarray#
- X: ndarray#
- Y: ndarray#
- Z: ndarray#
- control_idx: ndarray#
- time_labels: ndarray#
- treated_idx: ndarray#
- class mlsynth.utils.rmsi_helpers.structures.RMSIResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.rmsi_helpers.structures.RMSIInputs, counterfactual_matrix: ~numpy.ndarray, effects_matrix: ~numpy.ndarray, att_by_period: ~typing.Dict[~typing.Any, float], treated_mean: ~numpy.ndarray, synthetic_mean: ~numpy.ndarray, rank: int, components: ~typing.Dict[str, ~numpy.ndarray] | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Bases:
BaseEstimatorResultsTop-level container returned by
mlsynth.RMSI.fit().An
EffectResult(the observational report): in addition to the RMSI-specific fields below it exposes the standardized sub-models (effects,time_series,weights,inference,fit_diagnostics,method_details) and the flat accessorsatt/counterfactual/gap/pre_rmse. The treated counterfactual path (res.counterfactual) is the cross-treated-unit synthetic mean; the full imputed(N, T)matrix lives incounterfactual_matrix.- Parameters:
inputs (RMSIInputs)
counterfactual_matrix (np.ndarray) – Estimated untreated potential outcomes
M_hat, shape(N, T). (Renamed fromcounterfactual, which now returns the 1-D treated path per the result contract.)effects_matrix (np.ndarray) – Observed minus imputed on treated cells (NaN elsewhere), shape
(N, T). (Renamed fromeffects, which is now the standardizedEffectsResultsslot.)att_by_period (dict) –
{time_label: ATT}over the post-treatment periods.treated_mean, synthetic_mean (np.ndarray) – Cross-treated-unit means of observed / imputed over the full timeline (the series the plotter draws), length
T.rank (int) – Factor rank used in the block recombination.
components (dict, optional) – The four components of the “tall” fit (diagnostic), if retained.
metadata (dict)
- counterfactual_matrix: np.ndarray#
- effects_matrix: np.ndarray#
- inputs: RMSIInputs#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- synthetic_mean: np.ndarray#
- treated_mean: np.ndarray#
Helper Modules#
The numerical core: sieve bases, projections, singular-value soft-thresholding, Algorithm 1 (four-component) and Algorithm 3 (block-missing recombination).
Numerical core for RMSI (Agarwal, Choi & Yuan 2026).
Implements the four-component sieve + nuclear-norm estimator (their Algorithm 1) and the block-missing causal extension (their Algorithm 3). Both reduce to projections and singular-value soft-thresholding – no iterative solver.
Algorithm 1 decomposes M = M1 + M2 + M3 + M4:
M1 = PX Y PZ– explained by both row & column featuresM2 = svt(PX Y (I-PZ), nu2/2)– row-feature-drivenM3 = svt((I-PX) Y PZ, nu3/2)– column-feature-drivenM4 = svt((I-PX) Y (I-PZ), nu4/2)– residual low-rank (no features)
where PX/PZ project onto sieve bases of the row/column covariates, the
penalised least-squares argmin ||B - A||_F^2 + nu||A||_* has the closed form
svt(B, nu/2), and nu2 = C2 sqrt(T), nu3 = C3 sqrt(N),
nu4 = C4 (sqrt(N) + sqrt(T)).
- mlsynth.utils.rmsi_helpers.core.algorithm1(Y: ndarray, X: ndarray, Z: ndarray, *, J: int = 2, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0, PX: ndarray | None = None, PZ: ndarray | None = None)#
Four-component sieve + SVT estimator (Algorithm 1) for a full matrix.
- Parameters:
Y (np.ndarray, shape (N, T)) – Fully observed outcome matrix.
X (np.ndarray, shape (N, d1)) – Row (unit) covariates.
Z (np.ndarray, shape (T, d2)) – Column (time) covariates.
J (int) – Sieve order.
C2, C3, C4 (float) – Penalty constants for the three soft-thresholded components.
PX, PZ (np.ndarray, optional) – Precomputed projectors (built from the sieve bases if omitted).
- Returns:
M_hat (np.ndarray, shape (N, T)) – The aggregated estimate
M1 + M2 + M3 + M4.components (dict) –
{"M1", "M2", "M3", "M4"}– the individual components.
- mlsynth.utils.rmsi_helpers.core.algorithm3(Y: ndarray, X: ndarray, Z: ndarray, *, control_idx: ndarray, T0: int, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0)#
Block-missing causal estimator (Algorithm 3).
Imputes the full matrix from the “tall” submatrix (all units, pre-treatment periods) and the “wide” submatrix (control units, all periods), recombining via
M_hat = U_tall H_adj D_wide V_wide'with a rotationH_adjthat aligns the wide left-singular vectors to the tall ones on the control rows.- Parameters:
Y (np.ndarray, shape (N, T)) – Outcome matrix; the treated post-treatment block is treated as missing.
X, Z (np.ndarray) – Row and column covariates.
control_idx (np.ndarray) – Indices of never-treated (control) units (the “wide” rows).
T0 (int) – Number of clean pre-treatment periods (the “tall” columns).
J (int) – Sieve order.
rank (int, optional) – Factor rank
K; a relative-magnitude singular-value threshold (_auto_rank()) is used when omitted.C2, C3, C4 (float) – Penalty constants forwarded to
algorithm1().
- Returns:
M_hat (np.ndarray, shape (N, T)) – The completed matrix (the imputed counterfactual everywhere).
rank (int) – The factor rank used.
- mlsynth.utils.rmsi_helpers.core.sieve_poly(C: ndarray, J: int = 2) ndarray#
Polynomial sieve basis (with intercept) of a covariate matrix.
For
J = 2each covariateccontributescandc**2; a constant column is prepended so the projection captures means.- Parameters:
C (np.ndarray, shape (n, d)) – Covariate matrix (rows are units/periods).
J (int) – Polynomial sieve order.
- Returns:
np.ndarray, shape (n, 1 + d*J) – Basis matrix
[1, c_1, c_1^2, ..., c_d, c_d^2, ...].
Panel + side-information ingestion (block design enforced).
Panel + side-information ingestion for the RMSI estimator.
- mlsynth.utils.rmsi_helpers.setup.prepare_rmsi_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str, unit_covariates: List[str] | None = None, time_covariates: List[str] | None = None) RMSIInputs#
Pivot a long panel into block matrices and extract side information.
RMSI (Agarwal, Choi & Yuan 2026) assumes a block design: every ever-treated unit adopts at the same period
T0.unit_covariatesare columns (approximately) constant within a unit -> the row feature matrixX(one row per unit, time-averaged);time_covariatesare columns constant within a period -> the column feature matrixZ(one row per period, unit-averaged). Either may be empty, in which case the corresponding projection captures only the mean.
Run loop: Algorithm 3, the ATT and per-period effects, and the imputed series.
Orchestration for the RMSI estimator (Agarwal, Choi & Yuan 2026).
- mlsynth.utils.rmsi_helpers.pipeline.run_rmsi(inputs: RMSIInputs, *, J: int = 2, rank: int | None = None, C2: float = 1.0, C3: float = 1.0, C4: float = 1.0) RMSIResults#
Run RMSI (Algorithm 3) and assemble
RMSIResults.- Parameters:
inputs (RMSIInputs)
J (int) – Polynomial sieve order (default 2, matching the paper).
rank (int, optional) – Factor rank; estimated by the eigenvalue-ratio method if omitted.
C2, C3, C4 (float) – Penalty constants for the soft-thresholded components.
Block causal DGP with side information for examples and tests.
Block-adoption causal DGP with side information for RMSI examples/tests.
Mirrors the regime RMSI targets: an outcome matrix whose untreated potential
outcomes are driven by (nonlinear) functions of unit covariates X and time
covariates Z plus a residual low-rank part, with a block of treated units
adopting at a common period. Returns a tidy long panel (with the covariate
columns) ready for mlsynth.RMSI.
- mlsynth.utils.rmsi_helpers.simulation.simulate_rmsi_panel(n_units: int = 40, n_treated: int = 8, T0: int = 20, n_post: int = 11, d_unit: int = 2, d_time: int = 2, att: float = 5.0, noise: float = 0.5, seed: int = 0) DataFrame#
Simulate a block, side-information, low-rank causal panel.
- Parameters:
n_units, n_treated (int) – Total units and the number treated (treated units adopt at
T0).T0, n_post (int) – Pre- and post-treatment period counts.
d_unit, d_time (int) – Numbers of unit (row) and time (column) covariates.
att (float) – Constant additive treatment effect on treated post cells.
noise (float) – Idiosyncratic-noise standard deviation.
seed (int) – RNG seed.
- Returns:
pandas.DataFrame – Long panel with columns
unit,time,Y,treated,x0..(unit covariates),z0..(time covariates).
Path-A (Proposition 99) and Path-B (synthetic Monte-Carlo) replications.
Replications of Agarwal, Choi & Yuan (2026).
Path B – the paper’s synthetic Monte-Carlo study (Section 5.1): the four-component DGP under the block-missing (MNAR) pattern. RMSI recovers the missing block at a lower error than the no-side-information baseline.
Path A – the empirical Proposition 99 / tobacco application (Section 5.2), run on the data shipped in
basedata/P99data.csv.
Both run through the public estimator surface / the documented helper functions.
- class mlsynth.utils.rmsi_helpers.replication.RMSISimConfig(N: int = 400, T: int = 400, N0: int = 200, T0: int = 200, J: int = 5, sigma: float = 0.5, alphas: tuple = (0.25, 0.25, 0.25, 0.25), n_reps: int = 100)#
Parameters for the RMSI synthetic Monte-Carlo (paper Section 5.1, MNAR).
- mlsynth.utils.rmsi_helpers.replication.replicate_prop99(data: str | DataFrame | None = None, *, rank: int | None = 3, sieve_order: int = 2, verbose: bool = True)#
Estimate California’s Proposition 99 effect with RMSI (Path A).
Loads the Abadie et al. (2010) tobacco panel (
basedata/P99data.csv), treats California from 1989, uses the state-level Abadie predictors (lnincome,beer,age15to24,retprice) as unit covariates and the year-average retail price as the time covariate, and runsmlsynth.RMSI.- Parameters:
data (str or pandas.DataFrame, optional) – Path/URL to the panel, or a DataFrame. Default downloads
P99data.csvfrombasedata/.rank (int, optional) – Factor rank (default 3; suits the single-treated-unit case – the eigenvalue-ratio default tends to pick rank 1 here).
sieve_order (int) – Polynomial sieve order.
- Returns:
mlsynth.utils.rmsi_helpers.structures.RMSIResults
- mlsynth.utils.rmsi_helpers.replication.run_rmsi_simulation(cfg: RMSISimConfig = RMSISimConfig(N=120, T=120, N0=60, T0=60, J=5, sigma=0.5, alphas=(0.25, 0.25, 0.25, 0.25), n_reps=10), *, seed: int = 0, verbose: bool = True) Dict#
Run the synthetic MNAR Monte-Carlo and return missing-block AMSE.
Compares RMSI (with side information) to a no-side-information baseline (the same block estimator with empty covariates, i.e. a de-meaned low-rank completion) by the average mean squared error over the missing block.
- Returns:
dict –
{"rmsi": amse_side_info, "no_side_info": amse_baseline, "rel_improvement": ...}.
- mlsynth.utils.rmsi_helpers.replication.simulate_rmsi_dgp(N: int, T: int, *, alphas=(0.25, 0.25, 0.25, 0.25), sigma: float = 0.5, seed: int = 0)#
Generate
(Y, M, X, Z)from the paper’s four-component DGP (Section 5.1).Eight characteristics (four per margin) with the paper’s distributions;
M = sum_r alpha_r M_rwith eachM_rnormalised to||M_r||_F = sqrt(2 N T); noiseN(0, sigma^2).