Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)

Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)#

Overview#

SNN (Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion,” arXiv:2109.15154) recovers the missing entries of a partially observed matrix when the data are missing not at random (MNAR) – the probability that an entry is observed depends on the underlying value. This selection bias is the norm in the two canonical matrix-completion applications:

Recommender systems: a user who dislikes horror films will almost never rate one, so the missingness pattern is informative about the ratings themselves.
Panel data / causal inference: policy-makers adopt programs for reasons correlated with outcomes, and competing policies cannot be observed simultaneously, so the potential-outcome matrix is systematically (not randomly) missing.

Classical matrix completion assumes data are missing completely at random (MCAR) and is biased under MNAR. SNN provides a causal framework and an estimator with entry-wise (max-norm) finite-sample consistency and asymptotic normality under MNAR.

The algorithm#

To impute entry \((i, j)\), SNN combines nearest neighbors (collaborative filtering) with synthetic controls:

Anchor rows and columns. Find a fully observed submatrix \(S\) whose rows are observed in column \(j\) and whose columns are observed in row \(i\) (paper Section 4.2). The reference implementation finds these via a maximum-biclique search; mlsynth uses a dependency-free greedy search for a large fully observed block.
Principal component regression. Truncate the SVD of \(S\), regress row \(i\)’s anchor-column values \(q\) on \(S\) to learn weights \(\beta\), and apply them to column \(j\)’s anchor-row values \(x\): \(\widehat A_{ij} = \langle x, \beta \rangle\) (paper Algorithm 1).

SNN generalises Synthetic Interventions (Agarwal et al. 2021b, the mlsynth.SI estimator), which itself generalises classic synthetic control: the same PCR machinery is applied, but the anchor submatrix is found per entry rather than assuming a fixed treated/donor block, so SNN handles arbitrary (block-structured) MNAR patterns.

Why panel data is a natural fit#

A fully observed anchor block is essential. Under independent MCAR no large fully observed submatrix exists, but the block-structured missingness of panel data – a control block observed throughout, with treated units missing their post-treatment \(y^N\) – naturally induces anchor rows (controls) and columns (pre-periods). SNN is therefore especially well suited to comparative case studies and staggered-adoption designs, exactly the setting this estimator targets.

Causal use in mlsynth#

The SNN estimator masks the treated post-treatment cells as missing, imputes their untreated potential outcomes by SNN matrix completion, and forms the treatment effect as observed minus imputed:

\[\tau_{it} \coloneqq y_{it} - \widehat{y}_{it}^N, \qquad \widehat{\tau} \coloneqq \frac{1}{|\{(i,t): d_{it}=1\}|} \sum_{d_{it}=1} \tau_{it}.\]

The general matrix-completion engine is exposed directly as mlsynth.utils.snn_helpers.snn_complete() for non-causal MNAR completion (e.g. recommender systems): pass a matrix with NaN for missing entries.

When to Use This Method#

SNN’s distinctive bet is about why data are missing. Classical matrix completion – and the nuclear-norm estimator in Matrix Completion with Nuclear Norm Minimization (MCNNM) – assumes the observed cells are a structured but ultimately exogenous sample of the matrix. SNN instead targets missing not at random (MNAR): the very event of observing a cell is correlated with its value (a horror-averse user never rates horror films; a region adopts a policy because of where its outcomes are heading). Under MNAR, MCAR-based completion is biased, and SNN’s per-entry nearest-neighbour + PCR construction restores entry-wise consistency and asymptotic normality.

Reach for SNN when#

Missingness is informative. Whether a cell is observed depends on its own (latent) value – recommender ratings, self-selected program adoption, instrument-driven attrition.
The observed cells contain a large fully observed anchor block. Panel causal designs supply this naturally: a control block observed throughout, with treated units missing only their post-treatment \(y^N\). This block structure is what lets SNN find anchor rows and columns per entry.
Arbitrary / block-structured missingness, including staggered adoption, where different units are missing different post-periods and no single fixed treated/donor split applies (SNN generalises mlsynth.SI to this case).
You want general (non-causal) MNAR matrix completion – e.g. a recommender matrix – via mlsynth.utils.snn_helpers.snn_complete().

Do not use SNN when#

No large fully observed submatrix exists. SNN’s anchor step needs a dense observed block; if missingness is heavy and scattered with no such block, prefer the nuclear-norm estimator Matrix Completion with Nuclear Norm Minimization (MCNNM), which regularises the whole matrix rather than imputing entry-by-entry.
The design is a simple single-treated block with a clean pre-period and you want classic interpretable donor weights, closed-form CIs, or a convex-combination story. Use Synthetic Interventions (SI), Two-Step Synthetic Control, or Synthetic Control with Multiple Outcomes (SCMO); per-entry anchoring is unnecessary machinery there.
Spillovers violate SUTVA on the control block – use Spatial Synthetic Difference-in-Differences (SpSyDiD) or Spillover-Aware Synthetic Control (SPILLSYNTH).
Continuous or multi-valued treatment – SNN imputes an untreated potential-outcome matrix under a binary mask; dose response belongs in Continuous-Treatment Synthetic Control (CTSC).
Distributional questions (quantiles, tails) – use Distributional Synthetic Control (DSC).

Core API#

SNN: Synthetic Nearest Neighbors / Causal Matrix Completion (Agarwal et al. 2021).

Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.

SNN recovers missing entries of a partially observed matrix under missing not at random (MNAR) patterns – where the probability that an entry is observed depends on the underlying values (selection bias), as in recommender systems and panel data. It does so by combining nearest neighbors (collaborative filtering) with synthetic controls: for a target entry it finds a fully observed anchor submatrix and runs principal component regression to impute the value. SNN generalises the Synthetic Interventions estimator (which mlsynth exposes as mlsynth.SI), which in turn generalises classic synthetic control.

In the causal/panel setting handled by this estimator, the treated units’ post-treatment untreated potential outcomes \(Y(0)\) are exactly the missing entries; SNN imputes them and the treatment effect is the observed outcome minus the imputed counterfactual. The underlying matrix completion engine is also exposed directly via mlsynth.utils.snn_helpers.snn_complete() for general MNAR matrix completion (e.g. recommender systems).

The block-structured missingness of panel data – a fully observed control block – naturally induces the anchor rows and columns SNN needs, so the method is especially well suited to comparative case studies and staggered-adoption designs.

class mlsynth.estimators.snn.SNN(config: SNNConfig | dict)#

Bases: object

Synthetic Nearest Neighbors (causal matrix completion) estimator.

Parameters:: config (SNNConfig or dict) – Configuration object. See mlsynth.config_models.SNNConfig.

fit() → SNNResults#: Run SNN and return SNNResults.

Configuration#

class mlsynth.config_models.SNNConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', plot: ~mlsynth.config_models.PlotConfig = <factory>, n_neighbors: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1, max_rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, spectral_energy: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.95, universal_rank: bool = True, clip: bool = True, inference: bool = False, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, random_state: int = 0)#

Configuration for the Synthetic Nearest Neighbors (SNN) estimator.

Agarwal, Dahleh, Shah & Shen (2021), “Causal Matrix Completion” (arXiv:2109.15154). Imputes treated units’ untreated potential outcomes by MNAR matrix completion (anchor submatrix + principal component regression), generalising the Synthetic Interventions / synthetic-control approach. Inherits the standard df / outcome / treat / unitid / time interface.

Parameters:

n_neighbors (int) – Number of synthetic neighbours (anchor-row groups) to average.
max_rank (int, optional) – Fixed PCR truncation rank; overrides the spectral/universal rule.
spectral_energy (float) – Singular-value energy threshold for spectral rank selection (used when max_rank is None and universal_rank is False).
universal_rank (bool) – Use the Donoho-Gavish (2014) universal hard-threshold rank. Default True – well-calibrated for small low-rank panels (e.g. Prop 99); set False to use the spectral-energy threshold.
clip (bool) – Clip imputations to the observed value range.
inference (bool) – Run a leave-one-control jackknife for the ATT SE / CI.
alpha (float) – Two-sided level for the jackknife confidence interval.
random_state (int) – Seed for anchor-row splitting.

alpha: float#

clip: bool#

inference: bool#

max_rank: int | None#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_neighbors: int#

random_state: int#

spectral_energy: float#

universal_rank: bool#

Helper Modules#

Core matrix-completion engine for Synthetic Nearest Neighbors (SNN).

Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.

SNN imputes a missing entry \((i, j)\) of a partially observed matrix by (1) finding anchor rows and columns – a fully observed submatrix \(S\) whose rows are observed in column \(j\) and whose columns are observed in row \(i\) – and (2) running principal component regression (PCR): truncate the SVD of \(S\), regress row \(i\)’s anchor-column values on \(S\) to learn weights \(\beta\), and apply them to column \(j\)’s anchor-row values (paper Algorithm 1).

It generalises the Synthetic Interventions / synthetic-control PCR machinery to arbitrary “missing not at random” (MNAR) patterns, because the anchor submatrix is found per entry rather than assuming a fixed treated/donor block. The reference implementation (github.com/deshen24/syntheticNN) uses a NetworkX maximum-biclique search to find anchors; this implementation uses a dependency-free greedy search for the largest fully observed submatrix.

mlsynth.utils.snn_helpers.completion.snn_complete(X: ndarray, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = False, min_value: float | None = None, max_value: float | None = None, random_state: int = 0) → Tuple[ndarray, ndarray]#

Complete a matrix with missing entries marked as NaN via SNN.

Parameters:

X (np.ndarray) – Partially observed matrix; missing entries are NaN.
n_neighbors (int) – Number of synthetic neighbours (anchor-row groups) to average.
max_rank (int, optional) – Fixed PCR truncation rank; overrides the spectral/universal rule.
spectral_energy (float) – Energy threshold for spectral rank selection (when max_rank and universal are unset).
universal (bool) – Use the Donoho-Gavish universal hard threshold for the rank.
min_value, max_value (float, optional) – Clip imputed values to this range.
random_state (int) – Seed for the anchor-row splitting.

Returns:

completed (np.ndarray) – Matrix with missing entries imputed (NaN where infeasible).
feasible (np.ndarray) – Boolean mask, True where an imputation was produced.

mlsynth.utils.snn_helpers.completion.snn_donor_weights(X: ndarray, mask: ndarray, i: int, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = False, random_state: int = 0) → Tuple[ndarray, ndarray]#

Effective PCR donor weights for treated unit i.

For a treated unit, every missing (post-treatment) cell shares the same anchor rows (the donor units) and anchor columns (the pre-periods), so a single weight vector \(\beta\) over the donors reproduces the imputed counterfactual: \(\widehat Y_{it}(0) = \sum_j \beta_j Y_{jt}\). Returns (donor_indices, weights); the weights are the (unconstrained) PCR coefficients – they need not be non-negative nor sum to one. Returns empty arrays if no anchor block exists.

mlsynth.utils.snn_helpers.completion.snn_predict(X: ndarray, mask: ndarray, i: int, j: int, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = False, random_state: int = 0) → Tuple[float, bool]#: Impute entry (i, j) of X via SNN. Returns (value, feasible).

Panel ingestion for the SNN estimator.

Pivots a long panel into (N, T) outcome and treatment matrices. SNN treats the treated post-treatment cells as missing and imputes them.

mlsynth.utils.snn_helpers.setup.prepare_snn_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str) → SNNInputs#

Pivot a long panel into SNNInputs.

A treated unit has treat == 1 from a common adoption period onward; SNN imputes those cells’ untreated potential outcomes.

Orchestration for the SNN estimator (Agarwal et al. 2021).

In the causal/panel setting, SNN masks the treated post-treatment cells as missing, imputes their untreated potential outcomes by synthetic nearest-neighbors matrix completion, and forms treatment effects as observed minus imputed.

mlsynth.utils.snn_helpers.pipeline.run_snn(inputs: SNNInputs, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = True, clip: bool = True, inference: bool = False, alpha_level: float = 0.05, random_state: int = 0) → SNNResults#

Run SNN and assemble SNNResults.

Parameters:

inputs (SNNInputs) – Preprocessed panel.
n_neighbors (int) – Number of synthetic neighbours (anchor-row groups) to average.
max_rank (int, optional) – Fixed PCR truncation rank (overrides spectral/universal rule).
spectral_energy (float) – Energy threshold for spectral rank selection.
universal (bool) – Use the Donoho-Gavish universal hard threshold for the rank (default True; well-calibrated for small low-rank panels).
clip (bool) – Clip imputations to the observed value range.
inference (bool) – If True, run a leave-one-control jackknife for the ATT SE/CI.
alpha_level (float) – Two-sided level for the jackknife CI.
random_state (int) – Seed for anchor-row splitting.

Frozen dataclasses for the Synthetic Nearest Neighbors (SNN) estimator.

Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.

In the causal/panel setting, SNN treats the treated units’ post-treatment potential outcomes \(Y(0)\) as the missing entries of the outcome matrix and imputes them by matrix completion (synthetic nearest neighbors), then forms treatment effects as observed minus imputed. It generalises the Synthetic Interventions / synthetic-control estimator to arbitrary missingness patterns.

class mlsynth.utils.snn_helpers.structures.SNNInference(method: str, se: float, ci: tuple, alpha_level: float, n_jackknife: int)#

Jackknife inference for the SNN ATT.

Leaves out one control (anchor) unit at a time, re-imputes, and uses the spread of the resulting ATTs to form a standard error and confidence interval.

method#

"jackknife".

Type:: str

se#

Jackknife standard error of the ATT.

Type:: float

ci#

Two-sided confidence interval for the ATT.

Type:: tuple of float

alpha_level#

Level used for ci.

Type:: float

n_jackknife#

Number of leave-one-control re-fits used.

Type:: int

alpha_level: float#

ci: tuple#

method: str#

n_jackknife: int#

se: float#

class mlsynth.utils.snn_helpers.structures.SNNInputs(Y: ndarray, D: ndarray, treated_idx: ndarray, T0: int, unit_names: List[Any], time_labels: ndarray)#

Preprocessed panel for SNN.

Y#

Observed outcomes, shape (N, T).

Type:: np.ndarray

D#

Treatment indicators, shape (N, T); 1 where treated.

Type:: np.ndarray

treated_idx#

Indices of ever-treated units.

Type:: np.ndarray

T0#

First treated period (post-treatment is t >= T0).

Type:: int

unit_names#

Length-N unit identifiers.

Type:: list

time_labels#

Length-T period labels.

Type:: np.ndarray

D: ndarray#

property N: int#

property T: int#

T0: int#

Y: ndarray#

time_labels: ndarray#

treated_idx: ndarray#

unit_names: List[Any]#

class mlsynth.utils.snn_helpers.structures.SNNResults(*, effects: ~mlsynth.config_models.EffectsResults | None = None, fit_diagnostics: ~mlsynth.config_models.FitDiagnosticsResults | None = None, time_series: ~mlsynth.config_models.TimeSeriesResults | None = None, weights: ~mlsynth.config_models.WeightsResults | None = None, inference: ~mlsynth.config_models.InferenceResults | None = None, method_details: ~mlsynth.config_models.MethodDetailsResults | None = None, sub_method_results: ~typing.Dict[str, ~typing.Any] | None = None, additional_outputs: ~typing.Dict[str, ~typing.Any] | None = None, raw_results: ~typing.Dict[str, ~typing.Any] | None = None, execution_summary: ~typing.Dict[str, ~typing.Any] | None = None, plot_config: ~mlsynth.config_models.PlotConfig | None = None, inputs: ~mlsynth.utils.snn_helpers.structures.SNNInputs, counterfactual_matrix: ~numpy.ndarray, effects_matrix: ~numpy.ndarray, att_by_period: ~typing.Dict[~typing.Any, float], feasible: ~numpy.ndarray, inference_jackknife: ~mlsynth.utils.snn_helpers.structures.SNNInference | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Top-level container returned by mlsynth.SNN.fit().

An EffectResult (the observational report): in addition to the SNN-specific fields below it exposes the standardized sub-models (effects, time_series, weights, inference, fit_diagnostics, method_details) and the flat accessors att / counterfactual / gap / att_ci / pre_rmse. The treated counterfactual path (res.counterfactual) is the cross-treated-unit mean of the imputed \(Y(0)\); the full imputed (N, T) matrix lives in counterfactual_matrix.

Parameters:

inputs (SNNInputs) – Preprocessed panel.
counterfactual_matrix (np.ndarray) – Outcome matrix with treated post-treatment \(Y(0)\) imputed, shape (N, T). (Renamed from counterfactual, which now returns the 1-D treated path per the result contract.)
effects_matrix (np.ndarray) – Per-cell treatment effects (observed minus imputed) for treated post cells; NaN elsewhere, shape (N, T). (Renamed from effects, which is now the standardized EffectsResults slot.)
att_by_period (dict) – {period_label: mean effect across treated units} for post-treatment periods.
feasible (np.ndarray) – Boolean mask of cells SNN could impute, shape (N, T).
inference_jackknife (SNNInference, optional) – The raw jackknife inference object (method / se / ci) when inference=True; None otherwise. The standardized InferenceResults is mirrored into the inference slot (so res.att_ci resolves).
metadata (dict) – Free-form diagnostics.

Notes

The PCR donor weights (the linear combination that builds the counterfactual) live in the standardized weights slot: for a single treated unit donor_weights maps donor name -> weight; with multiple treated units it holds the cross-unit average and the per-unit weights live in summary_stats['per_unit_donor_weights'].

att_by_period: Dict[Any, float]#

counterfactual_matrix: np.ndarray#

effects_matrix: np.ndarray#

feasible: np.ndarray#

inference_jackknife: 'SNNInference' | None#

inputs: SNNInputs#

metadata: Dict[str, Any]#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function BaseEstimatorResults.Config.<lambda>>}}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

Example#

Proposition 99 – California’s 1988 tobacco-control program, the canonical synthetic-control case study. SNN treats California’s post-1988 per-capita cigarette sales as the missing entries, imputes the counterfactual by matrix completion, and reports the ATT. With display_graphs=True it draws the observed-vs-counterfactual chart.

import pandas as pd

from mlsynth import SNN

# ------------------------------------------------------------------
# Load the Prop 99 panel (39 states, 1970-2000; California treated 1989)
# ------------------------------------------------------------------
file = (
    "https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
    "refs/heads/main/basedata/smoking_data.csv"
)
df = pd.read_csv(file)

res = SNN({
    "df": df,
    "outcome": "cigsale",
    "treat": "Proposition 99",      # boolean treatment column
    "unitid": "state",
    "time": "year",
    "inference": True,              # leave-one-control jackknife
    "display_graphs": True,         # observed vs SNN counterfactual
}).fit()

print(f"ATT (avg 1989-2000) = {res.att:+.2f} packs/capita")
lo, hi = res.att_ci               # standardized; jackknife when inference=True
print(f"jackknife 95% CI    = [{lo:+.2f}, {hi:+.2f}]")
print(f"gap by 2000         = {res.att_by_period[2000]:+.2f}")

The default universal_rank=True (Donoho-Gavish hard threshold) keeps the rank well-calibrated for this small (39 x 31) low-rank panel; it returns an average ATT of about -18 packs/capita, widening to roughly -29 by 2000 – consistent with Abadie, Diamond & Hainmueller (2010).

Verification#

Cross-validated against the reference implementation (deshen24/syntheticNN): on the Prop 99 block-missingness pattern mlsynth reproduces the reference’s imputed counterfactual to machine precision. See SNN — deshen24/syntheticNN (Prop 99).

The same SNN engine performs general (non-causal) matrix completion on any matrix with NaN for the missing entries:

import numpy as np
from mlsynth.utils.snn_helpers import snn_complete

X = np.array([[1.0, 2.0, np.nan],
              [2.0, 4.0, 6.0],
              [3.0, np.nan, 9.0]])
completed, feasible = snn_complete(X)

References#

Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies.” Journal of the American Statistical Association 105(490):493-505.

Agarwal, A., Shah, D., Shen, D., & Song, D. (2021b). “On Robustness of Principal Component Regression.” Journal of the American Statistical Association.

Agarwal, A., Dahleh, M., Shah, D., & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.

Athey, S., Bayati, M., Doudchenko, N., Imbens, G., & Khosravi, K. (2021). “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association 116(536):1716-1730.

Gavish, M., & Donoho, D. L. (2014). “The Optimal Hard Threshold for Singular Values is 4/sqrt(3).” IEEE Transactions on Information Theory 60(8):5040-5053.

Ma, W., & Chen, G. H. (2019). “Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption.” NeurIPS.

Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)

Contents

Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)#

Overview#

The algorithm#

Why panel data is a natural fit#

Causal use in mlsynth#

When to Use This Method#

Reach for SNN when#

Do not use SNN when#

Core API#

Configuration#

Helper Modules#

Example#

Verification#

References#