Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)#
Overview#
SNN (Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion,” arXiv:2109.15154) recovers the missing entries of a partially observed matrix when the data are missing not at random (MNAR) – the probability that an entry is observed depends on the underlying value. This selection bias is the norm in the two canonical matrix-completion applications:
Recommender systems: a user who dislikes horror films will almost never rate one, so the missingness pattern is informative about the ratings themselves.
Panel data / causal inference: policy-makers adopt programs for reasons correlated with outcomes, and competing policies cannot be observed simultaneously, so the potential-outcome matrix is systematically (not randomly) missing.
Classical matrix completion assumes data are missing completely at random (MCAR) and is biased under MNAR. SNN provides a causal framework and an estimator with entry-wise (max-norm) finite-sample consistency and asymptotic normality under MNAR.
The algorithm#
To impute entry \((i, j)\), SNN combines nearest neighbors (collaborative filtering) with synthetic controls:
Anchor rows and columns. Find a fully observed submatrix \(S\) whose rows are observed in column \(j\) and whose columns are observed in row \(i\) (paper Section 4.2). The reference implementation finds these via a maximum-biclique search; mlsynth uses a dependency-free greedy search for a large fully observed block.
Principal component regression. Truncate the SVD of \(S\), regress row \(i\)’s anchor-column values \(q\) on \(S\) to learn weights \(\beta\), and apply them to column \(j\)’s anchor-row values \(x\): \(\widehat A_{ij} = \langle x, \beta \rangle\) (paper Algorithm 1).
SNN generalises Synthetic Interventions (Agarwal et al. 2021b, the
mlsynth.SI estimator), which itself generalises classic
synthetic control: the same PCR machinery is applied, but the anchor
submatrix is found per entry rather than assuming a fixed treated/donor
block, so SNN handles arbitrary (block-structured) MNAR patterns.
Why panel data is a natural fit#
A fully observed anchor block is essential. Under independent MCAR no large fully observed submatrix exists, but the block-structured missingness of panel data – a control block observed throughout, with treated units missing their post-treatment \(Y(0)\) – naturally induces anchor rows (controls) and columns (pre-periods). SNN is therefore especially well suited to comparative case studies and staggered-adoption designs, exactly the setting this estimator targets.
Causal use in mlsynth#
The SNN estimator masks the treated post-treatment cells as
missing, imputes their untreated potential outcomes by SNN matrix
completion, and forms the treatment effect as observed minus imputed:
The general matrix-completion engine is exposed directly as
mlsynth.utils.snn_helpers.snn_complete() for non-causal MNAR
completion (e.g. recommender systems): pass a matrix with NaN for
missing entries.
When to Use This Method#
SNN’s distinctive bet is about why data are missing. Classical matrix completion – and the nuclear-norm estimator in Matrix Completion with Nuclear Norm Minimization (MCNNM) – assumes the observed cells are a structured but ultimately exogenous sample of the matrix. SNN instead targets missing not at random (MNAR): the very event of observing a cell is correlated with its value (a horror-averse user never rates horror films; a region adopts a policy because of where its outcomes are heading). Under MNAR, MCAR-based completion is biased, and SNN’s per-entry nearest-neighbour + PCR construction restores entry-wise consistency and asymptotic normality.
Reach for SNN when#
Missingness is informative. Whether a cell is observed depends on its own (latent) value – recommender ratings, self-selected program adoption, instrument-driven attrition.
The observed cells contain a large fully observed anchor block. Panel causal designs supply this naturally: a control block observed throughout, with treated units missing only their post-treatment \(Y(0)\). This block structure is what lets SNN find anchor rows and columns per entry.
Arbitrary / block-structured missingness, including staggered adoption, where different units are missing different post-periods and no single fixed treated/donor split applies (SNN generalises
mlsynth.SIto this case).You want general (non-causal) MNAR matrix completion – e.g. a recommender matrix – via
mlsynth.utils.snn_helpers.snn_complete().
Do not use SNN when#
No large fully observed submatrix exists. SNN’s anchor step needs a dense observed block; if missingness is heavy and scattered with no such block, prefer the nuclear-norm estimator Matrix Completion with Nuclear Norm Minimization (MCNNM), which regularises the whole matrix rather than imputing entry-by-entry.
The design is a simple single-treated block with a clean pre-period and you want classic interpretable donor weights, closed-form CIs, or a convex-combination story. Use Synthetic Interventions (SI), Two-Step Synthetic Control, or Synthetic Control with Multiple Outcomes (SCMO); per-entry anchoring is unnecessary machinery there.
Spillovers violate SUTVA on the control block – use Spatial Synthetic Difference-in-Differences (SpSyDiD) or Spillover-Aware Synthetic Control (SPILLSYNTH).
Continuous or multi-valued treatment – SNN imputes an untreated potential-outcome matrix under a binary mask; dose response belongs in Continuous-Treatment Synthetic Control (CTSC).
Distributional questions (quantiles, tails) – use Distributional Synthetic Control (DSC).
Core API#
SNN: Synthetic Nearest Neighbors / Causal Matrix Completion (Agarwal et al. 2021).
Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.
SNN recovers missing entries of a partially observed matrix under
missing not at random (MNAR) patterns – where the probability that an
entry is observed depends on the underlying values (selection bias), as in
recommender systems and panel data. It does so by combining nearest
neighbors (collaborative filtering) with synthetic controls: for a target
entry it finds a fully observed anchor submatrix and runs principal
component regression to impute the value. SNN generalises the Synthetic
Interventions estimator (which mlsynth exposes as mlsynth.SI),
which in turn generalises classic synthetic control.
In the causal/panel setting handled by this estimator, the treated units’
post-treatment untreated potential outcomes \(Y(0)\) are exactly the
missing entries; SNN imputes them and the treatment effect is the
observed outcome minus the imputed counterfactual. The underlying matrix
completion engine is also exposed directly via
mlsynth.utils.snn_helpers.snn_complete() for general MNAR matrix
completion (e.g. recommender systems).
The block-structured missingness of panel data – a fully observed control block – naturally induces the anchor rows and columns SNN needs, so the method is especially well suited to comparative case studies and staggered-adoption designs.
- class mlsynth.estimators.snn.SNN(config: SNNConfig | dict)#
Bases:
objectSynthetic Nearest Neighbors (causal matrix completion) estimator.
- Parameters:
config (SNNConfig or dict) – Configuration object. See
mlsynth.config_models.SNNConfig.
- fit() SNNResults#
Run SNN and return
SNNResults.
Configuration#
- class mlsynth.config_models.SNNConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', n_neighbors: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1, max_rank: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, spectral_energy: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.95, universal_rank: bool = True, clip: bool = True, inference: bool = False, alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, random_state: int = 0)#
Configuration for the Synthetic Nearest Neighbors (SNN) estimator.
Agarwal, Dahleh, Shah & Shen (2021), “Causal Matrix Completion” (arXiv:2109.15154). Imputes treated units’ untreated potential outcomes by MNAR matrix completion (anchor submatrix + principal component regression), generalising the Synthetic Interventions / synthetic-control approach. Inherits the standard
df/outcome/treat/unitid/timeinterface.- Parameters:
n_neighbors (int) – Number of synthetic neighbours (anchor-row groups) to average.
max_rank (int, optional) – Fixed PCR truncation rank; overrides the spectral/universal rule.
spectral_energy (float) – Singular-value energy threshold for spectral rank selection (used when
max_rankis None anduniversal_rankis False).universal_rank (bool) – Use the Donoho-Gavish (2014) universal hard-threshold rank. Default True – well-calibrated for small low-rank panels (e.g. Prop 99); set False to use the spectral-energy threshold.
clip (bool) – Clip imputations to the observed value range.
inference (bool) – Run a leave-one-control jackknife for the ATT SE / CI.
alpha (float) – Two-sided level for the jackknife confidence interval.
random_state (int) – Seed for anchor-row splitting.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Helper Modules#
Core matrix-completion engine for Synthetic Nearest Neighbors (SNN).
Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.
SNN imputes a missing entry \((i, j)\) of a partially observed matrix by (1) finding anchor rows and columns – a fully observed submatrix \(S\) whose rows are observed in column \(j\) and whose columns are observed in row \(i\) – and (2) running principal component regression (PCR): truncate the SVD of \(S\), regress row \(i\)’s anchor-column values on \(S\) to learn weights \(\beta\), and apply them to column \(j\)’s anchor-row values (paper Algorithm 1).
It generalises the Synthetic Interventions / synthetic-control PCR machinery to arbitrary “missing not at random” (MNAR) patterns, because the anchor submatrix is found per entry rather than assuming a fixed treated/donor block. The reference implementation (github.com/deshen24/syntheticNN) uses a NetworkX maximum-biclique search to find anchors; this implementation uses a dependency-free greedy search for the largest fully observed submatrix.
- mlsynth.utils.snn_helpers.completion.snn_complete(X: ndarray, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = False, min_value: float | None = None, max_value: float | None = None, random_state: int = 0) Tuple[ndarray, ndarray]#
Complete a matrix with missing entries marked as
NaNvia SNN.- Parameters:
X (np.ndarray) – Partially observed matrix; missing entries are
NaN.n_neighbors (int) – Number of synthetic neighbours (anchor-row groups) to average.
max_rank (int, optional) – Fixed PCR truncation rank; overrides the spectral/universal rule.
spectral_energy (float) – Energy threshold for spectral rank selection (when
max_rankanduniversalare unset).universal (bool) – Use the Donoho-Gavish universal hard threshold for the rank.
min_value, max_value (float, optional) – Clip imputed values to this range.
random_state (int) – Seed for the anchor-row splitting.
- Returns:
completed (np.ndarray) – Matrix with missing entries imputed (NaN where infeasible).
feasible (np.ndarray) – Boolean mask,
Truewhere an imputation was produced.
- mlsynth.utils.snn_helpers.completion.snn_donor_weights(X: ndarray, mask: ndarray, i: int, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = False, random_state: int = 0) Tuple[ndarray, ndarray]#
Effective PCR donor weights for treated unit
i.For a treated unit, every missing (post-treatment) cell shares the same anchor rows (the donor units) and anchor columns (the pre-periods), so a single weight vector \(\beta\) over the donors reproduces the imputed counterfactual: \(\widehat Y_{it}(0) = \sum_j \beta_j Y_{jt}\). Returns
(donor_indices, weights); the weights are the (unconstrained) PCR coefficients – they need not be non-negative nor sum to one. Returns empty arrays if no anchor block exists.
- mlsynth.utils.snn_helpers.completion.snn_predict(X: ndarray, mask: ndarray, i: int, j: int, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = False, random_state: int = 0) Tuple[float, bool]#
Impute entry
(i, j)ofXvia SNN. Returns (value, feasible).
Panel ingestion for the SNN estimator.
Pivots a long panel into (N, T) outcome and treatment matrices. SNN
treats the treated post-treatment cells as missing and imputes them.
- mlsynth.utils.snn_helpers.setup.prepare_snn_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str) SNNInputs#
Pivot a long panel into
SNNInputs.A treated unit has
treat == 1from a common adoption period onward; SNN imputes those cells’ untreated potential outcomes.
Orchestration for the SNN estimator (Agarwal et al. 2021).
In the causal/panel setting, SNN masks the treated post-treatment cells as missing, imputes their untreated potential outcomes by synthetic nearest-neighbors matrix completion, and forms treatment effects as observed minus imputed.
- mlsynth.utils.snn_helpers.pipeline.run_snn(inputs: SNNInputs, *, n_neighbors: int = 1, max_rank: int | None = None, spectral_energy: float = 0.95, universal: bool = True, clip: bool = True, inference: bool = False, alpha_level: float = 0.05, random_state: int = 0) SNNResults#
Run SNN and assemble
SNNResults.- Parameters:
inputs (SNNInputs) – Preprocessed panel.
n_neighbors (int) – Number of synthetic neighbours (anchor-row groups) to average.
max_rank (int, optional) – Fixed PCR truncation rank (overrides spectral/universal rule).
spectral_energy (float) – Energy threshold for spectral rank selection.
universal (bool) – Use the Donoho-Gavish universal hard threshold for the rank (default True; well-calibrated for small low-rank panels).
clip (bool) – Clip imputations to the observed value range.
inference (bool) – If True, run a leave-one-control jackknife for the ATT SE/CI.
alpha_level (float) – Two-sided level for the jackknife CI.
random_state (int) – Seed for anchor-row splitting.
Frozen dataclasses for the Synthetic Nearest Neighbors (SNN) estimator.
Agarwal, A., Dahleh, M., Shah, D. & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.
In the causal/panel setting, SNN treats the treated units’ post-treatment potential outcomes \(Y(0)\) as the missing entries of the outcome matrix and imputes them by matrix completion (synthetic nearest neighbors), then forms treatment effects as observed minus imputed. It generalises the Synthetic Interventions / synthetic-control estimator to arbitrary missingness patterns.
- class mlsynth.utils.snn_helpers.structures.SNNInference(method: str, se: float, ci: tuple, alpha_level: float, n_jackknife: int)#
Jackknife inference for the SNN ATT.
Leaves out one control (anchor) unit at a time, re-imputes, and uses the spread of the resulting ATTs to form a standard error and confidence interval.
- class mlsynth.utils.snn_helpers.structures.SNNInputs(Y: ndarray, D: ndarray, treated_idx: ndarray, T0: int, unit_names: List[Any], time_labels: ndarray)#
Preprocessed panel for SNN.
- Y#
Observed outcomes, shape
(N, T).- Type:
np.ndarray
- D#
Treatment indicators, shape
(N, T);1where treated.- Type:
np.ndarray
- treated_idx#
Indices of ever-treated units.
- Type:
np.ndarray
- time_labels#
Length-
Tperiod labels.- Type:
np.ndarray
- D: ndarray#
- Y: ndarray#
- time_labels: ndarray#
- treated_idx: ndarray#
- class mlsynth.utils.snn_helpers.structures.SNNResults(inputs: ~mlsynth.utils.snn_helpers.structures.SNNInputs, att: float, counterfactual: ~numpy.ndarray, effects: ~numpy.ndarray, att_by_period: ~typing.Dict[~typing.Any, float], feasible: ~numpy.ndarray, weights: ~typing.Any | None = None, inference: ~mlsynth.utils.snn_helpers.structures.SNNInference | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Top-level container returned by
mlsynth.SNN.fit().- att#
Average treatment effect on the treated, over imputed treated post-treatment cells.
- Type:
- counterfactual#
Outcome matrix with treated post-treatment \(Y(0)\) imputed, shape
(N, T).- Type:
np.ndarray
- effects#
Per-cell treatment effects (observed minus imputed) for treated post cells;
NaNelsewhere, shape(N, T).- Type:
np.ndarray
- att_by_period#
{period_label: mean effect across treated units}for post-treatment periods.- Type:
- feasible#
Boolean mask of cells SNN could impute, shape
(N, T).- Type:
np.ndarray
- weights#
Per-treated-unit donor weights (the PCR coefficients that build the counterfactual as a linear combination of the donor units). For a single treated unit,
donor_weightsmaps donor name -> weight; with multiple treated units it holds the cross-unit average and the per-unit weights live insummary_stats['per_unit_donor_weights'].- Type:
WeightsResults, optional
- inference#
SNNInferencewheninference=True;Noneotherwise.- Type:
object, optional
- counterfactual: ndarray#
- effects: ndarray#
- feasible: ndarray#
- inference: SNNInference | None = None#
Example#
Proposition 99 – California’s 1988 tobacco-control program, the canonical
synthetic-control case study. SNN treats California’s post-1988
per-capita cigarette sales as the missing entries, imputes the
counterfactual by matrix completion, and reports the ATT. With
display_graphs=True it draws the observed-vs-counterfactual chart.
import pandas as pd
from mlsynth import SNN
# ------------------------------------------------------------------
# Load the Prop 99 panel (39 states, 1970-2000; California treated 1989)
# ------------------------------------------------------------------
file = (
"https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
"refs/heads/main/basedata/smoking_data.csv"
)
df = pd.read_csv(file)
res = SNN({
"df": df,
"outcome": "cigsale",
"treat": "Proposition 99", # boolean treatment column
"unitid": "state",
"time": "year",
"inference": True, # leave-one-control jackknife
"display_graphs": True, # observed vs SNN counterfactual
}).fit()
print(f"ATT (avg 1989-2000) = {res.att:+.2f} packs/capita")
lo, hi = res.inference.ci
print(f"jackknife 95% CI = [{lo:+.2f}, {hi:+.2f}]")
print(f"gap by 2000 = {res.att_by_period[2000]:+.2f}")
The default universal_rank=True (Donoho-Gavish hard threshold) keeps
the rank well-calibrated for this small (39 x 31) low-rank panel; it
returns an average ATT of about -19 packs/capita, widening to roughly
-31 by 2000 – consistent with Abadie, Diamond & Hainmueller (2010).
The same SNN engine performs general (non-causal) matrix completion on
any matrix with NaN for the missing entries:
import numpy as np
from mlsynth.utils.snn_helpers import snn_complete
X = np.array([[1.0, 2.0, np.nan],
[2.0, 4.0, 6.0],
[3.0, np.nan, 9.0]])
completed, feasible = snn_complete(X)
References#
Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies.” Journal of the American Statistical Association 105(490):493-505.
Agarwal, A., Shah, D., Shen, D., & Song, D. (2021b). “On Robustness of Principal Component Regression.” Journal of the American Statistical Association.
Agarwal, A., Dahleh, M., Shah, D., & Shen, D. (2021). “Causal Matrix Completion.” arXiv:2109.15154.
Athey, S., Bayati, M., Doudchenko, N., Imbens, G., & Khosravi, K. (2021). “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association 116(536):1716-1730.
Gavish, M., & Donoho, D. L. (2014). “The Optimal Hard Threshold for Singular Values is 4/sqrt(3).” IEEE Transactions on Information Theory 60(8):5040-5053.
Ma, W., & Chen, G. H. (2019). “Missing Not at Random in Matrix Completion: The Effectiveness of Estimating Missingness Probabilities Under a Low Nuclear Norm Assumption.” NeurIPS.