About mlsynth#
Synthetic control, with batteries included. mlsynth is a Python package that gives applied researchers and data scientists access to the modern synthetic-control literature under a single, unified API. You provide a long DataFrame and a configuration dictionary; the estimator returns a typed result object with the ATT, the counterfactual, donor weights, fit statistics, and (where applicable) confidence or credible intervals.
At the time of writing, mlsynth ships more than thirty estimators spanning the full breadth of the synthetic-control literature, from the canonical Abadie-Diamond-Hainmueller method through Bayesian spike-and-slab variable selection, state-space models, matrix completion, synthetic difference-in-differences for staggered adoption, instrumental synthetic IV, synthetic-design methods for prospective experiments, and more. Every estimator is implemented from its original source paper and – for the verified subset – replicates the paper’s published numbers in a dedicated Verification section.
Design philosophy#
mlsynth is built around three principles.
Long DataFrame in, ATT out.
Every estimator consumes the same long-format panel: one row per unit
per time period, with at minimum a unit identifier, a time index, an
outcome column, and a binary treatment indicator. There is no
Dataprep object to construct, no pivoting to wide form, no
special-cased input for each method. The same DataFrame that fits a
Two-Step Synthetic Control will fit Matching and Synthetic Control (MASC).
One config dict, one
.fit()call.
Estimators take a single configuration dictionary and expose a single .fit() method that returns a
frozen, typed result. There are no separate compute_weights /
compute_counterfactual / compute_inference steps for the user
to assemble – the orchestration is the estimator’s job.
Every estimator is verified against its source.
Most synthetic-control libraries ship “an implementation”; the user
trusts it does what the paper says. mlsynth’s Verification campaign
holds every estimator to a stronger contract: each
docs/<estimator>.rst page contains (or will contain) a
Verification section that reproduces one of the source paper’s
reported numbers – either by replicating an empirical result on the
authors’ own dataset (“Path A”) or by reproducing a Monte Carlo from
the paper’s simulation section (“Path B”).
The unified API in action#
The same data, the same DataFrame, four different estimators – with
only the keys in the configuration dictionary changing. The example
below moves between vanilla Robust SCM, its convex variant from Dennis
Shen’s MIT master’s thesis (MIT DSpace),
the clustered variant of Rho et al. (2025), and `the Bayesian variant<https://jmlr.csail.mit.edu/papers/volume19/17-777/17-777.pdf>`_ – all
via the same CLUSTERSC class:
import pandas as pd
from mlsynth import CLUSTERSC
url = ("https://raw.githubusercontent.com/jgreathouse9/mlsynth/"
"main/basedata/basque_data.csv")
data = pd.read_csv(url)
base = {"df": data, "outcome": data.columns[3],
"treat": data.columns[-1], "unitid": data.columns[1],
"time": data.columns[2], "display_graphs": True}
variants = [
("Vanilla RSC", {**base, "method": "PCR", "objective": "OLS"}),
("Convex RSC", {**base, "method": "PCR", "objective": "SIMPLEX"}),
("Clustered RSC", {**base, "method": "PCR", "objective": "OLS",
"cluster": True}),
("Bayesian RSC", {**base, "method": "PCR", "Frequentist": False,
"cluster": True}),
]
for name, cfg in variants:
res = CLUSTERSC(cfg).fit()
print(f"{name:15} ATT = {res.att:+.3f}")
Four estimators, one DataFrame, four lines of differences. The same pattern recurs throughout the library, depending on the circumstance.
The verification campaign#
mlsynth, as much as possible, reproduces against the source paper. Each verified estimator’s documentation page contains a replication section showing that the implementation matches one of the paper’s headline numbers. We distinguish between Path A (empirical replication on the original authors’ dataset, matching their published estimates) and Path B (Monte Carlo replication of the paper’s simulation section). Where both paths are feasible, both are run; where the authors’ data is not redistributable/easily accessible, Path B is used.
See Replications for the catalogue of all current replications, with headline numbers and per-family coverage status.
Installation#
mlsynth requires Python 3.9 or later and standard scientific dependencies. The simplest install is from PyPI:
pip install mlsynth
For the development version directly from GitHub:
pip install -U git+https://github.com/jgreathouse9/mlsynth.git
Confirm the install:
>>> import mlsynth
>>> mlsynth.__version__
For an isolated environment, the standard
python -m venv mlsynth_env && source mlsynth_env/bin/activate
pattern works as expected.
For a fuller tour of which estimator fits which problem, see A practitioner’s decision tree.
Use cases#
mlsynth is a general-purpose synthetic-control toolkit. The applications below are the ones the library has been used for most heavily in practice.
Observational causal inference. Estimate the average treatment effect of a policy, a regulatory change, a marketing intervention, or a supply shock from a panel of already-observed outcomes. This is the canonical comparative-case-study setting that synthetic control was built for; mlsynth supplies more than two dozen estimators covering low-dim, high-dim, staggered, and instrumental variants.
Experimental design at the market level. When randomising individual units is infeasible – as in geo-marketing experiments, cluster-level public-health interventions, or market-level pricing studies – mlsynth supports the design-stage problem of choosing which units to treat, before any intervention takes place. See Synthetic Design (SYNDES), Synthetic Controls for Experimental Design (MAREX), Parallel-Trends Supergeo Design (PANGEO), Synthetic Principal Component Design (SPCD), and Lexicographic Synthetic Control (LEXSCM). This use case is described in detail in Synthetic Controls for Marketing Experiments.
High-dimensional donor pools. When \(N \gg T_0\) – as arises with commodity-category panels, large product portfolios, or fine industry classifications, or we have many covariates to choose from, the classical SC quadratic program loses its unique solution. Lasso-style alternatives sometimes over-select. Bayesian Synthetic Control with a Soft Simplex Constraint (BVS-SS), Cluster Synthetic Controls (CLUSTERSC), Multi-Level Synthetic Control (mlSC), Relaxed / Penalized Synthetic Control (RESCM), Sparse Synthetic Control (SparseSC), Forward-Selected Synthetic Control (FSCM), and Panel Data Approach (PDA) each address this regime with a different selection strategy.
Special data structures. Multiple outcomes (Synthetic Control with Multiple Outcomes (SCMO)), continuous treatments (Continuous-Treatment Synthetic Control (CTSC)), full distributional ATTs (Distributional Synthetic Control (DSC)), multiple treatment arms (Synthetic Interventions (SI)), individual-level units (MicroSynth (User-Level Balancing SC)), missing outcome cells (Matrix Completion with Nuclear Norm Minimization (MCNNM), Synthetic Nearest Neighbors / Causal Matrix Completion (SNN)), and strongly trending series (Time-Aware Synthetic Control (TASC), Synthetic Business Cycle (SBC), Harmonic Synthetic Control (HSC)) each have estimators built for them.
Citation#
If you use mlsynth in academic work, please cite the library as:
@software{mlsynth,
author = {Greathouse, Jared},
title = {{mlsynth}: A Python Toolbox of Synthetic-Control Methods
for Program Evaluation},
year = {2025},
version = {0.1.2},
url = {https://github.com/jgreathouse9/mlsynth},
}
Please also cite the original source paper of any estimator you use in your analysis. The reference for each estimator is listed at the top of its documentation page; a consolidated bibliography is in References.
Acknowledgments#
mlsynth was developed by Jared Greathouse at Georgia State University and benefited from advice, code contributions, and methodological discussions with Jason Coupet, Kathy Li, Mani Bayani, Zhentao Shi, and Jaume Vives-i-Bastida. The Robust PCA Synthetic Control implementation in particular would not exist without Mani Bayani’s original code contribution.
Roadmap#
Current development priorities, in roughly the order they will land:
Verification coverage to 100%. Extend the Path A / Path B campaign to every estimator in the library.
A unified result-object contract. All estimators currently expose an ATT, a counterfactual path, and (where applicable) a CI, but the attribute names differ across families. A documented minimum contract (
result.att,result.att_ci,result.counterfactual,result.weights,result.pre_rmse) is in design.Staggered-adoption expansion. Several estimators currently built for a single treated unit (the \(\ell_2\) relaxation of Shi & Wang, the Factor Model Approach of Li & Sonnier) are in principle compatible with staggered adoption; exposing those extensions in the API is planned.
A comparison matrix. A single table indexing which estimators produce CIs, which handle staggered adoption, which scale to large \(N\), which require covariates, etc.
Performance work. Several per-unit synthetic-control fits currently route through cvxpy’s compilation tax; replacing with closed-form simplex projection where possible.
If you would like to contribute to any of the above, see the GitHub issue tracker.