.. mlsynth documentation master file. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to mlsynth 1.0.0 ======================== .. meta:: :description: A Python toolbox of synthetic-control estimators for program evaluation. Express your causal panel-data problem in a long DataFrame, pick an estimator, get an ATT. :keywords: synthetic control, causal inference, program evaluation, difference-in-differences, panel data, ATT, Python. .. raw:: html Synthetic control, for everyone. mlsynth is an open-source Python toolbox of synthetic-control methods for program evaluation. It implements the classical Abadie-Diamond-Hainmueller estimator alongside a growing catalogue of modern variants -- Bayesian spike-and-slab selection, state-space modelling, instrumental variants, sequential difference-in-differences, matrix completion -- under a single long-DataFrame API. Every estimator's documentation page includes a *Verification* section that reproduces the original paper's reported numbers where applicable. For example, the following code reproduces Abadie, Diamond and Hainmueller's Proposition 99 study and cross-checks it against a robust variant. It loads the California tobacco panel shipped with the library, fits the classical synthetic control on the canonical predictor set, then fits an ordinary-least-squares principal-component-regression control on the same outcomes, and prints both estimated effects: .. code:: python import pandas as pd from mlsynth import VanillaSC, CLUSTERSC # California tobacco panel, 1970-2000: per-capita cigarette sales plus the # Abadie-Diamond-Hainmueller predictors. url = ("https://raw.githubusercontent.com/jgreathouse9/mlsynth/" "main/basedata/augmented_cali_long.csv") df = pd.read_csv(url) df = df[df["year"] <= 2000].copy() df["treated"] = ((df["state"] == "California") & (df["year"] >= 1989)).astype(int) # Three lagged-outcome predictors, as in ADH (2010), Table 2. for L in (1975, 1980, 1988): df[f"cig{L}"] = df["state"].map( df[df["year"] == L].set_index("state")["cigsale"]) covariates = ["loginc", "p_cig", "pct15-24", "pc_beer", "cig1975", "cig1980", "cig1988"] windows = {"loginc": (1980, 1988), "p_cig": (1980, 1988), "pct15-24": (1980, 1988), "pc_beer": (1984, 1988), "cig1975": (1975, 1975), "cig1980": (1980, 1980), "cig1988": (1988, 1988)} common = dict(df=df, outcome="cigsale", treat="treated", unitid="state", time="year", display_graphs=False) # Classical synthetic control on the canonical predictor set. sc = VanillaSC({**common, "covariates": covariates, "covariate_windows": windows, "backend": "mscmt", "canonical_v": "min.loss.w", "seed": 0}).fit() # Robust principal-component-regression control on the same outcomes. pcr = CLUSTERSC({**common, "method": "pcr", "pcr_objective": "OLS"}).fit() print(f"VanillaSC ATT = {sc.att:+.2f} packs/yr") print(f"OLS-PCR ATT = {pcr.att:+.2f} packs/yr") prints:: VanillaSC ATT = -18.98 packs/yr OLS-PCR ATT = -21.39 packs/yr The classical synthetic control reproduces ADH's headline estimate -- a drop of about nineteen packs per capita -- and the principal-component control, built on a different identifying assumption, agrees on a large reduction. Fitting an estimator and then stress-testing the finding against a second one is the everyday way mlsynth is used. This short script is a representative example of what mlsynth can do. In addition to classical SC, mlsynth also supports Bayesian variable selection (:doc:`bvss`), staggered-adoption sequential difference-in-differences (:doc:`seq_sdid`, :doc:`spsydid`), instrumental synthetic control (:doc:`siv`), matrix completion under missingness (:doc:`mcnnm`), state- space time-aware control (:doc:`tasc`), and clustered / robust high- dimensional variants (:doc:`clustersc`, :doc:`mlsc`, :doc:`marex`). For a guided tour of the estimator catalogue, start with the :doc:`about` page. Browse the *Estimators* sidebar for the full list grouped by methodology. mlsynth builds on top of `numpy `_, `pandas `_, `scipy `_, `scikit-learn `_, `cvxpy `_, `pydantic `_, and `statsmodels `_; convex programs are routed through cvxpy's solver stack. Installation. Install the latest version straight from GitHub: .. code:: bash pip install -U git+https://github.com/jgreathouse9/mlsynth.git mlsynth runs on Python 3.9 and later. The base install carries every core dependency and runs every estimator except two that rely on heavier, specialised backends. Those backends ship as optional *extras*, so you install only the weight you use: .. list-table:: :header-rows: 1 :widths: 12 30 58 * - Extra - Adds - Needed for * - ``design`` - ``pyscipopt`` (the SCIP mixed-integer solver) - the experimental-design estimators :doc:`syndes` and :doc:`marex`, whose market-selection step is a mixed-integer quadratic program * - ``bayes`` - ``numpyro`` (JAX-based MCMC) - :doc:`spotsynth`'s Bayesian synthetic-control mode * - ``all`` - both of the above - the full feature set Request an extra with bracket syntax, quoting the specifier so the shell does not glob the brackets: .. code:: bash # SCIP solver for SYNDES / MAREX pip install -U "mlsynth[design] @ git+https://github.com/jgreathouse9/mlsynth.git" # NumPyro for SPOTSYNTH's Bayesian mode pip install -U "mlsynth[bayes] @ git+https://github.com/jgreathouse9/mlsynth.git" # everything pip install -U "mlsynth[all] @ git+https://github.com/jgreathouse9/mlsynth.git" Both extra backends are imported lazily, so ``import mlsynth`` and importing any estimator class always succeed on the base install; the extra is consulted only when you actually run the design optimiser (SYNDES / MAREX) or SPOTSYNTH's Bayesian path, which otherwise raise a clear error naming the missing package. ``pyscipopt`` ships prebuilt wheels that bundle SCIP, so ``mlsynth[design]`` is normally a plain install with no separate solver setup. The test suite is a development artifact and is not shipped in the installed package -- clone the repository to run ``pytest``. Not sure which estimator to use? Walk the :doc:`choose` decision tree -- a sequence of identification and design questions that funnels you from "what kind of problem do I have?" down to one or two methods, with the catalogue grouped by family. Community. The mlsynth community spans economists, statisticians, and data scientists who use synthetic-control methods for program evaluation across policy, marketing, sports, and public health. We welcome you to join us! * To share feature requests and bug reports, use the `issue tracker `_. * To follow development, watch the `mlsynth repository `_ on GitHub. Development. mlsynth is maintained by `Jared Greathouse `_ (Georgia State University). The project would not be possible without the kind efforts of and discussions with `Jason Coupet `_, `Kathy Li `_, `Mani Bayani `_, `Zhentao Shi `_, and `Jaume Vives-i-Bastida `_, along with a growing list of contributors. News. The verification campaign now covers thirty-two of the thirty-six estimators in mlsynth -- each auditing its implementation against its source paper, either by reproducing an empirical Table value on the authors' own data ("Path A") or by reproducing a Monte Carlo from the paper's simulation section ("Path B"), or against an authoritative reference implementation. See the :doc:`replications` page for the full catalogue with headline numbers. .. toctree:: :hidden: :caption: Get started about choose replications benchmarks references .. toctree:: :hidden: :caption: Observational: canonical workhorses vanillasc tssc fdid .. toctree:: :hidden: :caption: Observational: decomposition-first sbc hsc .. toctree:: :hidden: :caption: Observational: generalised estimand / treatment / unit scmo ctsc dsc si microsynth .. toctree:: :hidden: :caption: Observational: convex-hull relaxation iscm nsc .. toctree:: :hidden: :caption: Observational: No Donors shc .. toctree:: :hidden: :caption: Observational: high-dimensional donors bvss clustersc mlsc fscm masc msqrt sparse_sc pda rescm .. toctree:: :hidden: :caption: Observational: time-aware and factor models tasc fma dscar .. toctree:: :hidden: :caption: Observational: staggered adoption sdid seq_sdid ppscm ssc .. toctree:: :hidden: :caption: Observational: spillover-aware spsydid spillsynth spotsynth .. toctree:: :hidden: :caption: Observational: missing data mcnnm snn rmsi .. toctree:: :hidden: :caption: Observational: endogenous treatment siv orthsc proximal .. toctree:: :hidden: :caption: Experimental design lexscm marex syndes pangeo geolift spcd musc rolldid .. toctree:: :hidden: :caption: Utilities and internals data helpers