Distributional Synthetic Control (DSC)

Contents

Distributional Synthetic Control (DSC)#

Overview#

DSC generalizes the synthetic control method from aggregate values to entire outcome distributions. Where classical synthetic control builds a donor combination that matches the treated unit’s pre-period mean and returns a single ATT, DSC builds a donor combination that matches the treated unit’s pre-period quantile function and returns the full counterfactual distribution at every post-period – hence the quantile treatment effect (QTE) at any quantile \(q \in (0, 1)\), and any functional of the distribution (Lorenz curves, Gini coefficients, interquartile ranges, stochastic-dominance comparisons).

The method is due to Gunsilius (2023); mlsynth’s implementation is validated against the author’s reference DiSCo R package, and the large-sample theory is from Zhang, Zhang & Zhang (2026). The core idea is the 2-Wasserstein barycenter: the natural generalization of a weighted average from points on the real line to probability distributions. Averaging quantile functions (not densities) keeps the synthetic unit geometrically faithful – it reproduces the support, the moments, and the quantiles of the target – whereas a naive linear mixture of densities is multimodal and matches none of them.

When to use this estimator#

Reach for DSC when the distribution is the object of interest, and you have micro-data. Three motivating regimes:

  • Inequality and the whole distribution, not the mean. A minimum-wage increase, a cash-transfer program, or a tax reform may leave the mean of family income almost unchanged while compressing the lower tail or fattening the middle. Gunsilius (2023, Section 6.2) studies exactly this: the distribution of equalized family income (as multiples of the federal poverty line) across U.S. states, treating Michigan as the unit of interest. A mean-only synthetic control would miss the distributional story entirely.

  • Heterogeneous treatment effects across the outcome distribution. In marketing, a promotion might lift spending among already-heavy buyers while doing nothing for light buyers; the QTE curve \(q \mapsto \widehat\alpha_{1t, q}\) reveals where in the spending distribution the effect lands. DSC returns this directly, without pre-specifying which quantile to test.

  • Repeated cross-sections with many individuals per cell. Whenever each (unit, time) cell is itself a sample – households in a state-year, customers in a store-week, patients in a hospital-month – DSC uses all of that within-cell information rather than collapsing it to a mean.

If you only have one aggregate number per (unit, time) cell, DSC reduces to classical synthetic control (Gunsilius 2023, Section 3.1) and the dedicated aggregate estimators (e.g. mlsynth.CLUSTERSC, mlsynth.FMA) are the appropriate tools.

Notation#

We follow the repository’s canonical conventions. There are \(J + 1\) units; unit \(1\) is treated and units \(j = 2, \dots, J + 1\) are donors. The panel runs over periods \(t \in \{1, \dots, T\}\) with treatment beginning at \(T_0 + 1\); the pre-period is \(\mathcal{T}_0 = \{1, \dots, T_0\}\) and the post-period is \(\mathcal{T}_1 = \{T_0 + 1, \dots, T\}\). Each cell \((j, t)\) carries \(n_{jt}\) individual observations \(\{Y_{l, jt}\}_{l = 1}^{n_{jt}}\). \(F_{Y_{jt}}\) is the cell’s outcome distribution and \(F^{-1}_{Y_{jt}}\) its quantile function; \(\mathcal{H} = \{w \in \mathbb{R}^J_{\ge 0} : \mathbf{1}^\top w = 1\}\) is the unit simplex. Weights \(\widehat w_t\) are fit per pre-period and aggregated to \(\widehat w\). \(W_2\) denotes the 2-Wasserstein distance, and \(F^{-1}_{Y_{1t, N}}\) the counterfactual (no-treatment) quantile function of the treated unit.

Mathematical Formulation#

Setup#

The empirical quantile estimator for a cell is the order-statistic rule

\[\widehat F^{-1}_{Y_{jt, n_j}}(q) = Y_{t, n_j(k)}, \quad \frac{k - 1}{n_j} < q \le \frac{k}{n_j},\]

where \(Y_{t, n_j(k)}\) is the \(k\)-th order statistic of cell \((j, t)\). The DSC counterfactual quantile function for the treated unit at a post-period \(t > T_0\) is the 2-Wasserstein barycenter of the donor quantile functions,

\[\widehat F^{-1}_{Y_{1t, N}}(q) = \sum_{j = 2}^{J + 1} \widehat w_j\, \widehat F^{-1}_{Y_{jt, n_j}}(q), \qquad \widehat w \in \mathcal{H},\]

and the quantile treatment effect is \(\widehat \alpha_{1t, q} = \widehat F^{-1}_{Y_{1t, I}}(q) - \widehat F^{-1}_{Y_{1t, N}}(q)\), the gap between the observed treated quantile function and its counterfactual. Averaging quantile functions rather than densities is what makes the synthetic unit geometrically faithful: a weighted average of quantile functions is itself a valid quantile function, so the barycenter has the same kind of support and shape as the target (Gunsilius 2023, Figure 1; Agueh & Carlier 2011).

Identifying assumptions#

The structural content sits in the causal model \(h(t, \cdot)\) mapping latent variables to observed distributions. DSC recovers the correct counterfactual under a distributional analogue of the classical factor-model restriction.

Assumption 1 (scaled-isometry causal model – identification). The data-generating map \(h(t, \cdot)\) on the 2-Wasserstein space is a scaled isometry for every \(t\): it preserves relative distances between distributions up to a common scale, \(d(U_1, U_j) = \tau\, d\bigl(h(t, U_1), h(t, U_j)\bigr)\). Then (Gunsilius 2023, Theorem 1) the DSC quantile function \(\widehat F^{-1}_{Y_{1t, N}}\) coincides with the treated unit’s quantile function had it not been treated. For the panel model it suffices that the maps \(U_{jt} \mapsto U_{jt'}\) preserve the optimal weights \(\widehat\lambda^\star\) across periods. Remark. This is the distributional counterpart of the linear factor model behind classical synthetic control: on the real line, scaled isometries in Euclidean space are exactly affine maps, so the classical method’s affine factor structure is the special case. Working in Wasserstein space buys generality – by the Kloeckner (2010) characterization, isometries there can even deform the support – so identification can hold with as little as one pre- and one post-period, analogous to changes-in-changes (Athey & Imbens 2006), without a rank-invariance assumption.

Assumption 2 (finite second moments and independence – consistency). All cell distributions have finite second moments, \(\mathbb{E}[Y_{jt}^2] < \infty\), and are independent across units for each \(t\). Remark. This is all that is needed for the estimated weights to be consistent (Gunsilius 2023, Proposition 1): the plug-in empirical-quantile weights \(\widehat{\widetilde\lambda}^\star_{tn}\) converge to the population optimal weights as the within-cell sample sizes grow. It is deliberately weak – no smoothness or continuity is required just to estimate the weights.

Assumption 3 (absolute continuity – uniform inference). For uniform results on the whole quantile process, each \(F_{Y_{jt}}\) is absolutely continuous with a density bounded away from zero on its support. Remark. This is the standard regularity condition (Csörgő & Horváth 1993) that delivers the Gaussian large-sample distribution of the counterfactual quantile process (Proposition 2) and hence bootstrap confidence bands. It can be relaxed at the cost of a non-Gaussian limit (discrete case, Proposition 5), which is why the inference below leans on Monte-Carlo / permutation routines rather than closed-form critical values.

When the assumptions bind: practical diagnostics#

The theory above states the structural conditions; this section translates them into things you can actually look for in a dataset. Each item names a plausible failure mode of an applied DSC study and a concrete diagnostic.

  1. Scaled-isometry structure fails – the donor pool cannot reproduce the treated quantile function in the pre-period. DSC’s identification requires the data-generating maps to be scaled isometries on Wasserstein space (Assumption 1). The cheapest practical proxy is whether the barycenter actually fits the treated quantile function pre-treatment.

    Plausibly violated when the treated unit has support, skewness, or tail behaviour that no convex combination of donors can match – a state whose income distribution is bimodal while every donor is unimodal, a hospital whose patient mix is qualitatively different from every other hospital. Diagnostic: inspect DSCResults.pre_period_wasserstein. If the pre-period \(W_2\) distance is large relative to the cross-donor scale, the barycenter is not finding the treated unit in the donor convex hull; overlay \(\widehat F^{-1}_{Y_{1t}}\) against the fitted barycenter at a few pre-periods and look for systematic miss in the tails.

  2. Too few within-cell observations – empirical quantile functions are noisy. Consistency (Assumption 2) requires finite second moments, but in practice it also requires enough draws per cell for the empirical quantile function \(\widehat F^{-1}_{Y_{jt, n_j}}\) to be a reasonable estimate of the population quantile function. With small \(n_{jt}\) the order-statistic estimator is jagged and the Wasserstein loss is dominated by sampling noise.

    Plausibly violated when you have only tens of individuals per cell – a small store-week, a rare disease cohort, a thinly populated state-year. Diagnostic: re-fit the model on the half of cells with the largest \(n_{jt}\) only, and check whether the QTE moves substantially. Alternatively, bin the quantile grid more coarsely and confirm the weights stabilise.

  3. Heavy tails or infinite second moments. The 2-Wasserstein loss requires finite second moments (Assumption 2). With Pareto-tailed outcomes (firm revenue, financial losses) the empirical loss is dominated by tail outliers and the weights become unstable.

    Plausibly violated when the outcome is a Pareto-like distribution or has visible outliers in the extreme upper quantiles. Diagnostic: refit DSC after Winsorising the outcome at, say, the 99th percentile and compare; large differences flag the second-moment assumption. Alternatively, run the analysis on a log-transformed outcome and confirm conclusions are preserved.

  4. Discrete or mixed outcomes – inference is non-Gaussian. Uniform large-sample bands (Assumption 3) require absolute-continuity of each cell distribution with a density bounded away from zero. With ordinal-scale outcomes or large point masses the quantile-process limit is non-Gaussian (Gunsilius 2023, Proposition 5).

    Plausibly violated when the outcome has heaps (counts, Likert scales, capped variables). Diagnostic: the point estimate is still fine – but lean on the placebo permutation test (compute_inference=True) rather than analytic bands, since the permutation procedure is distribution-free.

  5. Non-stationary donors – weights drift across pre-periods. DSC aggregates per-pre-period weights via \(\widehat w = \sum_t \lambda_t \widehat w_t\). The implicit assumption is that the donor-to-treated mapping is the same isometry across \(t\).

    Plausibly violated when the cross-sectional distribution of donors itself shifts over the pre-period (a new entrant changes a market, a policy change affects only some donors). Diagnostic: inspect the per-pre-period weights (returned in the DSC fit object) and look for a single donor whose weight rises or falls monotonically over \(t\); switch lambda_method="recency" to down-weight stale pre-periods.

  6. Donor contamination – another donor was treated in the pre-period. DSC, like classical SC, assumes donors are untreated in the pre-period and remain on their counterfactual trajectory throughout the post-period (no spillovers, no anticipation).

    Plausibly violated when a donor enacts a similar policy partway through the analysis window, or when the treatment generates spillovers across units (geographic neighbours, vertically linked markets). Diagnostic: drop suspected contaminated donors and refit; large QTE changes flag the SUTVA failure. Spillover-aware estimators (Spillover-Aware Synthetic Control (SPILLSYNTH), Spatial Synthetic Difference-in-Differences (SpSyDiD)) are aggregate-only and so do not directly substitute, but they can characterise the spillover size at the mean.

When not to use DSC#

  • Only one aggregate observation per (unit, time) cell. DSC needs a sample within each cell to estimate the empirical quantile function. With one number per cell, the empirical quantile function collapses to a step at that single value and DSC reduces to classical synthetic control. Use the aggregate estimators – Two-Step Synthetic Control, canonical SCM, Factor Model Approach (FMA), Cluster Synthetic Controls (CLUSTERSC) – instead.

  • The mean (or any single moment) is the genuine object of interest. If the policy question is “did the average outcome change?” and there is no scientific reason to read off effects at specific quantiles, DSC is overkill: it pays a variance cost for distributional richness you will not use. A scalar-targeted estimator with sharper inference (Forward Difference-in-Differences (FDID), Factor Model Approach (FMA)) is more efficient.

  • Outcome is fundamentally non-continuous and you need uniform inference. Heavily discrete outcomes (binary, low-count) violate Assumption 3 and the Gaussian large-sample bands no longer apply. Point estimation still works, but if you need uniform confidence bands, switch to a model designed for discrete outcomes (or rely on permutation inference and report quantile-by-quantile rather than uniformly).

  • Short pre-period with rapidly moving donor distributions. DSC averages per-pre-period weights and so needs enough pre-periods for that average to stabilise. If \(T_0\) is a handful of periods and the cross-sectional donor distributions are themselves moving rapidly, the aggregation step is doing little work and the post-period counterfactual will be dominated by whichever pre-period happens to look most like the post. Use Forward Difference-in-Differences (FDID) or Two-Step Synthetic Control with carefully chosen covariates.

  • Small within-cell sample (a few individuals per cell-period). The empirical quantile function is jagged with small \(n_{jt}\) and the Wasserstein loss becomes dominated by sampling noise. If micro-data is sparse, collapse to means and run an aggregate estimator instead of forcing a distributional fit.

  • Treated unit lies outside the convex hull of donor distributions. No simplex-weighted combination of donor quantile functions can reproduce a treated distribution that sits outside the hull – e.g. the only unit with a bimodal outcome, the only state with a long Pareto tail. The pre-period \(W_2\) will be visibly large. Either expand the donor pool, switch to an extrapolating variant (drop the simplex constraint, at the cost of identification), or fall back to an aggregate estimator.

  • Spillovers or interference across units. SUTVA-violating designs break DSC for the same reason they break classical SC. Use a spillover-aware aggregate estimator (Spillover-Aware Synthetic Control (SPILLSYNTH), Spatial Synthetic Difference-in-Differences (SpSyDiD)) and accept that the distributional question becomes identification-impractical at the same time.

Algorithm#

mlsynth implements Gunsilius’s four-step recipe (formalized as Algorithm 1 in Zhang et al. 2026).

Step 1 – Empirical quantile functions. For each cell \((j, t)\), compute \(\widehat F^{-1}_{Y_{jt, n_j}}\) via the order-statistic estimator above.

Step 2 – Per-pre-period weights. Draw an \(M\)-point quantile grid \(\{V_m\}_{m=1}^M \subset (0, 1)\) (Halton / Sobol low-discrepancy by default, or uniform i.i.d.) and form the pseudo-sample matrices \(\widetilde Y_{jt, m} = \widehat F^{-1}_{Y_{jt, n_j}}(V_m)\). The squared 2-Wasserstein loss \(W_2^2(\cdot) = \int_0^1 \lvert \sum_j w_j \widehat F^{-1}_{Y_{jt}}(q) - \widehat F^{-1}_{Y_{1t}}(q) \rvert^2 dq\) is approximated by the empirical risk \(L_t(w) = M^{-1} \sum_m \lvert \widetilde Y_{1t, m} - \sum_j w_j \widetilde Y_{jt, m}\rvert^2\), and the per-pre-period weights solve the simplex-constrained quadratic program

\[\widehat w_t = \arg\min_{w \in \mathcal{H}} L_t(w), \qquad t \in \mathcal{T}_0.\]

mlsynth solves this with accelerated projected gradient descent (FISTA, Beck & Teboulle 2009) and the exact simplex projection of Duchi et al. (2008). This is the dependency-free analogue of the reference R package’s pracma::lsqlincon constrained least-squares call, and – unlike a generic SLSQP solver – it stays robust as the donor pool grows into the hundreds, the regime the method is designed for (Section 6.1). By the Koksma-Hlawka inequality the QMC approximation error is \(O(\log M / M)\) for Halton / Sobol vs. \(O(M^{-1/2})\) for i.i.d. draws.

Step 3 – Aggregate over the pre-period. The final weight is a convex combination \(\widehat w = \sum_{t \in \mathcal{T}_0} \lambda_t \widehat w_t\), with \(\lambda_t \ge 0\) and \(\sum_t \lambda_t = 1\). mlsynth offers "uniform" (default; \(\lambda_t = 1/T_0\)) and "recency" (geometric decay) rules, and accepts caller-supplied weights so Arkhangelsky et al. (2021) SDiD-style \(\lambda_t\) can be plugged in.

Step 4 – Post-period QTE. For each \(t > T_0\), evaluate the counterfactual quantile function at the QTE grid and difference it against the observed treated quantile function to obtain \(\widehat\alpha_{1t, q}\).

Inference#

Placebo permutation test (Gunsilius 2023, Algorithm 1). The distributional analogue of the Abadie-Diamond-Hainmueller (2010) placebo test, and of the reference package’s DiSCo_per. Each donor is treated as a pseudo-treated unit in turn: DSC is refit on the remaining units (the real treated unit re-enters the donor pool), and the per-period squared 2-Wasserstein distance between that unit’s observed and barycenter quantile functions is recorded,

\[d_{\iota t} = \int_0^1 \bigl| F^{-1}_{Y_{\iota t, N}}(q) - F^{-1}_{Y_{\iota t}}(q) \bigr|^2 dq .\]

If the model fits the placebos pre-treatment and there is a genuine post-treatment effect, the real treated unit’s distance sits in the extreme upper tail of the \(J + 1\) distances. The permutation p-value at post-period \(t\) is \(p_t = r(d_{1t}) / (J + 1)\), the rank of the treated unit’s distance (rank 1 = largest). Enable it with compute_inference=True; the DSCInference object exposes the full distance paths (treated and every placebo) and the per-post-period p-values. Because it refits the weights \(J\) times it costs roughly \(J\times\) the point estimate.

Goodness-of-fit and large-sample bands. Working with whole distributions also licenses a Wasserstein goodness-of-fit test of \(H_0: F_{Y_{1t, N}} = F_{Y_{1t}}\) and tests of first-/second-order stochastic dominance (Gunsilius 2023, Propositions 3-4). Under Assumption 3 the counterfactual quantile process is asymptotically Gaussian (Proposition 2), so confidence bands follow from the standard bootstrap; in the discrete case the limit is non-Gaussian (Proposition 5) and a Monte-Carlo plug-in is used instead. The pre-period fit \(\xi_t\) is surfaced directly via DSCResults.pre_period_wasserstein so users can check fit quality before trusting any post-period contrast.

Core API#

Distributional Synthetic Control (DSC) estimator.

DSC (Gunsilius, F. (2023). “Distributional Synthetic Controls.” Econometrica 91(3):1105-1117) reconstructs the treated unit’s outcome distribution under no treatment as a weighted average of donor units’ outcome distributions, where the average is taken in the 2-Wasserstein space. Unlike classical SCM, which targets aggregate means, DSC delivers the full counterfactual quantile function and hence the quantile treatment effect at any quantile.

The asymptotic optimality of the DSC weights is established by Zhang, L., Zhang, X., & Zhang, X. (2026). “Asymptotic Properties of the Distributional Synthetic Controls.” arXiv:2405.00953. mlsynth’s implementation follows Algorithm 1 of that paper, which formalises Gunsilius’s recipe with explicit quantile-grid sampling and simplex-constrained Wasserstein regression on the resulting pseudo-sample matrices.

Data requirements#

DSC operates on micro-level panel data: for each (unit, time) cell the user supplies multiple individual observations. The input DataFrame should therefore have one row per (unit, time, individual observation) triple.

When only aggregate-level data are available, classical synthetic-control estimators (e.g. mlsynth.CLUSTERSC, mlsynth.FMA) remain the appropriate tool.

Output#

The DSCResults container surfaces:

  • donor_weights – the aggregated simplex weights \(\widehat w = \sum_t \lambda_t \widehat w_t\).

  • qte_curves – per-post-period QTECurve objects with the observed and counterfactual quantile functions plus the QTE.

  • average_qte – QTE averaged over post-periods.

  • att – single scalar mean-of-QTE summary, for compatibility with the rest of mlsynth.

class mlsynth.estimators.dsc.DSC(config: DSCConfig | dict)#

Bases: object

Distributional Synthetic Control estimator.

Parameters:

config (DSCConfig or dict) – Configuration object. See mlsynth.config_models.DSCConfig.

fit() DSCResults#

Run Algorithm 1 of Zhang et al. (2026) and return DSCResults.

Configuration#

class mlsynth.config_models.DSCConfig(*, df: ~pandas.DataFrame, outcome: str, treat: str, unitid: str, time: str, display_graphs: bool = True, save: bool | str = False, counterfactual_color: ~typing.List[str] = <factory>, treated_color: str = 'black', M: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=50)] = None, grid_method: ~typing.Literal['halton', 'sobol', 'uniform'] = 'halton', lambda_method: ~typing.Literal['uniform', 'recency'] = 'uniform', lambda_decay: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.9, lambda_weights: ~typing.List[float] | None = None, qte_quantiles: ~typing.List[float] | None = None, n_qte_points: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 99, random_state: int = 0, compute_inference: bool = False, inference_grid_points: ~typing.Annotated[int, ~annotated_types.Ge(ge=2)] = 200)#

Configuration for the Distributional Synthetic Control (DSC) estimator.

DSC (Gunsilius 2023; asymptotic theory: Zhang, Zhang & Zhang 2026) fits simplex-constrained weights on the quantile functions of donor outcomes to reconstruct the treated unit’s counterfactual outcome distribution. Unlike the other mlsynth estimators, DSC expects micro-level panel data: each (unit, time) cell carries multiple individual observations, supplied as one row per individual in the input DataFrame.

Parameters:
  • M (int, optional) – Number of quantile-grid points used to approximate the 2-Wasserstein loss. If None, defaults to max(200, min_cell_size) (Zhang et al. 2026 suggest \(M = C n\) for a constant C >= 1).

  • grid_method ({“halton”, “sobol”, “uniform”}) – Sampling rule for the quantile grid. "halton" (default) and "sobol" are quasi-Monte Carlo with Koksma-Hlawka error \(O(\log M / M)\); "uniform" is i.i.d. with \(O(M^{-1/2})\) error.

  • lambda_method ({“uniform”, “recency”}) – Default rule for the pre-period aggregation weights \(\lambda_t\). Ignored when lambda_weights is set.

  • lambda_decay (float) – Geometric decay factor for lambda_method="recency". Default 0.9.

  • lambda_weights (sequence of float, optional) – Caller-supplied length-T0 aggregation weights (must be non-negative and sum to 1). Useful for Arkhangelsky et al. (2021) SDiD-style time weights computed externally.

  • qte_quantiles (sequence of float, optional) – Quantile grid in (0, 1) at which to report the QTE. If None, an evenly spaced grid of n_qte_points quantiles is used.

  • n_qte_points (int) – Length of the default QTE grid when qte_quantiles is None. Default 99 (every percentile from 0.01 to 0.99).

  • random_state (int) – Seed forwarded to the QMC quantile-grid sampler.

M: int | None#
compute_inference: bool#
grid_method: Literal['halton', 'sobol', 'uniform']#
inference_grid_points: int#
lambda_decay: float#
lambda_method: Literal['uniform', 'recency']#
lambda_weights: List[float] | None#
model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

n_qte_points: int#
qte_quantiles: List[float] | None#
random_state: int#

Helper Modules#

Micro-panel data preparation for the Distributional Synthetic Control estimator.

DSC operates at a finer granularity than the rest of mlsynth: each (unit, time) cell carries multiple individual observations rather than a single aggregated outcome. This module converts a long-format DataFrame (one row per individual observation) into the DSCInputs container that the orchestrator expects.

Treatment is still applied at the unit-time level, exactly as in classical SCM. The treated unit is identified from the treat column: the unique unit-id with treat == 1 for some t.

mlsynth.utils.dsc_helpers.setup.prepare_dsc_inputs(df: DataFrame, outcome: str, treat: str, unitid: str, time: str) DSCInputs#

Pivot a long-format micro-panel into DSCInputs.

Parameters:
  • df (pd.DataFrame) – Long-format panel where each row is one individual observation. Must contain columns unitid, time, outcome, and treat (the treatment indicator, set at the unit-time level; identical for all individual rows within a given cell).

  • outcome (str) – Column with the individual-level outcome value.

  • treat (str) – Column with the 0/1 treatment indicator.

  • unitid (str) – Column with the unit identifier.

  • time (str) – Column with the time identifier.

Empirical quantile functions and Wasserstein-aligned pseudo-samples for DSC.

DSC fits weights on the quantile functions of donor and treated outcomes. Given the empirical CDF \(\widehat F_{Y_{jt, n_j}}\) constructed from the within-cell sample \(\{Y_{l, jt}\}_{l=1}^{n_j}\), its quantile function is

\[\widehat F^{-1}_{Y_{jt, n_j}}(q) = Y_{t, n_j(k)}, \quad \frac{k - 1}{n_j} < q \le \frac{k}{n_j},\]

where \(Y_{t, n_j(k)}\) is the \(k\)-th order statistic of the sample. This module exposes:

  • empirical_quantile() – evaluate \(\widehat F^{-1}_{Y_{jt, n_j}}\) at a vector of quantile points via NumPy’s inverted_cdf rule (matches the paper’s order-statistic estimator).

  • sample_quantile_grid() – draw the \(M\) quantile points \(\{V_m\}_{m=1}^{M}\) used to build pseudo-samples. Supports uniform i.i.d. draws and the Halton / Sobol low-discrepancy sequences (Koksma-Hlawka error \(O(\log M / M)\) vs. \(O(M^{-1/2})\) for i.i.d.).

mlsynth.utils.dsc_helpers.quantiles.build_pseudo_sample_matrix(inputs, time_label, quantile_grid: ndarray) tuple[ndarray, ndarray]#

Construct \((\widetilde{Y}_t, \widehat{Y}_{1t})\) for one period.

Returns:

  • donor_matrix (np.ndarray) – Shape (M, J) – column j is the donor’s quantile function evaluated at the grid.

  • treated_vec (np.ndarray) – Shape (M,) – the treated unit’s quantile function evaluated at the grid.

mlsynth.utils.dsc_helpers.quantiles.empirical_quantile(sample: ndarray, quantiles: ndarray) ndarray#

Evaluate the order-statistic empirical quantile function at quantiles.

Parameters:
  • sample (np.ndarray) – Length-n 1-D sample of observed outcomes for a single (unit, time) cell.

  • quantiles (np.ndarray) – Quantile probabilities in (0, 1), shape (M,).

Returns:

np.ndarray – Shape (M,) evaluations of \(\widehat F^{-1}_{Y_{jt, n_j}}(q)\).

mlsynth.utils.dsc_helpers.quantiles.sample_quantile_grid(M: int, method: Literal['halton', 'sobol', 'uniform'] = 'halton', random_state: int = 0) ndarray#

Draw the quantile-grid \(\{V_m\}_{m=1}^{M} \subset (0, 1)\).

Parameters:
  • M (int) – Number of quantile points.

  • method ({“halton”, “sobol”, “uniform”}) – Sampling rule. "halton" (default) and "sobol" give deterministic low-discrepancy sequences with Koksma-Hlawka error \(O(\log M / M)\); "uniform" draws i.i.d. samples with \(O(M^{-1/2})\) error.

  • random_state (int) – Seed for the QMC scrambling / i.i.d. RNG.

Returns:

np.ndarray – Length-M quantile points in (0, 1).

Simplex-constrained weight solver for Distributional Synthetic Controls.

For each pre-period \(t \in \mathcal T_0\), DSC solves

\[\widehat w_t = \arg\min_{w \in \mathcal H} \bigl\| \widetilde Y_t\, w - \widehat Y_{1t} \bigr\|_2^2, \qquad \mathcal H = \bigl\{ w \in [0, 1]^J : \mathbf 1^\top w = 1 \bigr\},\]

where \(\widetilde Y_t\) is the \(M \times J\) donor pseudo-sample matrix and \(\widehat Y_{1t}\) is the \(M \times 1\) treated pseudo-sample vector (Zhang, Zhang & Zhang 2026 eq. 3; the loss is the squared 2-Wasserstein distance approximated by Monte Carlo / QMC).

mlsynth.utils.dsc_helpers.weights.solve_simplex_weights(donor_matrix: ndarray, treated_vec: ndarray, *, max_iter: int = 5000, tol: float = 1e-12) ndarray#

Return the simplex-constrained least-squares weight vector.

Solves the convex program

\[\widehat w = \arg\min_{w \in \mathcal H} \| \widetilde Y_t\, w - \widehat Y_{1t} \|_2^2, \qquad \mathcal H = \{ w \ge 0 : \mathbf 1^\top w = 1 \},\]

by accelerated projected gradient descent (FISTA; Beck & Teboulle 2009) with the exact simplex projection of Duchi et al. (2008). This replaces an earlier SLSQP solver that failed ("Positive directional derivative for linesearch") once the donor pool grew past a few dozen units – precisely the regime of Gunsilius (2023, Section 6.1), where the method is meant to use tens to hundreds of donors. The reference DiSCo R package solves the same program with a dedicated constrained least-squares routine (pracma::lsqlincon); projected gradient is its dependency-free analogue and returns the identical optimum (the objective is convex with a unique minimum value over the simplex).

Parameters:
  • donor_matrix (np.ndarray) – \((M, J)\) design matrix – donor quantile functions evaluated on the grid.

  • treated_vec (np.ndarray) – Length-M target quantile function.

  • max_iter (int) – Maximum FISTA iterations.

  • tol (float) – Relative objective-change stopping tolerance.

Returns:

np.ndarray – Length-J weight vector with w >= 0 and sum(w) == 1.

mlsynth.utils.dsc_helpers.weights.wasserstein_loss_at_weights(donor_matrix: ndarray, treated_vec: ndarray, weights: ndarray) float#

Squared 2-Wasserstein loss \(\|\widetilde Y_t w - \widehat Y_{1t}\|_2^2 / M\).

The \(1/M\) normalisation makes this comparable across periods with different grid sizes.

Pre-period aggregation weights for DSC.

The DSC weight is

\[\widehat w = \sum_{t \in \mathcal T_0} \lambda_t\, \widehat w_t, \qquad \lambda_t \ge 0, \quad \sum_{t \in \mathcal T_0} \lambda_t = 1.\]

Zhang, Zhang & Zhang (2026, Section 2) point to Arkhangelsky et al. (2021) for principled choices of \(\lambda_t\). mlsynth currently ships the uniform rule (default) and a recency-weighted variant; the SDID-style time weights can be supplied externally via the lambda_weights argument to run_dsc().

mlsynth.utils.dsc_helpers.aggregation.aggregate_period_weights(period_weights: ndarray, lambda_weights: ndarray) ndarray#

Compute \(\widehat w = \sum_t \lambda_t \widehat w_t\).

Parameters:
  • period_weights (np.ndarray) – Shape (T0, J) matrix of per-pre-period donor weights.

  • lambda_weights (np.ndarray) – Length-T0 aggregation weights.

mlsynth.utils.dsc_helpers.aggregation.build_lambda_weights(T0: int, method: Literal['uniform', 'recency'] = 'uniform', decay: float = 0.9) ndarray#

Return length-T0 non-negative weights summing to 1.

Parameters:
  • T0 (int) – Number of pre-treatment periods.

  • method ({“uniform”, “recency”}) – "uniform" returns 1 / T0 everywhere. "recency" returns a geometrically-decayed schedule \(\lambda_t \propto \mathrm{decay}^{T_0 - t}\) so the weight peaks at the most recent pre-period.

  • decay (float) – Per-period decay factor for "recency" (must lie in (0, 1]).

Orchestration pipeline for Distributional Synthetic Controls.

Implements the four-step procedure of Gunsilius (2023) as formalised in Algorithm 1 of Zhang, Zhang & Zhang (2026):

  1. Estimate the empirical quantile function \(\widehat F^{-1}_{Y_{jt, n_j}}\) for every (unit, time) cell.

  2. For each pre-period \(t \in \mathcal T_0\), draw an \(M\)-point quantile grid \(\{V_m\}\), build the pseudo-sample matrix \(\widetilde Y_t\), and solve the simplex-constrained Wasserstein regression \(\widehat w_t = \arg\min_{w \in \mathcal H} \| \widetilde Y_t w - \widehat Y_{1t} \|_2^2\).

  3. Aggregate across the pre-periods: \(\widehat w = \sum_t \lambda_t \widehat w_t\).

  4. For each post-period \(t \in \mathcal T_1\), predict the counterfactual quantile function \(\widehat F^{-1}_{Y_{1t, N}}(q) = \sum_{j=2}^{J+1} \widehat w_j\, \widehat F^{-1}_{Y_{jt, n_j}}(q)\) and form the quantile treatment effect \(\widehat \alpha_{1t, q} = \widehat F^{-1}_{Y_{1t, I}}(q) - \widehat F^{-1}_{Y_{1t, N}}(q)\).

mlsynth.utils.dsc_helpers.pipeline.run_dsc(inputs: DSCInputs, *, M: int | None = None, grid_method: Literal['halton', 'sobol', 'uniform'] = 'halton', lambda_method: Literal['uniform', 'recency'] = 'uniform', lambda_decay: float = 0.9, lambda_weights: Sequence[float] | None = None, qte_quantiles: Sequence[float] | None = None, n_qte_points: int = 99, compute_inference: bool = False, inference_grid_points: int = 200, random_state: int = 0) DSCResults#

Run Algorithm 1 of Zhang, Zhang & Zhang (2026) and assemble DSCResults.

Parameters:
  • inputs (DSCInputs) – Preprocessed micro-panel; build via mlsynth.utils.dsc_helpers.setup.prepare_dsc_inputs().

  • M (int, optional) – Number of quantile-grid points \(V_m\). If None, defaults to max(200, min_cell_size) where min_cell_size is the smallest within-cell sample size in the panel (Zhang et al. 2026 suggest M = C * n for some constant C).

  • grid_method ({“halton”, “sobol”, “uniform”}) – Sampling rule for the quantile grid. "halton" (default) and "sobol" are deterministic QMC sequences with Koksma-Hlawka error \(O(\log M / M)\); "uniform" gives i.i.d. \(U[0, 1]\) with \(O(M^{-1/2})\) error.

  • lambda_method ({“uniform”, “recency”}) – Default aggregation rule for the pre-period weights. Ignored if lambda_weights is provided explicitly.

  • lambda_decay (float) – Geometric decay for lambda_method="recency".

  • lambda_weights (sequence of float, optional) – Caller-supplied length-T0 aggregation weights. Useful for Arkhangelsky-et-al-style SDiD time weights computed outside this estimator.

  • qte_quantiles (sequence of float, optional) – Quantile grid in (0, 1) at which to report the QTE. If None, an evenly spaced grid of n_qte_points quantiles between 1/(n_qte_points + 1) and n_qte_points/(n_qte_points + 1) is used.

  • n_qte_points (int) – Length of the default QTE grid.

  • random_state (int) – Seed forwarded to the QMC sampler.

Returns:

DSCResults – Frozen container with donor weights, per-pre-period weights, QTE curves per post-period, the averaged QTE, and a single scalar ATT summary.

Placebo permutation inference for Distributional Synthetic Controls.

Implements the placebo-permutation test of Gunsilius (2023, Algorithm 1), the distributional analogue of the Abadie-Diamond-Hainmueller (2010) placebo test and the procedure in the reference DiSCo R package (DiSCo_per / DiSCo_per_iter).

For every unit \(\iota \in \{1, \dots, J + 1\}\) we pretend it is the treated unit, fit DSC weights on the other units over the pre-period, aggregate them, build the post-period barycenter, and record the per-period squared 2-Wasserstein distance between that unit’s observed quantile function and its barycenter,

\[d_{\iota t} = \int_0^1 \bigl| F^{-1}_{Y_{\iota t, N}}(q) - F^{-1}_{Y_{\iota t}}(q) \bigr|^2 dq .\]

If there is a genuine treatment effect on the real treated unit and the model fits the placebos well pre-treatment, the real unit’s post-period distance should sit in the extreme tail of the placebo distances. The permutation p-value at post-period \(t\) is the rank of the treated unit’s distance among all \(J + 1\) distances, \(p_t = r(d_{1t}) / (J + 1)\), with rank 1 the largest.

class mlsynth.utils.dsc_helpers.inference.DSCInference(time_labels: ndarray, post_time_labels: ndarray, treated_distance: ndarray, placebo_distances: ndarray, placebo_units: List[Any], p_values: Dict[Any, float], n_permutations: int)#

Placebo-permutation inference output for DSC.

time_labels#

All T period labels (pre and post), the x-axis of the distance paths.

Type:

np.ndarray

post_time_labels#

The T - T0 post-treatment labels the p-values refer to.

Type:

np.ndarray

treated_distance#

Squared 2-Wasserstein distance per period for the real treated unit, shape (T,).

Type:

np.ndarray

placebo_distances#

Squared 2-Wasserstein distance per period for each placebo unit (each donor treated as pseudo-treated in turn), shape (J, T).

Type:

np.ndarray

placebo_units#

Donor labels aligned with the rows of placebo_distances.

Type:

list

p_values#

Mapping {post_time_label: p_t} with \(p_t = r(d_{1t}) / (J + 1)\) (rank 1 = largest distance).

Type:

dict

n_permutations#

Number of placebo units used (J).

Type:

int

n_permutations: int#
p_values: Dict[Any, float]#
placebo_distances: ndarray#
placebo_units: List[Any]#
post_time_labels: ndarray#
time_labels: ndarray#
treated_distance: ndarray#
mlsynth.utils.dsc_helpers.inference.placebo_permutation_test(inputs, *, M: int, grid_method: str, lam: ndarray, n_eval: int = 200, random_state: int = 0) DSCInference#

Run the Gunsilius (2023) Algorithm 1 placebo permutation test.

Parameters:
  • inputs (DSCInputs) – Preprocessed micro-panel.

  • M (int) – Quantile-grid size for the weight-fitting Wasserstein loss.

  • grid_method ({“halton”, “sobol”, “uniform”}) – Quantile-grid sampling rule (matches the main fit).

  • lam (np.ndarray) – Length-T0 pre-period aggregation weights (the same vector the point estimate used).

  • n_eval (int) – Number of evenly spaced quantiles used to evaluate the squared 2-Wasserstein distances.

  • random_state (int) – Seed for the quantile grid.

Frozen dataclasses for the Distributional Synthetic Control (DSC) estimator.

DSC (Gunsilius 2023; asymptotic theory in Zhang, Zhang & Zhang 2026) operates on micro-level panels: for each (unit, time) cell the user supplies multiple individual observations and the estimator works with the empirical quantile function of that cell rather than its mean.

The output therefore lives at the quantile level: per post-period we return a counterfactual quantile function alongside the observed one, and per quantile we expose the quantile treatment effect (QTE).

class mlsynth.utils.dsc_helpers.structures.DSCInputs(cell_samples: Dict[Tuple[Any, Any], ndarray], unit_names: List[Any], time_labels: ndarray, T: int, T0: int, treated_unit_name: Any)#

Preprocessed micro-panel data for DSC.

cell_samples#

Mapping (unit_id, time_id) -> 1-D np.ndarray of individual outcomes for that cell. Each array’s length is \(n_{jt}\) (the within-cell sample size).

Type:

dict

unit_names#

Length-(J + 1) ordering of unit ids. unit_names[0] is the treated unit, the remaining J are donors.

Type:

list

time_labels#

Length-T ordering of time-period labels.

Type:

np.ndarray

T#

Total number of panel periods.

Type:

int

T0#

Number of pre-treatment periods.

Type:

int

treated_unit_name#

Label of the treated unit.

Type:

Any

property J: int#

Number of donor units (excluding the treated unit).

T: int#
T0: int#
cell_samples: Dict[Tuple[Any, Any], ndarray]#
property n_post: int#

Number of post-treatment periods.

time_labels: ndarray#
treated_unit_name: Any#
unit_names: List[Any]#
class mlsynth.utils.dsc_helpers.structures.DSCResults(inputs: ~mlsynth.utils.dsc_helpers.structures.DSCInputs, donor_weights: ~typing.Dict[~typing.Any, float], period_weights: ~typing.Dict[~typing.Any, ~numpy.ndarray], lambda_weights: ~numpy.ndarray, qte_curves: ~typing.List[~mlsynth.utils.dsc_helpers.structures.QTECurve], average_qte: ~numpy.ndarray, att: float, pre_period_wasserstein: ~numpy.ndarray, weights: ~typing.Any | None = None, inference: ~typing.Any | None = None, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Top-level container returned by mlsynth.DSC.fit().

inputs#

Preprocessed micro-panel.

Type:

DSCInputs

donor_weights#

Mapping {donor_unit_name: weight} for the aggregated weight vector \(\widehat w = \sum_t \lambda_t \widehat w_t\). All weights are non-negative and sum to one (the simplex constraint of Gunsilius 2023).

Type:

dict

period_weights#

Mapping {time_label: weight_vector} for the per-pre-period weights \(\widehat w_t\) (each a J-vector). Useful for diagnostic inspection of pre-period heterogeneity.

Type:

dict

lambda_weights#

Length-T0 pre-period aggregation weights.

Type:

np.ndarray

qte_curves#

Per-post-period QTE objects. Index 0 is post-period 1.

Type:

list of QTECurve

average_qte#

QTE averaged over post-periods (uniform mean), shape (Q,).

Type:

np.ndarray

att#

Mean-aggregated treatment effect on the treated, obtained by averaging the QTE over both quantiles and post-periods. Useful as a single scalar summary; the QTE itself remains the primary object.

Type:

float

pre_period_wasserstein#

Pre-period 2-Wasserstein loss per t \in T_0, shape (T0,). Lower is tighter pre-period fit.

Type:

np.ndarray

metadata#

Free-form pipeline diagnostics (M, quantile grid method, random seed, etc.).

Type:

dict

att: float#
average_qte: ndarray#
donor_weights: Dict[Any, float]#
inference: Any | None = None#
inputs: DSCInputs#
lambda_weights: ndarray#
metadata: Dict[str, Any]#
period_weights: Dict[Any, ndarray]#
pre_period_wasserstein: ndarray#
qte_curves: List[QTECurve]#
weights: Any | None = None#
class mlsynth.utils.dsc_helpers.structures.QTECurve(time_label: Any, quantiles: ndarray, observed: ndarray, counterfactual: ndarray, qte: ndarray)#

Quantile treatment effect at a single post-period.

time_label#

Time-period identifier.

Type:

Any

quantiles#

Quantile grid in (0, 1), shape (Q,).

Type:

np.ndarray

observed#

Empirical treated quantile function evaluated at quantiles, shape (Q,). Under treatment this is \(F^{-1}_{Y_{1t, I}}\).

Type:

np.ndarray

counterfactual#

DSC counterfactual quantile function, shape (Q,). Estimate of \(F^{-1}_{Y_{1t, N}}\).

Type:

np.ndarray

qte#

observed - counterfactual, shape (Q,). Quantile treatment effect.

Type:

np.ndarray

counterfactual: ndarray#
observed: ndarray#
qte: ndarray#
quantiles: ndarray#
time_label: Any#

Example#

A self-contained one-draw Monte Carlo. The treated unit shares the donors’ pre-period DGP; in the post-period it receives a location shift of \(+1.5\) (a constant effect across quantiles). We expect a roughly flat QTE near \(1.5\), and – because the shift is real – the placebo permutation test should rank the treated unit most extreme.

"""One draw of a DSC location-shift simulation with placebo inference."""

import numpy as np
import pandas as pd

from mlsynth import DSC

rng = np.random.default_rng(0)
J, T_pre, T_post, n_per_cell = 12, 8, 4, 200
T = T_pre + T_post
delta_post = 1.5

unit_loc = rng.standard_normal(J + 1) * 0.5
time_shift = np.linspace(0.0, 1.0, T)
rows = []
for j in range(J + 1):
    for t in range(T):
        loc = unit_loc[j] + time_shift[t]
        if j == 0 and t >= T_pre:
            loc += delta_post
        for y in rng.normal(loc=loc, scale=1.0, size=n_per_cell):
            rows.append({"unit": j, "time": t, "y": float(y),
                         "D": int(j == 0 and t >= T_pre)})
df = pd.DataFrame(rows)

res = DSC({
    "df": df, "outcome": "y", "treat": "D", "unitid": "unit", "time": "time",
    "M": 400, "grid_method": "halton", "lambda_method": "uniform",
    "n_qte_points": 49,
    "compute_inference": True,        # placebo permutation test
    "inference_grid_points": 100,
}).fit()

print(f"true location shift = {delta_post:+.3f}")
print(f"DSC mean-of-QTE     = {res.att:+.3f}")
print(f"placebo p-values    = "
      f"{{t: round(p, 3) for t, p in res.inference.p_values.items()}}")
print(f"pre-period W2 fit   = {res.pre_period_wasserstein.round(4).tolist()}")

Reproducing the paper’s simulation (Gunsilius 2023, Figure 4)#

The headline simulation in Gunsilius (2023, Section 6.1) shows the DSC barycenter replicating a Gaussian-mixture target better as the donor pool grows: rough with 4 controls, good with 30, near-perfect with 500. The controls are mixtures of 3 Gaussians and the target a mixture of 4, with means drawn uniformly on \([-10, 10]\) and variances on \([0.5, 6]\). The snippet below reproduces that finding using the same quantile machinery the estimator calls internally, reporting the squared 2-Wasserstein distance between barycenter and target as \(J\) grows.

import numpy as np
from mlsynth.utils.dsc_helpers import (
    empirical_quantile, sample_quantile_grid, solve_simplex_weights,
)

def mixture(rng, n_comp, n):
    means = rng.uniform(-10.0, 10.0, n_comp)
    var = rng.uniform(0.5, 6.0, n_comp)
    comp = rng.integers(0, n_comp, n)
    return rng.normal(means[comp], np.sqrt(var[comp]))

def w2_to_target(J, rng, n=1000, M=1000, n_eval=1000):
    target = mixture(rng, 4, n)
    controls = [mixture(rng, 3, n) for _ in range(J)]
    V = sample_quantile_grid(M=M, method="uniform",
                             random_state=int(rng.integers(1_000_000_000)))
    donor_mat = np.column_stack([empirical_quantile(c, V) for c in controls])
    w = solve_simplex_weights(donor_mat, empirical_quantile(target, V))
    q = np.linspace(0.005, 0.995, n_eval)
    bc = np.column_stack([empirical_quantile(c, q) for c in controls]) @ w
    w2 = float(np.mean((bc - empirical_quantile(target, q)) ** 2))
    return w2, float(np.mean(w > 1e-4))   # distance, fraction of active weights

rng = np.random.default_rng(12345)
for J in (4, 30, 100, 500):
    reps = 20 if J <= 100 else 6
    out = np.array([w2_to_target(J, rng) for _ in range(reps)])
    print(f"J={J:>4}:  mean W2^2 = {out[:, 0].mean():7.3f}   "
          f"frac weights > 1e-4 = {out[:, 1].mean():.3f}")

# J=   4:  mean W2^2 ~  4.2     frac weights > 1e-4 ~ 0.58
# J=  30:  mean W2^2 ~  0.6     frac weights > 1e-4 ~ 0.17
# J= 100:  mean W2^2 ~  0.8     frac weights > 1e-4 ~ 0.05
# J= 500:  mean W2^2 ~  0.4     frac weights > 1e-4 ~ 0.02

The distance collapses once the donor pool is rich enough for the target to sit inside (or near) the convex hull of the controls, and the weight vector is “essentially sparse” – only a small fraction of donors carry non-negligible weight – exactly the two phenomena Gunsilius reports. (The residual distance at large \(J\) is empirical-quantile noise from the finite within-cell sample of 1000 draws, not bias.)

References#

Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies.” Journal of the American Statistical Association 105(490):493-505.

Agueh, M., & Carlier, G. (2011). “Barycenters in the Wasserstein Space.” SIAM Journal on Mathematical Analysis 43(2):904-924.

Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W., & Wager, S. (2021). “Synthetic Difference-in-Differences.” American Economic Review 111(12):4088-4118.

Athey, S., & Imbens, G. W. (2006). “Identification and Inference in Nonlinear Difference-in-Differences Models.” Econometrica 74(2):431-497.

Beck, A., & Teboulle, M. (2009). “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.” SIAM Journal on Imaging Sciences 2(1):183-202.

Dube, A. (2019). “Minimum Wages and the Distribution of Family Incomes.” American Economic Journal: Applied Economics 11(4):268-304.

Duchi, J., Shalev-Shwartz, S., Singer, Y., & Chandra, T. (2008). “Efficient Projections onto the l1-Ball for Learning in High Dimensions.” ICML.

Gunsilius, F. F. (2023). “Distributional Synthetic Controls.” Econometrica 91(3):1105-1117. Reference implementation: the DiSCo R package.

Kloeckner, B. (2010). “A Geometric Study of Wasserstein Spaces: Euclidean Spaces.” Annali della Scuola Normale Superiore di Pisa 9(2):297-323.

Zhang, L., Zhang, X., & Zhang, X. (2026). “Asymptotic Properties of the Distributional Synthetic Controls.” arXiv:2405.00953.