Parallel-Trends Supergeo Design (PANGEO)

Parallel-Trends Supergeo Design (PANGEO)#

When to Use This Estimator#

PANGEO is a tool for designing a geo experiment — deciding, before you run it, which geographic markets to treat and which to hold out, so that the after-the-fact comparison is as clean as possible. Use it when:

you can assign treatment at the geo level (turn an ad campaign, price, or feature on in some markets and not others);
you have a panel of pre-period history (weekly/monthly sales, conversions, GMV…) for every candidate geo;
the number of geos is modest (tens, not thousands), as is typical with DMAs/regions; and
you want the eventual treatment-effect estimate to be precise — i.e. you want treated and control markets that already move together.

A worked geo-experiment. A brand wants to measure the incremental sales from a new TV/CTV campaign. Ads are bought at the DMA level, so the experimental units are ~50–210 large, heterogeneous markets — not exchangeable shoppers. The plan: run the campaign in a treatment set of DMAs, withhold it in a control set, and read the post-launch sales gap as the lift. The whole experiment lives or dies on the split: if the treated DMAs were already trending differently from the controls, the post-launch gap mixes the campaign effect with that pre-existing divergence. PANGEO reads the DMAs’ pre-period sales panel and chooses the treatment/control split (bundling DMAs into balanced supergeos) so the two sides’ sales trajectories run parallel beforehand — turning the post-period gap into a clean read on the campaign.

PANGEO is two stages:

Design (pre-period data only). Each arm’s geos are grouped into composite supergeos and formed into balanced pairs — with no geo trimmed — so that pre-period parallelism is maximised. By default this partition is found by a fast clustering heuristic (the OSD analogue of Shaw 2025); an exact set-partitioning mixed-integer program is available as an opt-out (fast=False). Both optimise the same parallelism objective. A power analysis reports the minimum detectable effect (MDE) implied by the chosen supergeo size \(Q\).
Evaluation (after the experiment). The same design is scored against the realised outcomes with the Augmented Difference-in-Differences estimator of Li & Van den Bulte (2022), giving the ATT, percent ATT, and CIs at the arm and program levels.

The two stages share one quantity: both the design objective and the standard error of the realised effect are governed by the variance of the supergeo gap residual. Minimising non-parallelism simultaneously minimises the MDE and tightens the CI — optimising parallelism is optimising inferential precision.

This is a principled deviation from Google’s Supergeo Design. Chen et al. (2023) and OSD (Shaw 2025) match supergeos on a scalar summary — the summed baseline response, or a few covariate totals — which collapses the time dimension. PANGEO matches on the full pre-period trajectory. That difference is not cosmetic: the downstream analysis is a difference-in-differences, which differences trajectories over time, so two markets with identical totals but different seasonal shapes are not interchangeable for it even though scalar matching scores them as a perfect match. In trending, seasonal data — which is essentially all geo-marketing data — matching on shape rather than on a single number is what makes the post-period comparison valid. (The simulation at the end of this page quantifies the gap: when geos share a baseline mean but differ in shape, PANGEO recovers the effect ~30× more precisely than a scalar match.)

When not to use it. If the assignment is already fixed (an observational study, or a campaign that already ran in specific markets), there is no design to choose — use an estimation-stage method (Two-Step Synthetic Control, Synthetic Business Cycle (SBC), Forward Difference-in-Differences (FDID)). Large geo pools (hundreds of markets) are handled by the default clustering partition (see Forming the supergeos below); only the exact opt-out program is scale-limited. And if the outcome is plausibly stationary with no trend or seasonality, scalar matching is already adequate and simpler.

What is a supergeo?#

Geo experiments differ from ordinary A/B tests in one decisive way: the experimental units are a small number of large, heterogeneous aggregates — markets, regions, DMAs — rather than many exchangeable individuals. Randomising treatment across a handful of dissimilar markets routinely produces treatment and control groups with very different baseline characteristics, and the resulting post-randomisation bias does not average away over the single assignment a practitioner actually runs (Abadie & Zhao 2026). Classic matched-pair designs help, but with heterogeneous geos there may be no good one-to-one match for a given market.

A supergeo resolves this by relaxing the unit of matching. Rather than insisting that single geos match, geos are pooled into composite aggregates: a supergeo is simply a bundle of geos treated as one unit, with outcome equal to their (population-weighted) mean. Composite units can be made comparable even when their constituents are not — a small, noisy market combined with a complementary one can, in aggregate, track another composite closely. The design then pairs supergeos, randomises treatment within each pair, and — unlike trimming-based approaches — assigns every geo to some supergeo, so the experiment spans the entire market with nothing discarded (Chen, Doudchenko, Jiang, Stein & Ying 2023).

PANGEO keeps this structure but changes what the supergeos are matched on. Supergeo Design and OSD match on a scalar summary (the summed response, or a few covariate totals), which collapses the time dimension; PANGEO matches on the full pre-treatment trajectory, choosing pairs whose aggregate paths run as parallel as possible. The reason is that the downstream difference-in-differences analysis differences trajectories: two markets with identical totals but different seasonal shapes are not interchangeable for it, even though scalar matching treats them as equivalent.

Setup and notation#

Let \(y_{it}\) denote the outcome of geo \(i \in \mathcal{N} \coloneqq \{1,\dots,N\}\) in period \(t \in \mathcal{T} \coloneqq \{1,\dots,T\}\). The first \(T_0\) periods are the pre-treatment (design) window \(\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}\) and the remaining \(|\mathcal{T}_2| = T - T_0\) periods are the experimental window \(\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}\). A single categorical column assigns each geo to an arm; arms occupy disjoint geo pools \(\mathcal{N}_a\) and are designed independently, so the exposition below fixes one arm and drops the arm subscript.

Throughout we maintain the linear factor model used by both the synthetic- control and DiD literatures (Abadie, Diamond & Hainmueller 2010; Li & Van den Bulte 2022) for the no-treatment potential outcome,

(1)#\[y_{it}^{N} = \delta_t + \boldsymbol{\theta}_t^{\top} \mathbf{z}_i + \boldsymbol{\lambda}_t^{\top}\boldsymbol{\mu}_i + \varepsilon_{it},\]

where \(\delta_t\) is a common time effect, \(\mathbf{z}_i\) are observed covariates with time-varying loadings \(\boldsymbol{\theta}_t\), \(\boldsymbol{\mu}_i\) are unobserved factor loadings with factors \(\boldsymbol{\lambda}_t\), and \(\varepsilon_{it}\) is mean-zero idiosyncratic noise.

A supergeo is a set \(\mathcal{S}\) of same-arm geos with aggregate trajectory

\[\bar y_{\mathcal{S},t} \coloneqq \frac{\sum_{i\in \mathcal{S}}\omega_i\,y_{it}}{\sum_{i\in \mathcal{S}}\omega_i},\]

where \(\omega_i>0\) are aggregation weights (the weight_col population, or \(\omega_i\equiv 1\)). A pair \(p=(\mathcal{A}_p,\mathcal{B}_p)\) consists of two disjoint supergeos with \(|\mathcal{A}_p|,|\mathcal{B}_p|\le Q\); \(\mathcal{A}_p\) is the treatment half and \(\mathcal{B}_p\) the control half. Its gap is

(2)#\[g_{p,t} \coloneqq \bar y_{\mathcal{A}_p,t} - \bar y_{\mathcal{B}_p,t}.\]

Under (1) the common time effect cancels and

(3)#\[g_{p,t} = \boldsymbol{\lambda}_t^{\top}\big(\bar{\boldsymbol{\mu}}_{\mathcal{A}_p}-\bar{\boldsymbol{\mu}}_{\mathcal{B}_p}\big) + \boldsymbol{\theta}_t^{\top}\big(\bar{\mathbf{z}}_{\mathcal{A}_p}-\bar{\mathbf{z}}_{\mathcal{B}_p}\big) + \big(\bar\varepsilon_{\mathcal{A}_p,t}-\bar\varepsilon_{\mathcal{B}_p,t}\big),\]

with \(\bar{\boldsymbol{\mu}}_{\mathcal{S}}, \bar{\mathbf{z}}_{\mathcal{S}}\) the weighted means over \(\mathcal{S}\). The pair exhibits parallel trends precisely when the loadings are balanced, \(\bar{\boldsymbol{\mu}}_{\mathcal{A}_p}=\bar{\boldsymbol{\mu}}_{\mathcal{B}_p}\) (and \(\bar{\mathbf{z}}_{\mathcal{A}_p}=\bar{\mathbf{z}}_{\mathcal{B}_p}\)); the gap is then constant in expectation and a difference-in-differences comparison within the pair is unbiased.

Identifying assumptions#

PANGEO can make a good experiment likely, but it cannot manufacture one that the data do not support. Three assumptions underpin it, and each points at a way the method can fail or stall.

Parallel trends (the crux). The design maximises parallelism in the pre-period; the validity of the Stage-2 effect estimate rests on that parallelism persisting into the post-period absent treatment — i.e. the treatment and control supergeos would have continued to move together had the campaign never launched. This is exactly the difference-in-differences parallel-trends assumption, and it is the assumption PANGEO is organised around.

Remark. PANGEO optimises pre-period parallelism (making the assumption as plausible as the data allow) but cannot guarantee it holds out-of-sample. If a shock hits the treated markets, a competitor reacts only there, or the pre-period co-movement was coincidental, the post-period gap diverges on its own and the ATT is biased — the same Achilles’ heel as any DiD. The achieved parallelism \(R^2\) (low values mean no balanced design exists), the reported MDE, and a placebo / blank-window check on the held-out pre-period are the diagnostics that flag the risk.
A linear factor structure for the no-treatment outcomes (Eq. (1)). The gap decomposition that makes “match on trajectory” equivalent to “balance the factor loadings” relies on this model. It is the standard synthetic-control / interactive-fixed-effects assumption.

Remark. The model is mild for sales-like panels, but a wildly non-factor outcome (e.g. one driven by an idiosyncratic, unit-specific regime change) is not balanceable by any partition — no split of the geos can render (3) constant in expectation.
A modest, designable geo pool. Each arm needs enough geos to form at least one supergeo pair, and the geos must be heterogeneous-but- matchable.

Remark. A pool that cannot form a balanced pair (e.g. two wildly different markets) has no good design to find. Scale itself is not a barrier — the default clustering partition handles large pools (see Forming the supergeos below) — only the exact opt-out program is scale-limited.

When PANGEO Fails or Stalls#

The concrete failure / stall modes:

Parallel trends breaks post-launch — the dominant risk, above. No design fixes it; only the diagnostics warn of it.
No matchable structure. If the geos are so heterogeneous that no partition achieves high parallelism, PANGEO still returns the best feasible design, but the parallelism \(R^2\) stays low and the MDE blows up — a signal that a geo experiment here is underpowered and the read will be noisy regardless of split.
The exact program stalls. Set partitioning is NP-hard, so the exact opt-out program (fast=False) can be slow or intractable with many geos and a large supergeo size \(Q\). The default clustering partition avoids this — it is \(O(n\log n)\) and, under cluster structure, matches the exact optimum (see Forming the supergeos). If you nonetheless need the exact program, cap \(Q\) (smaller supergeos), use the automatic \(Q\) selection, or raise min_pairs.
Too few geos / arms. A pool that cannot form a balanced pair (e.g. two wildly different markets) has no good design to find.

In short: PANGEO improves the plausibility of parallel trends by construction and quantifies the residual risk (parallelism \(R^2\), MDE), but it inherits DiD’s identifying assumption rather than removing it. Treat a low parallelism \(R^2\) or a large MDE as the design telling you the experiment is fragile.

Stage 1 — the supergeo design#

The parallelism objective#

The pre-treatment window is split into an estimation window \(\mathcal E\) (the first \(\lfloor \kappa T_0\rfloor\) periods, \(\kappa=\) frac_E, default \(0.7\)) and a held-out blank window \(\mathcal B=\{1,\dots,T_0\}\setminus\mathcal E\). A pair is scored by the variance of its level-removed gap over the estimation window,

(4)#\[c(p) \coloneqq \sum_{t\in\mathcal E}\big(g_{p,t}-\bar g_p\big)^2, \qquad \bar g_p \coloneqq \frac{1}{|\mathcal E|}\sum_{t\in\mathcal E} g_{p,t},\]

which is exactly the pre-period residual sum of squares of a difference-in-differences fit (cf. mlsynth.utils.selector_helpers._did_from_mean()). Taking expectations under (3) with balanced covariates,

\[\mathbb E\,c(p) = \big(\bar{\boldsymbol{\mu}}_{A_p}-\bar{\boldsymbol{\mu}}_{B_p}\big)^{\top} \Big[\textstyle\sum_{t\in\mathcal E}(\boldsymbol{\lambda}_t-\bar{\boldsymbol{\lambda}}) (\boldsymbol{\lambda}_t-\bar{\boldsymbol{\lambda}})^{\top}\Big] \big(\bar{\boldsymbol{\mu}}_{A_p}-\bar{\boldsymbol{\mu}}_{B_p}\big) \;+\; \mathbb E\!\sum_{t\in\mathcal E} \big(\bar\varepsilon_{A_p,t}-\bar\varepsilon_{B_p,t}-\overline{\cdot}\big)^2 .\]

The first term is a positive-definite quadratic form in the loading imbalance, so minimising (4) drives \(\bar{\boldsymbol{\mu}}_{A_p}\to\bar{\boldsymbol{\mu}}_{B_p}\) — it balances the unobserved factor loadings, which is what parallel-trends DiD requires. The time-constant component of the loading difference is absorbed by the level shift \(\bar g_p\) and never penalised: two supergeos may differ arbitrarily in level yet match perfectly in shape. Scalar sum-matching, by contrast, collapses the time dimension and is blind to shape.

Forming the supergeos#

Given the per-pair cost (4), the design must partition each arm’s geos into supergeo pairs of minimum total non-parallelism. PANGEO offers two solvers for this, and they optimise the same objective (4). The econometric content of the design — balancing the factor loadings of (3), i.e. making the supergeo halves move in parallel — is therefore identical whichever solver runs; they differ only in how they search. Both run independently within each arm (arms hold disjoint geos), and both produce an exact cover: every geo is assigned to exactly one supergeo pair, with nothing trimmed.

Clustering partition (default, fast=True). Geo-experiment panels almost always have cluster structure: markets fall into a handful of latent types that move together up to a level shift (shared seasonality, regional demand, category-level trends). When that structure is present the best supergeos are just groups of same-type geos, which can be found by clustering rather than by combinatorial search. The default solver — an analogue of OSD (Shaw 2025) for the trajectory objective — forms the supergeos in five plain steps:

Level removal. Each geo’s pre-period trajectory has its own time-mean subtracted, leaving the shape \(y_{it}-\bar y_i\). This is the same demeaning as the score (4): two geos moving in parallel at any level become identical, so the grouping targets parallel trends rather than matching levels.
Embedding. The shapes are projected onto their leading principal components (denoising; under the factor model (1) the shapes span the factor space, which PCA recovers).
Clustering. Hierarchical (Ward) linkage orders the geos so that parallel-moving markets are adjacent.
Size-bounded grouping. The ordering is cut into contiguous groups of size \(2,\dots,2Q\) — each splittable into two halves of size \(\le Q\) — giving the exact cover. min_pairs caps the group size so that at least that many pairs form.
Per-group split. Within each group the treatment and control halves are the split minimising (4) — the same best split the exact program uses.

A handful of candidate groupings are generated (varying the linkage rule and a small embedding perturbation, as in OSD) and the one with the lowest total (4) is kept; fast_candidates sets how many. The cost is \(O(n\log n)\), against the exponential search below.

Remark. The clustering partition is a heuristic: its total cost is \(\ge\) the exact optimum, with equality when the trajectory clusters are clean. Checked against a structure-aware oracle (group by the true latent clusters, then split each exactly), on clustered panels it matches the exact optimum — e.g. at \(N=60\) geos with supergeos up to \(Q=6\) it recovers the oracle design in well under a second, a regime where the exact program cannot even begin (it enumerates \(O(N^{2Q})\) candidate subsets). This is why it is the default. On data with no cluster structure the gap to the optimum can be large; there the exact program is preferable when affordable.

The exact set-partitioning program (`fast=False`)#

The exact solver enumerates every admissible pair and selects the optimal exact cover by mixed-integer programming. Let \(\mathcal F\) be the family of admissible pairs: every subset of the arm’s geos of size \(2,\dots,2Q\) that can be split into two halves each of size \(\le Q\), each subset scored at its best such split by (4). Let \(\mathbf{M}\in\{0,1\}^{N\times|\mathcal F|}\) be the geo-by- pair incidence matrix (\(M_{iG}=1\) iff geo \(i\in G\)) and \(c_G\) the score of pair \(G\). The design solves the set-partitioning program

(5)#\[\min_{\mathbf{x}\in\{0,1\}^{|\mathcal F|}} \sum_{G\in\mathcal F} c_G\,x_G \quad\text{s.t.}\quad \mathbf{M}\mathbf{x} = \mathbf 1 \ \ (\text{exact cover}),\qquad \mathbf 1^{\top} \mathbf{x} \ge \kappa_{\min}\ \ (\text{minimum pairs}),\]

solved with cvxpy and the HiGHS mixed-integer backend. The exact-cover constraint \(\mathbf{M}\mathbf{x}=\mathbf 1\) assigns every geo to exactly one chosen pair (no geo is trimmed). Because each \(c_G\) is precomputed offline, the objective is linear in \(\mathbf{x}\) — the program is a mixed-integer linear program regardless of the (possibly nonlinear) per-pair cost, which is what keeps it tractable. Within each chosen pair the treatment and control halves are the score-minimising split; which half is actually treated is randomised in the field.

Per-pair objectives#

The objective argument selects the per-pair cost \(c_G\); all three choices leave (5) a linear program. Writing \(g_t\) for the gap of a candidate split and \(\bar g\) for its estimation-window mean,

"ss_res" (default) — the absolute residual sum of squares \(\sum_t (g_t-\bar g)^2\). Scale-dependent, so high-amplitude pairs weigh more and the design prioritises making large markets parallel.
"r2" — the scale-free criterion \(1-R^2 = \sum_t(g_t-\bar g)^2 / \sum_t(\bar Y_{A,t}-\overline{\bar Y_A})^2\), so every pair counts equally (FDID’s \(R^2\) criterion, optimised exactly by the program rather than greedily).
"weighted" — a recency-weighted residual SS \(\sum_t w_t (g_t-\bar g_w)^2\), the level removed at the weighted mean \(\bar g_w\), with weights \(w_t=\rho_{\mathrm{dec}}^{\,T_0-1-t}\) (recency_decay), up-weighting the recent pre-period closest to the experiment.

The per-pair gap_variance and parallelism_r2 reported on the result are always the unweighted quantities of (4), so designs from different objectives are comparable on a common yardstick.

Supergeo size \(Q\) and automatic selection#

Setting max_supergeo_size \(=Q=1\) recovers the classic matched-pairs design; \(Q>1\) permits composite supergeos when no single geo matches another well, without trimming. \(Q\) is a granularity knob with an interior optimum: too small and no parallel matches exist (singleton geos are too noisy); too large and the arm yields few, coarse pairs. The program-level MDE is not monotone in \(Q\) and is not tracked by the parallelism \(R^2\) (which is scale-free and rises with \(Q\)); only the absolute residual variance that drives power matters.

Consequently, if max_supergeo_size is left unset, PANGEO selects \(Q\) automatically: it solves (5) for every feasible \(Q\in\{1,\dots,\min(\lceil N/2\rceil, 6)\}\) and returns the design with the smallest mean program MDE. The full sweep — each \(Q\)’s program-pair count, mean program MDE, and the \(2/2^{P}\) randomisation-inference p-value floor for \(P\) pairs — is recorded in results.metadata["q_sweep"] and the choice in results.metadata["q_selected"], so the decision is auditable and may be overridden with an explicit \(Q\).

The default selection rule (q_selection="mde_min") minimises the mean program MDE. That rule sees only one axis — power — and can pick a \(Q\) whose MDE edge is within the sampling noise of a design with more pairs. The alternative q_selection="pareto_1se" treats the choice as the two-objective problem it is: minimise the MDE and maximise the pair count \(P\) (more pairs give a finer randomisation reference and a lower \(2/2^{P}\) p-value floor). It keeps only the Pareto-efficient \(Q\) on (MDE \(\downarrow\), \(P\) \(\uparrow\)), then a one-standard-error tie-break returns the largest \(P\) whose MDE is within one SE of the frontier’s best — spending pairs on power only when the gain exceeds its own noise. The SE is the deterministic small-sample estimate \(\mathrm{MDE}/\sqrt{2(B-1)}\) on the \(B\)-period blank window, and an optional q_min_pairs sets a hard inference floor. Setting compute_q_sweep=True records the full sweep even when \(Q\) is fixed, so a chosen \(Q\) can be audited against the alternatives.

Solver diagnostics. Every design records how the partition solver ran in results.metadata["solver_diagnostics"] (per arm). The clustering (fast=True) path reports the candidate groupings tried and feasible, the winning candidate and its linkage, and each candidate’s total score; the exact set-partitioning path (fast=False) reports the number of candidate supergeo pairs \(|\mathcal F|\), the MIP objective, optimality gap and dual bound, node/iteration counts, and the solve time. When no exact cover exists the raised error names the structural obstruction (units that appear in no admissible pair, or an odd-arm/even-pair parity clash) rather than a generic failure.

Balancing baseline covariates#

Parallelism is level-blind: by (4) the level shift \(\bar g_p\) absorbs any time-constant gap, so a baseline characteristic (population, income) that merely shifts a market’s level is differenced out and never enters the trajectory score. This is correct for parallel-trends DiD but says nothing about balance on such characteristics — the role of OSD’s scalar covariate matching. PANGEO restores it with a standardised mean-difference penalty appended to (4),

(6)#\[c(p) \;\longmapsto\; c(p) \;+\; \sum_{m} \omega^{\mathrm{cov}}_m \Big(\frac{\bar c_{A_p,m}-\bar c_{B_p,m}}{s_m}\Big)^2,\]

the weighted squared standardised mean difference (SMD) between the halves’ covariate means, where \(s_m\) is the cross-geo standard deviation (standardize_covariates, default True) and \(\omega^{\mathrm{cov}}_m\) a per-covariate weight (covariate_weights, default \(1\)). Because (6) is precomputed it preserves linearity in (5). Larger weights buy tighter covariate balance at the cost of some parallelism; the achieved per-pair SMDs are reported in SupergeoPair.covariate_smd. Pass covariates=[...] (baseline columns, each reduced to its per-geo mean) to enable; with no covariates the design is unchanged. This is also the Abadie & Zhao (2026, Thm. 1) prescription — moving structure from the unobserved \(\boldsymbol{\mu}_i\) into the observed \(\mathbf{z}_i\) lowers the estimator’s bias — and the Stage-2 device for restoring inferential validity (below).

df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
                               T=104, seed=0, covariates=True)

res = PANGEO({
    "df": df, "outcome": "sales", "arm": "arm",
    "unitid": "unit", "time": "time", "max_supergeo_size": 3,
    "covariates": ["population", "income"],
    "covariate_weights": {"population": 5.0, "income": 5.0},
}).fit()

for arm, design in res.arm_designs.items():
    for p in design.pairs:
        print(p.treatment, p.control, p.parallelism_r2, p.covariate_smd)

Power and the minimum detectable effect#

Because power and the design objective are governed by the same supergeo gap residual, mlsynth.PANGEO.fit() returns a power analysis (results.power). For pair \(p\) the per-period noise is estimated honestly on the held-out blank window \(\mathcal B\) (out of sample with respect to the optimisation) as the residual of the same counterfactual model used at evaluation ((7)) — fit on the estimation window \(\mathcal E\), evaluated on \(\mathcal B\):

\[\widehat\sigma_p^2 \coloneqq \frac{1}{|\mathcal B|-1}\sum_{t\in\mathcal B} \widehat e_{p,t}^2 .\]

Using the evaluation model here (the augmented-DiD residual by default, or the plain level-removed gap when att_augment=False) rather than a fixed recipe keeps the projected MDE and the realised standard error ((8)) coherent. The \(X\)-period effect for the pair then has variance \(\widehat\sigma_p^2\,[f(X,\rho)+f(T_0,\rho)]\), where

\[f(n,\rho) = \frac{1}{n}\Big(1 + 2\sum_{k=1}^{n-1}(1-\tfrac{k}{n})\rho^{k}\Big)\]

is the variance-inflation factor of the mean of \(n\) AR(1)-correlated periods and \(\rho\) is the pooled lag-1 autocorrelation of the blank residuals. Serial correlation is decisive: weekly sales are highly autocorrelated, so \(X\) post weeks are worth far fewer than \(X\) independent observations and adding post periods yields sharply diminishing returns — the trap a naive i.i.d. power calculation falls into.

The program-level effect is the treated-size-weighted average of the pair effects, with weights \(w_p = (\sum_{i\in A_p}\omega_i)/\sum_{q}\sum_{i\in A_q}\omega_i\), and (treating pairs as independent across the program)

\[\widehat{\operatorname{Var}}(\widehat\tau_{\mathrm{prog}}) = \sum_p w_p^2\,\widehat\sigma_p^2\,\big[f(X,\rho)+f(T_0,\rho)\big], \qquad \mathrm{MDE}(X) = \big(z_{1-\alpha/2}+z_{1-\beta}\big)\, \sqrt{\widehat{\operatorname{Var}}(\widehat\tau_{\mathrm{prog}})}.\]

The program level is the headline: small arms are individually under-powered (with \(P\) pairs a pure within-pair randomisation test has a hard p-value floor of \(2/2^{P}\), so one needs \(P\ge 6\) to reach \(p<0.05\)), whereas pooling across arms gives the program an effective sample size equal to the total pair count and routinely detects effects several points smaller than any one arm. Per-arm curves are stored in results.power.arms. The MDE is reported in outcome units and as a percent of the treated baseline, by default at \(1-\beta=0.80\) power for horizons \(X=2,\dots,12\); power_target, power_alpha and power_post_periods configure this and compute_power=False skips it.

res = PANGEO({
    "df": df, "outcome": "sales", "arm": "arm",
    "unitid": "unit", "time": "time", "max_supergeo_size": 3,
}).fit()

pw = res.power
print(f"serial correlation rho = {pw.serial_correlation:.2f}")
print(pw.summary())                       # MDE % by horizon: program + arms
print(pw.program.mde_pct_by_horizon()[8]) # detectable % lift after 8 weeks
print(pw.power_for_effect(effect_pct=5.0, post_periods=8))  # invert: power

Stage 2 — evaluation by Augmented DiD#

The estimator#

Once the experiment has run, pass a post_col (a \(0/1\) indicator of post-treatment periods, as in LEXSCM). The design is rebuilt on the pre rows alone — so it is identical to the design-only result — and results.effects carries the realised ATT at the arm and program levels using the Augmented Difference-in-Differences estimator of Li & Van den Bulte (2022).

Fix a level (an arm, or the program) and write \(y^{T}_t\) for its treated supergeo aggregate and \(y^{C}_t\) for its control supergeo aggregate, both treated-size-weighted across the level’s pairs. The counterfactual is the pre-period least-squares projection

(7)#\[y^{T}_t = \delta_1 + \delta_2\,y^{C}_t + \gamma\,t + e_t, \qquad t = 1,\dots,T_0 .\]

This augments plain DiD in two ways: the control scale \(\delta_2\) is estimated rather than fixed at \(1\), and a linear time trend \(\gamma t\) is included (att_augment and att_trend, both default True). With regressor \(\mathbf{x}_t=(1,\,y^{C}_t,\,t)^{\top}\) and OLS estimate \(\widehat{\boldsymbol{\delta}}\), the per-period effect and the ATT are

\[\widehat u_t \coloneqq y^{T}_t - \mathbf{x}_t^{\top}\widehat{\boldsymbol{\delta}}, \qquad \widehat\tau \coloneqq \frac{1}{T_{\mathrm{post}}} \sum_{t=T_0+1}^{T} \widehat u_t .\]

The percent ATT is taken relative to the post-period counterfactual (cf. mlsynth.utils.resultutils.effects.calculate()), not the pre-treatment baseline:

\[\widehat\tau_{\%} \coloneqq 100\times\frac{\widehat\tau}{\bar y^{0}_{\mathrm{post}}}, \qquad \bar y^{0}_{\mathrm{post}} \coloneqq \frac{1}{T_{\mathrm{post}}}\sum_{t=T_0+1}^{T} \mathbf{x}_t^{\top}\widehat{\boldsymbol{\delta}} = \frac{1}{T_{\mathrm{post}}}\sum_{t=T_0+1}^{T}\big(y^{T}_t-\widehat u_t\big).\]

Inference#

Li & Van den Bulte (2022, Prop. 3.1–3.3) show \(\sqrt{T_{\mathrm{post}}}\,(\widehat\tau-\tau)\xrightarrow{d} N(0,\Sigma_1+\Sigma_2)\), where \(\Sigma_1\) is the variance from estimating \(\boldsymbol{\delta}\) and \(\Sigma_2\) from averaging the post-period errors. Their Web Appendix C.13 gives the prediction-variance estimator

(8)#\[\widehat{\operatorname{Var}}(\widehat\tau) = \widehat\omega^2\Big[\, \bar{\mathbf{x}}_{\mathrm{post}}^{\top} \Big(\textstyle\sum_{t=1}^{T_0} \mathbf{x}_t \mathbf{x}_t^{\top}\Big)^{-1} \bar{\mathbf{x}}_{\mathrm{post}} \;+\; \frac{1}{T_{\mathrm{post}}}\Big],\]

with \(\bar{\mathbf{x}}_{\mathrm{post}}=T_{\mathrm{post}}^{-1}\sum_{t>T_0}\mathbf{x}_t\). The first bracketed term is \(\Sigma_1\) (it inflates automatically when the post-period control drifts outside its pre-period range, pricing the extrapolation uncertainty) and the second is \(\Sigma_2\). The residual variance \(\widehat\omega^2\) is estimated over the long pre-period as a Newey–West/Bartlett long-run variance with truncation lag \(\lfloor T_0^{1/4}\rfloor\) (Li & Van den Bulte’s \(O(T^{1/4})\) rule); lag \(0\) is the i.i.d. case \(\widehat\omega^2=\widehat e^{\top}\widehat e/(T_0-k)\) for \(k\) regressors. The confidence interval is \(\widehat\tau \pm z_{1-\alpha/2}\sqrt{\widehat{\operatorname{Var}}(\widehat\tau)}\) and the p-value is the two-sided normal test of \(\tau=0\).

Design-based inference. The Augmented-DiD interval above is model-based: it is valid when the regression residual is weakly-dependent stationary. Because treatment is randomised within each supergeo pair, PANGEO also reports an assumption-light design-based companion at results.effects.randomization (per arm and for the program). Let \(d_k\) be the antisymmetric within-pair difference-in-differences contrast — treated-minus-control, level-removed — which negates exactly when a pair’s assignment is flipped. Under the sharp null of no effect the only randomness is the assignment, so the null distribution of the weighted-mean statistic is generated by the \(2^{P}\) sign flips \(\{\sum_k w_k s_k d_k : s\in\{\pm1\}^{P}\}\); the two-sided permutation p-value is the share with \(|\text{stat}|\ge|\text{observed}|\) (enumerated exactly for \(P\le 14\) pairs, otherwise a fixed-seed Monte Carlo). The companion matched-pair (pair-clustered) standard error \(\sqrt{\sum_k w_k^2 (d_k-\widehat\tau)^2}\,P/(P-1)\) — which reduces to \(\mathrm{sd}(d_k)/\sqrt{P}\) under equal weights — gives a \(t_{P-1}\) interval that respects the pairing. The permutation p-value cannot fall below \(2/2^{P}\), so a design with few pairs honestly reports weak design-based evidence even when the model-based SE is tight — the same \(P\)-versus-power tension the q_sweep p-value floor surfaces at the planning stage.

Why this estimator suits the supergeo gap#

Li & Van den Bulte’s regularity conditions (Assumptions C2–C3) explicitly admit trend and unit-root (integrated) common factors \(\boldsymbol{\lambda}_t\) — the regimes under which naive i.i.d. standard errors collapse. The mechanism is the augmentation: regressing the treated aggregate on a scaled control is a cointegrating regression, and a single \(\delta_2\) cancels a shared integrated factor in (3), while \(\gamma t\) absorbs deterministic drift. The validity condition reduces to a single requirement — that the regression residual \(e_t\) be (weakly dependent) stationary — which the augmentation and trend deliver. The design’s parallelism is retained throughout; it minimises the residual variance, which by (8) directly tightens the standard error — and, because the power analysis reads the same held-out residual, the planning MDE as well.

Plain DiD as an option. Setting att_augment=False (and optionally att_trend=False) recovers Li & Van den Bulte’s ordinary difference-in-differences — y^{T}_t - y^{C}_t = \delta_1 [+ \gamma t] + e_t with the control coefficient fixed at one — and the power analysis follows suit, so the two stages stay coherent. A head-to-head Monte-Carlo comparison (R² design + plain DiD versus the augmented defaults) found augmented DiD both more precise (lower realised MDE) and better-covering across the stationary, trend-plus-seasonal and integrated-factor regimes, because plain DiD has no mechanism to absorb a control-scale mismatch or a trend and leaves that structure in its residual. The augmented estimator is therefore the default; plain DiD remains available for settings where its textbook simplicity is preferred.

The result object#

fit() returns a DesignResult — the experimental-design half of mlsynth’s two-family result contract — because PANGEO chooses an assignment before any intervention. The chosen treated markets are on results.selected_units and the full map on results.assignment; the supergeo pairs, achieved parallelism, power/MDE and solver diagnostics are on results.arm_designs, results.power and results.metadata. When the panel carries a post_col, the design resolves to an effect report: results.report is a standard EffectResult exposing the program ATT through the same flat surface as any estimation-stage method (.att, .counterfactual, .gap, .att_ci), while the richer per-pair, per-arm and design-based numbers remain on results.effects.

Validity envelope (smoke tests)#

A Monte-Carlo study over the bundled simulator confirms that validity hinges on residual stationarity, not on the interval recipe. The simulator can place the unobserved factor on an i.i.d., AR(1) or random-walk process (factor) and toggle the seasonal amplitude (season_amp) and per-geo trend (trend_sd):

Gap structure	Program coverage (nominal 0.95)	Type-I (nominal 0.05)
stationary i.i.d. factor (paper DGP)	0.93	0.07
linear trend + seasonality	0.87	0.13
integrated factor (random walk)	0.60	0.40

The point estimate is unbiased in every regime. On a stationary gap — matching Li & Van den Bulte’s factor-model design — the prediction- variance interval is at its nominal rate; the augmentation and trend regressor recover most of the coverage lost to a deterministic trend and seasonality; and the adversarial random-walk-plus-seasonality gap, where two integrated factors and amplitude-heterogeneous seasonality exceed what a single \(\delta_2\) can cointegrate, marks the honest assumption boundary. In practice the fitted \(\widehat\delta_2\) (reported as AttEstimate.scale) and the residual diagnose whether the assumption holds; if a single scale cannot flatten the gap, add covariate or seasonal regressors before trusting the interval.

Because the power analysis now uses the evaluation model’s held-out residual, the planning MDE is calibrated to the realised standard error: on the stationary gap the projected MDE matches the realised value to within roughly 7% (ratio \(\approx 0.93\)), and on integrated gaps it is conservative (over-states the MDE), the safe direction. These experiments live in mlsynth/tests/test_pangeo.py (TestADIDInference).

df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
                               T=104, seed=0, n_post=8)
res = PANGEO({
    "df": df, "outcome": "sales", "arm": "arm",
    "unitid": "unit", "time": "time", "post_col": "post_col",
    "max_supergeo_size": 3,
    "att_augment": True, "att_trend": True,   # Augmented DiD (defaults)
}).fit()

print(res.effects.summary())           # program + per-arm ATT, SE, CI, p
pe = res.effects.program
print(f"program ATT = {pe.att_pct:.1f}% "
      f"[{pe.ci_lower_pct:.1f}, {pe.ci_upper_pct:.1f}], "
      f"p={pe.p_value:.3f}, scale delta_2={pe.scale:.2f}")

Core API#

PANGEO: Parallel-trends supergeo experimental design.

PANGEO is a prospective experimental-design method for geographic (geo) experiments, in the lineage of Supergeo Design (Chen, Doudchenko, Jiang, Stein & Ying 2023). The Supergeo idea – group geos into composite “supergeos” and form balanced pairs, randomising treatment within each pair, without trimming any geo – is retained, including its set- partitioning mixed-integer program.

The departure is the matching objective. Supergeo (and the scalable OSD variant) match on a scalar aggregate (the summed response) or a few summary covariate balances. PANGEO instead matches on the full pre-treatment trajectory: it chooses the partition whose treatment and control halves are as parallel as possible over the pre-period, scored by the difference-in-differences pre-period residual sum of squares (the level-removed gap variance; cf. mlsynth.utils.selector_helpers._did_from_mean()). Because the DiD level shift is absorbed, two supergeos can differ in level yet still match perfectly on shape – exactly what a downstream DiD / synthetic-control analysis needs, and what scalar sum-matching throws away.

Multi-arm support: a single categorical column names each geo’s eligible treatment arm (e.g. A/B/C); arms occupy non-overlapping geos and PANGEO designs each arm independently. The output is a design (supergeo pairs + treatment/control assignment + achieved parallelism), not a treatment effect.

class mlsynth.estimators.pangeo.PANGEO(config: PANGEOConfig | dict)#

Bases: object

Parallel-trends supergeo experimental design.

Parameters:: config (PANGEOConfig or dict) – Configuration object. See mlsynth.config_models.PANGEOConfig.

fit() → PangeoResults#

Design the parallel supergeo pairs and return PangeoResults.

With a post_col, the design is built on the pre rows only (so it is identical to the design-only result) and the realized DiD ATT on the post rows is attached as results.effects.

Configuration#

class mlsynth.config_models.PANGEOConfig(*, df: ~pandas.DataFrame, outcome: str, arm: str, unitid: str, time: str, post_col: str | None = None, weight_col: str | None = None, max_supergeo_size: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, min_pairs: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1, q_selection: ~typing.Literal['mde_min', 'pareto_1se'] = 'mde_min', q_min_pairs: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1, compute_q_sweep: bool = False, fast: bool = True, fast_candidates: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 5, objective: ~typing.Literal['ss_res', 'r2', 'weighted'] = 'ss_res', recency_decay: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.97, frac_E: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.7, covariates: ~typing.List[str] | None = None, covariate_weights: ~typing.Dict[str, float] | None = None, standardize_covariates: bool = True, compute_power: bool = True, power_target: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.8, power_alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, power_post_periods: ~typing.List[int] = <factory>, att_augment: bool = True, att_trend: bool = True, display_graphs: bool = True, save: bool | str = False)#

Configuration for the PANGEO experimental-design estimator.

Parallel-trends supergeo design (in the Supergeo / Chen et al. 2023 lineage): partitions each treatment arm’s geos into supergeo pairs whose treatment/control halves are maximally parallel over the pre-period, via a cvxpy/HiGHS set-partitioning MIP. A prospective design method – it returns supergeo pairs + a treatment/control assignment, not a treatment effect – so it takes a single categorical arm column rather than a binary treat indicator.

Parameters:

df (pd.DataFrame) – Historical (pre-treatment) balanced long panel.
outcome (str) – Historical outcome column (e.g. sales).
arm (str) – Single categorical column naming each geo’s eligible treatment arm (e.g. values A/B/C). Arms occupy non-overlapping geos; the design runs independently within each arm.
unitid (str) – Unit (geo) identifier column.
time (str) – Time-period column.
post_col (str, optional) – 0/1 indicator column marking post-treatment periods (0 = pre). When given, the design is built on the pre rows alone – identical to the design-only result – and the realized difference-in-differences ATT is additionally computed on the post rows (results.effects).
weight_col (str, optional) – Per-unit aggregation weight (e.g. population), constant within a unit. Makes both the supergeo design and the ATT population-weighted.
max_supergeo_size (int) – Q – the maximum size of either supergeo within a pair. Set 1 to recover classic matched pairs.
min_pairs (int) – Minimum number of supergeo pairs per arm.

class Config#

arbitrary_types_allowed = True#

arm: str#

att_augment: bool#

att_trend: bool#

compute_power: bool#

compute_q_sweep: bool#

covariate_weights: Dict[str, float] | None#

covariates: List[str] | None#

df: pd.DataFrame#

display_graphs: bool#

fast: bool#

fast_candidates: int#

frac_E: float#

max_supergeo_size: int | None#

min_pairs: int#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

objective: Literal['ss_res', 'r2', 'weighted']#

outcome: str#

post_col: str | None#

power_alpha: float#

power_post_periods: List[int]#

power_target: float#

q_min_pairs: int#

q_selection: Literal['mde_min', 'pareto_1se']#

recency_decay: float#

save: bool | str#

standardize_covariates: bool#

time: str#

unitid: str#

weight_col: str | None#

Helper Modules#

Pre-treatment parallelism scoring for PANGEO supergeo pairs.

The design objective replaces Supergeo’s scalar sum-matching with a difference-in-differences parallelism score on the full pre-period vector. For a supergeo pair split into halves A and B with mean trajectories \(\bar Y_A, \bar Y_B\), define the DiD level shift \(\delta = \overline{(\bar Y_A - \bar Y_B)}\) and score the pair by the variance of the level-removed pre-period gap

\[\text{score}(A, B) = \sum_{t} \big[(\bar Y_{A,t} - \bar Y_{B,t}) - \delta\big]^2 .\]

This is exactly the pre-period residual sum of squares of a DiD fit (cf. mlsynth.utils.selector_helpers._did_from_mean()): minimising it makes the two halves run parallel, so the within-pair DiD comparison is clean regardless of their levels (the level is absorbed by \(\delta\)).

mlsynth.utils.pangeo_helpers.parallelism.best_split(members: ndarray, Ypre: ndarray, max_size: int, objective: str = 'ss_res', weights: ndarray | None = None, cov: ndarray | None = None, cov_scales: ndarray | None = None, cov_weights: ndarray | None = None, unit_weights: ndarray | None = None) → Tuple[float, List[int], List[int]]#

Best treatment/control split of a candidate supergeo pair.

Parameters:

members (np.ndarray) – Row indices (into Ypre) of the units in this candidate pair.
Ypre (np.ndarray) – Pre-period outcomes, shape (n_units, T0).
max_size (int) – Maximum size of either supergeo (Q).
objective ({“ss_res”, “r2”, “weighted”}) – Per-pair cost to minimise (see split_cost()).
weights (np.ndarray, optional) – Length-T0 weights for objective="weighted".
cov (np.ndarray, optional) – Baseline covariate matrix, shape (n_units, M) aligned with the rows of Ypre. When given, a standardized SMD^2 imbalance term is added to each split’s trajectory cost (see covariate_imbalance()).
cov_scales (np.ndarray, optional) – Length-M standardization scales for the covariates.
cov_weights (np.ndarray, optional) – Length-M per-covariate penalty weights (default 1 each).
unit_weights (np.ndarray, optional) – Length-n_units per-unit aggregation weights (e.g. population); the supergeo mean trajectory is the weighted average of its members.

Returns:

score (float) – Minimum cost over admissible splits (inf if none).
side_a, side_b (list of int) – The treatment / control halves (unit indices) achieving it.

mlsynth.utils.pangeo_helpers.parallelism.covariate_imbalance(cov_a: ndarray, cov_b: ndarray, scales: ndarray, weights: ndarray | None = None) → float#

Weighted standardized SMD^2 between two supergeos’ covariate means.

For supergeo means \(\bar c_A, \bar c_B\) (averaged over each half’s units) and per-covariate scales \(s_m\),

\[\sum_m w_m \Big(\frac{\bar c_{A,m} - \bar c_{B,m}}{s_m}\Big)^2 .\]

A precomputed scalar, so adding it to the trajectory cost keeps the outer set-partitioning problem a linear MILP.

mlsynth.utils.pangeo_helpers.parallelism.enumerate_candidate_pairs(unit_indices: ndarray, Ypre: ndarray, max_size: int, objective: str = 'ss_res', weights: ndarray | None = None, cov: ndarray | None = None, cov_scales: ndarray | None = None, cov_weights: ndarray | None = None, unit_weights: ndarray | None = None) → List[dict]#

All admissible supergeo pairs over unit_indices with their scores.

A candidate pair is any subset of size 2 .. 2*max_size that can be split into two halves each of size <= max_size. Returns a list of {"members", "score", "side_a", "side_b"} dicts – the inputs to the set-partitioning MIP. score is the chosen objective (plus the optional standardized covariate-imbalance penalty when cov is given).

mlsynth.utils.pangeo_helpers.parallelism.gap_variance(mean_a: ndarray, mean_b: ndarray) → float#: Variance of the level-removed gap between two trajectories (the DiD pre-period residual sum of squares).

mlsynth.utils.pangeo_helpers.parallelism.parallelism_r2(mean_a: ndarray, mean_b: ndarray) → float#: R^2 of the DiD parallel-trends fit (1 = perfectly parallel).

mlsynth.utils.pangeo_helpers.parallelism.split_cost(mean_a: ndarray, mean_b: ndarray, objective: str = 'ss_res', weights: ndarray | None = None) → float#

Per-pair cost minimised by the MIP (lower = more parallel).

All three objectives are precomputed scalars, so the outer selection problem stays a linear MILP.

"ss_res" – absolute DiD residual sum of squares \(\sum_t (g_t - \bar g)^2\) (scale-dependent; big-amplitude pairs weigh more).
"r2" – 1 - R^2 = ss_res / ss_tot (scale-free; every pair counts equally, FDID’s R^2 criterion but optimised exactly).
"weighted" – weighted residual SS \(\sum_t w_t (g_t - \bar g_w)^2\) with the level removed at the weighted mean \(\bar g_w = \sum_t w_t g_t / \sum_t w_t\) (e.g. recency weighting, so recent parallelism matters more).

Set-partitioning MIP for PANGEO supergeo-pair design.

Given the admissible supergeo pairs over an arm’s units (each with a pre-period parallelism score), select a subset of pairs that partitions every unit exactly once while minimising total non-parallelism – the Supergeo covering formulation (Chen et al. 2023), unchanged except the per-pair score is the difference-in-differences parallelism of parallelism rather than a scalar sum-difference.

\[\min_{x \in \{0,1\}^{|\mathcal F|}} \sum_{G} \text{score}(G)\, x_G \quad\text{s.t.}\quad M^\top x = \mathbf 1 \ (\text{exact cover}),\; \mathbf 1^\top x \ge \kappa\ (\text{min pairs}).\]

Solved with cvxpy using the first installed mixed-integer backend (HiGHS by preference, else SCIP / GLPK_MI / CBC).

mlsynth.utils.pangeo_helpers.mip.solve_partition(candidate_pairs: List[dict], unit_indices: ndarray, min_pairs: int = 1, return_diagnostics: bool = False)#

Select the exact-cover set of supergeo pairs of minimum total score.

Dispatches to the standalone highspy HiGHS backend when available (dropping the cvxpy middleman for the MIP), otherwise falls back to the cvxpy MIP. Both solve the identical exact-cover program and return the same optimum; see _solve_partition_highspy() / _solve_partition_cvxpy().

Panel ingestion for the PANGEO design estimator.

Pivots a historical (pre-treatment) long panel into a wide units x time outcome matrix and records each unit’s treatment-arm eligibility from a single categorical arm column (values A, B, … ). The design is run independently within each arm.

class mlsynth.utils.pangeo_helpers.setup.PangeoInputs(Y: ~numpy.ndarray, unit_names: ~typing.List[~typing.Any], time_labels: ~numpy.ndarray, arm_of: ~typing.Dict[~typing.Any, ~typing.Any], arm_units: ~typing.Dict[~typing.Any, ~numpy.ndarray], covariates: ~numpy.ndarray | None = None, covariate_names: ~typing.List[str] = <factory>, covariate_scales: ~numpy.ndarray | None = None, weights: ~numpy.ndarray | None = None, weight_name: str | None = None)#

Preprocessed pre-treatment panel for PANGEO.

Y#

Pre-period outcomes, shape (N, T); rows = units in unit_names order.

Type:: np.ndarray

unit_names#

Length-N unit identifiers.

Type:: list

time_labels#

Length-T time labels.

Type:: np.ndarray

arm_of#

{unit_name: arm_label}.

Type:: dict

arm_units#

{arm_label: np.ndarray of row indices} (the arm’s geo pool).

Type:: dict

covariates#

Baseline covariate matrix, shape (N, M) aligned with unit_names rows (None if no covariates requested).

Type:: np.ndarray or None

covariate_names#

Length-M covariate column names (empty if none).

Type:: list

covariate_scales#

Length-M cross-unit standard deviations used to standardize the covariate imbalance (None if no covariates).

Type:: np.ndarray or None

weights#

Length-N per-unit aggregation weights (e.g. population) aligned with unit_names (None = equal weights). Used for both the supergeo mean trajectory in the design and the downstream ATT.

Type:: np.ndarray or None

weight_name#

Name of the weight column (None if equal weights).

Type:: str or None

Y: ndarray#

arm_of: Dict[Any, Any]#

arm_units: Dict[Any, ndarray]#

covariate_names: List[str]#

covariate_scales: ndarray | None = None#

covariates: ndarray | None = None#

time_labels: ndarray#

unit_names: List[Any]#

weight_name: str | None = None#

weights: ndarray | None = None#

mlsynth.utils.pangeo_helpers.setup.build_post_matrix(post_df: DataFrame, inputs: PangeoInputs, outcome: str, unitid: str, time: str) → tuple[ndarray, ndarray]#

Pivot the post-treatment rows into a (N, T_post) outcome matrix aligned with inputs.unit_names.

Returns (Y_post, post_time_labels). Every design unit must appear in the post period with no missing cells (a balanced post panel).

mlsynth.utils.pangeo_helpers.setup.prepare_pangeo_inputs(df: DataFrame, outcome: str, arm: str, unitid: str, time: str, min_units_per_arm: int = 2, covariates: List[str] | None = None, standardize_covariates: bool = True, weight_col: str | None = None) → PangeoInputs#

Pivot a historical panel into PangeoInputs.

Parameters:

df (pd.DataFrame) – Balanced pre-treatment long panel; one row per (unit, time).
outcome (str) – Historical outcome column (e.g. sales).
arm (str) – Single categorical column naming each geo’s eligible treatment arm (e.g. values A/B/C). Units are designed within their arm.
unitid, time (str) – Unit-id and time column names.
min_units_per_arm (int) – Minimum geos required per arm to form at least one supergeo pair.
covariates (list of str, optional) – Baseline covariate columns to balance across supergeo halves. Each unit’s covariate value is its mean over the panel (so a column that varies over time is reduced to a per-unit baseline level).
standardize_covariates (bool) – Divide each covariate’s imbalance by its cross-unit std (default). With False the raw scale is used (scales = 1).
weight_col (str, optional) – Per-unit aggregation weight column (e.g. population), constant within a unit. Makes the supergeo aggregate a weighted average; None (default) gives equal weights.

Orchestration for the PANGEO design estimator.

For each treatment arm: enumerate admissible supergeo pairs over the arm’s geos (scored by pre-period DiD parallelism), solve the set-partitioning MIP to choose the exact-cover design of minimum total non-parallelism, and assemble the per-arm supergeo pairs with their treatment/control halves.

mlsynth.utils.pangeo_helpers.pipeline.run_pangeo(inputs: PangeoInputs, *, max_supergeo_size: int | None = None, min_pairs: int = 1, fast: bool = False, fast_candidates: int = 5, objective: str = 'ss_res', recency_decay: float = 0.97, frac_E: float = 0.7, covariate_weights: Dict[str, float] | None = None, compute_power: bool = True, power_target: float = 0.8, power_alpha: float = 0.05, power_post_periods: Sequence[int] | None = None, att_augment: bool = True, att_trend: bool = True, q_selection: str = 'mde_min', q_min_pairs: int = 1, compute_q_sweep: bool = False) → PangeoResults#

Design parallel supergeo pairs within each arm.

Parameters:

inputs (PangeoInputs) – Preprocessed pre-treatment panel.
max_supergeo_size (int, optional) – Q – the maximum size of either supergeo within a pair. If None (the default), Q is selected automatically: every feasible Q in 1..min(ceil(smallest_arm/2), 6) is designed and the one minimising the program-level MDE is returned (see _auto_select_q()). The sweep is recorded in results.metadata["q_sweep"].
min_pairs (int) – Minimum number of supergeo pairs per arm.
objective ({“ss_res”, “r2”, “weighted”}) – Per-pair parallelism cost minimised by the MIP (see mlsynth.utils.pangeo_helpers.parallelism.split_cost()).
recency_decay (float) – Geometric recency-weight decay for objective="weighted": period t gets weight recency_decay**(T0-1-t) (recent periods up-weighted), normalised to sum to T0.
frac_E (float) – Fraction of the pre-period used as the estimation window E that the split is optimised over; the remaining tail is the blank window B, held out so its gap residuals are an honest, out-of-sample estimate of the parallel-trends noise (powering the MDE and the conformal CIs). Mirrors LEXSCM / SPCD. Falls back to the full pre when the panel is too short to leave a usable B.
covariate_weights (dict, optional) – {covariate_name: weight} on the standardized SMD^2 imbalance penalty (default 1.0 each). Only used when inputs.covariates is present.
compute_power (bool) – Attach a program- and arm-level MDE / power analysis to the result (see mlsynth.utils.pangeo_helpers.power).
power_target (float) – Target power for the stored MDE (default 0.80).
power_alpha (float) – Two-sided significance level for the MDE (default 0.05).
power_post_periods (sequence of int, optional) – Post-period horizons to evaluate (default range(2, 13) = 2..12).

Program- and arm-level power / MDE analysis for a PANGEO design.

Once a PANGEO design is frozen, the minimum detectable effect (MDE) after X post-treatment periods is a closed-form function of the pre-period parallelism the design achieved – which is exactly what the MILP minimised, so power and the design objective are the same quantity.

For a supergeo pair the no-effect gap \(g_t = \bar Y^T_t - \bar Y^C_t\) sits on its parallel-trends line \(\delta_p = \overline{g}_{\text{pre}}\); its per-period residual variance is

\[\sigma_p^2 = \text{ss\_res}_p / (T_0 - 1) \qquad (\text{ss\_res}_p = \text{pair.gap\_variance}),\]

the noise an X-period difference-in-differences ATT must overcome. The estimator \(\hat\tau_p = \overline{g}_{\text{post}} - \delta_p\) has variance

\[\operatorname{Var}(\hat\tau_p) = \sigma_p^2\,\big[f(X,\rho) + f(T_0,\rho)\big],\]

where \(f(n,\rho) = \operatorname{Var}(\text{mean of } n \text{ serially-correlated periods})/\sigma^2\) is the variance-inflation factor of an AR(1) process. Consecutive weeks are correlated, so X post weeks are worth far fewer than X independent draws – the trap a naive i.i.d. power calculation falls into. \(\rho\) is estimated from the pooled pre-period gap residuals of the chosen pairs.

The program ATT is the treated-size-weighted average of the pair ATTs; its MDE is the headline number a program owner reports. Per-arm curves are also returned. Pairs are treated as independent across the program, so the arm count multiplies the effective sample size – which is why pooling to the program level detects far smaller effects than any one small arm could. Cross-pair common shocks within an arm are ignored (a mild optimism; a placebo-in-time engine would absorb them).

class mlsynth.utils.pangeo_helpers.power.MDEPoint(post_periods: int, mde_absolute: float, mde_pct: float, se: float)#

Minimum detectable effect at one post-period horizon.

post_periods#

Number of post-treatment periods (X) assumed.

Type:: int

mde_absolute#

MDE on the per-unit outcome scale.

Type:: float

mde_pct#

MDE as a percentage of the baseline outcome level.

Type:: float

se#

Standard error of the X-period ATT under the design.

Type:: float

mde_absolute: float#

mde_pct: float#

post_periods: int#

se: float#

class mlsynth.utils.pangeo_helpers.power.PangeoPower(program: ~mlsynth.utils.pangeo_helpers.power.PowerCurve, arms: ~typing.Dict[~typing.Any, ~mlsynth.utils.pangeo_helpers.power.PowerCurve], alpha: float, power_target: float, post_periods: ~typing.List[int], serial_correlation: float, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#

Power / MDE analysis attached to PangeoResults.

program#

Headline program-level MDE curve (pooled across all arms).

Type:: PowerCurve

arms#

{arm_label: PowerCurve} – per-arm MDE curves.

Type:: dict

alpha#

Two-sided significance level assumed.

Type:: float

power_target#

Target power the MDEs are computed at (default 0.80).

Type:: float

post_periods#

Horizons evaluated.

Type:: list of int

serial_correlation#

Pooled lag-1 (AR(1)) autocorrelation of the gap residuals used to inflate the variance for serial dependence.

Type:: float

alpha: float#

arms: Dict[Any, PowerCurve]#

metadata: Dict[str, Any]#

post_periods: List[int]#

power_for_effect(effect_pct: float, post_periods: int, level: str = 'program') → float#

Power to detect a effect_pct % effect at horizon post_periods.

Inverts the MDE relation for a given true effect size (two-sided Gaussian approximation).

power_target: float#

program: PowerCurve#

serial_correlation: float#

summary() → DataFrame#: Tidy table of MDE (% of baseline) by horizon: program + each arm.

class mlsynth.utils.pangeo_helpers.power.PowerCurve(level: str, baseline: float, n_treated: int, n_pairs: int, points: List[MDEPoint])#

MDE-vs-horizon curve at one aggregation level (program or arm).

level#

"program" or an arm label.

Type:: str

baseline#

Treated-group baseline outcome level used to express mde_pct.

Type:: float

n_treated#

Number of treated geos contributing.

Type:: int

n_pairs#

Number of supergeo pairs contributing.

Type:: int

points#

One entry per post-period horizon.

Type:: list of MDEPoint

baseline: float#

level: str#

mde_pct_by_horizon() → Dict[int, float]#: {post_periods: mde_pct} for quick lookup.

n_pairs: int#

n_treated: int#

points: List[MDEPoint]#

mlsynth.utils.pangeo_helpers.power.compute_pangeo_power(arm_designs: Dict[Any, Any], *, post_periods: Sequence[int] | None = None, alpha: float = 0.05, power_target: float = 0.8) → PangeoPower#

Program- and arm-level MDE curves for a frozen PANGEO design.

Parameters:

arm_designs (dict) – {arm_label: ArmDesign} from a completed design.
post_periods (sequence of int, optional) – Horizons to evaluate (default range(2, 13) = 2..12).
alpha (float) – Two-sided significance level (default 0.05).
power_target (float) – Target power (default 0.80).

Realized ATT estimation for a PANGEO design with post-period data.

PANGEO is a design method: with only pre-treatment history it returns the supergeo pairs and the treatment/control assignment. If the experiment has since run – i.e. the panel carries a post_col marking post-treatment periods – the same design (built on the pre-period alone) is scored against the realized post outcomes here, with inference following the Augmented Difference-in-Differences estimator of Li & Van den Bulte (2022, Marketing Science 42(4):746-767).

The estimator and its inference#

For a treated supergeo aggregate \(y^{T}_t\) and a control supergeo aggregate \(y^{C}_t\), the counterfactual is the regression projection

\[y^{T}_t = \delta_1 + \delta_2\, y^{C}_t + \gamma\, t + e_t , \qquad t = 1,\dots,T_1 ,\]

fit by least squares on the pre-period (the augmented DiD: the scale \(\delta_2\) is free rather than forced to 1, and a linear time trend \(\gamma t\) is included). Writing \(x_t = (1, y^{C}_t, t)'\) and \(\hat\delta\) for the OLS estimate, the per-period treatment effect is \(\hat u_t = y^{T}_t - x_t'\hat\delta\) and the ATT is \(\hat\Delta = T_2^{-1}\sum_{t=T_1+1}^{T}\hat u_t\).

Li & Van den Bulte show (Propositions 3.1-3.3; Web Appendix C) that \(\sqrt{T_2}(\hat\Delta-\Delta)\to N(0,\Sigma_1+\Sigma_2)\), which gives the prediction-variance standard error (their C.13)

\[\widehat{\operatorname{Var}}(\hat\Delta) = \hat\omega^2\Big[\,\bar x_{\text{post}}' \big(\textstyle\sum_{t=1}^{T_1} x_t x_t'\big)^{-1} \bar x_{\text{post}} \;+\; \tfrac{1}{T_2}\,\Big] ,\]

where \(\bar x_{\text{post}}\) is the post-period mean of \(x_t\) and \(\hat\omega^2\) is the residual variance, estimated over the long pre-period (a Newey-West/Bartlett long-run variance with lag \(\lfloor T_1^{1/4}\rfloor\) to allow serial correlation; lag 0 is the i.i.d. case \(\hat\sigma^2_e=\hat e'\hat e/(T_1-k)\)). The two terms are the coefficient-estimation variance (Σ₁) and the post-period averaging variance (Σ₂). The CI is \(\hat\Delta\pm z_{1-\alpha/2}\,\text{SE}\).

Why this estimator suits the supergeo gap#

The theory explicitly admits trend and unit-root (integrated) common factors (Li & Van den Bulte Assumptions C2/C3, Prop 3.3). The augmentation \(\delta_2\) makes treated-on-control a cointegrating regression, scaling out a shared integrated factor; the trend term absorbs deterministic drift; and the prediction-variance term automatically inflates when the post-period control drifts outside its pre-period range, pricing the extrapolation uncertainty. The validity condition is that the residual \(e_t\) be (weakly dependent) stationary – which the augmentation + trend deliver. The arm and program ATTs apply this single-treated-unit estimator to the treated-size-weighted supergeo aggregate at each level; the program number is the headline.

class mlsynth.utils.pangeo_helpers.effects.AttEstimate(level: str, att: float, att_pct: float, baseline: float, se: float, ci_lower: float, ci_upper: float, ci_lower_pct: float, ci_upper_pct: float, p_value: float, n_post: int, scale: float, observed: ~numpy.ndarray = <factory>, counterfactual: ~numpy.ndarray = <factory>)#

An augmented-DiD ATT (Li & Van den Bulte 2022) at one level.

level#

"program" or an arm label.

Type:: str

att#

Augmented-DiD ATT on the (population-weighted) outcome scale.

Type:: float

att_pct#

ATT as a percentage of the post-period counterfactual level.

Type:: float

baseline#

Mean post-period counterfactual outcome used for att_pct (the predicted treated series absent treatment).

Type:: float

se#

Prediction-variance standard error (Li & Van den Bulte C.13).

Type:: float

ci_lower, ci_upper

Confidence interval for the absolute ATT.

Type:: float

ci_lower_pct, ci_upper_pct

The same interval as a percentage of baseline.

Type:: float

p_value#

Two-sided normal p-value for the null of no effect.

Type:: float

n_post#

Number of post-treatment periods averaged.

Type:: int

scale#

Fitted augmentation coefficient \(\hat\delta_2\) (1.0 if the augmentation is disabled, i.e. plain DiD).

Type:: float

observed#

Observed treated supergeo aggregate over pre + post periods.

Type:: np.ndarray

counterfactual#

Augmented-DiD counterfactual prediction of the treated aggregate over the same periods; the gap in the post window is the per-period effect.

Type:: np.ndarray

att: float#

att_pct: float#

baseline: float#

ci_lower: float#

ci_lower_pct: float#

ci_upper: float#

ci_upper_pct: float#

counterfactual: ndarray#

level: str#

n_post: int#

observed: ndarray#

p_value: float#

scale: float#

se: float#

class mlsynth.utils.pangeo_helpers.effects.PangeoEffects(program: ~mlsynth.utils.pangeo_helpers.effects.AttEstimate, arms: ~typing.Dict[~typing.Any, ~mlsynth.utils.pangeo_helpers.effects.AttEstimate], pair_att: ~typing.Dict[~typing.Any, ~typing.List[float]], n_post: int, weighted: bool, alpha: float, metadata: ~typing.Dict[str, ~typing.Any] = <factory>, randomization: ~typing.Dict[str, ~mlsynth.utils.pangeo_helpers.effects.RandomizationInference] = <factory>)#

Realized ATT for a PANGEO design scored against post-period data.

program#

Headline program-level ATT (pooled across all arms).

Type:: AttEstimate

arms#

{arm_label: AttEstimate}.

Type:: dict

pair_att#

{arm_label: [per-pair ATT, ...]} (point estimates).

Type:: dict

n_post#

Number of post periods.

Type:: int

weighted#

Whether a population weight was used in the aggregation.

Type:: bool

alpha#

Significance level for the intervals.

Type:: float

randomization#

{"program"|arm_label: RandomizationInference} – the design-based (permutation + matched-pair) companion to the model-based ADID SE.

Type:: dict

alpha: float#

arms: Dict[Any, AttEstimate]#

metadata: Dict[str, Any]#

n_post: int#

pair_att: Dict[Any, List[float]]#

program: AttEstimate#

randomization: Dict[str, RandomizationInference]#

randomization_summary() → DataFrame#: Tidy table of the design-based (permutation) inference per level.

summary() → DataFrame#: Tidy table of the program and per-arm ATT estimates.

weighted: bool#

class mlsynth.utils.pangeo_helpers.effects.RandomizationInference(level: str, att: float, se_pair: float, ci_lower: float, ci_upper: float, p_permutation: float, n_pairs: int, df: int, exact: bool, n_perm: int)#

Design-based (randomization) inference for one level’s realized ATT.

Because treatment is randomized within each supergeo pair, the sharp null of no effect can be tested without a model: recompute the (weighted) mean of the antisymmetric within-pair DiD contrasts under every within-pair sign flip and compare to the observed statistic.

level#

"program" or an arm label.

Type:: str

att#

Design-based point estimate: the treated-size-weighted mean of the within-pair DiD contrasts d_k = (\bar y^T_{post}-\bar y^T_{pre}) - (\bar y^C_{post}-\bar y^C_{pre}) (model-free; flips sign exactly when a pair’s treatment assignment flips).

Type:: float

se_pair#

Matched-pair (pair-clustered) standard error sqrt(sum wn_k^2 (d_k-att)^2) * K/(K-1); reduces to sd(d_k)/sqrt(K) under equal weights. nan for a single pair.

Type:: float

ci_lower, ci_upper

att +/- t_{K-1} * se_pair interval.

Type:: float

p_permutation#

Two-sided randomization p-value: fraction of within-pair sign flips whose |statistic| >= |observed|.

Type:: float

n_pairs#

Number of supergeo pairs (K) – the randomization units.

Type:: int

df#

Degrees of freedom K-1 for the matched-pair t interval.

Type:: int

exact#

True if all 2**K sign flips were enumerated; False if the null distribution was Monte-Carlo sampled (large K).

Type:: bool

n_perm#

Number of sign patterns evaluated.

Type:: int

att: float#

ci_lower: float#

ci_upper: float#

df: int#

exact: bool#

level: str#

n_pairs: int#

n_perm: int#

p_permutation: float#

se_pair: float#

mlsynth.utils.pangeo_helpers.effects.adid_counterfactual(YT: ndarray, YC: ndarray, n_pre: int, augment: bool = True, trend: bool = True) → ndarray#

Treated-series counterfactual from the (augmented) DiD fit.

Fit yT = d1 [+ d2*yC] [+ g*t] on the first n_pre periods and return the predicted treated trajectory over all periods – the line PANGEO plots against the observed treated aggregate. For plain DiD the counterfactual is yC + (d1 [+ g*t]); for augmented DiD it is the projection d1 + d2*yC [+ g*t] directly.

mlsynth.utils.pangeo_helpers.effects.build_effect_report(effects: PangeoEffects)#

Map the realized program-level ATT to a standard EffectResult.

Resolves a PANGEO DesignResult to the observational report family: the program Augmented-DiD ATT, its counterfactual/gap path, and the CI are packed into the standardized sub-models so report satisfies the same flat read-contract (att, counterfactual, gap, att_ci) as any effect estimator. The richer per-pair / design-based numbers stay on PangeoResults.effects.

mlsynth.utils.pangeo_helpers.effects.compute_pangeo_effects(results, inputs, Y_post: ndarray, *, alpha: float = 0.05, augment: bool = True, trend: bool = True) → PangeoEffects#

Augmented-DiD ATT (Li & Van den Bulte 2022) for a design scored on post outcomes, at the program and arm levels.

Parameters:

results (PangeoResults) – The frozen design (pairs + assignment) built on the pre-period.
inputs (PangeoInputs) – Pre-period inputs (supplies unit order and population weights).
Y_post (np.ndarray) – Post-period outcomes, shape (N, T_post), rows aligned with inputs.unit_names.
alpha (float) – Significance level for the CIs / p-values.
augment (bool) – Free augmentation coefficient delta_2 on the control aggregate (the augmented DiD). False forces delta_2 = 1 (plain DiD).
trend (bool) – Include a linear time-trend regressor.

Seasonal factor-model simulator of sales-like panel data for PANGEO.

Generates a balanced panel of “sales” with the structure typical of geo marketing data, drawing on the factor-model DGPs used throughout mlsynth:

\[Y_{it} = \underbrace{\lambda_t^\top \mu_i}_{\text{low-rank factors}} + \underbrace{a_i \sin(2\pi t / s + \phi_i)}_{\text{seasonality}} + \underbrace{\gamma_i + \beta_i t}_{\text{unit level + trend}} + \varepsilon_{it},\]

across several non-overlapping treatment arms. The design problem is prospective, so the generated panel is pre-treatment only – it is the historical window a designer would use to build balanced supergeo pairs.

mlsynth.utils.pangeo_helpers.simulation.make_seasonal_sales_panel(units_per_arm: int = 5, arms: Tuple[str, ...] = ('A', 'B', 'C'), T: int = 156, n_factors: int = 2, season_period: int = 52, noise: float = 0.05, seed: int = 0, covariates: bool = False, n_post: int = 0, factor: str = 'rw', season_amp: float = 1.0, trend_sd: float = 0.01) → DataFrame#

Simulate a seasonal, multi-arm, sales-like pre-treatment panel.

Parameters:

units_per_arm (int) – Number of geos (markets) eligible for each arm.
arms (tuple of str) – Arm labels; each unit is eligible for exactly one arm. Arms occupy non-overlapping geos.
T (int) – Number of pre-treatment periods (e.g. weeks; default 3 years).
n_factors (int) – Rank of the common low-rank factor structure.
season_period (int) – Seasonal cycle length (e.g. 52 weeks).
noise (float) – Idiosyncratic noise scale, relative to the signal.
seed (int) – RNG seed.
covariates (bool) – If True, also emit time-invariant baseline population and income columns (correlated with the unit’s level and factor loadings) for PANGEO’s covariate-balancing option.
n_post (int) – Number of post-treatment periods to append after the T pre periods. When > 0 the panel gains a post_col (0 = pre, 1 = post); the DGP continues unchanged (no treatment effect – the effect is injected by the caller after the design is fixed, which is how PANGEO’s ATT recovery is validated).
factor ({“rw”, “iid”, “ar1”}) – Process for the unobserved common factors that drive the supergeo gap. "rw" (default) is an integrated random walk – a stress test whose non-exchangeable loadings violate the conformal-inference assumption (Abadie & Zhao 2026, Thm 2) and demand the increment-bootstrap. "iid" and "ar1" are stationary (exchangeable) factors under which the blank-window conformal CI is exact.

Returns:

pd.DataFrame – Long panel with columns unit, time, sales, arm (plus population/income when covariates=True and post_col when n_post > 0).

Frozen dataclasses for the PANGEO design estimator.

PANGEO is a prospective experimental design method: from historical (pre-treatment) sales it partitions each treatment arm’s geos into supergeo pairs whose treatment/control halves are maximally parallel over the pre-period, so a later difference-in-differences / synthetic-control analysis has clean parallel trends. The output is a design (supergeo pairs + treatment/control assignment + achieved parallelism), not a treatment effect.

class mlsynth.utils.pangeo_helpers.structures.ArmDesign(arm: Any, pairs: List[SupergeoPair], n_units: int, total_gap_variance: float, mean_parallelism_r2: float, treatment_units: List[Any], control_units: List[Any])#

The supergeo-pair design for a single treatment arm.

arm#

Arm label.

Type:: Any

pairs#

The chosen supergeo pairs partitioning the arm’s units.

Type:: list of SupergeoPair

n_units#

Number of geos eligible for the arm.

Type:: int

total_gap_variance#

Sum of the pairs’ gap variances – the design objective value.

Type:: float

mean_parallelism_r2#

Mean within-pair parallel-trends R^2 across pairs.

Type:: float

treatment_units#

All units assigned to treatment in the arm.

Type:: list

control_units#

All units assigned to control in the arm.

Type:: list

arm: Any#

control_units: List[Any]#

mean_parallelism_r2: float#

n_units: int#

pairs: List[SupergeoPair]#

total_gap_variance: float#

treatment_units: List[Any]#

class mlsynth.utils.pangeo_helpers.structures.PangeoResults(*, report: ~mlsynth.config_models.BaseEstimatorResults | None = None, assignment: ~typing.Any | None = None, selected_units: ~typing.Any | None = None, design_weights: ~mlsynth.config_models.WeightsResults | None = None, power: ~typing.Any | None = None, metadata: ~typing.Dict[str, ~typing.Any] | None = None, arm_designs: ~typing.Dict[~typing.Any, ~mlsynth.utils.pangeo_helpers.structures.ArmDesign] = <factory>, max_supergeo_size: int | None = None, time_labels: ~numpy.ndarray | None = None, effects: ~typing.Any | None = None, **extra_data: ~typing.Any)#

Top-level container returned by mlsynth.PANGEO.fit().

A DesignResult (the experimental-design family): it chooses the treatment/control assignment before any intervention, and – once post-period outcomes exist – resolves to an EffectResult via report.

Inherited design fields#

reportEffectResult or None: The realized effect report (program-level Augmented-DiD ATT mapped to the standard effect surface); None for a design-only fit.
assignmentdict: Flat {unit_name: "treatment"|"control"} map across all arms.
selected_unitslist: Units assigned to treatment (the design’s chosen treated set).
powerPangeoPower or None: Program- and arm-level MDE / power analysis.
metadatadict: Free-form design diagnostics (solver, q_sweep, solver_diagnostics, …).

PANGEO-specific fields#

arm_designsdict: {arm_label: ArmDesign} – the supergeo-pair design per arm.
max_supergeo_sizeint: The Q used (max size of either supergeo within a pair).
time_labelsnp.ndarray: Pre-period time labels the design was built on.
effectsPangeoEffects or None: The rich realized-ATT object (per-pair, per-arm, program, plus the design-based randomization inference) when post-period data is given.

class Config#

arbitrary_types_allowed = True#

extra = 'allow'#

frozen = True#

arm_designs: Dict[Any, ArmDesign]#

effects: Any | None#

max_supergeo_size: int | None#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'extra': 'allow', 'frozen': True, 'json_encoders': {<class 'numpy.ndarray'>: <function MlsynthResult.Config.<lambda>>}}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

time_labels: np.ndarray | None#

class mlsynth.utils.pangeo_helpers.structures.SupergeoPair(treatment: ~typing.List[~typing.Any], control: ~typing.List[~typing.Any], gap_variance: float, parallelism_r2: float, treatment_mean: ~numpy.ndarray, control_mean: ~numpy.ndarray, covariate_smd: ~typing.Dict[str, float] = <factory>, gap_level: float = 0.0, holdout_resid: ~numpy.ndarray = <factory>)#

One supergeo pair within an arm.

treatment#

Unit names assigned to treatment in this pair.

Type:: list

control#

Unit names assigned to control in this pair.

Type:: list

gap_variance#

Pre-period level-removed gap variance between the two halves (lower = more parallel; the DiD pre-period residual SS).

Type:: float

parallelism_r2#

R^2 of the within-pair parallel-trends fit (1 = perfectly parallel).

Type:: float

treatment_mean#

Pre-period mean trajectory of the treatment half.

Type:: np.ndarray

control_mean#

Pre-period mean trajectory of the control half.

Type:: np.ndarray

covariate_smd#

{covariate_name: standardized mean difference} between the treatment and control halves (empty if no covariates were used).

Type:: dict

gap_level#

DiD counterfactual gap level \(\delta\) – the mean gap over the estimation window E (the periods the split was optimised on).

Type:: float

holdout_resid#

Gap residuals on the held-out blank window B (gap[B] - gap_level). B is excluded from the optimisation, so these residuals are an honest out-of-sample estimate of the parallel-trends noise – the reservoir for conformal inference and the variance behind the MDE.

Type:: np.ndarray

control: List[Any]#

control_mean: ndarray#

covariate_smd: Dict[str, float]#

gap_level: float = 0.0#

gap_variance: float#

holdout_resid: ndarray#

parallelism_r2: float#

treatment: List[Any]#

treatment_mean: ndarray#

Example#

A seasonal, multi-arm sales panel (the bundled simulator), designed into parallel supergeo pairs. With display_graphs=True PANGEO plots, per arm, the observed treated supergeo aggregate against the augmented-DiD counterfactual prediction (mlsynth.utils.pangeo_helpers.effects.adid_counterfactual()): the in-sample pre-period fit when designing, and – once post_col data are supplied – the counterfactual extended past the treatment date, so the post-window gap is the estimated effect.

from mlsynth import PANGEO
from mlsynth.utils.pangeo_helpers import make_seasonal_sales_panel

# 3 arms (non-overlapping geos), 6 geos each, 156 weeks of history.
df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
                               T=156, seed=0)

res = PANGEO({
    "df": df,
    "outcome": "sales",
    "arm": "arm",                # single categorical arm column
    "unitid": "unit",
    "time": "time",
    "max_supergeo_size": 3,      # Q
}).fit()

for arm, design in res.arm_designs.items():
    print(f"Arm {arm}: {len(design.pairs)} pair(s), "
          f"parallel-trends R^2 = {design.mean_parallelism_r2:.3f}")
    for p in design.pairs:
        print(f"   T={p.treatment}  C={p.control}  R^2={p.parallelism_r2:.3f}")

# res.assignment maps every geo -> 'treatment' / 'control'.

On the simulated data this returns designs with parallel-trends \(R^2\) around 0.90–0.98 — roughly 10–35x more parallel than a random treatment/control split of the same geos.

Simulation: Trajectory Matching vs. a Scalar Supergeo#

The supergeo design of Chen et al. (2023) matches (super)geos on a scalar summary of baseline response (its variance term is \(\sum_k (Z_{G_{k,+}} - Z_{G_{k,-}})^2\), a sum over scalar baseline differences). PANGEO carries this into the panel setting: it matches on the full pre-treatment trajectory (level-removed parallelism), so it can separate geos that look identical on a scalar yet move differently over time. The self-contained Monte Carlo below makes the gap concrete — adapting the paper’s RMSE comparison (supergeo vs. matched pairs) to a panel where every geo has the same pre-period mean but a distinct trajectory shape, a setting in which scalar matching is by construction blind. Six geos keep the MIP instantaneous.

import numpy as np
import pandas as pd
from mlsynth import PANGEO

def make_panel(rng, T_pre=20, S=6, level=100.0, noise=0.6):
    """Six geos = three parallel pairs (up-trend, down-trend, cycle);
    each shape is demeaned over the pre-period so all pre-means match."""
    T = T_pre + S
    t = np.arange(T)
    up = (t - t.mean()) / t.std()
    cyc = np.sin(2 * np.pi * t / 5.0)
    shapes = [5 * up, 5 * up, -5 * up, -5 * up, 5 * cyc, 5 * cyc]
    cols = []
    for s in shapes:
        s = s - s[:T_pre].mean()              # equal pre-means => scalar-blind
        cols.append(level + s + rng.normal(0, noise, T))
    return np.column_stack(cols), T_pre

def did(Y, T_pre, treated, control, tau):
    Yo = Y.copy()
    Yo[T_pre:, treated] += tau                # inject the true effect
    t_eff = Yo[T_pre:, treated].mean() - Yo[:T_pre, treated].mean()
    c_eff = Yo[T_pre:, control].mean() - Yo[:T_pre, control].mean()
    return t_eff - c_eff

def pangeo_design(Y, T_pre):
    df = pd.DataFrame(
        {"geo": f"g{g}", "t": int(t), "y": Y[t, g], "arm": "A"}
        for g in range(6) for t in range(T_pre)
    )
    a = PANGEO({"df": df, "outcome": "y", "unitid": "geo", "time": "t",
                "arm": "arm", "max_supergeo_size": 1,
                "compute_power": False, "display_graphs": False}).fit().assignment
    T = [int(g[1:]) for g, v in a.items() if v == "treatment"]
    C = [int(g[1:]) for g, v in a.items() if v == "control"]
    return T, C

def scalar_match(Y, T_pre, rng):              # Google-style: pair on scalar mean
    order = np.argsort(Y[:T_pre].mean(0))
    T, C = [], []
    for k in range(0, 6, 2):
        a, b = order[k], order[k + 1]
        if rng.random() < 0.5:
            T.append(a); C.append(b)
        else:
            T.append(b); C.append(a)
    return T, C

tau, R = 4.0, 60
rng = np.random.default_rng(1)
errs = {"PANGEO (trajectory)": [], "scalar matched-pairs": []}
for _ in range(R):
    Y, T_pre = make_panel(rng)
    Tp, Cp = pangeo_design(Y, T_pre)
    errs["PANGEO (trajectory)"].append(did(Y, T_pre, Tp, Cp, tau) - tau)
    Ts, Cs = scalar_match(Y, T_pre, rng)
    errs["scalar matched-pairs"].append(did(Y, T_pre, Ts, Cs, tau) - tau)

for name, e in errs.items():
    e = np.array(e)
    print(f"{name:22s} RMSE = {np.sqrt((e ** 2).mean()):.2f}")

Because all geos share a pre-period mean, the scalar match pairs them essentially at random, and the difference-in-differences estimate inherits the up/down/cycle shape mismatch (RMSE ≈ 6 against a true effect of 4). PANGEO reads the trajectory shape, recovers the three parallel pairs, and estimates the effect about 30× more precisely (RMSE ≈ 0.2). That is the supergeo idea carried into the panel world: match on how the series move, not on a single number.

References#

Chen, A., Doudchenko, N., Jiang, S., Stein, C., & Ying, B. (2023). “Supergeo Design: Generalized Matching for Geographic Experiments.” arXiv:2301.12044.

Shaw, C. (2025). “Optimized Supergeo Design: A Scalable Framework for Geographic Marketing Experiments.” arXiv:2506.20499.

Li, K. T. (2023). “Frontiers: A Simple Forward Difference-in-Differences Method.” Marketing Science 43(2):267-279.

Li, K. T., & Van den Bulte, C. (2022). “Augmented Difference-in-Differences.” Marketing Science 42(4):746-767.

Abadie, A., & Zhao, J. (2026). “Synthetic Controls for Experimental Design.” Working paper.

Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies.” Journal of the American Statistical Association 105(490):493-505.

Parallel-Trends Supergeo Design (PANGEO)

Contents

Parallel-Trends Supergeo Design (PANGEO)#

When to Use This Estimator#

What is a supergeo?#

Setup and notation#

Identifying assumptions#

When PANGEO Fails or Stalls#

Stage 1 — the supergeo design#

The parallelism objective#

Forming the supergeos#

The exact set-partitioning program (fast=False)#

Per-pair objectives#

Supergeo size \(Q\) and automatic selection#

Balancing baseline covariates#

Power and the minimum detectable effect#

Stage 2 — evaluation by Augmented DiD#

The estimator#

Inference#

Why this estimator suits the supergeo gap#

The result object#

Validity envelope (smoke tests)#

Core API#

Configuration#

Helper Modules#

The estimator and its inference#

Why this estimator suits the supergeo gap#

Inherited design fields#

PANGEO-specific fields#

Example#

Simulation: Trajectory Matching vs. a Scalar Supergeo#

References#

The exact set-partitioning program (`fast=False`)#