Parallel-Trends Supergeo Design (PANGEO)#
When to Use This Estimator#
PANGEO is a tool for designing a geo experiment — deciding, before you run it, which geographic markets to treat and which to hold out, so that the after-the-fact comparison is as clean as possible. Use it when:
you can assign treatment at the geo level (turn an ad campaign, price, or feature on in some markets and not others);
you have a panel of pre-period history (weekly/monthly sales, conversions, GMV…) for every candidate geo;
the number of geos is modest (tens, not thousands), as is typical with DMAs/regions; and
you want the eventual treatment-effect estimate to be precise — i.e. you want treated and control markets that already move together.
A worked geo-experiment. A brand wants to measure the incremental sales from a new TV/CTV campaign. Ads are bought at the DMA level, so the experimental units are ~50–210 large, heterogeneous markets — not exchangeable shoppers. The plan: run the campaign in a treatment set of DMAs, withhold it in a control set, and read the post-launch sales gap as the lift. The whole experiment lives or dies on the split: if the treated DMAs were already trending differently from the controls, the post-launch gap mixes the campaign effect with that pre-existing divergence. PANGEO reads the DMAs’ pre-period sales panel and chooses the treatment/control split (bundling DMAs into balanced supergeos) so the two sides’ sales trajectories run parallel beforehand — turning the post-period gap into a clean read on the campaign.
PANGEO is two stages:
Design (pre-period data only). A set-partitioning mixed-integer program groups each arm’s geos into composite supergeos, forms balanced pairs with no geo trimmed, and selects the partition that maximises pre-period parallelism. A power analysis reports the minimum detectable effect (MDE) implied by the chosen supergeo size \(Q\).
Evaluation (after the experiment). The same design is scored against the realised outcomes with the Augmented Difference-in-Differences estimator of Li & Van den Bulte (2022), giving the ATT, percent ATT, and CIs at the arm and program levels.
The two stages share one quantity: both the design objective and the standard error of the realised effect are governed by the variance of the supergeo gap residual. Minimising non-parallelism simultaneously minimises the MDE and tightens the CI — optimising parallelism is optimising inferential precision.
This is a principled deviation from Google’s Supergeo Design. Chen et al. (2023) and OSD (Shaw 2025) match supergeos on a scalar summary — the summed baseline response, or a few covariate totals — which collapses the time dimension. PANGEO matches on the full pre-period trajectory. That difference is not cosmetic: the downstream analysis is a difference-in-differences, which differences trajectories over time, so two markets with identical totals but different seasonal shapes are not interchangeable for it even though scalar matching scores them as a perfect match. In trending, seasonal data — which is essentially all geo-marketing data — matching on shape rather than on a single number is what makes the post-period comparison valid. (The simulation at the end of this page quantifies the gap: when geos share a baseline mean but differ in shape, PANGEO recovers the effect ~30× more precisely than a scalar match.)
When *not* to use it. If the assignment is already fixed (an observational study, or a campaign that already ran in specific markets), there is no design to choose — use an estimation-stage method (Two-Step Synthetic Control, Synthetic Business Cycle (SBC), Forward Difference-in-Differences (FDID)). If you have hundreds of geos the exact MIP may be intractable (see When PANGEO Fails or Stalls below) — the scalable OSD relaxation is the better fit there. And if the outcome is plausibly stationary with no trend or seasonality, scalar matching is already adequate and simpler.
What is a supergeo?#
Geo experiments differ from ordinary A/B tests in one decisive way: the experimental units are a small number of large, heterogeneous aggregates — markets, regions, DMAs — rather than many exchangeable individuals. Randomising treatment across a handful of dissimilar markets routinely produces treatment and control groups with very different baseline characteristics, and the resulting post-randomisation bias does not average away over the single assignment a practitioner actually runs (Abadie & Zhao 2026). Classic matched-pair designs help, but with heterogeneous geos there may be no good one-to-one match for a given market.
A supergeo resolves this by relaxing the unit of matching. Rather than insisting that single geos match, geos are pooled into composite aggregates: a supergeo is simply a bundle of geos treated as one unit, with outcome equal to their (population-weighted) mean. Composite units can be made comparable even when their constituents are not — a small, noisy market combined with a complementary one can, in aggregate, track another composite closely. The design then pairs supergeos, randomises treatment within each pair, and — unlike trimming-based approaches — assigns every geo to some supergeo, so the experiment spans the entire market with nothing discarded (Chen, Doudchenko, Jiang, Stein & Ying 2023).
PANGEO keeps this structure but changes what the supergeos are matched on. Supergeo Design and OSD match on a scalar summary (the summed response, or a few covariate totals), which collapses the time dimension; PANGEO matches on the full pre-treatment trajectory, choosing pairs whose aggregate paths run as parallel as possible. The reason is that the downstream difference-in-differences analysis differences trajectories: two markets with identical totals but different seasonal shapes are not interchangeable for it, even though scalar matching treats them as equivalent.
Assumptions, and When PANGEO Fails or Stalls#
PANGEO can make a good experiment likely, but it cannot manufacture one that the data do not support. Three assumptions underpin it, and each points at a way the method can fail or stall.
1. Parallel trends (the crux). The design maximises parallelism in the pre-period; the validity of the Stage-2 effect estimate rests on that parallelism persisting into the post-period absent treatment — i.e. the treatment and control supergeos would have continued to move together had the campaign never launched. This is exactly the difference-in-differences parallel-trends assumption, and it is the assumption PANGEO is organised around. The crucial honesty: PANGEO optimises pre-period parallelism (making the assumption as plausible as the data allow) but it cannot guarantee the assumption holds out-of-sample. If a shock hits the treated markets, a competitor reacts only there, or the pre-period co-movement was coincidental, the post-period gap diverges on its own and the ATT is biased — the same Achilles’ heel as any DiD. Diagnostics that flag the risk: the achieved parallelism \(R^2\) (low values mean no balanced design exists — see below), the reported MDE, and a placebo / blank-window check on the held-out pre-period.
2. A linear factor structure for the no-treatment outcomes (Eq. (1)). The gap decomposition that makes “match on trajectory” equivalent to “balance the factor loadings” relies on this model. It is the standard synthetic-control / interactive-fixed-effects assumption and is mild for sales-like panels, but a wildly non-factor outcome (e.g. one driven by an idiosyncratic, unit-specific regime change) is not balanceable by any partition.
3. A modest, designable geo pool. Each arm needs enough geos to form at least one supergeo pair, and the geos must be heterogeneous-but- matchable.
The concrete failure / stall modes:
Parallel trends breaks post-launch — the dominant risk, above. No design fixes it; only the diagnostics warn of it.
No matchable structure. If the geos are so heterogeneous that no partition achieves high parallelism, PANGEO still returns the best feasible design, but the parallelism \(R^2\) stays low and the MDE blows up — a signal that a geo experiment here is underpowered and the read will be noisy regardless of split.
The MIP stalls. Set partitioning is NP-hard. With many geos and a large supergeo size \(Q\), the exact mixed-integer program can be slow or intractable. Mitigations: cap \(Q\) (smaller supergeos), use the automatic \(Q\) selection, raise
min_pairs, or — for hundreds of geos — fall back to the scalable OSD relaxation. The examples on this page use a handful of geos, so the solve is instant.Too few geos / arms. A pool that cannot form a balanced pair (e.g. two wildly different markets) has no good design to find.
In short: PANGEO improves the plausibility of parallel trends by construction and quantifies the residual risk (parallelism \(R^2\), MDE), but it inherits DiD’s identifying assumption rather than removing it. Treat a low parallelism \(R^2\) or a large MDE as the design telling you the experiment is fragile.
Setup and notation#
Let \(Y_{it}\) denote the outcome of geo \(i\in\{1,\dots,N\}\) in period \(t\in\{1,\dots,T\}\). The first \(T_0\) periods are the pre-treatment (design) window and the remaining \(T_{\mathrm{post}} = T - T_0\) periods are the experimental window. A single categorical column assigns each geo to an arm; arms occupy disjoint geo pools \(\mathcal N_a\) and are designed independently, so the exposition below fixes one arm and drops the arm subscript.
Throughout we maintain the linear factor model used by both the synthetic- control and DiD literatures (Abadie, Diamond & Hainmueller 2010; Li & Van den Bulte 2022) for the no-treatment potential outcome,
where \(\delta_t\) is a common time effect, \(Z_i\) are observed covariates with time-varying loadings \(\theta_t\), \(\mu_i\) are unobserved factor loadings with factors \(\lambda_t\), and \(\varepsilon_{it}\) is mean-zero idiosyncratic noise.
A supergeo is a set \(S\) of same-arm geos with aggregate trajectory
where \(\omega_i>0\) are aggregation weights (the weight_col
population, or \(\omega_i\equiv 1\)). A pair
\(p=(A_p,B_p)\) consists of two disjoint supergeos with
\(|A_p|,|B_p|\le Q\); \(A_p\) is the treatment half and
\(B_p\) the control half. Its gap is
Under (1) the common time effect cancels and
with \(\bar\mu_S, \bar Z_S\) the weighted means over \(S\). The pair exhibits parallel trends precisely when the loadings are balanced, \(\bar\mu_{A_p}=\bar\mu_{B_p}\) (and \(\bar Z_{A_p}=\bar Z_{B_p}\)); the gap is then constant in expectation and a difference-in-differences comparison within the pair is unbiased.
Stage 1 — the supergeo design#
The parallelism objective#
The pre-treatment window is split into an estimation window
\(\mathcal E\) (the first \(\lfloor \kappa T_0\rfloor\) periods,
\(\kappa=\) frac_E, default \(0.7\)) and a held-out blank
window \(\mathcal B=\{1,\dots,T_0\}\setminus\mathcal E\). A pair is
scored by the variance of its level-removed gap over the estimation
window,
which is exactly the pre-period residual sum of squares of a
difference-in-differences fit (cf.
mlsynth.utils.selector_helpers._did_from_mean()). Taking expectations
under (3) with balanced covariates,
The first term is a positive-definite quadratic form in the loading imbalance, so minimising (4) drives \(\bar\mu_{A_p}\to\bar\mu_{B_p}\) — it balances the unobserved factor loadings, which is what parallel-trends DiD requires. The time-constant component of the loading difference is absorbed by the level shift \(\bar g_p\) and never penalised: two supergeos may differ arbitrarily in level yet match perfectly in shape. Scalar sum-matching, by contrast, collapses the time dimension and is blind to shape.
The set-partitioning program#
Let \(\mathcal F\) be the family of admissible pairs: every subset of the arm’s geos of size \(2,\dots,2Q\) that can be split into two halves each of size \(\le Q\), each subset scored at its best such split by (4). Let \(M\in\{0,1\}^{N\times|\mathcal F|}\) be the geo-by- pair incidence matrix (\(M_{iG}=1\) iff geo \(i\in G\)) and \(c_G\) the score of pair \(G\). The design solves the set-partitioning program
solved with cvxpy and the HiGHS mixed-integer backend. The exact-cover
constraint \(Mx=\mathbf 1\) assigns every geo to exactly one chosen
pair (no geo is trimmed). Because each \(c_G\) is precomputed offline,
the objective is linear in \(x\) — the program is a mixed-integer
linear program regardless of the (possibly nonlinear) per-pair cost,
which is what keeps it tractable. Within each chosen pair the treatment and
control halves are the score-minimising split; which half is actually
treated is randomised in the field.
Per-pair objectives#
The objective argument selects the per-pair cost \(c_G\); all three
choices leave (5) a linear program. Writing \(g_t\) for the gap
of a candidate split and \(\bar g\) for its estimation-window mean,
"ss_res"(default) — the absolute residual sum of squares \(\sum_t (g_t-\bar g)^2\). Scale-dependent, so high-amplitude pairs weigh more and the design prioritises making large markets parallel."r2"— the scale-free criterion \(1-R^2 = \sum_t(g_t-\bar g)^2 / \sum_t(\bar Y_{A,t}-\overline{\bar Y_A})^2\), so every pair counts equally (FDID’s \(R^2\) criterion, optimised exactly by the program rather than greedily)."weighted"— a recency-weighted residual SS \(\sum_t w_t (g_t-\bar g_w)^2\), the level removed at the weighted mean \(\bar g_w\), with weights \(w_t=\rho_{\mathrm{dec}}^{\,T_0-1-t}\) (recency_decay), up-weighting the recent pre-period closest to the experiment.
The per-pair gap_variance and parallelism_r2 reported on the result
are always the unweighted quantities of (4), so designs from
different objectives are comparable on a common yardstick.
Supergeo size \(Q\) and automatic selection#
Setting max_supergeo_size \(=Q=1\) recovers the classic
matched-pairs design; \(Q>1\) permits composite supergeos when no single
geo matches another well, without trimming. \(Q\) is a granularity knob
with an interior optimum: too small and no parallel matches exist
(singleton geos are too noisy); too large and the arm yields few, coarse
pairs. The program-level MDE is not monotone in \(Q\) and is not
tracked by the parallelism \(R^2\) (which is scale-free and rises with
\(Q\)); only the absolute residual variance that drives power matters.
Consequently, if max_supergeo_size is left unset, PANGEO selects
\(Q\) automatically: it solves (5) for every feasible
\(Q\in\{1,\dots,\min(\lceil N/2\rceil, 6)\}\) and returns the design
with the smallest mean program MDE. The full sweep — each \(Q\)’s
program-pair count, mean program MDE, and the \(2/2^{P}\)
randomisation-inference p-value floor for \(P\) pairs — is recorded in
results.metadata["q_sweep"] and the choice in
results.metadata["q_selected"], so the decision is auditable and may be
overridden with an explicit \(Q\).
Balancing baseline covariates#
Parallelism is level-blind: by (4) the level shift \(\bar g_p\) absorbs any time-constant gap, so a baseline characteristic (population, income) that merely shifts a market’s level is differenced out and never enters the trajectory score. This is correct for parallel-trends DiD but says nothing about balance on such characteristics — the role of OSD’s scalar covariate matching. PANGEO restores it with a standardised mean-difference penalty appended to (4),
the weighted squared standardised mean difference (SMD) between the halves’
covariate means, where \(s_m\) is the cross-geo standard deviation
(standardize_covariates, default True) and \(\omega^{\mathrm{cov}}_m\)
a per-covariate weight (covariate_weights, default \(1\)). Because
(6) is precomputed it preserves linearity in (5). Larger
weights buy tighter covariate balance at the cost of some parallelism; the
achieved per-pair SMDs are reported in SupergeoPair.covariate_smd. Pass
covariates=[...] (baseline columns, each reduced to its per-geo mean) to
enable; with no covariates the design is unchanged. This is also the
Abadie & Zhao (2026, Thm. 1) prescription — moving structure from the
unobserved \(\mu_i\) into the observed \(Z_i\) lowers the
estimator’s bias — and the Stage-2 device for restoring inferential
validity (below).
df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
T=104, seed=0, covariates=True)
res = PANGEO({
"df": df, "outcome": "sales", "arm": "arm",
"unitid": "unit", "time": "time", "max_supergeo_size": 3,
"covariates": ["population", "income"],
"covariate_weights": {"population": 5.0, "income": 5.0},
}).fit()
for arm, design in res.arm_designs.items():
for p in design.pairs:
print(p.treatment, p.control, p.parallelism_r2, p.covariate_smd)
Power and the minimum detectable effect#
Because power and the design objective are governed by the same supergeo
gap residual, mlsynth.PANGEO.fit() returns a power analysis
(results.power). For pair \(p\) the per-period noise is estimated
honestly on the held-out blank window \(\mathcal B\) (out of sample
with respect to the optimisation) as the residual of the same counterfactual
model used at evaluation ((7)) — fit on the estimation window
\(\mathcal E\), evaluated on \(\mathcal B\):
Using the evaluation model here (the augmented-DiD residual by default,
or the plain level-removed gap when att_augment=False) rather than a
fixed recipe keeps the projected MDE and the realised standard error
((8)) coherent. The \(X\)-period effect for the pair then has
variance \(\hat\sigma_p^2\,[f(X,\rho)+f(T_0,\rho)]\), where
is the variance-inflation factor of the mean of \(n\) AR(1)-correlated periods and \(\rho\) is the pooled lag-1 autocorrelation of the blank residuals. Serial correlation is decisive: weekly sales are highly autocorrelated, so \(X\) post weeks are worth far fewer than \(X\) independent observations and adding post periods yields sharply diminishing returns — the trap a naive i.i.d. power calculation falls into.
The program-level effect is the treated-size-weighted average of the pair effects, with weights \(w_p = (\sum_{i\in A_p}\omega_i)/\sum_{q}\sum_{i\in A_q}\omega_i\), and (treating pairs as independent across the program)
The program level is the headline: small arms are individually
under-powered (with \(P\) pairs a pure within-pair randomisation test
has a hard p-value floor of \(2/2^{P}\), so one needs \(P\ge 6\) to
reach \(p<0.05\)), whereas pooling across arms gives the program an
effective sample size equal to the total pair count and routinely detects
effects several points smaller than any one arm. Per-arm curves are stored
in results.power.arms. The MDE is reported in outcome units and as a
percent of the treated baseline, by default at \(1-\beta=0.80\) power
for horizons \(X=2,\dots,12\); power_target, power_alpha and
power_post_periods configure this and compute_power=False skips it.
res = PANGEO({
"df": df, "outcome": "sales", "arm": "arm",
"unitid": "unit", "time": "time", "max_supergeo_size": 3,
}).fit()
pw = res.power
print(f"serial correlation rho = {pw.serial_correlation:.2f}")
print(pw.summary()) # MDE % by horizon: program + arms
print(pw.program.mde_pct_by_horizon()[8]) # detectable % lift after 8 weeks
print(pw.power_for_effect(effect_pct=5.0, post_periods=8)) # invert: power
Stage 2 — evaluation by Augmented DiD#
The estimator#
Once the experiment has run, pass a post_col (a \(0/1\) indicator of
post-treatment periods, as in LEXSCM). The design is rebuilt on the pre rows
alone — so it is identical to the design-only result — and
results.effects carries the realised ATT at the arm and program
levels using the Augmented Difference-in-Differences estimator of Li &
Van den Bulte (2022).
Fix a level (an arm, or the program) and write \(y^{T}_t\) for its treated supergeo aggregate and \(y^{C}_t\) for its control supergeo aggregate, both treated-size-weighted across the level’s pairs. The counterfactual is the pre-period least-squares projection
This augments plain DiD in two ways: the control scale \(\delta_2\) is
estimated rather than fixed at \(1\), and a linear time trend
\(\gamma t\) is included (att_augment and att_trend, both default
True). With regressor \(x_t=(1,\,y^{C}_t,\,t)^{\top}\) and OLS
estimate \(\hat\delta\), the per-period effect and the ATT are
The percent ATT is taken relative to the post-period counterfactual
(cf. mlsynth.utils.resultutils.effects.calculate()), not the
pre-treatment baseline:
Inference#
Li & Van den Bulte (2022, Prop. 3.1–3.3) show \(\sqrt{T_{\mathrm{post}}}\,(\hat\Delta-\Delta)\xrightarrow{d} N(0,\Sigma_1+\Sigma_2)\), where \(\Sigma_1\) is the variance from estimating \(\delta\) and \(\Sigma_2\) from averaging the post-period errors. Their Web Appendix C.13 gives the prediction-variance estimator
with \(\bar x_{\mathrm{post}}=T_{\mathrm{post}}^{-1}\sum_{t>T_0}x_t\). The first bracketed term is \(\Sigma_1\) (it inflates automatically when the post-period control drifts outside its pre-period range, pricing the extrapolation uncertainty) and the second is \(\Sigma_2\). The residual variance \(\hat\omega^2\) is estimated over the long pre-period as a Newey–West/Bartlett long-run variance with truncation lag \(\lfloor T_0^{1/4}\rfloor\) (Li & Van den Bulte’s \(O(T^{1/4})\) rule); lag \(0\) is the i.i.d. case \(\hat\omega^2=\hat e^{\top}\hat e/(T_0-k)\) for \(k\) regressors. The confidence interval is \(\hat\Delta \pm z_{1-\alpha/2}\sqrt{\widehat{\operatorname{Var}}(\hat\Delta)}\) and the p-value is the two-sided normal test of \(\Delta=0\).
Why this estimator suits the supergeo gap#
Li & Van den Bulte’s regularity conditions (Assumptions C2–C3) explicitly admit trend and unit-root (integrated) common factors \(\lambda_t\) — the regimes under which naive i.i.d. standard errors collapse. The mechanism is the augmentation: regressing the treated aggregate on a scaled control is a cointegrating regression, and a single \(\delta_2\) cancels a shared integrated factor in (3), while \(\gamma t\) absorbs deterministic drift. The validity condition reduces to a single requirement — that the regression residual \(e_t\) be (weakly dependent) stationary — which the augmentation and trend deliver. The design’s parallelism is retained throughout; it minimises the residual variance, which by (8) directly tightens the standard error — and, because the power analysis reads the same held-out residual, the planning MDE as well.
Plain DiD as an option. Setting att_augment=False (and optionally
att_trend=False) recovers Li & Van den Bulte’s ordinary
difference-in-differences — y^{T}_t - y^{C}_t = \delta_1 [+ \gamma t] +
e_t with the control coefficient fixed at one — and the power analysis
follows suit, so the two stages stay coherent. A head-to-head Monte-Carlo
comparison (R2 design + plain DiD versus the augmented defaults)
found augmented DiD both more precise (lower realised MDE) and
better-covering across the stationary, trend-plus-seasonal and
integrated-factor regimes, because plain DiD has no mechanism to absorb a
control-scale mismatch or a trend and leaves that structure in its residual.
The augmented estimator is therefore the default; plain DiD remains available
for settings where its textbook simplicity is preferred.
Validity envelope (smoke tests)#
A Monte-Carlo study over the bundled simulator confirms that validity hinges
on residual stationarity, not on the interval recipe. The simulator can
place the unobserved factor on an i.i.d., AR(1) or random-walk process
(factor) and toggle the seasonal amplitude (season_amp) and per-geo
trend (trend_sd):
Gap structure |
Program coverage (nominal 0.95) |
Type-I (nominal 0.05) |
|---|---|---|
stationary i.i.d. factor (paper DGP) |
0.93 |
0.07 |
|
0.87 |
0.13 |
|
0.60 |
0.40 |
The point estimate is unbiased in every regime. On a stationary gap —
matching Li & Van den Bulte’s factor-model design — the prediction-
variance interval is at its nominal rate; the augmentation and trend
regressor recover most of the coverage lost to a deterministic trend and
seasonality; and the adversarial random-walk-plus-seasonality gap, where two
integrated factors and amplitude-heterogeneous seasonality exceed what a
single \(\delta_2\) can cointegrate, marks the honest assumption
boundary. In practice the fitted \(\hat\delta_2\) (reported as
AttEstimate.scale) and the residual diagnose whether the assumption
holds; if a single scale cannot flatten the gap, add covariate or seasonal
regressors before trusting the interval.
Because the power analysis now uses the evaluation model’s held-out
residual, the planning MDE is calibrated to the realised standard error:
on the stationary gap the projected MDE matches the realised value to within
roughly 7% (ratio \(\approx 0.93\)), and on integrated gaps it is
conservative (over-states the MDE), the safe direction. These experiments
live in mlsynth/tests/test_pangeo.py (TestADIDInference).
df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
T=104, seed=0, n_post=8)
res = PANGEO({
"df": df, "outcome": "sales", "arm": "arm",
"unitid": "unit", "time": "time", "post_col": "post_col",
"max_supergeo_size": 3,
"att_augment": True, "att_trend": True, # Augmented DiD (defaults)
}).fit()
print(res.effects.summary()) # program + per-arm ATT, SE, CI, p
pe = res.effects.program
print(f"program ATT = {pe.att_pct:.1f}% "
f"[{pe.ci_lower_pct:.1f}, {pe.ci_upper_pct:.1f}], "
f"p={pe.p_value:.3f}, scale delta_2={pe.scale:.2f}")
Core API#
PANGEO: Parallel-trends supergeo experimental design.
PANGEO is a prospective experimental-design method for geographic (geo) experiments, in the lineage of Supergeo Design (Chen, Doudchenko, Jiang, Stein & Ying 2023). The Supergeo idea – group geos into composite “supergeos” and form balanced pairs, randomising treatment within each pair, without trimming any geo – is retained, including its set- partitioning mixed-integer program.
The departure is the matching objective. Supergeo (and the scalable
OSD variant) match on a scalar aggregate (the summed response) or a few
summary covariate balances. PANGEO instead matches on the full
pre-treatment trajectory: it chooses the partition whose treatment and
control halves are as parallel as possible over the pre-period, scored
by the difference-in-differences pre-period residual sum of squares (the
level-removed gap variance; cf.
mlsynth.utils.selector_helpers._did_from_mean()). Because the DiD
level shift is absorbed, two supergeos can differ in level yet still match
perfectly on shape – exactly what a downstream DiD / synthetic-control
analysis needs, and what scalar sum-matching throws away.
Multi-arm support: a single categorical column names each geo’s eligible
treatment arm (e.g. A/B/C); arms occupy non-overlapping geos
and PANGEO designs each arm independently. The output is a design
(supergeo pairs + treatment/control assignment + achieved parallelism),
not a treatment effect.
- class mlsynth.estimators.pangeo.PANGEO(config: PANGEOConfig | dict)#
Bases:
objectParallel-trends supergeo experimental design.
- Parameters:
config (PANGEOConfig or dict) – Configuration object. See
mlsynth.config_models.PANGEOConfig.
- fit() PangeoResults#
Design the parallel supergeo pairs and return
PangeoResults.With a
post_col, the design is built on the pre rows only (so it is identical to the design-only result) and the realized DiD ATT on the post rows is attached asresults.effects.
Configuration#
- class mlsynth.config_models.PANGEOConfig(*, df: ~pandas.DataFrame, outcome: str, arm: str, unitid: str, time: str, post_col: str | None = None, weight_col: str | None = None, max_supergeo_size: ~typing.Annotated[int | None, ~annotated_types.Ge(ge=1)] = None, min_pairs: ~typing.Annotated[int, ~annotated_types.Ge(ge=1)] = 1, objective: ~typing.Literal['ss_res', 'r2', 'weighted'] = 'ss_res', recency_decay: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Le(le=1.0)] = 0.97, frac_E: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.7, covariates: ~typing.List[str] | None = None, covariate_weights: ~typing.Dict[str, float] | None = None, standardize_covariates: bool = True, compute_power: bool = True, power_target: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.8, power_alpha: ~typing.Annotated[float, ~annotated_types.Gt(gt=0.0), ~annotated_types.Lt(lt=1.0)] = 0.05, power_post_periods: ~typing.List[int] = <factory>, att_augment: bool = True, att_trend: bool = True, display_graphs: bool = True, save: bool | str = False)#
Configuration for the PANGEO experimental-design estimator.
Parallel-trends supergeo design (in the Supergeo / Chen et al. 2023 lineage): partitions each treatment arm’s geos into supergeo pairs whose treatment/control halves are maximally parallel over the pre-period, via a cvxpy/HiGHS set-partitioning MIP. A prospective design method – it returns supergeo pairs + a treatment/control assignment, not a treatment effect – so it takes a single categorical
armcolumn rather than a binarytreatindicator.- Parameters:
df (pd.DataFrame) – Historical (pre-treatment) balanced long panel.
outcome (str) – Historical outcome column (e.g. sales).
arm (str) – Single categorical column naming each geo’s eligible treatment arm (e.g. values
A/B/C). Arms occupy non-overlapping geos; the design runs independently within each arm.unitid (str) – Unit (geo) identifier column.
time (str) – Time-period column.
post_col (str, optional) – 0/1 indicator column marking post-treatment periods (0 = pre). When given, the design is built on the pre rows alone – identical to the design-only result – and the realized difference-in-differences ATT is additionally computed on the post rows (
results.effects).weight_col (str, optional) – Per-unit aggregation weight (e.g. population), constant within a unit. Makes both the supergeo design and the ATT population-weighted.
max_supergeo_size (int) – Q – the maximum size of either supergeo within a pair. Set
1to recover classic matched pairs.min_pairs (int) – Minimum number of supergeo pairs per arm.
- df: DataFrame#
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
Helper Modules#
Pre-treatment parallelism scoring for PANGEO supergeo pairs.
The design objective replaces Supergeo’s scalar sum-matching with a
difference-in-differences parallelism score on the full pre-period
vector. For a supergeo pair split into halves A and B with mean
trajectories \(\bar Y_A, \bar Y_B\), define the DiD level shift
\(\delta = \overline{(\bar Y_A - \bar Y_B)}\) and score the pair by
the variance of the level-removed pre-period gap
This is exactly the pre-period residual sum of squares of a DiD fit (cf.
mlsynth.utils.selector_helpers._did_from_mean()): minimising it makes
the two halves run parallel, so the within-pair DiD comparison is clean
regardless of their levels (the level is absorbed by \(\delta\)).
- mlsynth.utils.pangeo_helpers.parallelism.best_split(members: ndarray, Ypre: ndarray, max_size: int, objective: str = 'ss_res', weights: ndarray | None = None, cov: ndarray | None = None, cov_scales: ndarray | None = None, cov_weights: ndarray | None = None, unit_weights: ndarray | None = None) Tuple[float, List[int], List[int]]#
Best treatment/control split of a candidate supergeo pair.
- Parameters:
members (np.ndarray) – Row indices (into
Ypre) of the units in this candidate pair.Ypre (np.ndarray) – Pre-period outcomes, shape
(n_units, T0).max_size (int) – Maximum size of either supergeo (Q).
objective ({“ss_res”, “r2”, “weighted”}) – Per-pair cost to minimise (see
split_cost()).weights (np.ndarray, optional) – Length-
T0weights forobjective="weighted".cov (np.ndarray, optional) – Baseline covariate matrix, shape
(n_units, M)aligned with the rows ofYpre. When given, a standardized SMD^2 imbalance term is added to each split’s trajectory cost (seecovariate_imbalance()).cov_scales (np.ndarray, optional) – Length-
Mstandardization scales for the covariates.cov_weights (np.ndarray, optional) – Length-
Mper-covariate penalty weights (default 1 each).unit_weights (np.ndarray, optional) – Length-
n_unitsper-unit aggregation weights (e.g. population); the supergeo mean trajectory is the weighted average of its members.
- Returns:
score (float) – Minimum cost over admissible splits (
infif none).side_a, side_b (list of int) – The treatment / control halves (unit indices) achieving it.
- mlsynth.utils.pangeo_helpers.parallelism.covariate_imbalance(cov_a: ndarray, cov_b: ndarray, scales: ndarray, weights: ndarray | None = None) float#
Weighted standardized SMD^2 between two supergeos’ covariate means.
For supergeo means \(\bar c_A, \bar c_B\) (averaged over each half’s units) and per-covariate scales \(s_m\),
\[\sum_m w_m \Big(\frac{\bar c_{A,m} - \bar c_{B,m}}{s_m}\Big)^2 .\]A precomputed scalar, so adding it to the trajectory cost keeps the outer set-partitioning problem a linear MILP.
- mlsynth.utils.pangeo_helpers.parallelism.enumerate_candidate_pairs(unit_indices: ndarray, Ypre: ndarray, max_size: int, objective: str = 'ss_res', weights: ndarray | None = None, cov: ndarray | None = None, cov_scales: ndarray | None = None, cov_weights: ndarray | None = None, unit_weights: ndarray | None = None) List[dict]#
All admissible supergeo pairs over
unit_indiceswith their scores.A candidate pair is any subset of size
2 .. 2*max_sizethat can be split into two halves each of size<= max_size. Returns a list of{"members", "score", "side_a", "side_b"}dicts – the inputs to the set-partitioning MIP.scoreis the chosenobjective(plus the optional standardized covariate-imbalance penalty whencovis given).
- mlsynth.utils.pangeo_helpers.parallelism.gap_variance(mean_a: ndarray, mean_b: ndarray) float#
Variance of the level-removed gap between two trajectories (the DiD pre-period residual sum of squares).
- mlsynth.utils.pangeo_helpers.parallelism.parallelism_r2(mean_a: ndarray, mean_b: ndarray) float#
R^2 of the DiD parallel-trends fit (1 = perfectly parallel).
- mlsynth.utils.pangeo_helpers.parallelism.split_cost(mean_a: ndarray, mean_b: ndarray, objective: str = 'ss_res', weights: ndarray | None = None) float#
Per-pair cost minimised by the MIP (lower = more parallel).
All three objectives are precomputed scalars, so the outer selection problem stays a linear MILP.
"ss_res"– absolute DiD residual sum of squares \(\sum_t (g_t - \bar g)^2\) (scale-dependent; big-amplitude pairs weigh more)."r2"–1 - R^2=ss_res / ss_tot(scale-free; every pair counts equally, FDID’s R^2 criterion but optimised exactly)."weighted"– weighted residual SS \(\sum_t w_t (g_t - \bar g_w)^2\) with the level removed at the weighted mean \(\bar g_w = \sum_t w_t g_t / \sum_t w_t\) (e.g. recency weighting, so recent parallelism matters more).
Set-partitioning MIP for PANGEO supergeo-pair design.
Given the admissible supergeo pairs over an arm’s units (each with a
pre-period parallelism score), select a subset of pairs that partitions
every unit exactly once while minimising total non-parallelism – the
Supergeo covering formulation (Chen et al. 2023), unchanged except the
per-pair score is the difference-in-differences parallelism of
parallelism rather than a scalar sum-difference.
Solved with cvxpy using the HiGHS mixed-integer backend.
- mlsynth.utils.pangeo_helpers.mip.solve_partition(candidate_pairs: List[dict], unit_indices: ndarray, min_pairs: int = 1) List[dict]#
Select the exact-cover set of supergeo pairs of minimum total score.
- Parameters:
candidate_pairs (list of dict) – Output of
parallelism.enumerate_candidate_pairs(); each hasmembers(unit indices),score,side_a,side_b.unit_indices (np.ndarray) – The arm’s unit indices that must all be covered exactly once.
min_pairs (int) – Minimum number of supergeo pairs in the design (>= 1).
- Returns:
list of dict – The chosen candidate pairs (a subset of
candidate_pairs).
Panel ingestion for the PANGEO design estimator.
Pivots a historical (pre-treatment) long panel into a wide
units x time outcome matrix and records each unit’s treatment-arm
eligibility from a single categorical arm column (values A, B,
… ). The design is run independently within each arm.
- class mlsynth.utils.pangeo_helpers.setup.PangeoInputs(Y: ~numpy.ndarray, unit_names: ~typing.List[~typing.Any], time_labels: ~numpy.ndarray, arm_of: ~typing.Dict[~typing.Any, ~typing.Any], arm_units: ~typing.Dict[~typing.Any, ~numpy.ndarray], covariates: ~numpy.ndarray | None = None, covariate_names: ~typing.List[str] = <factory>, covariate_scales: ~numpy.ndarray | None = None, weights: ~numpy.ndarray | None = None, weight_name: str | None = None)#
Preprocessed pre-treatment panel for PANGEO.
- Y#
Pre-period outcomes, shape
(N, T); rows = units inunit_namesorder.- Type:
np.ndarray
- time_labels#
Length-
Ttime labels.- Type:
np.ndarray
- covariates#
Baseline covariate matrix, shape
(N, M)aligned withunit_namesrows (Noneif no covariates requested).- Type:
np.ndarray or None
- covariate_scales#
Length-
Mcross-unit standard deviations used to standardize the covariate imbalance (Noneif no covariates).- Type:
np.ndarray or None
- weights#
Length-
Nper-unit aggregation weights (e.g. population) aligned withunit_names(None= equal weights). Used for both the supergeo mean trajectory in the design and the downstream ATT.- Type:
np.ndarray or None
- Y: ndarray#
- time_labels: ndarray#
- mlsynth.utils.pangeo_helpers.setup.build_post_matrix(post_df: DataFrame, inputs: PangeoInputs, outcome: str, unitid: str, time: str) tuple[ndarray, ndarray]#
Pivot the post-treatment rows into a
(N, T_post)outcome matrix aligned withinputs.unit_names.Returns
(Y_post, post_time_labels). Every design unit must appear in the post period with no missing cells (a balanced post panel).
- mlsynth.utils.pangeo_helpers.setup.prepare_pangeo_inputs(df: DataFrame, outcome: str, arm: str, unitid: str, time: str, min_units_per_arm: int = 2, covariates: List[str] | None = None, standardize_covariates: bool = True, weight_col: str | None = None) PangeoInputs#
Pivot a historical panel into
PangeoInputs.- Parameters:
df (pd.DataFrame) – Balanced pre-treatment long panel; one row per
(unit, time).outcome (str) – Historical outcome column (e.g. sales).
arm (str) – Single categorical column naming each geo’s eligible treatment arm (e.g. values
A/B/C). Units are designed within their arm.unitid, time (str) – Unit-id and time column names.
min_units_per_arm (int) – Minimum geos required per arm to form at least one supergeo pair.
covariates (list of str, optional) – Baseline covariate columns to balance across supergeo halves. Each unit’s covariate value is its mean over the panel (so a column that varies over time is reduced to a per-unit baseline level).
standardize_covariates (bool) – Divide each covariate’s imbalance by its cross-unit std (default). With
Falsethe raw scale is used (scales = 1).weight_col (str, optional) – Per-unit aggregation weight column (e.g. population), constant within a unit. Makes the supergeo aggregate a weighted average;
None(default) gives equal weights.
Orchestration for the PANGEO design estimator.
For each treatment arm: enumerate admissible supergeo pairs over the arm’s geos (scored by pre-period DiD parallelism), solve the set-partitioning MIP to choose the exact-cover design of minimum total non-parallelism, and assemble the per-arm supergeo pairs with their treatment/control halves.
- mlsynth.utils.pangeo_helpers.pipeline.run_pangeo(inputs: PangeoInputs, *, max_supergeo_size: int | None = None, min_pairs: int = 1, objective: str = 'ss_res', recency_decay: float = 0.97, frac_E: float = 0.7, covariate_weights: Dict[str, float] | None = None, compute_power: bool = True, power_target: float = 0.8, power_alpha: float = 0.05, power_post_periods: Sequence[int] | None = None, att_augment: bool = True, att_trend: bool = True) PangeoResults#
Design parallel supergeo pairs within each arm.
- Parameters:
inputs (PangeoInputs) – Preprocessed pre-treatment panel.
max_supergeo_size (int, optional) – Q – the maximum size of either supergeo within a pair. If
None(the default), Q is selected automatically: every feasible Q in1..min(ceil(smallest_arm/2), 6)is designed and the one minimising the program-level MDE is returned (see_auto_select_q()). The sweep is recorded inresults.metadata["q_sweep"].min_pairs (int) – Minimum number of supergeo pairs per arm.
objective ({“ss_res”, “r2”, “weighted”}) – Per-pair parallelism cost minimised by the MIP (see
mlsynth.utils.pangeo_helpers.parallelism.split_cost()).recency_decay (float) – Geometric recency-weight decay for
objective="weighted": periodtgets weightrecency_decay**(T0-1-t)(recent periods up-weighted), normalised to sum toT0.frac_E (float) – Fraction of the pre-period used as the estimation window E that the split is optimised over; the remaining tail is the blank window B, held out so its gap residuals are an honest, out-of-sample estimate of the parallel-trends noise (powering the MDE and the conformal CIs). Mirrors LEXSCM / SPCD. Falls back to the full pre when the panel is too short to leave a usable B.
covariate_weights (dict, optional) –
{covariate_name: weight}on the standardized SMD^2 imbalance penalty (default 1.0 each). Only used wheninputs.covariatesis present.compute_power (bool) – Attach a program- and arm-level MDE / power analysis to the result (see
mlsynth.utils.pangeo_helpers.power).power_target (float) – Target power for the stored MDE (default 0.80).
power_alpha (float) – Two-sided significance level for the MDE (default 0.05).
power_post_periods (sequence of int, optional) – Post-period horizons to evaluate (default
range(2, 13)= 2..12).
Program- and arm-level power / MDE analysis for a PANGEO design.
Once a PANGEO design is frozen, the minimum detectable effect (MDE) after
X post-treatment periods is a closed-form function of the pre-period
parallelism the design achieved – which is exactly what the MILP
minimised, so power and the design objective are the same quantity.
For a supergeo pair the no-effect gap \(g_t = \bar Y^T_t - \bar Y^C_t\) sits on its parallel-trends line \(\delta_p = \overline{g}_{\text{pre}}\); its per-period residual variance is
the noise an X-period difference-in-differences ATT must overcome. The
estimator \(\hat\tau_p = \overline{g}_{\text{post}} - \delta_p\)
has variance
where \(f(n,\rho) = \operatorname{Var}(\text{mean of } n
\text{ serially-correlated periods})/\sigma^2\) is the variance-inflation
factor of an AR(1) process. Consecutive weeks are correlated, so X post
weeks are worth far fewer than X independent draws – the trap a naive
i.i.d. power calculation falls into. \(\rho\) is estimated from the
pooled pre-period gap residuals of the chosen pairs.
The program ATT is the treated-size-weighted average of the pair ATTs; its MDE is the headline number a program owner reports. Per-arm curves are also returned. Pairs are treated as independent across the program, so the arm count multiplies the effective sample size – which is why pooling to the program level detects far smaller effects than any one small arm could. Cross-pair common shocks within an arm are ignored (a mild optimism; a placebo-in-time engine would absorb them).
- class mlsynth.utils.pangeo_helpers.power.MDEPoint(post_periods: int, mde_absolute: float, mde_pct: float, se: float)#
Minimum detectable effect at one post-period horizon.
- class mlsynth.utils.pangeo_helpers.power.PangeoPower(program: ~mlsynth.utils.pangeo_helpers.power.PowerCurve, arms: ~typing.Dict[~typing.Any, ~mlsynth.utils.pangeo_helpers.power.PowerCurve], alpha: float, power_target: float, post_periods: ~typing.List[int], serial_correlation: float, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Power / MDE analysis attached to
PangeoResults.- program#
Headline program-level MDE curve (pooled across all arms).
- Type:
- serial_correlation#
Pooled lag-1 (AR(1)) autocorrelation of the gap residuals used to inflate the variance for serial dependence.
- Type:
- arms: Dict[Any, PowerCurve]#
- power_for_effect(effect_pct: float, post_periods: int, level: str = 'program') float#
Power to detect a
effect_pct% effect at horizonpost_periods.Inverts the MDE relation for a given true effect size (two-sided Gaussian approximation).
- program: PowerCurve#
- summary() DataFrame#
Tidy table of MDE (% of baseline) by horizon: program + each arm.
- class mlsynth.utils.pangeo_helpers.power.PowerCurve(level: str, baseline: float, n_treated: int, n_pairs: int, points: List[MDEPoint])#
MDE-vs-horizon curve at one aggregation level (program or arm).
- mlsynth.utils.pangeo_helpers.power.compute_pangeo_power(arm_designs: Dict[Any, Any], *, post_periods: Sequence[int] | None = None, alpha: float = 0.05, power_target: float = 0.8) PangeoPower#
Program- and arm-level MDE curves for a frozen PANGEO design.
- Parameters:
arm_designs (dict) –
{arm_label: ArmDesign}from a completed design.post_periods (sequence of int, optional) – Horizons to evaluate (default
range(2, 13)= 2..12).alpha (float) – Two-sided significance level (default 0.05).
power_target (float) – Target power (default 0.80).
Realized ATT estimation for a PANGEO design with post-period data.
PANGEO is a design method: with only pre-treatment history it returns the
supergeo pairs and the treatment/control assignment. If the experiment has
since run – i.e. the panel carries a post_col marking post-treatment
periods – the same design (built on the pre-period alone) is scored
against the realized post outcomes here, with inference following the
Augmented Difference-in-Differences estimator of Li & Van den Bulte
(2022, Marketing Science 42(4):746-767).
The estimator and its inference#
For a treated supergeo aggregate \(y^{T}_t\) and a control supergeo aggregate \(y^{C}_t\), the counterfactual is the regression projection
fit by least squares on the pre-period (the augmented DiD: the scale \(\delta_2\) is free rather than forced to 1, and a linear time trend \(\gamma t\) is included). Writing \(x_t = (1, y^{C}_t, t)'\) and \(\hat\delta\) for the OLS estimate, the per-period treatment effect is \(\hat u_t = y^{T}_t - x_t'\hat\delta\) and the ATT is \(\hat\Delta = T_2^{-1}\sum_{t=T_1+1}^{T}\hat u_t\).
Li & Van den Bulte show (Propositions 3.1-3.3; Web Appendix C) that \(\sqrt{T_2}(\hat\Delta-\Delta)\to N(0,\Sigma_1+\Sigma_2)\), which gives the prediction-variance standard error (their C.13)
where \(\bar x_{\text{post}}\) is the post-period mean of \(x_t\) and \(\hat\omega^2\) is the residual variance, estimated over the long pre-period (a Newey-West/Bartlett long-run variance with lag \(\lfloor T_1^{1/4}\rfloor\) to allow serial correlation; lag 0 is the i.i.d. case \(\hat\sigma^2_e=\hat e'\hat e/(T_1-k)\)). The two terms are the coefficient-estimation variance (Σ₁) and the post-period averaging variance (Σ₂). The CI is \(\hat\Delta\pm z_{1-\alpha/2}\,\text{SE}\).
Why this estimator suits the supergeo gap#
The theory explicitly admits trend and unit-root (integrated) common factors (Li & Van den Bulte Assumptions C2/C3, Prop 3.3). The augmentation \(\delta_2\) makes treated-on-control a cointegrating regression, scaling out a shared integrated factor; the trend term absorbs deterministic drift; and the prediction-variance term automatically inflates when the post-period control drifts outside its pre-period range, pricing the extrapolation uncertainty. The validity condition is that the residual \(e_t\) be (weakly dependent) stationary – which the augmentation + trend deliver. The arm and program ATTs apply this single-treated-unit estimator to the treated-size-weighted supergeo aggregate at each level; the program number is the headline.
- class mlsynth.utils.pangeo_helpers.effects.AttEstimate(level: str, att: float, att_pct: float, baseline: float, se: float, ci_lower: float, ci_upper: float, ci_lower_pct: float, ci_upper_pct: float, p_value: float, n_post: int, scale: float, observed: ~numpy.ndarray = <factory>, counterfactual: ~numpy.ndarray = <factory>)#
An augmented-DiD ATT (Li & Van den Bulte 2022) at one level.
- baseline#
Mean post-period counterfactual outcome used for
att_pct(the predicted treated series absent treatment).- Type:
- ci_lower, ci_upper
Confidence interval for the absolute ATT.
- Type:
- ci_lower_pct, ci_upper_pct
The same interval as a percentage of baseline.
- Type:
- scale#
Fitted augmentation coefficient \(\hat\delta_2\) (1.0 if the augmentation is disabled, i.e. plain DiD).
- Type:
- observed#
Observed treated supergeo aggregate over pre + post periods.
- Type:
np.ndarray
- counterfactual#
Augmented-DiD counterfactual prediction of the treated aggregate over the same periods; the gap in the post window is the per-period effect.
- Type:
np.ndarray
- counterfactual: ndarray#
- observed: ndarray#
- class mlsynth.utils.pangeo_helpers.effects.PangeoEffects(program: ~mlsynth.utils.pangeo_helpers.effects.AttEstimate, arms: ~typing.Dict[~typing.Any, ~mlsynth.utils.pangeo_helpers.effects.AttEstimate], pair_att: ~typing.Dict[~typing.Any, ~typing.List[float]], n_post: int, weighted: bool, alpha: float, metadata: ~typing.Dict[str, ~typing.Any] = <factory>)#
Realized ATT for a PANGEO design scored against post-period data.
- program#
Headline program-level ATT (pooled across all arms).
- Type:
- arms: Dict[Any, AttEstimate]#
- program: AttEstimate#
- summary() DataFrame#
Tidy table of the program and per-arm ATT estimates.
- mlsynth.utils.pangeo_helpers.effects.adid_counterfactual(YT: ndarray, YC: ndarray, n_pre: int, augment: bool = True, trend: bool = True) ndarray#
Treated-series counterfactual from the (augmented) DiD fit.
Fit
yT = d1 [+ d2*yC] [+ g*t]on the firstn_preperiods and return the predicted treated trajectory over all periods – the line PANGEO plots against the observed treated aggregate. For plain DiD the counterfactual isyC + (d1 [+ g*t]); for augmented DiD it is the projectiond1 + d2*yC [+ g*t]directly.
- mlsynth.utils.pangeo_helpers.effects.compute_pangeo_effects(results, inputs, Y_post: ndarray, *, alpha: float = 0.05, augment: bool = True, trend: bool = True) PangeoEffects#
Augmented-DiD ATT (Li & Van den Bulte 2022) for a design scored on post outcomes, at the program and arm levels.
- Parameters:
results (PangeoResults) – The frozen design (pairs + assignment) built on the pre-period.
inputs (PangeoInputs) – Pre-period inputs (supplies unit order and population weights).
Y_post (np.ndarray) – Post-period outcomes, shape
(N, T_post), rows aligned withinputs.unit_names.alpha (float) – Significance level for the CIs / p-values.
augment (bool) – Free augmentation coefficient
delta_2on the control aggregate (the augmented DiD).Falseforcesdelta_2 = 1(plain DiD).trend (bool) – Include a linear time-trend regressor.
Seasonal factor-model simulator of sales-like panel data for PANGEO.
Generates a balanced panel of “sales” with the structure typical of geo marketing data, drawing on the factor-model DGPs used throughout mlsynth:
across several non-overlapping treatment arms. The design problem is prospective, so the generated panel is pre-treatment only – it is the historical window a designer would use to build balanced supergeo pairs.
- mlsynth.utils.pangeo_helpers.simulation.make_seasonal_sales_panel(units_per_arm: int = 5, arms: Tuple[str, ...] = ('A', 'B', 'C'), T: int = 156, n_factors: int = 2, season_period: int = 52, noise: float = 0.05, seed: int = 0, covariates: bool = False, n_post: int = 0, factor: str = 'rw', season_amp: float = 1.0, trend_sd: float = 0.01) DataFrame#
Simulate a seasonal, multi-arm, sales-like pre-treatment panel.
- Parameters:
units_per_arm (int) – Number of geos (markets) eligible for each arm.
arms (tuple of str) – Arm labels; each unit is eligible for exactly one arm. Arms occupy non-overlapping geos.
T (int) – Number of pre-treatment periods (e.g. weeks; default 3 years).
n_factors (int) – Rank of the common low-rank factor structure.
season_period (int) – Seasonal cycle length (e.g. 52 weeks).
noise (float) – Idiosyncratic noise scale, relative to the signal.
seed (int) – RNG seed.
covariates (bool) – If
True, also emit time-invariant baselinepopulationandincomecolumns (correlated with the unit’s level and factor loadings) for PANGEO’s covariate-balancing option.n_post (int) – Number of post-treatment periods to append after the
Tpre periods. When> 0the panel gains apost_col(0 = pre, 1 = post); the DGP continues unchanged (no treatment effect – the effect is injected by the caller after the design is fixed, which is how PANGEO’s ATT recovery is validated).factor ({“rw”, “iid”, “ar1”}) – Process for the unobserved common factors that drive the supergeo gap.
"rw"(default) is an integrated random walk – a stress test whose non-exchangeable loadings violate the conformal-inference assumption (Abadie & Zhao 2026, Thm 2) and demand the increment-bootstrap."iid"and"ar1"are stationary (exchangeable) factors under which the blank-window conformal CI is exact.
- Returns:
pd.DataFrame – Long panel with columns
unit,time,sales,arm(pluspopulation/incomewhencovariates=Trueandpost_colwhenn_post > 0).
Frozen dataclasses for the PANGEO design estimator.
PANGEO is a prospective experimental design method: from historical (pre-treatment) sales it partitions each treatment arm’s geos into supergeo pairs whose treatment/control halves are maximally parallel over the pre-period, so a later difference-in-differences / synthetic-control analysis has clean parallel trends. The output is a design (supergeo pairs + treatment/control assignment + achieved parallelism), not a treatment effect.
- class mlsynth.utils.pangeo_helpers.structures.ArmDesign(arm: Any, pairs: List[SupergeoPair], n_units: int, total_gap_variance: float, mean_parallelism_r2: float, treatment_units: List[Any], control_units: List[Any])#
The supergeo-pair design for a single treatment arm.
- arm#
Arm label.
- Type:
Any
- pairs#
The chosen supergeo pairs partitioning the arm’s units.
- Type:
list of SupergeoPair
- pairs: List[SupergeoPair]#
- class mlsynth.utils.pangeo_helpers.structures.PangeoResults(arm_designs: ~typing.Dict[~typing.Any, ~mlsynth.utils.pangeo_helpers.structures.ArmDesign], max_supergeo_size: int, assignment: ~typing.Dict[~typing.Any, str], time_labels: ~numpy.ndarray, metadata: ~typing.Dict[str, ~typing.Any] = <factory>, power: ~typing.Any | None = None, effects: ~typing.Any | None = None)#
Top-level container returned by
mlsynth.PANGEO.fit().- time_labels#
Pre-period time labels the design was built on.
- Type:
np.ndarray
- time_labels: ndarray#
- class mlsynth.utils.pangeo_helpers.structures.SupergeoPair(treatment: ~typing.List[~typing.Any], control: ~typing.List[~typing.Any], gap_variance: float, parallelism_r2: float, treatment_mean: ~numpy.ndarray, control_mean: ~numpy.ndarray, covariate_smd: ~typing.Dict[str, float] = <factory>, gap_level: float = 0.0, holdout_resid: ~numpy.ndarray = <factory>)#
One supergeo pair within an arm.
- gap_variance#
Pre-period level-removed gap variance between the two halves (lower = more parallel; the DiD pre-period residual SS).
- Type:
- treatment_mean#
Pre-period mean trajectory of the treatment half.
- Type:
np.ndarray
- control_mean#
Pre-period mean trajectory of the control half.
- Type:
np.ndarray
- covariate_smd#
{covariate_name: standardized mean difference}between the treatment and control halves (empty if no covariates were used).- Type:
- gap_level#
DiD counterfactual gap level \(\delta\) – the mean gap over the estimation window E (the periods the split was optimised on).
- Type:
- holdout_resid#
Gap residuals on the held-out blank window B (
gap[B] - gap_level). B is excluded from the optimisation, so these residuals are an honest out-of-sample estimate of the parallel-trends noise – the reservoir for conformal inference and the variance behind the MDE.- Type:
np.ndarray
- control_mean: ndarray#
- holdout_resid: ndarray#
- treatment_mean: ndarray#
Example#
A seasonal, multi-arm sales panel (the bundled simulator), designed into
parallel supergeo pairs. With display_graphs=True PANGEO plots, per arm,
the observed treated supergeo aggregate against the augmented-DiD
counterfactual prediction (mlsynth.utils.pangeo_helpers.effects.adid_counterfactual()):
the in-sample pre-period fit when designing, and – once post_col data
are supplied – the counterfactual extended past the treatment date, so the
post-window gap is the estimated effect.
from mlsynth import PANGEO
from mlsynth.utils.pangeo_helpers import make_seasonal_sales_panel
# 3 arms (non-overlapping geos), 6 geos each, 156 weeks of history.
df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
T=156, seed=0)
res = PANGEO({
"df": df,
"outcome": "sales",
"arm": "arm", # single categorical arm column
"unitid": "unit",
"time": "time",
"max_supergeo_size": 3, # Q
}).fit()
for arm, design in res.arm_designs.items():
print(f"Arm {arm}: {len(design.pairs)} pair(s), "
f"parallel-trends R^2 = {design.mean_parallelism_r2:.3f}")
for p in design.pairs:
print(f" T={p.treatment} C={p.control} R^2={p.parallelism_r2:.3f}")
# res.assignment maps every geo -> 'treatment' / 'control'.
On the simulated data this returns designs with parallel-trends \(R^2\) around 0.90–0.98 — roughly 10–35x more parallel than a random treatment/control split of the same geos.
Simulation: Trajectory Matching vs. a Scalar Supergeo#
The supergeo design of Chen et al. (2023) matches (super)geos on a scalar summary of baseline response (its variance term is \(\sum_k (Z_{G_{k,+}} - Z_{G_{k,-}})^2\), a sum over scalar baseline differences). PANGEO carries this into the panel setting: it matches on the full pre-treatment trajectory (level-removed parallelism), so it can separate geos that look identical on a scalar yet move differently over time. The self-contained Monte Carlo below makes the gap concrete — adapting the paper’s RMSE comparison (supergeo vs. matched pairs) to a panel where every geo has the same pre-period mean but a distinct trajectory shape, a setting in which scalar matching is by construction blind. Six geos keep the MIP instantaneous.
import numpy as np
import pandas as pd
from mlsynth import PANGEO
def make_panel(rng, T_pre=20, S=6, level=100.0, noise=0.6):
"""Six geos = three parallel pairs (up-trend, down-trend, cycle);
each shape is demeaned over the pre-period so all pre-means match."""
T = T_pre + S
t = np.arange(T)
up = (t - t.mean()) / t.std()
cyc = np.sin(2 * np.pi * t / 5.0)
shapes = [5 * up, 5 * up, -5 * up, -5 * up, 5 * cyc, 5 * cyc]
cols = []
for s in shapes:
s = s - s[:T_pre].mean() # equal pre-means => scalar-blind
cols.append(level + s + rng.normal(0, noise, T))
return np.column_stack(cols), T_pre
def did(Y, T_pre, treated, control, tau):
Yo = Y.copy()
Yo[T_pre:, treated] += tau # inject the true effect
t_eff = Yo[T_pre:, treated].mean() - Yo[:T_pre, treated].mean()
c_eff = Yo[T_pre:, control].mean() - Yo[:T_pre, control].mean()
return t_eff - c_eff
def pangeo_design(Y, T_pre):
df = pd.DataFrame(
{"geo": f"g{g}", "t": int(t), "y": Y[t, g], "arm": "A"}
for g in range(6) for t in range(T_pre)
)
a = PANGEO({"df": df, "outcome": "y", "unitid": "geo", "time": "t",
"arm": "arm", "max_supergeo_size": 1,
"compute_power": False, "display_graphs": False}).fit().assignment
T = [int(g[1:]) for g, v in a.items() if v == "treatment"]
C = [int(g[1:]) for g, v in a.items() if v == "control"]
return T, C
def scalar_match(Y, T_pre, rng): # Google-style: pair on scalar mean
order = np.argsort(Y[:T_pre].mean(0))
T, C = [], []
for k in range(0, 6, 2):
a, b = order[k], order[k + 1]
if rng.random() < 0.5:
T.append(a); C.append(b)
else:
T.append(b); C.append(a)
return T, C
tau, R = 4.0, 60
rng = np.random.default_rng(1)
errs = {"PANGEO (trajectory)": [], "scalar matched-pairs": []}
for _ in range(R):
Y, T_pre = make_panel(rng)
Tp, Cp = pangeo_design(Y, T_pre)
errs["PANGEO (trajectory)"].append(did(Y, T_pre, Tp, Cp, tau) - tau)
Ts, Cs = scalar_match(Y, T_pre, rng)
errs["scalar matched-pairs"].append(did(Y, T_pre, Ts, Cs, tau) - tau)
for name, e in errs.items():
e = np.array(e)
print(f"{name:22s} RMSE = {np.sqrt((e ** 2).mean()):.2f}")
Because all geos share a pre-period mean, the scalar match pairs them essentially at random, and the difference-in-differences estimate inherits the up/down/cycle shape mismatch (RMSE ≈ 6 against a true effect of 4). PANGEO reads the trajectory shape, recovers the three parallel pairs, and estimates the effect about 30× more precisely (RMSE ≈ 0.2). That is the supergeo idea carried into the panel world: match on how the series move, not on a single number.
References#
Chen, A., Doudchenko, N., Jiang, S., Stein, C., & Ying, B. (2023). “Supergeo Design: Generalized Matching for Geographic Experiments.” arXiv:2301.12044.
Shaw, C. (2025). “Optimized Supergeo Design: A Scalable Framework for Geographic Marketing Experiments.” arXiv:2506.20499.
Li, K. T. (2023). “Frontiers: A Simple Forward Difference-in-Differences Method.” Marketing Science 43(2):267-279.
Li, K. T., & Van den Bulte, C. (2022). “Augmented Difference-in-Differences.” Marketing Science 42(4):746-767.
Abadie, A., & Zhao, J. (2026). “Synthetic Controls for Experimental Design.” Working paper.
Abadie, A., Diamond, A., & Hainmueller, J. (2010). “Synthetic Control Methods for Comparative Case Studies.” Journal of the American Statistical Association 105(490):493-505.