Parallel-Trends Supergeo Design (PANGEO)
=========================================

.. currentmodule:: mlsynth

When to Use This Estimator
--------------------------

PANGEO is a tool for designing a geo experiment — deciding, *before*
you run it, which geographic markets to treat and which to hold out, so
that the after-the-fact comparison is as clean as possible. Use it when:

- you can assign treatment at the geo level (turn an ad campaign,
  price, or feature on in some markets and not others);
- you have a panel of pre-period history (weekly/monthly sales,
  conversions, GMV…) for every candidate geo;
- the number of geos is modest (tens, not thousands), as is typical
  with DMAs/regions; and
- you want the eventual treatment-effect estimate to be precise —
  i.e. you want treated and control markets that already move together.

A worked geo-experiment. A brand wants to measure the incremental
sales from a new TV/CTV campaign. Ads are bought at the DMA level, so the
experimental units are ~50–210 large, heterogeneous markets — not
exchangeable shoppers. The plan: run the campaign in a *treatment* set of
DMAs, withhold it in a *control* set, and read the post-launch sales gap
as the lift. The whole experiment lives or dies on the split: if the
treated DMAs were already trending differently from the controls, the
post-launch gap mixes the campaign effect with that pre-existing
divergence. PANGEO reads the DMAs' pre-period sales panel and chooses the
treatment/control split (bundling DMAs into balanced *supergeos*) so the
two sides' sales trajectories run parallel beforehand — turning the
post-period gap into a clean read on the campaign.

PANGEO is two stages:

#. Design (pre-period data only). Each arm's geos are grouped into
   composite *supergeos* and formed into balanced pairs --- with no geo
   trimmed --- so that pre-period parallelism is maximised. By default this
   partition is found by a fast clustering heuristic (the OSD analogue of
   Shaw 2025); an exact set-partitioning mixed-integer program is available
   as an opt-out (``fast=False``). Both optimise the same parallelism
   objective. A power analysis reports the minimum detectable effect (MDE)
   implied by the chosen supergeo size :math:`Q`.

#. Evaluation (after the experiment). The *same* design is scored
   against the realised outcomes with the Augmented
   Difference-in-Differences estimator of Li & Van den Bulte (2022),
   giving the ATT, percent ATT, and CIs at the arm and program levels.

The two stages share one quantity: both the design objective and the
standard error of the realised effect are governed by the variance of the
supergeo *gap* residual. Minimising non-parallelism simultaneously
minimises the MDE and tightens the CI — optimising parallelism is
optimising inferential precision.

This is a principled deviation from Google's Supergeo Design. Chen et
al. (2023) and OSD (Shaw 2025) match supergeos on a scalar summary —
the summed baseline response, or a few covariate totals — which collapses
the time dimension. PANGEO matches on the full pre-period trajectory.
That difference is not cosmetic: the downstream analysis is a
difference-in-differences, which *differences trajectories over time*, so
two markets with identical totals but different seasonal shapes are not
interchangeable for it even though scalar matching scores them as a
perfect match. In trending, seasonal data — which is essentially all
geo-marketing data — matching on shape rather than on a single number is
what makes the post-period comparison valid. (The simulation at the end
of this page quantifies the gap: when geos share a baseline mean but
differ in shape, PANGEO recovers the effect ~30× more precisely than a
scalar match.)

When *not* to use it. If the assignment is already fixed (an
observational study, or a campaign that already ran in specific markets),
there is no design to choose — use an estimation-stage method
(:doc:`tssc`, :doc:`sbc`, :doc:`fdid`). Large geo pools (hundreds of
markets) are handled by the default clustering partition (see *Forming the
supergeos* below); only the exact opt-out program is scale-limited. And if
the outcome is plausibly stationary with no trend or seasonality, scalar
matching is already adequate and simpler.

What is a supergeo?
^^^^^^^^^^^^^^^^^^^

Geo experiments differ from ordinary A/B tests in one decisive way: the
experimental units are a *small* number of *large, heterogeneous* aggregates
--- markets, regions, DMAs --- rather than many exchangeable individuals.
Randomising treatment across a handful of dissimilar markets routinely
produces treatment and control groups with very different baseline
characteristics, and the resulting post-randomisation bias does not average
away over the single assignment a practitioner actually runs (Abadie & Zhao
2026). Classic matched-pair designs help, but with heterogeneous geos there
may be *no* good one-to-one match for a given market.

A supergeo resolves this by relaxing the unit of matching. Rather than
insisting that single geos match, geos are pooled into composite aggregates:
a supergeo is simply a bundle of geos treated as one unit, with outcome equal
to their (population-weighted) mean. Composite units can be made comparable
even when their constituents are not --- a small, noisy market combined with
a complementary one can, in aggregate, track another composite closely. The
design then pairs *supergeos*, randomises treatment within each pair, and ---
unlike trimming-based approaches --- assigns every geo to some supergeo,
so the experiment spans the entire market with nothing discarded (Chen,
Doudchenko, Jiang, Stein & Ying 2023).

PANGEO keeps this structure but changes *what* the supergeos are matched on.
Supergeo Design and OSD match on a scalar summary (the summed response, or a
few covariate totals), which collapses the time dimension; PANGEO matches on
the full pre-treatment trajectory, choosing pairs whose aggregate paths
run as parallel as possible. The reason is that the downstream
difference-in-differences analysis differences trajectories: two markets with
identical totals but different seasonal shapes are not interchangeable for
it, even though scalar matching treats them as equivalent.

Setup and notation
------------------

Let :math:`y_{it}` denote the outcome of geo
:math:`i \in \mathcal{N} \coloneqq \{1,\dots,N\}` in period
:math:`t \in \mathcal{T} \coloneqq \{1,\dots,T\}`. The first :math:`T_0`
periods are the pre-treatment (design) window
:math:`\mathcal{T}_1 \coloneqq \{t \in \mathcal{T} : t \le T_0\}` and the
remaining :math:`|\mathcal{T}_2| = T - T_0` periods are the experimental window
:math:`\mathcal{T}_2 \coloneqq \{t \in \mathcal{T} : t > T_0\}`.
A single categorical column assigns each geo to an arm; arms occupy
disjoint geo pools :math:`\mathcal{N}_a` and are designed independently, so
the exposition below fixes one arm and drops the arm subscript.

Throughout we maintain the linear factor model used by both the synthetic-
control and DiD literatures (Abadie, Diamond & Hainmueller 2010; Li &
Van den Bulte 2022) for the no-treatment potential outcome,

.. math::
   :label: factor

   y_{it}^{N} = \delta_t + \boldsymbol{\theta}_t^{\top} \mathbf{z}_i
                + \boldsymbol{\lambda}_t^{\top}\boldsymbol{\mu}_i
                + \varepsilon_{it},

where :math:`\delta_t` is a common time effect, :math:`\mathbf{z}_i` are
observed covariates with time-varying loadings :math:`\boldsymbol{\theta}_t`,
:math:`\boldsymbol{\mu}_i` are unobserved factor loadings with factors
:math:`\boldsymbol{\lambda}_t`, and :math:`\varepsilon_{it}` is mean-zero
idiosyncratic noise.

A supergeo is a set :math:`\mathcal{S}` of same-arm geos with aggregate
trajectory

.. math::

   \bar y_{\mathcal{S},t}
   \coloneqq \frac{\sum_{i\in \mathcal{S}}\omega_i\,y_{it}}{\sum_{i\in \mathcal{S}}\omega_i},

where :math:`\omega_i>0` are aggregation weights (the ``weight_col``
population, or :math:`\omega_i\equiv 1`). A pair
:math:`p=(\mathcal{A}_p,\mathcal{B}_p)` consists of two disjoint supergeos with
:math:`|\mathcal{A}_p|,|\mathcal{B}_p|\le Q`; :math:`\mathcal{A}_p` is the
treatment half and :math:`\mathcal{B}_p` the control half. Its gap is

.. math::
   :label: gap

   g_{p,t} \coloneqq \bar y_{\mathcal{A}_p,t} - \bar y_{\mathcal{B}_p,t}.

Under :eq:`factor` the common time effect cancels and

.. math::
   :label: gapdecomp

   g_{p,t}
       = \boldsymbol{\lambda}_t^{\top}\big(\bar{\boldsymbol{\mu}}_{\mathcal{A}_p}-\bar{\boldsymbol{\mu}}_{\mathcal{B}_p}\big)
       + \boldsymbol{\theta}_t^{\top}\big(\bar{\mathbf{z}}_{\mathcal{A}_p}-\bar{\mathbf{z}}_{\mathcal{B}_p}\big)
       + \big(\bar\varepsilon_{\mathcal{A}_p,t}-\bar\varepsilon_{\mathcal{B}_p,t}\big),

with :math:`\bar{\boldsymbol{\mu}}_{\mathcal{S}}, \bar{\mathbf{z}}_{\mathcal{S}}`
the weighted means over :math:`\mathcal{S}`. The
pair exhibits parallel trends precisely when the loadings are balanced,
:math:`\bar{\boldsymbol{\mu}}_{\mathcal{A}_p}=\bar{\boldsymbol{\mu}}_{\mathcal{B}_p}`
(and :math:`\bar{\mathbf{z}}_{\mathcal{A}_p}=\bar{\mathbf{z}}_{\mathcal{B}_p}`);
the gap is then constant in expectation and a difference-in-differences
comparison within the pair is unbiased.

Identifying assumptions
-----------------------

PANGEO can make a good experiment *likely*, but it cannot manufacture one
that the data do not support. Three assumptions underpin it, and each
points at a way the method can fail or stall.

1. Parallel trends (the crux). The design maximises parallelism in the
   pre-period; the validity of the Stage-2 effect estimate rests on that
   parallelism persisting into the post-period absent treatment — i.e.
   the treatment and control supergeos *would have continued to move
   together* had the campaign never launched. This is exactly the
   difference-in-differences parallel-trends assumption, and it is the
   assumption PANGEO is organised around.

   *Remark.* PANGEO *optimises* pre-period parallelism (making the
   assumption as plausible as the data allow) but cannot guarantee it holds
   out-of-sample. If a shock hits the treated markets, a competitor reacts
   only there, or the pre-period co-movement was coincidental, the
   post-period gap diverges on its own and the ATT is biased — the same
   Achilles' heel as any DiD. The achieved parallelism :math:`R^2` (low
   values mean no balanced design exists), the reported MDE, and a placebo /
   blank-window check on the held-out pre-period are the diagnostics that
   flag the risk.

2. A linear factor structure for the no-treatment outcomes
   (Eq. :eq:`factor`). The gap decomposition that makes "match on trajectory"
   equivalent to "balance the factor loadings" relies on this model. It is
   the standard synthetic-control / interactive-fixed-effects assumption.

   *Remark.* The model is mild for sales-like panels, but a wildly
   non-factor outcome (e.g. one driven by an idiosyncratic, unit-specific
   regime change) is not balanceable by any partition — no split of the geos
   can render :eq:`gapdecomp` constant in expectation.

3. A modest, designable geo pool. Each arm needs enough geos to form
   at least one supergeo pair, and the geos must be heterogeneous-but-
   matchable.

   *Remark.* A pool that cannot form a balanced pair (e.g. two wildly
   different markets) has no good design to find. Scale itself is not a
   barrier --- the default clustering partition handles large pools (see
   *Forming the supergeos* below) --- only the exact opt-out program is
   scale-limited.

When PANGEO Fails or Stalls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The concrete failure / stall modes:

- Parallel trends breaks post-launch — the dominant risk, above. No
  design fixes it; only the diagnostics warn of it.
- No matchable structure. If the geos are so heterogeneous that *no*
  partition achieves high parallelism, PANGEO still returns the best
  feasible design, but the parallelism :math:`R^2` stays low and the MDE
  blows up — a signal that a geo experiment here is underpowered and the
  read will be noisy regardless of split.
- The exact program stalls. Set partitioning is NP-hard, so the exact
  opt-out program (``fast=False``) can be slow or intractable with many geos
  and a large supergeo size :math:`Q`. The default clustering partition
  avoids this --- it is :math:`O(n\log n)` and, under cluster structure,
  matches the exact optimum (see *Forming the supergeos*). If you
  nonetheless need the exact program, cap :math:`Q` (smaller supergeos), use
  the automatic :math:`Q` selection, or raise ``min_pairs``.
- Too few geos / arms. A pool that cannot form a balanced pair (e.g.
  two wildly different markets) has no good design to find.

In short: PANGEO improves the *plausibility* of parallel trends by
construction and *quantifies the residual risk* (parallelism
:math:`R^2`, MDE), but it inherits DiD's identifying assumption rather
than removing it. Treat a low parallelism :math:`R^2` or a large MDE as
the design telling you the experiment is fragile.

Stage 1 --- the supergeo design
-------------------------------

The parallelism objective
^^^^^^^^^^^^^^^^^^^^^^^^^^

The pre-treatment window is split into an estimation window
:math:`\mathcal E` (the first :math:`\lfloor \kappa T_0\rfloor` periods,
:math:`\kappa=` ``frac_E``, default :math:`0.7`) and a held-out blank
window :math:`\mathcal B=\{1,\dots,T_0\}\setminus\mathcal E`. A pair is
scored by the variance of its *level-removed* gap over the estimation
window,

.. math::
   :label: score

   c(p) \coloneqq \sum_{t\in\mathcal E}\big(g_{p,t}-\bar g_p\big)^2,
   \qquad
   \bar g_p \coloneqq \frac{1}{|\mathcal E|}\sum_{t\in\mathcal E} g_{p,t},

which is exactly the pre-period residual sum of squares of a
difference-in-differences fit (cf.
:func:`mlsynth.utils.selector_helpers._did_from_mean`). Taking expectations
under :eq:`gapdecomp` with balanced covariates,

.. math::

   \mathbb E\,c(p)
     = \big(\bar{\boldsymbol{\mu}}_{A_p}-\bar{\boldsymbol{\mu}}_{B_p}\big)^{\top}
       \Big[\textstyle\sum_{t\in\mathcal E}(\boldsymbol{\lambda}_t-\bar{\boldsymbol{\lambda}})
            (\boldsymbol{\lambda}_t-\bar{\boldsymbol{\lambda}})^{\top}\Big]
       \big(\bar{\boldsymbol{\mu}}_{A_p}-\bar{\boldsymbol{\mu}}_{B_p}\big)
     \;+\; \mathbb E\!\sum_{t\in\mathcal E}
       \big(\bar\varepsilon_{A_p,t}-\bar\varepsilon_{B_p,t}-\overline{\cdot}\big)^2 .

The first term is a positive-definite quadratic form in the loading
imbalance, so minimising :eq:`score` drives
:math:`\bar{\boldsymbol{\mu}}_{A_p}\to\bar{\boldsymbol{\mu}}_{B_p}` --- it balances the *unobserved*
factor loadings, which is what parallel-trends DiD requires. The
time-constant component of the loading difference is absorbed by the level
shift :math:`\bar g_p` and never penalised: two supergeos may differ
arbitrarily in level yet match perfectly in *shape*. Scalar sum-matching,
by contrast, collapses the time dimension and is blind to shape.

Forming the supergeos
^^^^^^^^^^^^^^^^^^^^^^

Given the per-pair cost :eq:`score`, the design must partition each arm's geos
into supergeo pairs of minimum total non-parallelism. PANGEO offers two solvers
for this, and they optimise the *same* objective :eq:`score`. The econometric
content of the design --- balancing the factor loadings of :eq:`gapdecomp`,
i.e. making the supergeo halves move in parallel --- is therefore identical
whichever solver runs; they differ only in how they search. Both run
independently within each arm (arms hold disjoint geos), and both produce an
exact cover: every geo is assigned to exactly one supergeo pair, with nothing
trimmed.

Clustering partition (default, ``fast=True``). Geo-experiment panels almost
always have cluster structure: markets fall into a handful of latent types
that move together up to a level shift (shared seasonality, regional demand,
category-level trends). When that structure is present the best supergeos are
just groups of same-type geos, which can be found by clustering rather than by
combinatorial search. The default solver --- an analogue of OSD (Shaw 2025)
for the trajectory objective --- forms the supergeos in five plain steps:

#. Level removal. Each geo's pre-period trajectory has its own time-mean
   subtracted, leaving the *shape* :math:`y_{it}-\bar y_i`. This is the same
   demeaning as the score :eq:`score`: two geos moving in parallel at any
   level become identical, so the grouping targets parallel trends rather
   than matching levels.
#. Embedding. The shapes are projected onto their leading principal
   components (denoising; under the factor model :eq:`factor` the shapes span
   the factor space, which PCA recovers).
#. Clustering. Hierarchical (Ward) linkage orders the geos so that
   parallel-moving markets are adjacent.
#. Size-bounded grouping. The ordering is cut into contiguous groups of size
   :math:`2,\dots,2Q` --- each splittable into two halves of size :math:`\le
   Q` --- giving the exact cover. ``min_pairs`` caps the group size so that at
   least that many pairs form.
#. Per-group split. Within each group the treatment and control halves are
   the split minimising :eq:`score` --- the same best split the exact program
   uses.

A handful of candidate groupings are generated (varying the linkage rule and a
small embedding perturbation, as in OSD) and the one with the lowest total
:eq:`score` is kept; ``fast_candidates`` sets how many. The cost is
:math:`O(n\log n)`, against the exponential search below.

*Remark.* The clustering partition is a heuristic: its total cost is
:math:`\ge` the exact optimum, with equality when the trajectory clusters are
clean. Checked against a structure-aware oracle (group by the true latent
clusters, then split each exactly), on clustered panels it *matches* the exact
optimum --- e.g. at :math:`N=60` geos with supergeos up to :math:`Q=6` it
recovers the oracle design in well under a second, a regime where the exact
program cannot even begin (it enumerates :math:`O(N^{2Q})` candidate subsets).
This is why it is the default. On data with no cluster structure the gap to the
optimum can be large; there the exact program is preferable when affordable.

The exact set-partitioning program (``fast=False``)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The exact solver enumerates every admissible pair and selects the optimal
exact cover by mixed-integer programming. Let :math:`\mathcal F` be the family
of admissible pairs: every subset
of the arm's geos of size :math:`2,\dots,2Q` that can be split into two
halves each of size :math:`\le Q`, each subset scored at its best such split
by :eq:`score`. Let :math:`\mathbf{M}\in\{0,1\}^{N\times|\mathcal F|}` be the geo-by-
pair incidence matrix (:math:`M_{iG}=1` iff geo :math:`i\in G`) and
:math:`c_G` the score of pair :math:`G`. The design solves the
set-partitioning program

.. math::
   :label: mip

   \min_{\mathbf{x}\in\{0,1\}^{|\mathcal F|}} \sum_{G\in\mathcal F} c_G\,x_G
   \quad\text{s.t.}\quad
   \mathbf{M}\mathbf{x} = \mathbf 1 \ \ (\text{exact cover}),\qquad
   \mathbf 1^{\top} \mathbf{x} \ge \kappa_{\min}\ \ (\text{minimum pairs}),

solved with ``cvxpy`` and the HiGHS mixed-integer backend. The exact-cover
constraint :math:`\mathbf{M}\mathbf{x}=\mathbf 1` assigns every geo to exactly one chosen
pair (no geo is trimmed). Because each :math:`c_G` is *precomputed* offline,
the objective is linear in :math:`\mathbf{x}` --- the program is a mixed-integer
*linear* program regardless of the (possibly nonlinear) per-pair cost,
which is what keeps it tractable. Within each chosen pair the treatment and
control halves are the score-minimising split; which half is actually
treated is randomised in the field.

Per-pair objectives
^^^^^^^^^^^^^^^^^^^^

The ``objective`` argument selects the per-pair cost :math:`c_G`; all three
choices leave :eq:`mip` a linear program. Writing :math:`g_t` for the gap
of a candidate split and :math:`\bar g` for its estimation-window mean,

* ``"ss_res"`` (default) --- the absolute residual sum of squares
  :math:`\sum_t (g_t-\bar g)^2`. Scale-dependent, so high-amplitude pairs
  weigh more and the design prioritises making large markets parallel.
* ``"r2"`` --- the scale-free criterion
  :math:`1-R^2 = \sum_t(g_t-\bar g)^2 / \sum_t(\bar Y_{A,t}-\overline{\bar Y_A})^2`,
  so every pair counts equally (FDID's :math:`R^2` criterion, optimised
  *exactly* by the program rather than greedily).
* ``"weighted"`` --- a recency-weighted residual SS
  :math:`\sum_t w_t (g_t-\bar g_w)^2`, the level removed at the weighted mean
  :math:`\bar g_w`, with weights :math:`w_t=\rho_{\mathrm{dec}}^{\,T_0-1-t}`
  (``recency_decay``), up-weighting the recent pre-period closest to the
  experiment.

The per-pair ``gap_variance`` and ``parallelism_r2`` reported on the result
are always the unweighted quantities of :eq:`score`, so designs from
different objectives are comparable on a common yardstick.

Supergeo size :math:`Q` and automatic selection
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Setting ``max_supergeo_size`` :math:`=Q=1` recovers the classic
matched-pairs design; :math:`Q>1` permits composite supergeos when no single
geo matches another well, without trimming. :math:`Q` is a granularity knob
with an interior optimum: too small and no parallel matches exist
(singleton geos are too noisy); too large and the arm yields few, coarse
pairs. The program-level MDE is *not* monotone in :math:`Q` and is *not*
tracked by the parallelism :math:`R^2` (which is scale-free and rises with
:math:`Q`); only the absolute residual variance that drives power matters.

Consequently, if ``max_supergeo_size`` is left unset, PANGEO selects
:math:`Q` automatically: it solves :eq:`mip` for every feasible
:math:`Q\in\{1,\dots,\min(\lceil N/2\rceil, 6)\}` and returns the design
with the smallest mean program MDE. The full sweep --- each :math:`Q`'s
program-pair count, mean program MDE, and the :math:`2/2^{P}`
randomisation-inference p-value floor for :math:`P` pairs --- is recorded in
``results.metadata["q_sweep"]`` and the choice in
``results.metadata["q_selected"]``, so the decision is auditable and may be
overridden with an explicit :math:`Q`.

Balancing baseline covariates
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Parallelism is level-blind: by :eq:`score` the level shift
:math:`\bar g_p` absorbs any time-constant gap, so a baseline characteristic
(population, income) that merely shifts a market's level is differenced out
and never enters the trajectory score. This is correct for parallel-trends
DiD but says nothing about balance on such characteristics --- the role of
OSD's scalar covariate matching. PANGEO restores it with a standardised
mean-difference penalty appended to :eq:`score`,

.. math::
   :label: covpen

   c(p) \;\longmapsto\; c(p)
     \;+\; \sum_{m} \omega^{\mathrm{cov}}_m
       \Big(\frac{\bar c_{A_p,m}-\bar c_{B_p,m}}{s_m}\Big)^2,

the weighted squared standardised mean difference (SMD) between the halves'
covariate means, where :math:`s_m` is the cross-geo standard deviation
(``standardize_covariates``, default ``True``) and :math:`\omega^{\mathrm{cov}}_m`
a per-covariate weight (``covariate_weights``, default :math:`1`). Because
:eq:`covpen` is precomputed it preserves linearity in :eq:`mip`. Larger
weights buy tighter covariate balance at the cost of some parallelism; the
achieved per-pair SMDs are reported in ``SupergeoPair.covariate_smd``. Pass
``covariates=[...]`` (baseline columns, each reduced to its per-geo mean) to
enable; with no covariates the design is unchanged. This is also the
Abadie & Zhao (2026, Thm. 1) prescription --- moving structure from the
unobserved :math:`\boldsymbol{\mu}_i` into the observed :math:`\mathbf{z}_i` lowers the
estimator's bias --- and the Stage-2 device for restoring inferential
validity (below).

.. code-block:: python

   df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
                                  T=104, seed=0, covariates=True)

   res = PANGEO({
       "df": df, "outcome": "sales", "arm": "arm",
       "unitid": "unit", "time": "time", "max_supergeo_size": 3,
       "covariates": ["population", "income"],
       "covariate_weights": {"population": 5.0, "income": 5.0},
   }).fit()

   for arm, design in res.arm_designs.items():
       for p in design.pairs:
           print(p.treatment, p.control, p.parallelism_r2, p.covariate_smd)

Power and the minimum detectable effect
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Because power and the design objective are governed by the same supergeo
gap residual, :meth:`mlsynth.PANGEO.fit` returns a power analysis
(``results.power``). For pair :math:`p` the per-period noise is estimated
honestly on the held-out blank window :math:`\mathcal B` (out of sample
with respect to the optimisation) as the residual of the *same counterfactual
model used at evaluation* (:eq:`adid`) --- fit on the estimation window
:math:`\mathcal E`, evaluated on :math:`\mathcal B`:

.. math::

   \widehat\sigma_p^2
     \coloneqq \frac{1}{|\mathcal B|-1}\sum_{t\in\mathcal B} \widehat e_{p,t}^2 .

Using the evaluation model here (the augmented-DiD residual by default,
or the plain level-removed gap when ``att_augment=False``) rather than a
fixed recipe keeps the projected MDE and the realised standard error
(:eq:`adidvar`) coherent. The :math:`X`-period effect for the pair then has
variance :math:`\widehat\sigma_p^2\,[f(X,\rho)+f(T_0,\rho)]`, where

.. math::

   f(n,\rho) = \frac{1}{n}\Big(1 + 2\sum_{k=1}^{n-1}(1-\tfrac{k}{n})\rho^{k}\Big)

is the variance-inflation factor of the mean of :math:`n` AR(1)-correlated
periods and :math:`\rho` is the pooled lag-1 autocorrelation of the blank
residuals. Serial correlation is decisive: weekly sales are highly
autocorrelated, so :math:`X` post weeks are worth far fewer than :math:`X`
independent observations and adding post periods yields sharply diminishing
returns --- the trap a naive i.i.d. power calculation falls into.

The program-level effect is the treated-size-weighted average of the pair
effects, with weights
:math:`w_p = (\sum_{i\in A_p}\omega_i)/\sum_{q}\sum_{i\in A_q}\omega_i`, and
(treating pairs as independent across the program)

.. math::

   \widehat{\operatorname{Var}}(\widehat\tau_{\mathrm{prog}})
     = \sum_p w_p^2\,\widehat\sigma_p^2\,\big[f(X,\rho)+f(T_0,\rho)\big],
   \qquad
   \mathrm{MDE}(X) = \big(z_{1-\alpha/2}+z_{1-\beta}\big)\,
       \sqrt{\widehat{\operatorname{Var}}(\widehat\tau_{\mathrm{prog}})}.

The program level is the headline: small arms are individually
under-powered (with :math:`P` pairs a pure within-pair randomisation test
has a hard p-value floor of :math:`2/2^{P}`, so one needs :math:`P\ge 6` to
reach :math:`p<0.05`), whereas pooling across arms gives the program an
effective sample size equal to the total pair count and routinely detects
effects several points smaller than any one arm. Per-arm curves are stored
in ``results.power.arms``. The MDE is reported in outcome units and as a
percent of the treated baseline, by default at :math:`1-\beta=0.80` power
for horizons :math:`X=2,\dots,12`; ``power_target``, ``power_alpha`` and
``power_post_periods`` configure this and ``compute_power=False`` skips it.

.. code-block:: python

   res = PANGEO({
       "df": df, "outcome": "sales", "arm": "arm",
       "unitid": "unit", "time": "time", "max_supergeo_size": 3,
   }).fit()

   pw = res.power
   print(f"serial correlation rho = {pw.serial_correlation:.2f}")
   print(pw.summary())                       # MDE % by horizon: program + arms
   print(pw.program.mde_pct_by_horizon()[8]) # detectable % lift after 8 weeks
   print(pw.power_for_effect(effect_pct=5.0, post_periods=8))  # invert: power

Stage 2 --- evaluation by Augmented DiD
---------------------------------------

The estimator
^^^^^^^^^^^^^

Once the experiment has run, pass a ``post_col`` (a :math:`0/1` indicator of
post-treatment periods, as in LEXSCM). The design is rebuilt on the pre rows
alone --- so it is identical to the design-only result --- and
``results.effects`` carries the realised ATT at the arm and program
levels using the Augmented Difference-in-Differences estimator of Li &
Van den Bulte (2022).

Fix a level (an arm, or the program) and write :math:`y^{T}_t` for its
treated supergeo aggregate and :math:`y^{C}_t` for its control supergeo
aggregate, both treated-size-weighted across the level's pairs. The
counterfactual is the pre-period least-squares projection

.. math::
   :label: adid

   y^{T}_t = \delta_1 + \delta_2\,y^{C}_t + \gamma\,t + e_t,
   \qquad t = 1,\dots,T_0 .

This *augments* plain DiD in two ways: the control scale :math:`\delta_2` is
estimated rather than fixed at :math:`1`, and a linear time trend
:math:`\gamma t` is included (``att_augment`` and ``att_trend``, both default
``True``). With regressor :math:`\mathbf{x}_t=(1,\,y^{C}_t,\,t)^{\top}` and OLS
estimate :math:`\widehat{\boldsymbol{\delta}}`, the per-period effect and the ATT are

.. math::

   \widehat u_t \coloneqq y^{T}_t - \mathbf{x}_t^{\top}\widehat{\boldsymbol{\delta}},
   \qquad
   \widehat\tau \coloneqq \frac{1}{T_{\mathrm{post}}}
       \sum_{t=T_0+1}^{T} \widehat u_t .

The percent ATT is taken relative to the post-period counterfactual
(cf. :func:`mlsynth.utils.resultutils.effects.calculate`), not the
pre-treatment baseline:

.. math::

   \widehat\tau_{\%} \coloneqq 100\times\frac{\widehat\tau}{\bar y^{0}_{\mathrm{post}}},
   \qquad
   \bar y^{0}_{\mathrm{post}}
     \coloneqq \frac{1}{T_{\mathrm{post}}}\sum_{t=T_0+1}^{T} \mathbf{x}_t^{\top}\widehat{\boldsymbol{\delta}}
     = \frac{1}{T_{\mathrm{post}}}\sum_{t=T_0+1}^{T}\big(y^{T}_t-\widehat u_t\big).

Inference
^^^^^^^^^

Li & Van den Bulte (2022, Prop. 3.1--3.3) show
:math:`\sqrt{T_{\mathrm{post}}}\,(\widehat\tau-\tau)\xrightarrow{d}
N(0,\Sigma_1+\Sigma_2)`, where :math:`\Sigma_1` is the variance from
estimating :math:`\boldsymbol{\delta}` and :math:`\Sigma_2` from averaging the
post-period errors. Their Web Appendix C.13 gives the
prediction-variance estimator

.. math::
   :label: adidvar

   \widehat{\operatorname{Var}}(\widehat\tau)
     = \widehat\omega^2\Big[\,
         \bar{\mathbf{x}}_{\mathrm{post}}^{\top}
         \Big(\textstyle\sum_{t=1}^{T_0} \mathbf{x}_t \mathbf{x}_t^{\top}\Big)^{-1}
         \bar{\mathbf{x}}_{\mathrm{post}}
         \;+\; \frac{1}{T_{\mathrm{post}}}\Big],

with :math:`\bar{\mathbf{x}}_{\mathrm{post}}=T_{\mathrm{post}}^{-1}\sum_{t>T_0}\mathbf{x}_t`.
The first bracketed term is :math:`\Sigma_1` (it inflates automatically when
the post-period control drifts outside its pre-period range, pricing the
extrapolation uncertainty) and the second is :math:`\Sigma_2`. The residual
variance :math:`\widehat\omega^2` is estimated over the long pre-period as a
Newey--West/Bartlett long-run variance with truncation lag
:math:`\lfloor T_0^{1/4}\rfloor` (Li & Van den Bulte's
:math:`O(T^{1/4})` rule); lag :math:`0` is the i.i.d. case
:math:`\widehat\omega^2=\widehat e^{\top}\widehat e/(T_0-k)` for :math:`k` regressors.
The confidence interval is
:math:`\widehat\tau \pm z_{1-\alpha/2}\sqrt{\widehat{\operatorname{Var}}(\widehat\tau)}`
and the p-value is the two-sided normal test of :math:`\tau=0`.

Why this estimator suits the supergeo gap
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Li & Van den Bulte's regularity conditions (Assumptions C2--C3) explicitly
admit trend and unit-root (integrated) common factors :math:`\boldsymbol{\lambda}_t`
--- the regimes under which naive i.i.d. standard errors collapse. The
mechanism is the augmentation: regressing the treated aggregate on a
*scaled* control is a cointegrating regression, and a single
:math:`\delta_2` cancels a shared integrated factor in :eq:`gapdecomp`,
while :math:`\gamma t` absorbs deterministic drift. The validity condition
reduces to a single requirement --- that the regression residual
:math:`e_t` be (weakly dependent) stationary --- which the augmentation and
trend deliver. The design's parallelism is retained throughout; it minimises
the residual variance, which by :eq:`adidvar` directly tightens the standard
error --- and, because the power analysis reads the *same* held-out residual,
the planning MDE as well.

Plain DiD as an option. Setting ``att_augment=False`` (and optionally
``att_trend=False``) recovers Li & Van den Bulte's ordinary
difference-in-differences --- ``y^{T}_t - y^{C}_t = \delta_1 [+ \gamma t] +
e_t`` with the control coefficient fixed at one --- and the power analysis
follows suit, so the two stages stay coherent. A head-to-head Monte-Carlo
comparison (R\ :sup:`2` design + plain DiD versus the augmented defaults)
found augmented DiD both more precise (lower realised MDE) and
better-covering across the stationary, trend-plus-seasonal and
integrated-factor regimes, because plain DiD has no mechanism to absorb a
control-scale mismatch or a trend and leaves that structure in its residual.
The augmented estimator is therefore the default; plain DiD remains available
for settings where its textbook simplicity is preferred.

Validity envelope (smoke tests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A Monte-Carlo study over the bundled simulator confirms that validity hinges
on residual stationarity, not on the interval recipe. The simulator can
place the unobserved factor on an i.i.d., AR(1) or random-walk process
(``factor``) and toggle the seasonal amplitude (``season_amp``) and per-geo
trend (``trend_sd``):

.. list-table::
   :header-rows: 1
   :widths: 50 25 25

   * - Gap structure
     - Program coverage (nominal 0.95)
     - Type-I (nominal 0.05)
   * - stationary i.i.d. factor (paper DGP)
     - 0.93
     - 0.07
   * - + linear trend + seasonality
     - 0.87
     - 0.13
   * - + integrated factor (random walk)
     - 0.60
     - 0.40

The point estimate is unbiased in every regime. On a stationary gap ---
matching Li & Van den Bulte's factor-model design --- the prediction-
variance interval is at its nominal rate; the augmentation and trend
regressor recover most of the coverage lost to a deterministic trend and
seasonality; and the adversarial random-walk-plus-seasonality gap, where two
integrated factors and amplitude-heterogeneous seasonality exceed what a
single :math:`\delta_2` can cointegrate, marks the honest assumption
boundary. In practice the fitted :math:`\widehat\delta_2` (reported as
``AttEstimate.scale``) and the residual diagnose whether the assumption
holds; if a single scale cannot flatten the gap, add covariate or seasonal
regressors before trusting the interval.

Because the power analysis now uses the evaluation model's held-out
residual, the planning MDE is calibrated to the realised standard error:
on the stationary gap the projected MDE matches the realised value to within
roughly 7% (ratio :math:`\approx 0.93`), and on integrated gaps it is
conservative (over-states the MDE), the safe direction. These experiments
live in ``mlsynth/tests/test_pangeo.py`` (``TestADIDInference``).

.. code-block:: python

   df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
                                  T=104, seed=0, n_post=8)
   res = PANGEO({
       "df": df, "outcome": "sales", "arm": "arm",
       "unitid": "unit", "time": "time", "post_col": "post_col",
       "max_supergeo_size": 3,
       "att_augment": True, "att_trend": True,   # Augmented DiD (defaults)
   }).fit()

   print(res.effects.summary())           # program + per-arm ATT, SE, CI, p
   pe = res.effects.program
   print(f"program ATT = {pe.att_pct:.1f}% "
         f"[{pe.ci_lower_pct:.1f}, {pe.ci_upper_pct:.1f}], "
         f"p={pe.p_value:.3f}, scale delta_2={pe.scale:.2f}")

Core API
--------

.. automodule:: mlsynth.estimators.pangeo
   :members:
   :undoc-members:
   :show-inheritance:

Configuration
-------------

.. autoclass:: mlsynth.config_models.PANGEOConfig
   :members:
   :undoc-members:

Helper Modules
--------------

.. automodule:: mlsynth.utils.pangeo_helpers.parallelism
   :members:
   :undoc-members:

.. automodule:: mlsynth.utils.pangeo_helpers.mip
   :members:
   :undoc-members:

.. automodule:: mlsynth.utils.pangeo_helpers.setup
   :members:
   :undoc-members:

.. automodule:: mlsynth.utils.pangeo_helpers.pipeline
   :members:
   :undoc-members:

.. automodule:: mlsynth.utils.pangeo_helpers.power
   :members:
   :undoc-members:

.. automodule:: mlsynth.utils.pangeo_helpers.effects
   :members:
   :undoc-members:

.. automodule:: mlsynth.utils.pangeo_helpers.simulation
   :members:
   :undoc-members:

.. automodule:: mlsynth.utils.pangeo_helpers.structures
   :members:
   :undoc-members:

Example
-------

A seasonal, multi-arm sales panel (the bundled simulator), designed into
parallel supergeo pairs. With ``display_graphs=True`` PANGEO plots, per arm,
the observed treated supergeo aggregate against the augmented-DiD
counterfactual prediction (:func:`mlsynth.utils.pangeo_helpers.effects.adid_counterfactual`):
the in-sample pre-period fit when designing, and -- once ``post_col`` data
are supplied -- the counterfactual extended past the treatment date, so the
post-window gap is the estimated effect.

.. code-block:: python

   from mlsynth import PANGEO
   from mlsynth.utils.pangeo_helpers import make_seasonal_sales_panel

   # 3 arms (non-overlapping geos), 6 geos each, 156 weeks of history.
   df = make_seasonal_sales_panel(units_per_arm=6, arms=("A", "B", "C"),
                                  T=156, seed=0)

   res = PANGEO({
       "df": df,
       "outcome": "sales",
       "arm": "arm",                # single categorical arm column
       "unitid": "unit",
       "time": "time",
       "max_supergeo_size": 3,      # Q
   }).fit()

   for arm, design in res.arm_designs.items():
       print(f"Arm {arm}: {len(design.pairs)} pair(s), "
             f"parallel-trends R^2 = {design.mean_parallelism_r2:.3f}")
       for p in design.pairs:
           print(f"   T={p.treatment}  C={p.control}  R^2={p.parallelism_r2:.3f}")

   # res.assignment maps every geo -> 'treatment' / 'control'.

On the simulated data this returns designs with parallel-trends :math:`R^2`
around 0.90--0.98 --- roughly 10--35x more parallel than a random
treatment/control split of the same geos.

Simulation: Trajectory Matching vs. a Scalar Supergeo
-----------------------------------------------------

The supergeo design of Chen et al. (2023) matches (super)geos on a
scalar summary of baseline response (its variance term is
:math:`\sum_k (Z_{G_{k,+}} - Z_{G_{k,-}})^2`, a sum over scalar baseline
differences). PANGEO carries this into the panel setting: it matches
on the full pre-treatment *trajectory* (level-removed parallelism), so it
can separate geos that look identical on a scalar yet move differently
over time. The self-contained Monte Carlo below makes the gap concrete —
adapting the paper's RMSE comparison (supergeo vs. matched pairs) to a
panel where every geo has the same pre-period mean but a distinct
trajectory shape, a setting in which scalar matching is by construction
blind. Six geos keep the MIP instantaneous.

.. code-block:: python

   import numpy as np
   import pandas as pd
   from mlsynth import PANGEO

   def make_panel(rng, T_pre=20, S=6, level=100.0, noise=0.6):
       """Six geos = three parallel pairs (up-trend, down-trend, cycle);
       each shape is demeaned over the pre-period so all pre-means match."""
       T = T_pre + S
       t = np.arange(T)
       up = (t - t.mean()) / t.std()
       cyc = np.sin(2 * np.pi * t / 5.0)
       shapes = [5 * up, 5 * up, -5 * up, -5 * up, 5 * cyc, 5 * cyc]
       cols = []
       for s in shapes:
           s = s - s[:T_pre].mean()              # equal pre-means => scalar-blind
           cols.append(level + s + rng.normal(0, noise, T))
       return np.column_stack(cols), T_pre

   def did(Y, T_pre, treated, control, tau):
       Yo = Y.copy()
       Yo[T_pre:, treated] += tau                # inject the true effect
       t_eff = Yo[T_pre:, treated].mean() - Yo[:T_pre, treated].mean()
       c_eff = Yo[T_pre:, control].mean() - Yo[:T_pre, control].mean()
       return t_eff - c_eff

   def pangeo_design(Y, T_pre):
       df = pd.DataFrame(
           {"geo": f"g{g}", "t": int(t), "y": Y[t, g], "arm": "A"}
           for g in range(6) for t in range(T_pre)
       )
       a = PANGEO({"df": df, "outcome": "y", "unitid": "geo", "time": "t",
                   "arm": "arm", "max_supergeo_size": 1,
                   "compute_power": False, "display_graphs": False}).fit().assignment
       T = [int(g[1:]) for g, v in a.items() if v == "treatment"]
       C = [int(g[1:]) for g, v in a.items() if v == "control"]
       return T, C

   def scalar_match(Y, T_pre, rng):              # Google-style: pair on scalar mean
       order = np.argsort(Y[:T_pre].mean(0))
       T, C = [], []
       for k in range(0, 6, 2):
           a, b = order[k], order[k + 1]
           if rng.random() < 0.5:
               T.append(a); C.append(b)
           else:
               T.append(b); C.append(a)
       return T, C

   tau, R = 4.0, 60
   rng = np.random.default_rng(1)
   errs = {"PANGEO (trajectory)": [], "scalar matched-pairs": []}
   for _ in range(R):
       Y, T_pre = make_panel(rng)
       Tp, Cp = pangeo_design(Y, T_pre)
       errs["PANGEO (trajectory)"].append(did(Y, T_pre, Tp, Cp, tau) - tau)
       Ts, Cs = scalar_match(Y, T_pre, rng)
       errs["scalar matched-pairs"].append(did(Y, T_pre, Ts, Cs, tau) - tau)

   for name, e in errs.items():
       e = np.array(e)
       print(f"{name:22s} RMSE = {np.sqrt((e ** 2).mean()):.2f}")

Because all geos share a pre-period mean, the scalar match pairs them
essentially at random, and the difference-in-differences estimate
inherits the up/down/cycle shape mismatch (RMSE ≈ 6 against a true effect
of 4). PANGEO reads the trajectory shape, recovers the three parallel
pairs, and estimates the effect about 30× more precisely (RMSE ≈ 0.2).
That is the supergeo idea carried into the panel world: match on *how the
series move*, not on a single number.

References
----------

Chen, A., Doudchenko, N., Jiang, S., Stein, C., & Ying, B. (2023).
"Supergeo Design: Generalized Matching for Geographic Experiments."
arXiv:2301.12044.

Shaw, C. (2025). "Optimized Supergeo Design: A Scalable Framework for
Geographic Marketing Experiments." arXiv:2506.20499.

Li, K. T. (2023). "Frontiers: A Simple Forward Difference-in-Differences
Method." *Marketing Science* 43(2):267-279.

Li, K. T., & Van den Bulte, C. (2022). "Augmented Difference-in-Differences."
*Marketing Science* 42(4):746-767.

Abadie, A., & Zhao, J. (2026). "Synthetic Controls for Experimental Design."
*Working paper.*

Abadie, A., Diamond, A., & Hainmueller, J. (2010). "Synthetic Control
Methods for Comparative Case Studies." *Journal of the American
Statistical Association* 105(490):493-505.