GEOLIFT — Meta’s GeoLift walkthrough (augsynth cross-validation)#
- Estimator:
- Source:
Meta’s GeoLift package (
facebookincubator/GeoLift), theGeoLift_Walkthroughvignette, which runs Ben-Michael, Feller & Rothstein’s Augmented SCM ([BMFR2021]) via augsynth (ebenmichael/augsynth) with Chernozhukov–Wüthrich–Zhu conformal inference ([CWZ2021]).- Replication type:
Cross-validation — match an authoritative reference implementation (GeoLift/augsynth) value-for-value on the package’s own published example.
- Status:
Done — fully verified; the realized effect report reproduces both of GeoLift’s walkthrough summaries (the unaugmented base model and the ridge-augmented “best” model) — ATT, percent lift, incremental, conformal p-value, L2 imbalance, scaled L2, percent improvement, bias removed, and the donor weights.
- Durable check:
benchmarks/cases/geolift.py(geolift_walkthrough, vs the published vignette),benchmarks/cases/geolift_marketselection.py(geolift_marketselection, vs the BestMarkets ranking), and the live-Rscript cross-checksgeolift_augsynth_ref(vs live augsynth) andgeolift_marketselection_ref(vs liveGeoLiftMarketSelection); plusmlsynth/tests/test_geolift_walkthrough.py.
Why this is the replication target#
The earlier port had no value-for-value anchor (only an end-to-end null on
the no-effect panel), because GeoLift’s market-selection routine has no
published table. But GeoLift’s realized effect report — the ATT and conformal
p-value it prints once a test has run — is published, in the
GeoLift_Walkthrough: it is the augsynth Augmented SCM with
fixed_effects=TRUE, the package’s default. That gives a hard cross-validation
target for the part of GEOLIFT that does the causal inference
(realize_design()).
The walkthrough treats chicago + portland over the last 15 of 105 days
(GeoLift_Test: 40 markets, the other 38 as donors) and prints two
summaries — the unaugmented base model (GeoLift(...)) and the ridge-augmented
“best” model (GeoLift(..., model = "best")):
Quantity |
GeoLift base |
GeoLift augmented |
|---|---|---|
Average ATT (per unit/period) |
|
|
Percent Lift |
|
|
Incremental Y (summed) |
|
|
Conformal p-value |
|
|
L2 imbalance |
|
|
Scaled L2 |
|
|
Percent improvement (naive) |
|
|
Avg estimated bias removed |
— |
|
Note
The two columns are two models, not two augsynth versions. The base
GeoLift() call is the unaugmented (simplex) fit — mlsynth reproduces it
with augment=None — and model = "best" selects the ridge-augmented fit,
which mlsynth reproduces with augment="ridge" (its default). Both match
the printed summaries to the published digits, including the L2 imbalance,
scaled L2, and percent-improvement diagnostics (the average bias removed is just
the base-minus-augmented ATT gap, 155.556 - 156.805 = -1.249). The live
augsynth cross-check below independently confirms the ridge fit to floating
point.
The walkthrough’s public call (GeoLift names the locations and the post
window — it is an analysis of a given test region, not a market search):
GeoLift_Test <- GeoLift(Y_id = "Y", data = GeoTestData_Test,
locations = c("chicago", "portland"),
treatment_start_time = 91, treatment_end_time = 105)
summary(GeoLift_Test) # base: ATT 155.556, Lift 5.4%, Incremental 4667, p 0.01
GeoTestBest <- GeoLift(Y_id = "Y", data = GeoTestData_Test,
locations = c("chicago", "portland"),
treatment_start_time = 91, treatment_end_time = 105,
model = "best")
summary(GeoTestBest) # ridge: ATT 156.805, Lift 5.5%, Incremental 4704, p 0.01
mlsynth reaches the same numbers through its public estimator —
GEOLIFT(...).fit() with fixed_effects=True (the default). The estimator is
a market-selection design, so the two markets are pinned with to_be_treated
+ treatment_size (the only candidate of that size) and the post window is
marked by post_col; res.report is the realized effect report — the
analogue of summary(GeoLift_Test):
import pandas as pd
from mlsynth import GEOLIFT
df = pd.read_csv("basedata/geolift_test_data.csv") # GeoLift_Test
dates = sorted(df["date"].unique())
df["post"] = df["date"].isin(set(dates[90:])).astype(int) # days 91-105
res = GEOLIFT({
"df": df, "outcome": "Y", "unitid": "location", "time": "date",
"treatment_size": 2, "to_be_treated": ["chicago", "portland"],
"durations": [15], "effect_sizes": [0.0, 0.10], "post_col": "post",
"how": "mean", "fixed_effects": True, "display_graphs": False,
"augment": "ridge", # the "best" model; use augment=None for the base model
}).fit()
res.selected_units # ['chicago', 'portland']
res.report.effects.att # 156.805 (ridge "best"; augment=None gives 155.556)
res.report.inference.p_value # 0.011 (GeoLift 0.01)
# how="sum" reports the summed incremental: ATT 313.6/period, p identical.
Pinned end-to-end through the public API in benchmarks/cases/geolift.py
(geolift_walkthrough) — both models and every printed quantity (ATT, lift,
incremental, conformal p, L2 imbalance, scaled L2, percent improvement, bias
removed, and the 13 donor weights) — and mlsynth/tests/
test_geolift_walkthrough.py.
Live cross-check vs augsynth#
To pin the augmented fit against the gold-standard reference rather than a doc
string, the durable cross-check fits augsynth itself and compares.
benchmarks/R/augsynth_geolift.R runs
augsynth(Y ~ trt, unit = location, time = t, data = panel,
progfunc = "ridge", scm = TRUE, fixedeff = TRUE) # GeoLift's fit
on the same chicago+portland panel (the two test geos averaged into one treated
series, exactly as GeoLift aggregates them), and benchmarks/cases/
geolift_augsynth_ref.py (geolift_augsynth_ref) checks mlsynth against
it. The agreement is essentially floating-point:
Install the reference once with benchmarks/R/install_augsynth.sh (augsynth
only — GeoLift’s fit is augsynth, so the heavy MarketMatching → Boom
chain is not needed). The install is commit-pinned — augsynth 0.2.0 @
7a90ea4 and every source-compiled dependency frozen to a SHA (S7,
LiblineaR, osqp) as of 2026-06-12 — so the cross-check runs the same
reference code every time, rather than a moving master tip whose results
could shift release to release. The case skips itself
when Rscript / augsynth is absent, so it is a no-op in CI and runs only
where the reference is installed.
This is what licenses the strong claim above: mlsynth’s ridge ASCM, its CV
λ-selection (the 1-SE rule), and its fixed-effect conformal refit are not merely
close to augsynth — they are the same computation, to ~7–11 significant
figures.
Market selection (the BestMarkets ranking)#
The walkthrough’s other half is the search for a test region, run on the 90-period pre-test panel:
GeoLiftMarketSelection(data = GeoTestData_PreTest, treatment_periods = c(10, 15),
N = c(2, 3, 4, 5), effect_size = seq(0, 0.2, 0.05), include_markets = "chicago",
exclude_markets = "honolulu", cpic = 7.50, budget = 1e5, fixed_effects = TRUE,
side_of_test = "two_sided")
GeoLift prints a ranked BestMarkets table; its top five designs are reproduced
by mlsynth value-for-value — rank, CPIC investment (exact to the cent), MDE,
and abs_lift_in_zero:
mlsynth’s GEOLIFT design takes one treatment_size, so the case runs it
for N = 2, 3, 4, 5 and pools the per-design MDE rows, then applies GeoLift’s
composite rank (the mean of three dense_rank``s over |MDE|, power, and
``abs_lift_in_zero, ties = min) across the pool — exactly how
GeoLiftMarketSelection ranks its single results table.
Note
Matching this top-five required fixing include_markets handling to GeoLift’s
generate-then-filter semantics (pre_test_power.R): candidates are
generated ignoring the forced markets, then kept only if they already contain
them, so a forced market is never welded onto an anchor it is uncorrelated
with. The earlier remove-and-reattach approach manufactured low-correlation
candidates (e.g. {chicago, las vegas}) that GeoLift never forms, which then
polluted the ranking (the composite ranks on MDE/power, not fit). Only the
stable top-five are pinned (see the live cross-check below for why).
Live cross-check vs GeoLiftMarketSelection#
Beyond the published table, benchmarks/cases/geolift_marketselection_ref.py
(geolift_marketselection_ref) runs the real GeoLiftMarketSelection via
Rscript and compares its BestMarkets to mlsynth’s pooled selection. On
the top-five designs mlsynth matches live GeoLift exactly — same candidate sets,
same rank, investment to the cent, same MDE. Crucially it also reproduces the
low-correlation candidates live GeoLift forms (e.g.
{atlanta, chicago, cleveland, las vegas}), which is what licenses the
generate-then-filter fix: the two libraries now build the same candidate pool.
The one design that differs is a single N=5 set
({chicago, cincinnati, houston, nashville, san diego}) at GeoLift’s rank six.
The cause is not selection but the power metric: with lookback_window = 1 the
power is a single-placement binary (detected or not at one flush window), and
that design’s flush-placement conformal p sits right at alpha — mlsynth gets
0.123, GeoLift just under 0.10 — so it falls on opposite sides of the
threshold, and (its only within-budget effect size being 0.05) mlsynth drops
it while GeoLift keeps it. The design’s true power at 0.05 is ~0.8 (with
lookback_window = 5 the p-values are 0.123, 0.061, 0.054, 0.016, 0.046),
so it is a known small-effect / single-placement power-methodology difference,
not a candidate-generation discrepancy — and it sits below the stable top-five.
Set lookback_window > 1 for a stable MDE/ranking: it is not the treatment
length (that is durations) but the number of staggered historical placements
the power is averaged over, so a single placement (the walkthrough’s default) is
a high-variance estimate at borderline designs.
Install the reference once with benchmarks/R/install_geolift.sh (it builds on
install_augsynth.sh and compiles the MarketMatching → CausalImpact →
bsts → Boom chain plus gsynth from GitHub’s CRAN mirrors, every
package pinned to a CRAN tag / commit, so no CRAN call is needed — the latest
Boom requires R ≥ 4.5, so the set is frozen to the last R-4.3-compatible
release). The case skips itself when Rscript / GeoLift is absent, so it
is a no-op in CI.
Pinned in benchmarks/cases/geolift_marketselection.py
(geolift_marketselection, vs the published table) and
geolift_marketselection_ref (vs the live run); the per-design investment is
also pinned independently by geolift_cpic.
Note
The walkthrough’s power curve (GeoLiftPower over an effect-size grid) is
published only as a plot, with no numeric table, so it has no value-for-value
target. The quantities that build it — each design’s per-effect-size power and
MDE — are the same ones validated by geolift_walkthrough,
geolift_marketselection, and the live geolift_marketselection_ref.
What it took to match — the four ingredients#
Reaching parity required reproducing augsynth’s pipeline from scratch and verifying each component against the published number. Four ingredients, each necessary; drop any one and the ATT or the p-value diverges.
Unit fixed effects (augsynth
fixed_effects=TRUE, GeoLift’s default).demean_datasubtracts each unit’s own pre-period mean from all of its periods, fits the SCM on the residuals (matching shapes, not levels), and restores the level with an intercept. This is what stops the donor pool from absorbing a treated-unit level shift: a convex/ridge combination of level-matched donors can chase a post-period jump, but once every unit is demeaned it cannot. Without it the realized ATT is wrong (≈209 vs 311) and the conformal refit absorbs the effect (p ≈ 0.56).Fit the mean of the treated units (augsynth
colMeans), not their sum. This is not scale-invariant for the conformal: a sum of \(k\) markets sits at \(k\times\) donor scale, outside the convex hull, so the simplex base fits it badly and the residual path changes — sum gives p ≈ 0.68 where mean gives p ≈ 0.01.GEOLIFTfits the per-unit mean and rescales the reported paths by \(k\) whenhow="sum"(the p-value, a ratio of norms, is invariant to that global reporting scale).The faithful conformal refit ([CWZ2021], augsynth
conformal). For the joint null the Augmented SCM is refit on all periods (augsynth’scbind(X, y)); under fixed effects the refit demeans by the full-path mean (rowMeansof the augmented matching matrix). The post-block statistic \((\sum |u_t|^q / \sqrt{n})^{1/q}\) is compared to permutations of the residual path. The all-period refit is what makes the pre/post residuals exchangeable — and hence the test calibrated.augsynth’s ridge ASCM itself: a simplex base + a period-space ridge correction \(w = w_\text{scm} + (X_1 - X_c^\top w_\text{scm})^\top (X_c X_c^\top + \lambda I)^{-1} X_c\), with \(\lambda\) selected by leave-one-period-out CV under the 1-SE rule (augsynth’s default
min_1se = TRUE).mlsynth’sridge_augment_weights()reproduces these weights tocorr = 1.0000on matched inputs.
Two traps we walked into (and out of)#
These are the cross-codebase-consistency lessons worth carrying to the next port.
A calibrated test can look “anti-powered.” Before isolating the fixed effect, the symptom was “our conformal p (0.57) is far from GeoLift’s (0.01), so the conformal must be broken/anti-powered.” It is not. A 40-market placebo study on the no-effect panel showed the all-period refit is well-calibrated (rejection rate ≈ 0.10 at \(\alpha = 0.10\)), and the tempting “fix” — fitting once on the pre-period and permuting the gap path — is the one that is broken (≈ 50 % false-positive rate, because pre residuals are in-sample and post residuals are out-of-sample, so they are not exchangeable). The low p was never the test; it was the fit (missing fixed effects), which estimated a smaller, level-absorbed effect that a correct test then correctly judged insignificant. Diagnose the estimand before blaming the inference.
Match defaults before mechanisms. Two of the four ingredients (
fixed_effects=TRUE,min_1se=TRUE) are just augsynth/GeoLift defaults we had not mirrored; one (mean vs sum) is an aggregation default. Only the fourth is “mechanism.” When two codebases disagree, enumerate the reference’s defaults first — most divergences are an unmatched default, not a wrong formula. Reproducing the reference end-to-end from scratch in a scratch script (here, ~40 lines of NumPy that hit ATT 312 / p 0.011) localizes which default matters far faster than reading either codebase.
Ben-Michael, E., Feller, A., & Rothstein, J. (2021). The Augmented Synthetic Control Method. Journal of the American Statistical Association.