GEOLIFT — Meta’s GeoLift walkthrough (augsynth cross-validation)#
- Estimator:
- Source:
Meta’s GeoLift package (
facebookincubator/GeoLift), theGeoLift_Walkthroughvignette, which runs Ben-Michael, Feller & Rothstein’s Augmented SCM ([BMFR2021]) via augsynth (ebenmichael/augsynth) with Chernozhukov–Wüthrich–Zhu conformal inference ([CWZ2021]).- Replication type:
Cross-validation — match an authoritative reference implementation (GeoLift/augsynth) value-for-value on the package’s own published example.
- Status:
Done — fully verified; the realized effect report reproduces GeoLift’s walkthrough ATT, percent lift, incremental, and conformal p-value.
- Durable check:
benchmarks/cases/geolift.py(geolift_walkthrough, vs the published vignette) andbenchmarks/cases/geolift_augsynth_ref.py(geolift_augsynth_ref, vs live augsynth via Rscript); plusmlsynth/tests/test_geolift_walkthrough.py.
Why this is the replication target#
The earlier port had no value-for-value anchor (only an end-to-end null on
the no-effect panel), because GeoLift’s market-selection routine has no
published table. But GeoLift’s realized effect report — the ATT and conformal
p-value it prints once a test has run — is published, in the
GeoLift_Walkthrough: it is the augsynth Augmented SCM with
fixed_effects=TRUE, the package’s default. That gives a hard cross-validation
target for the part of GEOLIFT that does the causal inference
(realize_design()).
The walkthrough treats chicago + portland over the last 15 of 105 days
(GeoLift_Test: 40 markets, the other 38 as donors) and reports:
Note
The vignette’s printed 155.556 / 4667 is from an older augsynth
release. Run against augsynth today the same fit returns
ATT = 156.81 (λ = 1.673102e9, 13 donors); mlsynth reproduces that
live augsynth output to floating point — see Live cross-check vs augsynth
below — so the ~0.8 % gap to the printed number is augsynth’s own
version-to-version drift, not an mlsynth discrepancy.
The walkthrough’s public call (GeoLift names the locations and the post
window — it is an analysis of a given test region, not a market search):
GeoLift_Test <- GeoLift(Y_id = "Y", data = GeoTestData_Test,
locations = c("chicago", "portland"),
treatment_start_time = 91, treatment_end_time = 105)
summary(GeoLift_Test) # ATT 155.556, Lift 5.4%, Incremental 4667, p 0.01
mlsynth reaches the same numbers through its public estimator —
GEOLIFT(...).fit() with fixed_effects=True (the default). The estimator is
a market-selection design, so the two markets are pinned with to_be_treated
+ treatment_size (the only candidate of that size) and the post window is
marked by post_col; res.report is the realized effect report — the
analogue of summary(GeoLift_Test):
import pandas as pd
from mlsynth import GEOLIFT
df = pd.read_csv("basedata/geolift_test_data.csv") # GeoLift_Test
dates = sorted(df["date"].unique())
df["post"] = df["date"].isin(set(dates[90:])).astype(int) # days 91-105
res = GEOLIFT({
"df": df, "outcome": "Y", "unitid": "location", "time": "date",
"treatment_size": 2, "to_be_treated": ["chicago", "portland"],
"durations": [15], "effect_sizes": [0.0, 0.10], "post_col": "post",
"how": "mean", "fixed_effects": True, "display_graphs": False,
}).fit()
res.selected_units # ['chicago', 'portland']
res.report.effects.att # 156.8 (GeoLift per-unit ATT 155.6)
res.report.inference.p_value # 0.011 (GeoLift 0.01)
# how="sum" reports the summed incremental: ATT 313.6/period, p identical.
Pinned end-to-end through the public API in benchmarks/cases/geolift.py
(geolift_walkthrough) and mlsynth/tests/test_geolift_walkthrough.py.
Live cross-check vs augsynth#
Because the printed vignette number has drifted with augsynth’s version, the
durable cross-check fits augsynth itself and compares — the gold-standard
reference rather than a doc string. benchmarks/R/augsynth_geolift.R runs
augsynth(Y ~ trt, unit = location, time = t, data = panel,
progfunc = "ridge", scm = TRUE, fixedeff = TRUE) # GeoLift's fit
on the same chicago+portland panel (the two test geos averaged into one treated
series, exactly as GeoLift aggregates them), and benchmarks/cases/
geolift_augsynth_ref.py (geolift_augsynth_ref) checks mlsynth against
it. The agreement is essentially floating-point:
Install the reference once with benchmarks/R/install_augsynth.sh (augsynth
only — GeoLift’s fit is augsynth, so the heavy MarketMatching → Boom
chain is not needed). The install is commit-pinned — augsynth 0.2.0 @
7a90ea4 and every source-compiled dependency frozen to a SHA (S7,
LiblineaR, osqp) as of 2026-06-12 — so the cross-check runs the same
reference code every time, rather than a moving master tip (an unpinned tip
is exactly the drift that staled the vignette’s number). The case skips itself
when Rscript / augsynth is absent, so it is a no-op in CI and runs only
where the reference is installed.
This is what licenses the strong claim above: mlsynth’s ridge ASCM, its CV
λ-selection (the 1-SE rule), and its fixed-effect conformal refit are not merely
close to augsynth — they are the same computation, to ~7–11 significant
figures.
What it took to match — the four ingredients#
Reaching parity required reproducing augsynth’s pipeline from scratch and verifying each component against the published number. Four ingredients, each necessary; drop any one and the ATT or the p-value diverges.
Unit fixed effects (augsynth
fixed_effects=TRUE, GeoLift’s default).demean_datasubtracts each unit’s own pre-period mean from all of its periods, fits the SCM on the residuals (matching shapes, not levels), and restores the level with an intercept. This is what stops the donor pool from absorbing a treated-unit level shift: a convex/ridge combination of level-matched donors can chase a post-period jump, but once every unit is demeaned it cannot. Without it the realized ATT is wrong (≈209 vs 311) and the conformal refit absorbs the effect (p ≈ 0.56).Fit the mean of the treated units (augsynth
colMeans), not their sum. This is not scale-invariant for the conformal: a sum of \(k\) markets sits at \(k\times\) donor scale, outside the convex hull, so the simplex base fits it badly and the residual path changes — sum gives p ≈ 0.68 where mean gives p ≈ 0.01.GEOLIFTfits the per-unit mean and rescales the reported paths by \(k\) whenhow="sum"(the p-value, a ratio of norms, is invariant to that global reporting scale).The faithful conformal refit ([CWZ2021], augsynth
conformal). For the joint null the Augmented SCM is refit on all periods (augsynth’scbind(X, y)); under fixed effects the refit demeans by the full-path mean (rowMeansof the augmented matching matrix). The post-block statistic \((\sum |u_t|^q / \sqrt{n})^{1/q}\) is compared to permutations of the residual path. The all-period refit is what makes the pre/post residuals exchangeable — and hence the test calibrated.augsynth’s ridge ASCM itself: a simplex base + a period-space ridge correction \(w = w_\text{scm} + (X_1 - X_c^\top w_\text{scm})^\top (X_c X_c^\top + \lambda I)^{-1} X_c\), with \(\lambda\) selected by leave-one-period-out CV under the 1-SE rule (augsynth’s default
min_1se = TRUE).mlsynth’sridge_augment_weights()reproduces these weights tocorr = 1.0000on matched inputs.
Two traps we walked into (and out of)#
These are the cross-codebase-consistency lessons worth carrying to the next port.
A calibrated test can look “anti-powered.” Before isolating the fixed effect, the symptom was “our conformal p (0.57) is far from GeoLift’s (0.01), so the conformal must be broken/anti-powered.” It is not. A 40-market placebo study on the no-effect panel showed the all-period refit is well-calibrated (rejection rate ≈ 0.10 at \(\alpha = 0.10\)), and the tempting “fix” — fitting once on the pre-period and permuting the gap path — is the one that is broken (≈ 50 % false-positive rate, because pre residuals are in-sample and post residuals are out-of-sample, so they are not exchangeable). The low p was never the test; it was the fit (missing fixed effects), which estimated a smaller, level-absorbed effect that a correct test then correctly judged insignificant. Diagnose the estimand before blaming the inference.
Match defaults before mechanisms. Two of the four ingredients (
fixed_effects=TRUE,min_1se=TRUE) are just augsynth/GeoLift defaults we had not mirrored; one (mean vs sum) is an aggregation default. Only the fourth is “mechanism.” When two codebases disagree, enumerate the reference’s defaults first — most divergences are an unmatched default, not a wrong formula. Reproducing the reference end-to-end from scratch in a scratch script (here, ~40 lines of NumPy that hit ATT 312 / p 0.011) localizes which default matters far faster than reading either codebase.
Ben-Michael, E., Feller, A., & Rothstein, J. (2021). The Augmented Synthetic Control Method. Journal of the American Statistical Association.