A/B Test Autopsy·interactive demo
Methods · References

About this demo

This page documents the design and provenance of every chart on the site. It exists so that a reader can replicate the work end-to-end from the repository in a few commands.

Why this exists

Most A/B tests in industry are read incorrectly rather than run incorrectly. Teams stop when the line crosses, segment until something pops, or extrapolate from a day-one lift that does not persist. The five pathologies on this site are well documented in the experimentation literature; the contribution here is to reproduce each one from synthetic data with known ground truth, then apply the lens to a real published dataset.

Statistical methods

SectionMethodReference
Peeking biasSequential two-proportion z-tests under H₀; Monte Carlo estimate of P(reject anywhere)Johari et al., 2017
Minimum detectable effectNormal-approximation power calculation, two proportionsLehr, 1992
Simpson's paradoxConditional vs. marginal lift across a binary segmentPearl, 2009
Novelty decayDaily lift with exponential decay toward steady stateGoldberg et al., 2019
Sample-ratio mismatchχ² goodness-of-fit on bucket counts at α = 10⁻³Fabijan et al., 2019

Architecture

Pathologies are simulated in Python with numpy and scipy. Each scenario carries the ground-truth parameters used to generate it. A build step (build_site.py) converts the simulation outputs to small JSON files under site/data/; the histograms for the Upworthy capstone are pre-aggregated to keep payloads small. Charts on every page are Vega-Lite specs rendered by Vega-Embed in the browser. The site is a small set of hand-written HTML files served by GitHub Pages.

The analyst notes

The short note next to each chart is produced by a language model conditioned on the numeric scenario. The prompt forbids hype words, em-dashes, and chain-of-thought preamble, and asks for a 45–65 word summary, one caveat, and one next step. Notes are generated once at build time and cached. A Streamlit version in the repository keeps a live chat panel; for the static site it would be unsafe to embed the API key in client JavaScript.

Data

The capstone uses the Upworthy Research Archive, a published dataset of 32,488 A/B tests on news headlines run by Upworthy between 2014 and 2016 and released for replication research by Matias, Munger, Le Quere, and Ebersole in 2021. When the archive CSV is not present locally, the build pipeline generates a deterministic stand-in with the same statistical shape: log-normal test sizes, a small minority of strong-effect headlines, and roughly 5% of tests crossing α = 0.05.

Reproducibility

From the repository root:

uv sync
uv run python -m abtest_autopsy.export      # regenerate data/exports/*.csv
uv run python -m abtest_autopsy.narrate     # regenerate narratives.csv
uv run python build_site.py                 # data/exports -> site/data
uv run pytest                               # 5 statistical sanity tests

References