A postmortem of the ways A/B tests go wrong.
Five common failure modes in online experimentation, reproduced from synthetic data with known ground truth and applied to thirty-two thousand real headline tests from the Upworthy Research Archive. Each section is interactive.
The five failure modes
Stopping early inflates type-I error
Checking interim results and stopping when significant drives the false-positive rate to roughly 30% at twenty looks, against a nominal 5%.
Open section →Required N scales with 1/MDE²
Halving the effect size you can detect quadruples the sample you need. Running an under-powered test longer rarely closes the gap.
Open section →Aggregate can reverse the segments
Treatment helps every segment yet hurts the aggregate when assignment ratios are uneven across segments with different baselines.
Open section →Day-one lifts fade
A 30% lift on day one decays to near zero by day twenty-eight. Reading the test too early overstates the long-run effect.
Open section →Broken randomisation invalidates lift
A 48/52 split with 20k users is not noise. A χ² guardrail at p < 10⁻³ catches assignment bugs before they reach analysis.
Open section →The lens on real data
Applied to thirty-two thousand peer-reviewed headline A/B tests from the Upworthy Research Archive. Lift distribution by significance, filterable by topic.
Open section →How it is built
Pathologies are simulated in Python with numpy and scipy.
Each scenario carries the ground-truth parameters used to generate it, so the
chart can show what a naive analyst would conclude alongside the truth.
Charts are Vega-Lite specs embedded directly in each page. The site is a small set of hand-written HTML files. There is no application server: the page that holds this paragraph is a static file on GitHub Pages.
The analyst notes on each page are produced by a language model conditioned on the numeric scenario and cached at build time. The repository ships a Streamlit version with a live chat panel for running locally.
Reference data
The capstone uses the Upworthy Research Archive (Matias, Munger, Le Quere, Ebersole, 2021), approximately 32 488 peer-reviewed A/B tests on news headlines. When the archive CSV is not present locally, a deterministic stand-in is generated with the same statistical shape: log-normal test sizes, a small minority of strong-effect headlines, and roughly 5% of tests crossing α = 0.05.
Source & reproducibility
- Repository: github.com/lupiochi
- Stack: Python 3.12,
numpy,scipy,pandas - Front end: HTML, Vega-Lite, Vega-Embed
- Built and verified May 2026