Interactive demo

A postmortem of the ways A/B tests go wrong.

Five common failure modes in online experimentation, reproduced from synthetic data with known ground truth and applied to thirty-two thousand real headline tests from the Upworthy Research Archive. Each section is interactive.

Start with peeking bias View source

The five failure modes

01 · Peeking bias

Stopping early inflates type-I error

Checking interim results and stopping when significant drives the false-positive rate to roughly 30% at twenty looks, against a nominal 5%.

Open section →

02 · Minimum detectable effect

Required N scales with 1/MDE²

Halving the effect size you can detect quadruples the sample you need. Running an under-powered test longer rarely closes the gap.

Open section →

03 · Simpson's paradox

Aggregate can reverse the segments

Treatment helps every segment yet hurts the aggregate when assignment ratios are uneven across segments with different baselines.

Open section →

04 · Novelty decay

Day-one lifts fade

A 30% lift on day one decays to near zero by day twenty-eight. Reading the test too early overstates the long-run effect.

Open section →

05 · Sample-ratio mismatch

Broken randomisation invalidates lift

A 48/52 split with 20k users is not noise. A χ² guardrail at p < 10⁻³ catches assignment bugs before they reach analysis.

Open section →

Capstone

The lens on real data

Applied to thirty-two thousand peer-reviewed headline A/B tests from the Upworthy Research Archive. Lift distribution by significance, filterable by topic.

Open section →

How it is built

Pathologies are simulated in Python with numpy and scipy. Each scenario carries the ground-truth parameters used to generate it, so the chart can show what a naive analyst would conclude alongside the truth.

Charts are Vega-Lite specs embedded directly in each page. The site is a small set of hand-written HTML files. There is no application server: the page that holds this paragraph is a static file on GitHub Pages.

The analyst notes on each page are produced by a language model conditioned on the numeric scenario and cached at build time. The repository ships a Streamlit version with a live chat panel for running locally.

Reference data

The capstone uses the Upworthy Research Archive (Matias, Munger, Le Quere, Ebersole, 2021), approximately 32 488 peer-reviewed A/B tests on news headlines. When the archive CSV is not present locally, a deterministic stand-in is generated with the same statistical shape: log-normal test sizes, a small minority of strong-effect headlines, and roughly 5% of tests crossing α = 0.05.

Source & reproducibility

Repository: github.com/lupiochi
Stack: Python 3.12, numpy, scipy, pandas
Front end: HTML, Vega-Lite, Vega-Embed
Built and verified May 2026