Methods · References

About this demo

This page documents the design and provenance of every chart on the site. It exists so that a reader can replicate the work end-to-end from the repository in a few commands.

Why this exists

Most A/B tests in industry are read incorrectly rather than run incorrectly. Teams stop when the line crosses, segment until something pops, or extrapolate from a day-one lift that does not persist. The five pathologies on this site are well documented in the experimentation literature; the contribution here is to reproduce each one from synthetic data with known ground truth, then apply the lens to a real published dataset.

Statistical methods

Section	Method	Reference
Peeking bias	Sequential two-proportion z-tests under H₀; Monte Carlo estimate of P(reject anywhere)	Johari et al., 2017
Minimum detectable effect	Normal-approximation power calculation, two proportions	Lehr, 1992
Simpson's paradox	Conditional vs. marginal lift across a binary segment	Pearl, 2009
Novelty decay	Daily lift with exponential decay toward steady state	Goldberg et al., 2019
Sample-ratio mismatch	χ² goodness-of-fit on bucket counts at α = 10⁻³	Fabijan et al., 2019

Architecture

Pathologies are simulated in Python with numpy and scipy. Each scenario carries the ground-truth parameters used to generate it. A build step (build_site.py) converts the simulation outputs to small JSON files under site/data/; the histograms for the Upworthy capstone are pre-aggregated to keep payloads small. Charts on every page are Vega-Lite specs rendered by Vega-Embed in the browser. The site is a small set of hand-written HTML files served by GitHub Pages.

The analyst notes

The short note next to each chart is produced by a language model conditioned on the numeric scenario. The prompt forbids hype words, em-dashes, and chain-of-thought preamble, and asks for a 45–65 word summary, one caveat, and one next step. Notes are generated once at build time and cached. A Streamlit version in the repository keeps a live chat panel; for the static site it would be unsafe to embed the API key in client JavaScript.

Data

The capstone uses the Upworthy Research Archive, a published dataset of 32,488 A/B tests on news headlines run by Upworthy between 2014 and 2016 and released for replication research by Matias, Munger, Le Quere, and Ebersole in 2021. When the archive CSV is not present locally, the build pipeline generates a deterministic stand-in with the same statistical shape: log-normal test sizes, a small minority of strong-effect headlines, and roughly 5% of tests crossing α = 0.05.

Reproducibility

From the repository root:

uv sync
uv run python -m abtest_autopsy.export      # regenerate data/exports/*.csv
uv run python -m abtest_autopsy.narrate     # regenerate narratives.csv
uv run python build_site.py                 # data/exports -> site/data
uv run pytest                               # 5 statistical sanity tests

References

Johari, R., Pekelis, L., & Walsh, D. J. (2017). Always valid inference: Bringing sequential analysis to A/B testing. arXiv:1512.04922.
Lehr, R. (1992). Sixteen S-squared over D-squared: A relation for crude sample size estimates. Statistics in Medicine, 11(8), 1099–1102.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
Goldberg, D., Johndrow, J., & Lee, J. (2019). Decision-making with the Sequential Probability Ratio Test. arXiv:1905.07731.
Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019). Diagnosing sample ratio mismatch in online controlled experiments. Proc. KDD '19.
Matias, J. N., Munger, K., Le Quere, M. A., & Ebersole, C. R. (2021). The Upworthy Research Archive: A time series of 32,488 experiments in U.S. media. Scientific Data, 8, 195. doi:10.1038/s41597-021-00934-7