About this demo
This page documents the design and provenance of every chart on the site. It exists so that a reader can replicate the work end-to-end from the repository in a few commands.
Why this exists
Most A/B tests in industry are read incorrectly rather than run incorrectly. Teams stop when the line crosses, segment until something pops, or extrapolate from a day-one lift that does not persist. The five pathologies on this site are well documented in the experimentation literature; the contribution here is to reproduce each one from synthetic data with known ground truth, then apply the lens to a real published dataset.
Statistical methods
| Section | Method | Reference |
|---|---|---|
| Peeking bias | Sequential two-proportion z-tests under H₀; Monte Carlo estimate of P(reject anywhere) | Johari et al., 2017 |
| Minimum detectable effect | Normal-approximation power calculation, two proportions | Lehr, 1992 |
| Simpson's paradox | Conditional vs. marginal lift across a binary segment | Pearl, 2009 |
| Novelty decay | Daily lift with exponential decay toward steady state | Goldberg et al., 2019 |
| Sample-ratio mismatch | χ² goodness-of-fit on bucket counts at α = 10⁻³ | Fabijan et al., 2019 |
Architecture
Pathologies are simulated in Python with numpy and scipy.
Each scenario carries the ground-truth parameters used to generate it.
A build step (build_site.py) converts the simulation outputs
to small JSON files under site/data/; the histograms for the
Upworthy capstone are pre-aggregated to keep payloads small.
Charts on every page are Vega-Lite
specs rendered by Vega-Embed
in the browser. The site is a small set of hand-written HTML files served by
GitHub Pages.
The analyst notes
The short note next to each chart is produced by a language model conditioned on the numeric scenario. The prompt forbids hype words, em-dashes, and chain-of-thought preamble, and asks for a 45–65 word summary, one caveat, and one next step. Notes are generated once at build time and cached. A Streamlit version in the repository keeps a live chat panel; for the static site it would be unsafe to embed the API key in client JavaScript.
Data
The capstone uses the Upworthy Research Archive, a published dataset of 32,488 A/B tests on news headlines run by Upworthy between 2014 and 2016 and released for replication research by Matias, Munger, Le Quere, and Ebersole in 2021. When the archive CSV is not present locally, the build pipeline generates a deterministic stand-in with the same statistical shape: log-normal test sizes, a small minority of strong-effect headlines, and roughly 5% of tests crossing α = 0.05.
Reproducibility
From the repository root:
uv sync
uv run python -m abtest_autopsy.export # regenerate data/exports/*.csv
uv run python -m abtest_autopsy.narrate # regenerate narratives.csv
uv run python build_site.py # data/exports -> site/data
uv run pytest # 5 statistical sanity tests
References
- Johari, R., Pekelis, L., & Walsh, D. J. (2017). Always valid inference: Bringing sequential analysis to A/B testing. arXiv:1512.04922.
- Lehr, R. (1992). Sixteen S-squared over D-squared: A relation for crude sample size estimates. Statistics in Medicine, 11(8), 1099–1102.
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
- Goldberg, D., Johndrow, J., & Lee, J. (2019). Decision-making with the Sequential Probability Ratio Test. arXiv:1905.07731.
- Fabijan, A., Gupchup, J., Gupta, S., Omhover, J., Qin, W., Vermeer, L., & Dmitriev, P. (2019). Diagnosing sample ratio mismatch in online controlled experiments. Proc. KDD '19.
- Matias, J. N., Munger, K., Le Quere, M. A., & Ebersole, C. R. (2021). The Upworthy Research Archive: A time series of 32,488 experiments in U.S. media. Scientific Data, 8, 195. doi:10.1038/s41597-021-00934-7