The lens, applied to real data
Approximately thirty-two thousand A/B tests on news headlines, originally collected by Upworthy between 2014 and 2016 and released for replication research by Matias et al. (2021). Each test pits two headlines against one another with random traffic. The autopsy lens: how often do these tests cross α, and what does the lift distribution look like in production?
Distribution of observed lifts
Across 32,488 tests, observed lifts are centered near zero and heavy-tailed. A handful of headlines beat their alternative by more than 50%; most do not move the click rate at all. Bars are split by whether the test crossed α = 0.05.
P-value distribution
Under a population where most A/B tests are null, the p-value distribution should be roughly uniform on [0, 1]. The observed bump near zero is the minority of headlines with a real effect, and the small spike just under 0.05 is the classical pattern that motivates a fixed-horizon, pre-registered design.
By topic
| Topic | Tests | Significant @ α=0.05 | Sig. rate | Median lift | Mean lift | Median impressions/arm | SRM flagged |
|---|
High-volume examples
Eight tests with the largest impression counts in the corpus.
| Test ID | Topic | Year | Impressions A | Impressions B | CTR A | CTR B | Lift | p-value |
|---|
Matias, J. N., Munger, K., Le Quere, M. A., & Ebersole, C. R. (2021). The Upworthy Research Archive: A time series of 32,488 experiments in U.S. media. Scientific Data, 8, 195. doi:10.1038/s41597-021-00934-7
This page renders a deterministic stand-in with the same statistical shape when the archive CSV is not present locally.