Capstone · Upworthy Research Archive

The lens, applied to real data

Approximately thirty-two thousand A/B tests on news headlines, originally collected by Upworthy between 2014 and 2016 and released for replication research by Matias et al. (2021). Each test pits two headlines against one another with random traffic. The autopsy lens: how often do these tests cross α, and what does the lift distribution look like in production?

Distribution of observed lifts

Across 32,488 tests, observed lifts are centered near zero and heavy-tailed. A handful of headlines beat their alternative by more than 50%; most do not move the click rate at all. Bars are split by whether the test crossed α = 0.05.

P-value distribution

Under a population where most A/B tests are null, the p-value distribution should be roughly uniform on [0, 1]. The observed bump near zero is the minority of headlines with a real effect, and the small spike just under 0.05 is the classical pattern that motivates a fixed-horizon, pre-registered design.

By topic

Topic	Tests	Significant @ α=0.05	Sig. rate	Median lift	Mean lift	Median impressions/arm	SRM flagged

High-volume examples

Eight tests with the largest impression counts in the corpus.

Test ID	Topic	Year	Impressions A	Impressions B	CTR A	CTR B	Lift	p-value

Citation

Matias, J. N., Munger, K., Le Quere, M. A., & Ebersole, C. R. (2021). The Upworthy Research Archive: A time series of 32,488 experiments in U.S. media. Scientific Data, 8, 195. doi:10.1038/s41597-021-00934-7

This page renders a deterministic stand-in with the same statistical shape when the archive CSV is not present locally.