New canonical dataset paper for Media Cloud

Credits
Published
June 3, 2026

In a new canonical dataset paper, we present a completely re-engineered Media Cloud, a massive searchable open source archive of digital news sources and content from around the globe. Since its previous presentation at ICWSM in 2021, the Media Cloud team has re-engineered the tool's data collection, storage, and retrieval systems, built a new front-end research interface, surpassed 1.8 billion stories, and reprocessed all the content to update the extracted metadata with consistent and modern techniques. In this paper we document the new system’s engineering, characterize the datasets to date, and describe user-facing tools. This includes a Directory of online news sources and a searchable Story Index of global news stories. We discuss the utility of the datasets, how they compare to other related work, challenges associated with maintaining open research infrastructure, and research made possible through the datasets and tooling.

Read the paper in the ICWSM 2026 proceedings.