Print newspapers are declaring bankruptcy nationwide. High-profile blogs are proliferating. Media companies are exploring new production techniques and business models in a landscape that is increasingly dominated by the Internet. In the midst of this upheaval, it is difficult to know what is actually happening to the shape of our news. Beyond one-off anecdotes or painstaking manual content analysis, there are few ways to examine the emerging news ecosystem.
The idea for Media Cloud emerged through a series discussions between faculty and friends of the Berkman Center. The conversations would follow a predictable pattern: one person would ask a provocative question about what was happening in the media landscape, someone else would suggest interesting follow-on inquiries, and everyone would realize that a good answer would require heavy number crunching. Nobody had the time to develop a huge infrastructure and download all the news just to answer a single question. However, there were eventually enough of these questions that we decided to build a tool for everyone to use.
Some of the early driving questions included:
- Do bloggers introduce storylines into mainstream media or the other way around?
- What parts of the world are being covered or ignored by different media sources?
- Where do stories begin?
- How are competing terms for the same event used in different publications?
- Can we characterize the overall mix of coverage for a given source?
- How do patterns differ between local and national news coverage?
- Can we track news cycles for specific issues?
- Do online comments shape the news?
Media Cloud offers a way to quantitatively examine all of these challenging questions by collecting and analyzing the news stream of tens of thousands of online sources.
Technically, Media Cloud performs five basic functions -- media definition, crawling, text extraction, word vectoring, and analysis. First, we define the set of media sources we want to collect and discover the feeds for each media source (which in the case of many newspapers includes hundreds of feeds). Second, we crawl each of those feeds several times each day to discover any new stories published by each feed and then download the html of each new story. Third, we extract just the substantive content of each story from each html page, leaving behind the ads, navigation, and other cruft. Fourth, we break that substantive text down into a set word counts so that we can count, down to the level of individual sentences, which words which media sources are using to talk about which topics. And finally, we have a set of tools for analyzing those word counts, including the Media Dashboard tool that acts as the front page for http://mediacloud.org. We make available all of the code for the system under an open source license and publish as much of the underlying data as legally possible.
Media Cloud is made possible by the generous support of the John D. and Catherine T. MacArthur Foundation and the Ford Foundation.