How is the word cloud generated?
There are two different kinds of word clouds generated by the Media Dashboard: a simple word cloud for a single query and a comparative word cloud for two queries. The single query word cloud includes the 100 most common stopworded words used by the given media set or media source during the given week. The size of each word is determined by its frequency across all media sources within the query. Words in a large stoplist of about four thousands words are excluded from the word cloud.
The comparative word cloud starts with the top 75 words of each query but colors words according to their relative frequency within each query. Words in red or blue are words within the top 75 words for one query that are ranked at least 25 spots lower than in the other query. Words in purple are ranked within 25 spots in both of the queries. So the word
How is the coverage map generated?
The coverage map displays a heat map of the coverage of a given country for the given media set or media source during the given week. So for a media source that mentions China frequently but African countries rarely, China should appear very dark on the map while African countries should appear very light. We use a very simple country tagging system that counts mentions of a given country by looking for either the country name if it is a single word (eg. Ghana) or for two words from the country name to appear anywhere in the same sentence if the country name is more than one word (eg. United States). This tagging algorithm is likely to have some false negatives and false positives.
What is a topic?
A topic is a way of not just tracking generic content within a given media set or media source, but tracking content closely associated with a specific topic (eg. egypt, economy, or health). Word clouds generated for a specific topic include only words that appear in the same sentence as the topic word. So the word cloud for the week of 2010-08-01 for Political Blogs with the health topic will include only words within that week and media source within sentences that include the word
How is the list of sentences generated?
Each word in a word cloud links to the list of sentences including that word within the given week and media set or media source. So clicking on
For each media source, we show a percentage indicating the frequency of the clicked word. This percentage is calculated by comparing the number of times the clicked word occurs within the media source in the given week with the total number of words within that media source for the given week. For topic word clouds, the percentage is determined by the occurances of the clicked word within sentences in the media source for the given week that contain the topic word in comparison to the total number of words within these sentences. For words that appear in both queries (words in purple in the word cloud), we average the percentages from the two queries.
What is the similarity score?
A similarity score is included in the legend at the bottom of each comparative word cloud. The similarity score is the cosine similarity of the word counts of the 500 most common stopworded words within each of the two queries. A score of 0 indicates no similarity at all and a score of 1 indicates that the list of word counts is exactly the same. This similarity score can be used to quantitatively assess the degree to which two queries differ, and to compare differences between various comparisons (to ask, for instance, whether news coverage differed more between two media sources in the past two weeks than the previous two weeks, or to ask whether the content of mainstream media coverage of the economy is more similar to content of the whitehouse feeds than the content of blog coverage).
What exactly is cosine similarity?
The following is adapted from a longer blog post by Ethan Zuckerman.
Cosine similarity is a technique computer scientists use to detect a type of similarity between documents. Basically, a computer program counts the appearances of words in a document (in our case, a week’s worth of media coverage) and compares that frequency list to that of another document. If those documents are identical in word frequency – both mention Obama 23 times, Libya 5 times and basketball twice – they score a 1. If they’ve got no words in common, they score a zero.
The actual math behind this is wonderfully cool, if slightly mind-bending. Imagine a set of documents with only two words in them – “Obama” and “NCAA”. In source A, Obama is mentioned 8 times, NCAA 2 times. Put a point on a graph at (8,2) – Obama’s our X axis, NCAA our Y axis, and draw a line that passes through 0,0 and 8,2 – that’s the vector that represents set A. In source B, Obama gets mentioned twice, NCAA 8 times – put the point at 2,8 and draw the vector for source B. The angle between vectors A and B is a measure of how similar the sets are, and taking the cosine of that angle is a simple way to scale the value to be between 0 and 1 for angles between 0 and 90 degrees. The trick, of course, is that documents contain words other than Obama and NCAA, and cosine similarity adds a new dimension to our graph for each new term. So the vectors we’re measuring when we compare all the words in a set of media sources over a week to another comparable week exist in 1000-dimensional space. Don’t bother imagining 1000-dimensional space – it will make your head hurt. Just imagine three dimensional space and think about two vectors that each emerge from 0,0,0 and each pass through an arbitrary point in positive x,y,z space – it’s easy enough to imagine measuring the angle between those two vectors. Then take it on faith that, mathematically, you can do the same thing in many-dimensional space.
Is the source code for Media Cloud available?
Yes. We publish all Media Cloud code under the GNU Affero General Public License, which basically says that you can download the code, modify it, and host your own version of Media Cloud as long as you publish any changes you make it to the code if you host a public version of Media Cloud or distribute your own version of Media Cloud. The Media Cloud is available at SourceForge here. Please note that we have not for some time made any strong attempt to package Media Cloud or to make it easy for other folks to install and maintain, so enter at your own risk. Most importantly, the answers to any specific methodological questions about how we collect and analyze the media cloud are available in the open source code.