This week’s word cloud looks a little strange:
-
Recent Posts
Recent Comments
Categories
This week’s word cloud looks a little strange:
Question: What role, then, is the Internet playing in Russian media?
Answer: Elena Vartanova ( Moscow State University Journalism Faculty): It really is a new part of our media system. People are increasingly consuming online news, and online news often takes the first step in agenda-setting. Only then do consumers get more analysis and commentary from print sources.
One of the functions of online media is creating an alternative news agenda. If you watch big television channels you see distilled content, which is double-checked by company managers, by people in power ¬ you won’t find problematic material. The alternative agenda on the Internet is helping Russians see pitfalls and problems. And the Internet has become a tool for people to create public opinion, to support the “man on the street.” In Russia, when mainstream media says something, you should double-check on the Internet. It provides a different point of view.
–Interview by Josh Tapper, Nieman Journalism Lab
In the above quote, Elena Vartanova echos two key research questions we have for Russian Media Cloud:
1. Are blogs and other online media provide an alternative public sphere, and;
2. What role do they play in agenda setting of the news.
To begin to test these hypotheses we have built off the hard work by Ethan Zuckerman, Hal Roberts, David Larochelle, Yochi Benkler and Zoe Fraade-Blanar on English Media Cloud, which collects data on different sets of English language blogs and popular traditional media available online (mostly newspapers). For the Russia effort we have an even larger and more varied set of feeds, including:
1. 1000 popular Russian blogs: The Yandex Top 1000 list
2. Over 11,000 Russian language blogs divided into link-based attentive clusters, based on the results of our previous Russian blog research
3. 1000 random, or long tail, blogs based on our own spider of the Russian blogosphere
4. Top 25 ‘mainstream media’: This is currently the Google Ad Planner list of the top 25 most popular news Web sites in Russia, which we filtered to remove sites any sites that are not news related or not primarily about Russia (*See list at bottom of this post)
5. Russian TV news transcripts: Channel 1, Vesti, REN TV, TV Tsentra, NTV, Channel 5, Mir, Zvezda, and TV Stolitsa
6. Russian government Web sites: President Medvedev’s official site, Putin’s official site, the Russian government portal government.ru, and sites of the Ministry of Emergency Situations, Ministry of Justice, Ministry of Defense, and the Ministry of Foreign Affairs
Using the same method as Ethan describes in his blog post on calculating cosine similarity among sources and sets of sources, we are able to draw a visual map that shows how similar these different sets of feeds are to one another, based on content (as opposed to links). What this method allows us to do, and what we have done with all of the below examples, is compare the similarity of bags of words in different media sets. Media Cloud outputs alone do not say anything about the meaning behind those differences between different sources. However, with additional context about what we know of the political situation and media ownership in Russia, as well qualitative analysis of sentences within queries, we can begin to hypothesize about the possible meaning behind similarity scores, word clouds, polar maps and other automated outputs.
As Ethan writes about cosine similarity:
This is a technique computer scientists use to detect a type of similarity between documents. Basically, a computer program counts the appearances of words in a document (in this case, a week’s worth of media coverage by 25 outlets) and compares that frequency list to that of another document. If those documents are identical in word frequency – both mention Obama 23 times, Libya 5 times and basketball twice – they score a 1. If they’ve got no words in common, they score a zero.
(The actual math behind this is wonderfully cool, if slightly mind-bending. Imagine a set of documents with only two words in them – “Obama” and “NCAA”. In source A, Obama is mentioned 8 times, NCAA 2 times. Put a point on a graph at (8,2) – Obama’s our X axis, NCAA our Y axis, and draw a line that passes through 0,0 and 8,2 – that’s the vector that represents set A. In source B, Obama gets mentioned twice, NCAA 8 times – put the point at 2,8 and draw the vector for source B. The angle between vectors A and B is a measure of how similar the sets are, and taking the cosine of that angle is a simple way to scale the value to be between 0 and 1 for angles between 0 and 90 degrees. The trick, of course, is that documents contain words other than Obama and NCAA, and cosine similarity adds a new dimension to our graph for each new term. So the vectors we’re measuring when we compare all the words in 25 media sources over a week to another comparable week exist in 3000-dimensional space. Don’t bother imagining 3000-dimensional space – it will make your head hurt. Just imagine three dimensional space and think about two vectors that each emerge from 0,0,0 and each pass through an arbitrary point in positive x,y,z space – it’s easy enough to imagine measuring the angle between those two vectors. Then take it on faith that, mathematically, you can do the same thing in many-dimensional space.)
Popular Blogs Compared to the Government and Traditional Media
As a first test of whether blogs are different than Russian traditional media and government information channels, in the first polar map we compare the similarity of the Yandex Top 1000 popular blogs compared to the Russian government, TV news transcripts, and top 25 MSM over the period of December 15, 2010 to February 21, 2011. The center node, or pole around which the map is drawn, is the collective content of Russian government feeds over that same time period. The further a source is from the black dot in the center, the more different it is from Russian government feeds. What we see at first glance from this map is that, although fairly overwhelming because of their large number, most blogs are located near the outer ring of this map, while the government, MSM and TV sources are located more closely to the center of the map, showing that the media are more similar to the government than most blogs. This is probably at least in part due to the fact that Russian popular blogs are not focused exclusively on politics, which we see from the content clustering (color) process.
Polar Map

Center Node: Russian Government

The color (and related title) of the nodes is determined by a slightly different process than the location (polar mapping) one. The clustering process is agnostic to the source of the feed, and splits the individual sources into different clusters based on the similarity of words that each uses in a given query made by researchers. The clustering engine uses a simple kmeans implementation based on the cosine similarity of the list of the top 100 non-stopword query words of each media source. This approach returns a different, randomized solution each time, so we run clustering about 20 times and keep the clustering run with the highest sum of total similarity for each cluster. The title of the cluster is the most popular word within the cluster that is ranked lower than that word for all clusters (so if three clusters all have ‘Russia’ as the most popular word, none of them can use ‘Russia’ as the cluster title).
The main clusters that emerge from this query are Film (green), Russia (tan/light orange), Photograph (orange), Site (light blue), and Russian (dark blue). The Russian government, TV and MSM are primarily still found near the center of the map (which is centered around the Russian government feeds), and most of the nodes are colored tan, which represents the “Russia” cluster. Although fairly overwhelming because of their numbers, we see most all of the blogs are located near the outer ring of this map, as in the previous polar maps. This is probably at least in part due to the fact that Russian popular blogs are not focused exclusively on politics.
The color (and related title) of the nodes is determined by a slightly different process than the location (polar mapping) one. The clustering process is agnostic to the source of the feed, and splits the individual sources into different clusters based on the similarity of words that each uses in a given query made by researchers. The clustering engine uses a simple kmeans implementation based on the cosine similarity of the list of the top 100 non-stopword query words of each media source. This approach returns a different, randomized solution each time, so we run clustering about 20 times and keep the clustering run with the highest sum of total similarity for each cluster. The title of the cluster is the most popular word within the cluster that is ranked lower than that word for all clusters (so if three clusters all have ‘Russia’ as the most popular word, none of them can use ‘Russia’ as the cluster title).
The main clusters that emerge from this query are Film (green), Russia (tan/light orange), Photograph (orange), Site (light blue), and Russian (dark blue). The Russian government, TV and MSM are primarily still found near the center of the map (which is centered around the Russian government feeds), and most of the nodes are colored tan, which represents the “Russia” cluster. Although fairly overwhelming because of their numbers, we see most all of the blogs are located near the outer ring of this map, as in the other polar maps.
Oppositional Political Blogs
In the next experiment, we focused just on known political blogs (that we identified in our previous blog research, based on links), to see how different political blogs are from the government and more traditional media sources. In the below polar map, we mapped the similarity of the content in Russian democratic blogs, Russian nationalist blogs, Top 25 mainstream media, Russian TV channels and Russian government Web sites, all compared to how similar they are to the Russian government feeds. The center node, or pole around which the map is drawn, is the collective content of Russian government feeds over a two-month period (in this case, from November 29, 2010 to January 31, 2011.) Again, the further a source is from the black dot in the center, the less similar it is to Russian government feeds.
Center Node: Russian Government
1. Kremlin.ru (Kremlin Web site)
2. Government.ru (Government of Russia Portal)
3. Premier.ru (Vladimir Putin’s Web site)
On the map we see that Russian political blogs on both extremes of the Russian opposition (nationalist and democratic) are the least similar to the Russian government and located in a zone almost completely separated from traditional and online news sources. TV and popular mainstream media are found close to center of the map, and also typically blue in color. The content clusters in this clustering run are ‘crowd,’ ‘Russian (russkaya),’ country, ‘Russian (rossiskaya)’ and a very small cluster around the term ‘happy.’
An example of a democratic opposition blog is that of the Strategy 31 movement, which attempts to organize protests against the government on the 31st of each month that has 31 days, and is located in the outer ring of the map. In the above map we’ve also highlighted a typical nationalist blog. The two word clouds below show the terms used most often by each. The Strategy 31 blog preferentially uses the terms ‘freedom,’ ‘constitution’ and ‘rally (miting).’ The blog from the nationalist cluster includes nationalist language (e.g., using the word Rossiyankovo instead of Rossiskovo), as well as Chechens, Tadzhiks, Pay, Lenin, and Domodevo (the airport where a bombing blamed on Chechens took place).
Word Cloud: Democratic Opposition Blog

Popular words in a democratic opposition blog: Strategy, rally, gathering, Triumfal’noi, freedom, constitution, Nemtsov (an opposition politician arrested at a political protest)
Word Cloud: Nationalist Blog

Popular words in a nationalist blog include: Lenin, Domodedovo, Russian (Rossiyanskovo), Tadzhik, Chechen, Kavkaz, and pay
As one would expect, Russian government Web sites such as Kremlin.ru and Premier.ru are very close to the center. The official Russian government newspaper, Rossiskaya Gazeta, is the newspaper that is most similar to the government.
As one would expect, Russian government Web sites such as Kremlin.ru and Premier.ru are very close to the center. The official Russian government newspaper, Rossiskaya Gazeta, is the newspaper that is most similar to the government.
It is surprising that TV channels are not that different from other news media according to our data. One would have expected TV to be closer to the Russian government than they are based on known ownership and editorial influence over TV channels, and for other online and offline newspapers to be further from the center than Russian TV. It is quite surprising to see Channel 1 as far from the center as it is, but looking at the stories coming through news feed, it seems that this is likely due to a fair number of advertisements for entertainment and other programming highlights on the channel not related to political or other news that are included in its ‘news feed.’ It is worth further investigation to see if our other news feeds capture similar promotional material for non-hard-news stories.
Among the government Web sites, the Ministry of Defense is the least similar to the collective government feeds, while the official Kremlin Web site (primarily about Medvedev) and the official Russian government Web portal Government.ru appear to be the most similar to all government feeds.
The mainstream news sites that are the least similar to the government are 3dnews (by a long shot, it is found in the outer blog ring) and Cnews.ru, which is explained by the heavy technology news for both sites instead of a Russian politics focus. The most similar TV channels to the government are TV Tsentra and Zvezda, a Russian military channel.
Further, we also see that clustering in this map according to content shows that the mainstream media and TV sources are all clustered together in dark blue. And the word frequency cloud also shows that this group is highly focused on Russian government and politics, with ‘Russia,’ ‘President,’ ‘government’ and ‘Putin’ among the most frequently used words.
These early findings seem to indicate that, for whatever reason, Russian TV channels and newspapers (traditional and Web native) cover topics similar to each other and to the Russian government. It will require more research to understand why this might be the case. However, a few theories are possible. This may also be a reflection of the dominance of two individuals over Russian politics, Medvedev and Putin. As the only two people whose decisions really matter in politics these may be the only political stories that ever get covered. However, it may be support for the theory of US media scholar Robert Entman, who argues that in the US the White House sets the news agenda, especially regarding international affairs, and Lance Bennett, who argues that the media simply index opinion of elites, including government elites, as well as the more general theories around media gatekeepers. This effect may be amplified in semi-authoritarian settings like Russia where sources of power and authority are more limited than in liberal democracies. It is also possible that we are detecting some level of self-censorship or even bias in the traditional media, caused by concerns over upsetting the Kremlin. Again, our research cannot yet say why traditional media are so similar to Russian government official information channels, simply that they are similar in the words they use, and we infer from that the stories that they cover.
We are currently exploring if using word frequency counts are a good way of measuring the agenda of a given media set (what that set or individual media sources talk about). However, even if they are, this will likely not tell us what frame a given source employs (how they talk about a given issue). So, just because they both frequently talk about Putin and Medvedev, does not necessarily mean they are talking about him in the same way, which would require human coding of blog posts or automated sentiment analysis.
Still, it seems that based on this early output from Russian media cloud that opposition blogs are indeed different from both government information channels and popular media, and that they are likely providing an alternative agenda to mainstream sources. More research is required to understand how these different sources talk about the same topic, and if blogs in any way have a different agenda than other media. The recent events in Egypt provide an excellent example of the appearance of an agenda item in the blogosphere that is almost completely absent from official Russian government information channels. That will be the focus of my next Media Cloud post.
Cross posted on the Internet & Democracy Blog.
*”Top 25 Mainstream Media” Currently in Media Cloud (We are updating this list based on analysis of additional rankings of Russia media besides Google Ad Planner)
RIA Novosti
Komsomolskaya Pravda
lenta.ru
gazeta.ru
3D News
Regnum
Vzglad
Newsru
Svobodnaya Pressa
Inosmi
Vedomosti
Argumenti i Fakti
Rossiskaya Gazeta
Pravda
Cnews
Dni.ru
Rosbalt
Interfax
Kommersant
Moskovskii Komsomolets
expert.ru
Izvestiya
bfm.ru
Trud
fontanka.ru
Today, the Berkman Center is relaunching Media Cloud, a platform designed to let scholars, journalists and anyone interested in the world of media ask and answer quantitative questions about media attention. For more than a year, we’ve been collecting roughly 50,000 English-language stories a day from 17,000 media sources, including major mainstream media outlets, left and right-leaning American political blogs, as well as from 1000 popular general interest blogs. (For much more about what Media Cloud does and how it does it, please see this post on the system from our lead architect, Hal Roberts.)
2011 has been an exciting year for those of us who usually complain that US audiences don’t encounter enough international news. Since protests in Tunisia succeeded in ousting Ben Ali from power in Tunisia, the news cycle has been dominated with stories of revolution in the Arab world and, tragically, with the destruction caused by earthquake and tsunami in Japan and the drama of possible nuclear disaster as a result. International news very rarely is the dominant story in US media – when the fine folks at Project for Excellence in Journalism noted that the protests in Iran were one of the very few international stories that led a US news cycle, I analyzed a few years of their data and concluded that, aside from coverage of the Olympics, it was virtually the only non-US story in recent years to have led a US news cycle. This year, we’re seeing this trend reversed – interest in the Japan disasters was extremely high in US media, and in protests in Egypt and Libya – perhaps there’s been a shift in public attention, in media coverage, or both.
This week, we saw a big shift in coverage in the mainstream media from the Easter holiday and the NBA and NHL playoffs last week to the beatification of Pope John Paul, Donald Trump, and the Sony Playstation network attack:
The following is an overview of the methods that Media Cloud uses to collect, download, and analyze the media ecosystem. Media Cloud collects stories from 30,000 feeds belonging to 17,000 media sources from a combination of mainstream and new media sources. It stores both the full html and the extracted story text from about 50,000 stories per day from those media sources. It converts that story text into per story word counts that it makes publicly available as daily data dumps. We use those word counts to perform a variety of modes of analysis of the media ecosystem, including word clouds, clustering, mapping, and a variety of regular and custom reports written by the media cloud team.
This week, a curious finding: Democrats fall out of the cloud for the MSM even though Republicans are still covered in both the MSM and political blogs:
The most obvious finding from this week’s word cloud is the disparity in coverage of Geraldine Ferraro’s death in political blogs vs. the MSM.
Libya and the Japan nuclear disaster are again the dominant stories this week, without much new joining them.
The Japan earthquake dominated coverage this week, pushing the other big story, Libya, almost completely out of the cloud:
Even Charlie Sheen gets pre-empted by events in Japan. Here’s how this week compared to last week in the MSM:
The similarity between political blogs and the MSM on Japan is remarkable.
Interestingly, Japan pops up a lot in coverage of Libya this week. It may be an indication of a set of stories framing Obama as overwhelmed by events, such as this article from TalkingPointsMemo:
The devastation in Japan comes at a tumultuous time for the President who is also being forced to respond to the spiraling unrest in the Middle East — particularly in Libya — and political clashes at home where Republicans are impugning his spending priorities as a government shutdown looms.
Wisconsin remains in the coverage of political blogs. Last week we reported the difference in coverage between the left and right blogospheres on this issue. If anything, things seem to have gotten more vitriolic. For example, the term “fleebagger” appears with significant frequency in the right blogosphere, referring to the Democratic legislators who left the state to avoid a vote on collective bargaining.
One story in Pajamas Media manages to refer to both union thugs and fleebagger Democrats in the same post. For its part, the left blogosphere has made “busting” the 17th most frequent word in sentences that also include “Wisconsin.” Union-busting is a common phrase in blog posts from the left, like this one from the Daily Kos.
There are a few more interesting findings in the comparison between the left and right blogospheres in general this week:
*The right has heavy coverage of Islam that is completely absent from the left and the MSM. Terms like “allah,” “islam,” “muslim,” “arab,” and “terror” rank very high in the right blogosphere. There’s also a lot of discussion about religion generally, as “Jews,” “Jewish,” and “Christians” also appear with high frequency.
*Only the right uses the word “Barack” which points to a preference in the right blogosphere for referring to the president as Barack Obama rather than President Obama.
*The left is covering James O’Keefe, notorious for bringing down ACORN and high-ranking officials at NPR with his hidden video tactics, though he’s not trending in the right. Both blogospheres, however, are paying attention to NPR.