Media Cloud 2.0 Pre-alpha release 00.00.05.

We have released a pre-alpha version of Media Cloud 2.0. The changes since our initial release in 2009 are too numerous to list. Perhaps most noticeably, the install process has been vastly improved. Many users will be able to install Media Cloud simply by running a single script instead of having to manually configure and install numerous Perl modules.

A source distribution of this release is available on Sourceforge at the following location:

http://sourceforge.net/projects/mediacloud/files/releases/mediacloud-2.0-pre-alpha.00.00.05.tgz/download

We have also put together an Ubuntu virtual machine image with Media Cloud already install. It’s available here:

http://sourceforge.net/projects/mediacloud/files/VMs/Media%20Cloud%202.0%20Pre-alpha%2000.00.05.ova/download

This is still alpha software but we hope this release provides an easier and more stable alternative to installing Media Cloud from source control.

Posted in Uncategorized | Leave a comment

HTML::TextCruft CPAN module released

We have extracted a piece of the Media Cloud code base and released it as HTML::TextCruft – a stand alone CPAN module. HTML::TextCruft is the first part of the code to extract article text from HTML and remove ads, navigation, and other cruft.

Media Cloud has always been free and open source but since it is a large code base not everyone is able to install it. By releasing this piece as a separate module, we hope that its functionality will be more accessible to the wider community.

More information on HTML::TextCruft is available on its CPAN page.

Posted in Uncategorized | Leave a comment

Media Cloud is Participating in Google Summer of Code 2012

Media Cloud is excited to be participating in Google Summer of Code this year through the Berkman Center for Internet & Society. Google Summer of Code (GSoC) is a global program in which Google offers students stipends to work on Open Source projects. Media Cloud received valuable contributions from our students when we participated in 2009 and 2010 and we’re looking forward to this year’s program.

For students who are interested in working on Media Cloud through Google Summer of Code, we have put together a list of possible Media Cloud projects here. There is also a Berkman Wiki listing Berkman specific GSoC requirements as well a number of other interesting Berkman projects also participating in GSoC. Finally, the GSoC homepage contains detailed information about GSoC policies and eligibility requirements.

The final application deadline is April 6 at 19:00 UTC but early applications are preferred.

Posted in Uncategorized | 2 Comments

Russian Media for the Week of 6/27/2011 – 7/03/2011

Russian media this week has seen the emergence of a number of prominent stories, including themes related to Russia’s budget and banking system, political appointments, energy politics, Russia’s relations with neighboring countries, bills being debated by the Duma, and concerns over forest fires in the country’s far east.

Week of June 20 – June 26 (Red) Compared to June 13 – June 19 (Blue) for Five Major Russian Media Segments (TV, Pop Blogs, Random Blogs, Mainstream Media, Government):

New issues related to domestic politics and finance seem to dominate the overall week-to-week comparison cloud, indicated by the emergence of new high frequency words (in red) such as “банк” (bank), “бюджет” (budget), “газа” (gas), and “национальной” (national).  The frequent discussion of banks this week is in part accounted for by the catastrophic failure and subsequent bailout of the Bank of Moscow, Russia’s fifth largest bank.  In what is reputed to be “the largest bailout in modern Russian history,” the bank will receive as much as $14 billion in state-backed loans, with the state-run VTB Bank increasing its stake in the company to 75%.

The Russian budget and budgetary constraints were also an important theme in this week’s news.  On Wednesday, 6/29, President Dmitry Medvedev delivered an address to the Duma laying out his three year budget guidelines for the 2012-2014 period.  Focusing on governance efficiency, modernization, competitiveness, long term development, and living standards, the President laid out 12 vital areas of budget policy that will be central to achieving national economic goals in the coming years.  In addition to his ongoing emphasis on modernization, Medvedev stressed the need for economic decentralization, with development occurring on a regional level and not just in and around the capital cities.  Budgets were also discussed in several other contexts this week, helping to account for the appearance of “бюджет” in the week’s overall word cloud.  Prime Minister Vladimir Putin made headlines for drawing attention to the need to ensure the new budget would be deficit-free.  New stories also discussed the protests in Greece related to that country’s budget debate and the possible implications for Russian oil revenue.  Nezavisimaya Gazeta reported on a new study that shows Russians on average spend 30% of their household budget on food, with poorer families spending as much as 50% of their income.  The discussion of Medvedev’s budgetary plan and related topics clearly dominated the Government media segment for the week.  New high frequency words there such as “развивать,” (develop), “реализации” (implementation), “региональных” (regional), “современные” (modern), “экономики” (economy), “экономической” (economic), and “технического” (technical) indicate the frequent discussion of some of the main components of Medvedev’s plan.

Week of June 27 – July 3 (Red) Compared to June 20 – June 26 (Blue) for Russian Government:

A couple of last week’s major stories continued to attract attention this week, with related terms showing up in purple in the week-to-week comparison cloud.  These include, for example, the nomination of St. Petersburg Governor Valentina Matvienko to become Speaker of the Federation Council.  With the approval of Medvedev and Putin, this week Matvienko agreed to accept the new position.  Opposition formed in Saint Petersburg, with young Yabloko party members protesting in the street on Wednesday and the formation of an opposition bloc entitled “St. Petersburg against Matvienko.”  As the city’s governor since 2003, Matvienko had become increasingly unpopular.  Resented by local residents for her government’s failure to clear the streets of snow and ice in the winter, many have speculated that Matvienko’s move was part of an effort to buoy support for the United Russia party in preparation for the upcoming Duma elections this December.  This story’s continued prominence is indicated by the frequency of words such as “петербург” (Petersburg), “федерации” ([of the] federation), and “совет” (council) in the week’s overall cloud.  Drilling down into specific media segments, the attention garnered by Matvienko’s high profile move becomes even more apparent, with her name (“Матвиенко”) and the word “губернатор” (governor) appearing among the new high frequency words in this week’s Mainstream Media word cloud.

Week of June 27 – July 3 (Red) Compared to June 20 – June 26 (Blue) for Russian Mainstream Media:

Some additional prominent topics in the week’s news also become more apparent on examining some of the other week-to-week comparisons for particular media segments.  The ongoing controversy surrounding the corruption accusations against and trial of former Orange Revolution leader Yulia Tymoshenko in Ukraine, for example, attracted the attention of some news segments more than others.  The former prime minister was indicted last December for abuse of power, with President Victor Yanukovich claiming that she illegally used $425 million in “Kyoto money” (money received from the sale of of carbon emission quotas) to finance pensions.  If she is found guilty, Tymoshenko will be banned from holding political office.  While some variant on “Украина” (Ukraine) appears as a high frequency word over the last couple of weeks in the Mainstream Media and the Popular Blogs word clouds, this topic appears not to have received equal attention in all media segments.  A comparison between popular blogs and TV media shows that this story appears to have gotten significantly more attention in the blogosphere than in television news coverage – demonstrated by the appearance of “Украины” in red in the word cloud comparing these two media segments.

Russian Popular Blogs (Red) versus Television (Blue) for Week of June 27 – July 3:

A similar contrast can be seen in the coverage of ongoing conflict between Russia and Belarus over unpaid electricity debt for April and May.  Belarus, which has been suffering a deep economic crisis over the last several months owes Russia some 1.2 billion rubles ($43 million) – a situation which came to a crisis this week, with the Kremlin threatening to cut off Belorussian electricity supplies if this debt was not repaid by Wednesday.  Though the immediate crisis was resolved by week’s end with Belarus promising to pay its debt and Russia restoring power supplies, the tension between the two countries continued, with disagreement as to the extent to which natural gas prices should be reduced in light of the recent Belarusian currency devaluation.  This story, as with that concerning Ukraine, appears to have received more attention in some media segments than others.  In contrast to the Ukrainian trial, this story seems to have been covered more by television and mainstream media and received less scrutiny in the blogosphere.  Note the appearance of “Белоруссия” (Belarus) in blue in the word cloud comparing high frequency words in the week’s TV and Popular Blog media segments.

Posted in Uncategorized | Leave a comment

Russian Media for the Week of 6/20/2011 – 6/26/2011

Russian media this week has been dominated by several new themes, relating to national history, disasters, and high politics.  The red words in the word cloud below indicate words that appeared in this week’s news with unusually high frequency, showing a contrast with the previous week.  (Blue words show high frequency words unique to the previous week, and purple indicates words that appeared with significant prevalence both weeks – generally representative of recurrent themes.)

Week of June 20 – June 26 (Red) Compared to June 13 – June 19 (Blue) for Five Major Russian Media Segments (TV, Pop Blogs, Random Blogs, Mainstream Media, Government):

As is clear from this week’s overall comparative word cloud across five major media segments, one of the dominant themes in the week’s media has been the 70th anniversary of the German invasion of Russia that marked the beginning of the Great Patriotic War (World War II).  The German invasion of the Soviet Union (Operation Barbarossa) began on June 22nd 1941 when Nazi tanks entered Soviet territory near the town of Brest in Belarus.  It was the beginning of four years of war in which over 20 million Soviet soldiers and civilians would perish (over 13% of the population).  The anniversary, referred to as a national “Day of Memory and Sorrow,” was somberly recalled in memorial events across Russia this week.  The unusually high occurrence of various forms of words such as “война” (war), “служба” (service), “великий” (great [patriotic war]), and “военный” (military) indicates the frequency with which the war and its legacy were discussed across the five media segments over the course of this week.  Some variants of one or more of these words appear clearly in the week’s word clouds for both Mainstream Media and Television, indicating that the story had particular prominence across these segments.  In popular blogs, we also see higher than usual discussion involving words such as “советский” (Soviet), often involving discussion of Soviet history and the legacy of the war.

One of the other major stories of the week was the June 20th crash of a passenger airplane (a Tupolev 134A-3) en route from Moscow to Petrozavodsk.  Flight RA-65691 of the airline RusAir (Русэйр) crashed and broke apart on landing, killing forty-seven out of fifty-two occupants.  This story is clearly indicated by prominent words in the week’s word cloud, such as “самолет” (airplane) and “петрозаводск” (Petrozavodsk).  One or both of these words appear in the week’s word clouds for both the Mainstream Media and TV.  The story apparently also received some prominent attention in the Government press, with “мчс” (acronym for the Russian Emergencies Ministry) appearing as one of the week’s highest frequency words for that news segment.  This theme seems to have been particularly picked up in Russian television, with additional words such as “авиакатастрофе” (aviation accident), “больницы” (hospitals), “погибших” (dead/deceased), “аэропорт” (airport), “пассажир” (passenger), “транспорт” (transportation), and “транспортакатастрофы” (transportation accident) featuring as unusually high frequency words visible in the segment-specific weekly word clouds.

A third significant set of stories of this week had to do with the appointments and nominations of officials for government positions.  Specifically, this included President Medvedev’s appointment of officials to fill leadership positions in the Ministry of the Interior (Министерство Внутренних дел Российской Федерации), the President’s apparent support for Saint Petersburg Governor Valentina Matvienko’s nomination as the new Speaker of Russia’s Federation Council (Совет Федерации), and the reappointment of Yuri Chaika as Prosecutor General (Генеральный Прокурор) by the Federation Council.  These stories are indicated by the prevalence of words such as “министерства” (ministry), “внутренних” (internal), “совет” (council), “федерации” ([of the] federation), and “генерал” (general).  The coverage of these news events appears to have been particularly strong, not surprisingly, across the Government media segment, though they also have received some attention in TV, Mainstream Media, and Popular Blogs.

Below are the week’s comparative word clouds from each of the five media segments (TV, mainstream media, government, popular blogs, and a random sample of all blogs).  Click on these figures to view interactive word clouds from which to explore themes of interest.

Week of June 20 – June 26 (Red) Compared to June 13 – June 19 (Blue) for Russian TV:

Week of June 20 – June 26 (Red) Compared to June 13 – June 19 (Blue) for Russian Mainstream Media:

Week of June 20 – June 26 (Red) Compared to June 13 – June 19 (Blue) for Russian Government:

Week of June 20 – June 26 (Red) Compared to June 13 – June 19 (Blue) for Russian Popular Blogs:

Week of June 20 – June 26 (Red) Compared to June 13 – June 19 (Blue) for Russian Random  Blogs:

 

Posted in Uncategorized | Leave a comment

Russian Media for the Week of 6/12/2011 – 6/18/2011

This week’s Russian word cloud shows some new trends and stories that differ from those of the previous week, though there have been few dramatic shifts in coverage.  The most striking new story to emerge here appears to be that of Colonel Yuri Budanov (Полковник Юрий Буданов), who was murdered while awaiting trial for the rape and murder of a young girl in Chechnya.  This story accounts for several of the increased frequency words that emerge in this week’s word cloud – a pattern also separately visible across all major media segments except for official government sources.  On closer inspection, some other stories have acquired new or renewed attention in particular media segments, with coverage of Ukraine and Mikhail Khodorkovsky featuring prominently in popular blogs and television media respectively.

Words in four prominent media segments (popular blogs, mainstream media, government, television) during the week starting 2011-06-05 (Blue) versus during the week starting 2011-06-12 (Red):

The word cloud above, comparing a combined set of main media sources from June 12th through June 18th 2011 (red) with the same set of sources over the previous week, June 5th through June 11th 2011 (blue), shows several new stories emerging (blue), but none of these are at as high a word frequency as the major words in purple (mentioned frequently both weeks) or even as the major words from the previous week (in red).  The cloud compares the combined sets of popular blogs, mainstream media sources, government media content, and television media content across the two weeks.

Some of the newly prominent words do not appear to represent any major new stories –ubiquitous names and financial terms likely appear as top words only because of a relative decline in other major stories with more uncommon terms.

The overall cosine similarity across the four media segments in Media Cloud between the week of June 05-11 and June 12-18 is 0.905, demonstrating a fairly high level of similarity between the two weeks.  This level of variation is not constant across all media forms, however.  We see some dissimilarities in the patterns of change within distinct media sources.

Government sources here seem to have shown the most significant changes in topical foci between the two weeks, with TV and mainstream media showing the second greatest amounts of change, both showing lower cosine similarity scores than that between popular blogs during this period.  This is interesting, as it indicates that the blogosphere’s topical foci have remained relatively constant while some new topics have been introduced to (or have disappeared from) the mainstream media, TV, and government sources.

In terms of coverage of key stories, it appears that there is substantial difference between the topics receiving greatest attention across the different media segments.  Most of this variation has been consistent over the last week and does not mark a dramatic shift because of the variation in coverage of a suddenly emerging pivotal story.

As we can see here, there has in fact been a modest convergence in the similarity of different news sources in the last week.  That notwithstanding, however, the differences across segments are striking.  The following word cloud shows the comparison between the content of popular blogs versus government media outlets during the June 12th-18th period.

Words in Popular Blogs (Blue) during the week starting 2011-06-12 versus words in Government media sources (Red) during the same week:


Here we see that coverage of war, other countries (including the US and Ukraine), Moscow, words related to the internet, politics, and the Budanov murder (colonel, Budanov, murder) all receive more attention in the popular blogs, whereas words related to economics (budget, financial), governance (regional, municipal, federal, law), citizenship (self-governance, participation, citizen) feature prominently in the government media sources.

The extremely low cosine similarity value between popular blogs and government sources is consistent with tendencies noted in previous blog posts.  Perhaps more surprising is the fact that TV media sources appear even more dissimilar from government sources, with these two media segments showing the lowest cosine similarity for the week at 0.318.

Words in TV (Blue) during the week starting 2011-06-12 versus words in Government media sources (Red) during the same week:


Here the high frequency words from TV (blue) show significant difference from those appearing frequently in government sources (red) with very little overlap (purple) in high frequency words.  While this does not definitively indicate a lack of similarity in coverage (or lack of coverage) of some topics, it certainly appears to indicate that there is a fair degree of dissimilarity in the topics that are covered.  In addition to the TV coverage of the Budanov murder (which did not receive frequent mention in government sources), the TV sources for the week included more prominent discussion of Khodorkovsky, war, other countries (including Europe), and cultural items such as film and festivals.

As these last couple examples indicate, some of this dissimilarity here could have to do with non-news content in the TV news feed (or at least a broader definition of news to include things not addressed by government media sources); but, as demonstrated by the other examples of non-overlapping frequent words, it appears there also is some substantial difference in the primary news content.

Posted in Uncategorized | Leave a comment

Weekly Update: Week of May 23

You know it’s a slow news week when there’s this much baseball–and soccer!–showing up in the US mainstream media:

Continue reading

Posted in Uncategorized | Tagged | Leave a comment

Russian Media Cloud Comparative Analysis

Using Cosine Similarity to compare week to week coverage in Russian media

What are the differences in how various Russian media outlets – traditional and web native – cover events?  How does coverage differ between sources during the same time period? Does coverage overlap – or do different outlets highlight different events?  What do these choices tell us about media outlet priorities and preferences?

With these questions in mind, we used Russian Media Cloud to determine the levels of similarity between four Russian media sets – Russian government websites, Russian television news, Russian mainstream media websites and popular Russian blogs -  over two weeks in April 2011.  We use cosine similarity to determine how similar the content of the media sets are to one another, a method described in this earlier post by Ethan Zuckerman.

Continue reading

Posted in Uncategorized | 2 Comments

Mapping the U.S. Popular Blogosphere

One thing that we can use quantitative text analysis for is to get a sense for the overall landscape of a set of blogs. The following map of popular blogs in the U.S. (the top 1000 blogs according to Bloglines) gives a good sense of what topics people write about on popular blogs and how those topics relate to one another:

Continue reading

Posted in Uncategorized | 6 Comments

Weekly Update: May 16, 2011

This week was the week of stories that could have been:

Continue reading

Posted in Uncategorized | Tagged | Leave a comment