An Important Fix to Our Top Words Analysis

Credits
Published
April 8, 2021

We are making a technical change to a part of the Media Cloud system to address an issue we've recently discovered: due to overly aggressive English and Portuguese stop words lists, meaningful words may have been filtered out from top words visualizations and downloads in our tools. If you have research that used Media Cloud to generate word clouds (especially with English or Portuguese languages), you should read what follows and perhaps repeat your queries using the tool today, with shorter and less-restrictive stop words lists. This change does not affect other language tools within our system, nor any research conducted in other languages, and it is not relevant for research that relies on our attention, influence, or representation data.

Background

Media Cloud, like most text analysis systems, uses "stop words lists", lists of words that are so common that including them in the analysis makes results noisy and hard to use. For instance, in English, words like "a," “and,” or "the" are so common that including them in visualizations of the most frequently used words in news coverage is unhelpful. We have different stop words lists for the twenty full-supported languages in Media Cloud.

Last week, we discovered that the English stop words list we have been using to generate word clouds is overly broad for many purposes. The English stop words we have been using for the last 8 years include terms that are common, but are important enough that they should be present in visualizations - notably, it includes terms like "black" and "police", which are critical terms in discussions of racial justice in America, a frequent use of our tools. While we don’t have a clear picture of why those terms were listed as stop words, we are guessing they got included because of the topic-specific nature of how Media Cloud started. We started building these tools 10 years ago for ourselves to study online discourse on specific topics in the US. But Media Cloud is very different now - used by researchers across the world to study a diverse set of topics. This stop words list makes little sense now. Spurred by this discovery, we are auditing our stop word lists across all the languages we support. We have already found that our previous Portuguese stop words list needed major improvements, and are revisiting others.

The Change We Are Making

As of today, we are now using a much shorter list of stop words for English and Portuguese, and are polishing the lists for all other languages.  This is an interim fix, and it will almost certainly make word clouds more "noisy", including terms that people may prefer to filter out (you may do so using the “Edit this Word Cloud” feature found in the View Options menu to hide terms you identify as stop words specific to your query).

This change only affects the word cloud visualizations available through the "Language" tab of our Explorer and Topic Mapper tools. Other aspects of the Media Cloud tools do not use these particular stop words lists. As noted, English and Portuguese are the main languages affected, but there are smaller changes we are making to all the other languages as well.

While this change might have minimal impact on some queries, the results of other queries are significantly impacted by this change. For instance, here is an illustrative comparison of studying coverage of George Floyd in the US media. On the left are the top words you would have seen prior to this change, when our stop word list was overly aggressive; on the right, highlighted in green, are the new words that will come back starting today. Note specifically how terms like "death", "killing", and "murder" all now show up. Research into language about this event would be significantly impacted by this change.

Comparing prior (left) and updated (right) top word results for a query of "George Floyd". New terms are highlighted in green.

On the other hand, some research inquiries are relatively unaffected. Take the below before-after comparison of research into language in media coverage of President Biden. The terms coming back, from our reading, don't significantly change findings about language used to discuss his plans.

Comparing prior (left) and updated (right) top word results for a query of "Biden". New terms are highlighted in green.

Next Steps

Our systems have grown large and complex over the past decade, but our team has remained small. That means we partly rely on reports from our users to identify problems with our systems. We are grateful to our friends at Global Voices who raised questions about stop words previously, and regret that we didn't fully address their concerns at the time. We now realize that the current stop words arrangement makes certain queries difficult to carry out accurately, and that is why we are acting quickly to solve the problem.

We plan to re-examine the way we use stop words lists, and allow researchers to easily compare between results generated by shorter and longer lists. For now, you can manually review the problematic "aggressive" English stop words list and the new shorter English one we have put in place. In the next few weeks, we will release a tool that will allow you to compare word cloud visualizations generated with the old and new stop words list, so that researchers can examine their queries and see if they were adversely affected. And we have also added support for you to review the difference at the API level (use the new `old_stopwords` parameter in a call to `wordCount`).

Thanks for your understanding and patience. While we are grateful for all the financial support Media Cloud has received over the years, we are still a very small team working on this product, and we appreciate all the help we get from our community in identifying and fixing problems with the system. We do intend to continue this audit for relics of previous versions of our system that might be impacting current uses, especially in regards to issues such as race and representation.

If you would like to discuss this change, or for support with how your research may be affected, please email support@mediacloud.org.

For reference: