Word Spaces - visualizing word2vec to support media analysis

Credits
Published
May 24, 2018

Media Cloud is a database of online news content, and a suite of tools for analyzing online media ecosystems. Researchers using it tend to focus on questions of attention, representation, influence, and language. We're introducing a new feature today to support the last one on that list - visualizing language via word2vec word-embeddings with what we're calling the "word space" chart. This feature was created and designed by Becky Bell and Rahul Bhargava.  Here's an example with the word space for a corpus of reporting on climate change in 2016 from American news sources:

An example word space, showing words in reporting about climate change in US media sources during 2016. Click to see the full interactive version.
An example word space, showing words in reporting about climate change in US media sources during 2016. Click to see the full interactive version.

There's a lot we've tried to bake in here. A quick tour of features:

  • Bigger and darker words were used more often in the stories; smaller and lighter words were used less often. As you can see, "climate" and "global" were the most used words.
  • When you hover over a word with your mouse, it turns orange to make it easier to read. Words that turn orange, and that are within the light blue cone, are all used in similar ways. This 'similarity' is based on an algorithmic analysis (cosine similarity) of the contexts that the words are used in.
  • The distance between words give you an indication of how often they show up in the same contexts. Words that are close together have a high probability of showing up in the same context (though that doesn't indicate they are necessarily used together).  Words that are far apart are used in different contexts.

Background

Quantitative analysis of text often relies on the process of creating "word embeddings" - converting words or phrases into numbers and vectors. There are a variety of techniques for doing this; most based on counting word occurrences or predictive models of where words might occur.

The simplest way to represent information about word use is the standard word cloud, as shown below, that represents words based on their frequency of use.  This has much-highlighted limitations due to lack of context. Don't show this to any friends you have that are data visualization experts; they hate word clouds.

A classic word cloud of reporting about climate change in US media during 2016. Word use more appear bigger, darker, and are located more centrally.
A classic word cloud of reporting about climate change in US media during 2016. Word use more appear bigger, darker, and are located more centrally.

A more complicated word embeddings approach could turn each word in a corpus into a vector that captures multidimensional information about other words it is used with - addressing the context problem of word clouds.  This idea relies on the distributional hypothesis in linguistics; that words used in similar contexts tend to have the same meanings.  As J.R. Firth said - “You shall know a word by the company it keeps.” (1957).

Over the last few years people have become very excited about using "word2vec" for representing the use of words and contexts they appear in. A simplified description would be to say that word2vec is based on building neural nets that predict probabilities of words appearing in specific context, and contexts appearing given specific words. This lets you do clever math-like operations on the words (ie. "king − man + woman = queen" is the canonical example). For more technical details read the academic papers that introduce the underlying neural net architectures and describe training efficiencies.  Or look at the open-source implementation of the algorithm described in both papers.

More importantly for us, people often collapse this massive word2vec prediction space into a two-dimensional visual representation so we humans can look at it and understand it (using PCA). Our first attempts at integrating this were visual mash-ups.  Like word clouds, we made more frequently used words both darker and larger. However, we placed the words in a 2D space based on the word vector data (a common technique).

An early attempt at showing word frequencies and word2vec data in a 2d space. A little busy, no?
An early attempt at showing word frequencies and word2vec data in a 2d space. A little busy, no?

The Problem

We've found that these mash-ups are confusing.  People read word clusters in the visualization above as groups of related words, but this judgment is based on cartesian (ie. x/y) distance rather than angular (ie. cosine) distance, which is the more accurate similarity metric. In order to correctly gauge the relationship between words, they should be reading the angle from the origin (remember, each word is a vector). We validated that the cosine similarity in this 2d space is strongly correlated with cosine similarity in the high-dimensional space of the full model.

Reading cartesian (ie. x/y) distance, vs. reading angular distance. In cartesian distance these two words are very close, but in angular distance they are far.
Reading cartesian (ie. x/y) distance, vs. reading angular distance. In cartesian distance these two words are very close, but in angular distance they are far.

Of course, this is really hard to think about, so we went about designing a more helpful visualization that would scaffold people exploring results in an accurate way. The "word space" is the result of a number of iterations with our real users, but of course we'll continue to iterate on it.

ANGULAR POSITION

The first key insight was that we needed to help people find the related words in the accurate (ie. angular) way. Our inspiration here is the rich history of representing polar coordinates. We also decided to leverage the rollover action, because we saw people tend to do that naturally when interacting with word clouds. With the vector angle in mind, the thing we want to show is "similarity" as represented by the cosine distance between the words.  After some testing with corpora that we know well, we came up with a threshold for "similar". On rollover we highlight words that have a cosine distance less than this threshold. In addition, we introduced a cone-shaped background shading to visually represent the angular distance threshold.  This has nice visual echoes of a flashlight, which map onto the user task of "discovering conversations" in the corpus nicely. Our hope is that these two indications (color and conical background shading) expose some of the mechanisms underneath the hood and thereby help novice viewers understand how to read the word2vec data.

As the user rolls over "clinton", the background cone appears to highlight the similar word space and the accurate way to read the chart (ie. angular). Words that are similar turn orange and all the other words fade to grey.
As the user rolls over "clinton", the background cone appears to highlight the similar word space and the accurate way to read the chart (ie. angular). Words that are similar turn orange and all the other words fade to grey.

Word Proximity

The other hard to to thing about with vectors is their length; usually measured as a distance from the origin in polar coordinates.  We read a whole bunch of papers, and tried a few things, but couldn't come up with a solid definition of what distance from the origin would represent in this 2D reduction of the high-dimensional space.  At first we thought it might represent the variety of context of use (see Shackel, 2015), but some validation didn't prove that to be true.  

So we want to focus on the proximity between words as the useful concept to think about. Words that show up closer together have a high probability of being used in similar word contexts.  Words that are far apart are not likely to be used in the same contexts.

Of course, words can block each other out. To address this, we added the ability to zoom in (by double-clicking, for now). However, zooming is a very disorienting operation that can leave someone confused about where they are in a larger 2-dimensional space. Returning to polar-coordinate for inspiration, we added a grid to the background.  It animates as the user zooms in, so they have a sense of the change in scale and can stay oriented to where they are in the word space.

Evaluation

We conducted a small user study to evaluate the effectiveness of the this word space visualization at helping people identify distinct conversation within a corpus of text.  The study compared the utility of a standard word cloud, a modified word cloud including the word2vec similarity data, and the word space in helping users complete a short theme-detection task. Roughly 100 people participated in the study, and each was assigned one of the three visualizations randomly.

We found that there was little difference between visualizations when measuring performance by the number of correct themes identified; however, we found that the word space visualization helped users more in finding themes that were less prominent within the topic. Based on these initial findings, we believe that although standard word clouds can provide a very general understanding of the contents of a topic, the word space visualization can provide users with subtle insights into the text corpus that cannot be revealed by word frequency data alone.

We're still evaluating the results, in hopes of writing them up more formally, but for now we believe that the word space visualization can help users with limited knowledge of a topic find conversations that would otherwise be difficult to identify without deeper study of the text corpus.

Feedback?

Our team and collaborating researchers have already found these rough visualizations useful. They've discovering insights using these charts that they hadn't seen before, which is rewarding for us to see!

However, we're not word2vec experts so we'd love your thoughts on this interpretation of the data and the visual representation. How can we make this wealth of hard-to-think-about data useful to novices researching large text corpora? Is this visual and explanation accurate? Is it useful? How could we improve it? How are other people visualizing word2vec data to help novices use it?

Also note that our Explorer tool uses the publicly shared Google News model (download), which like most models reflects many of the biases in our culture and news reporting (see Microsoft's work on ways to undo this). In our Topic Mapper tool, we generate a new model for each topic based solely on the corpus of stories included in that topic.  This produces far more useful results for researchers trying to identify disparate conversations or frames with a focused set of news about their topic of study..