Tech Brief: How We Deduplicate Content

Rahul Bhargava
Technical Lead and Principal Investigator
Northeastern University
Credits
Published
July 14, 2020

By Rahul Bhargava (with input from Hal Roberts, Emily Ndulue, Dennis Jen, and Cindy Bishop)

Working on the open web is hard - getting the raw content out of a webpage is tricky, the content of stories posted online can change often, the same "story" content often shows up at different URLs, and stories can become unavailable if you try to see them a few months later ("link-rot"). These problems stimy anyone trying to research online news at scale, including ourselves. Various research projects try to work around these problems by employing techniques referred to as "normalizing", "merging", or "deduplicating" content. Whatever you call it, Media Cloud takes a multi-pronged approach. This blog post explains some of the problems, and the techniques we employ to address them in both Explorer and when you create a Topic in our Topic Mapper tool.

What is the real "content" of a story?

If you download a story from a newspaper's website around 80% of the HTML content is not really part of what a reader would call the "story". Things like javascript code, formatting, ads, and navigation controls make up the bulk of most news stories online. For our research, we just want to store the actual news content, not this cruft. Getting to that content can be tricky.

A screenshot of an online story from the Boston Globe. I've scribbled over all the non-story content in red. I put it sideways so it doesn't make you scroll forever to see the whole thing.
A screenshot of an online story from the Boston Globe. I've scribbled over all the non-story content in red. I put it sideways so it doesn't make you scroll forever to see the whole thing.

Like most other projects, we use a piece of 3rd party software to get the “news content” out of the raw HTML - we use version 0.7 of readability-lxml. This does a great job of removing all the non-content information. Every story we ingest is passed through readability-lxml.

However, we found that there was a specific way readability-lxml failed that was impacting our research. A common pattern is for news sites to include links to other stories. A few years ago we started seeing a design pattern where more and more of these links were long snippets, sometimes including the entire lede of the linked-to story. We saw that readability-lxml was including those links as part of the story text, because it was so hard to distinguish from the news content. These are very harmful to include as part of the story text, because they often are long links to the most popular “top” stories, so they can significantly impact our attention results.

An example of an extended link to a related story embedded within the content of a story on the NYTimes site. This is interpreted by readability-lxml as content, when in fact it is not. We want to remove things like this.
An example of an extended link to a related story embedded within the content of a story on the NYTimes site. This is interpreted by readability-lxml as content, when in fact it is not. We want to remove things like this.

Unfortunately the solution to this is quite complicated - it relies on the idea that these embedded links would appear across lots of stories on the same site. We added a layer of code that checks, for each calendar week, within a given media source, whether we see the exact same sentence repeating across stories. If we do, then for regular stories that you see in Explorer search results, we eliminate the sentence from all the stories except the first we found that includes it. The idea is that we are removing this redundant content, which is not actually part of the news story. We've found that this works well. It does have one weakness we noticed: it can sometimes get rid of repeated quotes. However, overall this approach helps far more than it hurts. When you create a topic we go another level and eliminate that first occurrence of the duplicated sentence as well, because we have had topics in the past that have been overwhelmed when we did not remove it.

What distinguishes one "story" from another?

Another issue is that we very often run into multiple copies of the same story - it can be in an RSS feed, linked to from another story, shared in a Tweet, etc. We take two approaches to trying to merge the many copies of a story we find into the same entry in our database - URL-based matching and title-based matching. Each of these steps is described below.

One norm of the open-web is that the same news story shows up at multiple URLs. For instance, the URL we find for a story in an RSS feed might be different than the one linked on their homepage; and one or both of those can sometimes be a "redirect" that takes you to a third URL for the story, and so on. As an example, here is a list of a dozen different URLs our system encountered for one single story from Reuters:

We begin to try to solve this like many other researchers, aggressively trying to "normalize" each URL to determine its most basic unique form. Interestingly, our approach to this often yields URLs for a story that don't even resolve to a real web address (you can browse our source code for this "lossy normalization" process). The idea is to get down to the bare minimum amount of information in a URL that makes it unique. Once we have that, we save that "normalized" URL and connect it to the original story and URL. When a new URL comes in, we compare it to all the other normalized URLs from the same media source to look for matches. If there is no match, we add it to our system as a new story. We find this helps our research significantly, because we worry far less than we used to about whether we are seeing duplicates showing up. They still do, but at a much smaller percentage.

The second approach we take to matching stories with the same content is to use the title as a clue. Our software starts by getting rid of boilerplate information that is very often in article titles. Many publishers prefix, or postfix, titles with their website or newspaper name. For instance, the Boston Globe article pictured earlier has a page title of "Harvard, MIT sue to block Trump move to bar foreign students from US if classes are online - The Boston Globe". Our code strips these away and also removes punctuation and converts the title to lowercase (you can review the source code for normalizing titles if you are curious). With that "normalized title" in hand, we then check it against all the other titles from the same media source on the same day. If there is a match, we add an entry so the URLs resolve to the same story ID in our database.

It is worth mentioning one of the things we do not do: we don't refetch stories to track edits or changes. We fetch an article and save that version of it as the one true version. Our system doesn't track an article over time to look for any changes or to try to get the latest copy of it. This limits us from some types of research questions, for instance digging into corrections or SEO optimizations, but doesn't hinder our main body of work at all.

So how much does it help?

It is hard to come up with a rigorous quantification for how much these approaches help. Ideally you'd compare a "raw" dataset of stories to a manually reviewed and curated one. However at the scale we operate at that'd be very hard to do - our content and research just covers too many languages and technology norms to allow for a robust manual evaluation of how much each of these approaches mentioned here helps reduce duplicate content. In addition there are often qualitative decisions about how much two stories are actually the same one - sometimes it can be hard for a set of humans to agree on that. That said, over the years we have found that these approaches have identified about 25% of the stories within a given set as duplicates.
We hope this post has helped illuminate how we approach deduplicating our open web news content in Media Cloud, and why. If you have more questions or suggestions, do drop us a line at support @ mediacloud.org.