Evaluating Author Extraction for the Media Cloud Platform

Sands Fish & Rahul Bhargava
December 4, 2017

As we expand the capabilities of the Media Cloud platform, we are always looking for ways to detect and parse additional metadata available to support our, and our collaborators', research questions.  One type of question we often encounter is to desire to understand who is talking. If we can answer this question at a broad scale, we can begin to understand who speaks the most, who garners the most attention, and who uses certain language. With that information we could paint a clearer picture of the media landscape and the conversations that happen there.

We think about answering this question of who is speaking in online media reporting as two separate sub-questions:

  1. Who is quoted in news online, and who quotes them?
  2. Who are the authors that are writing news online?

This blog post summarizes our latest pass at investigating existing solutions for the second question: author detection in semi-structured web-based text. There are a few approaches we can pull from: using pattern matching, relying on structured metadata, or sourcing the job out to more complex algorithms and APIs.



We were recently notified that this testing relied on old, unmaintained versions of the goose and newspaper libraries. That renders it inaccurate and not useful.  We have pulled it down while we update our code and results. We're sorry for the error.