Unsupervised Entity Outlier Detection: The Case of Same-Sex Marriage

Claudia Orellana and Fernando Bermejo
November 2, 2017

A controversy is a topic that evokes highly polarized opinions from different sectors of the population. In controversial topics, such as same-sex marriage, transgender rights, selective abortion, and re-entry post-incarceration, the role of news media is particularly critical. News media influences audience attitudes, especially when discussing topics or groups of people with whom the public has minimal interpersonal contact (Schiappa et al., 2005).

For example, for individuals who have never met a transgender person, the only image they have about them is likely to be the one portrayed by the media. This gives media the power to either increase or reduce prejudice, depending on how the topic is framed, who is involved in it, and how these actors are portrayed.

One controversial topic that has triggered significant media coverage around the world is same-sex marriage. Starting in the early 2000’s a handful of national parliaments began to legalize same-sex marriage. In May 2015, Ireland became the first country in the world to accept same-sex marriage through a referendum. In the U.S., after years of legal arguments, June 26 2015 marked the day on which the Supreme Court held that the fundamental right to marry is guaranteed to same-sex couples throughout the entire country. Given its worldwide importance and its broad resonance, we will concentrate on same-sex marriage as our topic of analysis.

We present an unsupervised method that allows us to surface the main named-entities, i.e., people, locations, and organizations, discussed by the media in news articles on same-sex marriage. We follow an outlier detection approach to identify such entities of interest. Our initial exploration shows that the outlier named-entities, in a specific time period, provide useful initial knowledge that complements the information discovered based on popularity or topic detection methods.

Steady and bursty named-entities

Named-entities present in news articles carry different degrees of information, some give the reader the necessary background so as to better understand the topic, while others are present in a news article because they are the key participants in an event.

Based on their presence in the news, we define two types of named-entities:

  1. Steady: are often mentioned in news articles, and show a seemingly steady importance to the topic. These named-entities provide background knowledge and help understand the controversy.
  2. Bursty: present a noticeable burst of importance, possibly close or during an event. These named-entities can help better understand specific moments of the controversy.

In this analysis, we address the question: what are the most important characters of a controversy at a given point in time? To this end, we focus on detecting bursty named-entities which, contrary to steady named-entities, can give us more information related to events that occurred at specific moments in a controversy.

Data collection

Using Media Cloud, an open-source platform for studying media ecosystems, we collected 6,761 same sex marriage news articles from 844 U.S. news media between May 25 and July 23 2015, i.e., news articles containing the terms marriage and equality, traditional, gay, lesbian, or same-sex. For each news article, we collected the title, body, date and time of publication, media where the article was published, and number and destination/origin of out/inlinks in the news article. May – July 2015 was a decisive period for same-sex marriage in the U.S., including the decision that rendered it legal throughout the country.

Identifying Named-Entities in News Articles

We use Stanford NER, a Java implementation of a Named Entity Recognizer to identify named-entities in news articles sentences. Stanford NER can be used to identify up to 7 entity classes, i.e., location, person, organization, money, percent, date, and time. In this analysis, we concentrate on persons, locations, and organizations.

Although Stanford NER is implemented in Java, there are other options if one prefers to use another programming language, like Python, for example. NLTK, is a platform for building Python programs to work with human language data, and it offers a variety of tools to perform tasks such as tokenization, stemming, or named-entity recognition. You can read more about NLTK in the book Natural Language Processing with Python. In NLTK one can also find a wrapper for Stanford NER: StanfordNERTagger, which allow us to use the Java implemented tagger, in Python! (Read more here: Named Entity Recognition with Stanford NER Tagger).

Once we have identified the named-entities, our next step is to perform entity disambiguation. Entity disambiguation is the task of determining the identity of an entity in the text, and to use this identity to uniquely represent the entity. Let’s have a look at the example of The U.S. Supreme Court. In our corpus, we find several ways in which news sources refer to it, including supreme court, us supreme court, the supreme court, big supreme court, the us supreme court, united states supreme court, the united states supreme court, and the us supreme court. We know that all these variations refer to the same named-entity, so we need to find a way to map them all to a unique, unambiguous name. Another issue that we might encounter when working with entities is that the same word can refer to different entities, depending on the context, for example, jaguar can refer to the animal, the car, or the band.

In this analysis, we disambiguate the identified named-entities based on their edit-distance as in (Leskovec et al., 2009). Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. After this step, 941 unique named-entities remain.

Among these 941 named-entities, there are some that are mentioned only once or twice in the whole corpus, or maybe they are only mentioned by one media source. These cases will not give us an overall idea of the characters involved in the controversy so, in order to reduce noise, we set a lower bound N on the named-entity frequency in the whole corpus and a lower bound M on the unique media sources mentioning the entity. We use N=10 and M=5, meaning that we will only keep those entities that are mentioned in at least 10 news articles and by at least 5 different news sources. We will discard all named-entities that do not comply with these two conditions. Note that this decision is made on a case-by-case basis, it might be the case that for some tasks, we wish to keep all the identified named-entities, or to be more strict in the lower and upper bound.

We then proceed to group the news articles based on their publication date (we choose days as our period of study).

Fig 1: Entities (columns) in the same-sex marriage news articles per day (row), ordered from the most to the least frequent. Entities that are frequently mentioned in the news articles are shown as red columns - a red box indicates presence. Days wh…
Fig 1: Entities (columns) in the same-sex marriage news articles per day (row), ordered from the most to the least frequent. Entities that are frequently mentioned in the news articles are shown as red columns - a red box indicates presence. Days where all or the majority of entities are mentioned in news articles are shown as red rows.

Figure 1 illustrates the presence of named-entities in these news articles per day. A colored box indicates that entity e(columns) is mentioned in the news on day d(rows). We observe that certain named-entities are mentioned – almost – daily, e.g., Barack Obama, U.S. Congress, Justice Kennedy, and Ted Cruz. These named-entities can give us an overall sense of the type of issues being discussed in that period of time in same-sex marriage news articles. In Figure 1 we also observe that on certain days, news articles concentrate only on a group of entities, while on other days (e.g., June 26 to June 30 2015) nearly every entity is mentioned.

Now we compute a daily importance score for each named-entity (we use tf-idf). tf-idf or term-frequency inverse document-frequency, is a numerical statistic that reflects the importance of a term to a document in a collection of documents, i.e., if a term is present in all or almost all documents in the collection, then it is not as important to a document as is another term that is either unique to a document, or appears rarely in the document collection. According to their tf-idf values, we observe further differences among named-entities. Figure 2 shows the tf-idf values for each named-entity per day. Of all the identified named-entities, only some of them show a noticeable burst in their tf-idf value within the period of study. For example, Guam on June 05, Caytlin Jenner on June 06, Wells Fargo and bb&t on June 10, and Jackie Cote, Walmart, and Diana Smithson on July 14 2015. These entities do not necessarily provide an overall idea of the controversy, but can rather indicate punctual events that took place close or when the burst in the tf-idf value is observed.

Fig 2: Daily importance (tf-idf) values for each named-entity from May 25 to July 23 2015.
Fig 2: Daily importance (tf-idf) values for each named-entity from May 25 to July 23 2015.

Our final step is to apply the outlier detection method, daily.

Bursty named-entities as outliers. As shown in Figure 2, the importance of certain named-entities in news articles on given days appears to differ considerably from that of other named-entities. Given these characteristics, we use the concept of outlier as an analogy to our bursty named-entities.

Entity outlier detection. Our outlier detection method is based on the Median Absolute Deviation (MAD) criterion. MAD uses the median (M), rather than the mean, as the measure of centrality given its robustness to the presence of outliers in the data (Leys et al., 2013).

We use the MAD and define a decision criterion to detect the bursty named-entities or outliers per day. The MAD for day d is defined as follows:


Where i-e is the importance score of named-entity e, M(iE) is the median of the importance scores of all named-entities present in day d, M-j the new median of the resulting absolute values, and bis a constant ,b = 1.4826.

We then define the decision criterion for a value as follow:


Where M is the median of the importance scores and i-e is the importance score of named-entity e. Using this criteria, entities exceeding the threshold of 3 are our bursty named-entities or outliers for day d. Please refer to (Leys et al., 2013) for further explanation on MAD and the decision criteria.

We use tf-idf as the measure of importance in the process of bursty named-entity detection. Given that this method is based only on the importance values, we do not need to assign any label to the named-entities. The process is entirely unsupervised.


To explore our results, we focus on four days between May and June 2015 when major events in the same-sex marriage controversy took place. The events and top-10 surfaced outlier entities are shown in Table 1.

  • June 5
  • Event: Chief Judge Frances Tydingco-Gatewood of the U.S. territory of Guam strikes down its ban on same-sex marriage in Aguero v. Calvo
  • Outlier Entities: guam, the united states supreme court, mike huckabee, calvo, callie granade, texas, blair, united states, new york, alabama
  • June 9
  • Event: Pulaski County Circuit Judge Wendell Griffen rules that over 500 same-sex marriages performed in Arkansas in May 2014 are valid
  • Outlier Entities: the united states supreme court, gloria baileydavies, massachusetts, griffen, wells fargo, franklin graham, arkansas, henry, billy graham, north carolina
  • June 26
  • Event: The United States Supreme Court rules in Obergefell v. Hodges that because the fundamental right to marry extends to same-sex couples, same-sex marriage bans are unconstitutional under the Fourteenth Amendment. The decision renders same-sex marriage legal throughout the entire United States
  • Outlier Entities: the united states supreme court, united states, barack obama, roberts, america, justice anthony kennedy, washington, texas, justice antonin scalia, james obergefell
  • July 1
  • Event: The Episcopal Church by overwhelming votes at its General Convention removes gender specific language from church laws on marriage to allow for religious wedding services for same-sex couples
  • Outlier Entities: the united states supreme court, nathan collier, barack obama, united states, episcopal church, america, christine, texas, montana, kardashian

We seek to automatically detecting bursty named-entities that allow us to better understand the characters involved in a controversy. We cast our problem of identifying named-entities relevant to specific moments in time, as one of outlier detection. Our preliminary results (see Table 1) suggest that our method is able to surface bursty named-entities that are, in fact, relevant to the events of a given day.

We explore our method’s ability to surface named-entities that appear in four major events as recorded in Wikipedia; however, our method identifies further named-entities that do represent events of the day but do not appear in the Wikipedia snippets. For example, on June 26, among the top-5 bursty named-entities we observe Roberts, not mentioned in the Wikipedia event but who served as the Supreme Court Chief Justice on that day’s decision to legalize same-sex marriage in the U.S. We are currently working on an experimental evaluation that includes additional controversies, as well as other sources of ground truth, in addition to Wikipedia events.

Named-entities are only just one option to work with. When trying to characterize a controversy, i.e., who are the main characters?, how are they involved in the controversy?, what are the main events that took place within a time period?, we need to look further. Detecting relevant named-entities is a great place to start, but we have to go beyond them, to completely understand the coverage and main events surrounding a controversial topic.


Edward Schiappa, Peter B. Gregg, and Dean E. Hewes. 2005. The parasocial contact hypothesis. Communication Monographs 72(1):92–115. https://doi.org/10.1080/0363775052000342544.

Jure Leskovec, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA, KDD ’09, pages 497–506. https://doi.org/10.1145/1557019.1557077.

Christopher Ley, Olivier Klein, Philippe Bernard, and Laurent Licata. 2013. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. Journal of Experimental Social Psychology 49(4):764 – 766. https://doi.org/http://doi.org/10.1016/j.jesp.2013.03.013.