Approaches to Classifying Online Text as "Hate Speech"

June 16, 2022

By Jake Murel


Media Cloud’s International Hate Observatory is working to understand extreme speech on YouTube. It is a large project which we are approaching from several angles, employing multiple methods. We would like to be able to build a dataset of YouTube videos, produce transcripts from those videos, and get a sense for what, if any, harmful content they may contain. An early challenge, of course, is how to define, find, and classify extreme speech. To that end, we are working to develop machine learning classifiers using natural language processing. It’s a computational approach to understanding language use which is increasingly popular, but still evolving and complex. Before we can begin developing classifiers, we need a clearer picture of existing literature and methodological best practices. This blog post offers a quick summary of some of the dominant computational approaches we see emerging from various academic fields.

As we’ve seen, a lot of extreme speech proliferates across social media and networking sites such as Facebook and Twitter, which have begun to adopt stricter stances to enforcing previously professed community guidelines against hate speech. Their early approaches to employees or users manually tagging hate speech has proven unfeasible given that 1) user flagging marks only a small percentage of hate speech on sites (Salminen et al.) and 2) manual moderation of hate speech in real-time is laborious and resource-intensive. To address this issue, a number of machine learning models for classifying hate speech online have been developed.

It is a truism in scholarship to remark on the lack of any agreed upon definition of hate speech as a key critical issue in its detection online—for comparisons of hate speech definitions in scholarship, see Fortuna and Nunes, 2-5; MacAvaney et al., 2-3; Yin and Zubiaga, 17. Moreover, regarding the development of hate speech detection models, Yin and Zubiaga remark that variance in sampling methods, definitions, and annotation schemes inhibits any sort of extrapolation from specific models. Although Schmidt and Wiegand wrote a few years ago that “[t]here exist no comparative studies which would allow making judgment on the most effective learning method,” more recent studies have sought to compare algorithms and features in detecting hate speech—studies that I discuss and cite below.


The most straightforward (and technically simple) approach for hate speech detection is keyword searching. Some of keyword searching’s most severe disadvantages, however, are polysemy (one word with multiple meanings) and synonymy (multiple words that share one meaning) (MacAvaney et al.; Sahlgren et al.; Saleem et al.). Both of these factors can result in an abundance of false positives and false negatives. Moreover, keyword searching is inhibited by languages’ inherent propensity for developing over time, as well as social media users adapting terminology to escape detection (Nobata et al.). Keyword searches are thus inhibited by what Salminen et al. call the “linguistic diversity of hate.” Bag of words analysis (word frequencies extracted from text) has been suggested as a means of (partially) overcoming these issues, yet still suffers from ignoring hate speech’s semantic/syntactic context (Fortuna and Nunes; Sahlgren et al.). A number of approaches have been developed as potential means of addressing the problem of word context.

N-grams (continuous sequences of words in text) are widely considered effective for detecting hate speech, and scholarship further points to n-grams’ increased accuracy when combined with additional methods, notably word vectors (Fortuna and Nunes; Sahlgren et al.; Salminen et al.). Admittedly, this is unsurprising, as later studies (discussed and cited below) demonstrate that combining multiple features is more effective than utilizing one feature alone. Yin and Zubiaga do point to a few studies in which n-grams alone outperform neural networks, although the authors question whether n-grams may overfit training data in such studies. Despite their possible efficacy, however, n-grams have been shown to suffer from high levels of distance between related words, and there appears no adequate solution to this issue (Fortuna and Nunes).

Word embeddings (high dimensional vectors representing word meanings and use) are thought to have limited efficacy given hate speech often occurs beyond the word level. To address this, some researchers have developed paragraph or comment embeddings, which are shown to be more effective than word embeddings in classifying hate speech (Fortuna and Nunes; Schmidt and Wiegand). These, as well as n-grams, suffer from the problem of detecting purposefully coded words (e.g. w0m3n) (Bodapati and Gella). Normalizing datasets can fix this issue for a controlled environment, but normalization would not assist in monitoring hate speech across websites in real-time and require accounting for all possible variations of a given word (e.g. women, w0m3n, wmn, w-men, wm*n, wom3n, etc.)

Similar to embeddings, transformers (a specific type of neural network) like BERT address the context of a given word within a dataset, but in a more nuanced fashion. To put it simply, word embeddings  map each word from a dataset onto a vector space, with words sharing a similar textual content appearing nearer to one another in that vector space. While this approach thus addresses each word’s context (unlike keyword searches), word embeddings  still suffer from an inability to address polysemy—they do not take into account semantic differences in alternate uses of right, for example, mapping all instances of this word (despite its multiple denotations) onto one vector. While there have been attempts to address this issue via multi-sense embeddings, transformers like BERT offer another route by producing contextualized embeddings for each occurrence of a word according to its position within any given sentence. Of course, I am glossing over the theory and mechanics of how transformers work. My main purpose in this greatly oversimplified summary, however, is that transformers like BERT are shown to be more accurate and precise in detecting hate speech in part because they offer a means of addressing perennial issues of polysemy and synonymy. 

Numerous studies conclude that BERT performs with significantly better accuracy and precision when compared to other classification models (Bodapati and Gella; Swamy et al.). Salminen et al. note that the initial training for BERT models is “computationally expensive” yet downstreaming pretrained BERT models is straightforward and markedly less expensive. Yin and Zubiaga also note that pretraining BERT models is “computationally-heavy,” limiting feasibility in training BERT models. Although BERT (much like the above features) also suffers from decreased performance on cross-datasets, that drop in performance was significantly less compared to other features (Yin and Zubiaga; see also Fortuna et al.). Moreover, not only the above cited study by Salminen et al., but other papers have cross-examined BERT against other models, concluding that models trained with BERT consistently outperform others in terms of accuracy and precision (e.g. Mozafari et al.).

Researchers have also been comparing approaches to try and identify the best performing technologies. Saleem et al. cross-examines what they purport to be the most commonly-used machine learning algorithms for detecting hate speech—naive Bayes, support vector machines (SVM), and logistic regression—concluding that “the three classifiers perform almost identically.” MacAvaney et al. test a multi-view SVM against BERT on multiple datasets, concluding the two detect hate speech w/ roughly comparable accuracy, outperforming one another depending on the dataset. Salminen et al. compare the above three algorithms alongside XGBoost and Feed-Forward Neural Networks in conjunction with four machine learning features: BoW, TF-IDF, Word2Vec, and BERT. XGBoost and FFNN consistently outperformed the above three, with the highest F1 score (.994) resulting from combining XGBoost and BERT. This further points to BERT’s efficacy, substantiating its apparent status as possibly the foremost classification models in terms of accuracy and precision.

Additionally, some suggest that the greatest precision/accuracy may come from combining BERT with convolutional neural networks (Malik et al.; Zhou et al.). In the end, although BERT’s implementation is more computationally intensive than other methods, recent scholarship suggests BERT-based transformers are one of the most accurate and precise classifiers to date


Despite the extent and variety of research on automated hate speech detection, nearly all studies to date have focused on textual data, with Twitter being the most common platform of study. This creates a blind spot for hate speech that circulates online in audiovisual forms, a crucial oversight given the ever-increasing popularity of sites such as YouTube. Working with YouTube videos presents a separate set of challenges, such as evaluating noisy transcriptions and understanding YouTube’s proprietary algorithms.

The purpose of the present post is to provide a cursory cross-examination of research on hate speech detection, summarizing these methods’ respective efficacy and limitations in order to provide a frame for understanding our YouTube classification project. Over the next few months we’ll be iterating on these approaches to see how they perform on various types of text transcribed from YouTube videos. Doing similar work? Please drop us a line and let us know how you’re approaching it.