Automating Online Hate Speech Detection

A survey of deep learning approaches to the task of identifying hateful content

September 2, 2019 | 📍 Edinburgh, UK

Full dissertation
In March 2019, a white supremacist posted racist manifestos–which spread through Twitter–and live-streamed the shooting of over 51 people on Facebook and YouTube. While both companies leverage machine learning algorithms to automatically detect and remove such content, they failed to take down the content quickly enough. Facebook’s head of public policy defended the platform’s slow response time:
“The video was a new type that our machine learning system hadn’t seen before. It was a first person shooter with a GoPro on his head…This is unfortunately an adversarial space. Those sharing the video were deliberately splicing and cutting it and using filters to subvert automation. There is still progress to be made with machine learning [1]."
Progress needs to be made sooner rather than later. Hateful content on social media contributes to real-world violence [2], recruitment to and propaganda for terrorist individuals/groups [3], makes other users feel less safe and secure on social platforms [4], and triggers increased levels of toxicity in the network [5].
Governments are enforcing the need to find reasonable solutions, fast. The Sri Lanken government temporarily banned social media networks three separate times in the wake of the Easter 2019 suicide bombings that killed hundreds of people, in order to prevent “social unrest via hate messages” [6]. French leader Emmanuel Macron has made fighting online hate speech a priority, taking meetings with Mark Zuckerberg and other high-level Facebook representatives; as a result Facebook has agreed to hand over data on French users suspected of spewing hate speech online. This is an international first, and according to a counsel at law firm Linklaters, "a strong signal in terms of regulation" [8]. Germany and the UK have strict legislation against hate speech [7]. EU countries have threatened to fine social networks up to 50 million euro per year if they continue to fail to act fast enough [9].
Social media has contributed to a more open and connected world [10]. It has promoted Western liberal values through its effects on protest mobilization [11], community building [12], and accountability in governments and institutions [13]. It is critical that we do not lose the benefits of these platforms as they operate under increased regulation and scrutiny. This research surveys the capabilities and limitations of state-of-the-art (SOTA) deep learning classifiers. We aim to inform policy and decision makers, who must reconcile the benefits of social media platforms with the harms they threaten when hate speech is allowed to propagate.
• • •
The experiments consist of three phases that demonstrate the effect of user behavior metrics on a combination of models and embedding types. Each of our 4 model choices (CNN, LSTM, MLP, and DENSENET) are run with our 3 embedding types, for 4 rounds of experiments, with 3 different seeds, for a total of 144+ experiments before tuning.
The dataset for this research comes from Founta et al.’s work that describes the process of large scale crowdsourcing for annotations of hateful, normal, abusive, and spam tweets [14].
We clean the tweets by tokenizing, lowercasing, and removing punctuation.
We experiment with three types of text embeddings:
TF-IDF embeddings:/strongTF-IDF fits the training set into a weighted vector by normalized frequency of the 10,000 most common words in our vocabulary. Our validation and test set are transformed using the learned weights.
Pretrained Twitter embeddings:These embeddings are from a Word2Vec model trained on 400 million raw English tweets, with an embedding dimension of 400 [15].
Pretrained BERT embeddings:Google’s BERT, or Bidirectional Encoder Representations from Transforms, is a novel method of pre-training language representations which obtains SOTA results on a range of NLP tasks described in [16].
• • •
Phase 1: We compare our deep learning models to a baseline logistic regression model to begin with an idea around how much of an improvement they can offer. Here the feature embedding to our deep learning models are simply the tweet embeddings.
Phase 2: The goal of this round of experiments is to apply some type of context to our embeddings. Here and for the remaining experiments, we shift to building a neural network with multiple inputs in order to have the network learn from the annotated tweet in addition to other types of embeddings. We have the annotated tweet embedding and the context tweet embedding as separate inputs; they are processed by different parts of the model architecture. The learned features for each are concatenated and fed into a final fully connected layer. First, because of the statistics collected around the behavior of hateful users and retweeting, we define pairs of tweets: the original tweet and a context tweet. The context tweet is for the case that the tweet was a response to something else. Next, because tweet in-degree has been shown to be significant [17] we focus on in-degree in terms of number of times someone has retweeted a given tweet and number of times someone has favorited a given tweet. We concatenate the logged retweet and favorite counts to our tweet and reply embeddings. We are interested to see if the network can better learn from the retweet and favorite numbers as a measure of context.
Multiple Input CNN-BERT Model Architecture
Phase 3:This phase aims to add context in a more sophisticated way. For each tweet in our dataset, we crawl the author’s user timeline and collect 200 tweets. We then conduct topic modeling through the LDA approach. We use LDA, as opposed to other topic modeling techniques, because LDA represents documents (or tweets) as random mixtures over topics in the corpus, which reflects what we expect from tweets on a user’s timeline [18]. We also concatenate the coherence and perplexity scores of the user’s timeline to each embedded topic word to add a global measure of topic modeling.
Hyperparameter tuning: We tune our best performing models–by model type and embedding type–by experimenting with learning rate, regularization, number of layers, and the model specific parameter.
Hyperparameter Tuning, Facilitated by Comet.ML
• • •
Our final model, phase 2 CNN-BERT, successfully picked up on negative tweet sentiment and identified the abusive class at the highest rate, of 82% accuracy and f-score 0.78. The model offers a significant improvement on detecting hate speech, as we are able to improve on our logistic regression baseline performance on the hateful class by 0.13 f-score on a dataset with scarce hateful labels. If we were to randomly annotate a tweet as hateful with 4% probability, we’d achieve around 4% accuracy on the hateful class. Thus, we interpret the final f-score on the hateful class of 0.33 as relatively high.
The models that used Google’s pretrained BERT embeddings performed better than TF-IDF and Twitter pretrained embeddings across most models in our three phases of experiments. CNN-BERT outperformed the logistic regression, MLP, LSTM, and DenseNet models for all three phases of experiments. Before tuning, our best performing model is the CNN multiple input model architecture with tweet and user topic BERT embeddings. After tuning, our best performing model is the CNN multiple input model architecture with tweet and reply BERT embeddings. We hypothesize that this is because the parameter choices of a single layer, 47 filter CNN with high dropout will overfit with BERT and a large measure of user context. The pretrained BERT embeddings add enough semantic information to give us our most competitive models and adding additional metrics of context through aspects of the social network does not improve performance.
The task of automating the detection of hate speech on social media platforms remains a challenge, in part due to the difficulty in obtaining high-quality, large-scale annotated datasets and the scarce hateful samples available for machine learning models to learn from. Our experiments reflect this and suggest that improving the quality and consistency of annotations in our dataset is likely to result in more accurate automated systems.


  1. J. Wakefield, “Hate speech: Facebook, twitter and youtube told off by mps,” Apr 2019.
  2. Müller and C. Schwarz, “Fanning the flames of hate: Social media and hate crime,” Available at SSRN 3082972, 2018.
  3. I. Awan, “Cyber-extremism: Isis and the power of social media,” Society, vol. 54, no. 2, pp. 138–149, 2017.
  4. M. ElSherief, S. Nilizadeh, D. Nguyen, G. Vigna, and E. Belding, “Peer to peer hate: Hate speech instigators and their targets,” in Twelfth International AAAI Conference on Web and Social Media, 2018.
  5. J. Cheng, M. Bernstein, C. Danescu-Niculescu-Mizil, and J. Leskovec, “Anyone can become a troll: Causes of trolling behavior in online discussions,” in Pro- ceedings of the 2017 ACM conference on computer supported cooperative work and social computing. ACM, 2017, pp. 1217–1230.
  6. T. Marcin, “Facebook, youtube, whatsapp banned again in sri lanka after violence against muslims,” May 2019.
  7. E. Stein, “History against free speech: The new german law against the” auschwitz”: And other:” lies”,” Michigan Law Review, vol. 85, no. 2, pp. 277– 324, 1986.
  8. M. Rosemain, “Exclusive: In a world first, facebook to give data on hate speech…” Jun 2019.
  9. E. cial Thomasson, “German cabinet agrees to fine so- media over hate speech,” Apr 2017.
  10. H. Rainie, J. Q. Anderson, and J. Albright, The future of free speech, trolls, anonymity and fake news online. Pew Research Center Washington, DC, 2017.
  11. A. Breuer, T. Landman, and D. Farquhar,“Social media and protest mobilization: Evidence from the tunisian revolution,” Democratization, vol. 22, no. 4, pp. 764– 792, 2015.
  12. S. J. Jackson, M. Bailey, and B. Foucault Welles, “# girlslikeus: Trans advocacy and community building online,” New Media & Society, vol. 20, no. 5, pp. 1868– 1888, 2018.
  13. R. Enikolopov, M. Petrova, and K. Sonin, “Social media and corruption,” American Economic Journal: Applied Economics, vol. 10, no. 1, pp. 150–74, 2018.
  14. A. M. Founta, C. Djouvas, D. Chatzakou, I. Leontiadis, J. Blackburn, G. Stringhini, A. Vakali, M. Sirivianos, and N. Kourtellis, “Large scale crowdsourcing and characterization of twitter abusive behavior,” in Twelfth International AAAI Conference on Web and Social Media, 2018.
  15. F. Godin, B. Vandersmissen, W. De Neve, and R. Van de Walle, “Multimedia lab@ acl wnut ner shared task: Named entity recognition for twitter microposts using distributed word representations,” in Proceedings of the workshop on noisy user-generated text, 2015, pp. 146–153.
  16. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  17. R. Nishi, T. Takaguchi, K. Oka, T. Maehara, M. Toyoda, K.-i. Kawarabayashi, and N. Masuda, “Reply trees in twitter: data analysis and branching process mod- els,” Social Network Analysis and Mining, vol. 6, no. 1, p. 26, 2016.
  18. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003.