Data Exploration for Hate Speech Detection

About Me

Greetings, everyone! I’m Mohit Mishra, an amateur aspiring data scientist who likes to learn new algorithms and is always ready to work on projects involving machine learning and data analysis. I have a strong belief that words and data are the two most powerful tools which can evaluate this world. In this post, I am going to start a series of 7–8 posts related to hate speech detection. It will include data exploration, cleaning, analysis, and modeling with some great new stuff to learn and discuss with all of you. This post will be dedicated to data exploration alone. I’ve decided to give this learning series a try! Please enjoy!

Code used during the tutorial can be found over my Github and Notebook Link

Introduction

Nowadays very strange are happening day by day on the internet. People are using some sort of language which we call hate language. People talk about the gender, color, and ethnicity of other people which sometimes leads to real-life consequences. The term “hate speech” shall be understood as covering all. forms of expression that spread, incite, promote or justify. racial hatred, xenophobia, anti-Semitism, or other forms. hatred based on intolerance, including intolerance expressed. by aggressive nationalism and ethnocentrism, discrimination.

Content

Import Libraries

Data Exploration

Loading and displaying Datasets

Now let’s check the Label Encodings

Train vs Test Dataset

A. Train Dataset

Info:

  • The training dataset has three columns — “id”, “label”, “tweet”
  • Column “id” is not useful for classification.
  • Column “tweet” contains the tweets
  • Column “label” contains their category.
  • We have approx. 31K data points and there is no null values present in the training data

B. Test Dataset

Now let’s describe the test dataset

Train and Test Datasets: Tweet Length

Now we will check the distribution of length of the tweets in both train and test data.

The above histogram shows us the length frequency of each tweet from the dataset. Most sentence lengths lie between 70–120

Positive vs Negative Tweets

A. Check out a few non-racist/sexist (negative) tweets.

B. Check out a few racist/sexist (positive) tweets.

Label Distribution

  • It is very important to know whether data is balanced or not.
  • For dealing with imbalanced data we can use SMOTE, Upsampling or Downsampling

Let’s have a glimpse at label distribution in the training dataset.

Now let’s calculate counts and percentages

We are having around 93% of non-hate speech data in this dataset

In the plot above, we can understand that most of the data are labeled as 0 (not-offensive tweets), while 1 means offensive tweets.

Unbalanced Data

  • In the training dataset, we have 2,242 (7%) tweets labeled as racist or sexist, and 29,720 (93%) tweets labeled as nonracist/sexist. So, it is an imbalanced classification challenge.
  • Due to this class imbalance, accuracy may not be a good option for checking the performance of our models. Instead, a confusion matrix or F1 can be a good option.

Tweet length analysis (character length)

Knowing the max number of words in the dataset will be helpful later. Let’s extract the length of each tweet in the training data

Average Tweet Length vs Label

Label Counts

Average hate tweets vs normal tweets

The distribution of both seems to be almost the same. 90 to 120 characters in a tweet are the most common among both.

Tweet lengths

Let's split the positive and negative tweets into two different lists.

Split negative and positive tweets

Now let’s check the exact number of positive tweet length

Box Plot

  • The x-axis represents ‘tweets’ while the y-axis represents a number of characters.
  • The number of characters between the 25th and 75th %ile (Interquartile range) of data ranges from 60 to 110 tokens, the median being 80 tokens. Max length of any tweet in the dataset is less than 275 tokens

Box plot analysis

  1. Negative class: There are few outliers and the maximum length of the tweets going beyond 270
  2. Positive class: There are no outliers and the maximum number of words in a tweet are approx 150

WordCloud

A wordcloud is a visualization wherein the most frequent words appear in large sizes and the less frequent words appear in smaller sizes.

We can also think of questions related to the data in hand. A few probable questions are as follows:

  • What are the most common words in the entire dataset?
  • What are the most common words in the dataset for negative and positive tweets, respectively?
  • How many hashtags are there in a tweet?
  • Which trends are associated with my dataset?
  • Which trends are associated with either of the sentiments? — Are they compatible with the sentiments?

Let’s visualize all the words in our data using the wordcloud plot

1) Word cloud of all tweets

With the help of wordcloud, we can find the most repeated words easily from the whole dataset. From the above image, we can come to a conclusion that in our data ‘user’ is the word which repeated more number times. And we have some unknown symbols also in our data. So, before proceeding with modeling, we should first clean the data.

2) Word cloud of positive tweets

3) Word cloud of negative tweets

As we can clearly see, most of the words have negative connotations. So, it seems we have pretty good text data to work on.

Conclusion

Thanks for reading this blog. For getting notification about any upcoming blog over Machine Learning, Deep Learning, and DevOps you can connect with me on LinkedIn and also follow me over Medium. If you like this content do give applause to it and subscribe to the newsletter for notifications.

Code used during the tutorial can be found over my Github and Notebook Link.

ML Developer and DevOps Enthusiast