This article is the third of a series “hop on your Natural Language Processing Journey” about NLP:
- Chapter 1: Getting started with Natural Language Processing
- Chapter 2: email categorization
- Chapter 3: Create your first simple text classifier
- Chapter 4: Open-Source solutions vs provider
Last week, the text classification solution was implemented in Joe’s department. Little by little, as the labeled data were collected and the machine learning algorithm was taking over the rule-based approach, everyone could witness the full potential of the new system. Morale is high and we can see a smile on everyone’s face… Except Joe’s: he wants more.
Curiosity is his second nature and he’d like to understand more in details what’s behind the curtains. You’re the boss Joe, let’s get our hands dirty!
To meet his needs, we proposed a workshop where we will learn to implement a simple text classifier using only open source tools. This is a small introduction, and while not every aspect will be covered in this guide, you should be able at the end to implement your own first elementary sentiment analyzer.
“Sentiment analyzer? Didn’t you talk about a text classifier?”
True, but we can see a sentiment analyzer as a special case of text classifier. In this case, we will analyze tweets about coronavirus and try to associate a sentiment to each of them:
- Extremely negative
- Extremely positive
We will build the classifier in Python, using open source technologies.
We need four elements:
- The labeled data
- A python interpreter and a text editor
- An NLP library for preprocessing the data
- A machine learning library
The labeled data: Kaggle
Kaggle is a website that gathers “the world’s largest data science community” (source: themselves). You can find lots of interesting information, tutorials… But our goal today is the huge number of datasets provided.
Let’s download this one which contains the tweets we are looking for, and their respective sentiments label, as well as other information we won’t use.
A python interpreter and a text editor: Your turn
If you don’t have a python interpreter and a text editor, I invite you to learn one of the most important aspect of the IT world, and what occupies most of a programmer’s day: Googling!
Joke aside, as I am sure that your time is precious, here is a helper and simple steps to follow to set up everything:
An NLP library for preprocessing: spaCy
“What is this thing?”
spaCy is a Natural Language Processing library in Python. It can be used alongside already trained models for several languages. Those models have been trained on thousands of different texts in order to be able to recognize several text features on unseen texts.
Click here to learn more about spaCy and their models.
Among other features, spaCy is capable to:
- Tokenize a text: which means segmenting a text into “tokens” (sentences, words…)
- Lemmatization: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”
- Part-of-speech tagging: Assigning word types to tokens, like verb or noun
- Text classification: Assigning categories or labels to a whole document, or parts of a document
Click here to learn more about the features.
“Good, spaCy does text classification, what do we need a machine learning library for?”
Well, you got me! Technically, spaCy contains modules to train the model which will then make predictions. But the machine learning library will allow us more flexibility on our model, such as choosing the algorithms we want, with the specific parameters we need.
Click here to install spacy.
“Why would we need spaCy?”
spaCy will help us preprocess the texts, transforming raw data into a more suitable format for our model. Here are some of the techniques used:
- Tokenization: In order to transform the text into a vector form, we first need to tokenize it to so the engine knows what “words” are
- Removing stop words: removing the most common words of a language is convenient as we don’t want to flood the model with not so useful information
- Lemmatization: ideally, we would want that “be” and “was” is recognized as the same word in our vectorization, and that’s where lemmatization comes into play
Click here to download the English model.
You can install the “en_core_web_lg” model.
A machine learning library: scikit-learn
Finally, we need a machine learning library that supports the algorithms we want to implement. For that, we will use scikit-learn which is free and features various machine learning algorithms as well as hundreds of built-in functions to help us train and test models.
In order to install it, open a terminal and use the “pip” package installer for python with the following command:
All the information you need about this library, as well as tutorials and exercises can be found on their official website.
We will talk about three of the most used machine learning algorithms in text classification:
- Naïve Bayes
- Support Vector Machine (SVM)
- Deep Learning
The Naïve Bayes family of statistical algorithms are some of the most used algorithms in text classification and text analysis, overall.
One of those is Multinomial Naive Bayes which has the advantage that we can get good results even when the dataset isn’t very large.
Naive Bayes is based on Bayes’ Theorem, which helps us compute the conditional probabilities of the occurrence of two events, based on the probabilities of the occurrence of each individual event. So, we’re calculating the probability of each tag for a given text, and then outputting the tag with the highest probability.
The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of A being true, divided by the probability of B being true.
This means that any vector that represents a text will have to contain information about the probabilities of the appearance of certain words within the texts of a given category, so that the algorithm can compute the likelihood of that text’s belonging to the category.
Support Vector Machine
Support Vector Machines (SVM) is another powerful text classification machine learning algorithm. Like Naive Bayes, SVM doesn’t require lots of training data.
In short, SVM draws a line or “hyperplane” that divides a space into two subspaces. One subspace contains vectors (tags) that belong to a group, and another subspace contains vectors that do not belong to that group.
The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it looks like this:
Deep learning is a set of algorithms and techniques inspired by how the human brain works, called neural networks. Deep learning architectures offer huge benefits for text classification because they perform at super high accuracy.
The two main deep learning architectures for text classification are Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
The issue with deep learning is the huge amount of training data required for the algorithm to perform well. However, those algorithms don’t have a threshold for learning from data, unlike other algorithms like SVM and Naïve Bayes. DL classifiers will continue to get better the more data we feed them.
Now that we have those concepts in our head, let’s get started to implement a classifier. First create a file with the “.py” extension. Let’s call it “classifier.py”. You can then open it with your favorite text editor.
Tokening the Data With spaCy
First, let’s use spaCy to preprocess the text. In order to do that, let’s first import the modules we need:
- String is a standard module in python that provides additional tools to manipulate strings. We will use it to get a list of all the punctuation marks
- spaCy which we downloaded and installed before
After that, we will create two variables, one is a list of all the punctuation marks and the other contains the stop words defined by spaCy for the English language.
We can load the English model that we downloaded with spaCy and then create a function which purposes is to tokenize a text. It will take a text as input and it will output a list of tokens, which have been lemmatized, lowercased and where the stop words and punctuation marks have been ignored.
As an example, let’s try this function on the following input and see what comes out:
We can see that “learning” has been transformed into “learn” through the lemmatization and that the pronouns (my) and stop words (is, am, a, do, to…) have been ignored in the process.
Defining a custom transformer
Next, we will create a custom transformer to further clean our data and remove initial and end spaces. We won’t get into too many details about this class but it will be used later on.
Now that we know what we’re working with, let’s create a custom tokenizer function using spaCy. We’ll use this function to automatically strip information we don’t need, like stop words and punctuation, from each review.
Training and testing the model
Before beginning to implement the training and testing code, we need to import the modules we need.
We already installed sklearn earlier, but what about “pandas”? “pandas” is a library that will help us manipulate data. We received the data in the form of csv that will be read with this library. As it is not a standard library, we first need to install it. How would you do it? (hint: same process than with sklearn, using the tool “pip”).
Once the imports are done, we can get started. In the zip archive from Kaggle, there are two csv files:
Those two files are the training and testing set. For this example, let’s only use the testing set (smaller set) as our whole set, that we will then divide into a training and testing set. Let’s use pandas to read the csv file.
Then we will use a vectorizer using the custom tokenizer that we built earlier. Its role will be to vectorize each tweet, using the bag of words technique.
Using the “head” function in pandas, we can see the first lines of our data, with the names of the columns.
We then isolate the data that are of interest for us.
We already know that computers don’t like text, and the sentiment column composed of text values: “extremely negative, negative, neutral, positive, extremely positive”. We will use sklearn preprocessing module to help us transform those text values into numeric values:
- Extremely negative => 0
- Negative => 1
- Extremely positive => 4
The next step is to separate the training and testing set. Fortunately, sklearn helps us doing that. We can choose the ratio of training and testing data out of our dataset. In this example we will use 80% of the dataset for training purposes and 20% for testing.
In this exercise, we will use a Naive Bayes algorithm, but testing other algorithms is as easy as a line of code. Then we will create a pipeline containing our custom transformer, our vectorizer (bag of words, with our custom tokenizer) and our classifier (the model/algorithm we chose, ie Naive Bayes). We will then train the model on our training set.
Once this is done, we can test our model on unseen data, taken from our testing set and then evaluating how the model performs through metrics like accuracy, precision and recall:
- Accuracy: refers to the percentage of the total predictions our model makes that are completely correct.
- Precision: describes the ratio of true positives to true positives plus false positives in our predictions.
- Recall: describes the ratio of true positives to true positives plus false negatives in our predictions.
We can see that the metrics are not very good. What could have gone wrong and how would you improve it?
Here is the full code of the implementation we just made.
Now Joe understands a bit more what is happening and how easy it is to implement a text classification solution. He still wonders:
“Is it really the most efficient approach? Shouldn’t we go higher level, by using spaCy built-in text classification?”
Clever Joe, that is indeed something that needs our attention. Should we reinvent the wheel? Should we go the easy but expensive way of cloud providers?
Stay tuned for the next episode to go deeper into that train of thoughts!
- Scikit Learn
Written by Charles-Antoine Vanbeers