Corpus, Vocabulary, and Feature Vectors
00:00 This video starts off with some terminology. In the previous video, you combined the data from three sources— Yelp, Amazon, and IMDb—into a DataFrame. The DataFrame contains sentences, labels, and sources.
00:16 The sentences are a collection of texts that you will use scikit-learn—and later, Keras—to make predictions. This collection of texts is referred to as a corpus.
00:28 If you drill down to the word level, the collection of unique words in the corpus is called the vocabulary. The reason why you use only the unique words is because each one will be assigned a unique index that will be used to identify the words during training.
00:44 Machine learning algorithms don’t like text, so we have to represent the text with numerical values.
00:50 If you take each sentence in the corpus and convert it into a list of numbers using the indexes mapped in the vocabulary, you get a feature vector. This is a numerical representation of each sentence.
01:03 Let’s see how this works in more detail.
01:06
The next cell in the Notebook defines two sentences. You’re not going to train a model using these sentences, because there aren’t enough of them, but you’ll use the classes inside of scikit-learn to explore this concept of converting a corpus into feature vectors. In the next cell, import the CountVectorizer
class from the sklearn.feature_extraction.text
module, get a vectorizer, which will consider all words regardless of the frequency or the number of times they appear and that is case sensitive.
01:39
Tell the vectorizer to create the vocabulary from the sentences by calling the .fit()
method and passing it the sentences
. The vocabulary is found in the .vocabulary_
field.
01:52
The .vocabulary_
is a dictionary with the unique words as the keys and the indexes as the values. To get the vectors, call the .transform()
method on the vectorizer and pass it the corpus—the sentences
.
02:06
This will return a sparse matrix from the SciPy module. Use the .toarray()
method to display the vectors.
02:13
Notice the structure of the vectors. The vectors are the same length as the vocabulary. For each sentence, a new vector is created, and for each word in the sentence, the index of the word is a 1
in the same index in the vector.
02:29
For example, the first sentence is 'John likes ice cream'
, which is indexes 0
, 5
, 4
, and 2
, and the first vector has a 1
at indexes 0
, 2
, 4
, and 5
.
02:43 A sparse matrix is a good choice when you have a matrix with a lot of unused values. Your vocabulary right now only has a half dozen words, so this might not be obvious, but later on you’ll work with a larger vocabulary of over a thousand words.
02:57 It’s unlikely that even 5% of these words will be used in a single sentence. The number of unused words in each sentence will be quite large. You can reduce the size of the matrix by compressing that unused data and focusing only on the relevant data, and this is what a sparse matrix does.
03:18 You’ve just created what is called a bag-of-words model. In the next video, you’ll use this model to train a machine learning model using scikit-learn.
Become a Member to join the conversation.