Corpus, Vocabulary, and Feature Vectors
00:00 This video starts off with some terminology. In the previous video, you combined the data from three sources— Yelp, Amazon, and IMDb—into a DataFrame. The DataFrame contains sentences, labels, and sources.
00:28 If you drill down to the word level, the collection of unique words in the corpus is called the vocabulary. The reason why you use only the unique words is because each one will be assigned a unique index that will be used to identify the words during training.
00:50 If you take each sentence in the corpus and convert it into a list of numbers using the indexes mapped in the vocabulary, you get a feature vector. This is a numerical representation of each sentence.
The next cell in the Notebook defines two sentences. You’re not going to train a model using these sentences, because there aren’t enough of them, but you’ll use the classes inside of scikit-learn to explore this concept of converting a corpus into feature vectors. In the next cell, import the
CountVectorizer class from the
sklearn.feature_extraction.text module, get a vectorizer, which will consider all words regardless of the frequency or the number of times they appear and that is case sensitive.
.vocabulary_ is a dictionary with the unique words as the keys and the indexes as the values. To get the vectors, call the
.transform() method on the vectorizer and pass it the corpus—the
Notice the structure of the vectors. The vectors are the same length as the vocabulary. For each sentence, a new vector is created, and for each word in the sentence, the index of the word is a
1 in the same index in the vector.
02:43 A sparse matrix is a good choice when you have a matrix with a lot of unused values. Your vocabulary right now only has a half dozen words, so this might not be obvious, but later on you’ll work with a larger vocabulary of over a thousand words.
02:57 It’s unlikely that even 5% of these words will be used in a single sentence. The number of unused words in each sentence will be quite large. You can reduce the size of the matrix by compressing that unused data and focusing only on the relevant data, and this is what a sparse matrix does.
Become a Member to join the conversation.