Word Vectors and Embedding Layers
00:00 Let’s take a look at another way to represent the words. You’ve seen the bag-of-words model that represents a sequence of words as a vector. However, next you’ll see how to represent each word as a vector.
00:14
One way to represent a word as a vector is with one-hot encoding. Take a look at the following code. It simply creates a list of strings. The LabelEncoder
from scikit-learn will create a unique integer value for each member of the list, and you do this by calling the .fit_transform()
method.
00:35
It assigns 0
to 'Berlin'
, 1
to 'London'
, and 2
to 'New York'
.
00:41
You can use this to create one-hot encodings for each string with the OneHotEncoder
in scikit-learn. Notice that the OneHotEncoder
needs the labels to be in a column, not in a row. The result is a NumPy array.
00:56
Again, there is an entry for each string or label. This time, the labels are represented as a vector. The vector is the same length as the vocabulary—3, in this sample. In a vector, all the values are 0
except for a single 1
value.
01:13
The position of the 1
value is the same as the unique value assigned to the label. But this is not the ideal representation for text unless you are dealing with categorical values.
01:25 Next, you’ll look at the concept of an embedding layer for more efficient representations of text. The text needs to be prepared—again, up to 80% of machine learning is data preparation—to work with the embeddings.
01:38
The Tokenizer
utility inside of Keras will help you convert the text into integer values. The Tokenizer
will assign an integer value to the 5,000 most frequently used words in the corpus.
01:51 It will then create a vector for each sentence in the data set.
01:56
As you can see, the word 'the'
was assigned 1
, as it is occurring most often in the data set.
02:04
So, how is this different from using the CountVectorizer
from scikit-learn that you saw earlier? Recall that the CountVectorizer
created vectors that were the size of the vocabulary.
02:15
The Tokenizer
generates vectors that are the length of each text. This means that the vectors are going to be of different lengths and need to be padded with zeros to make them uniform.
02:27
It doesn’t matter if you pad the beginning or end, as long as you are consistent. In Keras, the pad_sequences()
function will take care of padding for you.
02:38
Give it the text to pad, where to pad—'post'
will pad at the end of the text—and the maximum length of the padded sequences. To create the embedding layer, you can use a pretrained model. You’ll do that later, but first, you’ll train a custom layer. Keras provides more utility classes to help out. Create a new Sequential
model and add an Embedding
layer.
03:03
The keyword arguments for the Embedding
layer will be the size of the vocabulary, the size of the vectors, and the length of each padded vector.
03:13
Then send the output to an intermediate Dense
layer of size 10
and activation
of 'relu'
, and finally, the output layer with the size and activation='sigmoid'
.
03:24
Notice that the output of the Embedding
layer must be flattened before the Dense
layer can use it. Compile the model with the same hyperparameters as before.
03:35
The loss function will be 'binary_crossentropy'
, the optimizer will be 'adam'
, and ['accuracy']
for the metrics.
03:44
Train the model with the .fit()
method like you did before. This time, use only 20
epochs. You can see that the accuracy for training and validation are quite low.
03:57 Test the model and graph the history. Again, the accuracy is too low and the errors are too high, so let’s look at a better way to train this model. The problem is that you want to consider the order of the values in the vectors instead of each individual value alone, and one way to handle this is to use a pooling layer. In this case, you’ll add a global pooling layer after the embedding layer in the model.
04:24
Inside of the pooling layer, the maximum values of all values in each dimension will be selected. There are also average pooling layers as well. The max pooling layer will highlight large values. Add the GlobalMaxPool1D
layer to the model and train it again.
04:46 Now this is much more accurate, but training a real-world embedding layer would take a lot more time and attention. In the next video, you’ll see how to save time by using a pretrained word embedding.
04:58 There are many pretrained embedding layers available to include with your project.
Become a Member to join the conversation.
B S K Karthik on April 21, 2021
Thank you for nice explanation and content. hope to see more content in pytorch or tensorflow .Hope real python starts more sessions like this on AI/ML.