Defining a Baseline Model
00:00 While training a machine learning model, you’ll often split your data set into two parts. The first part will be used to train the model and the model is used to make predictions from the new data. However, during training your model will see the training data set over and over many times. It’s possible that the model could get very comfortable with the training data and perform very well with it. However, when it sees new data outside of the training data, it might perform poorly. And this is a phenomenon known as overfitting.
00:33 In other words, the model is fit too closely to the training data, and this is contrary to the goal of a generalized model that performs well on any data.
00:43 One way to prevent or detect overfitting is to use the remaining part of the data set for testing, and this will give the model a chance to see data not in the training data set, and then you can evaluate the performance of the model. Let’s take a look.
00:58
You’re going to focus only on the Yelp data for this experiment. Use Pandas to filter the DataFrame and return only those items that have a 'source'
of 'yelp'
. Next, get the sentences and labels from the DataFrame.
01:11
The .values
field will extract the values of the columns into a NumPy array.
01:18
Performing the split is not as straightforward as you might expect. One potential issue is that the labels might not be evenly distributed across the training and testing data sets. If you train the model on a data set that is mostly positive sentiment and then test it on a data set that is mostly negative sentiment, you’re not going to be impressed with the outcome. In the module sklearn.model_selection
is the train_test_split()
function.
01:43 This will create the training and testing data sets for you. First, import the function. Then call the function, providing the sentences and labels along with the percentage of the data to be used for the test data set. Here, you’ll use 25%.
01:58
You’ll find that there are a number of religious debates about the ideal split point, but usually it’s going to be around 20% to 30%. In this case, for reproducibility, specify a random_state
of 1000
.
02:12 The function will return four values: the sentences in the training data set, the sentences in the test data set, the labels in the training data set, and the labels in the test data set.
02:25 The next cell repeats the process you saw in the previous video to vectorize the sentences. Notice the size of the sparse matrix. The vocabulary is 1,714 words and there are 1,000 sentences in the Yelp data, so 75% would be 750.
02:42 Therefore, uncompressed this would be 750 vectors of length 1,714, and this is a total of just under 1.3 million values. Now, how many sentences can you think of with 1,700 words? Or, for that matter, even 170 words? Well, it turns out the average length of a sentence in the data set is about 11 words and the longest sentence is 32 words in length.
03:08 Now you should be able to see this value of the sparse matrix.
03:14 It’s time to do some machine learning! You’re going to use an algorithm called logistic regression. Without going too deep into the mathematics, the result of logistic regression will be a value between 0 and 1 inclusive.
03:27 If you consider the lower values between 0 and 1 to be negative sentiment and the rest to be positive, logistic regression can be used for binary classification. And that’s all the math you need to know, because scikit-learn provides a logistic regression class.
03:43
Import the LogisticRegression
class from the sklearn.linear_model
module. Then create a new instance and use the .fit()
method to train a model on the training data. Then use the .score()
method to test the model.
04:00 The model is about 80% accurate. Before training the model on the entire data set, I want to call attention to the amount of code written to prepare the data versus the amount of code written to train the data.
04:12 Even if you leave out the vectorization experiment, which is only about half a dozen lines, you wrote a lot more code to prepare the data. This validates a common feature of machine learning: up to 80% of your time is spent preparing your data because “Garbage in, garbage out.” You may find that you’re not spending as much time as you expected writing code for actually machine learning.
04:37 Now, I’m not going to labor over the next piece of code because there is not really that much new. All it does is repeat the logistic regression on each source in the data set and print the score. You’ll see at the Amazon data set the score is similar to Yelp and IMDb is a little worse but not terrible.
04:55 Now that you understand the process of sentiment analysis, in the next video, you’re going to see neural networks for the first time.
Become a Member to join the conversation.