Defining a Baseline Model
00:00 While training a machine learning model, you’ll often split your data set into two parts. The first part will be used to train the model and the model is used to make predictions from the new data. However, during training your model will see the training data set over and over many times. It’s possible that the model could get very comfortable with the training data and perform very well with it. However, when it sees new data outside of the training data, it might perform poorly. And this is a phenomenon known as overfitting.
00:43 One way to prevent or detect overfitting is to use the remaining part of the data set for testing, and this will give the model a chance to see data not in the training data set, and then you can evaluate the performance of the model. Let’s take a look.
You’re going to focus only on the Yelp data for this experiment. Use Pandas to filter the DataFrame and return only those items that have a
'yelp'. Next, get the sentences and labels from the DataFrame.
Performing the split is not as straightforward as you might expect. One potential issue is that the labels might not be evenly distributed across the training and testing data sets. If you train the model on a data set that is mostly positive sentiment and then test it on a data set that is mostly negative sentiment, you’re not going to be impressed with the outcome. In the module
sklearn.model_selection is the
01:43 This will create the training and testing data sets for you. First, import the function. Then call the function, providing the sentences and labels along with the percentage of the data to be used for the test data set. Here, you’ll use 25%.
You’ll find that there are a number of religious debates about the ideal split point, but usually it’s going to be around 20% to 30%. In this case, for reproducibility, specify a
02:25 The next cell repeats the process you saw in the previous video to vectorize the sentences. Notice the size of the sparse matrix. The vocabulary is 1,714 words and there are 1,000 sentences in the Yelp data, so 75% would be 750.
02:42 Therefore, uncompressed this would be 750 vectors of length 1,714, and this is a total of just under 1.3 million values. Now, how many sentences can you think of with 1,700 words? Or, for that matter, even 170 words? Well, it turns out the average length of a sentence in the data set is about 11 words and the longest sentence is 32 words in length.
03:14 It’s time to do some machine learning! You’re going to use an algorithm called logistic regression. Without going too deep into the mathematics, the result of logistic regression will be a value between 0 and 1 inclusive.
03:27 If you consider the lower values between 0 and 1 to be negative sentiment and the rest to be positive, logistic regression can be used for binary classification. And that’s all the math you need to know, because scikit-learn provides a logistic regression class.
LogisticRegression class from the
sklearn.linear_model module. Then create a new instance and use the
.fit() method to train a model on the training data. Then use the
.score() method to test the model.
04:00 The model is about 80% accurate. Before training the model on the entire data set, I want to call attention to the amount of code written to prepare the data versus the amount of code written to train the data.
04:12 Even if you leave out the vectorization experiment, which is only about half a dozen lines, you wrote a lot more code to prepare the data. This validates a common feature of machine learning: up to 80% of your time is spent preparing your data because “Garbage in, garbage out.” You may find that you’re not spending as much time as you expected writing code for actually machine learning.
04:37 Now, I’m not going to labor over the next piece of code because there is not really that much new. All it does is repeat the logistic regression on each source in the data set and print the score. You’ll see at the Amazon data set the score is similar to Yelp and IMDb is a little worse but not terrible.
Become a Member to join the conversation.