Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Using kNN in scikit-learn: Score and Update k

Kimberly Fessel

Using k-Nearest Neighbors (kNN) in Python Kimberly Fessel 07:15

Transcript
Discussion

00:00 In this lesson, you’ll continue coding in scikit-learn to score and update the hyperparameter k of your kNN model.

00:10 In a previous lesson, you created X and y data from a dataset about abalones. X contains the physical measurements of the abalones, while y stores their rings, which you tried to predict using kNN.

00:24 You went on to split your data into training and test sets. You used sklearn to create a kNN model that makes predictions based on the three closest neighbors.

00:36 You fit that model to your training data and then made predictions with it for the features of both the training and test sets. Now, let’s determine how good those predictions are by scoring them against the actual values.

00:51 There are many different scoring methods to choose from, but for this lesson, let’s use mean squared error. You can use scikit-learn’s implementation by first importing it.

01:01 Execute from sklearn.metrics import mean_squared_error.

01:10 Then calculate the error for your training set by applying mean_squared_error() to y_train and pred_train. So you just need to use the two rays that you’d like to compare as the arguments here.

01:24 And let’s go ahead and save that output as mse_train. Taking a look at mse_train, you see that you have an error of just over 2.7, but what does that value actually mean?

01:39 MSE can be difficult to understand at first glance because it reports error in squared units. Let’s switch over to RMSE, or root mean squared error. So rmse_train will be equal to the square root of mse. You can just use NumPy’s square root function to calculate it.

02:01 So np, which is for NumPy, sqrt(), which is a square root function, and then we’ll take the square root of mse_train. Now, rmse_train is about 1.65, which is in the same units as your original target.

02:18 Your kNN model predictions are off by about 1.65 rings for the abalones in your training set.

02:27 But for a more realistic result, you should see how your model performs on data that it hasn’t actually ever seen, and that’s the test set. So let’s find the mean squared error of the actual target values in your test set compared to the predictions of your kNN model.

02:44 Let’s call that mse_test and that will be the mean_squared_error() of y_test and pred_test. Once again, let’s go ahead and take the square root of this.

02:56 So rsme_test will be the square root of mse_test.

03:03 Your rmse for the test set is about 2.38. So if you used this model to predict values with new abalone physical measurements, you can expect it to be off by about 2.38 rings.

03:22 So far, you’ve used a value of k equals 3 for your kNN model, which means the algorithm only considers the three nearest neighbors when making a prediction.

03:32 The k is a so-called hyperparameter of kNN, and it needs to be adjusted to an appropriate value each time you apply kNN to a new dataset. Imagine you set k equal to 1, so you’d only consider the closest neighbor when making a prediction.

03:48 Your predictions probably wouldn’t be very good because they would vary a lot from one point to another. That’s called high variance. However, if you set k to be very high, say the size of the entire dataset, you might be using neighbors that are very far away to make predictions, and you’d lose out on the nuances of your dataset. That’s called high bias.

04:12 Instead, the ideal k will be somewhere in the middle, and you’ll use some kind of validation strategy in order to find it.

04:22 Your model currently has an RMSE for the test dataset of about 2.38, but let’s try a different value for k to see if this helps lower the RMSE for your test set.

04:34 Create a new kNN model called knn_model_25, and that’s also going to be a KNeighborsRegressor, except for now we’re going to set k equal to 25.

04:49 That’s n_neighbors=25. That means that when your model’s making predictions, it will look at the twenty-five closest neighbors in order to make that prediction.

05:02 Now you’re going to train that model with your training data using the .fit() method, so knn_model2_25.fit(), and you’re fitting this to X_train, y_train.

05:17 And make predictions for your test set. Let’s call those pred_test_25, and those will be equal to your knn_model_25.predict().

05:30 Okay, so the predictions have been made, and you can go ahead and score these against the actual rings of your test abalones. Set mse_test_25 equal to the mean_squared_error() of your y_test and your pred_test_25.

05:49 And just like before, you can compute the RMSE of your test set, which is the square root of your your MSE.

06:01 Let’s take a look at rmse_test_25, and you can see that this value has now decreased from 2.38 to 2.17. By considering twenty-five neighbors when making a prediction, your kNN model has less variability from point to point. That makes your test error lower and means that your kNN generalizes better to unseen data.

06:27 The choice of 25 neighbors was somewhat arbitrary in this lesson, but in actuality, you would likely use a validation set or a cross-validation method like .GridSearchCV() to select the best hyperparameter k for your particular situation. While this process is outside the scope of this lesson, you can learn more about validation or cross-validation elsewhere on the Real Python platform.

06:56 You’ve now completed building a kNN model in Python’s scikit-learn. Coming up next, you’ll conclude this course by reviewing all you’ve learned about kNN, including its primary attributes, the main steps of the algorithm, and the code you used to make kNN predictions in Python.

Become a Member to join the conversation.