Using kNN in scikit-learn: Data, Fit, and Predict
00:00 Now let’s use Python’s scikit-learn library to build and use a k-nearest neighbors model, starting with data handling, fitting the model, and making predictions.
00:12 scikit-learn is one of the most comprehensive and most popular machine learning packages in Python. You can use it to do all kinds of data science operations, and in this lesson, you’ll use it to easily create a kNN model.
00:27 You’re about to dive into the scikit-learn code, but here’s a preview of the steps you’ll take. First, the abalone data will be split into two components, the training dataset and the test set. By splitting the data, you’ll be able to evaluate the performance of your kNN model.
00:45 Once your model is created, the training set will be used as the potential neighbors. Then you’ll make predictions for the test dataset and score those predictions against the actual abalone rings of the test data.
00:59 And you’ll be using scikit-learn to accomplish all of this through easy-to-use functions and methods. So let’s get started.
01:09
In a previous lesson, you set up X
and y
, which are NumPy arrays. X
contains the features of your abalone dataset, while y
contains the target values that you’ll try to predict. And recall that in this example, you want to predict the number of rings an abalone has based on its physical measurements because the rings indicate the age of the abalone, but can be difficult for biologists to obtain.
01:35
You’ll now randomly split each of the rows in your data into either the training set or the test set. And to do that, you’re going to be using scikit-learn’s train_test_split()
function.
01:46
It can be found in the model_selection
submodule. And by the way, scikit-learn is often called sklearn
. So let’s type from sklearn.model_selection
import train_test_split
.
02:06
Now you use train_test_split()
to create several variables: X_train
and X_test
, as well as y_train
and y_test
.
02:19
Those will be equal to the output of train_test_split()
applied to X
and y
. Note that train_test_split()
is able to split up as many arrays as you’d like here, and it will split them in the same way.
02:33
So the rows, or abalone, that end up in X_train
will be the exact same rows that end up in y_train
. So by using train_test_split()
, you don’t have to worry about an abalone’s physical measurements then matching up with its labels.
02:49
sklearn
is going to take care of that for you, but you will likely want to add a few more options here. First, you can set test_size
equal to 0.2
.
03:02
That means that you want 20 percent of your data to be put in the test set. The other 80 percent will be in the training set. Let’s check the first five values in y_train
.
03:16
As it stands now, every time you rerun train_test_split()
, you’ll get another random set of rows in the training set, and those rows are going to change each time. So if we rerun this, we’re going to see a different selection of rows in y_train
. Now, this isn’t ideal for reproducing results, so you can set random_state
to be an integer—
03:41
any integer, it doesn’t matter which one—and you’ll get the same random split of your data each time you rerun your code. So if we rerun this, we’re going to get the same y_train
values. Okay, so what do you have? Checking the shape of all of these variables, X_train
is an array with just over 3000 rows, and X_test
04:09 has just over 800 or so abalone. They both have seven columns, or seven features.
04:21
y_train
has the matching 3300 or so ring labels for the training data, and y_test
has about 800. Those match the test set.
04:36
Now let’s create a kNN model and fit it to the data. First, you’ll need to pull the kNN codebase from the scikit-learn library. So, from sklearn.neighbors
import KNeighborsRegressor
, and note that you’re using Regressor
here because the target values are numeric.
05:00
If you wanted to predict classes, you’d use KNeighborsClassifier
. Next, you’re going to instantiate a kNN model. That basically just means that you’ll create a variable.
05:12
Let’s call it knn_model
. This will store all the model attributes and all the hyperparameters or anything that the model learns from the training data and so on. So for now, knn_model
will be equal to KMeighborsRegressor
,
05:32 and this is going to be an object. You can also set any hyperparameters of your model here. For kNN, the main hyperparameter is k, the number of neighbors to consider when making a prediction.
05:45
That’s called n_neighbors
in this code. So go ahead and set n_neighbors
equals to 3
. That means you’ll consider three neighbors when making a prediction with this model.
06:00
So knn_model
is now a Python variable for your model. It has various different properties that you can look at, such as n_neighbors
.
06:12
But there are also a few properties that aren’t available yet. For example, n_features_in_
. you get an error if you try to look at this property right now because the model hasn’t been fitted to any training data yet.
06:26
So let’s go ahead and fit our model now. Execute knn_model.fit()
and pass in the X_train
and y_train
data. This trains your kNN model to the data that you specify in the arguments.
06:44
The first argument should be your features, and the second argument should be your targets. At this point, your model has learned all it needs to to be able to make predictions, and several properties that it didn’t know before it has now learned. For example, if we now check knn_model.n_features_in_
, we would see that there are 7
features or columns that were used to train this kNN model.
07:13
Once your kNN model has been trained, you can use it to make predictions. Your kNN model has a method called .predict()
, so knn_model.predict()
, and you only need to pass in the features that you’d like to make predictions for, such as X_train
.
07:31
These are the predicted ring values for the features in X_train
. Let’s go ahead and save that output as pred_train
.
07:47
pred_train
has 3341
items in it, one prediction for every row of your training data. Let’s take a look at the first five predictions here.
08:02
Seems like those are on the right scale, since most of the abalone in our dataset had about ten rings. You can go on to compare these predictions against the actual ring values in y_train
. And even more importantly, you can make predictions for the test set.
08:19
Let’s call that pred_test
. We will use our kNN model to predict ring values for X_test
, and we’ll be scoring these predictions in the next lesson.
08:33
By the way, if you’re new to sklearn
, just notice how much simpler it was to use sklearn
than to code up kNN from scratch by yourself.
08:42
scikit-learn is also great because its codebase follows similar patterns for all of its various supervised learning models, with the .fit()
and .predict()
methods throughout.
08:57 In this lesson, you leverage Python’s scikit-learn Library to build a kNN model and make predictions with it. Coming up next, you’ll continue working in scikit-learn to score your predictions and explore the hyperparameter k, the number of neighbors considered when making a prediction.
Become a Member to join the conversation.