Using kNN in scikit-learn: Data, Fit, and Predict
00:00 Now let’s use Python’s scikit-learn library to build and use a k-nearest neighbors model, starting with data handling, fitting the model, and making predictions.
00:12 scikit-learn is one of the most comprehensive and most popular machine learning packages in Python. You can use it to do all kinds of data science operations, and in this lesson, you’ll use it to easily create a kNN model.
00:27 You’re about to dive into the scikit-learn code, but here’s a preview of the steps you’ll take. First, the abalone data will be split into two components, the training dataset and the test set. By splitting the data, you’ll be able to evaluate the performance of your kNN model.
00:45 Once your model is created, the training set will be used as the potential neighbors. Then you’ll make predictions for the test dataset and score those predictions against the actual abalone rings of the test data.
00:59 And you’ll be using scikit-learn to accomplish all of this through easy-to-use functions and methods. So let’s get started.
In a previous lesson, you set up
y, which are NumPy arrays.
X contains the features of your abalone dataset, while
y contains the target values that you’ll try to predict. And recall that in this example, you want to predict the number of rings an abalone has based on its physical measurements because the rings indicate the age of the abalone, but can be difficult for biologists to obtain.
You’ll now randomly split each of the rows in your data into either the training set or the test set. And to do that, you’re going to be using scikit-learn’s
It can be found in the
model_selection submodule. And by the way, scikit-learn is often called
sklearn. So let’s type
Now you use
train_test_split() to create several variables:
X_test, as well as
Those will be equal to the output of
train_test_split() applied to
y. Note that
train_test_split() is able to split up as many arrays as you’d like here, and it will split them in the same way.
So the rows, or abalone, that end up in
X_train will be the exact same rows that end up in
y_train. So by using
train_test_split(), you don’t have to worry about an abalone’s physical measurements then matching up with its labels.
sklearn is going to take care of that for you, but you will likely want to add a few more options here. First, you can set
test_size equal to
That means that you want 20 percent of your data to be put in the test set. The other 80 percent will be in the training set. Let’s check the first five values in
As it stands now, every time you rerun
train_test_split(), you’ll get another random set of rows in the training set, and those rows are going to change each time. So if we rerun this, we’re going to see a different selection of rows in
y_train. Now, this isn’t ideal for reproducing results, so you can set
random_state to be an integer—
any integer, it doesn’t matter which one—and you’ll get the same random split of your data each time you rerun your code. So if we rerun this, we’re going to get the same
y_train values. Okay, so what do you have? Checking the shape of all of these variables,
X_train is an array with just over 3000 rows, and
04:09 has just over 800 or so abalone. They both have seven columns, or seven features.
y_train has the matching 3300 or so ring labels for the training data, and
y_test has about 800. Those match the test set.
Now let’s create a kNN model and fit it to the data. First, you’ll need to pull the kNN codebase from the scikit-learn library. So,
import KNeighborsRegressor, and note that you’re using
Regressor here because the target values are numeric.
If you wanted to predict classes, you’d use
KNeighborsClassifier. Next, you’re going to instantiate a kNN model. That basically just means that you’ll create a variable.
Let’s call it
knn_model. This will store all the model attributes and all the hyperparameters or anything that the model learns from the training data and so on. So for now,
knn_model will be equal to
05:32 and this is going to be an object. You can also set any hyperparameters of your model here. For kNN, the main hyperparameter is k, the number of neighbors to consider when making a prediction.
n_neighbors in this code. So go ahead and set
n_neighbors equals to
3. That means you’ll consider three neighbors when making a prediction with this model.
knn_model is now a Python variable for your model. It has various different properties that you can look at, such as
But there are also a few properties that aren’t available yet. For example,
n_features_in_. you get an error if you try to look at this property right now because the model hasn’t been fitted to any training data yet.
So let’s go ahead and fit our model now. Execute
knn_model.fit() and pass in the
y_train data. This trains your kNN model to the data that you specify in the arguments.
The first argument should be your features, and the second argument should be your targets. At this point, your model has learned all it needs to to be able to make predictions, and several properties that it didn’t know before it has now learned. For example, if we now check
knn_model.n_features_in_, we would see that there are
7 features or columns that were used to train this kNN model.
Once your kNN model has been trained, you can use it to make predictions. Your kNN model has a method called
knn_model.predict(), and you only need to pass in the features that you’d like to make predictions for, such as
These are the predicted ring values for the features in
X_train. Let’s go ahead and save that output as
3341 items in it, one prediction for every row of your training data. Let’s take a look at the first five predictions here.
Seems like those are on the right scale, since most of the abalone in our dataset had about ten rings. You can go on to compare these predictions against the actual ring values in
y_train. And even more importantly, you can make predictions for the test set.
Let’s call that
pred_test. We will use our kNN model to predict ring values for
X_test, and we’ll be scoring these predictions in the next lesson.
By the way, if you’re new to
sklearn, just notice how much simpler it was to use
sklearn than to code up kNN from scratch by yourself.
scikit-learn is also great because its codebase follows similar patterns for all of its various supervised learning models, with the
.predict() methods throughout.
08:57 In this lesson, you leverage Python’s scikit-learn Library to build a kNN model and make predictions with it. Coming up next, you’ll continue working in scikit-learn to score your predictions and explore the hyperparameter k, the number of neighbors considered when making a prediction.
Become a Member to join the conversation.