Coding kNN From Scratch: Neighbors and Predictions
00:00 In this lesson, you’ll continue coding your kNN model from scratch using Python.
00:07
Let’s continue coding that kNN model. Remember from the last lesson that you have X
and y
, which contain your features in X
and the target values in y
.
00:18
You’ll use these values to train your kNN model. You also have a new_data_point
, which contain physical measurements for a brand new abalone, and you’re trying to predict the rings of this abalone.
00:32
You also calculated the distances between your X
data and the new_data_point
.
00:38
Once those distances are known, it’s time to find the nearest neighbors. First you need to know how many nearest neighbors that you’d like to consider. So for this case, let’s set k
equal to 3
.
00:50 This means that you’ll use the three closest neighbors when making a prediction for kNN. So you have the distance between your new data point and all the other points.
01:00
You’d like to know which of those distances is the smallest, but even beyond that, you actually want to know which of those neighbors are the closest to your data points. So instead of just taking distances
and sorting this array, you’re actually going to use .argsort()
. And let’s take a look at what this does.
01:19 If I press Shift + Tab + Tab in a Jupyter Notebook, I can pull up the docstring. This says it will return the indices that would sort this array.
01:30 So instead of actually getting a sorted array out of this, we’re going to get the appropriate indices that would sort this array. That means that the indices of the closest neighbors will show up first, and the furthest away will show up last.
01:46 So if you go ahead and execute this, you’re going to get an array of indices rather than sorted distance values. In this case, you just want to know the three closest neighbors.
01:58
So create a new variable called nearest_neighbor_ids
and set that equal to distances.argsort()
. But instead of returning the entire array of indices, you just want to pick off the first k
, or, in this case, three values.
02:16
Now, nearest_neighbor_ids
just gives us the three closest neighbors.
02:23
And to prove to you that these actually are the three closest neighbors, let’s look at the physical measurements for neighbor number 4045
. These values are actually very similar to the physical measurements of the new data point, and the distance between these two values is actually very, very small.
02:46
You now know the ID numbers of the three closest neighbors to the new data point. How does this help you make a prediction? Well, remember for regression problems, you actually want to average the neighbors’ targets together in order to make a prediction. Remember that the neighbors’ targets are found in the y
array.
03:05
So if you go look at the value for neighbor number 4045
, you would see that this abalone has 9
rings. So create a new variable called nearest_neighbor_rings
, and you can gather these ring values from the y
array by submitting the nearest_neighbor_ids
. Now, nearest_neighbor_rings
has three different values for rings.
03:31
These are the three ring values for those three nearest neighbors. Because this is a regression problem, the only thing we need to do to go ahead and make a prediction is to take those rings and average them together, which we can do by doing this .mean()
method.
03:48
The prediction for our new data point is 10.0
rings. So to make this prediction, you had the new data point’s physical measurements, you found the three closest neighbors to that data point, then you averaged together the closest neighbors’ ring values.
04:07
This lesson featured a regression problem, but if you have a classification problem, you can just use majority vote based on the neighbors’ targets. So say that you have your nearest neighbors, and you know what their class labels are. You would probably have those in a NumPy array, and perhaps those are given by "Square"
, "Circle"
, "Square"
, "Triangle"
, and "Square"
.
04:31
You can use these now to make a prediction for your new data point. There are a couple different ways that you can do this. Let’s go ahead and import something called a counter. from
the collections
library import counter
.
04:46
And you’re going to be using counter
on your nearest_neighbors_classes
.
04:53
So you can see that counter
will just count up the total number of times it saw each of those different classes.
05:00
You can go ahead and save this as class_count
. There’s several different cool things that you can do with counters, but taking a look at the docstring, you can see that one of the first thing that comes up is called .most_common()
.
05:14 You can go ahead and use this to find the most common class.
05:20
You can pull up that class_count
variable and apply .most_common()
to it, and you just want the first or very most common element here.
05:31
That’s the square, and it was seen three times. If you’d like to go ahead and just pull out the class name of "Square"
, you’ll need to access the zeroth element.
05:41
And then the zeroth element of this resulting tuple, which is the "Square"
label.
05:51 While coding models from scratch can be great for educational purposes, it isn’t usually the most optimized way of doing machine learning. So in the next lessons, you’ll create a kNN model with Python’s scikit-learn library, which will help you perform all sorts of data science tasks.
Become a Member to join the conversation.