Coding kNN From Scratch: Data and Distances
00:00 Now that you know how kNN works, let’s code up a k-nearest neighbors algorithm from scratch in Python.
00:09
In this lesson, you’ll be coding up your own kNN model from scratch using NumPy. So let’s go ahead and import that library, and we’ll alias it as np
.
00:18
You’re going to continue working with the abalone dataset. So just to refresh your memory, abalone
is a DataFrame, and the top part of that DataFrame shows us that we have various physical measurements as well as a final column called "Rings"
, which can tell biologists the age of an abalone.
00:36
We’re going to try to predict rings using our kNN model. To train a kNN model, we need two separate quantities: X
, which will contain our features, and y
, which will contain the targets.
00:48
So let’s go ahead and set those up. In this case, X
is coming from the abalone
DataFrame, but one very important thing that we must do here is drop the "Rings"
column.
00:59
Because we’re going to try to predict rings with our kNN model, we absolutely cannot use this column when we feed in the features of our dataset. And because "Rings"
is a column, we actually need to drop this using axis=1
, which tells pandas that we’re going to drop a column.
01:16
We can take a look at the .head()
of X
, and you’ll see that the "Rings"
column has now been removed. It turns out that it’s going to be easier to work with X
if we have a NumPy array instead of a pandas DataFrame. Right now, X
is a pandas DataFrame.
01:32
We want to convert this into NumPy array. To do that, we’re going to overwrite X
with the values from this DataFrame. Now we can see that X
is a NumPy array, and if we check the type, we will also verify it’s a NumPy array.
01:50
So now we have X
, the features that we’re going to use to train our kNN model. We also need y
, which will be the target values for these labeled observations.
02:00
y
also comes from the abalone
DataFrame, but now we’re talking about the "Rings"
column. Right now, y
is a pandas Series, but we’d actually like that to be a NumPy array.
02:11
So once again, we’ll select the values from y
only, and we can check to see that, yes, that is now an array.
02:21
You now have X
and y
, which contain all of the data that you’ll need to train your kNN model. But once that model’s created, you’d like to be able to use it to make predictions.
02:31
So now let’s go ahead and create a new data point. Say you have a new abalone, and you’ve gathered its physical measurements and recorded them in a chart. You have its length, its diameter, and so on, and you’d like to be able to make a prediction for how many rings this abalone has. You can create a variable called new_data_point
, which will be a NumPy array, and you’re just recording all of these measurements in that NumPy array.
02:57
Just note that all these measurements are in the exact same order as what you had in the original abalone
DataFrame, starting with length all the way down to shell weight.
03:09
The first step when predicting with the kNN algorithm is to calculate the distance between your point of interest and every other observation in your dataset. So we’re going to do that using NumPy’s norm()
function.
03:23
Let’s take a look at how norm()
works first. In NumPy’s linear algebra submodule, there is a function called norm()
, and this works by calculating the length of a vector that it’s given.
03:35
Say that we have a vector of [1, 1]
. Now we know from Pythagorean theorem that the length of the hypotenuse should be square root of two, and that’s exactly what norm()
gives us.
03:47
So we’re going to be using norm()
to calculate distances here. Create a new variable called distances
, and here we’re going to store the distance between our new data point and every other observation in X
.
04:00
So you can call up norm()
and pass to it X
, which contains all of your other abalones, minus the new_data_point
.
04:10
You also want to specify axis=1
here, and that just lets norm()
know that you’d like to take the length of the new vectors row-wise as opposed to column-wise.
04:21
There’s actually a lot happening here, so let’s break this down a little bit further. Take a look at the shape the X
array. It’s 4177
by 7
, but if you take a look at the shape of the new_data_point
,
04:38
it’s only got 7
values. So when you do X
minus the new_data_point
, you’re actually using NumPy’s broadcasting to do that.
04:47
NumPy knows that you want to subtract the new_data_point
from every row of X
. Once that subtraction’s done, you’re going to be taking the norm()
of every individual row, and can you guess what the shape of distances
is? Yep, 4177
.
05:06
So you’ll actually have one distance for every single value that came from the X
array. This tells you how far away that observation is from your new_data_point
.
05:21 In this lesson, you prepared your data and calculated the distances between your new observation and every other abalone in the dataset. Next up, you’ll finish creating your kNN model and use it to make a prediction.
Become a Member to join the conversation.