Coding kNN From Scratch: Data and Distances
00:00 Now that you know how kNN works, let’s code up a k-nearest neighbors algorithm from scratch in Python.
In this lesson, you’ll be coding up your own kNN model from scratch using NumPy. So let’s go ahead and import that library, and we’ll alias it as
You’re going to continue working with the abalone dataset. So just to refresh your memory,
abalone is a DataFrame, and the top part of that DataFrame shows us that we have various physical measurements as well as a final column called
"Rings", which can tell biologists the age of an abalone.
We’re going to try to predict rings using our kNN model. To train a kNN model, we need two separate quantities:
X, which will contain our features, and
y, which will contain the targets.
So let’s go ahead and set those up. In this case,
X is coming from the
abalone DataFrame, but one very important thing that we must do here is drop the
Because we’re going to try to predict rings with our kNN model, we absolutely cannot use this column when we feed in the features of our dataset. And because
"Rings" is a column, we actually need to drop this using
axis=1, which tells pandas that we’re going to drop a column.
We can take a look at the
X, and you’ll see that the
"Rings" column has now been removed. It turns out that it’s going to be easier to work with
X if we have a NumPy array instead of a pandas DataFrame. Right now,
X is a pandas DataFrame.
We want to convert this into NumPy array. To do that, we’re going to overwrite
X with the values from this DataFrame. Now we can see that
X is a NumPy array, and if we check the type, we will also verify it’s a NumPy array.
So now we have
X, the features that we’re going to use to train our kNN model. We also need
y, which will be the target values for these labeled observations.
y also comes from the
abalone DataFrame, but now we’re talking about the
"Rings" column. Right now,
y is a pandas Series, but we’d actually like that to be a NumPy array.
So once again, we’ll select the values from
y only, and we can check to see that, yes, that is now an array.
You now have
y, which contain all of the data that you’ll need to train your kNN model. But once that model’s created, you’d like to be able to use it to make predictions.
So now let’s go ahead and create a new data point. Say you have a new abalone, and you’ve gathered its physical measurements and recorded them in a chart. You have its length, its diameter, and so on, and you’d like to be able to make a prediction for how many rings this abalone has. You can create a variable called
new_data_point, which will be a NumPy array, and you’re just recording all of these measurements in that NumPy array.
Just note that all these measurements are in the exact same order as what you had in the original
abalone DataFrame, starting with length all the way down to shell weight.
The first step when predicting with the kNN algorithm is to calculate the distance between your point of interest and every other observation in your dataset. So we’re going to do that using NumPy’s
Let’s take a look at how
norm() works first. In NumPy’s linear algebra submodule, there is a function called
norm(), and this works by calculating the length of a vector that it’s given.
Say that we have a vector of
[1, 1]. Now we know from Pythagorean theorem that the length of the hypotenuse should be square root of two, and that’s exactly what
norm() gives us.
So we’re going to be using
norm() to calculate distances here. Create a new variable called
distances, and here we’re going to store the distance between our new data point and every other observation in
So you can call up
norm() and pass to it
X, which contains all of your other abalones, minus the
You also want to specify
axis=1 here, and that just lets
norm() know that you’d like to take the length of the new vectors row-wise as opposed to column-wise.
There’s actually a lot happening here, so let’s break this down a little bit further. Take a look at the shape the
X array. It’s
7, but if you take a look at the shape of the
it’s only got
7 values. So when you do
X minus the
new_data_point, you’re actually using NumPy’s broadcasting to do that.
NumPy knows that you want to subtract the
new_data_point from every row of
X. Once that subtraction’s done, you’re going to be taking the
norm() of every individual row, and can you guess what the shape of
distances is? Yep,
So you’ll actually have one distance for every single value that came from the
X array. This tells you how far away that observation is from your
05:21 In this lesson, you prepared your data and calculated the distances between your new observation and every other abalone in the dataset. Next up, you’ll finish creating your kNN model and use it to make a prediction.
Become a Member to join the conversation.