**Hint:**You can adjust the default video playback speed in your account settings.

**Hint:**You can set your subtitle preferences in your account settings.

**Sorry!**Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

# Coding kNN From Scratch: Data and Distances

**00:00**
Now that you know how kNN works, let’s code up a k-nearest neighbors algorithm from scratch in Python.

**00:09**
In this lesson, you’ll be coding up your own kNN model from scratch using NumPy. So let’s go ahead and import that library, and we’ll alias it as `np`

.

**00:18**
You’re going to continue working with the abalone dataset. So just to refresh your memory, `abalone`

is a DataFrame, and the top part of that DataFrame shows us that we have various physical measurements as well as a final column called `"Rings"`

, which can tell biologists the age of an abalone.

**00:36**
We’re going to try to predict rings using our kNN model. To train a kNN model, we need two separate quantities: `X`

, which will contain our features, and `y`

, which will contain the targets.

**00:48**
So let’s go ahead and set those up. In this case, `X`

is coming from the `abalone`

DataFrame, but one very important thing that we must do here is drop the `"Rings"`

column.

**00:59**
Because we’re going to try to predict rings with our kNN model, we absolutely cannot use this column when we feed in the features of our dataset. And because `"Rings"`

is a column, we actually need to drop this using `axis=1`

, which tells pandas that we’re going to drop a column.

**01:16**
We can take a look at the `.head()`

of `X`

, and you’ll see that the `"Rings"`

column has now been removed. It turns out that it’s going to be easier to work with `X`

if we have a NumPy array instead of a pandas DataFrame. Right now, `X`

is a pandas DataFrame.

**01:32**
We want to convert this into NumPy array. To do that, we’re going to overwrite `X`

with the values from this DataFrame. Now we can see that `X`

is a NumPy array, and if we check the type, we will also verify it’s a NumPy array.

**01:50**
So now we have `X`

, the features that we’re going to use to train our kNN model. We also need `y`

, which will be the target values for these labeled observations.

**02:00**
`y`

also comes from the `abalone`

DataFrame, but now we’re talking about the `"Rings"`

column. Right now, `y`

is a pandas Series, but we’d actually like that to be a NumPy array.

**02:11**
So once again, we’ll select the values from `y`

only, and we can check to see that, yes, that is now an array.

**02:21**
You now have `X`

and `y`

, which contain all of the data that you’ll need to train your kNN model. But once that model’s created, you’d like to be able to use it to make predictions.

**02:31**
So now let’s go ahead and create a new data point. Say you have a new abalone, and you’ve gathered its physical measurements and recorded them in a chart. You have its length, its diameter, and so on, and you’d like to be able to make a prediction for how many rings this abalone has. You can create a variable called `new_data_point`

, which will be a NumPy array, and you’re just recording all of these measurements in that NumPy array.

**02:57**
Just note that all these measurements are in the exact same order as what you had in the original `abalone`

DataFrame, starting with length all the way down to shell weight.

**03:09**
The first step when predicting with the kNN algorithm is to calculate the distance between your point of interest and every other observation in your dataset. So we’re going to do that using NumPy’s `norm()`

function.

**03:23**
Let’s take a look at how `norm()`

works first. In NumPy’s linear algebra submodule, there is a function called `norm()`

, and this works by calculating the length of a vector that it’s given.

**03:35**
Say that we have a vector of `[1, 1]`

. Now we know from Pythagorean theorem that the length of the hypotenuse should be square root of two, and that’s exactly what `norm()`

gives us.

**03:47**
So we’re going to be using `norm()`

to calculate distances here. Create a new variable called `distances`

, and here we’re going to store the distance between our new data point and every other observation in `X`

.

**04:00**
So you can call up `norm()`

and pass to it `X`

, which contains all of your other abalones, minus the `new_data_point`

.

**04:10**
You also want to specify `axis=1`

here, and that just lets `norm()`

know that you’d like to take the length of the new vectors row-wise as opposed to column-wise.

**04:21**
There’s actually a lot happening here, so let’s break this down a little bit further. Take a look at the shape the `X`

array. It’s `4177`

by `7`

, but if you take a look at the shape of the `new_data_point`

,

**04:38**
it’s only got `7`

values. So when you do `X`

minus the `new_data_point`

, you’re actually using NumPy’s broadcasting to do that.

**04:47**
NumPy knows that you want to subtract the `new_data_point`

from every row of `X`

. Once that subtraction’s done, you’re going to be taking the `norm()`

of every individual row, and can you guess what the shape of `distances`

is? Yep, `4177`

.

**05:06**
So you’ll actually have one distance for every single value that came from the `X`

array. This tells you how far away that observation is from your `new_data_point`

.

**05:21**
In this lesson, you prepared your data and calculated the distances between your new observation and every other abalone in the dataset. Next up, you’ll finish creating your kNN model and use it to make a prediction.

Become a Member to join the conversation.