Coding kNN From Scratch: Data and Distances
You’re going to continue working with the abalone dataset. So just to refresh your memory,
abalone is a DataFrame, and the top part of that DataFrame shows us that we have various physical measurements as well as a final column called
"Rings", which can tell biologists the age of an abalone.
Because we’re going to try to predict rings with our kNN model, we absolutely cannot use this column when we feed in the features of our dataset. And because
"Rings" is a column, we actually need to drop this using
axis=1, which tells pandas that we’re going to drop a column.
We can take a look at the
X, and you’ll see that the
"Rings" column has now been removed. It turns out that it’s going to be easier to work with
X if we have a NumPy array instead of a pandas DataFrame. Right now,
X is a pandas DataFrame.
We want to convert this into NumPy array. To do that, we’re going to overwrite
X with the values from this DataFrame. Now we can see that
X is a NumPy array, and if we check the type, we will also verify it’s a NumPy array.
So now let’s go ahead and create a new data point. Say you have a new abalone, and you’ve gathered its physical measurements and recorded them in a chart. You have its length, its diameter, and so on, and you’d like to be able to make a prediction for how many rings this abalone has. You can create a variable called
new_data_point, which will be a NumPy array, and you’re just recording all of these measurements in that NumPy array.
The first step when predicting with the kNN algorithm is to calculate the distance between your point of interest and every other observation in your dataset. So we’re going to do that using NumPy’s
So we’re going to be using
norm() to calculate distances here. Create a new variable called
distances, and here we’re going to store the distance between our new data point and every other observation in
There’s actually a lot happening here, so let’s break this down a little bit further. Take a look at the shape the
X array. It’s
7, but if you take a look at the shape of the
NumPy knows that you want to subtract the
new_data_point from every row of
X. Once that subtraction’s done, you’re going to be taking the
norm() of every individual row, and can you guess what the shape of
distances is? Yep,
05:21 In this lesson, you prepared your data and calculated the distances between your new observation and every other abalone in the dataset. Next up, you’ll finish creating your kNN model and use it to make a prediction.
Become a Member to join the conversation.