Using kNN in scikit-learn: Data, Fit, and Predict
00:12 scikit-learn is one of the most comprehensive and most popular machine learning packages in Python. You can use it to do all kinds of data science operations, and in this lesson, you’ll use it to easily create a kNN model.
00:27 You’re about to dive into the scikit-learn code, but here’s a preview of the steps you’ll take. First, the abalone data will be split into two components, the training dataset and the test set. By splitting the data, you’ll be able to evaluate the performance of your kNN model.
00:45 Once your model is created, the training set will be used as the potential neighbors. Then you’ll make predictions for the test dataset and score those predictions against the actual abalone rings of the test data.
In a previous lesson, you set up
y, which are NumPy arrays.
X contains the features of your abalone dataset, while
y contains the target values that you’ll try to predict. And recall that in this example, you want to predict the number of rings an abalone has based on its physical measurements because the rings indicate the age of the abalone, but can be difficult for biologists to obtain.
Those will be equal to the output of
train_test_split() applied to
y. Note that
train_test_split() is able to split up as many arrays as you’d like here, and it will split them in the same way.
So the rows, or abalone, that end up in
X_train will be the exact same rows that end up in
y_train. So by using
train_test_split(), you don’t have to worry about an abalone’s physical measurements then matching up with its labels.
As it stands now, every time you rerun
train_test_split(), you’ll get another random set of rows in the training set, and those rows are going to change each time. So if we rerun this, we’re going to see a different selection of rows in
y_train. Now, this isn’t ideal for reproducing results, so you can set
random_state to be an integer—
any integer, it doesn’t matter which one—and you’ll get the same random split of your data each time you rerun your code. So if we rerun this, we’re going to get the same
y_train values. Okay, so what do you have? Checking the shape of all of these variables,
X_train is an array with just over 3000 rows, and
Now let’s create a kNN model and fit it to the data. First, you’ll need to pull the kNN codebase from the scikit-learn library. So,
import KNeighborsRegressor, and note that you’re using
Regressor here because the target values are numeric.
Let’s call it
knn_model. This will store all the model attributes and all the hyperparameters or anything that the model learns from the training data and so on. So for now,
knn_model will be equal to
But there are also a few properties that aren’t available yet. For example,
n_features_in_. you get an error if you try to look at this property right now because the model hasn’t been fitted to any training data yet.
The first argument should be your features, and the second argument should be your targets. At this point, your model has learned all it needs to to be able to make predictions, and several properties that it didn’t know before it has now learned. For example, if we now check
knn_model.n_features_in_, we would see that there are
7 features or columns that were used to train this kNN model.
Once your kNN model has been trained, you can use it to make predictions. Your kNN model has a method called
knn_model.predict(), and you only need to pass in the features that you’d like to make predictions for, such as
Seems like those are on the right scale, since most of the abalone in our dataset had about ten rings. You can go on to compare these predictions against the actual ring values in
y_train. And even more importantly, you can make predictions for the test set.
08:57 In this lesson, you leverage Python’s scikit-learn Library to build a kNN model and make predictions with it. Coming up next, you’ll continue working in scikit-learn to score your predictions and explore the hyperparameter k, the number of neighbors considered when making a prediction.
Become a Member to join the conversation.