Understanding How kNN Works
00:08 Recall the kNN is a supervised learning algorithm that learns from training data with labeled target values. Unlike most other machine learning algorithms that learn during the training phase, kNN performs nearly all of its calculations during the prediction phase.
01:22 intuition behind kNN really makes a lot of sense. Think about the neighbors where you live. You likely have a lot in common with them. You’re likely in a similar socioeconomic class, and perhaps you do similar types of professional work, or maybe their children attend the same school as yours, and so on. But of course, this neighborhood approach doesn’t always make sense. For example, you wouldn’t compare yourself to your neighbors when trying to predict your favorite color. And the same goes for using kNN. For some problems, this method will be highly successful, but for others, it will not.
02:15 Let’s talk a little bit more about that first step. You may have a few questions about this part of the algorithm—namely, how should nearest be defined, and how many neighbors should be considered?
02:30 There are a few different ways that nearest can be defined, but typically you’ll use Euclidian distance to judge how close data points are to one another. Say that you have a problem with two variables, height and width, and you want to know how close two data points, A and B, are to each other.
03:02 If you want to measure the distance between A and B, you can follow vector a to its endpoint and then draw vector negative -b, which is the same length as vector b, but it just points in the opposite direction.
03:42 The Euclidean distance between points A and B is defined as the square root of the quantity a1 minus b1 squared plus a2 minus b2 squared plus so on, all the way down to an minus bn squared, where n represents the number of variables that your data has.
04:03 This formula is equivalent to the norm of vector a minus vector b since the norm measures the length of the vectors. And if all of this reminds you of the Pythagorean theorem, you’re absolutely right.
04:24 Okay, now you know how nearness is defined. What about the other question? How many neighbors should be considered by the algorithm? Well, the k and k-nearest neighbors represents the number of neighbors. At minimum, k will be equal to one, which would mean that when a new data point needs a prediction, you’ll only look at its nearest neighbor from the training set to assign it a target. On the other end of the spectrum, at maximum, k could be equal to the total number of points in the training set.
04:55 For that case, you would pool the targets of the entire training set anytime a new data point needed a prediction. Neither option, neither the minimum nor the maximum k, is typically used in practice.
05:10 Instead, you’ll need to optimize k to suit your individual problem. k is a so-called hyperparameter of the k-nearest neighbors algorithm, which can be set using validation, cross-validation, or with some other strategy.
05:46 For classification problems, you’ll use majority vote. kNN predicts based on the mode, or the most popular, class targets of the points’ neighbors. Once again, say that you’re considering two variables, height and width. You have a training set for these two features, like this one.
06:07 When it comes time to make a prediction for a new data point with a new height and width, your kNN algorithm will look at the point’s neighbors, and perhaps you want k to be equal to 5. kNN will find the five nearest training data points. Since these are training points, these have known target class labels, so maybe this one’s a square, and this one’s a circle.
06:33 You have a square, a triangle, and another square. What should the prediction for the data point in question be? Well, each neighbor gets one vote, and since there are three votes for the square class, kNN predicts the new data point will be a square as well.
07:07 You’ll soon be considering the abalone dataset once again, which is a regression problem. So how does kNN make predictions for regression? kNN predicts the average of the nearest neighbors’ targets for regression problems. Given the same setup for height and width, if you want to make a regression prediction for a new data point based on five neighbors, once again, find the five closest neighbors.
07:34 But now the targets of those neighbors are numeric. Maybe this one has a value of 4.5, and this one is 2.1. You have 3.3, 4.2, and 2.9. kNN will take those five values, compute their average, and then predict an average value of 3.4 for the test point in question. To recap, kNN relies on the guiding principle that data points are similar to their neighbors. Training a kNN model basically just involves memorizing a training set of data points.
08:14 But when it comes time to make a prediction, kNN follows two steps. First, find the nearest neighbors of the data point of interest. You’ll almost always use Euclidean distance to judge closeness.
08:27 And you’ll also need to set k, the number of neighbors to consider. Once those neighbors have been found, make a prediction based on the targets of those neighbors. For classification, that means taking a majority vote of the neighbors’ class targets, and for regression, that means averaging the neighbors’ target values together.
Become a Member to join the conversation.