Understanding How kNN Works
00:00 By the end of this lesson, you’ll be able to explain how the k-nearest neighbors algorithm works.
00:08 Recall the kNN is a supervised learning algorithm that learns from training data with labeled target values. Unlike most other machine learning algorithms that learn during the training phase, kNN performs nearly all of its calculations during the prediction phase.
00:26 kNN is a nonlinear algorithm capable of learning complex patterns, and it’s also nonparametric, so it does not summarize the training data into a set of prescribed parameters.
00:39 And you can use kNN for either classification or regression problems.
00:47 The k-nearest neighbors algorithm boils down to one ultimate main idea: kNN assumes that data points are similar to their neighbors.
00:58 When it comes time for a trained kNN model to make a prediction for a new data point, the model will first find that new point’s nearest neighbors from the training dataset.
01:10 Then it makes predictions based on the targets of those neighbors, and that’s pretty much all there is to make a prediction with kNN. Pretty simple, right? The
01:22 intuition behind kNN really makes a lot of sense. Think about the neighbors where you live. You likely have a lot in common with them. You’re likely in a similar socioeconomic class, and perhaps you do similar types of professional work, or maybe their children attend the same school as yours, and so on. But of course, this neighborhood approach doesn’t always make sense. For example, you wouldn’t compare yourself to your neighbors when trying to predict your favorite color. And the same goes for using kNN. For some problems, this method will be highly successful, but for others, it will not.
02:03 So predicting with K nearest neighbors involves two primary steps: finding the data point of interest’s nearest neighbors, and then making a prediction based on those neighbors’ targets.
02:15 Let’s talk a little bit more about that first step. You may have a few questions about this part of the algorithm—namely, how should nearest be defined, and how many neighbors should be considered?
02:30 There are a few different ways that nearest can be defined, but typically you’ll use Euclidian distance to judge how close data points are to one another. Say that you have a problem with two variables, height and width, and you want to know how close two data points, A and B, are to each other.
02:50 These data points actually represent vectors in two-dimensional space. Vector a spans from the origin to point A, while vector b goes from the origin to point B.
03:02 If you want to measure the distance between A and B, you can follow vector a to its endpoint and then draw vector negative -b, which is the same length as vector b, but it just points in the opposite direction.
03:17 The point where this new vector ends is vector a minus b, and its length is actually equivalent to the distance between points A and B.
03:31 So, usually, nearness is determined as the length of the vector between points A and B. You can use the Euclidean distance formula to measure that length.
03:42 The Euclidean distance between points A and B is defined as the square root of the quantity a1 minus b1 squared plus a2 minus b2 squared plus so on, all the way down to an minus bn squared, where n represents the number of variables that your data has.
04:03 This formula is equivalent to the norm of vector a minus vector b since the norm measures the length of the vectors. And if all of this reminds you of the Pythagorean theorem, you’re absolutely right.
04:17 The Euclidean distance formula can be derived from the Pythagorean theorem.
04:24 Okay, now you know how nearness is defined. What about the other question? How many neighbors should be considered by the algorithm? Well, the k and k-nearest neighbors represents the number of neighbors. At minimum, k will be equal to one, which would mean that when a new data point needs a prediction, you’ll only look at its nearest neighbor from the training set to assign it a target. On the other end of the spectrum, at maximum, k could be equal to the total number of points in the training set.
04:55 For that case, you would pool the targets of the entire training set anytime a new data point needed a prediction. Neither option, neither the minimum nor the maximum k, is typically used in practice.
05:10 Instead, you’ll need to optimize k to suit your individual problem. k is a so-called hyperparameter of the k-nearest neighbors algorithm, which can be set using validation, cross-validation, or with some other strategy.
05:28 Returning to the main steps of kNN, you know about finding the neighbors in step one. So what about step two? How can you make predictions from those neighbors’ targets?
05:40 It turns out that this depends on your type of supervised learning problem.
05:46 For classification problems, you’ll use majority vote. kNN predicts based on the mode, or the most popular, class targets of the points’ neighbors. Once again, say that you’re considering two variables, height and width. You have a training set for these two features, like this one.
06:07 When it comes time to make a prediction for a new data point with a new height and width, your kNN algorithm will look at the point’s neighbors, and perhaps you want k to be equal to 5. kNN will find the five nearest training data points. Since these are training points, these have known target class labels, so maybe this one’s a square, and this one’s a circle.
06:33 You have a square, a triangle, and another square. What should the prediction for the data point in question be? Well, each neighbor gets one vote, and since there are three votes for the square class, kNN predicts the new data point will be a square as well.
06:53 You may be wondering what happens in the event of a tie. The default behavior for most kNN implementations is to break ties by prioritizing the vote of the closest neighbor.
07:07 You’ll soon be considering the abalone dataset once again, which is a regression problem. So how does kNN make predictions for regression? kNN predicts the average of the nearest neighbors’ targets for regression problems. Given the same setup for height and width, if you want to make a regression prediction for a new data point based on five neighbors, once again, find the five closest neighbors.
07:34 But now the targets of those neighbors are numeric. Maybe this one has a value of 4.5, and this one is 2.1. You have 3.3, 4.2, and 2.9. kNN will take those five values, compute their average, and then predict an average value of 3.4 for the test point in question. To recap, kNN relies on the guiding principle that data points are similar to their neighbors. Training a kNN model basically just involves memorizing a training set of data points.
08:14 But when it comes time to make a prediction, kNN follows two steps. First, find the nearest neighbors of the data point of interest. You’ll almost always use Euclidean distance to judge closeness.
08:27 And you’ll also need to set k, the number of neighbors to consider. Once those neighbors have been found, make a prediction based on the targets of those neighbors. For classification, that means taking a majority vote of the neighbors’ class targets, and for regression, that means averaging the neighbors’ target values together.
08:52 Up next, you’ll start creating your own kNN model from scratch by writing Python code.
Become a Member to join the conversation.