Simple Linear Regression: Background
00:57 And so the problem is going to be to find the coefficients b₀ and b₁ so that the estimated response—so this is the response that we get by taking in one of the observed inputs and applying it to our model—that estimated response is as close as possible to the actual observed response, yᵢ, and we want to do this for all observations, i, from one through n.
01:40 One way to collectively minimize all of the residuals is to minimize this function. That’s usually called RSS, which is called residual sum of squares. And from its name, what we’re doing is we are computing the residuals for each observation, i, we’re squaring them, and we are summing those up from i=1 through n. Now because the observations are fixed, the residual sum of squares function only depends on the coefficients b₀ and b₁.
02:14 So when you think about this expression, these observations for the response and for the corresponding input, those are actual numerical values that are known, and this RSS function really only depends on b₀ and b₁.
02:29 So finding b₀ and b₁ that minimizes this RSS function is a standard optimization problem. Using some techniques from calculus, you can actually derive explicit formulas for the coefficients b₀ and b₁. Now in these formulas, x̄ and ȳ—the overline y and the overline x notation or bar y, bar x notation—those denote the averages of x and y.
03:01 Now in the background, the module scikit-learn, which we’ll see in a minute, will be computing these coefficients for you. So of course, you’re not going to have to compute these manually using, say, these formulas or some other method. All right, that’s the math behind simple linear regression.
03:24 This figure is a visual representation of what’s happening in simple linear regression. In this hypothetical test data, we’ve got six observations, and these are represented by the six green dots.
03:38 The x values or the inputs for these observations are at five, fifteen, twenty-five, thirty-five, forty-five, and fifty-five. And the black line represents the computed linear regression line for this test data.
04:07 And this is the data that we’ll use when we go onto the next lesson, when we implement this in Python. Now the line that’s computed—the linear regression line—the coefficient b₀, that is visually or graphically the point on the y-axis where the line intersects.
04:45 And so these right there are the predicted responses or the estimated responses for the corresponding input values. The difference between the actual response—so yᵢ—and the estimated or predicted response, f, at xᵢ is graphically represented by this vertical line.
05:07 So in this particular observation, the actual response—so the green dot—is greater than the estimated response, while in this observation, the actual response is less than the predicted response.
05:22 So to summarize, the main idea with simple linear regression is to find the best line that fits the data, where the word best is measured by the function that minimizes the residual sum of squares.
Become a Member to join the conversation.