Simple Linear Regression: Background

Cesar Aguilar

Starting With Linear Regression in Python Cesar Aguilar 05:46

Transcript
Discussion

00:00 In this first lesson, we’ll introduce the basics of regression by taking a look at the simplest case, called simple linear regression, where we only have one scalar input.

00:12 We’re going to use simple linear regression to introduce a lot of the ideas that we’re going to need when we talk about multiple linear regression and polynomial regression.

00:22 In simple linear regression, there’s only one independent variable, which we’re going to denote simply by x instead of putting a subscript, x₁.

00:31 So the model f depends only on that input variable x, and the unknowns are the coefficients b₀ and b₁, and we assume a linear model. Now, as in the general case, we have n observations.

00:46 So we’ve got an observation for the input and its associated observation for the output, and we’ve got n of these.

00:57 And so the problem is going to be to find the coefficients b₀ and b₁ so that the estimated response—so this is the response that we get by taking in one of the observed inputs and applying it to our model—that estimated response is as close as possible to the actual observed response, yᵢ, and we want to do this for all observations, i, from one through n.

01:25 Now the differences between the actual responses, yᵢ, and the estimated responses with our found model is going to be called the residuals. And so we’re going to have n residuals.

01:40 One way to collectively minimize all of the residuals is to minimize this function. That’s usually called RSS, which is called residual sum of squares. And from its name, what we’re doing is we are computing the residuals for each observation, i, we’re squaring them, and we are summing those up from i=1 through n. Now because the observations are fixed, the residual sum of squares function only depends on the coefficients b₀ and b₁.

02:14 So when you think about this expression, these observations for the response and for the corresponding input, those are actual numerical values that are known, and this RSS function really only depends on b₀ and b₁.

02:29 So finding b₀ and b₁ that minimizes this RSS function is a standard optimization problem. Using some techniques from calculus, you can actually derive explicit formulas for the coefficients b₀ and b₁. Now in these formulas, x̄ and ȳ—the overline y and the overline x notation or bar y, bar x notation—those denote the averages of x and y.

02:57 And so these are nice closed formulas for those coefficients.

03:01 Now in the background, the module scikit-learn, which we’ll see in a minute, will be computing these coefficients for you. So of course, you’re not going to have to compute these manually using, say, these formulas or some other method. All right, that’s the math behind simple linear regression.

03:19 Let’s take a look at what’s going on visually with some test data.

03:24 This figure is a visual representation of what’s happening in simple linear regression. In this hypothetical test data, we’ve got six observations, and these are represented by the six green dots.

03:38 The x values or the inputs for these observations are at five, fifteen, twenty-five, thirty-five, forty-five, and fifty-five. And the black line represents the computed linear regression line for this test data.

03:55 So in this particular case for these six observations—so these are x and y pairs—computing the coefficient b₀ and b₁, we get these two values.

04:07 And this is the data that we’ll use when we go onto the next lesson, when we implement this in Python. Now the line that’s computed—the linear regression line—the coefficient b₀, that is visually or graphically the point on the y-axis where the line intersects.

04:27 And the coefficient b₁, that is the slope of the line.

04:32 When we evaluate the model—the linear regression models, so this function—when we evaluate it at the input variables from the observed data, we get these red squares.

04:45 And so these right there are the predicted responses or the estimated responses for the corresponding input values. The difference between the actual response—so yᵢ—and the estimated or predicted response, f, at xᵢ is graphically represented by this vertical line.

05:07 So in this particular observation, the actual response—so the green dot—is greater than the estimated response, while in this observation, the actual response is less than the predicted response.

05:22 So to summarize, the main idea with simple linear regression is to find the best line that fits the data, where the word best is measured by the function that minimizes the residual sum of squares.

05:38 All right, let’s implement simple linear regression on this same test data in Python using sklearn.

Become a Member to join the conversation.