Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to hundreds of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set the default subtitles language in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please see our video player troubleshooting guide to resolve the issue.

A Small Example of Linear Regression

Give Feedback

00:00 A small example of linear regression. In this example, you’ll apply what you’ve learned so far to solve a small regression problem. You’ll learn how to create datasets, split them into training and test subsets, and use them for linear regression.

00:16 As always, start by importing the necessary packages. You’ll need numpy, LinearRegression, and train_test_split().

00:35 Now that you’ve imported everything you need, you can create two small arrays, x and y, to represent the observations and then split them into training and test sets, just as you have done previously.

01:02 The dataset has twenty observations, or x-y pairs. You specify the argument test_size=8 so that the dataset is divided into a training set with twelve observations and a test set with eight observations.

01:38 Now you can use the training set to fit the model. Linear regression creates the object that represents the model, while .fit() trains, or fits, the model and returns it. With linear regression, fitting the model means determining the best intercept and slope values of the regression line, and you can see those values by querying the attributes as seen onscreen.

02:13 Although you can use x_train and y_train to check the goodness of fit, this isn’t best practice. An unbiased estimation of the predictive performance of your model is based on the test data.

02:32 .score() returns the coefficient of determination, or R squared, for the data passed. Its maximum is 1. The higher the R-squared value, the better the fit. In this case, the training data yields a slightly higher coefficient.

02:48 However, R-squared calculated with test data is an unbiased measure of your model’s prediction performance. You can see how this looks on the graph onscreen.

03:00 The green dots represent the x-y pairs used for training. The black line, called the estimated regression line, is defined by the results of model fitting: the intercept, and the slope.

03:11 So it reflects the positions of the green dots only. The white dots represent the test set. You can use them to estimate the performance of the model with data not used for training.

03:24 Now that you’ve seen how to use train_test_split() with a small example, in the next section, you’ll see how to use it with a larger dataset.

Become a Member to join the conversation.