A Small Example of Linear Regression

Splitting Datasets With scikit-learn and train_test_split() Darren Jones 03:34

00:00 A small example of linear regression. In this example, you’ll apply what you’ve learned so far to solve a small regression problem. You’ll learn how to create datasets, split them into training and test subsets, and use them for linear regression.

00:16 As always, start by importing the necessary packages. You’ll need numpy, LinearRegression, and train_test_split().

00:35 Now that you’ve imported everything you need, you can create two small arrays, x and y, to represent the observations and then split them into training and test sets, just as you have done previously.

01:02 The dataset has twenty observations, or x-y pairs. You specify the argument test_size=8 so that the dataset is divided into a training set with twelve observations and a test set with eight observations.

01:38 Now you can use the training set to fit the model. Linear regression creates the object that represents the model, while .fit() trains, or fits, the model and returns it. With linear regression, fitting the model means determining the best intercept and slope values of the regression line, and you can see those values by querying the attributes as seen onscreen.

02:13 Although you can use x_train and y_train to check the goodness of fit, this isn’t best practice. An unbiased estimation of the predictive performance of your model is based on the test data.

02:32 .score() returns the coefficient of determination, or R squared, for the data passed. Its maximum is 1. The higher the R-squared value, the better the fit. In this case, the training data yields a slightly higher coefficient.

02:48 However, R-squared calculated with test data is an unbiased measure of your model’s prediction performance. You can see how this looks on the graph onscreen.

03:00 The green dots represent the x-y pairs used for training. The black line, called the estimated regression line, is defined by the results of model fitting: the intercept, and the slope.

03:11 So it reflects the positions of the green dots only. The white dots represent the test set. You can use them to estimate the performance of the model with data not used for training.

03:24 Now that you’ve seen how to use train_test_split() with a small example, in the next section, you’ll see how to use it with a larger dataset.

Bartosz Wilk on Aug. 23, 2022

Hi, could you add code to show the Regression Line Plot from ‘model’ variable ?

Become a Member to join the conversation.