Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Hint: You can adjust the default video playback speed in your account settings.
Hint: You can set your subtitle preferences in your account settings.
Sorry! Looks like there’s an issue with video playback 🙁 This might be due to a temporary outage or because of a configuration issue with your browser. Please refer to our video player troubleshooting guide for assistance.

A Larger Regression Example

For more information about concepts covered in this lesson, you can check out:

00:00 A larger regression example. Now you’re ready to split a larger dataset to solve a regression problem. You’ll use a well-known Boston house prices dataset, which is included in scikit-learn.

00:14 This dataset has 506 samples, 13 input variables, and the house values as the output. You can retrieve it with load_boston(). First, import train_test_split() and load_boston().

00:38 Now that you have the needed functions imported, you can get the data to work with.

00:48 As you can see, load_boston() with the argument return_X_y=True returns a tuple with two NumPy arrays: a two-dimensional array with the inputs, and a one-dimensional array with the outputs.

01:06 Viewing x and y may not be so informative as you’re now dealing with much larger arrays, but you can see the dimensions of a NumPy array by viewing the .shape attribute as seen onscreen now.

01:22 You can see that x has 506 rows and 13 columns, and y has 506 rows. The next step is to split the data the same way as before.

01:45 When you work with larger datasets, it’s usually more convenient to pass the training or test size as a ratio. test_size=0.4 means that approximately 40% of samples will be assigned to the test set, and the remaining 60% will be assigned to the training set.

02:03 You can use .shape once more on the x_train and x_test arrays to confirm their sizes, showing that the training set has 303 rows, and the test set, 203. Finally, you can use the training set x_train and y_train to fit the model and the test set x_test and y_test for an unbiased evaluation of the model.

02:29 In this example, you’ll apply three well-known regression algorithms to create models that fit your data: Firstly, linear regression. secondly, gradient boosting. And thirdly, a random forest.

02:45 The process is pretty much the same as with the previous example. Firstly, import the needed classes. Secondly, create and fit the model instances using the training set. And thirdly, evaluate model with .score() using the test set.

03:02 You’ll see all three of these onscreen starting with linear regression. The first step is to import the LinearRegression model. Next, create and train the model with the single line that chains the .fit() method after the model is created.

03:26 You can then evaluate the model’s performance on the training and test datasets.

03:40 The procedure for the GradientBoostingRegressor is largely similar: importing it first,

03:53 and then creating and fitting the model in a single line.

04:04 Note that the random_state parameter is passed to the regressor to ensure reproducible results. Once again, the model is assessed on training and test datasets.

04:24 Finally, the same process is repeated for the RandomForestRegressor, noting again that the random forest random_state is being set during its creation.

04:57 You’ve used your training and test datasets to fit three models and evaluate their performance. The measure of accuracy obtained with .score() is the coefficient of determination.

05:09 It can be calculated with either the training or test set, but as you’ve already learned, the score obtained with a test that represents an unbiased estimation of performance.

05:20 As mentioned in the documentation, you can provide optional arguments to LinearRegression, GradingBoostingRegressor, and RandomForestRegressor.

05:29 As we’ve already seen, the GradientBoostingRegressor and RandomForestRegressor use the random_state parameter for the same reason as train_test_split() does: to deal with randomness in the algorithms and ensure reproducibility.

05:45 You can use train_test_split() to solve classification problems the same way you do for regression analysis. In machine learning, classification problems involve training a model to apply labels to or classify the input values and sort your dataset into categories.

06:03 In this Real Python tutorial, you’ll find an example of a handwriting recognition task. The example provides another demonstration of splitting data into training and test sets to avoid bias in the evaluation process.

06:18 In the next section of this course, you’ll take a look at some other functionalities that can be used for validation of your model.

Become a Member to join the conversation.