For more information about concepts covered in this lesson, you can check out:
A Larger Regression Example
00:00 A larger regression example. Now you’re ready to split a larger dataset to solve a regression problem. You’ll use a well-known Boston house prices dataset, which is included in scikit-learn.
This dataset has 506 samples, 13 input variables, and the house values as the output. You can retrieve it with
load_boston(). First, import
00:38 Now that you have the needed functions imported, you can get the data to work with.
As you can see,
load_boston() with the argument
return_X_y=True returns a tuple with two NumPy arrays: a two-dimensional array with the inputs, and a one-dimensional array with the outputs.
y may not be so informative as you’re now dealing with much larger arrays, but you can see the dimensions of a NumPy array by viewing the
.shape attribute as seen onscreen now.
You can see that
506 rows and
13 columns, and
506 rows. The next step is to split the data the same way as before.
When you work with larger datasets, it’s usually more convenient to pass the training or test size as a ratio.
test_size=0.4 means that approximately 40% of samples will be assigned to the test set, and the remaining 60% will be assigned to the training set.
You can use
.shape once more on the
x_test arrays to confirm their sizes, showing that the training set has
303 rows, and the test set,
203. Finally, you can use the training set
y_train to fit the model and the test set
y_test for an unbiased evaluation of the model.
02:29 In this example, you’ll apply three well-known regression algorithms to create models that fit your data: Firstly, linear regression. secondly, gradient boosting. And thirdly, a random forest.
The process is pretty much the same as with the previous example. Firstly, import the needed classes. Secondly, create and fit the model instances using the training set. And thirdly, evaluate model with
.score() using the test set.
You’ll see all three of these onscreen starting with linear regression. The first step is to import the
LinearRegression model. Next, create and train the model with the single line that chains the
.fit() method after the model is created.
03:26 You can then evaluate the model’s performance on the training and test datasets.
The procedure for the
GradientBoostingRegressor is largely similar: importing it first,
03:53 and then creating and fitting the model in a single line.
Note that the
random_state parameter is passed to the regressor to ensure reproducible results. Once again, the model is assessed on training and test datasets.
Finally, the same process is repeated for the
RandomForestRegressor, noting again that the random forest
random_state is being set during its creation.
You’ve used your training and test datasets to fit three models and evaluate their performance. The measure of accuracy obtained with
.score() is the coefficient of determination.
05:09 It can be calculated with either the training or test set, but as you’ve already learned, the score obtained with a test that represents an unbiased estimation of performance.
As mentioned in the documentation, you can provide optional arguments to
As we’ve already seen, the
RandomForestRegressor use the
random_state parameter for the same reason as
train_test_split() does: to deal with randomness in the algorithms and ensure reproducibility.
You can use
train_test_split() to solve classification problems the same way you do for regression analysis. In machine learning, classification problems involve training a model to apply labels to or classify the input values and sort your dataset into categories.
06:03 In this Real Python tutorial, you’ll find an example of a handwriting recognition task. The example provides another demonstration of splitting data into training and test sets to avoid bias in the evaluation process.
06:18 In the next section of this course, you’ll take a look at some other functionalities that can be used for validation of your model.
Become a Member to join the conversation.