Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

A Larger Regression Example

Splitting Datasets With scikit-learn and train_test_split() Darren Jones 06:26

For more information about concepts covered in this lesson, you can check out:

00:00 A larger regression example. Now you’re ready to split a larger dataset to solve a regression problem. You’ll use a well-known Boston house prices dataset, which is included in scikit-learn.

00:14 This dataset has 506 samples, 13 input variables, and the house values as the output. You can retrieve it with load_boston(). First, import train_test_split() and load_boston().

00:38 Now that you have the needed functions imported, you can get the data to work with.

00:48 As you can see, load_boston() with the argument return_X_y=True returns a tuple with two NumPy arrays: a two-dimensional array with the inputs, and a one-dimensional array with the outputs.

01:06 Viewing x and y may not be so informative as you’re now dealing with much larger arrays, but you can see the dimensions of a NumPy array by viewing the .shape attribute as seen onscreen now.

01:22 You can see that x has 506 rows and 13 columns, and y has 506 rows. The next step is to split the data the same way as before.

01:45 When you work with larger datasets, it’s usually more convenient to pass the training or test size as a ratio. test_size=0.4 means that approximately 40% of samples will be assigned to the test set, and the remaining 60% will be assigned to the training set.

02:03 You can use .shape once more on the x_train and x_test arrays to confirm their sizes, showing that the training set has 303 rows, and the test set, 203. Finally, you can use the training set x_train and y_train to fit the model and the test set x_test and y_test for an unbiased evaluation of the model.

02:29 In this example, you’ll apply three well-known regression algorithms to create models that fit your data: Firstly, linear regression. secondly, gradient boosting. And thirdly, a random forest.

02:45 The process is pretty much the same as with the previous example. Firstly, import the needed classes. Secondly, create and fit the model instances using the training set. And thirdly, evaluate model with .score() using the test set.

03:02 You’ll see all three of these onscreen starting with linear regression. The first step is to import the LinearRegression model. Next, create and train the model with the single line that chains the .fit() method after the model is created.

03:26 You can then evaluate the model’s performance on the training and test datasets.

03:40 The procedure for the GradientBoostingRegressor is largely similar: importing it first,

03:53 and then creating and fitting the model in a single line.

04:04 Note that the random_state parameter is passed to the regressor to ensure reproducible results. Once again, the model is assessed on training and test datasets.

04:24 Finally, the same process is repeated for the RandomForestRegressor, noting again that the random forest random_state is being set during its creation.

04:57 You’ve used your training and test datasets to fit three models and evaluate their performance. The measure of accuracy obtained with .score() is the coefficient of determination.

05:09 It can be calculated with either the training or test set, but as you’ve already learned, the score obtained with a test that represents an unbiased estimation of performance.

05:20 As mentioned in the documentation, you can provide optional arguments to LinearRegression, GradingBoostingRegressor, and RandomForestRegressor.

05:29 As we’ve already seen, the GradientBoostingRegressor and RandomForestRegressor use the random_state parameter for the same reason as train_test_split() does: to deal with randomness in the algorithms and ensure reproducibility.

05:45 You can use train_test_split() to solve classification problems the same way you do for regression analysis. In machine learning, classification problems involve training a model to apply labels to or classify the input values and sort your dataset into categories.

06:03 In this Real Python tutorial, you’ll find an example of a handwriting recognition task. The example provides another demonstration of splitting data into training and test sets to avoid bias in the evaluation process.

06:18 In the next section of this course, you’ll take a look at some other functionalities that can be used for validation of your model.

toigopaul on Aug. 6, 2024

Error I got playing along in Visual Studio Code:

load_boston has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable “B” assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original source::

import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows::

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

for the California housing dataset and::

from sklearn.datasets import fetch_openml housing = fetch_openml(name=”house_prices”, as_frame=True)

for the Ames housing dataset.

[1] M Carlisle. “Racist data destruction?” medium.com/@docintangible/racist-data-destruction-113e3eff54a8

[2] Harrison Jr, David, and Daniel L. Rubinfeld. “Hedonic housing prices and the demand for clean air.” Journal of environmental economics and management 5.1 (1978): 81-102. www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air

toigopaul on Aug. 7, 2024

FWIW, changing

from sklearn.datasets import load_boston
x, y = load_boston(return_X_y=True)

to

import pandas as pd
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
x = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]

worked fine for me.

Martin Breuss RP Team on Aug. 7, 2024

@toigopaul thanks for noting this and posting these interesting links to read 🙌

We’ve already updated the Regression Example in the tutorial that this course is based on with the California Housing dataset for the reasons you mentioned.

Video courses are a bit harder to update without breaking the overall course, so we haven’t gotten around to doing this one yet.

But for anyone bumping into this, I’d strongly suggest reading over the links that @toigopaul shared above, those are important lessons regarding how there can be significant bias in a dataset!

And to work through the example, head over to the written tutorial and work through the Regression Example as it’s described there.

Become a Member to join the conversation.