A Larger Regression Example
For more information about concepts covered in this lesson, you can check out:
00:00 A larger regression example. Now you’re ready to split a larger dataset to solve a regression problem. You’ll use a well-known Boston house prices dataset, which is included in scikit-learn.
00:14
This dataset has 506 samples, 13 input variables, and the house values as the output. You can retrieve it with load_boston()
. First, import train_test_split()
and load_boston()
.
00:38 Now that you have the needed functions imported, you can get the data to work with.
00:48
As you can see, load_boston()
with the argument return_X_y=True
returns a tuple with two NumPy arrays: a two-dimensional array with the inputs, and a one-dimensional array with the outputs.
01:06
Viewing x
and y
may not be so informative as you’re now dealing with much larger arrays, but you can see the dimensions of a NumPy array by viewing the .shape
attribute as seen onscreen now.
01:22
You can see that x
has 506
rows and 13
columns, and y
has 506
rows. The next step is to split the data the same way as before.
01:45
When you work with larger datasets, it’s usually more convenient to pass the training or test size as a ratio. test_size=0.4
means that approximately 40% of samples will be assigned to the test set, and the remaining 60% will be assigned to the training set.
02:03
You can use .shape
once more on the x_train
and x_test
arrays to confirm their sizes, showing that the training set has 303
rows, and the test set, 203
. Finally, you can use the training set x_train
and y_train
to fit the model and the test set x_test
and y_test
for an unbiased evaluation of the model.
02:29 In this example, you’ll apply three well-known regression algorithms to create models that fit your data: Firstly, linear regression. secondly, gradient boosting. And thirdly, a random forest.
02:45
The process is pretty much the same as with the previous example. Firstly, import the needed classes. Secondly, create and fit the model instances using the training set. And thirdly, evaluate model with .score()
using the test set.
03:02
You’ll see all three of these onscreen starting with linear regression. The first step is to import the LinearRegression
model. Next, create and train the model with the single line that chains the .fit()
method after the model is created.
03:26 You can then evaluate the model’s performance on the training and test datasets.
03:40
The procedure for the GradientBoostingRegressor
is largely similar: importing it first,
03:53 and then creating and fitting the model in a single line.
04:04
Note that the random_state
parameter is passed to the regressor to ensure reproducible results. Once again, the model is assessed on training and test datasets.
04:24
Finally, the same process is repeated for the RandomForestRegressor
, noting again that the random forest random_state
is being set during its creation.
04:57
You’ve used your training and test datasets to fit three models and evaluate their performance. The measure of accuracy obtained with .score()
is the coefficient of determination.
05:09 It can be calculated with either the training or test set, but as you’ve already learned, the score obtained with a test that represents an unbiased estimation of performance.
05:20
As mentioned in the documentation, you can provide optional arguments to LinearRegression
, GradingBoostingRegressor
, and RandomForestRegressor
.
05:29
As we’ve already seen, the GradientBoostingRegressor
and RandomForestRegressor
use the random_state
parameter for the same reason as train_test_split()
does: to deal with randomness in the algorithms and ensure reproducibility.
05:45
You can use train_test_split()
to solve classification problems the same way you do for regression analysis. In machine learning, classification problems involve training a model to apply labels to or classify the input values and sort your dataset into categories.
06:03 In this Real Python tutorial, you’ll find an example of a handwriting recognition task. The example provides another demonstration of splitting data into training and test sets to avoid bias in the evaluation process.
06:18 In the next section of this course, you’ll take a look at some other functionalities that can be used for validation of your model.
toigopaul on Aug. 7, 2024
FWIW, changing
from sklearn.datasets import load_boston
x, y = load_boston(return_X_y=True)
to
import pandas as pd
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
x = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
y = raw_df.values[1::2, 2]
worked fine for me.
Martin Breuss RP Team on Aug. 7, 2024
@toigopaul thanks for noting this and posting these interesting links to read 🙌
We’ve already updated the Regression Example in the tutorial that this course is based on with the California Housing dataset for the reasons you mentioned.
Video courses are a bit harder to update without breaking the overall course, so we haven’t gotten around to doing this one yet.
But for anyone bumping into this, I’d strongly suggest reading over the links that @toigopaul shared above, those are important lessons regarding how there can be significant bias in a dataset!
And to work through the example, head over to the written tutorial and work through the Regression Example as it’s described there.
Become a Member to join the conversation.
toigopaul on Aug. 6, 2024
Error I got playing along in Visual Studio Code:
load_boston
has been removed from scikit-learn since version 1.2.The Boston housing prices dataset has an ethical problem: as investigated in [1], the authors of this dataset engineered a non-invertible variable “B” assuming that racial self-segregation had a positive impact on house prices [2]. Furthermore the goal of the research that led to the creation of this dataset was to study the impact of air quality but it did not give adequate demonstration of the validity of this assumption.
The scikit-learn maintainers therefore strongly discourage the use of this dataset unless the purpose of the code is to study and educate about ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original source::
Alternative datasets include the California housing dataset and the Ames housing dataset. You can load the datasets as follows::
for the California housing dataset and::
from sklearn.datasets import fetch_openml housing = fetch_openml(name=”house_prices”, as_frame=True)
for the Ames housing dataset.
[1] M Carlisle. “Racist data destruction?” medium.com/@docintangible/racist-data-destruction-113e3eff54a8
[2] Harrison Jr, David, and Daniel L. Rubinfeld. “Hedonic housing prices and the demand for clean air.” Journal of environmental economics and management 5.1 (1978): 81-102. www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air