Locked learning resources

Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Locked learning resources

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Splitting Datasets With scikit-learn and train_test_split() (Summary)

You now know why and how to use train_test_split() from sklearn. You’ve learned that, for an unbiased estimation of the predictive performance of machine learning models, you should use data that hasn’t been used for model fitting. That’s why you need to split your dataset into training, test, and in some cases, validation subsets.

In this course, you’ve learned how to:

  • Use train_test_split() to get training and test sets
  • Control the size of the subsets with the parameters train_size and test_size
  • Determine the randomness of your splits with the random_state parameter
  • Obtain stratified splits with the stratify parameter
  • Use train_test_split() as a part of supervised machine learning procedures

You’ve also seen that the sklearn.model_selection module offers several other tools for model validation, including cross-validation, learning curves, and hyperparameter tuning.


Course Slides (.pdf)

6.2 MB
Avatar image for aniketbarphe

aniketbarphe on Sept. 4, 2021

Dear Team, Thank You very much for such a wonderful session. All topics were explained nicely. Only suggestion is with topic name “Other Validation Functionalities” is, it will be helpful if this is explained with example. With the help of example it is easy to understand rather than theory. Looking forward for positive response.

Avatar image for Joe Madaus

Joe Madaus on Nov. 7, 2021

I plan on taking the Linear Regression course next, so it is possible that this question is answered there :)

so, when we looked at the Boston house dataset, what were we trying to accomplish? This might be too far in the weeds for this course, but when training and testing; how does one know (besides the accuracy number) their model works and how does one know that their model is even good?

For example, couldnt someone effectively leave a field out of a dataset and get completely different results – better or worse?

I thought the course was excellent, but I can see I need to study a lot more :)

Avatar image for hodges-troy

hodges-troy on Nov. 26, 2021

I feel more confident now in my use of train_test_split()’s various options. I will use this information to run a regression on electricity usage data and utilize the stratify argument to make sure big and small users of electricity are represented correctly in the data groups.

Avatar image for Richpy

Richpy on Feb. 10, 2023

It’s very important to learn the code and why it is written in a certain way. Like random_state deals with randomness to ensure reproducibility. Thank you for sharing these important points in the video. Also, I am getting an error for the Data set provided in the video. You can replace it with some other data set.

Become a Member to join the conversation.