Splitting Datasets With scikit-learn and train_test_split() (Summary)

Splitting Datasets With scikit-learn and train_test_split() Darren Jones 01:23

You now know why and how to use train_test_split() from sklearn. You’ve learned that, for an unbiased estimation of the predictive performance of machine learning models, you should use data that hasn’t been used for model fitting. That’s why you need to split your dataset into training, test, and in some cases, validation subsets.

In this course, you’ve learned how to:

Use train_test_split() to get training and test sets
Control the size of the subsets with the parameters train_size and test_size
Determine the randomness of your splits with the random_state parameter
Obtain stratified splits with the stratify parameter
Use train_test_split() as a part of supervised machine learning procedures

You’ve also seen that the sklearn.model_selection module offers several other tools for model validation, including cross-validation, learning curves, and hyperparameter tuning.

Download

Course Slides (.pdf)

6.2 MB

Congratulations, you made it to the end of the course! What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment in the discussion section and let us know.

00:00 Summary. Well done! You’ve made it to the end of the course. You now know why and how to use train_test_split() from scikit-learn. You’ve learned that for an unbiased estimation of the predictive performance of machine learning models, you should use data that hasn’t been used for model fitting.

00:19 That’s why you need to split your dataset into training, test, and in some cases, validation subsets. In this course, you’ve learned how to use train_test_split() to get training and test sets, control the size of the subsets with the parameters train_size and test_size, determine the randomness of your splits with a random_state parameter, obtain stratified splits with the stratify parameter, and use train_test_split() as part of a supervised machine learning procedure.

00:53 You’ve also seen that scikit-learn’s model_selection module offers several other tools for model validation, including cross validation, learning curves, and hyperparameter tuning.

01:04 You can use the knowledge that you’ve gained in this course to allow you to work confidently with datasets for all these tools and create reproducible training, test, and validation sets for all of them.

01:16 We hope you found this course useful, and we’ll see you again soon at realpython.com.

aniketbarphe on Sept. 4, 2021

Dear Team, Thank You very much for such a wonderful session. All topics were explained nicely. Only suggestion is with topic name “Other Validation Functionalities” is, it will be helpful if this is explained with example. With the help of example it is easy to understand rather than theory. Looking forward for positive response.

Joe Madaus on Nov. 7, 2021

I plan on taking the Linear Regression course next, so it is possible that this question is answered there :)

so, when we looked at the Boston house dataset, what were we trying to accomplish? This might be too far in the weeds for this course, but when training and testing; how does one know (besides the accuracy number) their model works and how does one know that their model is even good?

For example, couldnt someone effectively leave a field out of a dataset and get completely different results – better or worse?

I thought the course was excellent, but I can see I need to study a lot more :)

hodges-troy on Nov. 26, 2021

I feel more confident now in my use of train_test_split()’s various options. I will use this information to run a regression on electricity usage data and utilize the stratify argument to make sure big and small users of electricity are represented correctly in the data groups.

Richpy on Feb. 10, 2023

It’s very important to learn the code and why it is written in a certain way. Like random_state deals with randomness to ensure reproducibility. Thank you for sharing these important points in the video. Also, I am getting an error for the Data set provided in the video. You can replace it with some other data set.

Become a Member to join the conversation.