Reproducible Results With train_test_split()
00:29 In the previous example, you used a dataset with twelve observations and got a training sample with nine rows and a test sample with three rows. That’s because you didn’t specify the desired size of the training and test sets.
00:42 By default, 25% of the samples are assigned to the test set. This ratio is generally fine for many applications, but it’s not always what you need. Typically, you’ll want to define the size of the test or training set explicitly, and sometimes you’ll even want to experiment with different values.
With this change, you get a different result from before. Earlier, you had a training set with nine items and a test set with three items. Now, with
4, the training set has eight items and the test set predictably has four items.
You now get the same result each time you run the function. This is because you’ve fixed the random number generator with
random_state=4. Onscreen, you can see what’s going on when you call
The samples of the dataset are shuffled randomly and then split into the training and test sets according to the size that was defined. You can see that
y has six 0s and six 1s. However, the test set has three 0s out to four items.
y_test have the same ratios of 0s and 1s as the original
y array. Stratified splits are desirable in some cases, like when you’re classifying an imbalanced dataset, which is a dataset with a significant difference in the number of samples that belong to distinct classes. Finally, you can turn off data shuffling and random split with
Become a Member to join the conversation.