Reproducible Results With train_test_split()
Making your work reproducible. Sometimes, to make your test reproducible, you need a random split with the same output for each function call. You can do that with the parameter
The value of
random_state isn’t important. It could be any non-negative integer. You could use an instance of
numpy.random.RandomState instead, but that is a more complex approach.
00:29 In the previous example, you used a dataset with twelve observations and got a training sample with nine rows and a test sample with three rows. That’s because you didn’t specify the desired size of the training and test sets.
00:42 By default, 25% of the samples are assigned to the test set. This ratio is generally fine for many applications, but it’s not always what you need. Typically, you’ll want to define the size of the test or training set explicitly, and sometimes you’ll even want to experiment with different values.
You can do that with the parameters
test_size. Onscreen, you can see
train_test_split() being called again, this time with
random_state being specified.
With this change, you get a different result from before. Earlier, you had a training set with nine items and a test set with three items. Now, with
4, the training set has eight items and the test set predictably has four items.
You’d get the same result with
test_size=0.33 because 33% of 12 is approximately 4. There’s one more very important difference between the last two examples.
You now get the same result each time you run the function. This is because you’ve fixed the random number generator with
random_state=4. Onscreen, you can see what’s going on when you call
The samples of the dataset are shuffled randomly and then split into the training and test sets according to the size that was defined. You can see that
y has six 0s and six 1s. However, the test set has three 0s out to four items.
If you want to approximately keep the proportion of
y values through the training and test sets, then pass
stratify=y. This will enable stratified splitting.
y_test have the same ratios of 0s and 1s as the original
y array. Stratified splits are desirable in some cases, like when you’re classifying an imbalanced dataset, which is a dataset with a significant difference in the number of samples that belong to distinct classes. Finally, you can turn off data shuffling and random split with
Now you have a split in which the first two-thirds of samples in the original
y arrays are assigned to the training set
03:44 and the last third to the test set.
There’s no shuffling, and therefore there’s no randomness. In the next section of the course, you’ll see how you can use
train_test_split() when working with supervised machine learning.
Become a Member to join the conversation.