Reproducible Results With train_test_split()

Splitting Datasets With scikit-learn and train_test_split() Darren Jones 04:06

00:00 Making your work reproducible. Sometimes, to make your test reproducible, you need a random split with the same output for each function call. You can do that with the parameter random_state.

00:16 The value of random_state isn’t important. It could be any non-negative integer. You could use an instance of numpy.random.RandomState instead, but that is a more complex approach.

00:29 In the previous example, you used a dataset with twelve observations and got a training sample with nine rows and a test sample with three rows. That’s because you didn’t specify the desired size of the training and test sets.

00:42 By default, 25% of the samples are assigned to the test set. This ratio is generally fine for many applications, but it’s not always what you need. Typically, you’ll want to define the size of the test or training set explicitly, and sometimes you’ll even want to experiment with different values.

01:01 You can do that with the parameters train_size or test_size. Onscreen, you can see train_test_split() being called again, this time with test_size and random_state being specified.

01:20 With this change, you get a different result from before. Earlier, you had a training set with nine items and a test set with three items. Now, with test_size being 4, the training set has eight items and the test set predictably has four items.

01:42 You’d get the same result with test_size=0.33 because 33% of 12 is approximately 4. There’s one more very important difference between the last two examples.

01:55 You now get the same result each time you run the function. This is because you’ve fixed the random number generator with random_state=4. Onscreen, you can see what’s going on when you call train_test_split().

02:11 The samples of the dataset are shuffled randomly and then split into the training and test sets according to the size that was defined. You can see that y has six 0s and six 1s. However, the test set has three 0s out to four items.

02:28 If you want to approximately keep the proportion of y values through the training and test sets, then pass stratify=y. This will enable stratified splitting.

02:58 Now, y_train and y_test have the same ratios of 0s and 1s as the original y array. Stratified splits are desirable in some cases, like when you’re classifying an imbalanced dataset, which is a dataset with a significant difference in the number of samples that belong to distinct classes. Finally, you can turn off data shuffling and random split with shuffle=False.

03:34 Now you have a split in which the first two-thirds of samples in the original x and y arrays are assigned to the training set

03:44 and the last third to the test set.

03:53 There’s no shuffling, and therefore there’s no randomness. In the next section of the course, you’ll see how you can use train_test_split() when working with supervised machine learning.

Become a Member to join the conversation.