How to Apply train_test_split()

Splitting Datasets With scikit-learn and train_test_split() Darren Jones 04:23

00:00 Getting started with train_test_split(). You need to import train_test_split() and numpy before you can use them, so let’s start with the import statements.

00:54 All these objects together make up the dataset, and they must be of the same length. In supervised machine learning applications, you’ll typically work with two such sequences: a two-dimensional array with the inputs, typically known as x, and a one-dimensional array of outputs, typically known as y.

01:16 options are the optional keyword arguments that you can use to get the desired behavior. train_size is the number that defines the size of the training set.

01:26 If you provide a float, then it must be in the range 0 to 1, and it will define the share of the dataset you use for testing.

01:33 If you provide an int, then it will represent the total number of the training samples. The default value is None. test_size is the number that defines the size of the test set.

01:44 It’s very similar to train_size. You should provide either train_size or test_size. If neither is given, then the default share of the dataset that will be used for testing is 0.25, 25%. random_state is the object that controls randomization during splitting.

02:04 It can either be an int or an instance of RandomState. The default value is None. shuffle is a Boolean that determines whether or not to shuffle the dataset before applying the split. stratify is an array-like object that, if not None, determines how to use a stratified split.

02:26 Now it’s time to try data splitting. You’ll start by creating a simple dataset to work with. This dataset will contain the inputs in the two-dimensional array x and outputs and the one-dimensional array y.

02:46 Here, you can see NumPy’s arange() being used, which is extremely convenient for generating arrays based on numerical ranges. You’ll also use .reshape() to modify the shape of the array returned by arange() and get a two-dimensional data structure.

03:10 Here, you can see the x and y NumPy arrays that were created. You can split both input and output datasets with a single function call, as seen onscreen.

03:27 Given two sequences, such as x and y here, train_test_split() performs the split and returns four sequences, which in this case will be NumPy arrays, in this order: x_train, the training part of the first sequence x.

03:43 x_test, the test part of the first sequence x. y_train, the training part the second sequence y. And finally, y_test, the test part of the second sequence y.

04:00 You probably got different results from the ones you see onscreen. This is because dataset splitting is random by default, and the result differs each time you run the function. However, this often isn’t what you want.

04:14 In the next part of the course, you’ll see how you can modify your code so that you get consistent, reproducible results.

Become a Member to join the conversation.