How to Apply train_test_split()
Getting started with
train_test_split(). You need to import
numpy before you can use them, so let’s start with the
Now that you have both imported, you can use
numpy to create a dataset and
train_test_split() to split that data into training sets and test sets. You’ll split inputs and outputs at the same time with a single function call. With
train_test_split(), you need to provide the sequences that you want to split as well as any optional arguments. It returns a list of NumPy arrays, other sequences, or SciPi sparse matrices if appropriate.
arrays is the sequence of lists, NumPy arrays, pandas DataFrames, or similar array-like objects that hold the data that you want to split.
All these objects together make up the dataset, and they must be of the same length. In supervised machine learning applications, you’ll typically work with two such sequences: a two-dimensional array with the inputs, typically known as
x, and a one-dimensional array of outputs, typically known as
options are the optional keyword arguments that you can use to get the desired behavior.
train_size is the number that defines the size of the training set.
If you provide a
float, then it must be in the range
1, and it will define the share of the dataset you use for testing.
If you provide an
int, then it will represent the total number of the training samples. The default value is
test_size is the number that defines the size of the test set.
It’s very similar to
train_size. You should provide either
test_size. If neither is given, then the default share of the dataset that will be used for testing is
random_state is the object that controls randomization during splitting.
It can either be an
int or an instance of
RandomState. The default value is
shuffle is a Boolean that determines whether or not to shuffle the dataset before applying the split.
stratify is an array-like object that, if not
None, determines how to use a stratified split.
Now it’s time to try data splitting. You’ll start by creating a simple dataset to work with. This dataset will contain the inputs in the two-dimensional array
x and outputs and the one-dimensional array
Here, you can see NumPy’s
arange() being used, which is extremely convenient for generating arrays based on numerical ranges. You’ll also use
.reshape() to modify the shape of the array returned by
arange() and get a two-dimensional data structure.
Here, you can see the
y NumPy arrays that were created. You can split both input and output datasets with a single function call, as seen onscreen.
Given two sequences, such as
train_test_split() performs the split and returns four sequences, which in this case will be NumPy arrays, in this order:
x_train, the training part of the first sequence
x_test, the test part of the first sequence
y_train, the training part the second sequence
y. And finally,
y_test, the test part of the second sequence
04:00 You probably got different results from the ones you see onscreen. This is because dataset splitting is random by default, and the result differs each time you run the function. However, this often isn’t what you want.
04:14 In the next part of the course, you’ll see how you can modify your code so that you get consistent, reproducible results.
Become a Member to join the conversation.