How to Apply train_test_split()
Now that you have both imported, you can use
numpy to create a dataset and
train_test_split() to split that data into training sets and test sets. You’ll split inputs and outputs at the same time with a single function call. With
train_test_split(), you need to provide the sequences that you want to split as well as any optional arguments. It returns a list of NumPy arrays, other sequences, or SciPi sparse matrices if appropriate.
arrays is the sequence of lists, NumPy arrays, pandas DataFrames, or similar array-like objects that hold the data that you want to split.
All these objects together make up the dataset, and they must be of the same length. In supervised machine learning applications, you’ll typically work with two such sequences: a two-dimensional array with the inputs, typically known as
x, and a one-dimensional array of outputs, typically known as
It’s very similar to
train_size. You should provide either
test_size. If neither is given, then the default share of the dataset that will be used for testing is
random_state is the object that controls randomization during splitting.
It can either be an
int or an instance of
RandomState. The default value is
shuffle is a Boolean that determines whether or not to shuffle the dataset before applying the split.
stratify is an array-like object that, if not
None, determines how to use a stratified split.
Now it’s time to try data splitting. You’ll start by creating a simple dataset to work with. This dataset will contain the inputs in the two-dimensional array
x and outputs and the one-dimensional array
Here, you can see NumPy’s
arange() being used, which is extremely convenient for generating arrays based on numerical ranges. You’ll also use
.reshape() to modify the shape of the array returned by
arange() and get a two-dimensional data structure.
Given two sequences, such as
train_test_split() performs the split and returns four sequences, which in this case will be NumPy arrays, in this order:
x_train, the training part of the first sequence
04:00 You probably got different results from the ones you see onscreen. This is because dataset splitting is random by default, and the result differs each time you run the function. However, this often isn’t what you want.
Become a Member to join the conversation.