The Importance of Data Splitting

Splitting Datasets With scikit-learn and train_test_split() Darren Jones 03:35

For more information about concepts covered in this lesson, you can check out:

00:00 The importance of data splitting. Supervised machine learning is about creating models that precisely map the given inputs—independent variables, or predictors—to the given outputs—dependent variables, or responses. How you measure the precision of your model depends on the type of problem you’re trying to solve.

00:23 In regression analysis, you typically use the coefficient of determination, root-mean-square error, mean absolute error, or similar quantities.

00:34 For classification problems, you often apply accuracy, precision, recall, F1 score, and related indicators. The acceptable numeric values that measure precision vary from field to field.

00:50 What’s most important to understand is that you usually need unbiased evaluation to properly use these measures, assess the predictive performance of your model, and to validate the model.

01:01 This means that you can’t evaluate the predictive performance of a model with the same data that you’ve used for training. You need to evaluate the model with fresh data that hasn’t been seen by the model before. You can accomplish that by splitting your dataset before you use it.

01:18 Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into three subsets.

01:30 The training set is applied to train, or fit, your model. For example, you use the training set to find the optimal weights or coefficients for linear regression, logistic regression, or neural networks.

02:09 The test set is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.

02:49 Underfitted models will likely have poor performance with both training and test sets.

02:56 Overfitting usually takes place when a model has an excessively complex structure and learns both the existing relations amongst data and noise. Such models often have bad generalization capabilities.

03:09 Although they work well with training data, they usually yield poor performance with unseen test data.

03:17 You can find a more detailed explanation of underfitting and overfitting in Real Python’s Linear Regression in Python course, as seen onscreen now.

03:27 Now that you’ve seen why data splitting is important, let’s move on to what’s needed to efficiently split data in Python.

Become a Member to join the conversation.