The Importance of Data Splitting
For more information about concepts covered in this lesson, you can check out:
00:00 The importance of data splitting. Supervised machine learning is about creating models that precisely map the given inputs—independent variables, or predictors—to the given outputs—dependent variables, or responses. How you measure the precision of your model depends on the type of problem you’re trying to solve.
00:23 In regression analysis, you typically use the coefficient of determination, root-mean-square error, mean absolute error, or similar quantities.
00:34 For classification problems, you often apply accuracy, precision, recall, F1 score, and related indicators. The acceptable numeric values that measure precision vary from field to field.
00:50 What’s most important to understand is that you usually need unbiased evaluation to properly use these measures, assess the predictive performance of your model, and to validate the model.
01:01 This means that you can’t evaluate the predictive performance of a model with the same data that you’ve used for training. You need to evaluate the model with fresh data that hasn’t been seen by the model before. You can accomplish that by splitting your dataset before you use it.
01:18 Splitting your dataset is essential for an unbiased evaluation of prediction performance. In most cases, it’s enough to split your dataset randomly into three subsets.
01:30 The training set is applied to train, or fit, your model. For example, you use the training set to find the optimal weights or coefficients for linear regression, logistic regression, or neural networks.
01:45 The validation set is used for unbiased model evaluation during hyperparameter tuning. For example, when you want to find the optimal number of neurons in a neural network or the best kernel for a support vector machine, you experiment with different values. For each considered setting of hyperparameters, you fit the model with the training set and assess its performance with the validation set.
02:09 The test set is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.
02:19 In less complex cases where you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets. Splitting a dataset might also be important for detecting if your model suffers from one of two very common problems called underfitting and overfitting. Underfitting is usually the consequence of a model being unable to encapsulate the relations among data. For example, this can happen when trying to represent nonlinear relations with a linear model.
02:49 Underfitted models will likely have poor performance with both training and test sets.
02:56 Overfitting usually takes place when a model has an excessively complex structure and learns both the existing relations amongst data and noise. Such models often have bad generalization capabilities.
03:09 Although they work well with training data, they usually yield poor performance with unseen test data.
03:17 You can find a more detailed explanation of underfitting and overfitting in Real Python’s Linear Regression in Python course, as seen onscreen now.
03:27 Now that you’ve seen why data splitting is important, let’s move on to what’s needed to efficiently split data in Python.
Become a Member to join the conversation.