The definition of a cascade is a series of waterfalls coming one after another. A similar concept is used in computer science to solve a complex problem with simple units. The problem here is reducing the number of computations for each image.
To solve it, Viola and Jones turned their strong classifier (consisting of thousands of weak classifiers) into a cascade where each weak classifier represents one stage. The job of the cascade is to quickly discard non-faces and avoid wasting precious time and computations.
When an image subregion enters the cascade, it is evaluated by the first stage. If that stage evaluates the subregion as positive, meaning that it thinks it’s a face, then the output of the stage is maybe.
If a subregion gets a maybe, then it is sent to the next stage of the cascade. If that one gives a positive evaluation, then that’s another maybe, and the image is sent to the third stage.
This process is repeated until the image passes through all stages of the cascade. If all classifiers approve the image, then it is finally classified as a human face and is presented to the user as a detection.
If, however, the first stage gives a negative evaluation, then the image is immediately discarded as not containing a human face. If it passes the first stage but fails the second stage, then it is discarded as well. Basically, the image can get discarded at any stage of the classifier.
This is designed so that non-faces get discarded very quickly, which saves a lot of time and computational resources. Since every classifier represents a feature of a human face, a positive detection basically says, “Yes, this subregion contains all the features of a human face.” But as soon as one feature is missing, it rejects the whole subregion.
To accomplish this effectively, it’s important to put your best performing classifiers early in the cascade. In the Viola-Jones algorithm, the eyes and nose bridge classifiers are examples of best performing weak classifiers.
00:00 When the Viola-Jones framework is being used to detect faces, a 24 by 24 pixel subregion moves across the image to detect the presence of faces. In order to figure out if a face is present, it uses what’s called a classifier cascade.
00:51 So, if any one of the classifiers is missing, then we can assume that this specific subregion does not contain a face and just move on. This dramatically improves efficiency because it prevents us from scanning for all of the other features in a strong classifier if we already know that one of them is missing. Think of it like this: say the strong classifier we are using is made up of three features—the eyes, the nose, and the mouth.
01:20 If we start our search by looking for the eyes and we don’t find them, then what’s the point of searching for the nose and the mouth? We already know that the face does not exist here, so let’s move on to the next subregion and keep scanning. In order to accomplish this effectively, it’s important that we put our best-performing classifiers, aka the ones with the highest weight, early in our cascade.
Become a Member to join the conversation.