Viola–Jones object detection framework

The Viola–Jones object detection framework is the first object detection framework to provide competitive object detection rates in real-time proposed in 2001 by Paul Viola and Michael Jones.[1][2] Although it can be trained to detect a variety of object classes, it was motivated primarily by the problem of face detection. This algorithm is implemented in OpenCV as cvHaarDetectObjects().

Problem description

The basic problem to be solved is to implement an algorithm for detection of faces in an image. This can be solved easily by humans. However there is a stark contrast to how difficult it actually is to make a computer successfully solve this task. In order to ease the task Viola–Jones limit themselves to full view frontal upright faces. That is, in order to be detected the entire face must point towards the camera and it should not be tilted to any side. This may compromise the requirement for being unconstrained a little bit, but considering that the detection algorithm most often will be succeeded by a recognition algorithm these demands seem quite reasonable.

Components of the framework

Feature types used by Viola and Jones

Feature types and evaluation

The main characteristics of Viola–Jones algorithm which makes it a good detection algorithm are:

The algorithm has mainly 4 stages:

  1. Haar Features Selection
  2. Creating Integral Image
  3. Adaboost Training algorithm
  4. Cascaded Classifiers

The features employed by the detection framework universally involve the sums of image pixels within rectangular areas. As such, they bear some resemblance to Haar basis functions, which have been used previously in the realm of image-based object detection.[3] However, since the features used by Viola and Jones all rely on more than one rectangular area, they are generally more complex. The figure at right illustrates the four different types of features used in the framework. The value of any given feature is always simply the sum of the pixels within clear rectangles subtracted from the sum of the pixels within shaded rectangles. As is to be expected, rectangular features of this sort are rather primitive when compared to alternatives such as steerable filters. Although they are sensitive to vertical and horizontal features, their feedback is considerably coarser.

Haar Feature that looks similar to the bridge of the nose is applied onto the face
Haar Feature that looks similar to the eye region which is darker than the upper cheeks is applied onto a face
3rd and 4th kind of Haar Feature

1. Haar Features – All human faces share some similar properties. This knowledge is used to construct certain features known as Haar Features.

The properties that are similar for a human face are:

That is useful domain knowledge:

The four features applied in this algorithm are applied onto a face and shown on the left.

Rectangle features:

However, with the use of an image representation called the integral image, rectangular features can be evaluated in constant time, which gives them a considerable speed advantage over their more sophisticated relatives. Because each rectangular area in a feature is always adjacent to at least one other rectangle, it follows that any two-rectangle feature can be computed in six array references, any three-rectangle feature in eight,and any four-rectangle feature in just ten.

The integral image at location (x,y), is the sum of the pixels above and to the left of (x,y), inclusive.

Learning algorithm

The speed with which features may be evaluated does not adequately compensate for their number, however. For example, in a standard 24x24 pixel sub-window, there are a total of M = 162,336[4] possible features, and it would be prohibitively expensive to evaluate them all when testing an image. Thus, the object detection framework employs a variant of the learning algorithm AdaBoost to both select the best features and to train classifiers that use them. This algorithm constructs a “strong” classifier as a linear combination of weighted simple “weak” classifiers.

h(\mathbf{x}) = \text{sign}\left(\sum_{j=1}^{M} \alpha_j h_j(\mathbf{x})\right)

Each weak classifier is a threshold function based on the feature f_j.

h_j(\mathbf{x}) = 
\begin{cases}
-s_j &\text{if } f_j < \theta_j\\
s_j &\text{otherwise}
\end{cases}

The threshold value \theta_j and the polarity s_j \in \pm 1 are determine in the training, as well as the coefficients \alpha_j.

Here a simplified version of the learning algorithm is reported:[5]

Input: Set of N positive and negative training images with their labels {(\mathbf{x}^i,y^i)}. If image i is a face y^i=1, if not y^i=-1.

  1. Initialization: assign a weight w^i_{1}=\frac{1}{N} to each image i.
  2. For each feature f_j with j = 1,...,M
    1. Renormalize the weights such that they sum to one.
    2. Apply the feature to each image in the training set, then find the optimal threshold and polarity \theta_j,s_j that minimizes the weighted classification error. That is \theta_j,s_j = \arg\min_{\theta,s} \;\sum_{i=1}^N w^i_{j} \varepsilon^i_{j} where \varepsilon^i_{j} = 
\begin{cases}
0 &\text{if }y^i = h_j(\mathbf{x}^i,\theta_j,s_j)\\
1 &\text{otherwise}
\end{cases}
    3. Assign a weight \alpha_j to h_j that is inversely proportional to the error rate. In this way best classifiers are considered more.
    4. The weights for the next iteration, i.e. w_{j+1}^i, are reduced for the images i that were correctly classified.
  3. Set the final classifier to h(\mathbf{x}) = \text{sign}\left(\sum_{j=1}^{M} \alpha_j h_j(\mathbf{x})\right)

Cascade architecture

In cascading, each stage consists of a strong classifier. So all the features are grouped into several stages where each stage has certain number of features.

The job of each stage is to determine whether a given sub-window is definitely not a face or may be a face. A given sub-window is immediately discarded as not a face if it fails in any of the stages.

A simple framework for cascade training is given below:

 while F(i) > Ftarget
     i++
     n(i) = 0; F(i)= F(i-1)

     while F(I) > f x F(i-1)
      - n(i) ++
      - use P and N to train a classifier with n(I) features using AdaBoost
      - Evaluate current cascaded classifier on validation set to determine F(i) & D(i)
      - decrease threshold for the ith classifier until the current cascaded classifier has a detection rate of at least d x D(i-1) (this also affects F(i))
      - N = ∅
      - If F(i) > Ftarget then evaluate the current cascaded detector on the set of non-face images and put any false detections into the set N.

The cascade architecture has interesting implications for the performance of the individual classifiers. Because the activation of each classifier depends entirely on the behavior of its predecessor, the false positive rate for an entire cascade is:

F = \prod_{i=1}^K f_i.

Similarly, the detection rate is:

D = \prod_{i=1}^K d_i.

Thus, to match the false positive rates typically achieved by other detectors, each classifier can get away with having surprisingly poor performance. For example, for a 32-stage cascade to achieve a false positive rate of 10^{-6}, each classifier need only achieve a false positive rate of about 65%. At the same time, however, each classifier needs to be exceptionally capable if it is to achieve adequate detection rates. For example, to achieve a detection rate of about 90%, each classifier in the aforementioned cascade needs to achieve a detection rate of approximately 99.7%.

Advantages of Viola–Jones algorithm

Disadvantages of Viola–Jones algorithm

Detected Face using the cascadeObjectDetector function in MATLAB. This function uses the Viola–Jones algorithm to detect faces

MATLAB code for using the cascadeObjectDetector() function

clear all
vid = videoinput('winvideo', 1, 'YUY2_640x480'); % Giving the framesize
vid.ReturnedColorSpace = 'RGB'; % Mentioning RGB format
vid.TriggerRepeat = Inf; % Triggers the camera repeatedly
vid.FrameGrabInterval = 1; % Time between successive frames
start(vid); % Starts capturing video
detector = vision.CascadeObjectDetector(); % Create a detector using Viola-Jones
while true % Infinite loop to continuously detect the face
    img = flipdim(getdata(vid, 1), 2); % Flips the image horizontally
    imshow(img); hold on; % Displays image
    bbox = step(detector, img); % Creating bounding box using detector
    if ~ isempty(bbox)
        rectangle('position', bbox(1, :), 'lineWidth', 2, 'edgeColor', 'y');
    end % Draws a yellow rectangle around the detected face
    flushdata(vid);
    hold off;
end

Related face detection and tracking algorithm

The algorithm that is similar to Viola–Jones but can detect and track even tilted and rotated faces is the KLT algorithm. Here the feature points are detected by scanning the face initially. Thereafter, the face can be detected and tracked even when the face is tilted and further away from the camera, which is not possible in case of Viola–Jones algorithm.[7]

Improvements over the Viola–Jones algorithm

An improved algorithm on Viola–Jones object detector[8]

MATLAB implementation of Viola–Jones algorithm[9]

OpenCV implementation of Viola–Jones algorithm

Haar Cascade Detection in OpenCV[10]

Cascade Classifier Training in OpenCV[11]

Citations of the Viola–Jones algorithm in Google Scholar[12]

Implementing the Viola–Jones Face Detection Algorithm by Ole Helvig Jensen[13]

Adaboost Explanation from ppt by Qing Chen, Discovery Labs, University of Ottawa and a video lecture by Ramsri Goutham.

Video link - [14]

References

  1. Rapid object detection using a boosted cascade of simple features
  2. Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See pages 1,3.
  3. C. Papageorgiou, M. Oren and T. Poggio. A General Framework for Object Detection. International Conference on Computer Vision, 1998
  4. Viola-Jones’ face detection claims 180k features
  5. R. Szeliski, Computer Vision, algorithms and applications, Springer
  6. Viola, Jones: Robust Real-time Object Detection, IJCV 2001 See page 11.
  7. Face Detection and Tracking using the KLT algorithm
  8. Improved algorithm than Viola–Jones
  9. MATLAB implementation of Viola–Jones algorithm
  10. OpenCV Implementation of Viola–Jones
  11. Cascade Classifier Training in OpenCV
  12. Citations of the Viola–Jones algorithm in Google Scholar
  13. Implementing Viola–Jones
  14. Video lecture on Viola–Jones algorithm

External links