Cascade Classifiers and the Integral Image: Viola-Jones Face Detection from Scratch

Why Face Detection Was Hard in 2008

Face detection — locating all faces in an arbitrary image, as opposed to recognising whose face it is — sounds deceptively simple. Human vision does it effortlessly. Computational implementations in the 1990s and early 2000s could do it too, given enough time: a typical sliding-window detector evaluating features at multiple scales and positions might spend several seconds on a single VGA image. For real-time video at 15–30 frames per second, that budget was about two orders of magnitude too expensive.

Viola and Jones (2004) solved the speed problem without sacrificing detection rate. Their key innovations — Haar-like rectangular features evaluated via the integral image, a feature selection procedure based on AdaBoost, and a rejection cascade — together achieved 15 fps on a 384×288 image on 2001-era hardware. For my 2009 BSc implementation I had substantially faster hardware but aimed to understand the algorithm deeply enough to extend it: specifically, to handle in-plane rotation and partial occlusion, which were the two most common failure modes.

Haar-Like Features: Four Rectangles That See Everything

A Haar-like feature computes the difference between the sum of pixel intensities in two or more rectangular regions. The four canonical types are:

Type A — two adjacent horizontal rectangles (captures horizontal intensity gradients)
Type B — two adjacent vertical rectangles (captures vertical intensity gradients)
Type C — three horizontal rectangles, centre minus both flanks (captures edges with a bright/dark/bright pattern — particularly useful for eye regions)
Type D — a 2×2 arrangement of rectangles in a checkerboard pattern (captures diagonal structures)

These features are reminiscent of the basis functions in the discrete Haar wavelet transform — hence the name. Within a 24×24 pixel detection window there are over 160,000 distinct Haar-like features when you enumerate all possible sizes, positions, and types. No practical detector evaluates all of them; the challenge is selecting the informative subset.

Critically, evaluating any Haar feature naively requires summing pixel values over rectangular regions — an O(wh) operation for a region of width w and height h. At detection time, when we're sliding this window over a full image at multiple scales, each window evaluation needs to assess dozens of features; each feature needs to sum over its rectangles. Without a smarter data structure, this compounds into something impossibly slow.

The Integral Image: O(1) Rectangle Sums

The integral image (also called the summed area table) enables O(1) computation of the sum of any rectangular region, regardless of size. It is precomputed once from the source image:

ii(x, y) = Σ_{x'≤x, y'≤y} i(x', y')

That is, ii(x, y) is the sum of all pixel values in the rectangle from the top-left corner of the image to position (x, y). Once the integral image is built (O(n) time, O(n) space), any axis-aligned rectangle sum follows from four lookups:

RectSum(r1, c1, r2, c2) =
    ii(r2, c2)
  - ii(r1-1, c2)
  - ii(r2, c1-1)
  + ii(r1-1, c1-1)

The inclusion-exclusion cancels the double-subtraction of the top-left corner. A complete Haar feature evaluation — two or three rectangle sums, one subtraction — takes a fixed number of array lookups regardless of the feature's spatial extent. A 1×1 feature and a 24×24 feature take exactly the same time. This property is what makes multi-scale detection tractable: scale is handled by changing the window size and resampling the integral image at different scales, not by evaluating larger features.

AdaBoost: Learning Which Features Matter

Given 160,000+ candidate features and a training set of face and non-face images, AdaBoost identifies the small subset of features that, combined as a weighted linear classifier, best separates faces from background.

Each iteration of AdaBoost selects the single weak classifier (one Haar feature with a threshold and polarity) that minimises weighted training error, then updates the sample weights to emphasise the examples the current classifier got wrong. After T iterations, the final classifier is a weighted vote of all selected weak classifiers:

H(x) = sign( Σ_{t=1}^{T} α_t · h_t(x) )

where each weak classifier is:

h_t(x, f, p, θ) = 1   if p · f(x) < p · θ
                  0   otherwise

The polarity p ∈ {+1, −1} allows the threshold to fire in either direction. The weight update after round t:

α_t = ½ · ln( (1 − ε_t) / ε_t )

D_{t+1}(i) = D_t(i) · exp(−α_t · y_i · h_t(x_i)) / Z_t

Examples correctly classified at round t have their weight reduced; misclassified examples have their weight increased. The next weak classifier therefore focuses its discriminative power on exactly the examples the ensemble currently struggles with. Z_t is a normalisation constant ensuring the weights remain a proper distribution.

The training complexity is daunting: at each of T rounds, we must evaluate all features on all training examples. Viola and Jones reduced this from O(NTd²) to O(NT) by precomputing feature responses, and Pham & Cham (2007) pushed it further to O(Nd²+T) through cached integral image structures. Even so, training the original Viola-Jones detector required weeks on 2001 hardware. My implementation trained on a smaller subset — the full detector was pre-trained and used as-is for the extension experiments.

The original Viola & Jones (2004) detector selected 6,060 features out of 160,000+ candidates, organised into 38 component classifiers. The feature count gives a sense of the selectivity: fewer than 4% of candidates contributed useful discriminative power.

The Rejection Cascade

Even with O(1) feature evaluation, checking 6,060 features for every candidate window position in a 640×480 image is expensive when the number of windows exceeds one million. The cascade solves this.

Rather than evaluating all features for every window, the cascade is a sequence of increasingly complex classifiers. Each stage uses a small number of features and is tuned for very high sensitivity (rarely misses a face) but only moderate specificity (rejects some but not all background). A window must pass every stage to be declared a face; it is rejected immediately upon failing any stage.

Stage-level performance targets in the Viola-Jones design were approximately 99% detection rate (sensitivity) and 50% false positive rate per stage. Across k stages, false positives compound multiplicatively:

FPR_overall = FPR_stage^k = 0.5^k

With 38 stages, the overall false positive rate reaches approximately 0.5³⁸ ≈ 3.6 × 10⁻¹² in theory — in practice somewhat higher due to correlated errors, but still in the range of 10⁻⁵ to 10⁻⁶. Meanwhile, the overall true positive rate remains (0.99)³⁸ ≈ 68% — a real detection rate cost.

The cascade's speed gain comes from the distribution of windows in natural images: the vast majority are background (sky, walls, vegetation). A well-calibrated first stage with just 2–3 features can reject 50% of all background windows immediately. By the time a window reaches stage 10, it has survived dozens of feature evaluations; by that point the cascade has implicitly invested in filtering. Most of the computation is spent on the rare hard cases near the decision boundary — exactly where the computational budget should go.

Extending for Rotation and Occlusion

The face detection literature in 2004 treated frontal, unoccluded faces as the target and treated everything else as out-of-scope. My BSc work extended the standard detector in two directions.

In-plane rotation

The Haar features are axis-aligned; they have no natural invariance to rotation. A face tilted 30° looks like a completely different feature pattern from a frontal face. The standard approach in 2008 was rotation-specialised detectors: train a separate detector for each rotation range and run all detectors in parallel.

For full 360° coverage with ±15° overlap between adjacent detectors, you need 12 specialised detectors. I implemented three, covering −15°–15° (the standard frontal detector), 15°–45°, and 45°–75° — giving approximately ±75° total coverage. The results showed rotation tolerance of at least ±15° in all test cases and greater than ±25° in over half. The failure mode was clean: the detector did not misfire at unexpected angles, it simply stopped firing, which is preferable to false positives.

Partial occlusion

Occlusion — a hand, scarf, hair, or hat covering part of the face — is more challenging than rotation because it destroys the specific Haar features that happen to cover the occluded region. The critical insight is that not all regions are equally important: the Viola-Jones cascade is most sensitive to the eye region, where the dark eye sockets flanking the bright nose bridge produce strong type-C horizontal feature responses. Occlusion of the lower face (mouth, chin) is tolerated well; occlusion of both eyes simultaneously is fatal to detection.

My experiments confirmed this structural dependency: cases where detection failed corresponded precisely to heavy occlusion of the eye-critical region. Cases where only the lower half of the face was obscured — by a scarf or hands raised to the mouth — maintained acceptable detection rates. This wasn't a bug; it was the correct behaviour given what the trained cascade had learned to look for.

Skin Colour as a Pre-Filter

Running the full cascade on every window position in a large image is expensive even with the cascade's rejection rate. One practical acceleration is a skin colour pre-filter: before applying the cascade, mask out image regions that are not plausibly skin-coloured. The cascade only evaluates windows that overlap the skin mask.

I modelled skin colour as a Gaussian mixture in the HSV colour space:

P(x, θ) = Σᵢ αᵢ · N(x; μᵢ, Σᵢ)

The mixture parameters (αᵢ, μᵢ, Σᵢ) were fitted to a labelled skin pixel dataset using the EM algorithm. HSV was preferred over RGB because the hue and saturation channels capture the redness-and-pinkness of skin tone more robustly than raw RGB, and the value channel can be normalised to reduce sensitivity to illumination changes.

The skin filter reduced the number of cascade evaluations by 60–80% on typical indoor images, at the cost of occasionally missing faces with atypical skin tones or under unusual lighting. Whether that trade-off is acceptable depends on the application. For a live webcam system prioritising speed, it was well worth it.

Merging Multiple Detections

A real detector running across multiple scales produces clusters of overlapping bounding boxes for each face — the same face detected at slightly different positions and scales. These must be merged into a single canonical bounding box.

The approach I implemented grouped detections by bounding-box area overlap: if two bounding boxes overlapped by more than a threshold fraction, they were considered detections of the same face. Within each group, the final bounding box was computed as the centroid mean of the group members. The overlap threshold controlled the trade-off between merging distinct faces (too high a threshold) and emitting multiple boxes for a single face (too low).

An alternative approach — used in OpenCV's implementation — filters by group size: discard any group with fewer than n detections (the minNeighbors parameter in OpenCV). This enforces spatial consistency: a face detected at only one scale and position is suspicious and may be a false positive; a face detected at 10 slightly varying windows is almost certainly real.

What I Learned

The algorithm's elegance is in how each component addresses one specific bottleneck, and how the components compose without interference. The integral image is a data structure that enables the feature evaluator; AdaBoost is a training procedure that selects good features and rejects bad ones; the cascade is an inference architecture that exploits the imbalanced prior (most windows are not faces) to skip work. None of these techniques are specific to faces — they're general tools applied to a face-shaped problem.

By 2009, when I submitted the thesis, the deep learning era was already beginning to stir. Convolutional networks trained on large datasets would eventually make the hand-crafted Haar features and the carefully calibrated AdaBoost selection seem quaint. But the cascade architecture — shallow early stages that reject cheaply, deep later stages that discriminate precisely — reappears in modern deep detection systems. MobileNet-SSD has cascade-shaped fast-path logic in it; so does YOLO's objectness score. The underlying intuition that computing on uninteresting regions is waste has not aged.