AdaBoost

AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, formulated by Yoav Freund and Robert Schapire.[1] It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. In some problems, however, it can be less susceptible to the overfitting problem than most learning algorithms. The classifiers it uses can be weak (i.e., display a substantial error rate), but as long as their performance is not random (resulting in an error rate of 0.5 for binary classification), they will improve the final model. Even classifiers with an error rate higher than would be expected from a random classifier will be useful, since they will have negative coefficients in the final linear combination of classifiers and hence behave like their inverses.

AdaBoost generates and calls a new weak classifier in each of a series of rounds  t = 1,\ldots,T. For each call, a distribution of weights D_{t} is updated that indicates the importance of examples in the data set for the classification. On each round, the weights of each incorrectly classified example are increased, and the weights of each correctly classified example are decreased, so the new classifier focuses on the examples which have so far eluded correct classification.

Contents

The algorithm for the binary classification task

Given:

Initialize \textstyle D_{1}(i) = \frac{1}{m}, i=1,\ldots,m.

For t = 1,\ldots,T:

h_{t} = \underset{h_{t} \in \mathcal{H}}{\operatorname{argmax}} \; \left\vert 0.5 - \epsilon_{t}\right\vert

where  \epsilon_{t} = \sum_{i=1}^{m} D_{t}(i)I(y_i \ne h_{t}(x_{i})) .

I is the indicator function.

D_{t%2B1}(i) = \frac{ D_t(i) \exp(-\alpha_t y_i h_t(x_i)) }{ Z_t }

where Z_{t} is the normalization factor:

\sum_{i} D_{t}(i)\exp(-\alpha_t y_i h_t(x_i))\,\!

which ensures that D_{t%2B1} will be a probability distribution (i.e., the sum over all x equals one).

Output the final classifier:

H(x) = \textrm{sign}\left( \sum_{t=1}^{T} \alpha_{t}h_{t}(x)\right)

Note that the equation to update the distribution D_{t} is constructed so that:

- \alpha_{t} y_{i} h_{t}(x_{i}) \begin{cases} <0, & y(i)=h_{t}(x_{i}) \\ >0, & y(i) \ne h_{t}(x_{i}) \end{cases}

Thus, after selecting an optimal classifier h_{t} \, for the distribution D_{t} \,, the examples x_{i} \, that the classifier h_{t} \, identified correctly are weighted less and those that it identified incorrectly are weighted more. Therefore, when the algorithm is testing the classifiers on the distribution D_{t%2B1} \,, it will select a classifier that better identifies those examples that the previous classifier missed.

Statistical Understanding of Boosting

Boosting can be seen as minimization of a convex loss function over a convex set of functions.[2] Specifically, the loss being minimized is the exponential loss

\sum_i e^{-y_i f(x_i)}

and we are seeking a function

f(x) = \sum_t \alpha_t h_t(x)\,\!

See also

References

  1. ^ Yoav Freund, Robert E. Schapire. "A Decision-Theoretic Generalization of on-Line Learning and an Application to Boosting", 1995
  2. ^ T. Zhang, "Statistical behavior and consistency of classification methods based on convex risk minimization", Annals of Statistics 32 (1), pp. 56-85, 2004.

Implementations

External links