Multiclass classification

Not to be confused with multi-label classification.

In machine learning, multiclass or multinomial classification is the problem of classifying instances into one of the more than two classes (classifying instances into one of the two classes is called binary classification).

While some classification algorithms naturally permit the use of more than two classes, others are by nature binary algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies.

Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance.

General strategies

This section discusses strategies for reducing the problem of multiclass classification to multiple binary classification problems.

One-vs.-rest

The one-vs.-rest^[1]^{:182, 338} (or one-vs.-all, OvA or OvR) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.^[1]^:182^{[note 1]}

In pseudocode, the training algorithm for an OvA learner constructed from a binary classification learner $L$ is as follows:

Inputs:

$L$ , a learner (training algorithm for binary classifiers)
samples $X$
labels $y$ where $y i$ ∈ {1, … $K$ } is the label for the sample $X i$

Output:

a list of classifiers $f k$ for $k$ ∈ {1, …, $K$ }

Procedure:

For each k in {1, …, K}:
- Construct a new label vector $y i = 1$ where $y i = k$ , 0 (or −1) elsewhere
- Apply $L$ to $X$ , $y$ to obtain $f k$

Making decisions means applying all classifiers to an unseen sample $x$ and predicting the label $k$ for which the corresponding classifier reports the highest confidence score:

\hat{y} = \arg\max_{k \in 1 \ldots K} f_k(x)

Although this strategy is popular, it is a heuristic that suffers from several problems. Firstly, the scale of the confidence values may differ between the binary classifiers. Second, even if the class distribution is balanced in the training set, the binary classification learners see unbalanced distributions because typically the set of negatives they see is much larger than the set of positives.^[1]^:338

One-vs.-one

In the one-vs.-one (OvO) reduction, one trains $K (K - 1) / 2$ binary classifiers for a $K$ -way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all $K (K - 1) / 2$ classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier.^[1]^:339

Like OvR, OvO suffers from ambiguities in that some regions of its input space may receive the same number of votes.^[1]^:183

Notes

↑ In multi-label classification, OvR is known as binary relevance and the prediction of multiple classes is considered a feature, not a problem.

References

1 2 3 4 5 Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer.

This article is issued from Wikipedia - version of the Friday, May 22, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.