Binary classification
From Wikipedia, the free encyclopedia
Binary classification is the task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not. Some typical binary classification tasks are
- medical testing to determine if a patient has certain disease or not (the classification property is the disease)
- quality control in factories; i.e. deciding if a new product is good enough to be sold, or if it should be discarded (the classification property is being good enough)
- deciding whether a page or an article should be in the result set of a search or not (the classification property is the relevance of the article - typically the presence of a certain word in it)
Classification in general is one of the problems studied in computer science, in order to automatically learn classification systems; some methods suitable for learning binary classifiers include the decision trees, Bayesian networks, support vector machines, and neural networks.
Sometimes, classification tasks are trivial. Given 100 balls, some of them red and some blue, a human with normal color vision can easily separate them into red ones and blue ones. However, some tasks, like those in practical medicine, and those interesting from the computer science point-of-view, are far from trivial, and produce also faulty results.
Contents |
[edit] Hypothesis testing
In traditional statistical hypothesis testing, the tester starts with a null hypothesis and an alternative hypothesis, performs an experiment, and then decides whether to reject the null hypothesis in favour of the alternative.
A positive or statistically significant result is one which rejects the null hypothesis. Doing this when the null hypothesis is in fact true - a false positive - is a Type I error; doing this when the null hypothesis is false is a true positive.
A negative or not statistically significant result is one which does not reject the null hypothesis. Doing this when the null hypothesis is in fact false - a false negative - is a Type II error; doing this when the null hypothesis is true is a true negative.
[edit] Evaluation of binary classifiers
- See also: sensitivity and specificity
To measure the performance of a medical test, the concepts sensitivity and specificity are often used; these concepts are readily usable for the evaluation of any binary classifier. Say we test some people for the presence of a disease. Some of these people have the disease, and our test says they are positive. They are called true positives (TP). Some have the disease, but the test claims they don't. They are called false negatives (FN). Some don't have the disease, and the test says they don't - true negatives (TN). Finally, we might have healthy people who have a positive test result false positives (FP). Thus, the number of true positives, false negatives, true negatives, and false positives add up to 100% of the set.
Sensitivity (TPR) is the proportion of people that tested positive of all the positive people tested; that is (true positives) / (true positives + false negatives). It can be seen as the probability that the test is positive given that the patient is sick. The higher the sensitivity, the fewer real cases of diseases go undetected (or, in the case of the factory quality control, the fewer faulty products go to the market).
Specificity (TNR) is the proportion of people that tested negative of all the negative people tested; that is (true negatives) / (true negatives + false positives). As with sensitivity, it can be looked at as the probability that the test is negative given that the patient is not sick. The higher the specificity, the fewer healthy people are labeled as sick (or, in the factory case, the less money the factory loses by discarding good products instead of selling them).
The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the ROC curve.
In theory, sensitivity and specificity are independent in the sense that it is possible to achieve 100 % in both (for instance, the human classifying the red and blue balls most likely does). In practice, there often is a trade-off, and you can't achieve both. This is because much of the characteristics identified to determine whether a sample gives a positive or negative test may not be as obvious as red or blue colors; they are usually an array. A widely known instance is such as the indicator for checking obesity: Body Mass Index. If high sensitivity is desired, a very low threshold could be set which would consequently declare many as obese. Therefore, the number of true positives increase and false negatives decrease. That is, the sensitivity increases. The disadvantage would then be obvious since the number of false positives also increase due to incorrect testing of some normal people as obese. As a result, specificity decreases.
In addition to sensitivity and specificity, the performance of a binary classification test can be measured with positive (PPV) and negative predictive values (NPV). These are possibly more intuitively clear: the positive prediction value answers the question "how likely it is that I really have the disease, given that my test result was positive?". It is calculated as (true positives) / (true positives + false positives); that is, it is the proportion of true positives out of all positive results. (The negative prediction value is the same, but for negatives, naturally.)
One should note, though, one important difference between the two concepts. That is, sensitivity and specificity are independent from the population in the sense that they don't change depending on what the proportion of positives and negatives tested are. Indeed, you can determine the sensitivity of the test by testing only positive cases. However, the prediction values are dependent on the population.
[edit] Example
As an example, say that you have a test for a disease with 99 % sensitivity and 99 % specificity. Say you test 2000 people, and 1000 of them are sick and 1000 of them are healthy. You are likely to get about 990 true positives, 990 true negatives, and 10 of false positives and negatives each. The positive and negative prediction values would be 99 %, so the people can be quite confident about the result.
Say, however, that of the 2000 people only 100 are really sick. Now you are likely to get 99 true positives, 1 false negative, 1881 true negatives and 19 false positives. Of the 19+99 people tested positive, only 99 really have the disease - that means, intuitively, that given that your test result is positive, there's only 84 % chance that you really have the disease. On the other hand, given that your test result is negative, you can really be reassured: there's only 1 chance in 1881, or 0.05% probability, that you have the disease despite of your test result.
[edit] Measuring a classifier with sensitivity and specificity
Suppose you are training your own classifier, and you wish to measure its performance using the well-accepted sensitivity and specificity. It may be instructive to compare your classifier to a random classifier that flips a coin based on the prevalence of a disease. Suppose that the probability a person has the disease is p and the probability that they do not is q = 1 − p. Suppose then that we have a random classifier that guesses that you have the disease with that same probability p and guesses you do not with the same probability q.
The probability of a true positive is the probability that you have the disease and the random classifier guesses that you do, or p2. With similar reasoning, the probability of a false negative is pq. From the definitions above, the sensitivity of this classifier is p2 / (p2 + pq) = p. With more similar reasoning, we can calculate the specificity as q2 / (q2 + pq) = q.
So, while the measure itself is independent of disease prevalence, the performance of this random classifier depends on disease prevalence. Your classifier may have performance that is like this random classifier, but with a better-weighted coin (higher sensitivity and specificity). So, these measures may be influenced by disease prevalence.