F1 score

In statistics, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.

The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall:

F = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} %2B \mathrm{recall}}.

The general formula for positive real β is:

F_\beta = (1 %2B \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) %2B \mathrm{recall}}.

The formula in terms of Type I and type II errors:

F_\beta = \frac {(1 %2B \beta^2) \cdot \mathrm{true\ positive} }{((1 %2B \beta^2) \cdot \mathrm{true\ positive} %2B \beta^2 \cdot \mathrm{false\ negative} %2B \mathrm{false\ positive})}\,.

Two other commonly used F measures are the F_{2} measure, which weights recall higher than precision, and the F_{0.5} measure, which puts more emphasis on precision than recall.

The F-measure was derived so that F_\beta "measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision" [1]. It is based on van Rijsbergen's effectiveness measure

E = 1 - \left(\frac{\alpha}{P} %2B \frac{1-\alpha}{R}\right)^{-1}.

Their relationship is F_\beta = 1 - E where \alpha=\frac{1}{1 %2B \beta^2}.

Applications

The F-score is often used in the field of information retrieval for measuring search, document classification, and query classification performance[2]. Earlier works focused primarily on the F1 score, but with the proliferation of large scale search engines, performance goals changed to place more emphasis on either precision or recall[3] and so F_\beta is seen in wide application.

The F-score is also used in machine learning.[4] Note, however, that the F-measures do not take the true negative rate into account, and that measures such as the Matthews correlation coefficient may be preferable to assess the performance of a binary classifier.

See also

References

  1. ^ van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth. 
  2. ^ Steven M. Beitzel. (2006). On Understanding and Classifying Web Queries. Phd Thesis. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.634&rep=rep1&type=pdf. 
  3. ^ X. Li, Y.-Y. Wang, and A. Acero (July 2008). "Learning query intent from regularized click graphs". Proceedings of the 31st SIGIR Conference. 
  4. ^ See, e.g., the evaluation of the CoNLL 2002 shared task.