F1 score

In statistics, the F₁ score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. The F₁ score can be interpreted as a weighted average of the precision and recall, where an F₁ score reaches its best value at 1 and worst score at 0.

The traditional F-measure or balanced F-score (F₁ score) is the harmonic mean of precision and recall:

$F = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} %2B \mathrm{recall}}$ .

The general formula for positive real β is:

$F_\beta = (1 %2B \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) %2B \mathrm{recall}}$ .

The formula in terms of Type I and type II errors:

$F_\beta = \frac {(1 %2B \beta^2) \cdot \mathrm{true\ positive} }{((1 %2B \beta^2) \cdot \mathrm{true\ positive} %2B \beta^2 \cdot \mathrm{false\ negative} %2B \mathrm{false\ positive})}\,$ .

Two other commonly used F measures are the $F_{2}$ measure, which weights recall higher than precision, and the $F_{0.5}$ measure, which puts more emphasis on precision than recall.

The F-measure was derived so that $F_\beta$ "measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision" ^[1]. It is based on van Rijsbergen's effectiveness measure

$E = 1 - \left(\frac{\alpha}{P} %2B \frac{1-\alpha}{R}\right)^{-1}$ .

Their relationship is $F_\beta = 1 - E$ where $\alpha=\frac{1}{1 %2B \beta^2}$ .

Applications

The F-score is often used in the field of information retrieval for measuring search, document classification, and query classification performance^[2]. Earlier works focused primarily on the F₁ score, but with the proliferation of large scale search engines, performance goals changed to place more emphasis on either precision or recall^[3] and so $F_\beta$ is seen in wide application.

The F-score is also used in machine learning.^[4] Note, however, that the F-measures do not take the true negative rate into account, and that measures such as the Matthews correlation coefficient may be preferable to assess the performance of a binary classifier.

References

^ van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth.
^ Steven M. Beitzel. (2006). On Understanding and Classifying Web Queries. Phd Thesis. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.127.634&rep=rep1&type=pdf.
^ X. Li, Y.-Y. Wang, and A. Acero (July 2008). "Learning query intent from regularized click graphs". Proceedings of the 31st SIGIR Conference.
^ See, e.g., the evaluation of the CoNLL 2002 shared task.

F1 score

Applications

See also

References