Cohen's kappa

Cohen's kappa coefficient is a statistic which measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, since κ takes into account the possibility of the agreement occurring by chance. There is controversy surrounding Cohen’s Kappa due to the difficulty in interpreting indices of agreement. Some researchers have suggested that it is conceptually simpler to evaluate disagreement between items.^[1] See the Limitations section for more detail.

Calculation

Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The first mention of a kappa-like statistic is attributed to Galton (1892);^[2] see Smeeton (1985).^[3]

The definition of κ is:

\kappa \equiv {\frac {p_{o}-p_{e}}{1-p_{e}}}=1-{\frac {1-p_{o}}{1-p_{e}}},\!

where $p o$ is the relative observed agreement among raters (identical to accuracy), and $p e$ is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then $κ = 1$ . If there is no agreement among the raters other than what would be expected by chance (as given by $p e$ ), $κ \leq 0$ .

For categories $k$ , number of items $N$ and $n_{ki}$ the number of times rater $i$ predicted category $k$ :

p_{e}={\frac {1}{N^{2}}}\sum _{k}n_{k1}n_{k2}

The seminal paper introducing kappa as a new technique was published by Jacob Cohen in the journal Educational and Psychological Measurement in 1960.^[4]

A similar statistic, called pi, was proposed by Scott (1955). Cohen's kappa and Scott's pi differ in terms of how $p e$ is calculated.

Note that Cohen's kappa measures agreement between two raters only. For a similar measure of agreement (Fleiss' kappa) used when there are more than two raters, see Fleiss (1971). The Fleiss kappa, however, is a multi-rater generalization of Scott's pi statistic, not Cohen's kappa. Kappa is also used to compare performance in machine learning but the directional version known as Informedness or Youden's J statistic is argued to be more appropriate for supervised learning.^[5]

Example

Suppose that you were analyzing data related to a group of 50 people applying for a grant. Each grant proposal was read by two readers and each reader either said "Yes" or "No" to the proposal. Suppose the disagreement count data were as follows, where A and B are readers, data on the main diagonal of the matrix (top left-bottom right) the count of agreements and the data off the main diagonal, disagreements:

		B
		Yes	No
A	Yes	a	b
A	No	c	d

e.g.

		B
		Yes	No
A	Yes	20	5
A	No	10	15

The observed proportionate agreement is:

p_{o}={\frac {a+d}{a+b+c+d}}={\frac {20+15}{50}}=0.7

To calculate $p e$ (the probability of random agreement) we note that:

Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time.
Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time.

So the expected probability that both would say yes at random is:

p_{\text{Yes}}={\frac {a+b}{a+b+c+d}}\cdot {\frac {a+c}{a+b+c+d}}=0.5*0.6=0.3

Similarly:

p_{\text{No}}={\frac {c+d}{a+b+c+d}}\cdot {\frac {b+d}{a+b+c+d}}=0.5*0.4=0.2

Overall random agreement probability is the probability that they agreed on either Yes or No, i.e.:

p_{e}=p_{\text{Yes}}+p_{\text{No}}=0.3+0.2=0.5

So now applying our formula for Cohen's Kappa we get:

\kappa ={\frac {p_{o}-p_{e}}{1-p_{e}}}={\frac {0.7-0.5}{1-0.5}}=0.4\!

Same percentages but different numbers

A case sometimes considered to be a problem with Cohen's Kappa occurs when comparing the Kappa calculated for two pairs of raters with the two raters in each pair having the same percentage agreement but one pair give a similar number of ratings while the other pair give a very different number of ratings.^[6] For instance, in the following two cases there is equal agreement between A and B (60 out of 100 in both cases) so we would expect the relative values of Cohen's Kappa to reflect this. However, calculating Cohen's Kappa for each:

		B
		Yes	No
A	Yes	45	15
A	No	25	15

\kappa ={\frac {0.60-0.54}{1-0.54}}=0.1304

		B
		Yes	No
A	Yes	25	35
A	No	5	35

\kappa ={\frac {0.60-0.46}{1-0.46}}=0.2593

we find that it shows greater similarity between A and B in the second case, compared to the first. This is because while the percentage agreement is the same, the percentage agreement that would occur 'by chance' is significantly higher in the first case (0.54 compared to 0.46).

Significance and magnitude

Kappa (vertical axis) and Accuracy (horizontal axis) calculated from the same simulated binary data. Each point on the graph is calculated from a pairs of judges randomly rating 10 subjects for having a diagnosis of X or not. Note in this example a Kappa=0 is approximately equivalent to an accuracy=0.5

Statistical significance makes no claim on how important is the magnitude in a given application or what is considered as high or low agreement.

Statistical significance for kappa is rarely reported, probably because even relatively low values of kappa can nonetheless be significantly different from zero but not of sufficient magnitude to satisfy investigators.^[7]^:66 Still, its standard error has been described^[8] and is computed by various computer programs.^[9]

If statistical significance is not a useful guide, what magnitude of kappa reflects adequate agreement? Guidelines would be helpful, but factors other than agreement can influence its magnitude, which makes interpretation of a given magnitude problematic. As Sim and Wright noted, two important factors are prevalence (are the codes equiprobable or do their probabilities vary) and bias (are the marginal probabilities for the two observers similar or different). Other things being equal, kappas are higher when codes are equiprobable. On the other hand, Kappas are higher when codes are distributed asymmetrically by the two observers. In contrast to probability variations, the effect of bias is greater when Kappa is small than when it is large.^[10]^:261–262

Another factor is the number of codes. As number of codes increases, kappas become higher. Based on a simulation study, Bakeman and colleagues concluded that for fallible observers, values for kappa were lower when codes were fewer. And, in agreement with Sim & Wrights's statement concerning prevalence, kappas were higher when codes were roughly equiprobable. Thus Bakeman et al. concluded that "no one value of kappa can be regarded as universally acceptable."^[11]^:357 They also provide a computer program that lets users compute values for kappa specifying number of codes, their probability, and observer accuracy. For example, given equiprobable codes and observers who are 85% accurate, value of kappa are 0.49, 0.60, 0.66, and 0.69 when number of codes is 2, 3, 5, and 10, respectively.

Nonetheless, magnitude guidelines have appeared in the literature. Perhaps the first was Landis and Koch,^[12] who characterized values < 0 as indicating no agreement and 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement. This set of guidelines is however by no means universally accepted; Landis and Koch supplied no evidence to support it, basing it instead on personal opinion. It has been noted that these guidelines may be more harmful than helpful.^[13] Fleiss's^[14]^:218 equally arbitrary guidelines characterize kappas over 0.75 as excellent, 0.40 to 0.75 as fair to good, and below 0.40 as poor.

Weighted kappa

Weighted kappa lets you count disagreements differently^[15] and is especially useful when codes are ordered.^[7]^:66 Three matrices are involved, the matrix of observed scores, the matrix of expected scores based on chance agreement, and the weight matrix. Weight matrix cells located on the diagonal (upper-left to bottom-right) represent agreement and thus contain zeros. Off-diagonal cells contain weights indicating the seriousness of that disagreement. Often, cells one off the diagonal are weighted 1, those two off 2, etc.

The equation for weighted κ is:

\kappa = 1- \frac{\sum_{i=1}^{k} \sum_{j=1}^{k}w_{ij}x_{ij}} {\sum_{i=1}^{k} \sum_{j=1}^{k}w_{ij}m_{ij}}

where k=number of codes and $w_{ij}$ , $x_{ij}$ , and $m_{ij}$ are elements in the weight, observed, and expected matrices, respectively. When diagonal cells contain weights of 0 and all off-diagonal cells weights of 1, this formula produces the same value of kappa as the calculation given above.

Kappa maximum

Kappa assumes its theoretical maximum value of 1 only when both observers distribute codes the same, that is, when corresponding row and column sums are identical. Anything less is less than perfect agreement. Still, the maximum value kappa could achieve given unequal distributions helps interpret the value of kappa actually obtained. The equation for κ maximum is:^[16]

\kappa _{{\max }}={\frac {P_{{\max }}-P_{{\exp }}}{1-P_{{\exp }}}}

where $P_{{\exp }}=\sum _{{i=1}}^{k}P_{{i+}}P_{{+i}}$ , as usual, $P_{{\max }}=\sum _{{i=1}}^{k}\min(P_{{i+}},P_{{+i}})$ ,

k = number of codes, $P_{{i+}}$ are the row probabilities, and $P_{{+i}}$ are the column probabilities.

Limitations

Kappa is an index that considers observed agreement with respect to a baseline agreement. However, investigators must consider carefully whether Kappa’s baseline agreement is relevant for the particular research question. Kappa’s baseline is frequently described as the agreement due to chance, which is only partially correct. Kappa’s baseline agreement is the agreement that would be expected due to random allocation, given the quantities specified by the marginal totals of square contingency table. Thus, Kappa = 0 when the observed allocation is apparently random, regardless of the quantity disagreement as constrained by the marginal totals. However, for many applications, investigators should be more interested in the quantity disagreement in the marginal totals than in the allocation disagreement as described by the additional information on the diagonal of the square contingency table. Thus for many applications, Kappa’s baseline is more distracting than enlightening. Consider the following example:

Kappa example

Comparison 1
		Reference
		G	R
Comparison	G	1	14
Comparison	R	0	1

The disagreement proportion is 14/16 or .875. The disagreement is due to quantity because allocation is optimal. Kappa is .01.

Comparison 2
		Reference
		G	R
Comparison	G	0	1
Comparison	R	1	14

The disagreement proportion is 2/16 or .125. The disagreement is due to allocation because quantities are identical. Kappa is -0.07.

Here, reporting quantity and allocation disagreement is informative while Kappa obscures information. Furthermore, Kappa introduces some challenges in calculation and interpretation because Kappa is a ratio. It is possible for Kappa’s ratio to return an undefined value due to zero in the denominator. Furthermore, a ratio does not reveal its numerator nor its denominator. It is more informative for researchers to report disagreement in two components, quantity and allocation. These two components describe the relationship between the categories more clearly than a single summary statistic. When predictive accuracy is the goal, researchers can more easily begin to think about ways to improve a prediction by using two components of quantity and allocation, rather than one ratio of Kappa.^[1]

Some researchers have expressed concern over κ's tendency to take the observed categories' frequencies as givens, which can make it unreliable for measuring agreement in situations such as the diagnosis of rare diseases. In these situations, κ tends to underestimate the agreement on the rare category.^[17] For this reason, κ is considered an overly conservative measure of agreement.^[18] Others^[19] contest the assertion that kappa "takes into account" chance agreement. To do this effectively would require an explicit model of how chance affects rater decisions. The so-called chance adjustment of kappa statistics supposes that, when not completely certain, raters simply guess—a very unrealistic scenario.

References

1 2 Pontius, Robert; Millones, Marco (2011). "Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment". International Journal of Remote Sensing. 32: 4407–4429.
↑ Galton, F. (1892). Finger Prints Macmillan, London.
↑ Smeeton, N.C. (1985). "Early History of the Kappa Statistic". Biometrics. 41: 795. JSTOR 2531300.
↑ Cohen, Jacob (1960). "A coefficient of agreement for nominal scales". Educational and Psychological Measurement. 20 (1): 37–46. doi:10.1177/001316446002000104.
↑ Powers, David M. W. (2012). "The Problem with Kappa" (PDF). Conference of the European Chapter of the Association for Computational Linguistics (EACL2012) Joint ROBUS-UNSUP Workshop.
↑ Kilem Gwet (May 2002). "Inter-Rater Reliability: Dependency on Trait Prevalence and Marginal Homogeneity" (PDF). Statistical Methods for Inter-Rater Reliability Assessment. 2: 1–10.
1 2 Bakeman, R.; Gottman, J.M. (1997). Observing interaction: An introduction to sequential analysis (2nd ed.). Cambridge, UK: Cambridge University Press. ISBN 0-521-27593-8.
↑ Fleiss, J.L.; Cohen, J.; Everitt, B.S. (1969). "Large sample standard errors of kappa and weighted kappa". Psychological Bulletin. 72: 323–327. doi:10.1037/h0028106.
↑ Robinson, B.F; Bakeman, R. (1998). "ComKappa: A Windows 95 program for calculating kappa and related statistics". Behavior Research Methods, Instruments, and Computers. 30: 731–732. doi:10.3758/BF03209495.
↑ Sim, J; Wright, C. C (2005). "The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements". Physical Therapy. 85: 257–268. PMID 15733050.
↑ Bakeman, R.; Quera, V.; McArthur, D.; Robinson, B. F. (1997). "Detecting sequential patterns and determining their reliability with fallible observers". Psychological Methods. 2: 357–370. doi:10.1037/1082-989X.2.4.357.
↑ Landis, J.R.; Koch, G.G. (1977). "The measurement of observer agreement for categorical data". Biometrics. 33 (1): 159–174. JSTOR 2529310. PMID 843571. doi:10.2307/2529310.
↑ Gwet, K. (2010). "Handbook of Inter-Rater Reliability (Second Edition)" ISBN 978-0-9708062-2-2
↑ Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. ISBN 0-471-26370-2.
↑ Cohen, J. (1968). "Weighed kappa: Nominal scale agreement with provision for scaled disagreement or partial credit". Psychological Bulletin. 70 (4): 213–220. PMID 19673146. doi:10.1037/h0026256.
↑ Umesh, U. N.; Peterson, R.A.; Sauber M. H. (1989). "Interjudge agreement and the maximum value of kappa.". Educational and Psychological Measurement. 49: 835–850. doi:10.1177/001316448904900407.
↑ Viera, Anthony J.; Garrett, Joanne M. (2005). "Understanding interobserver agreement: the kappa statistic". Family Medicine. 37 (5): 360–363.
↑ Strijbos, J.; Martens, R.; Prins, F.; Jochems, W. (2006). "Content analysis: What are they talking about?". Computers & Education. 46: 29–48. doi:10.1016/j.compedu.2005.04.002.
↑ Uebersax, JS. (1987). "Diversity of decision-making models and the measurement of interrater agreement" (PDF). Psychological Bulletin. 101: 140–146. doi:10.1037/0033-2909.101.1.140.

External links

Online calculators

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode
Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range
Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data
Survey methodology	Sampling stratified cluster Standard error Opinion poll Questionnaire
Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment
Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F
Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.