Krippendorff's alpha

Krippendorff's alpha coefficient[1] is a statistical measure of the agreement achieved when coding a set of units of analysis in terms of the values of a variable. Since the 1970s, alpha is used in content analysis where textual units are categorized by trained readers, in counseling and survey research where experts code open-ended interview data into analyzable terms, in psychological testing where alternative tests of the same phenomena need to be compared, or in observational studies where unstructured happenings are recorded for subsequent analysis.

Krippendorff’s alpha generalizes several known statistics, often called measures of inter-coder agreement, inter-rater reliability, reliability of coding given sets of units (as distinct from unitizing) but it also distinguishes itself from statistics that are called reliability coefficients but are unsuitable to the particulars of coding data generated for subsequent analysis.

Krippendorff’s alpha is applicable to any number of coders, each assigning one value to one unit of analysis, to incomplete (missing) data, to any number of values available for coding a variable, to binary, nominal, ordinal, interval, ratio, polar, and circular metrics (Levels of Measurement), and it adjusts itself to small sample sizes of the reliability data. The virtue of a single coefficient with these variations is that computed reliabilities are comparable across any numbers of coders, values, different metrics, and unequal sample sizes.

Software for calculating Krippendorff’s alpha is available.[2][3][4][5]

Reliability data

Reliability data are generated in a situation in which m ≥ 2 jointly instructed (e.g., by a Code book) but independently working coders assign any one of a set of values 1,...,V to a common set of N units of analysis. In their canonical form, reliability data are tabulated in an m-by-N matrix containing n values vij that coder ci has assigned to unit uj. Define mj as the number of values assigned to unit j across all coders c. When data are incomplete, mj may be less than m. Reliability data require that values be pairable, i.e., mj ≥ 2. The total number of pairable values is nmN.

To help clarify, here is what the canonical form looks like, in the abstract:

u1 u2 u3 ... uN
c1 v11 v12 v13 ... v1N
c2 v21 v22 v23 ... v2N
c3 v31 v32 v33 ... v3N
... ... ... ... ... ...
cm vm1 vm2 vm3 ... vmN

General form of alpha

\alpha = 1-\frac{D_o}{D_e} = 1 - \frac{\textstyle \sum_{u=1}^N \frac{m_u}{n}D_u}{D_e}

where the disagreement

D_u = \frac{1}{m_u (m_u -1)}\sum_{i=1,i' \ne i}^{m} \delta (c_{iu}, c_{i'u})

is the average difference \delta (c_{iu} c_{i'u}) between two values ciu and ci'u over all mu(mu-1) pairs of values possible within unit u – without reference to coders. {_{metric}} \delta_{ck}^2 is a function of the metric of the variable, see below. The observed disagreement

D_o = \sum_{u=1}^{N} \frac{m_u}{n} D_u = \frac{1}{n} \sum_{u=1}^{N} \frac{1}{m_u -1} \sum_{i=1,i'\ne i}^{m} \delta (c_{iu}, c_{i'u})

is the average over all unit-wise disagreements in the data. And the expected disagreement

D_e = \frac{1}{n(n-1)} \sum_{u=1,u'=1}^{N} \sum_{i=1,i'=1}^{m} \delta (c_{iu}, c_{i'u'}), [(i,u) \ne (i',u')]

is the average difference between any two values ciu and ci'u' over all n(n–1) pairs of values possible within the reliability data – without reference to coders or units. In effect, De is the disagreement that is expected when the values used by all coders are randomly assigned to the given set of units.

One interpretation of Krippendorff's alpha is: \alpha = 1 - \frac {D_{within~units~=~in~error}}{D_{within~and~between~units~=~in~total}}

α = 1 indicates perfect reliability
α = 0 indicates the absence of reliability. Units and the values assigned to them are statistically unrelated
α < 0 when disagreements are systematic and exceed what can be expected by chance.

In this general form, disagreements Do and De may be conceptually transparent but are computationally inefficient. They can be simplified algebraically, especially when expressed in terms of the visually more instructive coincidence matrix representation of the reliability data.

Coincidence matrices

A coincidence matrix cross tabulates the n pairable values from the canonical form of the reliability data into a v-by-v square matrix, where v is the number of values available in a variable. Unlike contingency matrices, familiar in association and correlation statistics, which tabulate pairs of values (Cross tabulation), a coincidence matrix tabulates all pairable values. A coincidence matrix omits references to coders and is symmetrical around its diagonal, which contains all perfect matches, viu = vi'u for two coders i and i' , across all units u. The matrix of observed coincidences contains frequencies:

o_{vv'} = \sum_{u=1}^N \frac{\sum_{i \ne i'}^m I(v_{iu}=v)*I(v_{i'u}=v') }{m_u - 1} = o_{v'v},
n_v = \sum_{l=1}^V o_{vl} = \sum_{v_{ij}}^{m,N} I(v_{ij} = v), n_v = \sum_{l=1}^L o_{lv} = \sum_{v_{ij}}^{m,N} I(v_{ij} = v), and n = \sum_{l=1,p=1}^V o_{lp},

omitting unpaired values, where I(∘) = 1 if is true, and 0 otherwise.

Because a coincidence matrix tabulates all pairable values and its contents sum to the total n, when four or more coders are involved, ock may be fractions.

The matrix of expected coincidences contains frequencies:

\ e_{vv'} = \frac{\sum_{i \ne i'}^m I(v_{iu}=v)*I(v_{i'u}=v') }{n-1} = \frac{1}{n-1}
\begin{cases}
  n_v(n_v-1)  & \mbox{iff }v\mbox{ = }v' \\
  n_vn_{v'} & \mbox{iff }v\mbox{ ≠ }v'
\end{cases}
=e_{kc} ,

which sum to the same nc, nk, and n as does ock. In terms of these coincidences, Krippendorff's alpha becomes:

\alpha = 1- \frac{D_o}{D_e} = 1 - \frac{\sum_{v=1,v'=1}^{V} o_{vv'} \delta(v,v')}{ \sum_{v=1,v'=1}^{V} e_{vv'} \delta(v,v')} = 1 - \frac{\sum_{v=1,v'=1}^{V} o_{vv'}  \delta (v, v')}{\frac{1}{n-1} \sum_{v=1,v'=1}^{V} n_v n_{v'}~\delta (v,v')}.

Difference functions

Difference functions \delta(v,v')[6] between values v and v' reflect the metric properties (Levels of Measurement) of their variable.

In general:

\delta (v,v') \ge 0

\delta(v,v) = 0

\delta(v,v') =  \delta(v',v)

In particular:

For nominal data  \delta_{nominal}(v,v') =
\begin{cases}
  0 & \mbox{iff }v\mbox{ = }v' \\
  1 & \mbox{iff }v\mbox{ ≠ }v'
\end{cases}
 , where v and v' serve as names.
For ordinal data  \delta_{ordinal}(v,v') = \left ( \sum_{g=v}^{g=v'} n_g - \frac{n_v + n_{v'}}{2} \right )^2, where v and v' are ranks.
For interval data  \delta_{interval}(v,v') = (v - v')^2, where v and v' are interval scale values.
For ratio data  \delta_{ratio}(v,v') = \left ( \frac{v-v'}{v+v'} \right )^2, where v and v' are absolute values.
For polar data  \delta_{polar}(v,v') =  \frac{(v-v')^2}{(v+v'-2v_{min})(2v_{max}-v-v')} , where vmin and vmax define the end points of the polar scale.
For circular data  \delta_{circular}(v,v') = \left ( \sin \left [180 \frac{v-v'}{U} \right ] \right )^2, where the sine function is expressed in degrees and U is the circumference or the range of values in a circle or loop before they repeat. For equal interval circular metrics, the smallest and largest integer values of this metric are adjacent to each other and U = vlargestvsmallest + 1.

Significance

Inasmuch as mathematical statements of the statistical distribution of alpha are always only approximations, it is preferable to obtain alpha’s distribution by bootstrapping.[7][8] Alpha 's distribution gives rise to two indices:

The minimum acceptable alpha coefficient should be chosen according to the importance of the conclusions to be drawn from imperfect data. When the costs of mistaken conclusions are high, the minimum alpha needs to be set high as well. In the absence of knowledge of the risks of drawing false conclusions from unreliable data, social scientists commonly rely on data with reliabilities α ≥ .800, consider data with 0.800 > α ≥ 0.667 only to draw tentative conclusions, and discard data whose agreement measures α < 0.667.[9]

A misunderstanding of Krippendorff's alpha has become an instructive public controversy.[10]

A computational example

Let the canonical form of reliability data be a 3-coder-by-15 unit matrix with 45 cells:

Units u: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Coder A * * * * * 3 4 1 2 1 1 3 3 * 3
Coder B 1 * 2 1 3 3 4 3 * * * * * * *
Coder C * * 2 1 3 4 4 * 2 1 1 3 3 * 4

Suppose “*” indicates a default category like “cannot code,” “no answer,” or “lacking an observation.” Then, * provides no information about the reliability of data in the four values that matter. Note that unit 2 and 14 contains no information and unit 1 contains only one value, which is not pairable within that unit. Thus, these reliability data consist not of mN=45 but of n=36 pairable values, not in N =15 but in 12 multiply coded units.

The coincidence matrix for these data would be constructed as follows:

o11 = {in u=4}: \textstyle\frac {2}{2-1}+ {in u=10}: \textstyle\frac {2}{2-1}+ {in u=11}; \textstyle\frac {2}{2-1}=6
o13 = {in u=8}: \textstyle\frac {1}{2-1}=1= o31
o22 = {in u=3}: \textstyle\frac {2}{2-1}+ {in u=9}: \textstyle\frac {2}{2-1}=4
o33 = {in u=5}: \textstyle\frac {2}{2-1}+ {in u=6}: \textstyle\frac {2}{3-1}+ {in u=12}: \textstyle\frac {2}{2-1}+ {in u=13}: \textstyle\frac {2}{2-1}=7
o34 = {in u=6}: \textstyle\frac {2}{3-1}+ {in u=15}: \textstyle\frac {1}{2-1}=2= o43
o44 = {in u=7}: \textstyle\frac {6}{3-1}=3
Values v or v' : 1 2 3 4 nv
Value 1 6 1 7
Value 2 4 4
Value 3 1 7 2 10
Value 4 2 3 5
Frequency nv' 7 4 10 5 26

In terms of the entries in this coincidence matrix, Krippendorff's alpha may be calculated from:

\alpha_{metric} = 1 - \frac{D_o}{D_e} = 1 - \frac{\sum_{v=1,v'=1}^{V} o_{vv'}  \delta_{metric}(v,v')}{\frac{1}{n-1} \sum_{v=1,v'=1}^{V} n_v n_{v'}~ \delta_{metric}(v,v')}.

For convenience, because products with \delta(v,v) = 0 and \delta(v,v') = \delta(v',v), only the entries in one of the off-diagonal triangles of the coincidence matrix are listed in the following:

\alpha_{metric} = 1 - \frac{1 \delta_{metric}(1,3) + 2 \delta_{metric}(3,4)}{\frac{1}{26-1}(4\cdot7 \delta_{metric} (1,2) + 10\cdot7 \delta_{metric}(1,3) + 5\cdot7 \delta_{metric}(1,4) + 10\cdot4 \delta_{metric}(2,3) +5\cdot4 \delta_{metric}(2,4) + 5\cdot10 \delta_{metric}(3,4))}

Considering that all  \delta_{nominal}(v,v') = 1 when v {\ne}v' for nominal data the above expression yields:

\alpha_{nominal} = 1 - \frac{1+2}{\frac{1}{26-1}(4\cdot7 + 10\cdot7 + 5\cdot7 + 10\cdot4 + 5\cdot4 + 5\cdot10)} =0.691

With  \delta_{interval}(1,2)= \delta_{interval}(2,3)=  \delta_{interval}(3,4) = 1^2, \delta_{interval}(1,3) = \delta_{interval}(2,4)=2^2, and \delta_{interval}(1,4)=3^2, for interval data the above expression yields:

\alpha_{interval} = 1 - \frac{1\cdot2^2+2\cdot1^2}{\frac{1}{26-1}(4\cdot7\cdot1^2+10\cdot7\cdot2^2+5\cdot7\cdot3^2+10\cdot4\cdot1^2+5\cdot4\cdot2^2+5\cdot10\cdot1^2)} = 0.811

Here, \alpha_{interval} > \alpha_{nominal} because disagreements happens to occur largely among neighboring values, visualized by occurring closer to the diagonal of the coincidence matrix, a condition that \alpha_{interval} takes into account but \alpha_{nominal} does not. When the observed frequencies ov≠ v' are on the average proportional to the expected frequencies ev ≠ v', \alpha_{interval} = \alpha_{nominal}.

Comparing alpha coefficients across different metrics can provide clues to how coders conceptualize the metric of a variable.

Alpha's embrace of other statistics

Krippendorff's alpha brings several known statistics under a common umbrella, each of them has its own limitations but no additional virtues.

\pi = \frac {P_o - P_e}{1-P_e} where  P_o = \sum_c \frac{o_{cc}}{n}, and P_e = \sum_c \frac{n_c^2}{n^2}
When data are nominal, alpha reduces to a form resembling Scott’s pi:
_{nominal}\alpha = 1 - \frac{D_o}{D_e} = \frac{\textstyle\sum_c o_{cc} - \textstyle\sum_c e_{cc}}{n - \textstyle\sum_c e_{cc}} = \frac{\textstyle\sum_c \frac {O_{cc}}{n} - \textstyle\sum_c \frac{n_c(n_c-1)}{n(n-1)}}{1- \textstyle\sum_c \frac {n_c(n_c-1)}{n(n-1)}}
Scott’s observed proportion of agreement \ P_o appears in alpha’s numerator, exactly. Scott’s expected proportion of agreement, \ P_e = \textstyle\sum_c \frac {n_c^2}{n^2} is asymptotically approximated by \textstyle\sum_c \frac{n_c(n_c-1)}{n(n-1)} when the sample size n is large, equal when infinite. It follows that Scott’s pi is that special case of alpha in which two coders generate a very large sample of nominal data. For finite sample sizes: _{nominal}\alpha  = 1 - \textstyle\frac{n-1}{n} (1-\pi) \ge \pi. Evidently, \lim_{n \to \infty} \ _{nominal}\alpha = \pi.
K = \frac{\bar P- \bar P_e}{1-\bar P_e} where \bar P = \frac{1}{N} \sum_{u=1}^N \sum_c \frac {n_{cu}(n_{cu}-1)}{m(m-1)} = \sum_c \frac{o_{cc}}{mN}, and \bar P_e = \sum_c \frac{n_c^2}{(mN)^2}
When sample sizes are finite, K can be seen to perpetrate the inconsistency of obtaining the proportion of observed agreements \bar P by counting matches within the m(m-1) possible pairs of values within u, properly excluding values paired with themselves, while the proportion \bar P_e is obtained by counting matches within all (mN)2=n2 possible pairs of values, effectively including values paired with themselves. It is the latter that introduces a bias into the coefficient. However, just as for pi, when sample sizes become very large this bias disappears and the proportion \textstyle\sum_c \frac{n_c(n_c-1)}{n(n-1)} in nominalα above asymptotically approximates \bar P_e in K. Nevertheless, Fleiss' kappa, or rather K, intersects with alpha in that special situation in which a fixed number of m coders code all of N units (no data are missing), using nominal categories, and the sample size n=mN is very large, theoretically infinite.
\rho = 1 - \frac {6 \sum D^2}{N(N^2-1)},
where \textstyle\sum D^2 = \textstyle\sum_{u=1}^N~_{ordinal} \delta_{c{_u}k{_u}}^2 is the sum of N differences between one coder’s rank c and the other coder’s rank k of the same object u. Whereas alpha accounts for tied ranks in terms of their frequencies for all coders, rho averages them in each individual coder's instance. In the absence of ties, \rho's numerator \textstyle\sum D^2=ND_o and \rho's denominator \textstyle\frac{N(N^2-1)}{6}= \frac{n}{n-1} ND_e, where n=2N, which becomes \ ND_e when sample sizes become large. So, Spearman’s rho is that special case of alpha in which two coders rank a very large set of units. Again, _{ordinal}\alpha \ge \rho and \lim_{n \to \infty}\ _{ordinal}\alpha = \rho.

Krippendorff's alpha is more general than any of these special purpose coefficients. It adjusts to varying sample sizes and affords comparisons across a wide variety of reliability data, mostly ignored by the familiar measures.

Coefficients incompatible with alpha and the reliability of coding

Semantically, reliability is the ability to rely on something, here on coded data for subsequent analysis. When a sufficiently large number of coders agree perfectly on what they have read or observed, relying on their descriptions is a safe bet. Judgments of this kind hinge on the number of coders duplicating the process and how representative the coded units are of the population of interest. Problems of interpretation arise when agreement is less than perfect, especially when reliability is absent.

Naming a statistic as one of agreement, reproducibility, or reliability does not make it a valid index of whether one can rely on coded data in subsequent decisions. Its mathematical structure must fit the process of coding units into a system of analyzable terms.

Notes

  1. Krippendorff, K. (2013) pp. 221-250 describes the mathematics of alpha and its use in content analysis since 1969.
  2. Hayes, A. F. & Krippendorff, K. (2007) describe and provide SPSS and SAS macros for computing alpha, its confidence limits and the probability of failing to reach a chosen minimum.
  3. Manual page of the kripp.alpha() function for the platform independent statistics package R
  4. The Alpha resources page.
  5. Matlab code to compute Krippendorff's alpha.
  6. Computing Krippendorff’s Alpha Reliability” http://repository.upenn.edu/asc_papers/43/
  7. Krippendorff, K. (2004) pp. 237-238
  8. Hayes, A. F. & Krippendorff, K. (2007)
  9. Krippendorff, K. (2004) pp. 241-243
  10. Brooks, R. “Sweet Jesus I love Bill Reilly!” Los Angeles Times, May 4, 2007, http://www.latimes.com/news/opinion/commentary/la-oe-brooks4may04,0,6548272.column?coll=la-home-commentary; Mitchell, R. “Stop calling O’Reilly names.” Los Angeles Times, May 10, 2007, http://www.latimes.com/news/opinion/la-oew-mitchell9may09,0,3143633.story?coll=la-opinion-center; Conway, M., Grabe, M. E., & Grieves, K. "Bill O'Reilly and Krippendorff's Alpha." Los Angeles Times May 16, 2007, http://www.latimes.com/news/opinion/la-oew-conway16may16,0,3767872.story?coll=la-opinion-center; Convey, M., et al. “Peas in a pod; LA Times op-ed.
  11. Scott, W. A. (1955)
  12. Fleiss, J. L. (1971)
  13. Cohen, J. (1960)
  14. Siegel, S. & Castellan, N. J. (1988), pp. 284-291.
  15. Spearman, C. E. (1904)
  16. Pearson, K. (1901), Tildesley, M. L. (1921)
  17. Krippendorff, K. (1970)
  18. Cohen, J. (1960)
  19. Krippendorff, K. (1978) raised this issue with Joseph Fleiss
  20. Zwick, R. (1988), Brennan, R. L. & Prediger, D. J. (1981), Krippendorff (1978, 2004).
  21. Nunnally, J. C. & Bernstein, I. H. (1994)
  22. Cronbach, L. J. (1951)
  23. Bennett, E. M., Alpert, R. & Goldstein, A. C. (1954)
  24. Goodman, L. A. & Kruskal, W. H. (1954) p. 758
  25. Lin, L. I. (1989)

References

External links