Cross tabulation
From Wikipedia, the free encyclopedia
A cross tabulation (often abbreviated as cross tab) displays the joint distribution of two or more variables. They are usually presented as a contingency table in a matrix format. Whereas a frequency distribution provides the distribution of one variable, a contingency table describes the distribution of two or more variables simultaneously. Each cell shows the number of respondents who gave a specific combination of responses, that is, each cell contains a single cross tabulation.
The following is a fictitious example of a 3 × 2 contingency table. The variable “Wikipedia usage” has three categories: heavy user, light user, and non user. These categories are all inclusive so the columns sum to 100%. The other variable "underpants" has two categories: boxers, and briefs. These categories are not all inclusive so the rows need not sum to 100%. Each cell gives the percentage of subjects who share that combination of traits.
boxers | briefs | |
heavy Wiki user | 70% | 5% |
light Wiki user | 25% | 35% |
non Wiki user | 5% | 60% |
Cross tabs are frequently used because:
- They are easy to understand. They appeal to people who do not want to use more sophisticated measures.
- They can be used with any level of data: nominal, ordinal, interval, or ratio - cross tabs treat all data as if it is nominal
- A table can provide greater insight than single statistics
- It solves the problem of empty or sparse cells
- they are simple to conduct
[edit] Statistics related to cross tabulations
The following list is not comprehensive.
- Chi-square - This tests the statistical significance of the cross tabulations. Chi-squared should not be calculated for percentages. The cross tabs must be converted back to absolute counts (numbers) before calculating chi-squared. Chi-squared is also problematic when any cell has a joint frequency of less than five. For an in-depth discussion of this issue see Fienberg, S.E. (1980). "The Analysis of Cross-classified Categorical Data." 2nd Edition. M.I.T. Press, Cambridge, MA.
- Contingency Coefficient - This tests the strength of association of the cross tabulations. It is a variant of the phi coefficient that adjusts for statistical significance. Values range from 0 (no association) to 1 (the theoretical maximum possible association).
- Cramer’s V - This tests the strength of association of the cross tabulations. It is a variant of the phi coefficient that adjusts for the number of rows and columns. Values range from 0 (no association) to 1 (the theoretical maximum possible association).
- Lambda coefficient — This tests the strength of association of the cross tabulations when the variables are measured at the nominal level. Values range from 0 (no association) to 1 (the theoretical maximum possible association). Asymmetric lambda measures the percentage improvement in predicting the dependent variable. Symmetric lambda measures the percentage improvement when prediction is done in both directions.
- phi coefficient - If both variables instead are nominal and dichotomous, phi coefficient is a measure of the degree of association between two binary variables. This measure is similar to the correlation coefficient in its interpretation. Two binary variables are considered positively associated if most of the data falls along the diagonal cells. In contrast, two binary variables are considered negatively associated if most of the data falls off the diagonal.
- Kendall tau:
- Tau b - This tests the strength of association of the cross tabulations when both variables are measured at the ordinal level. It makes adjustments for ties and is most suitable for square tables. Values range from -1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.
- Tau c - This tests the strength of association of the cross tabulations when both variables are measured at the ordinal level. It makes adjustments for ties and is most suitable for rectangular tables. Values range from -1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.
- Gamma - This tests the strength of association of the cross tabulations when both variables are measured at the ordinal level. It makes no adjustment for either table size or ties. Values range from -1 (100% negative association, or perfect inversion) to +1 (100% positive association, or perfect agreement). A value of zero indicates the absence of association.
- Uncertainty coefficient, entropy coefficient or Theil's U
[edit] See also
[edit] External links
- More Correlation Coefficients
- Nominal Association: Phi, Contingency Coefficient, Tschuprow's T, Cramer's V, Lambda, Uncertainty Coefficient
- Customer Insight com Cross Tabulation
- The POWERMUTT Project: IV. DISPLAYING CATEGORICAL DATA
- StATS: Steves Attempt to Teach Statistics Odds ratio versus relative risk (January 9, 2001)