Multifactor dimensionality reduction

From Wikipedia, the free encyclopedia

Multifactor dimensionality reduction (MDR) is a data mining approach for detecting and characterizing combinations of attributes or independent variables that interact to influence a dependent or class variable. MDR was designed specifically to identify interactions among discrete variables that influence a binary outcome and is considered a nonparametric alternative to traditional statistical methods such as logistic regression.

The basis of the MDR method is a constructive induction algorithm that converts two or more variables or attributes to a single attribute. This process of constructing a new attribute changes the representation space of the data. The end goal is to create or discover a representation that facilitates the detection of nonlinear or nonadditive interactions among the attributes such that prediction of the class variable is improved over that of the original representation of the data.

Contents

[edit] Illustrative example

Consider the following simple example using the exclusive OR (XOR) function. XOR is a logical operator that is commonly used in data mining and machine learning as an example of a function that is not linearly separable. The table below represents a simple dataset where the relationship between the attributes (X1 and X2) and the class variable (Y) is defined by the XOR function such that Y = X1 XOR X2.

Table 1

X1 X2 Y
0 0 0
0 1 1
1 0 1
1 1 0

A data mining algorithm would need to discover or approximate the XOR function in order to accurately predict Y using information about X1 and X2. An alternative strategy would be to first change the representation of the data using constructive induction to facilitate predictive modeling. The MDR algorithm would change the representation of the data (X1 and X2) in the following manner. MDR starts by selecting two attributes. In this simple example, X1 and X2 are selected. Each combination of values for X1 and X2 are examined and the number of times Y=1 and/or Y=0 is counted. In this simple example, Y=1 occurs zero times and Y=0 occurs once for the combination of X1=0 and X2=0. With MDR, the ratio of these counts is computed and compared to a fixed threshold. Here, the ratio of counts is 0/1 which is less than our fixed threshold of 1. Since 0/1 < 1 we encode a new attribute (Z) as a 0. When the ratio is greater than one we encode Z as a 1. This process is repeated for all unique combinations of values for X1 and X2. Table 2 illustrates our new transformation of the data.

Table 2

Z Y
0 0
1 1
1 1
0 0

The data mining algorithm now has much less work to do to find a good predictive function. In fact, in this very simple example, the function Y = Z has a classification accuracy of 1. A nice feature of constructive induction methods such as MDR is the ability to use any data mining or machine learning method to analyze the new representation of the data. Decision trees, neural networks, or a naive Bayes classifier could be used.

[edit] Data mining with MDR

As illustrated above, the basic constructive induction algorithm in MDR is very simple. However, its implementation for mining patterns from real data can be computationally complex. As with any data mining algorithm there is always concern about overfitting. That is, data mining algorithms are good at finding interesting patterns in completely random data. How do you know if what you found is an important signal or just a chance pattern? One approach is to estimate the generalizability of a model to independent datasets using methods such as cross-validation. Models that describe random data typically don't generalize. Another approach is to generate many random permutations of your data to see what your data mining algorithm finds when given the chance to overfit. Permutation testing makes it possible to generate an empirical p-value for your result. These approaches have all been shown to be useful for choosing and evaluating MDR models.

[edit] Applications

MDR has mostly been applied to detecting gene-gene interactions or epistasis in genetic studies of common human diseases such as atrial fibrillation, autism, bladder cancer, breast cancer, cardiovascular disease, hypertension, prostate cancer, schizophrenia, and type II diabetes. However, it can be applied to other domains such as economics, engineering, meteorology, etc. where interactions among discrete attributes might be important for predicting a binary outcome.

[edit] Software

An open-source and freely-available MDR software package can be downloaded from here.

[edit] See also

[edit] References

  • Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J Hum Genet. 2001 Jul;69(1):138-47. PubMed
  • Moore JH, Williams SM. New strategies for identifying gene-gene interactions in hypertension. Ann Med. 2002;34(2):88-95. PubMed
  • Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol. 2003 Feb;24(2):150-7. PubMed
  • Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003 Feb 12;19(3):376-82. PubMed
  • Moore JH. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered. 2003;56(1-3):73-82. PubMed
  • Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS. Multifactor-dimensionality reduction shows a two-locus interaction associated with Type 2 diabetes mellitus. Diabetologia. 2004 Mar;47(3):549-54. PubMed
  • Tsai CT, Lai LP, Lin JL, Chiang FT, Hwang JJ, Ritchie MD, Moore JH, Hsu KL, Tseng CD, Liau CS, Tseng YZ. Renin-angiotensin system gene polymorphisms and atrial fibrillation. Circulation. 2004 Apr 6;109(13):1640-6. PubMed
  • Hahn LW, Moore JH. Ideal discrimination of discrete clinical endpoints using multilocus genotypes. In Silico Biol. 2004;4(2):183-94. PubMed
  • Coffey CS, Hebert PR, Ritchie MD, Krumholz HM, Gaziano JM, Ridker PM, Brown NJ, Vaughan DE, Moore JH. An application of conditional logistic regression and multifactor dimensionality reduction for detecting gene-gene interactions on risk of myocardial infarction: the importance of model validation. BMC Bioinformatics. 2004 Apr 30;5:49. PubMed
  • Moore JH. Computational analysis of gene-gene interactions using multifactor dimensionality reduction.

Expert Rev Mol Diagn. 2004 Nov;4(6):795-803. PubMed

  • Williams SM, Ritchie MD, Phillips JA 3rd, Dawson E, Prince M, Dzhura E, Willis A, Semenya A, Summar M, White BC, Addy JH, Kpodonu J, Wong LJ, Felder RA, Jose PA, Moore JH. Multilocus analysis of hypertension: a hierarchical approach.

Hum Hered. 2004;57(1):28-38. PubMed

  • Bastone L, Reilly M, Rader DJ, Foulkes AS. MDR and PRP: a comparison of methods for high-order genotype-phenotype associations. Hum Hered. 2004;58(2):82-92. PubMed
  • Ma DQ, Whitehead PL, Menold MM, Martin ER, Ashley-Koch AE, Mei H, Ritchie MD, Delong GR, Abramson RK, Wright HH, Cuccaro ML, Hussman JP, Gilbert JR, Pericak-Vance MA. Identification of significant association and gene-gene interaction of GABA receptor subunit genes in autism. Am J Hum Genet. 2005 Sep;77(3):377-88. PubMed
  • Soares ML, Coelho T, Sousa A, Batalov S, Conceicao I, Sales-Luis ML, Ritchie MD, Williams SM, Nievergelt CM, Schork NJ, Saraiva MJ, Buxbaum JN. Susceptibility and modifier genes in Portuguese transthyretin V30M amyloid polyneuropathy: complexity in a single-gene disease. Hum Mol Genet. 2005 Feb 15;14(4):543-53. PubMed
  • Qin S, Zhao X, Pan Y, Liu J, Feng G, Fu J, Bao J, Zhang Z, He L. An association study of the N-methyl-D-aspartate receptor NR1 subunit gene (GRIN1) and NR2B subunit gene (GRIN2B) in schizophrenia with universal DNA microarray.

Eur J Hum Genet. 2005 Jul;13(7):807-14. PubMed

  • Wilke RA, Moore JH, Burmester JK. Relative impact of CYP3A genotype and concomitant medication on the severity of atorvastatin-induced muscle damage. Pharmacogenet Genomics. 2005 Jun;15(6):415-21. PubMed
  • Xu J, Lowey J, Wiklund F, Sun J, Lindmark F, Hsu FC, Dimitrov L, Chang B, Turner AR, Liu W, Adami HO, Suh E, Moore JH, Zheng SL, Isaacs WB, Trent JM, Gronberg H. The interaction of four genes in the inflammation pathway significantly predicts prostate cancer risk. Cancer Epidemiol Biomarkers Prev. 2005 Nov;14 (11 Pt 1):2563-8. PubMed
  • Wilke RA, Reif DM, Moore JH. Combinatorial pharmacogenetics. Nat Rev Drug Discov. 2005 Nov;4(11):911-8. PubMed
  • Ritchie MD, Motsinger AA. Multifactor dimensionality reduction for detecting gene-gene and gene-environment interactions in pharmacogenomics studies. Pharmacogenomics. 2005 Dec;6(8):823-34. PubMed
  • Andrew AS, Nelson HH, Kelsey KT, Moore JH, Meng AC, Casella DP, Tosteson TD, Schned AR, Karagas MR. Concordance of multiple analytical approaches demonstrates a complex relationship between DNA repair gene SNPs, smoking and bladder cancer susceptibility. Carcinogenesis. 2006 May;27(5):1030-7. PubMed
  • Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, Barney N, White BC. A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol. 2006 Jul 21;241(2):252-61. PubMed

[edit] Further reading

  • R. S. Michalski, "Pattern Recognition as Knowledge-Guided Computer Induction," Department of Computer Science Reports, No. 927, University of Illinois, Urbana, June 1978.