Imputation (genetics)
Imputation in genetics refers to the statistical inference of unobserved genotypes.[1] It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test initially-untyped genetic variants for association with a trait of interest.[2] Genotype imputation hence helps tremendously in narrowing-down the location of probably causal variants in genome-wide association studies.
Context
In genetic epidemiology and quantitative genetics, researchers aim at identifying genomic locations which variation between individuals is associated with variation in traits of interest between the same individuals. Such studies hence require access to the genetic make-up of a set of individuals. Sequencing the whole genome of each individual in the study is often too costly, only a subset of the genome can therefore be measured. This often means, first, only considering single-nucleotide polymorphisms (SNPs) and neglecting copy number variants, and second, only measuring SNPs known to be variable enough in the population so that they are likely to be also variable in the set of individuals under consideration. The most informative subset of SNPs is chosen based on the distribution of common genetic variation along the genome, for instance as produced by the HapMap or the 1000 Genomes Project in humans. These SNPs are then used to build a micro-array, thereby allowing each individual in the study to be genotyped at all these SNPs simultaneously.
Motivation
The cost of genotyping increases with the number of SNPs on the micro-array. A trade-off therefore exists between the number of SNPs and the number of individuals. However, the genome is transmitted from parents to offsprings over generations in such a way that the genotypes at nearby SNPs are correlated. If we measure the genotypes at two nearby SNPs, we should therefore be able to predict ("impute") the genotypes at all the SNPs in between these two "tag" SNPs. Obviously, the accuracy of prediction depends on the strength of correlation in the region. More specifically, it depends on the amount of recombination. At the end, we can genotype a large number of individuals and, after imputation, assess the effect of more SNPs than only those on the micro-array.
Statistical models
Designing accurate statistical models for genotype imputation is very much related to the problem of haplotype estimation ("phasing") and is an active area of research.[3]
See also
References
- ↑ Scheet, Paul; Stephens, Matthew (2006). "A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase". The American Journal of Human Genetics 78 (4): 629–644. doi:10.1086/502802. PMC 1424677. PMID 16532393.
- ↑ Marchini, J.; Howie, B. (2010). "Genotype imputation for genome-wide association studies". Nature Reviews Genetics 11 (7): 499–511. doi:10.1038/nrg2796. PMID 20517342.
- ↑ Howie, Bryan; Donnelly, Peter; Marchini, Jonathan (2009). "A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies". PLoS Genetics 5 (6). doi:10.1371/journal.pgen.1000529. PMC 2689936. PMID 19543373.