GeneMark
From Wikipedia, the free encyclopedia
GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae, and the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be “non-coding”.
Contents |
[edit] GeneMark.hmm (prokaryotic)
GeneMark.hmm algorithm was designed to improve the gene prediction quality, particularly to improve GeneMark in finding exact gene starts. The idea was to integrate the GeneMark models into naturally designed hidden Markov model framework with gene boundaries modeled as transitions between hidden states. Additionally, the ribosome binding site model is used to make the gene start predictions more accurate. In evaluations by different groups it was shown that GeneMark.hmm is significantly more accurate than GeneMark in exact gene prediction. From 1998 until now GeneMark.hmm and its self-training version, GeneMarkS, are the standard tools for gene identification in new prokaryotic genomic sequences, including metagenomes.
[edit] GeneMark.hmm (eukaryotic)
Next step after developing prokaryotic GeneMark.hmm was to extend the approach to the eukaryotic genomes where accurate prediction of protein coding exon boundaries presents the major challenge.The HMM architecture of eukaryotic GeneMark.hmm consists of hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes located on both DNA strands. It also includes hidden states for initiation site, termination site, as well as donor and acceptor splice sites. GeneMark.hmm has been frequently used for annotation of plant and animal genomes.
[edit] Heuristic Models
Computer methods of accurate gene finding in DNA sequences require models of protein coding and non-coding regions derived either from experimentally validated training sets or from large amounts of anonymous DNA sequence. A heuristic method for derivation of parameters of inhomogeneous Markov models of protein coding regions. was proposed in 1999. The heuristic method utilizes the observation that parameters of the Markov models used in GeneMark can be approximated by the functions of the sequence G+C content. Therefore, a short DNA sequence sufficient for estimation of the genome G+C content (a fragment longer than 400 nt) is also sufficient for derivation of parameters of the Markov models used in GeneMark and GeneMark.hmm. Models built by the heuristic approach could be used to find genes in small fragments of anonymous prokaryotic genomes, such as metagenomic sequences, as well as in genomes of organelles, viruses, phages and plasmids. This method can also be used for highly inhomogeneous genomes where adjustment of the Markov models to local DNA composition is needed. The heuristic method provides an evidence that the mutational pressure that shapes G+C content is the driving force of the evolution of codon usage pattern.
[edit] Family of gene prediction programs
Bacteria, Archaea and metagenomes
- GeneMark-P
- GeneMark.hmm-P
- GeneMarkS
Eukaryotes
- GeneMark-E
- GeneMark.hmm-E
- GeneMark.hmm-ES
Viruses, phages and plasmids
- Heuristic approach
EST and cDNA
- GeneMark-E
[edit] External links
[edit] References
GeneMark Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry, 1993, Vol. 17, No. 19, pp. 123-133 Abstract | Article
GeneMark.hmm Lukashin A. and Borodovsky M. "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Research, 1998, Vol. 26, No. 4, pp. 1107-1115 Medline | Article
Heuristic Models Besemer J. and Borodovsky M. "Heuristic approach to deriving models for gene finding." Nucleic Acids Research, 1999, Vol. 27, No. 19, pp. 3911-3920 Medline | Article
GeneMarkS Besemer J., Lomsadze A. and Borodovsky M. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618 Medline | Article
VIOLIN Mills R., Rozanov M., Lomsadze A., Tatusova T. and Borodovsky M. "Improving gene annotation in complete viral genomes." Nucleic Acids Research, 2003, Vol. 31, No. 23, 7041-7055 Medline | Article
GeneMark Web server Besemer J. and Borodovsky M. "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses." Nucleic Acids Research, 2005, Vol. 33, Web Server Issue, pp. W451-454 Medline | Article
GeneMark.hmm-ES Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. and Borodovsky M. "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research, 2005, Vol. 33, No. 20, 6494-6506 Medline | Article