Sequence assembly

From Wikipedia, the free encyclopedia

In bioinformatics, sequence assembly refers to aligning and merging fragments of a DNA sequence to reconstruct the original sequence, typically fragments of the genome resulting from shotgun sequencing, or fragments of a gene transcript (ESTs).

Well-known first-generation sequence assemblers were Phrap, TIGR Assembler, and CAP3, with Phil Green's Phrap the earliest and most popular. Faced with the challenge of assembling the much larger genomes of the fruit fly Drosophila melanogaster in 2000 and the human genome just a year later, scientists developed a new generation of assemblers. The first of these was the Celera Assembler, developed by Gene Myers and colleagues, followed by Arachne, developed at MIT by Serafim Batzoglou and later enhanced by David Jaffe and colleagues. These modern assemblers can handle genomes of 100-300 million base pairs such as the fruit fly and other insects, as well as the 3 billion base pairs of the human genome and other mammals. Subsequent to these efforts, several other groups, mostly at the major genome sequencing centers, built large-scale assemblers, and an open source effort known as AMOS was launched to bring together all the innovations in genome assembly technology under the open source framework.

EST assembly differs from genome assembly in several ways. For instance, genomes often have large amounts of repetitive sequences, mainly in the intra-genic parts. Since ESTs represent gene transcripts, they will not contain these repeats. On the other hand, genes sometimes overlap in the genome (sense-antisense transcription), and should ideally still be assembled separately. EST assembly is also complicated by features like (cis-) alternative splicing, trans-splicing, single-nucleotide polymorphism, recoding, and post-transcriptional modification. These differences make the new generation assemblers less applicable to EST assembly.

[edit] Greedy algorithm

Given a set of sequence fragments the object is to find the Shortest common supersequence.

  1. calculate pairwise alignments of all fragments
  2. choose two fragments with the largest overlap
  3. merge chosen fragments
  4. repeat step 2. and 3. until only one fragment is left

The result is a suboptimal solution to the problem.

[edit] See also

[edit] References