De novo transcriptome assembly
De novo transcriptome assembly is the method of creating a transcriptome without the aid of a reference genome.
Introduction
As a result of the development of novel sequencing technologies, the years between 2008 and 2012 saw a large drop in the cost of sequencing. Per megabase and genome, the cost dropped to 1/100,000th and 1/10,000th of the price, respectively.[1] Prior to this, only transcriptomes of organisms that were of broad interest and utility to scientific research were sequenced; however, these newly developed high-throughput sequencing (also called next-generation sequencing) technologies are both cost- and labor- effective, and the range of organisms studied via these methods is expanding.[2] Within the past few years, transcriptomes have been created for chickpea,[3] planarians,[4] Parhyale hawaiensis,[5] as well as the brains of the Nile crocodile, the corn snake, the bearded dragon, and the red-eared slider, to name just a few.[6]
Examining non-model organisms can provide novel insights into the mechanisms underlying the "diversity of fascinating morphological innovations" that have enabled the abundance of life on planet Earth.[7] In animals and plants, the "innovations" that cannot be examined in common model organisms include mimicry, mutualism, parasitism, and asexual reproduction. De novo transcriptome assembly is often the preferred method to studying non-model organisms, since it is cheaper and easier than building a genome, and reference-based methods are not possible without an existing genome. The transcriptomes of these organisms can thus reveal novel proteins and their isoforms that are implicated in such unique biological phenomena.
De novo vs. reference-based assembly
A set of assembled transcripts allows for initial gene expression studies. Prior to the development of transcriptome assembly computer programs, transcriptome data were analyzed primarily by mapping on to a reference genome. Though genome alignment is a robust way of characterizing transcript sequences, this method is disadvantaged by its inability to account for incidents of structural alterations of mRNA transcripts, such as alternative splicing.[8] Since a genome contains the sum of all introns and exons that may be present in a transcript, spliced variants that do not align continuously along the genome may be discounted as actual protein isoforms.
Transcriptome vs. genome assembly
Unlike genome sequence coverage levels – which can vary randomly as a result of repeat content in non-coding intron regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced isoforms, or minor variation among members of a gene family.[8]
Method
RNA-seq
(Main article: RNA-seq)
Once mRNA is extracted and purified from cells, it is sent to a high-throughput sequencing facility, where it is first reverse transcribed to create a cDNA library. This cDNA can then be fragmented into various lengths depending on the platform used for sequencing. Each of the following platforms utilizes a different type of technology to sequence millions of short reads: 454 Sequencing, Illumina, and SOLiD.
Assembly algorithms
The cDNA sequence reads are assembled into transcripts via a short read transcript assembly program. Most likely, some amino acid variations among transcripts that are otherwise similar reflect different protein isoforms. It is also possible that they represent different genes within the same gene family, or even genes that share only a conserved domain, depending on the degree of variation.
A number of assembly programs are available (see Assemblers). Although these programs have been generally successful in assembling genomes, transcriptome assembly presents some unique challenges. Whereas high sequence coverage for a genome may indicate the presence of repetitive sequences (and thus be masked), for a transcriptome, they may indicate abundance. In addition, unlike genome sequencing, transcriptome sequencing can be strand-specific, due to the possibility of both sense and antisense transcripts. Finally, it can be difficult to reconstruct and tease apart all splicing isoforms.[9]
Short read assemblers generally use one of two basic algorithms: overlap graphs and de Bruijn graphs.[10] Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap.[10] De Bruijn graphs align k-mers (usually 25-50 bp) based on k-1 sequence conservation to create contigs. The use of k-mers – which are shorter than the read lengths – in de Bruijn graphs reduces the computational intensity of this method.[10]
Functional annotation
Functional annotation of the assembled transcripts allows for insight into the particular molecular functions, cellular components, and biological processes in which the putative proteins are involved. Blast2GO (B2G) enables Gene Ontology based data mining to annotate sequence data for which no GO annotation is available yet. It is a research tool often employed in functional genomics research on non-model species.[11] It works by blasting assembled contigs against a non-redundant protein database (at NCBI), then annotating them based on sequence similarity. GOanna is another GO annotation program specific for animal and agricultural plant gene products that works in a similar fashion. It is part of the AgBase database of curated, publicly accessible suite of computational tools for GO annotation and analysis.[12] Following annotation, KEGG (Kyoto Encyclopedia of Genes and Genomes) enables visualization of metabolic pathways and molecular interaction networks captured in the transcriptome.[13]
In addition to being annotated for GO terms, contigs can also be screened for open reading frames (ORFs) in order to predict the amino acid sequence of proteins derived from these transcripts. Another approach is to annotate protein domains and determine the presence of gene families, rather than specific genes.
Verification and quality control
Since a reference genome is not available, the quality of computer-assembled contigs may be verified either by comparing the assembled sequences to the reads used to generate them (reference-free), or by aligning the sequences of conserved gene domains found in mRNA transcripts to transcriptomes or genomes of closely related species (reference-based). Tools such as Transrate [14] and DETONATE [15] allow statistical analysis of assembly quality by these methods. Another method is to design PCR primers for predicted transcripts, then attempt to amplify them from the cDNA library. Often, exceptionally short reads are filtered out. Short sequences (< 40 amino acids) are unlikely to represent functional proteins, as they are unable to fold independently and form hydrophobic cores.[16]
Assemblers
The following is a partial compendium of assembly software that has been used to generate transcriptomes, and has also been cited in scientific literature.
SOAPdenovo-Trans
SOAPdenovo-Trans is a de novo transcriptome assembler inherited from the SOAPdenovo2 framework, designed for assembling transcriptome with alternative splicing and different expression level. The assembler provides a more comprehensive way to construct the full-length transcript sets compare to SOAPdenovo2.
Velvet/Oases
(Main article: Velvet assembler)
The Velvet algorithm uses de Bruijn graphs to assemble transcripts. In simulations, Velvet can produce contigs up to 50-kb N50 length using prokaryotic data and 3-kb N50 in mammalian bacterial artificial chromosomes (BACs).[17] These preliminary transcripts are transferred to Oases, which uses paired end read and long read information to build transcript isoforms.[18]
Trans-ABySS
ABySS is a parallel, paired-end sequence assembler. Trans-ABySS (Assembly By Short Sequences) is a software pipeline written in Python and Perl for analyzing ABySS-assembled transcriptome contigs. This pipeline can be applied to assemblies generated across a wide range of k values. It first reduces the dataset into smaller sets of non-redundant contigs, and identifies splicing events including exon-skipping, novel exons, retained introns, novel introns, and alternative splicing. The Trans-ABySS algorithms are also able to estimate gene expression levels, identify potential polyadenylation sites, as well as candidate gene-fusion events.[19]
Trinity
Trinity[20] first divides the sequence data into a number of de Bruijn graphs, each representing transcriptional variations at a single gene or locus. It then extracts full-length splicing isoforms and distinguishes transcripts derived from paralogous genes from each graph separately. Trinity consists of three independent software modules, which are used sequentially to produce transcripts:
- Inchworm assembles the RNA-Seq data into transcript sequences, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
- Chrysalis clusters the Inchworm contigs and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptional complexity for a given gene (or a family or set of genes that share a conserved sequence). Chrysalis then partitions the full read set among these separate graphs.
- Butterfly then processes the individual graphs in parallel, tracing the paths of reads within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.[21]
See also
- Transcriptome
- Human-transcriptome database for alternative splicing (H-DBAS)
- UniGene
- Full-parasites
- Exome sequencing
References
- ↑ Wetterstrand KA. "DNA Sequencing Costs: Data from the NHGRI Large-Scale Genome Sequencing Program Available at: www.genome.gov/sequencingcosts". Genome.gov.
- ↑ Surget-Groba Y, Montoya-Burgos JI (2010). "Optimization of de novo transcriptome assembly from next-generation sequencing data". Genome Res. 20 (10): 1432–1440. doi:10.1101/gr.103846.109. PMC 2945192. PMID 20693479.
- ↑ Garg R, Patel RK, Tyagi AK, Jain M (2011). "De novo assembly of chickpea transcriptome using short reads for gene discovery and marker identification". DNA Res. 18 (1): 53–63. doi:10.1093/dnares/dsq028. PMC 3041503. PMID 21217129.
- ↑ Adamidi C et al. (2011). "De novo assembly and validation of planaria transcriptome by massive parallel sequencing and shotgun proteomics". Genome Res. 21 (7): 1193–1200. doi:10.1101/gr.113779.110. PMC 3129261. PMID 21536722.
- ↑ Zeng V et al. (2011). "De novo assembly and characterization of a maternal and developmental transcriptome for the emerging model crustacean Parhyale hawaiensis". BMC Genomics 12: 581. doi:10.1186/1471-2164-12-581. PMC 3282834. PMID 22118449.
- ↑ Tzika AC et al. (2011). "Reptilian transcriptome v1.0, a glimpse in the brain transcriptome of five divergent Sauropsida lineages and the phylogenetic position of turtles". EvoDevo 2 (1): 19. doi:10.1186/2041-9139-2-19. PMC 3192992. PMID 21943375.
- ↑ Rowan BA, Weigel D, Koenig D (2011). "Developmental genetics and new sequencing technologies: the rise of nonmodel organisms". Developmental Cell 21 (1): 65–76. doi:10.1016/j.devcel.2011.05.021. PMID 21763609.
- ↑ 8.0 8.1 Birol I et al. (2009). "De novo transcriptome assembly with ABySS". Bioinformatics 21 (25): 2872–7. doi:10.1093/bioinformatics/btp367. PMID 19528083.
- ↑ Martin J.A., Wang Z. (2011). "Next-generation transcriptome assembly". Nature Reviews Genetics 12: 671–682. doi:10.1038/nrg3068.
- ↑ 10.0 10.1 10.2 Illumina, Inc. (2010). "De Novo Assembly Using Illumina Reads".
- ↑ Conesa A et al. (2005). "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research". Bioinformatics 21 (18): 3674–3676. doi:10.1093/bioinformatics/bti610. PMID 16081474.
- ↑ McCarthy FM et al. (2006). "AgBase: a functional genomics resource for agriculture". BMC Genomics 7: 229. doi:10.1186/1471-2164-7-229. PMC 1618847. PMID 16961921.
- ↑ "KEGG PATHWAY Database".
- ↑ Transrate: understand your transcriptome assembly. http://hibberdlab.com/transrate
- ↑ Li, B et al. Evaluation of de novo transcriptome assemblies from RNA-Seq data. Genome Biology 2014, 15:553.
- ↑ Karplus, K. pdb-1: Minimum length of Protein Sequence. https://lists.sdsc.edu/pipermail/pdb-l/2011-January/005317.html.
- ↑ Zerbino DR, Birney E (2008). "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs". Genome Res. 18 (5): 821–829. doi:10.1101/gr.074492.107. PMC 2336801. PMID 18349386.
- ↑ "Oases: de novo transcriptome assembler for very short reads".
- ↑ "Trans-ABySS: Analyze ABySS multi-k assembled shotgun transcriptome data".
- ↑ "Trinity".
- ↑ "Trinity RNA-Seq Assembly – software for the reconstruction of full-length transcripts and alternatively spliced isoforms".