BLAST
From Wikipedia, the free encyclopedia
This article may be too technical for a general audience. Please help improve this article by providing more context and better explanations of technical details to make it more accessible, without removing technical details. |
BLAST | |
---|---|
Developed by | Myers, E., Altschul S.F., Gish W., Miller E.W., Lipman D.J., NCBI |
Latest release | 2.2.18 |
OS | UNIX, Linux, Mac, MS-Windows |
Genre | Bioinformatics tool |
License | Public Domain |
Website | ftp://ftp.ncbi.nlm.nih.gov/blast/ |
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence. The BLAST program was designed by Eugene Myers, Stephen Altschul, Warren Gish, David J. Lipman and Webb Miller at the NIH and was published in J. Mol. Biol. in 1990[1].
Contents |
[edit] Background
BLAST is one of the most widely used bioinformatics programs[2], because it addresses a fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis on speed is vital to making the algorithm practical on the huge genome databases currently available, although subsequent algorithms can be even faster.
Examples of other questions that researchers use BLAST to answer are
- Which bacterial species have a protein that is related in lineage to a certain protein with known amino-acid sequence?
- Where does a certain sequence of DNA originate?
- What other genes encode proteins that exhibit structures or motifs such as ones that have just been determined?
BLAST is also often used as part of other algorithms that require approximate sequence matching.
The BLAST algorithm and the computer program that implements it were developed by Stephen Altschul, Warren Gish, David Lipman at the U.S. National Center for Biotechnology Information (NCBI), Webb Miller at the Pennsylvania State University, and Gene Myers at the University of Arizona. It is available on the web on the NCBI website. Alternative implementations include WU-BLAST and FSA-BLAST.
The original paper by Altschul, et al.[1] was the most highly cited paper published in the 1990s.[3]
[edit] Input/Output
Input and output conform to the FASTA format.
[edit] Algorithm
To run, BLAST requires a query sequence to search for, and a sequence to search against (or a sequence database containing multiple such sequences)(also called the target sequence). BLAST will find subsequences in the database which are similar to subsequences in the query. In typical usage, the query sequence is much smaller than the database, e.g., the query may be one thousand nucleotides while the database is several billion nucleotides.
BLAST searches for high scoring sequence alignments between the query sequence and sequences in the database using a heuristic approach that approximates the Smith-Waterman algorithm. The exhaustive Smith-Waterman approach is too slow for searching large genomic databases such as GenBank. Therefore, the BLAST algorithm uses a heuristic approach that is less accurate than the Smith-Waterman but over 500 times faster. The speed and relatively good accuracy of BLAST are among the key technical innovation of the BLAST programs.
The original BLAST algorithm can be conceptually divided into three stages.
- In the first stage, BLAST searches for exact matches of a small fixed length W between the query and sequences in the database. For example, given the sequences AGTTAC and ACTTAG and a word length W = 3, BLAST would identify the matching substring TTA that is common to both sequences. These exact matches are known as seeds. By default, W = 11 is used for nucleic seeds. Present versions of BLAST use two exact hits separated by a gapless region as matches in the following alignment step (high-scoring pairs, HSPs).
- In the second stage, BLAST tries to extend the match in both directions, starting at the seed. The ungapped alignment process extends the initial seed match of length W in each direction in an attempt to boost the alignment score. Insertions and deletions are not considered during this stage. For our example, the ungapped alignment between the sequences AGTTAC and ACTTAG centered around the common word TTA would be:
..AGTTAC.. | ||| ..ACTTAG..
- If a high-scoring un-gapped alignment is found, the database sequence passes on to the third stage.
- In the third stage, BLAST performs a gapped alignment between the query sequence and the database sequence using a variation of the Smith-Waterman algorithm. Statistically significant alignments are then displayed to the user.
[edit] Parallel BLAST
Parallel BLAST versions are implemented using MPI and Pthreads, and have been ported to various platforms including Windows, Linux, Solaris, Mac OS X, and AIX. Popular approaches to parallelize BLAST include query distribution, hash table segmentation, computation parallelization, and database segmentation (partition)[citation needed].
[edit] Program
The BLAST program can either be downloaded and run as a command-line utility "blastall" or accessed for free over the web. The BLAST web server, hosted by the NCBI, allows anyone with a web browser to perform similarity searches against constantly updated databases of proteins and DNA that include most of the newly sequenced organisms.
BLAST is actually a family of programs (all included in the blastall executable). These include:
- Nucleotide-nucleotide BLAST (blastn)
- This program, given a DNA query, returns the most similar DNA sequences from the DNA database that the user specifies.
- Protein-protein BLAST (blastp)
- This program, given a protein query, returns the most similar protein sequences from the protein database that the user specifies.
- Position-Specific Iterative BLAST (PSI-BLAST)
- This program is used to find distant relatives of a protein. First, a list of all closely related proteins is created. These proteins are combined into a general "profile" sequence, which summarises significant features present in these sequences. A query against the protein database is then run using this profile, and a larger group of proteins is found. This larger group is used to construct another profile, and the process is repeated.
By including related proteins in the search, PSI-BLAST is much more sensitive in picking up distant evolutionary relationships than a standard protein-protein BLAST. - Nucleotide 6-frame translation-protein (blastx)
- This program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database.
- Nucleotide 6-frame translation-nucleotide 6-frame translation (tblastx)
- This program is the slowest of the BLAST family. It translates the query nucleotide sequence in all six possible frames and compares it against the six-frame translations of a nucleotide sequence database. The purpose of tblastx is to find very distant relationships between nucleotide sequences.
- Protein-nucleotide 6-frame translation (tblastn)
- This program compares a protein query against the all six frame translations of a nucleotide sequence database.
- Large numbers of query sequences (megablast)
- When comparing large numbers of input sequences via the command-line BLAST, "megablast" is much faster than running BLAST multiple times. It concatenates many input sequences together to form a large sequence before searching the BLAST database, then post-analyze the search results to glean individual alignments and statistical values.
[edit] Alternative versions
An extremely fast but considerably less sensitive alternative to BLAST that compares nucleotide sequences to the genome is BLAT (Blast Like Alignment Tool). A version designed for comparing multiple large genomes or chromosomes is BLASTZ.
[edit] Accelerated versions
- There are two main field-programmable gate array (FPGA) implementations of the BLAST algorithm. Progeniq is up to 100x faster than a software implementation running on the same processor[citation needed]. TimeLogic [1] offers a FPGA BLAST package called Tera-BLAST.
- The Mitrion-C Open Bio Project is an ongoing effort to port blast to run on Mitrion FPGAs. It is available on SourceForge.
[edit] References
- ^ a b Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990). "Basic local alignment search tool". J Mol Biol 215 (3): 403-410. doi: . PMID 2231712.
- ^ Casey, RM (2005). "BLAST Sequences Aid in Genomics and Proteomics". Business Intelligence Network.
- ^ Sense from Sequences: Stephen F. Altschul on Bettering BLAST. ScienceWatch July/August 2000.
[edit] See also
- Needleman-Wunsch algorithm
- Smith-Waterman algorithm
- Sequence alignment
- Sequence alignment software
- Sequerome
- eTBLAST
[edit] External links
- NCBI-BLAST website
- NCBI-BLAST Tutorial
- WU-BLAST - The original gapping BLAST with statistics, developed and maintained by Warren Gish at Washington University in St. Louis
- FSA-BLAST - A new, faster but still accurate version of NCBI BLAST based on recently published algorithmic improvements
- NBIC mpiBLAST - Netherlands Bioinformatics Centre, running mpiBLAST
- PatternHunter - An alternative software which provides similar functionality to BLAST while claiming increased speed and sensitivity
- Parallel BLAST - A dual scheduling BLAST tested on the Blue Gene/L
- BLAST HOWTO at the Wikiomics bioinformatics wiki
- A/G BLAST - Implementation for PowerPC G4/G5 processors and Mac OS X, from Apple Computer's Advanced Computation Group and Genentech.
- STRAP The protein workbench STRAP contains a comfortable BLAST front-end with a cache for BLAST results
- KoriBlast is a reliable graphical environment dedicated to sequence data mining. KoriBlast combines Blast searches with advanced data management capabilities and a state-of-the-art graphical user interface.
- Using the Basic Local Alignment Search Tool (BLAST)
Databases supported by Bioinformatic Harvester |
NCBI-BLAST | CDD | Ensembl | Entrez | Flybase | Flymine | GFP-cDNA | Genome_browser | GeneCard | Google_Scholar | GoPubMed | HomoloGene | iHOP | IPI | OMIM | Mitocheck | PSORT | PolyMeta | UniProt | SOURCE | SOSUI | RZPD | Sciencenet | STRING | SMART | ZFIN | |