Warren Gish
Warren Richard Gish | |
---|---|
Nationality | American |
Fields | Bioinformatics |
Institutions |
National Center for Biotechnology Information Washington University in St. Louis Advanced Biocomputing LLC University of California, Berkeley |
Alma mater | University of California, Berkeley |
Thesis | I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (1988) |
Doctoral advisor | Michael Botchan[1] |
Known for | BLAST |
Warren Richard Gish is the owner of Advanced Biocomputing LLC. He joined Washington University in St. Louis as a junior faculty member in 1994, and was a Research Associate Professor of Genetics from 2002 to 2007.[2][3]
Education
Gish completed his Doctor of Philosophy degree at the University of California, Berkeley in 1988.[1]
Research
Gish is known primarily for his contributions to NCBI BLAST,[4][5] the 1996 development of the first practical gapped BLAST package (with statistics) (WU-BLAST), and most recently the AB-BLAST package.
As a Molecular Biology graduate student in 1985, with a view toward rapid identification of restriction enzyme recognition sites in DNA sequence data, Gish independently developed a DFA function library in the C language. (The idea to apply a finite-state machine to this problem was originally suggested by fellow graduate student and BSD UNIX engineer Mike Karels.) Gish's DFA implementation was that of a Mealy machine architecture, which is more compact than an equivalent Moore machine and hence faster as well. Construction of the DFA was O(n), where n is the sum of the lengths of the sequences to be identified. The DFA could then be used to scan target sequences in a single pass with no backtracking in O(m) time, where m is the total length of the target(s). The method of DFA construction was recognized later as representing a consolidation of Algorithms 3 and 4 described by Alfred V. Aho and Margaret J. Corasick.[6]
While working for U.C. Berkeley in December 1986, Gish sped up the FASTP program [7] (later known as FASTA[8]) of William R. Pearson and David J. Lipman by 2- to 3-fold without altering the results. When the performance modifications were communicated to Pearson and Lipman, Gish further suggested that a DFA (rather than a lookup table) would yield faster k-tuple identification and improve the overall speed of the program by perhaps as much as 10% in some cases; however such marginal improvement even in the best case was deemed by the authors to not be worth the added code complexity. Gish also envisioned at this time a centralized search service, wherein all nucleotide sequences from GenBank would be maintained in memory to eliminate I/O bottlenecks—and stored in compressed form to conserve memory—with clients invoking FASTN searches remotely via the Internet.
Gish's earliest contributions to BLAST were made while working at the NCBI, starting in July 1989. Even in early prototypes BLAST was typically much faster than FASTA. Gish instantly recognized the added benefit in this context of using a DFA for word-hit recognition. He quickly morphed his earlier DFA code into a form that was used in all BLAST search modes. Others of his contributions to BLAST include: the use of compressed nucleotide sequences, both as an efficient storage format and as a rapid, native search format; parallel processing; memory-mapped I/O; the use of sentinel bytes and sentinel words at the start and end of sequences to improve the speed of word-hit extension; the first implementations of BLASTX, TBLASTN and TBLASTX; the transparent use of external (plug-in) programs such as seg, xnu, and dust to mask low-complexity regions in query sequences at run time; the NCBI BLAST E-mail Service with optional public key-encrypted communications; the NCBI Experimental BLAST Network Service; the NCBI non-redundant (nr) protein and nucleotide sequence databases, typically updated on a daily basis with all data from GenBank, Swiss-Prot, and the PIR; a BLAST function library used in specialized applications for EST analysis and Entrez data production, as well as in the NCBI BLAST suite version 1.4; and project management for the earliest NCBI Dispatcher for distributed services (inspired by CORBA's Object Request Broker). The NCBI Experimental BLAST Network Service, running the latest BLAST software on SMP hardware against the latest sequence databases, established the NCBI in December 1989, as a convenient, one-stop shop for sequence similarity searching.
At Washington University in St. Louis, Gish developed the first practical BLAST suite of programs to combine rapid gapped sequence alignment with statistical evaluation methods appropriate for gapped alignment scores. The resulting search programs were significantly more sensitive, but only marginally slower than ungapped BLAST, due to novel application of the BLAST dropoff score X during gapped alignment extension. Sensitivity of gapped BLAST was further improved by his novel application of Karlin-Altschul Sum statistics[9] to the evaluation of multiple, gapped alignment scores in all BLAST search modes. Sum statistics were originally (and analytically) developed for the evaluation of multiple, ungapped alignment scores. The empirical use of Sum statistics in the treatment of gapped alignments was validated in collaboration with Stephen Altschul, from 1994-1995. In May 1996, WU-BLAST version 2.0 with gapped alignments was publicly released in the form of a drop-in upgrade for existing users of ungapped NCBI BLAST and WU-BLAST (both at version 1.4, after having forked in 1994). Little NIH funding (average 20% FTE) was received for his WU-BLAST development, starting in November 1995, and ending shortly after the September 1997 release of the NCBI gapped BLAST (“blastall”). As an option to WU-BLAST, Gish implemented a faster, more memory-efficient and more sensitive two-hit BLAST algorithm than is used by the NCBI software. In 1999, Gish added support to WU-BLAST for the Extended Database Format (XDF), the first BLAST database format capable of accurately representing the entire draft sequence of the human genome in full-length chromosome sequence objects. This was also the first time any BLAST package introduced a new database format in a manner transparent to existing users and without abandoning support for prior formats, as a result of abstracting the database I/O functions completely separately from the data analysis functions. WU-BLAST with XDF was the first BLAST suite to support accurate, comprehensive indexed-retrieval of NCBI standard sequence identifiers, to allow users to retrieve individual sequences in part or in whole, natively, translated or reverse-complemented, and able to dump the entire contents of a BLAST database back into human-readable FASTA format. In 2000, unique support for reporting of links (consistent sets of HSPs) was added, along with the ability for users to limit the distance between HSPs allowed in the same set to a biologically relevant length (e.g., the length of the longest intron in the species of interest) and with the distance limitation entering into the calculation of p-values. Between 2001-2003, Gish improved the speed of the DFA code used in WU-BLAST. Gish also proposed multiplexing query sequences to speed up BLAST searches by an order of magnitude or more (MPBLAST); implemented segmented sequences with internal sentinel bytes, in part to aid multiplexing with MPBLAST and in part to aid analysis of segmented query sequences from shotgun sequencing assemblies; and directed use of WU-BLAST as a fast, flexible search engine for accurately identifying and masking genome sequences for repetitive elements and low-complexity sequences (the MaskerAid[10] package for RepeatMasker). With doctoral student Miao Zhang, Gish directed development of EXALIN,[11] which significantly improved the accuracy of spliced alignment predictions, by a novel approach that combined information from donor and acceptor splice site models with information from sequence conservation. Although EXALIN performed full dynamic programming by default, it could optionally utilize the output from WU-BLAST to seed the dynamic programming and speed up the process by about 100-fold with little loss of sensitivity or accuracy.
In 2008, Gish founded Advanced Biocomputing, LLC, where he continues to improve and support the AB-BLAST package.
References
- ↑ 1.0 1.1 Gish, Warren Richard (1988). I. SV40 mutants isolated from transformed human cells. II. Methods for sequence analysis (PhD thesis). University of California, Berkeley.
- ↑ List of publications from Microsoft Academic Search
- ↑ Warren Gish's publications indexed by the DBLP Bibliography Server at the University of Trier
- ↑ Altschul, S.; Gish, W.; Miller, W.; Myers, E.; Lipman, D. (1990). "Basic Local Alignment Search Tool". Journal of Molecular Biology 215 (3): 403–410. doi:10.1016/S0022-2836(05)80360-2. PMID 2231712.
- ↑ Sense from Sequences: Stephen F. Altschul on Bettering BLAST
- ↑ Aho, Alfred V.; Corasick, Margaret J. (June 1975). "Efficient string matching: An aid to bibliographic search". Communications of the ACM 18 (6): 333–340. doi:10.1145/360825.360855.
- ↑ Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science 227 (4693): 1435–41. doi:10.1126/science.2983426. PMID 2983426.
- ↑ Pearson, W. R.; Lipman, D. J. (1988). "Improved tools for biological sequence comparison". Proceedings of the National Academy of Sciences of the United States of America 85 (8): 2444–2448. doi:10.1073/pnas.85.8.2444. PMC 280013. PMID 3162770.
- ↑ Karlin, S.; Altschul, S. F. (1993). "Applications and statistics for multiple high-scoring segments in molecular sequences". Proceedings of the National Academy of Sciences of the United States of America 90 (12): 5873–5877. doi:10.1073/pnas.90.12.5873. PMC 46825. PMID 8390686.
- ↑ Bedell, J. A.; Korf, I.; Gish, W. (2000). "MaskerAid : A performance enhancement to RepeatMasker". Bioinformatics 16 (11): 1040–1041. doi:10.1093/bioinformatics/16.11.1040. PMID 11159316.
- ↑ Zhang, M.; Gish, W. (2005). "Improved spliced alignment from an information theoretic approach". Bioinformatics 22 (1): 13–20. doi:10.1093/bioinformatics/bti748. PMID 16267086.