Bioinformatics workflow management systems
A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, in a specific domain of science, bioinformatics.
There are currently many different workflow systems. Some have been developed more generally as scientific workflow systems for use by scientists from many different disciplines like astronomy and earth science.
Examples
- Anduril is an open source component-based workflow framework for scientific data analysis developed at the University of Helsinki.[1] Anduril provides an execution engine written in Java, a large number of components for bioinformatics analysis, and the AndurilScript language to create and manage workflows.
- BioBike[2] is a biocomputing platform based upon the KnowOS (Knowledge Operating System) e-science technology. Written entirely in Lisp, KnowOS's main distinguishing feature is "through-the-browser" programmability.
- BioExtract harnesses the power of online informatics tools for creating and customizing workflows. Users can query online sequence data, analyze it using an array of informatics tools, create and share custom workflows for repeated analysis, and save the resulting data and workflows in standardized reports.
- BioManager is a bioinformatic data management and analysis workflow developed by the University of Sydney.
- CellProfiler[3] is an open source modular image analysis software developed at the Broad Institute. Capable of handling hundreds of thousands of images, it contains advanced algorithms for image analysis of cell-based assays and is optimized for high-throughput work. The software allows the user to construct a pipeline of individual modules; each module performs a image processing step, such as image loading, object identification, and feature extraction.
- Discovery Net (circa 2000) is one of the earliest examples of scientific workflow systems. It was the winner of the “Most Innovative Data Intensive Application Award” at the ACM SC02 (Supercomputing 2002) conference and exhibition, based on a demonstration of a fully interactive distributed genome annotation pipeline for a Malaria genome case study. The Discovery Net system originated from a £2m EPSRC-funded project with the same name investigating the development of an e-Science platform for scientific discovery from the data generated by a wide variety of high throughput devices at Imperial College London. Many of the features of the system (architecture features, visual front-end, simplified access to remote Web and Grid Services and inclusion of a workflow store) were considered novel at the time, and have since found their way into other academic and commercial systems.
- Ergatis[4] is a web-based system used to create, run, and monitor reusable bioinformatics analysis pipelines. It contains pre-built components for common bioinformatics analysis tasks, such as blast searches or storing data in a Chado database. These components can be arranged graphically to create highly-configurable pipelines.
- Galaxy[5] is an open source workflow system developed at Penn State and Emory University. Galaxy is available as a free public web server[6] and as downloadable software.[7] Galaxy stresses ease of use and sharing and persisting analyses.
- GenePattern is a genomic analysis platform developed at the Broad Institute of MIT & Harvard that provides access to more than 150 tools for gene expression analysis, proteomics, SNP analysis, RNA-seq, flow cytometry, and common data processing tasks. A web-based interface provides access to these tools and allows the creation of multi-step analysis pipelines that enable reproducible in silico research.
- Geodise (Grid Enabled Optimisation and Design Search for Engineering) was developed at the University of Southampton.
- Kepler enables scientists in a variety of disciplines like biology, ecology and astronomy to compose and execute workflows. Kepler is based on the Ptolemy II system for heterogeneous, concurrent modeling and design. Ptolemy II was developed by the members of the Ptolemy project at University of California Berkeley. Although not originally intended for scientific workflows, it provides a mature platform for building and executing workflows, and supports multiple models of computation.
- LONI Pipeline is a Java-based distributed graphical data-analysis environment for constructing, validating, executing and disseminating scientific workflows. As the LONI Pipeline references all data, services and tools as external objects, it directly allows resource interoperability without the need for rebuilding the software.
- Medicel Integrator Workflow is a cluster-enabled bioinformatics workflow design and execution application. It can be used stand-alone or integrated with a biology data warehouse.
- Pegasus is a flexible framework that enables the mapping of complex scientific workflows onto the grid developed at the Information Sciences Institute at the University of Southern California.
- Pegasys is a software for executing and integrating analyses of biological sequences, developed by the University of British Columbia.
- Pipeline Pilot is Accelrys’ scientific informatics platform that streamlines the data integration and analysis by using a Visual Programming Language (similar to LabVIEW) to build a pipeline to transform any number of inputs (raw data) into any number of outputs.
- Taverna workbench is an open source workflow system that enables scientists (typically, though not exclusively, in bioinformatics) to compose and execute scientific workflows. It has been developed as part of a £5.5m EPSRC project called myGrid based at the University of Manchester. Independently, other researchers have created Programming by example workflow development tools that are interoperable with Taverna.[8]
- Triana is an open source problem solving environment developed at Cardiff University that combines an intuitive visual interface with powerful data analysis tools.
- Wildfire is a distributed, Grid-enabled workflow construction and execution environment. It has a graphical user interface for constructing and running workflows. Wildfire borrows user interface features from Jemboss and adds a drag-and-drop interface allowing the user to compose EMBOSS (and other) programs into workflows. For execution, Wildfire uses GEL, the underlying workflow execution engine, which can exploit available parallelism on multiple CPU machines including Beowulf-class clusters and Grids.
- Sight is a web agent – oriented workflow platform that historically has extensive means to integrate websites with ordinary web forms and HTML responses (there is also support for WSDL as well). The system has a GUI-based workflow composer that supports modules with multiple ports and allows to access data from the modules that stand earlier in workflow. Sight was developed in Ulm university using java and it currently released under GPL.
- RetroGuide is a query framework for querying retrospective bioinformatics data.
- UGENE Workflow Designer is an open source visual environment designed for building and executing bioinformatics workflows. The main purpose of the system is providing user-friendly GUI for creating computational workflows that can be executed as well as on commodity hardware as on high-performance clusters and supercomputers.
- HCDC is an open source workflow system developed at ETH Zurich that is focus on large scale image based biological experiments. Include large collection of components for multiwell plate handling (96, 384, ...).
- Mobyle is a framework and web portal specifically aimed at the integration of bioinformatics software and databanks. Mobyle is the successor of Pise and the RPBS server, previous systems that provided web environments to define and execute bioinformatics analyses.
- Remora is a web server implemented according to the BioMoby web-service specifications, providing life science researchers with an easy-to-use workflow generator and launcher, a repository of predefined workflows and a survey system.
References
- ^ Ovaska, K.; Laakso, M.; Haapa-Paananen, S.; Louhimo, R.; Chen, P.; Aittomäki, V.; Valo, E.; Núñez-Fontarnau, J. et al. (2010). "Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme". Genome Medicine 2 (9): 65. doi:10.1186/gm186. PMID 20822536. edit
- ^ Elhai, J.; Taton, A.; Massar, J.; Myers, J. K.; Travers, M.; Casey, J.; Slupesky, M.; Shrager, J. (2009). "BioBIKE: A Web-based, programmable, integrated biological knowledge base". Nucleic Acids Research 37 (Web Server issue): W28–W32. doi:10.1093/nar/gkp354. PMC 2703918. PMID 19433511. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2703918. edit
- ^ Kamentsky, L.; Jones, T. R.; Fraser, A.; Bray, M. -A.; Logan, D. J.; Madden, K. L.; Ljosa, V.; Rueden, C. et al. (2011). "Improved structure, function and compatibility for CellProfiler: Modular high-throughput image analysis software". Bioinformatics 27 (8): 1179–1180. doi:10.1093/bioinformatics/btr095. PMC 3072555. PMID 21349861. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=3072555. edit
- ^ Orvis, J.; Crabtree, J.; Galens, K.; Gussman, A.; Inman, J. M.; Lee, E.; Nampally, S.; Riley, D. et al. (2010). "Ergatis: A web interface and scalable software system for bioinformatics workflows". Bioinformatics 26 (12): 1488–1492. doi:10.1093/bioinformatics/btq167. PMC 2881353. PMID 20413634. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2881353. edit
- ^ Goecks, J.; Nekrutenko, A.; Taylor, J.; Galaxy Team, T. (2010). "Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology 11 (8): R86. doi:10.1186/gb-2010-11-8-r86. PMC 2945788. PMID 20738864. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=2945788. edit
- ^ http://usegalaxy.org/
- ^ http://getgalaxy.org/
- ^ Hull, Duncan; Wolstencroft, Katy; Stevens, Robert; Goble, Carole A.; Pocock, Matthew R.; Li, Peter; Oinn, Tom (2006). "Taverna: A tool for building and running workflows of services". Nucleic Acids Research 34 (Web Server issue): W729–W732. doi:10.1093/nar/gkl320. PMC 1538887. PMID 16845108. http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pmcentrez&artid=1538887. edit
External links
- Oinn, T.; Greenwood, M.; Addis, M.; Alpdemir, M. N.; Ferris, J.; Glover, K.; Goble, C.; Goderis, A. et al. (2006). "Taverna: Lessons in creating a workflow environment for the life sciences". Concurrency and Computation: Practice and Experience 18 (10): 1067–1100. doi:10.1002/cpe.993. edit This paper reviews some of the above workflow systems
- Yu, J.; Buyya, R. (2005). "A taxonomy of scientific workflow systems for grid computing". ACM SIGMOD Record 34 (3): 44. doi:10.1145/1084805.1084814. edit from the ACM SIGMOD Record
- Portal of a joint European Grid and web-services project called EMBRACE. Provides much information and many work-out bioinformatics examples and web-services.
- Galaxy
- GenePattern Website and Reich, M.; Liefeld, T.; Gould, J.; Lerner, J.; Tamayo, P.; Mesirov, J. P. (2006). "GenePattern 2.0". Nature Genetics 38 (5): 500–501. doi:10.1038/ng0506-500. PMID 16642009. edit (Nature Genetics)
- Curcin, V.; Ghanem, M. (2008). Scientific workflow systems - can one size fit all?. pp. 1–9. doi:10.1109/CIBEC.2008.4786077. edit paper in CIBEC'08 comparing multiple workflow systems for bioinformatics applications
- Workflow technology based Solutions for Bioinformatics
- Mobyle
- Remora