Kepler scientific workflow system

Kepler Scientific Workflow System
Stable release	2.4 / 2013-04-05
Written in	Java
Operating system	Linux, Mac OS X, Windows
Type	Workflow Management System
License	BSD License
Website	kepler-project.org

Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows.^[1]^[2]^[3] Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement. Workflows in general, and scientific workflows in particular, are directed graphs where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components.^[4] In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

Scientific workflow

A scientific workflow is the process of combining data and processes into a configurable, structured set of steps that implement semi-automated computational solutions to a scientific problem. Scientific workflow systems often provide graphical user interfaces to combine different technologies along with efficient methods for using them, and thus increase the efficiency of the scientists.

Access to scientific data

Kepler provides direct access to scientific data that has been archived in many of the commonly used data archives. For example, Kepler provides access to data stored in the Knowledge Network for Biocomplexity (KNB) Metacat server^[5] and described using Ecological Metadata Language. Additional data sources that are supported include data accessible using the DiGIR protocol, the OPeNDAP protocol, GridFTP, JDBC, SRB, and others.

Models of Computation

Kepler differs from many of the other bioinformatics workflow management systems in that it separates the structure of the workflow model from its model of computation, such that different models for the computation of the workflow can be bound to a given workflow graph. Kepler inherits several common models of computation from the Ptolemy system, including Synchronous Data Flow (SDF), Continuous Time (CT), Process Network (PN), and Dynamic Data Flow (DDF), among others.

Hierarchical workflows

Kepler supports hierarchy in workflows, which allows complex tasks to be composed of simpler components. This feature allows workflow authors to build re-usable, modular components that can be saved for use across many different workflows.

Workflow semantics

Kepler provides a model for the semantic annotation of workflow components using terms drawn from an ontology. These annotations support many advanced features, including improved search capabilities, automated workflow validation, and improved workflow editing.^[6]

Sharing workflows

Kepler components can be shared by exporting the workflow or component into a Kepler Archive (KAR) file, which is an extension of the JAR file format from Java. Once a KAR file is created, it can be emailed to colleagues, shared on web sites, or uploaded to the Kepler Component Repository. The Component Repository is centralized system for sharing Kepler workflows that is accessible via both a web portal and a web service interface. Users can directly search for and utilize components from the repository from within the Kepler workflow composition GUI.

Provenance

Provenance is a critical concept in scientific workflows, since it allows scientists to understand the origin of their results, to repeat their experiments, and to validate the processes that were used to derive data products.^[7] In order for a workflow to be reproduced, provenance information must be recorded that indicates where the data originated, how it was altered, and which components and what parameter settings were used. This will allow other scientists to re-conduct the experiment, confirming the results.^[8] Little support exists in current systems to allow end-users to query provenance information in scientifically meaningful ways, in particular when advanced workflow execution models go beyond simple DAGs (as in process networks).^[9]

Kepler history

The Kepler Project was created in 2002 by members of the Science Environment for Ecological Knowledge (SEEK) project ^[3] and the Scientific Data Management (SDM) project. The project was founded by researchers at the National Center for Ecological Analysis and Synthesis (NCEAS) at the University of California, Santa Barbara and the San Diego Supercomputer Center at the University of California, San Diego. Kepler extends Ptolemy II, which is a software system for modeling, simulation, and design of concurrent, real-time, embedded systems developed at UC Berkeley. Collaboration on Kepler quickly grew as members of various scientific disciplines realized the benefits of scientific workflows for analysis and modeling and began contributing to the system. As of 2008, Kepler collaborators come from many science disciplines, including ecology, molecular biology, genetics, physics, chemistry, conservation science, oceanography, hydrology, library science, computer science, and others.

References

↑ Ludäscher B., Altintas I., Berkley C., Higgins D., Jaeger-Frank E., Jones M., Lee E., Tao J., Zhao Y. 2006. Scientific Workflow Management and the Kepler System. Special Issue: Workflow in Grid Systems. Concurrency and Computation: Practice & Experience 18(10): 1039-1065.
↑ Altintas I, Berkley C, Jaeger E, Jones M, Ludäscher B, Mock S. 2004. Kepler: An Extensible System for Design and Execution of Scientific Workflows. Proceedings of the The Future of Grid Data Environments, Global Grid Forum 10.
↑ 3.0 3.1 Michener, William K., James H. Beach, Matthew B. Jones, Bertram Ludaescher, Deana D. Pennington, Ricardo S. Pereira, Arcot Rajasekar, and Mark Schildhauer. 2007. "A Knowledge Environment for the Biodiversity and Ecological Sciences", Journal of Intelligent Information Systems, 29(1): 111-126. doi:10.1007/s10844-006-0034-8
↑ Taylor, I.J.; Deelman, E.; Gannon, D.B.; Shields, M. (Eds.), “Workflows for e-Science: Scientific Workflows for Grids”, 530 p., Springer. ISBN 978-1-84628-519-6.
↑ Jones, Matthew B., C. Berkley, J. Bojilova, M. Schildhauer. 2001. Managing Scientific Metadata. IEEE Internet Computing 5 (5): 59-68.
↑ Berkley, Chad, Shawn Bowers, Matthew B. Jones, Bertram Ludaescher, Mark Schildhauer, Jing Tao. 2005. Incorporating Semantics in Scientific Workflow Authoring. 17th International Conference on Scientific and Statistical Database Management. IEEE Computer Society.
↑ http://twiki.ipaw.info/bin/view/Challenge/WebHome
↑ http://www.adambarker.org/papers/ppam08.pdf
↑ Shawn Bowers, Timothy McPhillips, Bertram Ludascher, Shirley Cohen, Susan B. Davidson 2006. A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows.

External links

Kepler Project website
Kepler Component Repository
Ptolemy II project website
Knowledge Network for Biocomplexity (KNB) Data archive
List of software tools related to workflows on the DataONE website