Kepler scientific workflow system

From Wikipedia, the free encyclopedia

Kepler is a free-software system for designing, executing, and sharing scientific workflows[1][2][3]. Workflows in general, and scientific workflows in particular, are directed graphs where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components[4]. In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

Contents

[edit] Access to scientific data

Kepler provides direct access to scientific data that has been archived in many of the commonly used data archives. For example, Kepler provides access to data stored in the Knowledge Network for Biocomplexity (KNB) Metacat server[5] and described using Ecological Metadata Language. Additional data sources that are supported include data accessible using the DiGIR protocol, the OPeNDAP protocol, GridFTP, JDBC, SRB, and others.

[edit] Models of Computation

Kepler differs from many of the other bioinformatics workflow management systems in that it separates the structure of the workflow model from its model of computation, such that different models for the computation of the workflow can be bound to a given workflow graph. Kepler inherits several common models of computation from the Ptolemy system, including Synchronous Data Flow (SDF), Continuous Time (CT), Process Network (PN), and Dynamic Data Flow (DDF), among others.

[edit] Hierarchical workflows

Kepler supports hierarchy in workflows, which allows complex tasks to be composed of simpler components. This feature allows workflow authors to build re-usable, modular components that can be saved for use across many different workflows.

[edit] Workflow semantics

Kepler provides a model for the semantic annotation of workflow components using terms drawn from an ontology. These annotations support many advanced features, including improved search capabilities, automated workflow validation, and improved workflow editing.[6]

[edit] Sharing workflows

Kepler components can be shared by exporting the workflow or component into a Kepler Archive (KAR) file, which is an extension of the JAR file format from Java. Once a KAR file is created, it can be emailed to colleages, shared on web sites, or uploaded to the Kepler Component Repository. The Component Repository is centralized system for sharing Kepler workflows that is accessible via both a web portal and a web service interface. Users can directly search for and utilize components from the repository from within the Kepler workflow composition GUI.

[edit] Kepler History

The Kepler Project was created in 2002 by members of the Science Environment for Ecological Knowledge (SEEK) project [3] and the Scientific Data Management (SDM) project. The project was founded by researchers at the National Center for Ecological Analysis and Synthesis (NCEAS) at the University of California, Santa Barbara and the San Diego Supercomputer Center at the University of California, San Diego. Kepler extends Ptolemy II, which is a software system for modeling, simulation, and design of concurrent, real-time, embedded systems developed at UC Berkeley. Collaboration on Kepler quickly grew as members of various scientific disciplines realized the benefits of scientific workflows for analysis and modeling and began contributing to the system. As of 2008, Kepler collaborators come from many science disciplines, including ecology, molecular biology, genetics, physics, chemistry, conservation science, oceanography, hydrology, library science, computer science, and others.

[edit] Kepler FAQs

General Q's
Q: My analyses often require the same basic components. How can I create a workflow template that includes these?
A: Create a workflow that includes all the basic components and save it with an intuitive name, such as, "ANOVAtemplate.xml". To begin a new workflow based on your ANOVA template, open Kepler, on the File menu choose Open File, navigate to the directory in which you saved ANOVAtemplate.xml and select it. Then, immediately choose Save As... from the File menu and save the workflow under a more specific name, such as "ANOVA_date_project.xml". This leaves ANOVAtemplate.xml unchanged and ready to serve as a template the next time you need it.

Director Q's

Q: Why doesn't my workflow ever finishing executing?
A: By default the workflow director's "iterations" are set to 0, which indicates "loop indefinitely." To change this, right-click on the director, choose "Configure Director" and change "iterations" from 0 to 1 for one iteration, or to n for n iterations, then push the "Commit" button.

Q: Why do I get the "SDF scheduler found disconnected actors!" error message?
A: The SDF Director does not expect unconnected workflow components. During workflow development, however, it can be convenient to disconnect one actor and connect another. To make the SDF Director allow this, right-click the Director, choose "Configure Actor" and check the box beside "allowDisconnectedGraphs", the push the "Commit" button.

RExpression Actor Q's

Q: How do I keep the R coding window of the RExpression actor open while running my workflow?
A: Right-click on the RExpression actor on your workflow and choose "Open Actor" (Ctrl-L) from the menu. When you are finished making changes to your R-script, choose Save (Ctrl-S) from the File menu. Then, push the "Run or Resume" button on the workflow toolbar (Ctrl-R) to run the workflow and see the results of your changes.


Graphing Q's
Q: Must I connect a graphing actor to my RExpression actor in order to see graphical output?
A: No. Right-click on the RExpression actor, choose "Configure Actor" and check the box beside, "Automatically display graphics." Kepler will save the graphic as a pdf file in a temporary directory and open your default pdf viewer to display it.
Q: Why are some of my x-axis labels missing?
A: The ImageJ actor generates *.png and *.pdf files, with default height and width equal to 480x480 pixels. If some of your x-axis labels are long, they may be excluded from the plot. There are several ways to fix this. First, try changing to the other graphics file format (i.e., right-click the RExpression actor, choose Configure Actor, click the drop-down box beside Graphics Format, and select the one not currently selected. Re-run your workflow. If that doesn't fix the problem, try changing the dimensions of the graphics file. To do so, right-click the RExpression actor, choose Configure Actor, and change the Number of X pixels in image (or, Number of Y pixels in image) to a new value. The default generates a square image. Some other common height:width relationships are y/x=2/3, y/x=1/sqrt(3), and y/x=2/(1+sqrt(5)), the latter being the Golden Ratio. Of course, there are aesthetic limits to stretching axes, so if none of these remedies work, you can always try abbreviating your x-axis category labels.

[edit] References

  1. ^ Ludäscher B., Altintas I., Berkley C., Higgins D., Jaeger-Frank E., Jones M., Lee E., Tao J., Zhao Y. 2006. Scientific Workflow Management and the Kepler System. Special Issue: Workflow in Grid Systems. Concurrency and Computation: Practice & Experience 18(10): 1039-1065.
  2. ^ Altintas I, Berkley C, Jaeger E, Jones M, Ludäscher B, Mock S. 2004. Kepler: An Extensible System for Design and Execution of Scientific Workflows. Proceedings of the The Future of Grid Data Environments, Global Grid Forum 10.
  3. ^ a b Michener, William K., James H. Beach, Matthew B. Jones, Bertram Ludaescher, Deana D. Pennington, Ricardo S. Pereira, Arcot Rajasekar, and Mark Schildhauer. 2007. "A Knowledge Environment for the Biodiversity and Ecological Sciences", Journal of Intelligent Information Systems, 29(1): 111-126. Doi: 10.1007/s10844-006-0034-8
  4. ^ Taylor, I.J.; Deelman, E.; Gannon, D.B.; Shields, M. (Eds.), “Workflows for e-Science: Scientific Workflows for Grids”, 530 p., Springer. ISBN: 978-1-84628-519-6.
  5. ^ Jones, Matthew B., C. Berkley, J. Bojilova, M. Schildhauer. 2001. Managing Scientific Metadata. IEEE Internet Computing 5 (5): 59-68.
  6. ^ Berkley, Chad, Shawn Bowers, Matthew B. Jones, Bertram Ludaescher, Mark Schildhauer, Jing Tao. 2005. Incorporating Semantics in Scientific Workflow Authoring. 17th International Conference on Scientific and Statistical Database Management. IEEE Computer Society.

[edit] External Links

Kepler Project website: [[1]]
Kepler Component Repository: [[2]]
Ptolemy II project website: [[3]]
Knowledge Network for Biocomplexity (KNB) Data archive: [[4]]
The Golden Ratio on Wikipedia: [[5]]

[edit] See also