Dataflow
From Wikipedia, the free encyclopedia
Dataflow is a term used in computing, and may have various shades of meaning. It is closely related to message passing.
Contents |
[edit] Software architecture
Dataflow is a software architecture based on the idea that changing the value of a variable should automatically force recalculation of the values of other variables.
Dataflow programming languages embody these principles, with Spreadsheets perhaps the most widespread embodiment of dataflow. For example, in a spreadsheet you can specify a cell formula which depends on other cells; then when any of those cells is updated the first cell's value is automatically recalculated. It's possible for one change to initiate a whole sequence of changes, if one cell depends on another cell which depends on yet another cell, and so on.
The dataflow technique is not restricted to recalculating numeric values, as done in spreadsheets. For example, dataflow can be used to redraw a picture in response to mouse movements, or to make robot turn in response to a change in light level.
One benefit of dataflow is that it can reduce the amount of coupling-related code in a program. For example, without dataflow, if a variable X depends on a variable Y, then whenever Y is changed X must be explicitly recalculated. This means that Y is coupled to X. Since X is also coupled to Y (because X's value depends on the Y's value), the program ends up with a cyclic dependency between the two variables. Most good programmers will get rid of this cycle by using an observer pattern, but only at the cost of introducing a non-trivial amount of code. Dataflow improves this situation by making the recalculation of X automatic, thereby eliminating the coupling between from Y to X. Dataflow makes implicit a significant amount of code that otherwise would have had to be tediously explicit.
Dataflow is also sometimes referred to as reactive programming.
There have been a few programming languages created specifically to support dataflow. In particular, many (if not most) visual programming languages have been based on the idea of dataflow. A good example of a Java-based framework is Pervasive DataRush.
[edit] Diagrams
The term dataflow may also be used to refer to the flow of data within a system, and is the name normally given to the arrows in a data flow diagram that represent the flow of data between external entities, processes, and data stores.
[edit] Concurrency
A dataflow network is a network of concurrently executing processes or automata that can communicate by sending data over channels (see message passing.)
Kahn process networks, named after one of the pioneers of dataflow networks, are a particularly important class of such networks. In a Kahn process network the processes are determinate. This implies that they satisfy the so-called Kahn's principle, which roughly speaking states that, each determinate process computes a continuous function from input streams to output streams, and that a network of determinate processes is itself determinate, thus computing a continuous function. This implies that the behaviour of such networks can be described by a set of recursive equations, which can be solved using fixpoint theory.
The concept of dataflow networks is closely related to another model of concurrency known as the Actor model.
[edit] Hardware architecture
Hardware architectures for dataflow was a major topic in Computer architecture research in the 1970s and early 1980s. Jack Dennis of MIT pioneered the field of static dataflow architectures while the Manchester Dataflow Machine and MIT Tagged Token architecture were major projects in dynamic dataflow.
A compiler analyzes a computer program for the data dependencies between operations. It does this in order to better optimize the instruction sequences. Normally, the compiled output has the results of these optimizations, but the dependency information itself is not recorded within the compiled binary code.
A compiled program for a dataflow machine would keep this dependency information. A dataflow compiler would record these dependencies by creating unique tags for each dependency instead of using variable names. By giving each dependency a unique tag, it exposes any possibility of parallel execution of non-dependent instructions. Each instruction, along with its tagged operands would be stored in the compiled binary code.
The compiled program would be loaded into a Content-addressable memory of the dataflow computer. When all of the tagged operands of an instruction became available, that is previously calculated, the instruction was marked as available for execution by an execution unit. This was known as activating or firing the instruction.
Once the instruction was completed by the execution unit, its output data would be broadcast (with its tag) to the CAM memory. Any other instructions that were dependent on this particular datum (identified by its tag value) would be updated. In this way, subsequent instructions would be activated.
Instructions would be activated in data order, that is when all of the required data operands were available. This order can be different from the sequential order envisioned by the human programmer, the programmed order.
The instructions along with their required data would be transported as packets to the execution units. These packets are often known as instruction tokens. Similarly, data results are transported back to the CAM as data tokens. The packetization of instructions and results allowed for parallel execution of activated instructions on a large scale. Connection networks would deliver the activated instruction tokens to the execution units and return data tokens to the instruction CAM memory. In contrast to the conventional von Neumann architecture, data tokens are not permanently stored in memory, rather they are transient messages that only exist when in transit to the instruction storage.
Earlier designs that only used instruction addresses as data dependency tags were called static dataflow machines. These machines could not allow instructions from multiple loop iterations (or multiple calls to the same routine) to be issued simultaneously as the simple tags could not differentiate between the different loop iterations (or each invocation of the routine). Later designs called dynamic dataflow machines used more complex tags to allow greater parallelism from these cases.
The research, however, never overcame the problems related to:
- efficiently broadcasting data tokens in a massively parallel system
- efficiently dispatching instruction tokens in a massively parallel system
- building CAMs large enough to hold all of the dependencies of a real program
Instructions and their data dependencies proved to be too fine-grained to be effectively distributed in a large network. That is, the time for the instructions and tagged results to travel through a large connection network was longer than the time to actually do the computations.
Out-of-order execution is the conceptual descendant of dataflow computation and has become the dominant computing paradigm since the 1990s. It is a form of restricted dataflow. This paradigm introduced the idea of an execution window. The execution window follows the programmed sequential order of the program, however within the window, instructions are allowed to be completed in data dependency order. This is accomplished by the computer hardware dynamically tagging the data dependencies within the window. The logical complexity of dynamically keeping track of the data dependencies, restricts OoO CPUs to a small number of execution units (2-6) and the execution window sizes to the range of 32 to 200 instructions, much smaller than envisioned for full dataflow machines.
[edit] See also
- Data flow diagram
- Dataflow language
- Lazy evaluation
- Complex event processing
- Pure Data
- Flow-based programming (FBP)
[edit] External links
- BMDFM: Binary Modular Dataflow Machine, BMDFM.
- Cantata: Dataflow Visual Language for image processing.
- Cells: Dataflow extension to Common Lisp Object System, CLOS.
- PyCells: Python port.
- DataRush: Dataflow framework for Java.
- Stella: Dataflow Visual Language for dynamic dataflow modeling and simulation.
- Adam and Eve: Extension for C++, by Adobe.