Transport triggered architecture
From Wikipedia, the free encyclopedia
Transport triggered architecture (TTA) is an application-specific instruction set processor (ASIP) architecture template that allows easy customization of microprocessor designs.
Contents |
[edit] Structure
TTA processors are built of independent function units and register files, which are connected with transport buses and sockets.
[edit] Function unit
Each function unit implements one or more operations, which implement functionality ranging from a simple addition of integers to a complex and arbitrary user-defined computation. Operands for operations are transferred through function unit ports.
Each function unit may have an independent pipeline. In case a function unit is fully pipelined, a new operation that takes multiple clock cycles to finish can be started in every clock cycle. On the other hand, a pipeline can be such that it does not always accept new operation start requests while an old one is still executing.
Data memory access and communication to outside of the processor is handled by using special function units. Function units that implement memory accessing operations and connect to a memory module are often called load/store units.
[edit] Control unit
Control unit is a special case of function units which controls execution of programs. Control unit has access to the instruction memory in order to fetch the instructions to be executed. In order to allow the executed programs to transfer the execution (jump) to an arbitrary position in the executed program, control unit provides control flow operations. A control unit usually has an instruction pipeline, which consists of stages for fetching, decoding and executing program instructions.
[edit] Register files
Register files contain general purpose registers, which are used to store variables in programs. Like function units, also register files have input and output ports. The number of read and write ports, that is, the capability of being able to read and write multiple registers in a same clock cycle, can vary in each register file.
[edit] Transport buses and sockets
Interconnect architecture consists of transport buses which are connected to function unit ports by means of sockets. Due to expense of connectivity, it is usual to reduce the number of connections between units (function units and register files). A TTA is said to be fully connected in case there is a path from each unit output port to every unit's input ports.
Sockets provide means for programming TTA processors by allowing to select which bus-to-port connections of the socket are enabled at any time instant. Thus, data transports taking place in a clock cycle can be programmed by defining the source and destination socket/port connection to be enabled for each bus.
Conditional execution is implemented with the aid of guards. Each data transport can be conditionalized by a guard, which is connected to a register (often a 1-bit conditional register) and to a bus. In case the value of the guarded register evaluates to false (zero), the data transport programmed for the bus the guard is connected to is squashed, that is, not written to its destination. Unconditional data transports are not connected to any guard and are always executed.
[edit] TTA customization
Customization being one motivation for developing TTA processors, a new TTA processor can be created by defining function units, operations implemented in each function unit, register files, count of registers in each register files, count of buses, and connections between units.
[edit] Programming
In more traditional processor architectures, a processor is usually programmed by defining the executed operations and their operands. For example, an addition instruction in a RISC architecture could look like the following.
add r3, r1, r2
This operation adds the values of general-purpose registers r1 and r2 and stores the result in register r3. Coarsely, the execution of the instruction in the processor probably results in translating the instruction to control signals which control the interconnection network connections and function units. The interconnection network is used to transfer the current values of registers r1 and r2 to the function unit that is capable of executing the add operation, often called ALU as in Arithmetic-Logic Unit. Finally, a control signal selects and triggers the addition operation in ALU, of which result is transferred back to the register r3.
TTA programs do not define the operations, but only the data transports needed to write and read the operand values. Operation itself is triggered by writing data to a triggering operand of an operation. Thus, an operation is executed as a side effect of the triggering data transport. Therefore, executing an addition operation in TTA requires three data transport definitions, also called moves:
r1 -> add.1
r2 -> add.2
add.3 -> r3
The second move, a write to operand two, triggers the addition operation, which makes result of addition available to be read for the next move.
Sequential TTA programs are generic sequences of general purpose register and operation operand moves. The moves of the sequential code are not scheduled to be executed in any target architecture. For this reason, sequential programs are sometimes called unscheduled programs.
Parallel TTA programs are defined as sequences of TTA instructions. Each TTA instruction defines a set of moves. A move defines endpoints for a data transport taking place in a transport bus. For instance, a move can state that a data transport from function unit F, port 1, to register file R, port 2, should take place in bus B1. In case there are multiple buses in the target processor, each bus can be utilized in parallel in the same clock cycle. Thus, it is possible to exploit instruction level parallelism by scheduling several data transports in the same instruction.
Parallel programs are always targeted to some TTA architecture. Consequently, they are also referred to as scheduled programs. Parallel programs are final in the sense that it is possible to generate the program bit image representing the parallel code and run it in a real processor hardware that implements the targeted architecture.
[edit] Customizable operation set
One of the customization points for TTA is the operation set. It is possible for the designer to add a new operation to the target processor which implements arbitrary functionality. This allows, for example, to convert longer chains of operations to a single custom operation execution.
A short example might clarify this idea. Let us assume than an algorithm includes lots of subtractions and additions of same input operands, thus the sequential code would include sequences like this:
r1 -> sub.1
r2 -> sub.2
sub.3 -> r3
r1 -> add.1
r2 -> add.2
add.3 -> r4
Now, the designer of the TTA system sees that a piece of code including a sequence like this is ranked high in the profiling data, that is, a major part of the execution time is spent running the code. Therefore, he decides to create a new custom operation, addsub, which computes both the sum and the difference of the operands it receives and places the difference in the first output and the sum in the second output. The new custom operation can be used to convert the previous code to the following:
r1 -> addsub.1
r2 -> addsub.2
addsub.3 -> r3
addsub.4 -> r4
Getting rid of the two moves might not seem much, but it might provide bigger savings in the long run if the sequence is executed in a tight loop with only a few instructions. Furthermore, the same optimization strategy of converting sequences of operations into a single custom operation can be applied to chains of operations of virtually arbitrary length.
[edit] Programmer visible operation latency
The leading philosophy of TTAs is to move complexity from hardware to software. Due to this, several additional hazards are introduced to the programmer. One of them is the programmer visible operation latency of the function units. Timing is completely a responsibility of programmer. The programmer has to schedule the instructions such that the result is not read too early or not too late. There is no hardware detection to lock up the processor in case a result is read too early. For example, let us say that an architecture has an operation add with latency of 1, and operation sub with latency of 3. When triggering the add operation, it is possible to read the result in the next instruction (next clock cycle), but in case of sub, one has to wait for two instructions before the result can be read. The result is ready for the 3rd instruction after the triggering instruction.
Reading a result too early results in reading the result of a previously triggered operation, or in case no operation was triggered previously, the read value is undefined. On the other hand result must be read early enough to make sure the next operation result does not overwrite the current result in the output port.