TRIPS architecture

From Wikipedia, the free encyclopedia

TRIPS is a new microprocessor architecture being designed by a team at the University of Texas at Austin in conjunction with IBM. TRIPS uses a new instruction set architecture that is designed to be easily broken down into large groups of instructions (graphs) that can be run on independent processing elements. The design collects related data into the graphs, attempting to avoid expensive data reads and writes and keeping the data in high speed memory close to the processing elements. The prototype TRIPS processor contains 16 such elements, but it is expected this will rapidly scale up to 128 in "real world" processors in the near future. Combined with a number of architecture changes, the TRIPS design hopes to reach 1 TFLOP on a single processor by 2012.[1]

Contents

[edit] Background

The earliest computers were able to run only a single program without operator intervention. Operators developed various manual work flows to keep the systems busy by manually queuing up jobs. During the 1960s computer performance grew to the point where these manual work flows were a limiting factor in keeping the system busy. This led to the introduction of batch processing that queued up many jobs and automatically selected between them, and time sharing systems, which accepted input from a number of users and ran bits of their tasks in succession quickly enough to be (generally) invisible. In both cases the model of the overall system remained the same; the processor was running a single program at any one time, rapidly switching between them. As these systems became more generalized, the time sharing concept evolved into the modern concept of multitasking.

The operating systems and processors running them are not really modeling what the system is trying to do in theory — running multiple programs in parallel. Modern processors are dedicated to running a single application as quickly as possible, while modern operating systems attempt to efficiently switch between programs. While there have been attempts to build CPUs that directly run multiple programs at the same time, these have not seen widespread use for a variety of reasons.

Processors have nevertheless made dramatic gains in performance by adding instruction parallelism as opposed to application parallelism. This parallelism has expanded both in "depth" using a system known as instruction pipelining, as well as in "breadth" by adding additional functional units. Whereas an early microprocessor could keep only one instruction "in flight" at a time, modern designs can have hundreds, running in assembly-line fashion on several independent units. These techniques were successful in allowing dramatic gains in processor performance during the late 1990s and early 2000's.

A limiting factor in this evolution is the ability for the processor to effectively spread the instructions from a single program across multiple function units. Programming languages are generally organized to appear "linear" from the programming language's point of view, that is, the program can generally be thought of a series of instructions that say "do this, now do this". In many cases the string of instructions depends on the results of previous instructions, which makes the job of selecting independent code fairly difficult. To illustrate the problem, consider the following example of pseudo-code:

add a and b and put the result in c
add 5 to c and put the result back in c
add b and c and put the results in d
add c to d and put the result in e

In this case, which is far from uncommon, the instructions are dependent on the results of previous calculations. The first two instructions can actually be run almost in parallel, loading up the values of a, b and 5, and then completing the actual additions later. However, the remaining calculations cannot be run until the earlier ones have completed. This sort of data interdependency is quite common, and dramatically increases the difficultly of adding parallelism to modern CPU designs. Although corporate and university research teams have invested huge amounts of effort in improving these systems, modern processors appear to have reached a hard limit of about four independent streams. Beyond that the CPU simply cannot find enough independence to keep additional units busy, and any additional units go to waste.

One way around this has been to add different types of units dedicated to different types of data. Floating point units (FPUs) have been a common feature of modern CPU designs since the mid-90s, which strip out math instructions into a separate set of units. However, typical applications do not use enough floating point math to make more than two or three units worthwhile. A more recent addition, largely from 2000's, is SIMD instructions, which handle another set of data that is used for common media processing, like video or sound. Like floating point, these instructions are not common enough, in general use, to make more than one or two such units useful in commodity processors. In all of these cases the processor becomes less general purpose as these units are added; a processor developed for common desktop use might have four integer units (ALU's), two or three FPUs, and two or three SIMD units. This mix of units is selected after extensive simulations and testing of common workloads. This same selection might not be nearly as useful in other roles, for instance a web server would want to improve integer performance, while a supercomputer application would want more floating point.

Another recent attempt to address these problems has been VLIW designs. These processors are functionally similar to more "classic" designs, but offload the decisions about what can be run in parallel to the compiler, as opposed to the processor itself. The compiler can spend considerably more time studying the program before deciding what instructions are indeed independent, loading them together into a single "wide" instruction. This dramatically lowers the amount of on-chip circuitry needed to support the units, area that can be re-used for other purposes like additional cache memory.

However in order to support this, VLIW designs expose the functional units directly to the compiler, which places instructions on the units directly. This means that a compiled program has to run on a processor with a fixed number of units. Changing the number of units would require considerable developer support and multiple versions of the same software for different processors. In practice the amount of trouble this would cause has locked designs into a single implementation, perhaps even a more serious limitation on scalability than even the traditional designs, which can at least scale down to fewer cores without too much trouble.

Since about 2005, designers have increasingly looked at ways to improve performance by adding entire additional processors to the die. This has no effect on the performance of any single application, but given several applications queued up in a multitasking system, this can lead to higher system performance. Developers can use the additional cores to improve the performance of single programs, but this requires complex programming techniques.

[edit] EDGE

Key to the TRIPS concept is Explicit Data Graph Execution, or EDGE.[2] EDGE is based on the processor being able to better understand the instruction stream being sent to it, not seeing it as a linear stream of individual instructions, but rather blocks of instructions related to a single task using isolated data. EDGE attempts to run all of these instructions as a block, distributing them internally along with any data they need to process.[3] The compilers examine the code and find blocks of code that share information in a specific way. These are then assembled into compiled "hyperblocks" and fed into the CPU. Since the compiler is guaranteeing that these blocks have specific interdependencies between them, the processor can isolate the code in a single functional unit with its own local memory.

In the pseudocode example above, the interdependencies between the data in the code would be noticed by the compiler, which would compile these instructions into a single hyperblock. Code that did not rely on this data would be compiled into separate hyperblocks. Of course its possible that an entire program would use the same data, so the compilers also look for instances where data is handed off to other code and then effectively abandoned by the original block, which is an extremely common access pattern. In this case the compiler will still produce two separate hyperblocks, but explicitly encode the handoff of the data rather than simply leaving it stored in some shared memory location. In doing so, the processor can "see" these communications events and schedule them to run in proper order. Blocks that have considerable interdependencies are re-arranged by the compiler to spread out the communications in order to avoid bottlenecking the transport.

The effect of this change is to greatly increase the isolation of the individual functional units. EDGE processors are limited in parallelism by the capabilities of the compiler, not the on-chip systems. Whereas modern processors are reaching a plateau at four-wide parallelism, EDGE designs can scale out much wider. They can also scale "deeper" as well, handing off blocks from one unit to another in a chain that is scheduled to reduce the contention due to shared values.

[edit] TRIPS

The current implementation of the EDGE concept is the TRIPS processor, the Tera-op, Reliable, Intelligently adaptive Processing System. A TRIPS CPU is built by repeating a single basic functional unit as many times as needed. Lacking the scheduling problems of traditional designs, or the explicit unit exposure of VLIW designs, TRIPS can be scaled up and down in performance much more easily. For instance, the current TRIPS design can be implemented up to 16-wide, but could just as easily be implemented with a single unit. While a traditional design could scale down to a single unit, its upward scaling is currently limited to about 4-wide, and is unlikely to change much. VLIW designs are limited to whatever was originally selected for the implementation.

The TRIPS design's use of hyperblocks that are loaded en masse also allows for dramatic gains in speculative execution. Whereas a traditional design might have a few hundred instructions to examine for possible scheduling, the TRIPS design has thousands. This leads to greatly improved unit utilization; scaling its performance to a typical 4-issue superscalar design, TRIPS can process about three times as many instructions per cycle.

In TRIPS, the individual units are general purpose, allowing any instruction to run on any core. In traditional designs there are a variety of different types of units, allowing more parallelism than the four-wide schedulers would otherwise allow. However, in order to keep all of the units active the instruction stream has to include all of these different types of instruction, and this is often not the case. Since TRIPS can be scaled out as wide as needed, the need for different types of units goes away, and a TRIPS processor is just as able to run a single type of instruction as many different types, a capability the designers refer to as a "polymorphic processor".

TRIPS is so flexible in this regard that the developers have suggested it would even replace some custom high-speed designs like DSPs. Like TRIPS, DSP's gain additional performance by limiting data inter-dependencies, but unlike TRIPS they do so by allowing only a very limited workflow to run on them. TRIPS would be just as fast as a custom DSP on these workloads, but equally able to run other workloads at the same time. As the designers have noted, it is unlikely a TRIPS processor could be used to replace highly customized designs like GPUs in modern graphics cards, but may be able to replace or outperform many lower-performance chips like those used for media processing.

The reduction of the global register file also results in non-obvious gains. The addition of new circuitry to modern processors has meant that their overall size has remained about the same even as they move to smaller process sizes. As a result the relative distance to the register file has grown, and this limits the possible cycle speed due to communications delays. In EDGE the data is generally more local or isolated in well defined inter-core links, eliminating large "cross-chip" delays. This means the individual cores can be run at higher speeds, limited by the signaling time of the much shorter data paths.

The combination of these two design changes effects greatly improves system performance. The goal is to produce a single-processor system with 1 TFLOPs performance by 2012. For comparison, a modern Mac Pro using the latest 2-core Intel Xeon can perform about 5 GFLOPs on single applications.[4]

In 2003, the TRIPS team started implementing a prototype chip. Each chip has two complete cores, each one with 16 functional units in a 4-wide, 4-deep arrangement. In the current implementation, the compiler constructs "hyperblocks" of 128 instructions each, and allows the system to keep 8 blocks "in flight" at the same time, for a total of 1,024 instructions per core. The basic design can include up to 32 chips interconnected, approaching 500 GFLOPS.[5]

[edit] References

Languages