FPS AP-120B
The FPS AP-120B was a 38-bit, pipeline-oriented array processor manufactured by Floating Point Systems. It was designed to be attached to a host computer such as a DEC PDP-11 as a fast number-cruncher. Data transfer was accomplished using direct memory access.
Processor cycle time was 167 nanoseconds, giving a speed of 6 MHz. Since it could present two floating point results per cycle, one from the adder and the other from the multiplier, a capacity of 12 Megaflops was claimed for the processor.
Architecture
The processor was designed around the concept of multiple parallel processing units operating in synchronization. A single 64-bit instruction word was divided into fields, each of which instructed a particular module under the control of the CPU. The modules were as follows:
- 16-bit Arithmetic and Logic unit (ALU)
- 38-bit Floating Point Adder (FADD) (two stages)
- 38-bit Floating Point Multiplier (FMUL) (three stages)
- Two Data Pad registers for receiving data from memory.
The processor had access to dual-interleaved core memory in which odd numbered addresses were stored in one physical bank, and even numbered addresses were stored in the other. This represented an attempt to take advantage of typical sequential fetching of memory words. Fetching sequentially from one physical bank would result in a latency of two instruction cycles before the data was loaded into the destination data pad. Interleaving allowed a sequential access to occur immediately after the previous one. Both accesses took two cycles to complete, but the overlap and dual destination pads maximized the use of the data channel.
The floating point arithmetic modules were both multi-stage processors which were driven by explicit instructions. In the two-stage adder an assembler instruction such as FADD DX,DY would load values from data pads DX and DY into stage one of the adder. A subsequent FADD instruction would be required to present the result at the adder's output. This second FADD could be a dummy with no arguments, or it could be the next calculation in a sequence. In this fashion a stream of FADD operations could be performed in a pipeline, with a new result in every instruction cycle though every addition requires two cycles.
Similarly the multiplier, a three-stage unit, required one FMUL DX,DY to begin a multiplication, followed by two more FMUL instructions to produce the result. Careful programming of the pipeline allowed the production of one result per cycle, with each calculation taking three cycles in itself.
For maximum efficiency all calculations were programmed using the assembler language supplied with the hardware. A high-level language resembling Fortran was provided for coordinating tasks and controlling data transfers to and from the host computer.
Lookup tables
In order to support typical applications in signal processing, the hardware was delivered with a pre-calculated lookup table of sine and cosine values. Sines and cosines for angles from 0 to π/2 radians were stored in alternate addresses to take advantage of the interleaving described above. Values for all other angles could be calculated by using one or other of the values from the lookup table, negating if necessary, using well-known rules.
Typical programming style
This was unusual, being driven by the synchronous parallel processing architecture. The basic philosophy can be summarized as follows:
- Lay out the shortest sequence of instructions for performing one instance of the desired calculation, allowing for two-cycle memory latency, and the driving of the floating-point modules with explicit FADD and FMUL instructions.
- Inspect the sequence to determine the minimum number of instructions forming a loop which will perform the calculation repetitively. This requires attention to resource conflicts. For instance the data bus for moving results around can only move one data word per cycle. Likewise the ALU, used mostly for counting loops and memory addressing, can only be used for one purpose per cycle. This step is typically trial-and-error.
- Conceptually "wrap" the full sequence of instructions around the loop, using FADD and FMUL instructions to drive calculations through the pipelines.
- Before the loop begins, add parallel process initiations as required.
The final item was accomplished as follows: assume that the entire calculation requires 15 cycles, and the minimum loop size is 5 cycles. The first 5 instruction words begin iteration 1 of the calculation. The second 5 words contain both iteration 1, and the beginning of iteration 2 in parallel. This usually would be a copy of the operations beginning iteration 1. The next 5 words contain the final steps of iteration 1, the middle of iteration 2, and the beginning of iteration 3. These five words form the body of the loop which repeats until the desired number of data points have been processed.
Application
As an attached processor, the AP-120B was typically used as a low cost/cost-effective adjunct to systems like diagnostic medical imaging systems, and more.
References
- Page 206 ff, Parallel Computers Two: Architecture, Programming and Algorithms, by Roger W. Hockney, C. R. Jesshope. CRC Press 1988 ISBN 0-85274-811-6
- FPS had a bibliography of papers.