Joel McCormack

Joel McCormack is the designer of the NCR Corporation version of the p-code machine which is a kind of Stack machine popular in the 1970s as the preferred way to implement new computing architectures and languages such as Pascal and BCPL. The NCR design shares no common architecture with the Pascal MicroEngine designed by Western Digital but both were meant to execute the UCSD p-System.[1,2]

Contents

P-machine Theory

Urs Ammann, a student of Niklaus Wirth, originally presented p-code in his PhD thesis (see Urs Ammann, On Code Generation in a Pascal Compiler, Software—Practice and Experience, Vol. 7, No. 3, 1977, pp. 391–423). The central idea is that a complex software system is coded for a non-existent, fictitious, minimal computer or virtual machine and that computer is realized on specific real hardware with an interpreting computer program that is typically small, simple, and quickly developed. The Pascal programming language had to be re-written for every new computer being acquired, so Ammann proposed writing the system one time to a virtual architecture. The successful academic implementation of Pascal was the UCSD p-System developed by Kenneth Bowles, a professor at UCSD, who began the project of developing a universal Pascal programming environment using the P-machine architecture for the multitude of different computing platforms in use at that time. McCormack was part of a team of undergraduates working on the project.[3] He took this familiarity and experience with him to NCR.

P-machine Design

In 1979 McCormack was employed by NCR right out of college, and they had developed a Bit slicing implementation of the p-code machine using the Am2900 chip set. This CPU had a myriad of timing and performance problems so McCormack proposed a total redesign of the processor using a programmable logic device based Microsequencer. McCormack left NCR to start a company called Volition Systems but continued the work on the CPU as a contractor. The new CPU used an 80-bit wide microword, so parallelism in the microcode was radically enhanced. There were several loops in the microcode that were a single instruction long and many of the simpler p-code ops took 1 or 2 microcode instructions. With the wide microword and the way the busses were carefully arranged, as well as incrementing memory address registers, the cpu could execute operations inside the ALU while transferring a memory word directly to the onboard stack, or feed one source into the ALU while sending a previously computed register to the destination bus in a single microcycle.

The cpu ran at three different clock speeds (using delay lines for a selectable clock); two bits in the microword selected the cycle time for that instruction. The clocks around 130, 150, and 175 nanoseconds. Newer parts from AMD would have allowed a faster 98 nsec cycle for the fastest instructions, but they didn't come out with a correspondingly faster branch control unit.

There was a separate prefetch/instruction formatting unit (again, using stoppable delay line clocks for synchronization...asynchronous logic allows for skewed timings). It had a 32-bit buffer and could deliver up the next data as a signed byte, unsigned byte, 16-bit word, or "big" operand (the one-or-two byte format where 0..127 was encoded as one byte, and 128..32767 was encoded as two bytes).

There was an onboard stack of 1024 16-bit words, so that both scalars and sets could be operated on there. The top of the stack was actually kept in one of the AMD 2901's registers, so that simple operations like integer addition took a single cycle. before we stole the technique of keeping the top word of the stack in one of the AMD 2901 registers. These often resulted in one fewer microinstructions. (The stack doesn't quite operate this way...it decrements before data is written to it, and increments after data is read.)

Since next-address control and next microcode location were in each wide microword, there was no penalty for any-order execution of the microcode. Thus, we had a table of 256 labels, and the microcode compiler moved the first instruction at each of those labels to the first 256 locations of microcode memory. The only restriction this placed upon the microcode was that if the p-code required more than one microinstruction, then the first microinstruction couldn't have any flow control specified (as it would be filled in with a "goto <rest of microcode for p-code>).

P-machine Architecture

The CPU used the technique of keeping the top word of the stack in one of the AMD 2901 registers. This often resulted in one fewer microinstructions. For example, here are a few p-codes the way they ended up. tos is a register, and q is a register. "|" means parallel activities in a single cycle. (The stack doesn't quite operate this way...it decrements before data is written to it, and increments after data is read.)

Since next-address control and next microcode location were in each wide microword, there was no penalty for any-order execution of the microcode. A table of 256 labels, and the microcode compiler moved the first instruction at each of those labels to the first 256 locations of microcode memory. The only restriction this placed upon the microcode was that if the p-code required more than one microinstruction, then the first microinstruction couldn't have any flow control specified (as it would be filled in with a "goto <rest of microcode for p-code>).

fetch	% Fetch and save in an AMD register the next byte opcode from
	% the prefetch unit, and go to that location in the microcode.
	q := ubyte | goto ubyte

SLDCI	% Short load constant integer (push opcode byte)
	% Push top-of-stack AMD register onto real stack, load
	% the top-of-stack register with the fetched opcode that got us here
	dec(sp) | stack := tos | tos := q | goto fetch

LDCI	% Load constant integer (push opcode word)
	% A lot like SLDCI, except fetch 2-byte word and "push" on stack
	dec(sp) | stack := tos | tos := word | goto fetch

SLDL1	% Short load local variable at offset 1
	% mpd0 is a pointer to local data at offset 0.  Write appropriate
	% data address into the byte-addressed memory-address-register
	mar := mpd0+2
	% Push tos, load new tos from memory
SLDX	dec(sp) | stack := tos | tos := memword | goto fetch

LDL	% Load local variable at offset specified by "big" operand
	r0 := big
	mar := mpd0 + r0 | goto sldx

INCR	% Increment top-of-stack by big operand
	tos := tos + big | goto fetch

ADI	% Add two words on top of stack
	tos := tos + stack | inc(sp) | goto fetch

EQUI	% Top two words of stack equal?
	test tos - stack | inc(sp)
	tos := 0 | if ~zero goto fetch
	tos := 1 | goto fetch

This architecture should be compared to the original P-code machine specification as proposed by Niklaus Wirth.

P-machine Performance

The end result was a 9"x11" board for the CPU that ran UCSD p-System faster than anything else, by a wide margin. As much as 35-50 times faster than the LSI-11 interpreter, and 7-9 times faster than the Western Digital Pascal MicroEngine did by replacing the LSI-11 microcode with p-code microcode. It also ran faster than the Niklaus Wirth Lilith machine but lacked the bit-mapped graphics capabilities, and around the same speed as a VAX-11/750 running native code. (But the VAX was hampered by the poor code coming out of the Berkeley Pascal compiler, and was also a 32-bit machine.)

Education

Later Employment

Publications

References

  1. The Pascal Users' Group Newsletter Archive
  2. The UCSD P-system Museum
  3. The UCSD Pascal Reunion website

See also