Mill CPU Architecture

The Mill CPU architecture is a novel belt machine-based architecture for general purpose computing, which has been under development by Ivan Godard and his startup Mill Computing, Inc. (East Palo Alto, California;[1] formerly Out Of The Box Computing) since ca. 2003. Mill Computing claims it has a "10x single-thread power/performance gain over conventional out-of-order superscalar architectures" but "runs the same programs, without rewrite".[2]

Mill computing was founded by persons who formerly worked together on a family of DSP CPUs, the Phillips Trimedia

Approach

The designers claim that the power and cost improvements come by adapting a DSP-like deeply-pipelined CPU to general-purpose code. The timing hazards from branches and memory access are said to be handled using speculative execution, pipelining and other late-binding but statically-scheduled logic. The claimed improvements in power and area are said to come from eliminating the dynamic optimization hardware: register-renaming, out-of-order execution hazard management and dynamic cache optimization.

Therefore, the Mill architecture is designed as a compiler target for highly-optimizing compilers. The overall plan is to substitute static optimization at compile-time for hardware optimization at run time. To this end, each Mill CPU is designed to have timing and memory-access behavior that is predictable to single cycle times.

VLIW Instructions

Mill uses a VLIW-style encoding to store up to 33 simple operations in wide instruction words. Mill uses two program counters, and every wide instruction is split into two parts. One of the program counters counts backwards.[3] So, the code of every linear instruction block is executed from its middle to outside by two almost independent decoders. Unused operations are deleted by a small fixed-format data item in the center of each instruction. This helps keep the code density up by reducing the incidence of no-operation codes in Mill code. It also permits each function-unit to begin speculative execution of its instruction field, and then discard its result if it has no instruction.

The Belt, a pipelining register system

Godard quotes research claiming that 80% of the values in a CPU's register system are accessed once, 14% more than once, and 6% are not at all.

Therefore, the Mill uses novel temporal register addressing scheme "The Belt", which has been proposed by Ivan Godard, to greatly reduce the complexity of CPU hardware (specifically the number of internal registers).[4] It is recommended that the belt be perceived as a moving "conveyor belt" where the oldest values "drop off" the belt into oblivion.

The relative-addressing nature of the Mill's machine code and assembly language may be harder to read and debug than the more conventional register name paradigm, but few large projects are written in such low-level languages. The elimination of registers avoids complex register renaming schemes.[3]

A novel feature of the belt is that operations can drop multiple results to it. For example, division can produce a quotient and a remainder. Operations with overflow can produce double-wide results.

Since Mill instructions do not need to specify a location to store a result, they are smaller by that amount.

Godard says that the belt is not a shift register. Instead it is a semantic representation of the bypass network present in most fast computers: This is the network that intercepts pipelined accesses to registers, routing them directly to the execution units that need the result. The actual number of registers is reasonably small: Those required to pipeline the output of each functional unit, and one for each possible belt item. The small number of actual registers reduces the size, power and complexity of the network to access the registers.

Belt items are accessed by belt position, and "move" by changing their names in a pipeline-safe way. The names are not just belt positions, but also tags for function frames. Just by incrementing the frame tag counter, the belt appears empty to a newly called function.

The length of the belt is designed so that residence time in the belt equals the time to access the scratchpad, a RAM area used to spill belt items that will be reused.

The belt is the fast, CPU end of a hardware caching system called the "spiller" which moves belt items between subroutines, the scratchpad and much slower memory areas associated with each function iteration's data area. If the bandwidth of the spiller is exceeded, the Mill stalls, waiting for the belt to become consistent.

Use of Metadata

The Mill also assigns metadata to each belt item by the type and success of load operations. The Mill assigns status, a width and vectorization count to each item in the belt. Instructions operate on the item described, and therefore the width and vector count are not part of the instruction coding. If an operation fails, the failure information is hashed and placed in the destination and its metadata for use in debugging.

The Mill also uses the metadata to assist speculative execution and pipelining. For example, if a vector load operation fails, e.g. part of it leaves a protection boundary, those parts of that belt entry will be marked as "Not a Result" in the metadata. This permits speculatively-executed vector code to emulate per-vector-item fault behavior: The "Not a Result" items create a fault only if there is an attempt to store them or perform other non-speculative code on them). If they are never used, no fault is ever created.

The Mill's architecture appears able to reduce the size and complexity of pipelined loop code. It uses metadata and speculation to eliminate pipeline set-up and teardown. In the pipeline video, every operation was required to cope with an argument of "not a number" in a sensible way: Arithmetic and bit-wise logical operations produce a NaN if any input is a NaN. Stores and other non-speculable operations do nothing. To run a pipelined loop, the code pushes a group of NaNs on the belt, and then begins to execute the steady-state loop body. As live data iterates in the loop body, the pipeline is initialized. Teardown happens in a parallel way by feeding NaNs to the loop. A crucial invention was to permit operations to insert NaNs on the belt, but only for pipelined loops.

To pipeline nested loops, the Mill treats each loop almost like a subroutine call, with saves and restores of appropriate state.

Lockstep Phased Execution

Another improvement said to open up the instruction-level parallelism is that Mill instructions are phased. Instructions may span several clock cycles, and hold up to 33 operations. Within an instruction, math operations finish first, data rearrangements in the middle, and stores to RAM last. Also, both the operations and even multiple cores operate in statically-predictable prioritized timings.

Family Traits

There are several versions of the Mill CPU in development: spanning Tin (low-end) to Gold (targeted to high-performance market). The company estimates that dual-core Gold chip implemented with 28 nm may work at 1.2 GHz with a typical TDP of 28 Watts and performance of 79 billion operations per second.[3]

Different versions of the Mill are intended for different markets, and are said to have different instruction set architectures, different numbers of execution units, different pipeline timings and therefore very different binaries. In order to accommodate these, compilers are required to emit a "specification" which is then recompiled into an executable binary by a recompiler supplied by the Mill computing company. In this way, code that can be distributed is adapted to specifics of the exact model's pipeline, binary coding, etc.

The development of so many tool sets and CPU designs could be too expensive to be practical. Ivan Godard said that Mill's plan is to develop software tools that accept a specification for a Mill CPU, and then write the software tools (assembler, compiler backend and simulator), and the Verilog describing the CPU. In a demo video, Mill claimed to show early versions of the software to create an assembler and simulator. The bulk of the compiler is said to be a port of LLVM, but it is incomplete as of 2014.

Skepticism

The OOTBC (Mill computing) was criticized by Linley Gwennap in 2013 for absence of working compiler and lack of estimations in conventional benchmarks.[3]

In the Mill videos, a number of questioners said that other VLIW compiler designers have been unable to locate and use instruction level parallelism (ILP) of more than two. Ivan Godard claims that a sufficiently-wide Mill can find and use an ILP of six or more on common programs.

References

  1. http://investing.businessweek.com/research/stocks/private/snapshot.asp?privcapId=261967066
  2. The Mill CPU Architecture - Specification (8 of 9). 2014-05-24. Retrieved 2014-07-23.
  3. 1 2 3 4 Getting Way Out of the Box // Microprocessor Report, August 5, 2013
  4. http://millcomputing.com/docs/belt/

External links

This article is issued from Wikipedia - version of the Saturday, February 13, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.