Cell architecture

From Wikipedia, the free encyclopedia

Cell architecture

Contents

[edit] Constituent elements

Cell BE
architecture
software
development
fabrication
  • PPE
  • SPE
  • EIB
  • XDR
  • FLEX I/O

[edit] Patents

U.S. patents 2004
Corporation Filings
International Business Machines 3,248
Matsushita Electric Industrial 1,934
Canon Kabushika Kaisha 1,805
Hewlett-Packard Development 1,775

It is well known that IBM holds one of the world's largest patent portfolios. The United States Patent and Trademark Office proclaimed in 2004, "For the twelfth consecutive year, IBM received more patents than any other private sector organization."

A selection of the patents filed by the Cell design team most pertinent to the unique conception and features of Cell are detailed below. This material offers an interesting perspective on how IBM viewed the novelties involved in the Cell processor during the initial design period.

[edit] U.S. 6,779,049 — distributed DMA translation

United States patent 6779049 was granted on August 17, 2004 to Erik R. Altman and ten others, including Dr Peter Hofstee, Cell Chief Scientist and Cell Synergistic Processor Chief Architect, bearing the rather ponderous title Symmetric multi-processing system with attached processing Units being able to access a shared memory without being structurally configured with an address translation mechanism.

The abstract cites one embodiment of the invention as consisting of an SMP system with shared memory with '"a plurality of processing elements coupled to the shared memory". In the Cell design inspired by this patent, the processing elements were fleshed out as the PPE core and the eight SPE cores.

The abstract then further defines the nature of a processing element: "Each of the plurality of processing elements comprises a processing unit, a direct memory access controller and a plurality of attached processing units." All nine Cell cores have a DMA controller as this patent describes. For the PPE core the attached processing units is the Power architecture PPU execution engine; for the SPE cores, the processing unit is the SPU execution engine.

Finally, it further defines each DMA controller as comprising "an address translation mechanism thereby enabling each associated attached processing unit to access shared memory in a restricted manner without an address translation mechanism". The vague phrase restricted manner is key. For the Cell this statement underscores that the predominant memory access mechanism is DMA requests set up by the DMA controller contained within each core. For the SPE cores, the DMA mechanism is in fact the only mechanism for making direct access to system memory. Whether the PPE funnels all varieties of system memory request exclusively through its internal DMA controller is not stipulated by IBM in their Cell overview materials.

This invention in part stemmed from an observation by IBM that GPU processors typically achieve higher performance per watt because they are not burdened with address translation overheads on every memory access. GPU working memory was adapted to Cell as the SPE "local store". No address translation is performed when the SPU performs memory operations on local store. Memory translation is instead performed by the SPE's internal DMA controller on the granularity of DMA requests.

It is important to note that this significantly reduces the frequency of memory address translations performed. Instead of one translation per load or store instruction, address translation is performed once per DMA request. A single DMA request can be up to 16 KiB in length. However, Cell is tuned for DMA requests of 128B in length so this request length is especially common.

The patent primarily concerns the management and coherence of the distributed TLB tables. As an example of the level of the patent, Fig. 6 contains a flow chart containing seven boxes which flow in sequence with no decision points, paraphrased extremely roughly runs as follows:

Fig. 6 Heavily Redacted
  1. a processor invalidates a page table entry within its own TLB
  2. the processor issues a TLB invalidated notice
  3. the processor broadcasts the notice to other places
  4. other places search their own TLB for the notified entries
  5. other places invalidate entries of their own as notified
  6. other places issue acknowledgements
  7. the originating processor issues a synchronization notice to other places

This is not a reliable, legal account; for that you must consult a qualified patent lawyer. The claims largely read on what each of those steps might entail and ineluctable elaborations such as handling a range of addresses.

See Also

See also United States patent 6907477 granted a year later on June 14, 2005 also to Erik R. Altman and many others, bearing the less ponderous title Symmetric multi-processing system utilizing a DMAC to allow address translations for attached processors. Same general idea, newfangled claims; syndicates Fig 6. Focusses more on the address translation mechanism such as translating a range of virtual address into a corresponding range of physical addresses.

[edit] U.S. 6,760,819 — DMA coherence

United States patent 6760819 was granted to Sang Hoo Dhong and four others on July 6, 2004 with IBM as the assignee under the title Symmetric multiprocessor coherence mechanism.

This patent concerns address coherency mechanisms. While less specific to Cell than the distributed DMA address translation (U.S. patent 6779049), it again reveals IBM's preoccupation with efficient multiprocessor coherency mechanisms during the Cell design period.

The abstract describes the patent as involving a mechanism to reduce "the number of coherency busses" associated with "snoop resolution and coherency operations" on a multi-level processor cache. In this approach, the L2 cache contains a copy of the L1 cache tags which permits both sets of tags to be snooped at the L2 cache in one operation, eliminating the need for an L1 cache-coherency bus. The abstract also states that "updates to the coherency states of the copy of the L1 directory are mirrored in the L1 directory and L1 cache" without describing how this information is conveyed between the caches; perhaps on an unnamed bus of less complexity than the coherency bus replaced.

It is not clear whether the Cell implementation exploits this technique or not. IBM was engaged in the design of other cores during the same time period (such as the Power 970 cores in the Xbox-360) which might exploit this technique instead. Most likely IBM adopted it for all their subsequent Power core designs featuring multi-level caches if it proved workable in practice.

[edit] U.S. 6,820,142 — token based DMA

United States patent 6820142 was granted to Peter Hofstee and two others on November 16, 2004 with IBM as the assignee under the title Token based DMA.

This is another patent pertaining to the problem of orchestrating multiple cores each with their own DMA unit. Without governance, the DMA controllers are capable of overwhelming available resources leading to contention or starvation effects.

In this approach, a "master controller" grants tokens to the processing elements to access "the shared memory for a particular duration of time at a unique deterministic point in time". In this patent, the emphasis is on determinism, a system characteristic most important from the real-time performance perspective. However, this same technique also serves concepts related to determinism, such as priority and fairness.

In the actual Cell design, this basic mechanism is greatly refined. The system is partitioned into resource allocation groups (RAGs) down to the granularity of individual memory banks on the XDR memory devices. Note that the use of reservation policy is optional from the perspective of an individual SPE. This mechanism exists to aid the SPE cores in sharing resources effectively if they chose to do so. Untrusted code running on an SPE core is capable of ignoring the "master controller" token mechanism and swamping other elements on the EIB bus with nuisance transactions.

[edit] U.S. 6,785,841 — redundant elements

United States patent 6785841 was granted to Chekib Akrout and two others on August 31, 2004 with IBM as the assignee under the title Processor with redundant logic.

The patent bears directly on Cell from the cost-to-manufacture perspective. The last sentence of the History of Related Art declares "Thus, for conventionally designed processor chips, redundancy has typically not been used with great success. It would be desirable, therefore, to design a processor device with cost effective redundant elements".

The direct bearing on Cell is revealed by the statement "Each of the attached processors may comprise a single instruction multiple data (SIMD) processor such as a vector processor or an array processor" characterizing the SPU design exactly.

In the design of the Cell EIB, each bus participant has an element identification number. The Summary of Invention concludes "Disabling the non-functional processor may include altering the information in the attached processor ID register while enabling the redundant processor may include programming the processor ID of the redundant processor to the value of the non-functional processor. Disabling the non-functional attached processor may further include electrically disconnecting the attached processor such as by destroying one or more fuseable links."

Cell chips sold into the PS3 market will contain seven functional SPE cores. Some sources claim that processors sold into the consumer appliance market, such as HDTV sets made by Toshiba, will have six functional SPE cores. Cell chips with eight functional SPE cores are high-graded into the scientific workstation market.

Note that while the SPE cores are identical one to another, the location of the SPE cores relative to other participants on the EIB bus does have an impact on the timing and efficiency of EIB transactions; this procedure for eliminating defective SPU cores is not absolutely transparent to the performance of the device in all cases.

[edit] U.S. 6,839,828 — selective scalar subpath

United States patent 6839828 was granted to Michael Gschwind and two others on January 4, 2005 with IBM as the assignee under the ponderous title SIMD datapath coupled to scalar/vector/address/conditional data register file with selective subpath scalar processing mode.

On one level this patent says that if you have four beer taps (vector pub) and a customer orders only one beer (scalar customer), you fill the order by using only one beer tap (selective subpath). The rest of the patent governs how to keep the beer cold in the three inactive kegs without burning unnecessary power on refrigeration, which claim 3 depicts as "The processor of claim 1, further including a power savings unit which disables functional units not used for processing a given instruction."

Amazingly, this is more difficult than it sounds. In one of their early disclosures about Cell, Dr Hofstee admitted that the circuitry involved in managing this decision consumed almost as many resources as it saved, so they ended up discarding the mechanism in the inaugural device. The technique might show up again in a future Cell design if the balance of cost to savings improves.

The question arose because of an unusual feature of the SPU processor: the unified register set where all registers are 128 bits wide. For many reasons, these registers are often used to hold word scalars, ignoring the other three vector positions. As an example, all address calculations on the local store are performed using word scalars. In theory it ought to save power to disable execution units for the disused vector elements when operating on scalar values. In this case, the practice was contrary.

See Also

See also United States patent 6,785,841 which explains how to designate a beer tap that doesn't work (defective element) as being "out of service".

[edit] U.S. 6,865,631 — reduction of RPC interrupts

United States patent 6865631 was granted to Peter Hofstee and Ravi Nair on March 8, 2005 with IBM as the assignee under the title Reduction of interrupts in remote procedure calls.

Again we have a patent centered around Cell's novel DMA structure. Not every method of coordinating the SPE cores would be characterized as a remote procedure call (RPC). This concept arises most naturally when the SPE cores are orchestrated by the microtasking kernel approach. Each SPE microtask can be regarded as a procedure call which binds the code and data together.

In prior art, if the subordinate processors receiving the remote procedure call are implemented as separate chips—which is common—the normal mechanism to alert the master process of task completion is to signal an external interrupt. Some processors are specially designed to feature lightweight interrupt handlers taking perhaps a few dozen instructions to handle a simple interrupt request. Other processors, especially very fast processors, are much less agile in handling interrupts, the cost can run to hundreds of processor cycles. It is likely that the Cell PPE—a very fast processor—would take a relatively large performance hit from each interrupt handled. For this reason, as IBM concludes the Background Information section, "It would therefore be desirable to develop an SMP system where the APU(s) do not interrupt the processing unit upon completion of its task(s) in one or more remote procedure calls." [emph. added]

This patent describes an alternate method devised in which the DMA controllers are used to monitor task status by monitoring dedicated completion signals. Presumably each SPE has a completion signal line that runs to the PPE DMA controller, though the implementation at the hardware level is not specified by the patent.

Unlike a fast execution core, a DMA controller is extremely agile. Even with many DMA requests enqueued, the DMA controller is usually rate limited by available bandwidth or concurrency. In the Cell design, each DMA controller can have a maximum of two transfers active concurrently; any other enqueued requests are stalled. It is not especially challenging to have the DMA controller monitor signal lines (which the patent strangely depicts as "polling") and awaken dormant transfer requests within the request queue once the signal is received.

This is a highly technical patent which spells out the gory details involved. The most complex scenario presented is where the RPC completion events are distributed into multiple DMA queues. This is an elaborate scenario where a PPE thread delegates a microtask to an SPE core and then delegates the completion logic to the SPE's local DMA controller.

Fundamentally this patent describes an esoteric synchronization mechanism appropriate to a design rich in integrated DMA controllers. The solution does not exist in prior art because the problem did not exist in prior practice. That is not an offhand remark: it is quite possible that the Cell architecture's DMA-centric design was partially motivated by the opportunity to create so many strange new problems with patentable solutions such as this patent describes.

[edit] U.S. 6,924,802 — SIMD function interpolation

United States patent 6924802 was granted to Gordon Fossum and three others on August 2, 2005 with IBM as the assignee under the title Efficient function interpolation using SIMD vector permute functionality.

Unlike the majority of patents pertaining to Cell, this patent does not involve the hardware as such. IBM had Cell in mind as a visualization workstation. As a result, they explored ways to exploit Cell to its best advantage. As IBM notes in their Description of Related Art, "Unfortunately, a major bottleneck exists in the calculation and estimation of functions that generate the visual display data. An advance in the calculation and estimation of functions that generate the visual display data would allow for substantial improvement in visual display system performance" citing "representative examples" as including "sin(x), cos(x), log2(x) and exp2(x)" while noting that "many others are involved in the calculation of visual display data."

This patent describes a way to exploit the powerful vector permute instruction in the VXE and SPU instruction sets to manage coefficients to interpolate these functions more rapidly. The patent therefore pertains to running a certain kind of software program on the Cell hardware. The vector permute instruction did not originate with Cell. It was part of the original AltiVec instruction set designed by Apple Computer, IBM and Motorola (the AIM alliance). The AltiVec trademark is owned solely by Motorola. When the technology sharing agreement ended, IBM did not license any AltiVec technology from Motorola. Instead, IBM reverse-engineered the AltiVec instructions and included them in a larger instruction set termed VMX from which the Cell SIMD instruction sets derive.

What IBM recognizes in this patent is not the originality of vector permute but a particular use of this instruction to enhance the speed of visual presentation.

The vector permute instruction can be viewed as a kind of mathematical function, as can the display functions which it is used to interpolate. To the extent that this invention is purely mathematical in nature, this type of patent would be regarded as controversial. Nevertheless, the U.S. Patent office grants many patents of this nature irrespective of how well they might or might not hold up in court.

It is, however, somewhat unusual to see the use of a computational primitive patented as such.