Talk:Cell microprocessor/1

From Wikipedia, the free encyclopedia

Contents

[edit] Power Saving Capability

See above. Does the Cell have this? If so, to what extend? I am interested in this point but couldn't find it adressed. I think it's an important topic, and even if the Cell does not feature something like that, it should be mentioned somewhere. Thank you. Karoschne 23:25, 11 February 2006 (UTC)

[edit] Cell in the PS3

There are a number of recent news articles on the web which claim the cell will be used in the PS3, as Sony originally intended. Unless we can find an official statement from Sony I will assume that the authors of those claims have gotten their info from a quick web search where the PS3 Cell claim can be found many places. [1] says clearly that cell will not be used in the PS3. Thue | talk 21:34, 10 Jul 2004 (UTC)

I called Sony, and they say it will indeed use the Cell chip. Thue | talk 09:37, 21 Jul 2004 (UTC)


Mentioned in the patent but not in the text: The individual APUs are apparently shielded from one another, and so can be used (securely) for downloadable codecs or other bits of code.

[edit] 256 kilobit or kilobyte memory?

Does each SPU contain 256 kilobit or kilobyte memory?

Kilobyte KB

[edit] Moved

I moved this article from Cell chip to Cell (microprocessor) to match standard naming conventions. -- uberpenguin 14:05, 2005 Apr 12 (UTC)

[edit] TRON

I have thought about how much Cell (or at least the vision of computing's future as spun by IBM and Sony reps) resembles the TRON Project. Both envision a future where computing time has largely returned to a utility, and both are designs to facilitate a quasi-intelligent network of embedded computers all working in tandem to make the greatest use of their combined resources. Anyone else have any thoughts on this? I'm going to get a few references together and perhaps include some information about these similarities in this Cell article. -- uberpenguin 00:42, 2005 Apr 19 (UTC)

the Cell is a microprocessor (or rather an 8-way microcluster on chip with a master general CPU), and TRON is an operating system. I don't see any probral with porting TRON to the cell, but I think Cell would work better with a system specialy designed for this chip. What about the similarity of notwork of embedded computers - I don't know. Any way, personally I see Cell rather as powerfull Digital signal processor able to process 3D graphics, decompress mpegs, process audio and video in real time, be the CPU for supercomputers and clusters, but not as a way to use processing power of my TV and electronic musical instruments when I need it for some raytracing. -- DariuszT 23:40, 19 Apr 2005 (UTC)
Erm... I was referring to the overall visions of Cell and TRON, not their limited and specific applications. Ken Sakamura designed TRON to be a 'total computing platform,' not just an Operating System (and in actuality, TRON is NOT an OS nor a kernel; it is a platform that includes specifications for several kernels like ITRON, MTRON, CTRON, etc). When you read the ideas of designing a collaborative computer out of several embedded Cell devices, it really smacks of Sakamura's papers on TRON. Also, it is somewhat fallacious to call Cell a DSP; it much more resembles an agressive vector architecture (a very rough comparison is to SX-6). Anyway, in your last sentence you described perfectly what IBM's vision for Cell is and what Ken Sakamura's vision for TRON was 20 years ago. -- uberpenguin 01:35, 2005 Apr 20 (UTC)

[edit] Linux port

Seems like there is cell related activities in Linux mailing list. [2]

"The three companies are now doing a final review of a Cell architecture specification that could be released to software developers by the end of May. It will include details of more than 200 new instructions used in the specialized cores inside Cell. The group also plans to release open-source software libraries for Cell as early as this fall." [3]

[edit] Dynamic Branching

Can someone find a source on Cell, that says it lacks a dynamic brancher on the SPE subcores?

Answer: SPE and PPE are "in order" processing units. In order execution, typically means, no branch prediction. ~ Niall Sandy 11:53pm April 5, 2006

Answer #2: An in-order core tends not to support speculative execution, though many early processors would drop partially executed instructions on the floor to recover from a late detected fault of one kind or another. There's no reason an in-order core can't predict branches dynamically to hint the instruction prefetch logic, without getting ahead of itself by executing those instructions prior to confirming whether the branch was taken/not-taken and correctly/incorrectly predicted. However, the Cell SPUs do no such thing. They support a branch hint mechanism under program control. Because of the way that memory accesses take priority over instruction line fetches, a program with a very heavy stream of memory operations can starve instruction prefetch leading to a big stall when the instruction stream runs dry. It's sometimes necessary for the program stream to issue an explicit ifetch cycle. This was all in the reference which allowed me to deduce VMX floating point capabilities. I also read something about the design of the IBM XL compiler out of Toronto I think with respect to automatic VMX/SPU code generation that touched on these subjects. MaxEnt 05:51, 9 April 2006 (UTC)

[edit] Hypervisor

Where did this information come from? The ability to run multiple OSes simultaneously really isn't a hardware function, and there is no indication that anything within Cell specifically enables this (for example, the POWER-based AS/400s have been able to do the same thing under a software system called LPAR for several years).

In all honesty this article needs some serious fact checking and stylistic cleanup. I'll start looking at it, so please bear with me and discuss any edits I might make that you don't agree with. -- uberpenguin 20:36, 2005 Jun 25 (UTC)

check http://www.embedded.com/shared/printableArticle.jhtml?articleID=163702001 for information regarding multiple guest os.

Ugh... More marketing drivel from Kutaragi... It seems like the more I read from this guy the more I dislike him. That interview makes it seem like the concept of running multiple OSes on a single computer is new or even Sony's own idea... Anyway, where he is talking about this "multiple OS" capability it's as if he's referring to the software and hardware interchangably, which on its own throws the technical value of that interview into question. In any case, whatever his intentions are it's pretty obvious that this capability is indeed provided in software at some level or another. Do you have any source that provides more accurate details? The article you cite is basically some ranting from Sony's top PR person (aka the CEO) and can hardly be used to create a technically accurate article. That interview alone provides no technical details and suggests that this capability is a software feature anyway (which means that it would not belong in this article). If we can't find some better sources, I don't see any reason to keep this section in the Cell article. -- uberpenguin 01:57, 2005 Jun 27 (UTC)

I don't have any other articles. Maybe we (you) should get rid of the hypervisor section. He does say in the article that no one will have low-level access to the processor like they did with the emotion engine, so the cell might not be as fast as people think it will. So, we could at least keep in the level 0,1,2 in another section. I totally agree with you, kuturago has lost his mind, it must be the age and stress from microsoft. I do remember reading interviews from him during the ps2 launch and he seemed like such a modest guy back then. I don't know whether the mutitple OSes is built into the hardware or software. probably software. I just assumed it was the hypervisor from IBM, check out this news PR link http://www-03.ibm.com/chips/news/2004/1129_cell1.html

I understand the logical connection to Hypervisor, but Hypervisor is simply IBM's rebranding of LPAR, which works in conjuction with TIMI on the AS/400s. Even if Cell used a similar system (fairly unlikely that it would use Hypervisor proper) it would most certainly be a software function as is LPAR on the AS/400s. I'm tempted to include this information, but frankly the interview was pretty bad and contained very few useful technical details. I'm removing the section for the time being, but I'll watch for more information on this alleged function and add the section back if more clarification becomes apparent. For now, though, this seems like some nifty software feature that has little to do with the actual Cell microprocessor. -- uberpenguin June 28, 2005 13:45 (UTC)
Many recent IBM PPCs have hardware LPAR support. See IBM's PowerPC Operating Environment Architecture Book III Version 2.02 at http://www-128.ibm.com/developerworks/eserver/library/es-archguide-v2.html Unfortunately, you've already edited this stuff off the page.
Yes, but the documents you are referring to describe POWER4 and POWER5, not cores like PPC970. There is a definite difference between IBM's POWER and PowerPC lines. Cell's PPE most closely resembles the PPC970 core, not a POWER core. Regardless, there isn't nearly enough information on Cell yet to know whether or not its PPE has any of the hardware support for LPAR. Again, it doesn't particularly matter since LPAR / TIMI are part of a soft ISA, not a physical one. -- uberpenguin 03:51, 15 October 2005 (UTC)
The hypervisor is supported in hardware as an extra privileged state. This is in the CBEA (Cell Broadband Engine Architecture) documentation. The difference between POWER and PowerPC is essentially one of branding, IBM now refer to both as "Power Architecture" and both utilise the PowerPC instruction set. The PPE is described as a "Power" core but is a completely new design, it is however user level binary compatible with other POWER/PowerPC cores. The 970 core is quite different from the PPE, it's a modified version of a POWER4 core. - N.Blachford 18th Nov 2005

[edit] Local Memory is NOT Cache

Uberpenguin, Cache and local memory are NOT one and the same. Cache is local memory with additional control logic that automatically moves data. The local memory on the SPEs have no such control logic, and data must be specifically moved by the programmer. While it may act like cache, local memory is NOT cache. There is a HUGE difference. I believe you missed this distinction in the PlayStation 3 article also.the1physicist 07:30, 26 Jun 2005 (UTC)

And I believe you are making a distinction where there is not one. Since when has the term "cache" necessarily implied branch prediction and the state machines that superscalar processors use to copy data from instruction memory? The "local memories" for each SPE fetch instruction data from main memory in 1024-bit block, correct? Does this not sound like the function of cache to you? I don't know of too many sane general definitions for cache which don't cover the local memory in the SPE... These local memory areas cache data for their respective SPEs; as far as I'm concerned, memory that caches data can fairly be called "cache." If you want to make a distinction from the L1 and L2 caches in an average superscalar design, then I can understand, but I see no reason why the local memories cannot be called cache. -- uberpenguin 22:15, 2005 Jun 26 (UTC)
Some clarification. I want to be absolutely clear that I'm NOT implying that the local memory/cache in the Cell's SPEs is identical in operation to superscalar CPU cache. From what I understand these SPE memories are addressable areas, therefore differing from most superscalar CPU caches which are largely transparent to software. You might notice, though, that I keep using the term superscalar specifically. It's not uncommon to see similar caches to those of the SPEs in Cell in vector architectures. The SX-6 provides similar areas (which NEC does call "cache," IIRC) and even the PlayStation 2's emotion engine has similar addressable memory areas that are "scratch memory" for the vector units.
So unless you have other objections, I feel that the issue is thus: I contend that the SPE memory can rightly be called cache due to its function, you contend that it cannot because the method by which that function is performed differs from the method common to superscalar CPU cache. Is this an accurate summary of the issue at hand? -- uberpenguin 22:32, 2005 Jun 26 (UTC)
That is an accurate description of the matter at hand. I think the more important thing is that the people who developed Cell (Sony, Toshiba, and IBM) refer to it as local memory. To call it cache would deny the fact that there is a distiction. I do believe we should take a concensus on this, so everyone cast your vote. Anyhow, given the implementation differences and the nomenclature of STI, I vote for local memory.the1physicist 23:39, 26 Jun 2005 (UTC)
Something else I just thought of: if we don't happen to reach an agreement one way or the other, do you think it would be appropriate to call it local memory and in parenthesis say that it functions like cache?the1physicist 23:44, 26 Jun 2005 (UTC)
To be perfectly honest, I don't think there are many people that will respond to this dialogue. So if we cannot reach a concensus, I don't object to calling it local memory if you can show some official stuff from Sony/IBM/Toshiba that specifically refers to it as local memory. Since it is a matter of semantics, we might as well default to the designers' version. -- uberpenguin 23:48, 2005 Jun 26 (UTC)
Alrighty then, here ya go. [4] This is a pdf document written by one of the Cell designers. He uses the terms 'local memory' and 'local storage' somewhat interchangeably, but he never calls it cache.the1physicist 05:08, 27 Jun 2005 (UTC)
A cache is designed to 'cache' frequently used data in a very fast memory near the CPU. Indeed data transfer from RAM to CPU is in general rather slow, so such a caching system make the system faster. This means that the cache duplicates data found in RAM. If data is modified in the cache, it'll have to be written back in RAM. A cache may be called local memory, but a local memory is not always a cache. The local memory of a SPE is filled using the Spufs (see recent references). In fact the memory of the SPE can be seen as a file in the Unix's FS. You can read and write on it. Then execute its content. They also say there is no page fault. So I suppose the SPE is restricted to its local memory. Therefore as you can see it's definitly not a cache. Debackerl
Again, I think you miss the point that it's a cache by function, not by similarity to superscalar CPU instruction cache. I've already stated my case above... I don't see your point regarding spufs; similar filesystems have been developed for several other types of user-addressable memory. Nobody will win a semantics argument, and I already agreed to calling it local memory just by the merit of the designers calling it that. -- uberpenguin June 28, 2005 13:41 (UTC)
It's not a cache DUE to its FUNCTION, it's function is to store the program and data, the function of a cache is "a cache (pronounced kăsh) is a collection of data duplicating original values stored elsewhere or computed earlier, where the original data are expensive (usually in terms of access time) to fetch or compute relative to reading the cache. Once the data are stored in the cache, future use can be made by accessing the cached copy rather than refetching or recomputing the original data, so that the average access time is lower." as says the Wikipedia article about 'cache'. I spoke about Spufs because it explains how the local memory is used. Also a CPU could run without cache, it would be slow but it'd read everthing from RAM all the time, the SPE cannot run withoug local memory because its program is stored there, but maybe nowhere else. If you have a really big program, I suppose some library, or compiler will automaticaly use Spufs to update the local memory's content. (Maybe I misunderstood the article about Spufs.) With some much differences I wonder how it could be the same. Debackerl
Superscalar CPU caches cache instructions and data. The local memory of the SPEs are most often used to contain instructions and data for their respective SPE. The function is very similar, the means by which it is accomplised differ.
In any case you are only arguing silly semantics, and I have already changed the article to reflect the local memory phrasing and have pointed out that the local memory does not operate like a common superscalar CPU cache. Spufs is a Linux-specific pseudo-filesystem that only simplifies the access to SPE resources (as procfs greatly simplifies access to certain kernel information). I think you might be confused about some semantics here, but regardless I have backed off using the term cache since it seems to cause confusion... 'Nuff said. -- uberpenguin June 28, 2005 15:27 (UTC)
"silly semantics" my computer science teacher may give us a 3/20 for that. I know you left the sementic. I just wonder why a teletype and a keyboard would be the same for you. Similar function, but not the same device! I know spufs is only an implementation, but even if it would have been implemented as syscalls, noone would ever write to a CPU cache, with the SPE you're forced to. No mater it's using a FS or so. Debackerl
Well that sounds typical since the majority of undergrad CS degrees these days are nothing but programmer training and silly semantics anyway. This is a pointless discussion, the issue was already resolved, so why do you insist on arguing it with me? If you want to believe you are in the right so you can sleep better at night, be my guest. You claim that nobody would ever write to a CPU cache, but perhaps you ignored my comments about vector processor cache earlier. That cache is indeed user-writable and is called cache by its designers. I'm not the only one who feels this way either; look at Hannibal's response to the Blachford article on Ars Technica. He takes a similar stance to mine. This is indeed an issue of silly semantics because the SPE local memory IS cache from a valid POV, and IS NOT cache from another valid POV. You just would like to believe that all things that are called cache are implemented in the same way as a superscalar CPU cache. Have you even worked with any vector computers before? Do you even realize that Cell more or less IS a vector computer? -- uberpenguin June 29, 2005 11:52 (UTC)
Well in the master degree in my University, the trouble is that we are not trained enough in coding. We do a lot of theorical computing, maths, and so. I recommand you to change the cache article to speak about those two caches (conventional, and vector computer). I'm not alone to continue the conversation, you do it too ;) My goal is not an agreement about the article, just to find out what's wrong (not who is wrong). That's why I thing you should edit the cache or CPU cache article. It's useless to argue if you do not want to promote the 2nd definition on wikipedia.

Debackerl

[edit] Regarding SpuFs

Can someone read and summarize this source? http://www-128.ibm.com/developerworks/linux/library/pa-cell/index.html


[edit] Major edits

I've just made a batch of fairly major edits to the article. In its former state it was quite messy, redundant, and included too much information. I've tried to condense it down a bit without removing anything important. I'm sure some folks will contest some of the changes I've made, so please discuss any issues you may have here rather than just starting to revert. I plan to make further clarifications and stylistic improvements later on. Specifically the references and external links section could use some sorting through and cleaning. Right now the article is highly over-referenced; the same reference might be pointed to three times in a paragraph, which is confusing and messy. Also there are still a few sections that are somewhat dodgy and need to be re-checked for factuality. Remember that this isn't a place for speculation and assumption, nor is it a place for dumping every tidbit possible on this highly publicized subject.

Thanks for your patience in helping to clean up this article and get it up to Wikipedia standards. -- uberpenguin June 28, 2005 15:32 (UTC)

[edit] Top500-statement is crap!

"IBM expect to make them run at 3Ghz giving 200 GFLOPS per CPU (or 400 GFLOPS per boards), and to put seven boards in a single rack for a total performance of 2.8 TFLOPS. This is equivalent to the 70th supercomputer in the TOP500 List [...]."

I read those dumb statements all over the web, now even in wikipedia. :-( Despite the fact that these 200 GFLOPS are highly theoretical, it is a SINGLE PRECISION figure! The Top500 List is based on Linpac which is computed in DOUBLE precision! Cells SP performance is about 10 times less than DP, so could someone please edit that part of the article? You might have noticed that my english is not very good, so I wont do this change... -- Banshee!

I agree wholly, but it's like pulling teeth to make even the most minor changes to this article, much less pull something like that. I'm totally in favor of removing the statement alltogether, but I won't do it without a little more input from the people that watch this article. People? -- uberpenguin June 30, 2005 23:16 (UTC)
I checked the source code of LINPACK, it can either run in single or double precision mode. The only page I found on TOP500 about our problem is http://www.top500.org/reports/1994/benrep3/node27.html, which is 10 years old. Debackerl

In any case, adding up the peak performance of all FPUs does not give the LINPACK rating: otherwise, why bother with LINPACK at all? LINPACK, like all other realistic benchmarks, depends on the performance of the whole system, FPUs, pipelines, caches, memory interfaces, cluster interconnects, compiler code quality... -- Karada 1 July 2005 10:38 (UTC)

Correct, thats what I meant with "highly theoretical". Altough i have to add that Linpack barly depends on memory interfaces or cluster interconnects, which is a big problem and makes the top500-guys (AFAIK) think about other benchmarks. You just can't get all aspects of a high performance computer into one simple figure (although every manager wants to have that, it fits their simple world) - it's like chosing his car by its maximum speed... But that is a whole different topic. And btw - it is really hard to find some linpack-facts on top500.org, but if IBM is a credible source, have a look at http://www-1.ibm.com/servers/eserver/literature/pdf/linpack.pdf as they mention the double precision fact. I think deleting that whole statement would definitly be a good idea, it is just not professional. --Banshee! 1 July 2005 15:08 (UTC)

[edit] Picture - false colors?

Why die picture is subtiteled with "false colors"? Any reason for that? BTW I have seen that kind of colors on dies in real life. The colors changes when you look at die with different angles - reason is dispersion and interference.

[edit] The Acronyms

Ey e'rr body, im just really curious about the acronyms stated in the Acronyms section. can someone with a really big heart define them for me? the thing is, wikipedia uses REALLY big words that i cant really understand. so can someone break down these definitions for me? especially the SPE:

  • EIB: Element Interconnect Bus [27]
  • LS: Local Storage (SPE's local memory) [28]
  • MIC: Memory Interface Controller [29]
  • PPE: Power Processor Element [30]
  • SPE: Synergistic Processing Element [31]
  • SPU: Streaming Processor Unit [32]
  • STI: Sony Computer Entertainment Inc., Toshiba Corp., IBM

thx in advance :D KittenKiller 04:31, 13 September 2005 (UTC)KittenKiller

[edit] 128-bit processor?

I'm not an electrical engineer or anything, but is the Cell a 64-bit or 128-bit processor (think AMD Athlon 64-type bits)? Pardon the ignorance, I'm still learning.

It's both. When dealing with stream/vector processors (and stream processor look-alikes like Cell), it's somewhat useless to slap a general "bit width label" on the thing. -- uberpenguin 19:48, 26 October 2005 (UTC)

[edit] it's not "Cell" it's "CBE"

IBM has suddenly started calling Cell "CBE" (or "Cell Broadband Engine," or "Cell BE" -- but never "Cell" alone). What is the policy on honoring the use of this kind of preference for one name over another from a corporation? Redirect ? :)

Cell is an generic name, used by Sony and the press. --Brazil4Linux 09:23, 28 November 2005 (UTC)
Actually, isn't that supposed to be BPA? The PPC kernel fork that was written at IBM cites "Broadband Processor Architecture". On the IBM site they say "CELL - also known as the Broadband Processor Architecture (BPA) - is an innovative solution whose design was based on the analysis of a broad range of workloads...." IBM --Trent Arms 06:36, 7 March 2006 (UTC)

[edit] Desktop on a chip

"In other ways the Cell resembles a modern desktop computer on a single chip." What does that even mean? - Shai-kun 12:54, 11 November 2005 (UTC)

I'm not sure exactly how that crept in there, but it's arguable at best and probably misleading... -- uberpenguin 13:57, 11 November 2005 (UTC)
In 2001 or prior to that, when kutaragi commented on PS2's sequel, he spoke about a supercomputer like processing power coupled with image sensing ability, which is exactly what happened with Eyetoy and HD IP. His comment should have stemmed from the fact that the cell processor itself can also create 3D graphics.

[edit] Blachford Article

I've noticed this pop in and out of this page on a few occasions. Please be aware though that there are two completely different versions of this article. The original was based on the 2002 patent application and is somewhat speculative in various areas. It will appear inaccurate compared to the Cell shown to the public but this is mainly because the Cell design underwent changes after the patent application. The second version (added in July 05) is almost a complete rewrite and is based on the Cell as shown today, it is considerably more accurate and contains little speculation. It was also reviewed prior to publication to ensure accuracy. I will let someone else judge if it should be linked to (I'm the author), however please use the second version if so. If it is considered to be inaccurate please let me know why. - N.Blachford 16th Nov 2005

Help us out with URLs, please ? Wizzy 15:13, 16 November 2005 (UTC)
We try to keep the page as condensed as possible (if you notice, we are constantly removing other links too, not just those to your article) and the current articles linked do a good job of overviewing the architecture already. There are a gazilliion articles out there on this thing's microarchitecture, and they can't all be linked here. -- uberpenguin 21:36, 16 November 2005 (UTC)
I'm not worried if it's in or not, just that if it is, the correct version should be used. - N.Blachford 18th Nov 2005

[edit] Conflict in PlayStation 3 Specifications

Near the beginning of the article it is mentioned that the PlayStation 3 will have 1 PPE clocked at 3.2 Ghz and 7 SPEs (six useable), also at 3.2 MHz. However, under Console videogames the chips are said to run at 2.0 GHz.

The specified clockrate was increased on the design board sometime in 2005. Collabi 23:11, 5 April 2006 (UTC)

[edit] Floating Point Performance Claims

The article states that "At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of performance". This seems to directly contradict the previous statement that "An SPE can operate on ... 4 single precision floating-point numbers in a single clock cycle." If an SPE can operate on 4 single precision floating point operations per cycle (presumably using a single SIMD instruction), then at 3.2 GHz, it's peak performance is 12.8 billion floating point operations per second (GFLOPS), not 25.6. With double-precision floating point calculations, it would be 6.4 GFLOPS.


The SPE can issue two instructions per clock, and has Odd / Even instruction pipelines. The Even pipe handles basic integer / fp arithmetic, and the Odd pipe handles permute, load / store and branch instructions. Thus, a maximum thoroughput of 25.6 GFLOPS, and a maximum arithmetic thoroughput of 12.8 billion per second. Also, the double-precision units are not optimized like the single-precision units, so utilizing double-precision drops thoroughput by about 10x (estimated 1.28 billion arithmetic operations per second). See this page on Real World Technologiesfor more details.--Defaultuser 18:50, 7 March 2006 (UTC)
Highly technical IBM papers I've slogged through indirectly clarified floating point capabilities. See my copious rantings below. MaxEnt 18:57, 8 April 2006 (UTC)
A single multiply-accumulate instruction is probably considered two "ops". —Ryanrs 05:21, 9 May 2006 (UTC)

[edit] Ocotopiler

http://arstechnica.com/news.ars/post/20060225-6265.html http://www.research.ibm.com/journal/sj/451/eichenberger.html

Add this.

[edit] local memory

I'm suprised no one has commented on the similarities between the 256kB local memory on the SPEs and program memory in a Harvard architecture - although not exactly the same thing part of the implementation is the same. Surely one of the (if not the main reason) reasons for having the SPE local memory is to reduce bandwidth usage to the main memory. Could this be worth inclusion in the text discussing the overall architecture? (possibly in 'architecture compared').
The article although reasonably good at the moment seems to give no real reason for Cell's choice of architecture - the data is there but the analysis is lacking. Does anyone agree?HappyVR 20:45, 13 March 2006 (UTC)

Quite so. I had been meaning to redo this article myself, but got distracted with other pursuits. I doubt anybody would object to your reworking parts of it. -- uberpenguin 22:44, 13 March 2006 (UTC)
On a second reading of this page I noticed that the article is quite/pretty good. Better than I originally thought - but it could do with 'redoing' for readability. Not an easy task given the amount of (new) technical stuff that needs to be got across.HappyVR 10:36, 14 March 2006 (UTC)

You've got it backwards. The SPE local store is a good example of a von Neumann memory structure. For an example of a Hardvard architecture, see the split L1 instruction and data caches in the PPE. —Ryanrs 06:20, 9 May 2006 (UTC)

er yes - I've commented that out for deletion (or change if someone else wants to) - thanks for spotting that - I thought I'd changed that a long time ago.HappyVR 11:26, 14 May 2006 (UTC)

[edit] Harvard architecture - local memory

Added a paragraph in 'architecture compared' noting that one sensible programming implementation results in the SPE's operating as a Havard architecture. My paragraph probably needs tidying up for readablity etc. So feel free to do so.


[edit] Clarified PPE VMX capabilities

From the document tagged pacellref I deduced the VMX single precision and double precision performance level. Required careful reading to spot the key numbers which are given in aggregate for the octoSPUs + VMX.

Although the SPU double-precision (DP) floating-point is not as high as the single-precision performance, it is still good. Each SPU is capable of executing two DP instructions every seven cycles. With Fused-Multiply-Add, an SPU can achieve a peak 1.83GFLOPS at 3.2GHz. With eight SPUs and fully pipelined DP floating-point support in the PPE's VMX, the Cell BE is capable of a peak 21.03GFLOPS DP floating-point, compared to a peak of 230.4GFLOPS SP floating point.

This same document insists that the octoSPUs have a peak SP performance of 204.8GFLOPS (all figures quoted at 3.2GHz), subtracting this from 230.4 gives us 25.6 or eight SP operations per clock.

The other subtraction works out to 1.997 DP operations per clock for the fully pipelined VMX. I think we can safely round up.

This doc. doesn't say if the VMX supports SP in IEEE 754 compliant mode and I don't read the PPE docs myself. I'm a SPU man myself. An important capability to note. Anyone? MaxEnt 06:51, 7 April 2006 (UTC)

about vmx / ieee / single precision - see two sections down - the vmx are ieee compliant in single precision - (so i assume this is the standard no. 754 obviously).HappyVR 10:03, 8 April 2006 (UTC)
This was clarified in discussions below. The VMX supports a mode bit which enables Java-subset IEEE compliance in single prec. mode. MaxEnt 19:00, 8 April 2006 (UTC)

[edit] Seven of Eight

I just had to add a section by that title. The article pacellref also contains unique information regarding the EIB, including a diagram associating unit IDs to EIB participants, in ring order:

PPE=6, SPE1=7, SPE3=8, SPE5=9, SPE7=10, IOF1=11
MIC=5, SPE0=4, SPE2=3, SPE4=2, SPE6=1, BIF/IOF2=0

This might be too technical for the main article, but it does suggest that knocking out a failed SPE is not fully transparent in software.

Cell Multiprocessor Communication Network: Built for Speed talks about the EIB in great depth (not added to refs section yet). It clearly states that the maximum distance on each ring is six hops, so each physical SPE has a unique view of which other units it can send to on the CW vs CCW rings (which can be detected via concurrency conflicts).

Anyone adding to the EIB portions needs to work through these two references, they paint a detailed picture, but not me tonight. MaxEnt 07:14, 7 April 2006 (UTC)

Added a very brief line to EIB section - relating to the number of 'hops' needed to transfer data between spes or other connected elements - thinking about commenting somewhere in the article that SPE's are not absolutely equivalent due to their interaction with the EIB - if it's not already stated somewhere.HappyVR 20:10, 7 April 2006 (UTC)

It's a good point to make. The place to make it, if possible, is where the PS3 Cell is described as having one of the eight SPUs disabled, since people are likely to assume this is more transparent than it is. There will be subtle diffs from one Cell to another depending on which SPU is knocked out because of the rigid hop count semantics. MaxEnt 19:02, 8 April 2006 (UTC)

[edit] Added the VMX to SPU comparison table

I've never worked with VMX, but I decided to chance wading through the VMX docs. The doc I found does not define any double prec. operations. Where are they? I think it's useful to include this survey of differences because IBM has a nasty habit of adding up VMX and SPU floating point performance numbers as if the situation was far more symmetric than it really is. It takes real work to make an algorithm perform well in both environments.

I'm adding a lot of material helter skelter. At some point, a larger review of content and org. will be needed. MaxEnt 08:42, 8 April 2006 (UTC)

double precision are not there, you're right - funny - I'd got the impression that PPE VMX did double precision (I probably just assumed incorrectly) - also your table - as I see I as Java compliance is a sub set of IEEE compliance then both vmx and spe are ieee compliant - maybe the table should make it clear that VMX are ieee sp compliant or just comment that both are ieee compliant?HappyVR 10:03, 8 April 2006 (UTC)
I haven't looked very hard yet for docs on the PPE side of the fence. Maybe there's an extension published somewhere I haven't found yet. In one of the paras below I define the Java and non-Java modes on the PPE side. I could be more explicit about the non-IEEE and IEEE modes on the SPU side. It's a battle to keep the table terse enough to read. MaxEnt 10:19, 8 April 2006 (UTC)

[edit] First pass EIB rewrite

There are typos, the description is unclear, and I suspect not all the figures provided are accurate. I'll try to get back to this myself soon. It overemphasizes the hop distance as a latency issue. In fact, the speed is very high even in the cases with more hops between the endpoints. The actual issue here is one of concurrency: with too many endpoints too far apart trying to talk at the same time, the EIB fails to achieve as much concurrency as it might if the hops were shorter. MaxEnt 10:16, 8 April 2006 (UTC)

I couldn't bear it so I took a stab at a rewrite. The EIB structure was quite fresh in my mind from yesterday. I packed a lot of information in there. It reads well in sentences but a little thick overall as a section. Removing the reference to the SPU DMA engines would help, but the EIB can't be properly understood without realizing how well it functions relatively unsupervised in background. MaxEnt 11:35, 8 April 2006 (UTC)

[edit] Added the Implementation Section

Found quite a good IEEE article paving out many details of the 90nm SPU implementation that convey a strong sense of "the flesh" of the upcoming Cell processor. For my own purposes I've engaged in some speculation about where IBM might head in future editions if they branch out of the PS3 niche. I kept most of my crazy ideas to myself, citing only the most dreadfully obvious possibilities (faster, cheaper, stronger).

I personally like the breakdown of the SPU by function unit die area (got out my ruler and spreadsheet for that one). To my tastes, this kind of hard core numerical overview takes some of the "crazy" taste out of the general blather (present company excluded). It also leads to a direct presentation of three issues: the relative rigid pipeline structure (illustrates programming complexity), the speed vs heat trade-off, and the cost reduction vs future enhancement trade-off (presently biased hugely for cost IMHO). Does anyone else enjoy as much detail as I provided?

This was my first significant contribution so I didn't sweat some small stuff. It could use a pass by someone more wiki-cultured. MaxEnt 04:25, 5 April 2006 (UTC)

Ran out of time adding material on the Cell die floorplan to clean up presentation. It will look better when I flesh out the text to spoon around the tables on three sides. Should be back to clean this mess up later today. MaxEnt 18:35, 5 April 2006 (UTC)

Looks good/alright - suprised no mention in 'prospects at 65nm' of improvement in double-precision perfomance - I assume this can be done 'transparently'.

Note as it presently stands 'The two pipes provide different execution units, as shown in the table above' might better read '...the table (right)'
I haven't had chance to read the pdf yet but should/will. In general I might suggest (or do it myself) to move your addition above 'Possible Applications' in the article to keep 'hardware specs' at the beginning and 'software apps' at the end.
Also the references are out of alignment but that's due to something else - I'll sort that out myself when I have time if no one else does it first. Keep doing it.HappyVR 21:07, 5 April 2006 (UTC)

Thanks for the feedback. Had time to reduce the eye-bleed and modestly improve the text, but have to run out again. I have plenty to say about future enhancements to double precision which I can't yet justify with a reference. The critical data to enable this kind of speculation is the functional unit die areas. First I need to explain why the double precision unit is smaller than the single precision unit. The double prec. unit is IEEE compliant so there is logic for that, plus logic to 'borrow' the single precision unit four times, but the mult unit used is actually located in the single precision unit. The double prec. unit has no multiplier of its own at all. Then you have to look at latency impacts and I still haven't found the right passage that states whether these numbers are related to a particular core, or whether they are architecturally mandated (which helps the compilers generate a single binary that runs equally well everywhere. It's even possible IBM has not decided this point yet, they might break the architectural mandate for a large enough gain, but not for a small gain. MaxEnt 01:41, 6 April 2006 (UTC)

[edit] Artifacts

I'm confronting a bit of an impasse to continue here. I need to return to some of the other primary sources regarding the Cell architecture and experiment with the Cell simulator to see whether the "shared multiplier" effect creates artifacts visible in software. The purpose of this exercise is to establish some clarity in my own mind about the exact dividing line between the implementation and the architectural specification, and which one is ultimately running the show. A VLIW influence would be to have the archicture specify the execution environment. A more conventional approach is to begin with an accessible implementation, generalize the properties that emerge into some architectural principles which the compiler writers, etc. can take as their foundation, and then hold to this model as long as practical as the underlying implementation technology evolves. An interesting point of comparison right now is the upcoming ARM2 architecture which hybridizes the speed/density trade-off instead of forcing programmers (or the runtime) to ping-pong between two discrete models. It could be a few days before I'm back again with more to contribute here. MaxEnt 20:47, 6 April 2006 (UTC)
Surely the answer is - there are no artifacts? (since is complies to a standard), still I'll be interested to see what the answer is. I think that a lot of people might be interested in 'number of clock cycles' for specific common instructions if you're interested.(Doesn't have to be an exhaustive list.) Also I have been able to easily find anything about register interlock between odd and even pipes - I assume that if say a floating point instr. is being carried out on the even pipe and the next instr. is say a 'load/store from/to memory/register' on the odd pipe that if same register is used then there is a conflict (without pipeline 'interlocking') - probably only worth mentioning if the behaviour is unusual or non-standard. Apologies if I've totally misunderstood the architecture and my comments are gibberish.HappyVR 21:36, 6 April 2006 (UTC)
The kind of artifact I'm looking for is generally known as a hazard, little dark corners where the nicely documented world breaks down due to hidden implementation issues. Instruction streams intercalating SP and DP instructions are not considered to be a common workload, so a hazard might emerge there if someone looks closely enough. It seems to be implied by the IBM docs: if the multiplier is shared, it can't serve two masters at the same time. The Intel P6 was rife with hazards. If two many L1 reads hit the same L1 bank, there were strange delays. And partial register stalls abound on the P6, particularly concerning the flags registers, but you could never figure this out because Intel had put special case logic in there to fast-path the ones that showed up most often on their design sim. IBM did this with the EIB, too. There is a fast-path provided in one case where the request queue is presently empty. Any chip that results from an intensively simulation driven design is packed with hazards and anti-hazards of this nature. MaxEnt 19:03, 7 April 2006 (UTC)
Yes, clearly in the case of dp and sp instructions being carried out sequentially there might be (probably, or ay least longer execution time) a 'stall'. In the worst case (no 'interlocking' on the multiplier) the second instruction will 'overwrite' the first producing garbage - but as both instructions only have one (the same) pipeline (so one must finish before the other can start) this seems impossible - good luck anyway finding out. It seems that for best performance code should be written or compiled to attempt to use even and odd pipelines aternatively.HappyVR 19:25, 7 April 2006 (UTC)
Absolutely there is in an interlock to prevent anything bad from happening. Correctness of the results is non-negotiable. If a hazard is detected, the execution pipeline will come to a grinding halt until the hazard clears and then things will resume normally. The only visible effect will be a small (undocumented) delay (sometimes referred to as a bubble). MaxEnt 19:09, 8 April 2006 (UTC)

[edit] EIB debated

(broken out from discussion above and deindented). Also the EIB is interesting - again it seems that data is rotated stepwise (both ways?) in this circular bus - it seems possible that faster perfomance could be had by simultaneously presenting the output of each SPE to all connected devices..(a bit like the operation of some caches) However this would greatly increase the complexity of this element - N 'lines of traffic' for N connected devices as well as (probably) the necessity of including a ?N deep buffer for each device to prevent locking when two devices send data to one device simultaneously. It certainly seems that cell is interesting and unique in the same way the emotion engine was only far more so.HappyVR 21:36, 6 April 2006 (UTC)

You're forgetting about the speed of light. All that blather about 11 FO4 design process means that signals can't get very far in one clock cycle (which are very short because the freq. is very high). Certainly not halfway across the chip. Plus driving a signal to many readers increases capacitance on the bus which retards signal propagation, unless you pump more power into it, which is exactly what they are trying not to do. There's so much bandwidth here that a multicast design is hardly necessary. Replicating a chunk of data in one SPU to the other seven SPUs takes three passes, each time doubling the number of SPUs containing the replicated data block and none of the passes need to involve hop lengths greater than three (if the data block originates on one of the four corner SPUs). It's not slow.
The EIB shunts little packets of data around four concentric counter-rotating traffic circles, one step per clock. Each ring can service a maximum of three cars. I'm fairly sure I read that each SPU has only one read and one write path, so a single SPU can only be active on two of the four rings at once (one read, one write). And look at the math: 12 EIB participants, two ports each (one R, one W); four rings supporting up to three concurrent transactions each involving one reader and one writer. Maximum capacity: twelve readers, twelve writers either way you look at it. In theory, a ring with twelve participants could handle six concurrent transactions if all transactions were adjacent. That would never happen anyway, so IBM only coded up enough concurrency control to match bus participant saturation. Providing a peak EIB ring concurrency of three, rather than four, five, or six does have some impact. Communications patterns combining long, six-hops distances with many short adjacent hops can no longer achieve saturation of endpoints (e.g. ring occupancy solutions of the form 2+2+2+6=12 are not possible). It would take far too much logic for the arbiter to find those solutions, and they never occur in practice anyway, and even if they did for some specific workload, the programmers can get in there and juggle the task to SPU assignments to make the problem go away. MaxEnt 19:03, 7 April 2006 (UTC)
Yes (don't wan't to get involved with speed of light issues - hope it goes away) - your section "seven of eight" pretty gives the answer I was interested in - namely that the spe's are not equivalent in terms of the eib - and so for 'streaming' between spe's the 'data reciever' spe needs to be 'close' (in terms of the eib) to the 'data sender' spe for maximum performance. I appreciate your comment about needing more power to get voltage rise across increased capacitance - It was just an idle suggestion really - I should be getting back to what I normally do.HappyVR 19:41, 7 April 2006 (UTC)

[edit] Second EIB rehash

Half of what I wrote in the first rehash proved wrong. Adding the essential quotes finally shed some light into the darkness. There is still a lingering question about my use of the word hop. I'm not sure I've ever seen that word used by IBM. Once again the stuggle for basic accuracy has taken prominence over ideal structure from the readership perspective. It's so darn hard to sift the conflicting information from the primary source (take a bow, IBM) there's not much mental capacity left for worrying about structure. MaxEnt 12:17, 10 April 2006 (UTC)

[edit] double prec multipliers

One more thing, the VMX numbers I dug out last night shows that IBM has already implemented double precision multipliers in this process that outperform their SPU cousins by a rough factor of three. It makes me thing that IBM had these up their sleeve from an early design concept for the SPU, then removed them to make room to double the LS size when that proved more important (I've read that the large LS was a late decision), then realized how badly DP had suffered and compromised by re-introducing them into the PPE alone. You can see in the references that most of the benchmark refinements that hit the sweet spot (98 percent utilization) are heavy into double buffering which hammers on effective LS capacity. MaxEnt 19:03, 7 April 2006 (UTC)
Yes - any shorcomings of this design are about saving die space really - I would not like to have been one of the engineers trying to decide what to keep when the space they had to work with filled up.HappyVR 19:41, 7 April 2006 (UTC)
What gets left out in the first cut is usually restored in a subsequent cut. It turns into a question of where you start. I just worked some more numbers. The 14mm2 difference in die size between DD1 and DD2 is far too large to concern the VMX unit alone. Even if I take the existing SPU execution units and add another four copies of the SP unit as dedicated multipliers for a fast DP unit, the numbers don't work. That change only increases the SPU VMX unit by 6mm2 and the entire beefed-out SPU execution engine still falls short of the 14mm added during the PPE rework. And hardward multiply doesn't ordinarily scale N^2. N^1.6 is a better estimate. 2^1.6 is approx. three, not four. MaxEnt 21:18, 7 April 2006 (UTC)


Here's another take. IIRC the die thermals are hottest over the 11 FO4 design regions (the EIB among other things is 22 FO4 which accounts for it running at half the clock rate, one of the references partially shows this). The 11 FO4 regions in the SPU are less of a cost-to-manufacture issue because one bad SPU can be knocked out (for the large PS3 market). The 11 FO4 region in the PPE is has the potential to become a huge CTM issue. You've only got one and you have to keep it. I'm guessing the other half of the DD2 rework was PPE hot spot mitigation driven by CTM considerations.


The reason for the PPE VMX beef-out also strike me as fairly clear. There's a lot of code out there which either A) requires IEEE compliance (regardless of whether it can work in single prec. mode or not); or B) is unsuitable for vectorization. Using the existing SPU VMX units for scalar, IEEE compliant work loads is a terrible thought to contemplate. IBM had to handle that workload better somewhere, and the cheap place to put it was the PPE. It also gives IBM a nice story for their scientific market (aka Mercury). They can tell people (under NDA) "just assume all the VMX units on the 65nm Cell run exactly the same as the PPE VMX does now". I'm speculating wildly here, and I would never put any of this in the main text. MaxEnt 22:29, 7 April 2006 (UTC)

[edit] Possible new page

Just thinking but maybe it's worth considering separate pages for SPU, PPE and the Cell bus structures - with links from the main cell page link this:

example


==Sample part of cell==

Main article: specific part of cell

Brief description of element of Cell for readers who don't want/need a lot of info here...

end of example

There's certainly enough stuff there to justify this - and it would help readability as well as addressing some editing issues about putting technical infomation into the main article.HappyVR 10:53, 8 April 2006 (UTC)

We will certainly need to do something. I think it is still too soon to decide how to cut the baby. It we push away too much of the hard core technical, the main article will begin to drift off again in the Fantasy Island direction. Or worse. We could have execution units named Ginger and Skipper. On one level Cell is visionary. On another level, it has just as many compromises and obscure difficulties as the devil we know. I somewhat deliberately structured my blurb on VMX to SPU portability to say "porting is easy, except when it isn't, except when it is, except when it isn't, except when it is". I didn't explicitly state in that pass that IBM was counting on the Altivec code base to help prime the SPU pump; but I did make it clear that by the time the dust settled they had done a better job of drawing upon the Altivec conceptual model than the Altivec feature set.
It was an explicit goal of Sony/IBM to design a processor to fit into the gap between desktop systems and dedicated multi-media hardware (such as GPUs). Life in that gap means the programmer needs to ingest a much larger grain of salt when translating theoretic performance numbers into real life. I think the primary balanced required of the main article is to convery this chimeric existence where it sometimes looks like a conventional processor, and sometimes doesn't. I'm not yet sure where to make the cut to preserve that sense. MaxEnt 19:36, 8 April 2006 (UTC)

[edit] Quadwords

Suggest changing this:
Quadword alignment is alignment on 16B boundaries where the low four address bits are zero.

to this:
Quadword (Four times 32bit word or 128bits) alignment is on 16Byte (128bit) boundarys.

Or similar - wikipedia does not actually have a page for quadword. To make it clear that in this architecture a word is 32 bits.HappyVR 11:45, 8 April 2006 (UTC)

I hate the term word since it generally means whatever we want it to mean. Word in this case is 32 bits. We shouldn't try to normify quadword usage within the wiki. We just say that our source material defines quadword as 32 bits. Something like this "IBM's VMX tech. defines a word as 32 bits and a quadword as 128 bits. Quadword alignment implies alignment on 16B boundaries where the lower four bits are all zeros." Give it a shot, I'm tired of it. I encourage changes. There will be many passes before we decide what works best. MaxEnt 19:43, 8 April 2006 (UTC)
ok changed the wording - I'm really looking for somewhere in the article to state clearly "In this architecture a word is defined as 32 bits" - perhaps putting "both PPE and SPEs use 32 bit length instructions - consequently in this architecture a word is defined as being 32 bits long." but whereHappyVR 20:23, 8 April 2006 (UTC)
All the IBM/Sony manuals begin with a ten page blob of architectural defs that make my eyes water because I've seen 99% of it before 9999 times. My vote is to quietly slip some normalized vocabulary into the VMX to SPU comparison table. Keep the specialist terminology close to the ground. We're not necessarily writing for people with spare mental buffers to carry around a lot of jargon. Make it prominent enough to notice when it matters to the reader. I'm going to make an edit to the table. See what you think. 20:35, 8 April 2006 (UTC) Done. By the way, it's a huge help to have you bat these things back at me. MaxEnt 20:58, 8 April 2006 (UTC)

[edit] Arbiter circuit

This was commented out:
"However all connected elements can be considered to be the same distance from the memory controller as they are connected in a star fashion to the FlexIO interface via an arbiter circuit."

It's in one of the two references (pdf) that user:MaxEnt supplied in the past week - I assumed that these were reliable - (possibly I've misread - but don't think so) - if it's still an issue I'll try to point to the exact page the info has come from (when I have some time).HappyVR 11:52, 8 April 2006 (UTC)


OOPS - sorry yes I was wrong - was looking at diagrams too quickly and not reading properly - no star fashion - what a dummy I am. See http://www-128.ibm.com/developerworks/power/library/pa-expert9/ HappyVR 12:01, 8 April 2006 (UTC)

Thanks for tracking that down. It freaked me out. There's a huge blank area in my understanding of the address translation and cache coherence model. For example, what happens on a page fault triggered by a SPU local DMA command? They keep me so busy mapping VMX onto SPU I haven't managed to get that far. I suspect there are some strange details lurking under the hood in the address snoop mechanism. From the ref you just provided (I've seen it before):
No, that's something we came up with as part of this. We wanted to be able to connect the Cell BEs into a multiprocessor configuration, and actually the first protocol that we came up with was the BIF. The EIB is sort of a logical extension of the BIF, so the BIF defined command formats, transaction types, the snooper cache coherence protocol and all of that, and it's also got definitions of physical link layer routing and all the other stuff that's in some ways similar to PCI Express. Some of the same concepts are used with the BIF. The difference with the BIF though is that it's got packets that allow for coherent MP communication.
I feel surrounded by Sand People hidden in the hills.
You were confused by how they drew the arbiter. I left that out of my EIB description entirely. It could have been far worse. The arb might be poking his nose into packet addresses. Hard to say. There's definitely another layer of black magic within the EIB concerning MP coherence that IBM hasn't fully revealed yet to the non-NDA crowd. MaxEnt 19:53, 8 April 2006 (UTC)

[edit] Overview

In the section (2) Architecture, before section (2.1) Power Processing Element I want to insert a sub section 'overview'
Couple of reasons - firstly I've noticed the article pretty much jumps straight in to the dirt - easy to miss this because most people coming here will already have a good idea what it's all about - however to the 'uninitiated' it will be confusing - they'd have to piece together the info. to find out what it is all about.
Secondarily an overview might/should help to stitch together section 2 and also prevent having to repeat any concepts in the text.
Thirdly if basic concepts are explained in 'overview' the subsections can jump right in with more detailed info.

I think various things should be included:
It's a 32bit architecture
the PPE is used as a controller/mediator/task organiser/master
The PPE is compatable with other power pc procs.
The SPEs have local memory, they act as slaves (do the heavy lifting , other metaphors)
The EIB connects the SPE and PPE and external buses
Other external buses connect to memory, serial devies etc
The registers (number length), word size, addressable memory etc.

I would do this but I (too) am getting tired - maybe not today - or maybe someone else will do it first - either way it can wait for now. I assume that no-one would object to a slightly explanded introductory paragraph to section 2? or would a separate section be bettter?HappyVR 20:46, 8 April 2006 (UTC)

I'm hugely in favour of sifting introductory concepts upward and maintaining accessibility against the onslaught of complexity. We're faced with describing a dime store supercomputer. Keep the ideas coming. MaxEnt 22:43, 8 April 2006 (UTC)

[edit] Bitness

The bitness of an architecture used to mean either the address space or the register width. In both cases the Cell would not be considered 32 bit. I'm sure the PPC has a virtual address space far larger than 4GB. Also, the general purpose register set in the PPC is mostly 64 bits now IINM, with special purpose registers of 128 bits. The SPU is even more biased toward 128 bit reg width. In scalar mode, the SPU functions as a 32-bit processor. All we can say is that the architecture defines a word as 32 bits. Power originated as a 32-bit arch, there's not much left of that now. MaxEnt 22:43, 8 April 2006 (UTC)

[edit] PPE compat

Absolutely. Binary compatible. Caveat: because Cell lacks OOO, code compiled by compilers attempting to exploit OOO in hardware won't necessarily run well. To run well, existing code needs to be recompiled with a compiler that correctly models the PPE's superscalar constraints. Was Power 5 OOO? I've never been much of a Power guy. MaxEnt 22:43, 8 April 2006 (UTC)

[edit] SPE local memory

Local memory is the fundamental departure from conventional designs. It ties into the distributed DMA controllers, EIB, multiprocessor coherence model, etc etc. It's also what makes the Cell look more like a GPU internally.

[edit] PPE to SPU relationship

Depicting the correct relationship (status hierarchy) between the PPE and the SPUs is tricky. If the system runs an operating system, the OS itself almost certainly runs on the PPE threads. The SPUs have no protection model (priv. modes). The virtual memory model, processes, file system code, etc. is all going to run on the PPE side. By this view, you would say the PPE is running the show. The situation becomes more blury for a game console where the game takes over and the OS fades into the background for long periods of time. In Linux and BSD, many OS tasks are handled behind the scenes with worker threads (of one variety or another). The SPU processors are sufficiently powerful running in scalar mode that the PPE could delegate OS worker thread tasks to the SPU context. Tasks such as running through VM page lists or lists of disk buffers deciding what to keep and what to flush and what to zero out for later. A SPU processor could also handle network packet assembly and disassembly tasks. Or the computations involved in handling software RAID. In fact, these applications all resemble the GPU. High end network cards and disk controllers have long contained coprocessors designed to offload these computations from the main OS. The SPU is unlikely to become involved in deciding which task to schedule next. Those portions of the OS are extremely branchy in nature. And the PPE also ends up doing some heavy lifting of its own: any Altivec kernel that requires Java semantics, or fast double precision.

[edit] Gone to the dogs

Maybe the right image is a dog team that consists of a border collie and eight huskies. Huskies aren't dumb animals, but there some things you can train a border collie to do that a husky would never be good at. Likewise, a border collie is a hard working dog, but doesn't have the strength or stamina to pull a heavy sled through deep snow. Both animals are general purpose work dogs, yet you have to use them differently to get your kibble's worth. Likewise, while the collie excels at agility (hard turns) it lacks the crunching power of an Opteron (bulldog) or straight ahead speed of a Pentium IV (greyhound). Actually the PPE is more like a pair of downsized herd dogs: perhaps the Welsh corgi. IBM would like people to think of Cell as two bordier collies heading eight huskies. On a bad day I might describe the Cell as consisting of two fast-footed corgis up front and eight single-minded scent-crazy beagles bringing up the rear. In any case, the point is that the relationship is not easy to depict in role-bound terminology. MaxEnt 22:43, 8 April 2006 (UTC)

Yes, thanks for all that - I was thinking about a similar analogy for the overview myself - coachman and eight horses - I'm still thinking about the wording of the overview - trying to avoid a description that is too simple whilst keeping it succint.
However it seems there is a double precision problem to sort out first (see below).HappyVR 09:41, 9 April 2006 (UTC)
I'm learning stuff from my coherency trawl that is changing my view. The PPE threads are very clearly in charge of the view of system memory seen by each SPU and I've noticed IBM describe the relationship from PPE to SPU as offload several times now. One doc also stated that the PPE has preemptive task switch ability over any running SPU. If anything goes wrong with how a SPU DMA request tries to access the address space (including translation problems due to TLB miss, etc.) the PPE gets the interrupt to fix the problem. I'm coming to the view that it's a legal guardianship of two PPE parents over eight SPU children, teenagers who are quite independent until they get into more trouble than they can handle themselves. MaxEnt 10:04, 9 April 2006 (UTC)
I hope we continue to work together on this. Every time I edit for accuracy, I end up adding technical terms such as SIMD that might not have been properly introduced yet. Once I say that PPE has 64 and 128 bit regs and SPU has only 128 bit regs, I really need to explain that the 128 bit SPU registers are flexible in support of both scalar and non-scalar (SIMD) data types. It's good to tackle the bit-size question early so that the reader understands that the old concept of 32-bit proc or 64-bit proc is not so simple any more. We now have 32-bit inst., 64 bit scalars, 64 bit pointers, 128 bit SIMD, etc. Ugh, I just checked, the SPU also defines doubleword and quadword scalars. Must fix. MaxEnt 09:53, 10 April 2006 (UTC)
I wouldn't worry about SIMD - I'll change any early instances of SIMD to SIMD so that readers can look it up themselves, and do the same to other technical terms or acronyms.HappyVR 10:16, 10 April 2006 (UTC)

[edit] Coherency

There is so much to clean up, but I wanted dig into the cohernecy model before I take my next cut.

The MFC also contains an atomic unit (ATO), which performs atomic DMA updates for synchronization between software running on various SPEs and the PPE. Atomic DMA commands are similar to PowerPC locking primitives (lwarx/stwcx).

This is a meaty sidebar:

The Cell BE processor -- on the other hand -- is a 64-bit processor similar to the PowerPC 970 and uses neither BATs nor segment registers. It uses Segment Lookaside Buffers (SLBs) and page tables for its address translation mechanism. There are 64 entries in the Cell BE SLB, and they form part of the process' context. SLB entries map the effective address of process address space to virtual address. Each SLB entry maps a 256MB effective address region, and so 16GB of address space of a process can be mapped at once. If a process address space is greater than 16GB, and if a particular effective address range of the currently running process is not mapped in the SLB, then DSFI/ISFI exceptions are generated. The OS should resolve this by filling in the correct entry in the SLB and replacing the suitable entry. In this way address space > 16GB can be effectively mapped using SLBs.

More vitals from the main text:

The MFC's MMU unit consists of the following:
  • Segment Lookaside Buffer (SLB) (managed through memory mapped input output (MMIO) registers)
  • Translation Lookaside Buffers (TLBs) to cache the DMA page table entries (option for hardware reload from page table or software loading through MMIO registers)
  • Storage Descriptor Register (SDR) which contains the DMA page table pointer (standard hashed PowerPC page table format)
The architectures allow the PPE and all of the MFCs to share a common page table, which enables the application to use their effective addresses directly in DMA operations without any need to locate the real address pages.

System Memory Map:

The Cell BE processor's system memory map consists of six distinct regions, namely SPE local store area, SPE privileged area, PPE core area, IIC (Interrupt controller area), PMD (Power Management and Debug) area, and Implementation Dependant Expansion Area (IDEA).

p.23 corrects a mistake I made describing peak EIB bandwidth. The 204.8 GB/s figure is peak bandwidth as seen by eight SPU processors; there are four other EIB participants that can increase this by another 50%.

  • Each SPU has an AUC which acts as a "4 line cache for shared memory atomic update"
  • BIC stands for Broadband Interface Controller
  • IOF1 is diagrammed as supporting a south bridge
  • IIC is internal interrupt controller; duplicated for each PPE hardward thread
  • IOT is I/O Bus Master Translation: bus addresses to system real addresses, two levels (256MB I/O segments; 4K,64K,1M,16M I/O pages)
  • IOST cache: segment table I believe HW/SW managed
  • IOPT cache: page table* I believe also HW/SW managed
  • RAG is "resource allocation group"
  • LPAR can't figure out
  • RT tasks? can't figure out
  • access tokens generated at configurable rate for each alloc. group: 1 per mem bank (x16), 2 per IOIF (x2)
  • MIC does not get a RAG, but each SPE, the PPE, and both IOIF do (11 total)
  • resource overcommit interrupt
  • more than 100 outstanding DMA requests supported
  • HW or SW TLB management (that's not what the other link seems to say)
  • SPE MMU similar to PowerPC MMU: 8 SLBs, 256 TLBs, multiple page sizes, SW/HW page table walk, PT/SLB misses interrupt PPE
  • isolation mode support (ick pooh)
  • PPE threads alternate fetch / dispatch cycles
  • p.36 shows the PPU execution pipes: 2 ops/cycle issue to load/store(1), fixed point(1), branch(1), and VMX/FPU queue(2)
  • VMX/FPU issue queue: 2 ops/cycle to VMX load/store(1), VMX arith/logic(1), FPU arith/logic(1), FPU load/store(1)
  • numbers in parens is number of ops that can be accepted by unit in one cycle
  • "exploit all of Cell's 18 asynchronous engines"

MaxEnt 08:20, 9 April 2006 (UTC)

[edit] Hot New Interview about PPU

Posted April 18, interview with Mark Nutter and Max Aguilar:

Confirms my emphasis on the DMA view of system memory. Here are the vital comments to work into the main text:

  • For reference, you may want to look at the MMU description for the SPEs, because that's the important piece, choosing, again, the same address translation machinery and storage protection machinery that the PowerPC core uses. They are fully PowerPC [ISA] compliant, so when you issue MFC/DMAs, it steps through the same address translation and protection mechanism that the PowerPC core steps through.
  • Now if we take and we start with, say, the PowerPC Architecture Book III and we look at that memory model from there, where do we go differently with the Cell BE processor? So that's where to look to define the address translation model.
  • We really don't advise people to memory map local store into the PPE's effective address space
  • You recommend going with the DMA? Aguilar: Yes. The DMA is definitely the way you want to transfer, because when you access the local store it's done through MMIO, very slow compared to the MFC/DMA. dW: That's a big clarification, that the recommended approach is using the DMA approach as opposed to the global memory mapping.
  • Let's consider for a moment an application that might need to copy the content to the local store from the PPE side. One way it could do that would be to call memcpy on that memory map local storage area, and essentially copy all 256 kilobytes with any other anonymous memory chunk in the system. That would be very slow relative to the DMA engine. It would have, potentially, various side effects -- depending on the memory target where you were copying to, it would potentially displace contents of the L2, and so on. So every load and store to an MMIO region is something to be avoided, if you can do it, and certainly to avoid 256 kilobytes worth of that.
  • Yes. As we mentioned before, MFC/DMA commands targeting effectively addressed memory go through address translation and protection. This is true both for accesses to regular memory and for accesses to memory-mapped I/O. MFC/DMA commands targeting another SPU's local storage area are just like any other memory access, from the MMU's point of view.
  • With the GCC toolchain, what abstractions does it support out of the box? Aguilar: It's a low-level compiler, in the sense that it compiles to the SPU or to the PPE ISA, but it doesn't do any of the higher-level abstraction or tying together of the programming model. MaxEnt 20:41, 19 April 2006 (UTC)

[edit] Double precision

I've looked and looked but it's not there - I can't find any reference to PPE ie Altivec doing double precision - only goes up to 32 bit single precision. Can this be sorted out - as far as I know the only evidence for double is an excess in IBM's published double precision performance numbers? Is this right - is it possible to write this off as a printing error/simple mistake that slipped through and eliminate all refs to PPE double precision.HappyVR 09:41, 9 April 2006 (UTC)

I wouldn't be so hasty. The vast majority of the Cell documentation set is still under NDA IIUC. The doc I got that number from was extremely strong in the accuracy of the numbers provided. There were no slips or mistakes. IBM cheated in some respects on how they released the early docs by referencing many existing PowerPC docs where they could get away with it. I suspect there is a Cell VMX Extension doc not yet released. We can simply indicate in the article that the double prec. capability for the VMX unit is based on a single reliable source as yet unconfirmed by the portions of the formal doc set presently available to the public. That's how I would play this. MaxEnt 09:58, 9 April 2006 (UTC)
The plot thickens. The PPU runs the same instructions set as the Power 970 (aka Apple G5). The G5, however, contains two scalar DP prec. execution units, so there is no need or purpose in supporting double prec. in SIMD format. In the Cell PPU, the VMX and FPU are merged together into some weird hybrid. However, they show FPU instructions as distinct from VMX instructions. I'm not sure which register file is used to supply operands for the FPU issue queue and whether these regs are 64-bit or 128-bits wide. A scalar mul-add to a fully pipelined FPU achieves the rated 6.4 DP GFLOPS, as does a partially pipelined two element SIMD. Whether or not the same execution resources are involved, it now seems clear to me that IBM does not consider the DP instructions to be part of the VMX instruction set. Maybe you were right and there is no SIMD DP regardless of whether we call it VMX or not. MaxEnt 10:51, 9 April 2006 (UTC)
Ah that must be it! The PPE has 64bit floats but they're not VMX - that's why my search under VMX produced nothing - hence the confusion - I knew that my feeling that there was double precision in Power was right and not just vaporous day dreams - it seems there is but it's not SIMD as the single precision instructions are. Hopefully that's it and the confusion is ended.HappyVR 11:29, 9 April 2006 (UTC)
Although the SPU double-precision (DP) floating-point is not as high as the single-precision performance, it is still good. Each SPU is capable of executing two DP instructions every seven cycles. With Fused-Multiply-Add, an SPU can achieve a peak 1.83GFLOPS at 3.2GHz. With eight SPUs and fully pipelined DP floating-point support in the PPE's VMX, the Cell BE is capable of a peak 21.03GFLOPS DP floating-point, compared to a peak of 230.4GFLOPS SP floating point.
That directly from three members of the IBM Systems Performance Group tasked with determining what the real Cell hardware might best achieve in the real world. It couldn't be a stronger or more notable source unless Dr Hoftstee was the sole author.
This is a perfect example of why your feedback is so helpful. The first time I read this I was immersed in the SPU and considered it completely unremarkable to see a claim that the Cell VMX had the same general capabilities. I didn't know that double prec. on Power 5 is handled by twin scalar units or that the official VMX docs. make no mention of double prec. at all until you raised the issue into my consciousness. Now we need to state this as it lies: the horse has spoken that VMX contains a "fully pipelined double prec. VMX" while noting that the Power 5 VMX that presently serves as the formal VMX documentation is silent on this point. I'll do it in the next day or two unless you get there first. MaxEnt 18:58, 9 April 2006 (UTC)
Here I am saying this is an excellent source and I immediately find a Homer.
Figure 1. Cell BE Processor Block Diagram shows 11 EIB participants diagramming the read and write links as 25.6 GB/s each, except the IO controller which is labelled with an inbound speed of 35 GB/s and smack in the center the EIB block is captioned EIB (204.8 GB/s). Ouch. That number represents the partition bandwidth for the case where every data stream contains an SPE participant. Secondly, the IO controller is actually two participants where the aggregate outbound is 35 GB/s and the aggregate inbound is 25.6 GB/s. Furthermore, the MIC is shown as supporting 25.6 GB/s on both inbound and outbound links, yet the external memory interface it is bound to has an aggregate in/out bandwidth of 25.6 GB/s.
The abstract EIB bandwidth is 12*25.6 = 307.2 GB/s. Taking into account limitations within the bus participants, and not the EIB itself, a more practical upper bound is 260 GB/s. Again all numbers normalized at 3.2 GHz. There are other IBM sources which depict the aggregate EIB bandwidth as 300 GB/s. MaxEnt 05:23, 10 April 2006 (UTC)
Oww! Oww! Oww! They also get their arrows wrong on the fictional combined IO block, inconsistently putting the chimeric 35 GB/s number on the inbound arrow in one place and the outbound arrow in another. MaxEnt 05:25, 10 April 2006 (UTC)
The mistake is repeated in the text too: The EIB supports a peak bandwidth of 204.8GB/s for intra-chip data transfers among the PPE, the SPEs, and the memory and the I/O interface controllers. MaxEnt 07:54, 10 April 2006 (UTC)

[edit] Overview

Added an overview - it almost certainly will need at least a bit of work later to improve readabilty etc.
My aim was to introduce the concepts (and some of the acronyms) used in the proc. as well as giving a rough idea of what it is intended for.
This should make it easier for later sections to get right down and start explaining more 'meat, skin and bones' concepts. As a result there may now be a few minor duplications in the text. I'll be coming back and reading through the entire thing slowly in an attempt ot check for this sort of thing. Hopefully it's ok for now.HappyVR 11:39, 9 April 2006 (UTC) It definately need some paragraph brakes at least.HappyVR 11:42, 9 April 2006 (UTC)

Your change of the expression not autonomous initially to not fully autonomous is subtle. It depends on what you mean by fully and when you ask. As I was reading about the coherence model it was made extremely clear that the SPE units are helpless until the PPE sets up the view of memory, privilege model, and task assignments before kicking things off. At this point, depending on how the PPE elected to set up the SPE units, they can be viewed as almost fully autonomous, or greatly codependent. It's possible to create a software model where the PPE never touches an SPE again. The SPE units are quite capable of pulling tasks off a global task queue themselves, obtaining the code and data, grinding away, then delivering the result to a global completion queue; the PPE might then pick up a signal via some autonomous synch. primitive that the work queue has new results. That might be full extent of how much the PPE pays attention to the SPE units after system initialization phase. In other models, it might become almost a full-time job for the PPE threads to respond to management events generated by the helpless SPE units; they can't, for example, manage faults originating from their own TLB or page-tables. The two wordings convey different truths. I like the other changes you've made that I looked at first. MaxEnt 05:50, 10 April 2006 (UTC)
Yes, it's not 100% clear, I'll try to find a way to reword to make the 'slave' status of the SPEs more clear. Perhaps something like 'although the SPEs are completely functional microprocessors they require the PPE to ..(make them do work)..' I'll try to get back to this and try to construct a better sentence.HappyVR 08:35, 10 April 2006 (UTC)
That's a good direction to take. I was thinking about this. The level of supervision is largely determined by the process priv. model. If the SPU processors are working in offload mode slurping away at OS internal work queues all tasks with the same view of memory and the same effective priv. level, the SPU processors can run largely unsupervised. However, if untrusted user code must execute on a SPU, the PPE must do some work to set up the appropriate user process view of memory and enforce other OS protection mechanisms. This is quite fundamental. IBM states repeatedly in the docs that the SPU has no priv. modes. The PPE is entirely responsible for setting up an appropriate view of memory for the level of priv. of the process offloaded. Also, now that I'm looking for it I see IBM using the word 'offload' in many diff. docs. The diff. is that offloading an OS internal work queue can feed a SPU forever, whereas offloading non-priv. code from userland the offloading step involves the PPE every time. MaxEnt 09:33, 10 April 2006 (UTC)

One of the passages you edited already contained the phrase PPE 'cores'. I must fix that immediately. There is precisely one PPE core, which supports two PPE threads, through the use of SMT (simultaneous multi-threading), which IBM implements very rigidly as the two threads doing opposite things on alternate clock cycles (one decodes, while the other executes, and this alternates back and forth every clock cycle). Some other SMT processors (Pentium IV) use a more opportunistic approach: the primary thread goes as fast as it can all the time, while the secondary thread exploits bubbles when the primary thread stalls. In Cell you could almost describe this design as lockstep SMT. It still only counts as one core. MaxEnt 06:29, 10 April 2006 (UTC) Corrected. I hyperlinked SMT but it still reads heavy.

Agree, I assume that the PPE 'threads' have individual decode and execute units as well as having separate 'maths' units?: (" While early superscalar CPUs would have two ALUs and a single FPU, a modern design like the PowerPC 970 include four ALUs and two FPUs, as well as two SIMD units" from superscalar).But the register set is shared.. (again assume). And the threads 'take turns' due to it being impossible for both to access memory etc.
Documentation on the PPE (in a specific CBEA arch.) is a bit difficult to find - all I have (in short) is that the PPE must comply to Books 1-3 of Power 'standards' that it must have the 64bit implementation and also must have various extra maths instr. If I could assume that PPE is identical to say a PowerPC 970 (plus extra cell specific intsr.) then it would be a lot easier to expand the 'Power Processing Element' sub section.HappyVR 08:25, 10 April 2006 (UTC)
That's another good change: In documentation relating to Cell a word is always taken to mean 32 bits. But again, the para in front of your edit is weak. I'll clean that up now. MaxEnt 07:57, 10 April 2006 (UTC) Done. Took longer than I figured and I wasn't able to find all the essential numbers. The 16TiB figure was wrong. 2^64 is 16 billion billion, which is 16 exbibytes (EiB). Perhaps the supported virtual address space is 16TiB, but I couldn't find that. MaxEnt 09:25, 10 April 2006 (UTC)
Yes - looking again the 16TiB was my mistake - it WAS meant to be the same as 2^64. I too am finding it difficult to find exact info on address range limit - definately I've not seen any thing above 2^64 - think the PPE is 2^62 (contiguous) up to 2^64 total from one of the refs. in the article. (No 64bit scalar in SPE either - I remember looking a long time ago and being dissapointed by it's absence).

The edits are good - it's improved.HappyVR 09:43, 10 April 2006 (UTC)

[edit] SPE vs SPU

I'm not sure about introducing the division of SPEs in to SPUs and other functional elements in the overview. Would suggest keeping it all under the blanket term SPE. Also would suggest moving:
" Note that the SPU processor can not directly access system memory; the 64-bit memory addresses formed by the SPU must be passed from the SPU processor to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space."
to the SPE section as it is a bit technical - the overview should address this issue though.HappyVR 10:16, 10 April 2006 (UTC)
It's not entirely our choice. Despite the terms bearing a superficial resemblence, the SPU is a processor core and the SPE is a fully-fledged EIB participant; the SPU has no direct view of system memory, whereas the SPE contains an asynch. DMA controller; etc, etc. We need to say: IBM has somewhat confusingly nested similar accronyms. The same problem exists with PPE threads vs PPE cores. I'm thinking about added a preamble to the overview concerning the most egregious terminology. MaxEnt 02:47, 11 April 2006 (UTC)
an SPU is part of the SPE though, I thought it might be best to try to keep the overview couched in terms of SPE, PPE, buses and leave any breakdown of SPE into component parts for later sections.HappyVR 19:45, 11 April 2006 (UTC)
I've made the edits I suugested in the paragraph above. I will if possible try to add something to overview to cover main memory access method for SPEs. Possibilty something about there being an added 'abstraction layer' between SPEs and memory - will need to read a bit more about this before I can do it. Clearly SPE main memory access appears not to be similar to say 'typical common processor method'.HappyVR 15:43, 10 April 2006 (UTC)

[edit] Harvard architecture and memory map

I'm wondering if (my) previous additions suggesting one programming model where the SPEs operate in a 'Harvard type architecture' are correct - for this to work the SPEs need to be able to access 'main memory' fairly easily. However looking at memory maps and the MFC has left me to be honest - none the wiser.<r> Can anyone easily confirm or deny this..
I'm trying to find out (and failing) the extent to which transfers of data from main memory to a SPE can be relatively transparent - for instance - my simplest question - if a memory access to spe local mem goes outside spe local mem range is it transparently mapped to other devices or..something else. I'm sort of getting the feeling that I might read every IBM and SONY pdf available and still not have the answer.
The text in question is contained in the 'Architecture compared' section.HappyVR 16:25, 10 April 2006 (UTC) Looking closer it seems the SPE memory map can never be used to examine external memory which is a dissapointment. I've change the section 'architecture compared' to reflect this - however if anyone knows that discrete main memory access by SPEs can be fast and simple please change it back.HappyVR 17:10, 10 April 2006 (UTC)

No the SPU most definitely can not access system memory addresses. The SPE, however, can. The SPU queues entries into the local MFC's DMA request queue using the channel interface instructions. These DMA requests are powerful objects: they can refer to lists of memory regions contained as data blocks within the local store. Using these lists, the SPU can queue up hundreds of discrete memory reads and writes that will take place transparently in background while the SPU continues to crunch numbers.
This is an essential concept in the Cell arch. The Xbox-360 is instead designed as three Power cores each dual threaded. I think each core has it's own L1 i-cache and d-cache and the L2 cache is shared among the six threads. There are many problems scaling this up. Every load/store on all six threads undergoes address translation, usually on a speed-critical path, using hot transistors. Every translated address must be compared against the L1 cache-line tag set, also a fast and hot critical path. To ensure coherency, many cache line snoops take place in background. More heat. When the L1 cache misses, another round of cache tag compares in the larger L2 cache. More heat, and this cache is handling misses generated by all six threads. It needs circuits to handle a large number of in-progress transactions. More die area, more heat. There is probably a layer of speculative pre-fetch taking place, because the cache needs to keep ahead of the threads whenever possible. Some of these prefetches are never used. More waste heat. And if you end up missing on the L2 cache, the thread that missed is looking at a stall of many hundreds of clock cycles by the time external memory decodes the address request and returns the data.
What makes the SPU so power efficient is that it doesn't do any of this. The local store is private memory. No coherency tests, no tag matches, no contention (e.g. from L2 cache line fills). Well, some contention from the MFC DMA controller, but not much, since that was designed to fetch full 128 byte rows. The coherency and address translation heat/cost was minimized by making the DMA transaction the unit of coherency resolution, rather than every single pointer load or store. To deal with the increasingly large memory access latency, IBM decided to make the 128 byte memory load/store the optimal granularity for the EIB fabric and then they bolted on top of that nine fully asynch. DMA units with the ability to process scatter/gather lists. The latency to memory increases. It takes more time to queue up the DMA transaction, start the EIB transfer, blast it out to the XDR memory chips, get the data back and return the result.
This is where streaming comes into play. Because the memory controller can have hundreds of outstanding in-flight memory requests in progress, utilization of memory bandwidth can approach 100% A firehose of bandwidth without paying much price in heat.
What you lose in the deal is being able to run simple programs on the SPU processors that access main memory willy-nilly. The programmer must go to the trouble of orchestrating the memory flows explicitly in DMA efficient granularities, and then vectorize the SPU computation as well, and deal with some very weird technicalities in how the multitasking OS is set up. This is why IBM constantly reiterates that the programming model is the biggest hurdle in Cell's adoption. Once the programmer does all this work, Cell achieves amazing high FLOPS/watt throughput. You could never achieve these peak results on a processor that was paying the price for coherent translation of every pointer load/store.
I've been meaning to get into this. Getting the EIB sorted out was extremely hard work. IBM is keeping a big chunk of the EIB secret. Then I started taking some notes on the coherency model (primarily how the DMA address translation is set up). It didn't help my initial understanding that IBM always leaves out of their EIB diagrams the address-snoop bus and the interrupt request logic. Have you ever had the feeling that the elephant in the room that no one else is talking about is standing on your foot?
Another dark corner is how the PPE threads relate to their DMA controller. When the PPE threads run into an L2 miss, does the memory controller (in hardware) queue up a DMA request using the PPE MFC, just like every other memory request? This request would be an L2 cache line fill. I'm guessing the L2 cache line size is 128 bytes. Is there a fast request path from the PPE threads to the XDR memory for small memory transactions (such as memory marked as uncached)?
Yes another mystery - I assumed that PPE has a fast link as you questioned at the end of the paragraph - it seems highly unlikely that is does not - after all the SPEs are for streaming, the PPE should run like a 'normal processor'. Couldn't find a definate answer easily though.HappyVR 19:33, 11 April 2006 (UTC)
And then there is this stupid MMIO channel thing to figure out. The MMIO unit changes register settings on the various SPEs. How does it do this? Are these changes propagated on the EIB as system control packets, or is there another set of wires in the EIB left off all the standard diagrams?
Anyways, this is good stuff. We are now getting to the heart of the matter concerning what makes Cell special, interesting, and difficult. Does that help to answer your question? If not, ask again. I think if we can nail down this very complex set of issues regarding coherency, asynch DMA engines, etc. we will have the foundation in place for a very good acount of what makes Cell unique. MaxEnt 06:46, 11 April 2006 (UTC)
Yes it answers. However I am bemused by the apparent exclusion of discrete memory access for SPEs in addition to the 'block move' instructions that are included.(However I am also bemused by the apparent fashion for L2 cache when using this die space for local memory for any given processor would in my opinion give much better results - eg reduce main memory bandwidth by 2 to 3 times - just my opinion.)
Taking this into account I was thinking that some mention should be made in the article - probably section 'programming models' of not using the SPEs to run in parallel cutting up a data block into 8 smaller sections BUT to have all SPEs work on the same data block whilst cutting up the actual program into 8 smaller programs.(possibly this is what the 'octopiler' is supposed to do.) - Perhaps even using the description 'serial multiprocessing' to describe Cell with obvious reference to the more commonly talked about 'parallel multiprocessing.
In terms of describing cell in 'ten words or less' it might be fair to describe cell simply as a 'PowerPC with additional streaming processing elements' - it's a somewhat less airy description than 'supercomputer on chip'.HappyVR 19:33, 11 April 2006 (UTC)

[edit] TLB / DMA Patent

I found more information about the EIB last night, including this patent:

It's quite clear that what I said in the section above (in answer to your Harvard question) that my comments strike at the heart of IBM's original thinking about how to meld a GPU into a general-purpose offload cluster.

The text of the patent is explicitly about managing remote TLB caches associated with DMA engines in a plurality of processing elements. It talks explicitly about the advantage for the GPU in not having to do TLB lookups working within local store data structures. I want to add patents to the references at some point.

I also found a reference which clearly states the that the IBM includes an address concentrator designed in a tree structure. Soon I'll have to come back and do EIB version 3. MaxEnt 13:43, 11 April 2006 (UTC)

Added section about this patent, first draft. Needs clean-up and links still. MaxEnt 13:14, 12 April 2006 (UTC)

[edit] ieee fp

Just remembered something - SPEs use ieee standard single precision floating point storage data format - but apparently due not conform to ieee 754 standards in terms of 'numerical precision' - ie the data format is the same but calculations may not give exactly the same result as the ieee standard (havent found what exactly the difference is) - this is mentioned in cbe documentation itself - probably should add a minor note once I've convinced myself that this is right.HappyVR 19:40, 11 April 2006 (UTC)

IEEE compliance has to do with esoteric details such as rounding modes, underflow, overflow, infinities, and invalid results. It introduces some strange values such as -infinity, +infinity, -0, +0, and NAN (not a number); there are some strange rules about these strange quantities such as -infinity + -infinity = -infinity, but -infinity + +infinity = NAN (or worse). I'm just going from memory here. There are very complicated reasons why these conventions make sense. For representation, the issue that sometimes shows up is how denormals are handled. I've forgotten everything else I once knew. MaxEnt 13:13, 12 April 2006 (UTC)

[edit] Patents

Well, it's quite apparent I just added quite a lot of new material concerning IBM's patents related to Cell technology. This kind of material can be quite dry to explain, so I supplied a certain "raising of the eyebrows" at junctures where that seemed like an appropriate response. I was quite careful not to pass judgement on the validity or merit of the patents themselves, or to veer into a tone potentially construed as a legal interpretation. I attempted to stress the angle of how IBM viewed the patent in the context of Cell wherever possible. I suppose my own POV is that the bad patents do more damage than the good patents are worth; I'm undecided to what extent the patent system could be fixed. I'm far from holding the view that IBM has any nefarious purpose in obtaining these patents. If they don't, they expose themselves to risk from competitors who do. The patent I most respected was the multiple DMA queue synchronization patent. The patent I least respected was the vector permute function inter. patent. Do they want us to use this chip, or waste our lives covering our backsides in patent research? The patent that made me laugh was the vector/scalar subpath patent. I guess a doctoral degree from Caltech is not worth what it used to be: when you leave the room, turn out the lights. Claim 1: method where the light switch is deactivated. Claim 2: method where the bulb is removed from socket. Claim 3: method where bulb is smashed with slingshot. Claim 4: method where nearby power pole is struck by car. Maybe turning out the lights is a novel idea to the eggheads who study there. Was my text too POV? Let me know. MaxEnt 20:16, 12 April 2006 (UTC)

I meant to also comment on formatting. I've put the portions quoted from the patents (average of about four sentences per patent) into bold type. Otherwise the text is just a long sequence of uniform blobs with nothing distinctive for the eye to grab onto. At least with this formatting the eye can run down the page looking for a favorite keyword to decide where to jump in, or not to jump in, as the case may be. Most often I've quoted a statement of sentiment about the problem IBM set out to solve as they expressed it to the patent office. I also quoted the passage about deactivating element IDs as this pertains directly to a high level attribute: the missing SPEs on consumer grade chips. I deliberately chose many of the patents dealing with the DMA structure as I view this as being central to Cell's conception. Other patents were selected to reflect a variety of perspectives. MaxEnt 20:31, 12 April 2006 (UTC)

I don't know if there is a Wiki preference for which patent database is linked. The USPTO links found through number search are more cumbersome. The Freepatent presentation worked for me.

http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/srchnum.htm&r=1&f=G&l=50&s1=6,826,662.WKU.&OS=PN/6,826,662&RS=PN/6,826,662 MaxEnt 09:27, 17 April 2006 (UTC)

[edit] Enough for now

I've accomplished my initial goals here for now. I would have liked to contribute a section on the coherency model, but it was impossible to work from IBM's descriptive material without writing code for the Cell simulator, so I settled for the patent analysis instead. I'll keep an eye on the discussion page but it could be a week or a month before I'm actively contributing again. MaxEnt 09:27, 13 April 2006 (UTC)

[edit] PPE double precision

I or someone else should edit this
"The IBM PPE Vector/SIMD manual does not define operations for double precision floating point, though IBM has published material implying certain double precision performance numbers associated with the Cell PPE VMX technology."

to read something like - the PPE double precision performance numbers are ..., these operations are not performed by the VMX SIMD unit but are implemented in the Power instruction set 'core' ie they are there but are not classed as VMX SIMD instructions - definately not SIMD anyway.?HappyVR 19:17, 13 April 2006 (UTC)

We don't know this. The old G5-centric VMX docs. don't necessarily reveal all features of the Cell VMX. The double performance numbers cited for the PPE threads and SPEs in aggregate imply one double precision mul-add per PPE clock (counting both threads together). We know that the VMX unit was enhanced between DD1 and DD2. One scenario is that the PPE core contains one fully pipelined double prec. mul-add which is shared between both threads, and also shared between the FP execution pipe and the VMX execution pipe. For the VMX pipe with a pair of double prec. values, the mul-add unit would be used twice and take an extra clock cycle to complete a pair of mul-adds. It's also possible that there are two double prec mul-add units which each accept a new double prec. mul-add every second clock cycle. The fmul instruction on the Pentium Pro was pipelined to accept a new instruction every two cycles; while the fadd was pipelined to take a new instruction on every cycle. It's also possible that each PPE thread running double prec. SIMD floating point can only bind the FP execution unit on alternate clock cycles, meaning that the two threads would not contend for this resource. However, that would mean achieving peak performance would require both PPE threads to be engaged in an FP intensive loop at the same time. If I were IBM, I would have added SIMD double to VMX and pipelined it to use a single double prec. mul-add unit twice. Then it only takes one PPE thread to achieve 6.4 GFLOPS, leaving the other PPE thread free to handle overhead (interrupts, etc). In this model, if both PPE processors tried to run VMX double prec. they would each stall out half the time; less transparent, but better results if used judiciously. You are trying to read too much from IBM's failure to issue an updated Cell-centric VMX specification. MaxEnt 04:21, 14 April 2006 (UTC)
I assumed fp figure for ppe was a result of a non-VMX extension PPE instruction - if so possiblity of two double precision fp units per PPE?HappyVR 09:33, 14 April 2006 (UTC)
It's possible to get this performance from a single DP unit if the unit is fully pipelined to handle a mul-add on every clock cycle. This is the cheapest approach for IBM in terms of silicon area, but perhaps this level of pipelining is hard to achieve for freq. reasons. If so, IBM might have included two execution units that are pipelined to accept a new operation every second clock cycle. Simpler to achieve in the freq. domain, but consumes more silicon area; the two unit design might also alleviate hot spots. Whether this unit(s) is/are available as part of the regular Power 970 floating point instruction set, or as part of an extended VMX instruction set, or both, is not yet clear to me from any source I've uncovered. I haven't done much coding on the PEE side, so I could have a big hole is the reference set I've consulted. My vote is one fully-pipelined unit accessible from either instruction set with contention effects if both threads go after it at the same time, but we are not here to vote, are we? It's possible that "fully pipelined" precisely means a new operand every clock cycle. MaxEnt 10:19, 14 April 2006 (UTC)

[edit] VSU Execution Unit

I finally found a good reference for this searching on the term "VSU" which I noticed in a diagram:

The VSU floating-point execution unit consists of a 32- by 64-bit register file per thread, as well as a ten-stage double-precision pipeline. The VSU vector execution units are organized around a 128-bit dataflow. The vector unit contains four subunits: simple, complex, permute, and single-precision floating point. There is a 32-entry by 128-bit vector register file per thread, and all instructions are 128-bit SIMD with varying element width (2 × 64-bit, 4 × 32-bit, 8 × 16-bit, 16 × 8-bit, and 128 × 1-bit).

The model seems to be a three operand scalar mul-add every clock cycle. Since each thread only dispatches on alternate clock cycles, each PPU thread achieves 3.2 DP GFLOPS, for 6.4 GFLOPS combined (as we know from another source).

I'll have more material for the main article soon. MaxEnt 04:15, 20 April 2006 (UTC)

[edit] Missing terms

PPU is defined nowhere.

[edit] KiB to KB vandalism

This seems to be a favorite place for IP users to jump in and change something. Someone else reverted this same edit recently. This time around I hyperlinked KiB in each case to make it a little less tantalizing. MaxEnt 02:14, 2 June 2006 (UTC)

There's no need to link each instance. Just revert whenever someone changes it. Fortunately the use of IEC binary prefixes has been supported by the MOS for awhile, and the policy has withstood several motions to remove it. Just keep the article in line with MOS and there's little reason to explain your actions to anonymous editors who take no further interest in the article. By the way, you might want to place a space between quantities and units since that is also proper convention (technically, a non-breaking space). -- uberpenguin @ 2006-06-02 02:33Z

[edit] revert confusing

I reverted the confusing attribute which came from a masquaraded IP address at the U of Singapore with no justification here in the talk page. I agree that the article could be clarified, but the technology is complex and it's not a simple matter to obtain clarity. Help pinpointing the greatest sources of confusion would be appreciated, a ugly banner is not. MaxEnt 02:14, 2 June 2006 (UTC)

[edit] pass at the intro

I wanted to restore some muscular and lively word choices without edging back toward hype. For $400m I think you can safely refer to it as groundbreaking. Other strong words that give the intro some needed flavour are modest, streamlined, exotic, prowess, challenge and ambition. I expect to become more active with this article again over the next week or so, and this time with more of a view to the article as a whole. MaxEnt 05:36, 2 June 2006 (UTC)

Groundbreaking skirts on POV too much, I think. If only because someone somewhere might disagree. Go ahead and make your edits, but don't be surprised if some of the words you want to use get toned down by other editors... -- uberpenguin @ 2006-06-02 14:49Z
I'd agree that "groundbreaking" is too much POV. The intro as a whole goes over the line somewhat. "Exceptional prowess", "exotic features", etc, are just too much. Also, the terms like "coherent EIB" and "streamlined coprocessing elements" don't seem to get across any definite information. What is coherent about the EIB? What is streamlined about the coprocessing elements? It just sounds like a bunch of sexy adjectives thrown on. Instead of saying groundbreaking, the intro should instead very briefly mention how Cell is different from other architectures. Something like "Cell's design largely differs from x86. While x86 XYZs, Cell ABCs." I'd make such a change myself, but don't know enough about the topic. Also, all this talk about "challenging" really needs to go since you aren't mentioning why it's challenging. Something like "Because Cell's architecture differs from ABC..." and how that challenges software developers would be a lot better than "groundbreaking initiative".
In my view the rest of the article supports these claims. I'm not sure how to pack justification into the intro without making the intro indigestible. I was planning to add refs to the intro in support, but there is so much literature to sift through I didn't have the rights one handy. In my view, the groundbreaking nature of Cell is its most essential attribute: it occupies a niche unlike anything that has come before (console to supercomputer, skipping mostly over PC for lack of good branch prediction), previous editors of this article did not even twig to the unnatural semi-autonomous relationship between the PPE and SPE processors, its the first design to hybridise general purpose processors with GPU-like processors in this fashion, and most of the future challenges programming this monster remain to be solved. We have the single prec. floating point capacity of a Cray from five years ago compressed into a 200mm die and "prowess" is regarded as POV. Doesn't that smack of the PC excess of the modern day liberal arts college. In contrast to, say, the XBox-360, which breaks no interesting ground that I can think of, how does one convey that this chip is truly a horse of a different colour? Groundbreaking means to "break ground" i.e. to sow your seeds where no one has even sown a crop before. Once upon a time, back when food came from soil, the heavy breathing it conveyed was hard labour, not adolescent sexual excitement. I once read advice to aspiring playwrights: never fear becoming too interesting, there is never any shortage in Hollywood of those who can dull you down. I prefer evocative language as a matter of course, however if POV has become tainted with the tall-poppy reflex, I'm more than willing to substitute wooden nickles as appropriate. My main concern is to drive the message across that Cell *is* different in many fundamental respects. Between the guy that marked the article "confusing" (which creates a force toward short little declarative sentences) and the people who trim any word that calls attention to itself, it becomes quite the art form to write anything that elicits cortical blood flow. I've always maintained that it is the fault of the writer if the reader sleeps through the procedings. I guess I'll have to whip out my file and file off any sharp edges that remain on my chainsaw for reasons of general safety. MaxEnt 21:23, 3 June 2006 (UTC)
Alright, I lose. I scanned for existing uses of "groundbreaking" on the web. Of course every nutritional supplement ever bottled is thus labeled (typo for windbreaking?), but also we find superconductors, the blue LED, many discoveries in partical physics and life sciences, and most amusingly, the Boeing 707. Even with all the patents I supplied, in a POV-hypervigilent context, it's perhaps a little over the top. MaxEnt 21:34, 3 June 2006 (UTC)
I understand your reasons for wanting to use colorful language, and your justifications. Unfortunately technical and encyclopedic writing are not the appropriate places to be using the kind of evocative language more suitable for an editorial or review. The fact is simply that the kind of words you want to use carry too much connotation. Look, if we don't call the Von Neumann architecture, arguably the basis for all modern digital computers, "groundbreaking", we shouldn't use that adjective for Cell either. Put it into perspective, Cell is a new spin on old ideas. By your own admission, Cray was building Cell-like designs decades ago. What we basically have in Cell is a mini super-computer architecture on a die. I'm not in any way discounting that achievement, but calling it "groundbreaking" is giving it more credit than it is due, I'm afraid. Now we're back to my example of the VN architecture. Sure it was nothing short of the foundation of everything we know as a "computer" today, but it wasn't a new idea, it was simply repackaged and implemented by a certain mathematician. It was not groundbreaking because it really didn't break any new ground, it just combined some existing ideas and produced an easily identifiable result. -- uberpenguin @ 2006-06-03 21:45Z
Sounds like you came to a conclusion just before I wrote the previous. Just keep in mind that you must keep your own opinions out of your writing here. Your writing should present things in a factual manner without bias, even if that does sacrifice some engaging diction. -- uberpenguin @ 2006-06-03 21:47Z
You've making too much of my biases. My bias is that ground-breaking as a phrase is rooted in a non-POV metaphor, a point which usage has apparently buried. I would definitely say the Cray was ground-breaking, because it *did* break new ground. The Cell memory coherence model, for which IBM received many patents, is likewise new intellectual turf. I wasn't attempting to imply that IBM had planted the plains of Idaho with Monsanto corn. Well it's a pervasive American sentiment that ground-breaking is quantized in units the size of Texas. But, I strongly object to the notion that I'm biased about the *subject matter*. Our difference concerns usage. I mostly drilled it out and replaced it with specific, narrowly drawn attributes. It bulks out the intro to get the same essence across that could have been cued into the main text with a more evocative jumping point. Hey, I enjoy being pushed, I just complain a lot. Now people are going to want less, no doubt. Quantization in units the size of Texas on a radical Atkins diet. MaxEnt 00:19, 4 June 2006 (UTC)
I reviewed my changes as they stand and I'm now satisfied in the axis of getting the essential message across: basic constituent elements, extending range of applications, novelties of approach, software challenges imposed by IBM's departure from design convention. I'm not thrilled with the last sentence, it stubs its toe wrapping up, but I don't see my way clear right now. Where I need feedback now is whether this tries to say too much for someone who is approaching this article for the first time. It's funny to me how ground-breaking in this community is laudatory first and foremost. I thought I defused that by placing it near to the $400m observation. I would most definitely expect to break some ground throwing that kind of cash around on the *design* phase of a project. I wonder how many of my other cues are activating mental mouse-traps before the light goes on, e.g. my use of the word exotic is to tip the casual reader off that he/she is not necessarily expected to recognize the technical terms that follow. Likewise I book-ended modest against prowess. Is it POV that I think the PPE is modest? It's a thin shadow of what a G5 can do. Prowess seems justified when this kind of paper is being published: http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf drooling over the rather slow DP performance modestly improved (yes, indeed, this paper uses the word "modest"). Cell is a complex topic. If the reader has no agility at all, I fear it's a losing battle. MaxEnt 01:09, 4 June 2006 (UTC)
Removed the stubbed toe. I was thinking we might also have a difference in our bias toward notability. I happen not to regard Cell as notable because of all the money spent / toys produced, but because it represents a new direction (that might not pan out) after the chip design industry reached an inflection point after 20 years of chasing clock frequency as the one true god. If Cell pans out of the next decade or more, it will rupture certain long-standing conventions in how high-performance software is developed/deployed. When clock freq. goes up, up, up no-one really has to think very hard. Now we have to think again. Carmack thinks Cell sucks. Good for him. LBNL thinks that Cell rocks (or could rock with *modest* improvements). Good for them. So I want the intro paragraph to tap into notability on that axis in particular. That was my objective wading into the lede paragraph. MaxEnt 02:04, 4 June 2006 (UTC)