Talk:Xenos
From Wikipedia, the free encyclopedia
Contents |
[edit] Move
Suggest moving to Xenos (GPU). No assertion of usage prominence over other uses for "Xenos". Would favour Xenos (Greek) under first precendences.--ZayZayEM (talk) 05:55, 10 December 2007 (UTC)
- It would make sense to me, as the concept is the source of the other names. Other possibilities though would be to point this to the disambiguous page. --Falcorian (talk) 17:49, 10 December 2007 (UTC)
[edit] So, how was the vertex rate figured?
In this article, they seem to have come to a "1.5 billion vertices per second" figure from somewhere. I would assume they thought that since a typical polygon is a triangle with three sides, 1.5 billion vertices, would form 500 million polygons. So it seems they worked backwards from the "500 million" figures released by Microsoft, and just assumed 1.5 billion vertices from there.
It's my understanding, that typical gpu listings for this figure, calculate vertex shader processing, by assuming it takes 4 clock cycles to complete the simplest vertex positional transform (matrix * vector). So for every vertex shader you have, = 1/4 the clock frequency. That's true of all gpus from what I can tell. And since all polygons in a mesh could (theoretically) be sharing a vertex with it's neighbor in strips and fans, the theoretical maximum polygon count is sometimes "considered" 1 to 1 with vertex rate.
The problem here, is that (being unified) all of Xenos's alus could potentially be processing geometry at once, such would be the case in the z-only pre-pass. But it wouldn't make much sense to list that as a per second figure of 6 billion. And technically, 1 block of alus devoted to vertex work, would be 2 billion vertices per second. But again, even that wouldn't make much sense for a list of reasons.
So, Microsoft listed the "set-up" limit in their specifications. That would be the maximum you could actually draw on screen, after backface and occlusion culling, etc.. And with a reasonable number of vertex shader instructions (outside of simple transform), you would avoid reaching that limit.
I'm not sure how it should technically be listed, but isn't the 1.5 billion figure ad-lib? As I believe the traditional "shared vertex" condition is already applied to the 500 million figure. Thus, you would likely never see much more than a few million polygons drawn in any given frame. (which I strongly believe to be the case)
Anyone disagree? Agree? Thanx. Edit: I guess it would be better to simply quote the Microsoft figures, but it's just annoying to hear folks claim its vertex ability performs as though it had 4 vertex shaders, when they ignore the difference between a "set-up rate" and a "transform rate". Swapnil 404 (talk) 22:53, 20 December 2007 (UTC)
[edit] Double shader preformance for the Xenos?
One of the lead engineers who was working on the Xenos (His name escapse me at the moment) stated that the xenos is capable of 96 billion shader ops per second, thats twice the ammount stated by Microsoft, Im assuming that the piplines now do 2 vector4 ops and 2 scalar ops, so 4 ops per pipline and 48*4*500,000,000=96,000,000,000 shader ops per second, I dont know if this is true, or what I said just then made any sense, im just wondering if anyone can confirm or disprove this but if its right can you please post this on the article (If im right I think it may effect everything and definetly make the flop count per pipline 20). —Preceding unsigned comment added by Gears, Gears, Gears (talk • contribs) 02:49, 29 January 2008 (UTC)
- Yeah, I remember when he said that. Feldstein I think it was. But I don't think he was implying shader ops. Perhaps shader flops. 4 flop madds per cycle, per shader. Rather than just listing vector and scaler. It could be, that you can issue a vector, scaler, vertex fetch, and texture load, all with-in one instruction cycle. That would be 4. And it would contrast Nvidia, because they hadn't decoupled their texturing ops from shader ops, etc.. And one stalls the other.Swapnil 404 (talk) 02:00, 23 February 2008 (UTC)
-
- I remember seeing 64 unified shaders way back before the official releases, was that a downgrade or just speculation? —Preceding unsigned comment added by 201.81.199.213 (talk) 18:02, 30 April 2008 (UTC)
[edit] Shader flop or op count
I thought when they said shader op they were refering to shader ALU operations per second, but any way that means that for the PS3's RSX so, because it preforms 2 scalar, 2 vector and one fog (Is a fog op similar to a vector, but missing a colour value?) op per pipe, so should we change it to shader flops per second for both? Or am I completly wrong. —Preceding unsigned comment added by Gears, Gears, Gears (talk • contribs) 03:52, 25 February 2008 (UTC)
- They were, for xenos. Just vector + scaler. They didn't consider anything else, like interpolation units, fetch units, etc.. I was just suggesting the fetching and loading as possible examples of what Feldstein meant by 4 ops, outside of it just being him misspeaking.
- RSX can't compute a shader, in the same cycle it issues a texture fetch. And there are still situations where one alu stalls the other, etc.. Xenos has separate logic for such tasks. So on RSX, issuing a texture :fetch, (and some of its latency) cuts directly into the shader operations, while on Xenos it doesn't.
- Really though, you'd have to specify what is meant by "shader op", as it could be any number of things really. (fp16 normalize could be considered one. I doubt they consider the fog alu an op, as I "think" it's there, more for legacy software, and modern games compute fog in the shader itself) And things like those mini-alus, seem to be there, to meet the shader model 3.0 specification requirements)
- Just as a base figure, an rsx shader alu, can do 4 flop madds, for vector and scaler ops. (vector3+ scaler, vector2 + vector2, or Vector4, etc)
- (madd is multiply + ADD, considered 2 flops)
- So, 2 alus per shader, each capable of 4 flop madds, madd = 2 flops.
- 24 x 2 = 48 x (4 x 2) = 384 flops per clock.
- http://www.watch.impress.co.jp/game/docs/20060329/3dps309.htm
- But then, you could ask, where are the vertex shaders considered in those figures. (could be just a ps slide)
- But it is a slide meant for developers, so no need to get counting every flop you can find to inflate the number, etc.. Swapnil 404 (talk) 20:47, 29 February 2008 (UTC)
- Truth is my knowledge of the subject is more limited then your own, but I would just like to know, should we consider the shader op count on this page 96 billion per second and keep it so on this page, or should we put it to something you see as correct. Also since you seem to be more informed about the Xenos than me could you work out how many programable flops there are per clock for the Xenos? Gears, Gears, Gears (talk) 09:12, 5 March 2008 (UTC)
Neah, I wouldn't change it on my own really. I've read the 96 billion quote, but I don't think it implies what's considered raw shaders. There are a bunch of things you could consider as a "shader op". A Microsoft rep has said Xenos could do "160 shader operations per cycle" or more, if you consider the 32 control flow ops, 16 texture fetches, and 16 programmable vertex fetches per clock, and consider that they can all be issued simultaneously, while on RSX they cut directly into shader operations to varying degrees. (the first alu in each of RSX's shaders, doubles as a tmu for texture calls for example) And perhaps he's right. But then, there would be other things to consider on RSX as well.
And just for straight "shaders", it'd be just the 48 alus, all vector4 MADD, + scaler special function. (scaler seems to be 1 flop, from a few different places I've read, although a Microsoft rep had calculated it as 2 in one of their flops comparisons) So, 48 x 4 x 2 = 384 for vector plus 48 more for scaler. 432 generic shader flops per cycle. If we count the scaler as 2, then it's 480. Which matches the Microsoft reps figures, when he said 240 billion per second. There are flops involved in other operations, but I think most would limit "programmable" to just those. Swapnil 404 (talk) 15:48, 5 March 2008 (UTC)
- Thanks for clearing up all those things Swapnil, but I would like to know one more thing, Just a straight out programmable shader flop performance (including vertex piplines on the RSX) comparison of the Xenos and RSX? If not then thats ok, but its definetly four flop MADDs for a vector4 opp right and 1 flop for a scalar opp? Gears, Gears, Gears (talk) 07:58, 6 March 2008 (UTC)
Well, I'm sure it depends alot on what the load is. The ratio of vertex shaders to pixel shaders, and the number of texture fetches involved, etc.. Along with a list of other factors. On paper, it used to be thought of as RSX has more raw shader power on paper, Xenos was more "efficient". Of course, that was before RSX was clocked back to 500mhz/650mhz ram, and doesn't consider any other components involved with "shading".
From the folks who've worked directly with the hardware, and would be in a position to know first hand (and actually willing to talk about it), most have said Xenos > RSX. Especially vertex shader work (by quite a bit), but it seems perhaps pixel shaders as well in some cases. Code optimized to run really well on RSX, could be expected to run ok on Xenos in many instances, but code optimized for Xenos would overwhelm RSX in some areas. (of course, none of that assumes eventually using cell to reduce rsx work with pre-culling, etc..) Overall, for gpu's shader performance, most give the edge to Xenos to varying degrees.
And a vertex shader is vector4+scaler. Xenos' need to be capable of both pixel and vertex work. I would guess, the scaler is a single flop. But I guess it could be either, as I've heard it both ways. Swapnil 404 (talk) 01:44, 7 March 2008 (UTC)
Just one last question, thanks for answering all the rest, what operations in the pipline make a pixel shader (or what you could call a general pixel shader). And can you also tell me if this is right, to the best of your knowledge:
RSX total programable flops (pixel and vertex pipes)= 24x2x(4x2)+8x(4x2)+8= 456 shader flops per cycle (or 464 if scalar= 2 flops)
Xenos total programable flops= 48x(4x2)+48= 432 shader flops per cycle (or 480 if scalar= 2 flops)
Does the RSX have 24x2 because its instructions are co-issued for the pixel piplines? Thanks for all the help anyway. Gears, Gears, Gears (talk) 09:58, 7 March 2008 (UTC)
- Yeah, pretty much. There are two alus tied together in each pipeline. Pixel shaders are typically vector3+scaler. (red, green, blue, alpha)
- Nvidia pixel shader alu, just did 4 flops at a time. (madd capable) So, it could do a vector3+scaler, or a vector4, or scaler+scaler or a vector2+vector2. Depending on what needs processing.
- A typical ati gpu, had 2 alus per pipe, but only one was madd capable, the other was just an add, and they were just standard "vector3+scaler". Meaning, that any time vector2 instructions came up, the alu could only process one at a time, and the other flops go wasted in that cycle. (but I don't think those came up very often)
- The difference for ati was that they had separate logic for issuing texture fetches. So, they don't waste any alu cycles doing that, and could hide fetch latency with flops, by just processing something else until it gets what it needs. Nvidia alus were far more likely to stall waiting for a fetch result.
- Xenos, has 48 vector4+scaler alus, all madd, and decoupled from texture fetches, and filtering. Swapnil 404 (talk) 21:47, 7 March 2008 (UTC)
- I think you may have already known this, but the xenos's scalar ALU is MADD so its 2 shader flop's so the 240 Gigaflops comment for the xenos is correct. Gears, Gears, Gears (talk) 08:48, 10 March 2008 (UTC)
- Just a thought, since the opps preformed are scalar MADD opps , does that mean you could use the scalar ops to perform pixel shader opps over four cycles (to make one pixel shader opp)? Gears, Gears, Gears (talk) 06:39, 11 March 2008 (UTC)
That could be. I've heard it described as madd, but also as an add. (watch imrpess japan) i think it was. They implied that they reorganized an ati shader, cut the add vectors in favor of an additional madd, and kept the add scaler as a special function. Something to that effect. I would go with them being madd though, until I found otherwise elsewhere, because I can't find that article. It could have simply been speculation too.
And I would assume, that if they're madd capable, they could be scheduled to issue a random scaler that happen to come up, in parallel to the vector. But no idea how versatile they are, or if they could process something in pieces like that. My guess would be no, but I wouldn't know for sure. I know that if you had just a vector2 add instruction to be processed, the mul capability of them go unused that cycle, along with the other two vector madds. It couldn't just do two separate vector 2's in parallel. or two vector2 adds and two vector2 multiply, like the flops numbers would indicate.
8 potential flops in the alu just go unused that cycle, so I don't know how far they'd go to more efficiently use the scaler. I guess it's up to the compiler to vectorize efficiently. That's why G80 went to an all scaler gpu. Swapnil 404 (talk) 01:25, 12 March 2008 (UTC)
- I was assuming that since the R600's pipes are similar to the xenoses and it's pipes are vector4+scalar MADD and it can assign the extra scalar to do shader operations (over multiple clock cycles, around 4), I thought it could be possible. Thats why they assume the R600 has 320 scalar piplines, 64x5=320 (5 flop madds per cycle per pipe). But what you say also makes sense, so i'll have to get some more proof. Gears, Gears, Gears (talk) 02:44, 13 March 2008 (UTC)
Well, seems R600's are all MIMD, they're not locked at vector4+scaler. They could process 1+1+1+1+1 if needed. (or any other combination) Xenos vectors are SIMD, so they're vector4 with the additional scaler at all times. Like this: http://www.behardware.com/medias/photos_news/00/20/IMG0020142.gif Xenos would be listed as "4+1" if it were listed. They're 5D, but they couldn't break it up in the same way. An R600 alu has 5 separate madd capable flops. With one also capable of other tasks when needed. Scaler in Xenos probably functions as a special function unit as well, for "sin, cos, exp, log, etc." like the R600. http://www.behardware.com/medias/photos_news/00/19/IMG0019979.jpg
"One R600 calculation unit is composed of 5 math units, one of them being able to handle special tasks, and one branch unit." Swapnil 404 (talk) 08:19, 13 March 2008 (UTC)
- I get it, so with the informaton we have its no for now (and probably no for sure). Gears, Gears, Gears (talk) 03:55, 15 March 2008 (UTC)
Yeah, I would assume not. It'd probably have a alot of work as it is. But, it's a console, so they could code "to the metal" if they wanted, so who knows.Swapnil 404 (talk) 17:23, 18 March 2008 (UTC)
- I know this dosent really relate to the Xenos, but could you clear something up, many people assume that beacuse the Cell can do many (because of its clock speed) fp32 ops that its more could be about as good at doing shader operations as the Xenos and RSX or anyother graphics card, but its not designed to do the calculations that graphics cards can do, so would that negate the high clock speed (compaired to graphics cards), since it can only do 4 MADD fp32 ops per cycle per SPE? If you find this is not suppost to be hear please tell me to delete it. Gears, Gears, Gears (talk) 06:27, 1 April 2008 (UTC)
Well, Cell's spes can do 2 flops, on up to 4 32 bit pieces of data, since they have 128 bit registers. And clock frequency would help, but there are only 7 spes, compared to a far greater number of shaders. As a straight gpu though, It's been said that Cell isn't nearly as useful as a typical gpu for things like pixel shaders, texturing, etc... But, I guess it would depend on what type of floating point processing you were doing, and how being more flexible with regards to ram, and branching, etc. could benefit more than just straight flops. The idea of a gpu, is to offload certain graphics tasks that are easily processed on simplified processors. Powerful, but only at what they were designed to process. Things like real-time lighting calculations haven't progressed as much as other gpu functions have. For example: http://gametomorrow.com/blog/index.php/2005/11/30/gpus-vs-cell/ http://gametomorrow.com/blog/index.php/2007/09/05/cell-vs-g80/ (technically, this is comparing a G80 running a general ray-tracer, to Cell, running its own version, which I would assume is more tuned to it)
But anyway, Cell in PS3 has other tasks to take care of. One spe is lost to redundancy, one to security/os, one taken at any time for os. A number of developers were talking about using one for sound (which seems like overkill), and a few others to pre-cull geometry before it gets passed to RSX, also, ai, collision detection, particles, animation, physics, etc.. Cell varies at its efficiency with several of those, compared to other cpus. But at certain other things, it seems extremely fast. You wouldn't get "that" kind of lighting performance, but perhaps they'll have some left over once they get more accustomed to the hardware.Swapnil 404 (talk) 20:07, 11 April 2008 (UTC)