User talk:MaxDZ8

From Wikipedia, the free encyclopedia

Contents

[edit] Reply on User comment in Talk:Shader

Thank you too for always maintaining the page. I am also not sure what is the best, but we are humans and we keep trying to do the best (though it will never happens). Cheers....^^. Wish you good days too....Draconins 14:34, 3 August 2006 (UTC)

[edit] Re: "Interesting Stream GPU" History Removal and rewrite.

I removed the hostory because some of it is incorrect. To Note:

1.) NV2x had had programmable control for both Vertex and Fragment operations, not just vertex operations as the history states. 2.) There is no such thing as "RD3xx" - its the R3xx series (Radeon 9700). Branching was support was static only in the fragment pipeline. 3.) While NV4x had conditional branching in fragment pipeline, its granularity was very large (4K pixels for NV4x, 1K for G7x) limiting the usefulness.

The point of the update was to provide a little more structure around the history of graphics processors by linking them to their DirectX functionality (Wiki has plenty of entries on DX so these can be referenced) and to also point out that its only really Shader Model 3.0 parts that will provide much in the way of usefulness when it comes to Stream Processing outside of standard graphics processing.


Thank you for discussing there the issue.

[edit] My proposed version

  • GPUs are recognized as widespread, consumer-grade stream processors. Although they are usually limited in hardware functionalities, a great deal of effort is spent on optimizing algorithms for this family of processors, which usually have very high horsepower. Various generations are to be noted by a stream processing point of view.
    1. Pre-NV2x: no explicit support for stream processing. Kernel operations are usually hidden in the API and provide too little flexibility for general use.
    2. NV2x: kernel stream operations are now explicitly under the programmer's control but only for vertex processing (fragments are still using old paradigms). No branching support severely hampers flexibility but some algorithms can be run (notably, low-precision fluid simulation).
    3. RD3xx: increased performance and precision with limited support for branching/looping in both vertex and fragment processing. The model is now flexible enough to cover many purposes, especially on vertex processing level supporting dynamic branching.
    4. NV4x: actually (September 25, 2005) state of the art. Very flexible branching support although some limitations still exists on the number of operations to be executed and strict recursion depth. Performance is estimated to be from 20 to 44GFLOPs.

[edit] Your proposal, being discussed

  • GPUs are becoming more recognized as widespread, consumer-grade stream processors. GPU’s based around Microsoft’s DirextX8 API, such as ATI’s R200 or NVIDIA’s NV20, introduced some programmability into the fragment pipeline however it was so limited the only use was specifically for 3D graphics related applications. Later, with graphics processors conforming to the DirectX9 specification, increased programmability was introduced into the pipeline and the Shader Model 3.0 specification demanded FP32 precisions throughout and fully conditional branching in both the Vertex and Fragment pipelines, thus making them more attractive for potential non-graphics uses.

Although several generations of parts confirming to the Shader Model 3.0 specification came from ATI (R520, R580) and NVIDIA (NV40, G70, G71), only ATI’s R580 graphics processor has thus far gained much traction in applications outside of standard 3D graphics, with commercial applications such as those provided by Peakstream, and research with the distributed computing application Folding@Home. This can be attributed not just to its programmable performance, rated at 374GFLOPs @ 650MHz, but also its finer thread sizing, benefiting dynamic branching performance, and handing of available register space in comparison to other Shader Model 3.0 compliant architectures. [1]


A first difference is on the structure as you say. Although your version reinforces structure in the page, your reinforces structure between pages. I agree shader models should be mentioned but I still think the previous structure to be better. This gives the best of both worlds.

I don't think there's the need strict for SM3.0, it makes it much easier but saying it's a must have is something I feel definetly excessive. On NV1x generation, a few managed to run even full radiosity processes. I definetly disagree on SM3.0 as a need.

[edit] On your comment

   
“
NV2x had had programmable control for both Vertex and Fragment operations, not just vertex operations as the history states.
   
”

Register combiners were definetly programmable and very powerful when coupled with texture_shader extension but as you know, it's a long way from NV_vertex_program (vertex programmability). Register combiners do fall in the Pre-NV2x class (it's really more a fixed function pipe and in fact NV20's functionality was just a overhauled NV_registers_combiners from NV1x - so saying you're right this means NV1x also was programmable). I think the previous version to be right here.

   
“
There is no such thing as "RD3xx" - its the R3xx series (Radeon 9700). Branching was support was static only in the fragment pipeline.
   
”

Minor naming issue. I Agree to change it, but I'll still personally go for RD3xx (hi-end chip) and RV3xx... (value-driven, like 9600). True for dynamic branching in Vertex processing vs pixel processing. Considering however most interesting decisions happen in the FS, I believe this is a minor issue... I agree this needs to be clarified.

   
“
While NV4x had conditional branching in fragment pipeline, its granularity was very large (4K pixels for NV4x, 1K for G7x) limiting the usefulness.
   
”

Trascurable performance issue. To the programmer interested in this kind of things, the really important thing is that it works. Please note the history is actually feature-driven rather than performance-driven. IMHO even with limited performance the thing is indeed useful and would likely turn out to be faster than CPU processing anyway. R5xx chips are just catching up with NV4x-G70 (I don't really care for alpha-to-coverage AA) with improved speed. Not a bad thing but not even an important improvement. I believe this shall be kept.

[edit] On the edit itself

   
“
R200 or NVIDIA’s NV20, introduced some programmability into the fragment pipeline however it was so limited the only use was specifically for 3D graphics related applications...
   
”

Just not true, but likely to be non-proofable now in both directions. As said above, the fragment processing was really a improved fixed pipe rather than a programmable one. I also disagree researchers began to be interested in GPGPU when SM3.0 was released.

   
“
only ATI’s R580 graphics processor has thus far gained much traction in applications outside of standard 3D graphics, with commercial applications such as those provided by Peakstream, and research with the distributed computing application Folding@Home
   
”

Just plain out wrong. See the nvidia developer pages as well as GPU Gems 1 and 2, and various NVSDK examples. Ati's 580 is undoubtly the first card to use pro apps to provide wow-factor, a questionable marketing strategy.

   
“
This can be attributed not just to its programmable performance, rated at 374GFLOPs @ 650MHz, but also its finer thread sizing, benefiting dynamic branching performance, and handing of available register space in comparison to other Shader Model 3.0 compliant architectures.
   
”

http://www.techreport.com/etc/2006q4/stream-computing/index.x?pg=3: plain and simple, the article is out of topic. GPU folding@home is still beta. I would wait more accurate estimations on a less controlled environment before this can be elected as a winner.

[edit] Long story short

I realize your intentions are good and some propositions are definetly an improvement. I still think the previous version to be a better one and I will RV your change 3 days from now. Thank you again for discussing the isse there instead of messing up the talk page.

MaxDZ8 talk 06:48, 13 October 2006 (UTC)


[edit] RE (WipEout!)

   
“
Register combiners were definetly programmable and very powerful when coupled with texture_shader extension but as you know, it's a long way from NV_vertex_program (vertex programmability). Register combiners do fall in the Pre-NV2x class (it's really more a fixed function pipe and in fact NV20's functionality was just a overhauled NV_registers_combiners from NV1x - so saying you're right this means NV1x also was programmable). I think the previous version to be right here.
   
”

Although NV1x and NV2x both feature some limited register combiner functionality, NV2x did also extend the programming model slightly with the introduction of FX8 ALU’s. NV2x conforms to the (albeit limited) PS1.1 programming model, whereas NV1x’s register combiners didn’t. NV1x was basically a Dot3 product part, while NV2x has PS1.1 capabilities, which go beyond Dot3.

R200 should also get a mention as supporting PS1.4 it (relatively, in DX8 terms) significantly increased the flexibility and capabilities of the fragment pipeline as well as introducing FX16 precision ALU’s.

Look, do whatever you want with this. I remember there was a few difference but I have no time to check. I still believe this shall be in GPGPU however.
MaxDZ8 talk 07:57, 14 October 2006 (UTC)
   
“
Minor naming issue. I Agree to change it, but I'll still personally go for RD3xx (hi-end chip) and RV3xx... (value-driven, like 9600).
   
”

The point I’m making here is that there is no such part as “RD3xx”, Its R300 or R3xx – R300 is Radeon 9700.

I understood this perfectly, in fact, I integrated.
MaxDZ8 talk


   
“
True for dynamic branching in Vertex processing vs pixel processing. Considering however most interesting decisions happen in the FS, I believe this is a minor issue... I agree this needs to be clarified.
   
”

You are correct that the Fragment Shaders are generally more important for GPGPU purposes, which makes it more important in the context of the article to be correct and state that dynamic branching and looping is not available in the fragment shaders on these parts. (Although linking it to DirectX and Shader Model 2.0 will point out the true capabilities).

Which happen to be a mess... this is really meant to be a quick overview fitting a few lines. I believe those should be better in GPGPU. Let's try to keep the two articles distinct.
MaxDZ8 talk 07:57, 14 October 2006 (UTC)


   
“
Trascurable performance issue. To the programmer interested in this kind of things, the really important thing is that it works. Please note the history is actually feature-driven rather than performance-driven. IMHO even with limited performance the thing is indeed useful and would likely turn out to be faster than CPU processing anyway.

R5xx chips are just catching up with NV4x-G70 (I don't really care for alpha-to-coverage AA) with improved speed. Not a bad thing but not even an important improvement. I believe this shall be kept.

   
”

Features and performance are always important. However, in applications such as these how the features are achieved are of vital importance, and exactly why one architecture is getting more traction than another.

See overview issue. Performance is definetly less important than features on design terms since it's usually evaluated after the design has been chosen. Do I need a citation here?
MaxDZ8 talk 07:57, 14 October 2006 (UTC)

While R5xx chips came later than NV4x/G7x chips and brought, ostensibly, the same feature set, the architecture differences between them are directly relate to the applicability in these types of applications. The problem with G7x is that they are only handling a single, very large, pixel batch per quad in order to hide texture latency, meaning that with longer shaders using more register space the thread size has to decrease (when the register space fills) making it impossible to hide the texture latency. R5xx has a threading mechanism where is juggles up to 128 (fine grain) threads per quad – when register space is used up then there are just fewer threads, but there is more likely to be other threads that can be scheduled for ALU work.

See again the overview issue. Again, I believe this has better place in GPGPU and not there.

Mike Houston points out here, the how these architectural differences affects performance specifically for GPGPU applications:

“Mike Houston: All GPUs are SIMD, so branching has a performance consequence. We have carefully designed the code to have high branch coherence. The code heavily relies on a tremendous amount of looping in the shader. On ATI, the overhead of looping and branching can be covered with math, and we have lots of math. We run the fragment shaders pretty close to peak for the instruction sequence used, i.e. we can't fully use all the pre-adders on the ALUs. But, I wouldn't say branching is the enabler. I'd say the incredible memory system and threading design is what currently make the X1K often the best architecture for GPGPU. Those allow us to run the fragment engines at close to peak. What ATI can do that NVIDIA can't that is currently important to the folding code being run is that we need to dynamically execute lots of instructions per fragment. On NVIDIA, the shader terminates after 64K instructions and exits with R0->3 in Color[0]->Color[3]. So, on NVIDIA, we have to multi-pass the shader, which crushes the cache coherence and increases our off-chip bandwidth requirements, which then exacerbates the below. The other big thing for us is the way texture latency can be hidden on ATI hardware. With math, we can hide the cost of all texture fetches. We are heavily compute bound by a large margin, and we could actually drive many more ALUs with the same memory system. NVIDIA can't hide the texture latency as well, and perhaps more importantly, even issuing a float4 fetch (which we use almost exclusively to feed the 4-wide vector units) costs 4 cycles. So NVIDIA's cost=ALU+texture+branch, whereas ATI is MAX(ALU, texture, branch).

Although there will be applications where G7x will outperform a CPU, Mike shows the performance implications the G7x architecture has on GROMACS calculations on page 32 of this presentation - G70’s architecture in this application is providing less performance than a P4 3.0GHz while R520 is 2.5-3.5x faster.

I want to state it clear: I have no doubt there's a significant performance difference but again, this is on the programming model and it's meant to be an introduction. It would be desiderable to put all the detail we can but unluckly, human brains can mangle 7+-2 informations on average so we need to keep it simple.
MaxDZ8 talk 07:57, 14 October 2006 (UTC)


   
“
Just plain out wrong. See the nvidia developer pages as well as GPU Gems 1 and 2, and various NVSDK examples. Ati's 580 is undoubtly the first card to use pro apps to provide wow-factor, a questionable marketing strategy.
   
”

I agree that there has been plenty of tests and applications that have experimented with GPU’s for general purpose applications, however the point I’m trying to convey is that its only now, with Shader Model 3.0 and the specifics of the R5xx architecture are we beginning to see some actual commercial and end user applications, that are providing useful improvements over CPU processing. I don’t agree with the notion of just dismissing this as marketing.

Mass-marketing didn't happen now because of PS3.0 but because PS3.0 is becoming commodity. I believe the model "by itself" didn't attract more attention than previous models (normalizing on avalability of course). Also, this fits perfectly in the bigger picture.
MaxDZ8 talk 07:57, 14 October 2006 (UTC)
   
“
plain and simple, the article is out of topic.

GPU folding@home is still beta. I would wait more accurate estimations on a less controlled environment before this can be elected as a winner.

   
”

I don’t understand how the article linked is out of topic? I thought it was wholly relevant to the topic. Although perhaps some of the material from Mike Houston maybe more so?

It is not out of topic, it is that this is more on Stream Processing that on GPGPU details. When I first wrote it, I considered not even mentioning GPUs... since I believe GPUs to be the best stream processors available I took a note of it but please, consider disclosing information in a following step for interested people. MaxDZ8 talk 07:57, 14 October 2006 (UTC)

I’d also not be so keen to dismiss Folding just because it’s a Beta. Yes, it beta but this is now something that anyone can download and access and see real practical utilization of GPGPU/stream processing on their graphics cards. Stanford / the Folding project are currently getting useful and data from the client as well – so far they are receiving 29TFLOPs of active processing power, nearly a fifth of the processing power of all the active CPU’s, but from merely 442 GPU’s, providing over 70x the processing power of all the active CPU’s.

See again overview issue. By the way, who should see what?
MaxDZ8 talk 07:57, 14 October 2006 (UTC)

[edit] Tesla Roadster

Hi, I fixed one of your edits on the Tesla Roadster talk page that Ilokjju had vandalized. I want to give you a heads up on the page and article that I suspect a sock puppeteer (Curaralhos, Ilokjju, DrPersti, ElonMusky, Mu8sky, Rogerstone, Prof nomamescabron, Prof Bujju, and maybe Uramanbfas, Prof Schnitzer, and 216.180.72.14) is editing it. I also noticed one possible puppet, Maraimo, vandalized Folding@home. Kslays 21:08, 15 December 2006 (UTC)