Single instruction, multiple threads
Single instruction, multiple thread (SIMT) is a parallel execution model, used in some GPGPU platforms, where multithreading is simulated by SIMD processors. The processors, say a number p of them, seem to execute many more than p tasks. This is achieved by each processor having multiple "threads" (or "work-items" or "Sequence of SIMD Lane operations"), which execute in lock-step, and are analogous to SIMD "lanes".[1]
SIMT was introduced by Nvidia:[2][3]
[The G80 Nvidia GPU architecture, Tesla] introduced the single-instruction multiple-thread (SIMT) execution model where multiple independent threads execute concurrently using a single instruction.
SIMT is intended to limit instruction fetching overhead,[4] and is used in modern GPUs (including, but not limited to those of Nvidia and AMD) in combination with 'latency hiding' to enable high-performance execution despite considerable latency in memory-access operations. This is where the processor is oversubscribed with computation tasks, and is able to quickly switch between tasks when it would otherwise have to wait on memory. This strategy is comparable to multithreading in CPUs (not to be confused with multi-core).[5]
A downside of SIMT execution is the fact that thread-specific control-flow is performed using "masking", leading to poor utilisation where a processor's threads follow different control-flow paths. For instance, to handle an if-else block where various threads of a processor execute different paths, all threads must actually process both paths (as all threads of a processor always execute in lock-step), but masking is used to disable and enable the various threads as appropriate. Masking is avoided when control flow is coherent for the threads of a processor, i.e. they all follow the same path of execution. The masking strategy is what distinguishes SIMT from ordinary SIMD, and has the benefit of inexpensive synchronization between the threads of a processor.[1]
Nvidia CUDA | AMD OpenCL | Henn&Pratt |
---|---|---|
Thread | Work-item | Sequence of SIMD Lane operations |
Warp | Wavefront | Thread of SIMD Instructions |
Block | Workgroup | Body of vectorized loop |
Grid | NDRange | Vectorized loop |
See also
References
- 1 2 Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. p. 52.
- ↑ "Nvidia Fermi Compute Arcitecture Whitepaper" (PDF). http://www.nvidia.com/. NVIDIA Corporation. 2009. Retrieved 2014-07-17. External link in
|website=
(help) - ↑ "NVIDIA Tesla: A Unified Graphics and Computing Architecture". http://www.ieee.org/. IEEE. 2008. p. 6 (Subscription required.). Retrieved 2014-08-07. External link in
|website=
(help) - ↑ Rul, Sean; Vandierendonck, Hans; D’Haene, Joris; De Bosschere, Koen (2010). An experimental study on performance portability of OpenCL kernels. Symp. Application Accelerators in High Performance Computing (SAAHPC).
- ↑ "Advanced Topics in CUDA" (PDF). cc.gatech.edu. 2011. Retrieved 2014-08-28.