Non-Uniform Memory Access

From Wikipedia, the free encyclopedia

Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors.

NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. Their commercial development came in work by Burroughs, Convex Computer (later HP), SGI, Sequent and Data General during the 1990s. Techniques developed by these companies later featured in a variety of Unix-like operating systems, as well as to some degree in Windows NT and in later versions of Microsoft Windows.

Contents

[edit] Basic concept

Architecture of a NUMA system. Notice that the processors are connected to the bus or crossbar by connections of varying thickness/number. This shows that different cpus have different priorities to memory access based on their location.
Architecture of a NUMA system. Notice that the processors are connected to the bus or crossbar by connections of varying thickness/number. This shows that different cpus have different priorities to memory access based on their location.

Modern CPUs operate considerably faster than the main memory they are attached to. In the early days of high-speed computing and supercomputers the CPU generally ran slower than its memory, until the performance lines crossed in the 1970s. Since then, CPUs, increasingly starved for data, have had to stall while they wait for memory accesses to complete. Many supercomputer designs of the 1980s and 90s focused on providing high-speed memory access as opposed to faster processors, allowing them to work on large data sets at speeds other systems could not approach.

Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this means installing an ever-increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid "cache misses". But the dramatic increase in size of the operating systems and of the applications run on them have generally overwhelmed these cache-processing improvements. Multi-processor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time.

NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks).

Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between banks. This operation has the effect of slowing down the processors attached to those banks, so the overall speed increase due to NUMA will depend heavily on the exact nature of the tasks run on the system at any given time.

[edit] Cache coherent NUMA (ccNUMA)

Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead.

Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model. As a result, all NUMA computers sold to the market use special-purpose hardware to maintain cache coherence, and thus class as "cache-coherent NUMA", or ccNUMA.

Typically, this takes place by using inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA performs poorly when multiple processors attempt to access the same memory area in rapid succession. Operating-system support for NUMA attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary.

Current implementations of ccNUMA systems are multi processor systems based on the Intel Itanium and AMD Opteron processor. Earlier ccNUMA approaches were systems based on the MIPS processor in SGI systems and Alpha processor EV7 of Digital Equipment Corporation(DEC).

[edit] NUMA vs. cluster computing

One can view NUMA as a very tightly coupled form of cluster computing. The addition of virtual memory paging to a cluster architecture can allow the implementation of NUMA entirely in software where no NUMA hardware exists. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater than that of hardware-based NUMA.

[edit] See also

[edit] References

This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.

[edit] External links

Topics in Parallel Computing  v  d  e 
General High-performance computing
Parallelism Data parallelismTask parallelism
Theory SpeedupAmdahl's lawFlynn's TaxonomyCost efficiencyGustafson's LawKarp-Flatt Metric
Elements ProcessThreadFiberParallel Random Access Machine
Coordination MultiprocessingMultithreadingMultitaskingMemory coherencyCache coherencyBarrierSynchronizationDistributed computingGrid computing
Programming Programming modelImplicit parallelismExplicit parallelism
Hardware Computer clusterBeowulfSymmetric multiprocessingNon-Uniform Memory AccessCache only memory architectureAsymmetric multiprocessingSimultaneous multithreadingShared memoryDistributed memoryMassively parallel processingSuperscalar processingVector processingSupercomputer
Software Distributed shared memoryApplication checkpointing
APIs PthreadsOpenMPMessage Passing Interface (MPI)
Problems Embarrassingly parallelGrand Challenge