Memory-level parallelism

Memory-level parallelism (MLP) is a term in computer architecture referring to the ability to have pending multiple memory operations, in particular cache misses or translation lookaside buffer (TLB) misses, at the same time.

In a single processor, MLP may be considered a form of instruction-level parallelism (ILP). However, ILP is often conflated with superscalar, the ability to execute more than one instruction at the same time. E.g., a processor such as the Intel Pentium Pro is five-way superscalar, with the ability to start executing five different microinstructions in a given cycle, but it can handle four different cache misses for up to 20 different load microinstructions at any time.

It is possible to have a machine that is not superscalar but which nevertheless has high MLP.

Arguably a machine that has no ILP, which is not superscalar, which executes one instruction at a time in a non-pipelined manner, but which performs hardware prefetching (not software instruction level prefetching) exhibits MLP (due to multiple prefetches outstanding) but not ILP. This is because there are multiple memory operations outstanding, but not instructions. Instructions are often conflated with operations.

Furthermore, multiprocessor and multithreaded computer systems may be said to exhibit MLP and ILP due to parallelism—but not intra-thread, single process, ILP and MLP. Often, however, we restrict the terms MLP and ILP to refer to extracting such parallelism from what appears to be non-parallel single threaded code.

References

Glew, A. (1998). "MLP yes! ILP no!" (abstract / slides), In Wild and Crazy Ideas Session, 8th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1998.
Ronen, R.; Mendelson, A.; Lai, K.; Shih-Lien Lu; Pollack, F.; Shen, J. P. (2001). "Coming challenges in microarchitecture and architecture". Proc. IEEE. 89 (3): 325–340. CiteSeerX 10.1.1.136.5349 . doi:10.1109/5.915377.
Zhou, H.; Conte, T. M. (2003). "Enhancing memory level parallelism via recovery-free value prediction". Proceedings of the 17th annual international conference on Supercomputing. pp. 326–335. CiteSeerX 10.1.1.14.4405 . ISBN 1-58113-733-8. doi:10.1145/782814.782859.
Yuan Chou; Fahs, B.; Abraham, S. (2004). Microarchitecture optimizations for exploiting memory-level parallelism. ISCA'04. Proceedings. 31st Annual International Symposium on Computer Architecture, 2004. pp. 76–87. ISBN 0-7695-2143-6. doi:10.1109/ISCA.2004.1310765.
Qureshi, M. K.; Lynch, D. N.; Mutlu, O.; Patt, Y. N. (2006). "A Case for MLP-Aware Cache Replacement". 33rd International Symposium on Computer Architecture. pp. 167–178. CiteSeerX 10.1.1.94.4663 . ISBN 0-7695-2608-X. doi:10.1109/ISCA.2006.5.
Van Craeynest, K.; Eyerman, S.; Eeckhout, L. (2009). "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor". High Performance Embedded Architectures and Compilers (PDF). LNCS. 5409. pp. 110–124. CiteSeerX 10.1.1.214.3261 . ISBN 978-3-540-92989-5. doi:10.1007/978-3-540-92990-1_10.

Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Speculative (SpMT) Preemptive Cooperative Clustered Multi-Thread (CMT) Hardware scout
Theory	PRAM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window
Coordination	Multiprocessing Memory coherency Cache coherency Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD SIMT MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Grid computer
APIs	Ateji PX Boost.Thread Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays MPI OpenMP OpenCL OpenHMPP OpenACC TPL PLINQ PVM POSIX Threads RaftLib UPC TBB
Problems	Embarrassingly parallel Software lockout Scalability Race condition Deadlock Livelock Starvation Deterministic algorithm Parallel slowdown
Category: parallel computing Media related to Parallel computing at Wikimedia Commons

CPU technologies
Architecture	Turing machine Post–Turing machine Universal Turing machine Quantum Turing machine Belt machine Stack machine Register machine Counter machine Pointer machine Random access machine Random access stored program machine Finite-state machine Queue automaton Von Neumann Harvard (Modified) Dataflow TTA Cellular Artificial neural network Machine learning Deep learning Neural processing unit (NPU) Convolutional neural network Load/store architecture Register memory architecture Register register architecture Endianness FIFO Zero-copy NUMA HUMA HSA Heterogeneous computing Parallel computing Amorphous computing Reconfigurable computing Cognitive computing DNA computing Peptide computing Chemical computing Organic computing Wetware computer Quantum computing Neuromorphic computing Optical computing Reversible computing Unconventional computing Hypercomputation Ternary computer Symmetric multiprocessing (SMP) Asymmetric multiprocessing (AMP) Cache hierarchy Memory hierarchy
ISA types	ASIP CISC RISC EDGE (TRIPS) VLIW (EPIC) MISC OISC NISC ZISC Comparison
ISAs	x86 z/Architecture ARM MIPS Power Architecture (PowerPC) SPARC VISC Mill Itanium (IA-64) Alpha Prism SuperH Clipper VAX Unicore PA-RISC MicroBlaze
Word size	1-bit 2-bit 4-bit 8-bit 9-bit 10-bit 12-bit 15-bit 16-bit 18-bit 22-bit 24-bit 25-bit 26-bit 27-bit 31-bit 32-bit 33-bit 34-bit 36-bit 39-bit 40-bit 48-bit 50-bit 60-bit 64-bit 128-bit 256-bit 512-bit Variable
Execution	Instruction pipelining Bubble Operand forwarding Out-of-order execution Register renaming Speculative execution Branch predictor Memory dependence prediction Hazards
Parallel level	Bit Bit-serial Word Instruction Pipelining Scalar Superscalar Task Thread Process Data Vector Memory
Multithreading	Temporal Simultaneous (SMT) (Hyper-threading) Speculative (SpMT) Preemptive Cooperative Clustered Multi-Thread (CMT) Hardware scout
Flynn's taxonomy	SISD SIMD (SWAR) SIMT MISD MIMD SPMD Addressing mode
CPU performance	Instructions per second (IPS) Instructions per clock (IPC) Cycles per instruction (CPI) Floating-point operations per second (FLOPS) Transactions per second (TPS) SUPS Performance per watt Orders of magnitude (computing) Cache performance measurement and metric
Core count	Single-core processor Multi-core processor Manycore processor
Types	Central processing unit (CPU) GPGPU AI accelerator Vision processing unit (VPU) Vector processor Barrel processor Stream processor Digital signal processor (DSP) I/O processor/DMA controller Network processor Baseband processor Physics processing unit (PPU) Coprocessor Secure cryptoprocessor ASIC FPGA FPOA CPLD Microcontroller Microprocessor Mobile processor Notebook processor Ultra-low-voltage processor Multi-core processor Manycore processor Tile processor Multi-chip module (MCM) Chip stack multi-chip modules System on a chip (SoC) Network on a chip (NoC) Multiprocessor system-on-chip (MPSoC) Programmable System-on-Chip (PSoC)
Components	Execution unit (EU) Arithmetic logic unit (ALU) Address generation unit (AGU) Floating-point unit (FPU) Load-store unit (LSU) Fixed-point unit (FXU) Vector unit (VU) Branch predictor Branch execution unit (BEU) Instruction Decoder Instruction Scheduler Instruction Fetch Unit Instruction Dispatch Unit Instruction Sequencing Unit Unified Reservation Station Barrel shifter Uncore Sum addressed decoder (SAD) Front-side bus Back-side bus Northbridge (computing) Southbridge (computing) Adder (electronics) Binary multiplier Binary decoder Address decoder Multiplexer Demultiplexer Registers Cache Memory management unit (MMU) Input–output memory management unit (IOMMU) Integrated Memory Controller (IMC) Power Management Unit (PMU) Translation lookaside buffer (TLB) Stack engine Register file Processor register Hardware register Memory buffer register (MBR) Program counter Microcode ROM Datapath Control unit Instruction unit Re-order buffer Data buffer Write buffer Coprocessor Electronic switch Electronic circuit Integrated circuit Three-dimensional integrated circuit Boolean circuit Digital circuit Analog circuit Mixed-signal integrated circuit Power management integrated circuit Quantum circuit Logic gate Combinational logic Sequential logic Emitter-coupled logic (ECL) Transistor–transistor logic (TTL) Glue logic Quantum gate Gate array Counter (digital) Bus (computing) Semiconductor device Clock rate CPU multiplier Vision chip Memristor
Power management	APM ACPI Dynamic frequency scaling Dynamic voltage scaling Clock gating
Hardware security	Non-executable memory (NX bit) Bounds checking (Intel MPX) Intel Secure Key Hardware restriction (firmware) Software Guard Extensions (Intel SGX) Trusted Execution Technology OmniShield Trusted Platform Module (TPM) Secure cryptoprocessor Hardware security module Hengzhi chip
Related	History of general-purpose CPUs

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

Memory-level parallelism

See also

References