Memory scrubbing

From Wikipedia, the free encyclopedia

Memory scrubbing consists of reading from each computer memory location, correcting bit errors (if any) with an error-correcting code (ECC), and writing the corrected data back to the same location.[1]

Motivation

Due to the high integration density of contemporary computer memory chips, the individual memory cell structures became small enough to be vulnerable to cosmic rays and/or alpha particle emission. The errors caused by these phenomena are called soft errors. This can be a problem for DRAM and SRAM based memories.

The probability of a soft error at any individual memory bit is very small. But,

  • together with the large amount of memory with which computers - especially servers - are equipped nowadays,
  • and together with several months of uptime,

the probability of soft errors in the total memory installed is significant.

ECC support

The information in an ECC memory is stored redundantly enough to correct single bit error per memory word. Hence, an ECC memory can support the scrubbing of the memory content. Namely, if the memory controller scans systematically through the memory, the single bit errors can be detected, the erroneous bit can be determined using the ECC checksum, and the corrected data can be written back to the memory.

More detail

It is important to check each memory location periodically, frequently enough, before multiple bit errors within the same word are too likely to occur, because the one bit errors can be corrected, but the multiple bit errors are not correctable, in the case of usual (as of 2008) ECC memory modules.

In order to not disturb regular memory requests from the CPU and thus prevent decreasing performance, scrubbing is usually only done during idle periods. As the scrubbing consists of normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the BIOS setup program.

The normal memory reads issued by the CPU or DMA devices are checked for ECC errors, but due to data locality reasons they can be confined to a small range of addresses and keeping other memory locations untouched for a very long time. These locations can become vulnerable to more than one soft error, while scrubbing ensures the checking of the whole memory within a guaranteed time.

On some systems, not only the main memory (DRAM-based) is capable of scrubbing but also the CPU caches (SRAM-based). On most systems the scrubbing rates for both can be set independently. Because cache is much smaller than the main memory, the scrubbing for caches does not need to happen as frequently.

Memory scrubbing increases reliability, therefore it can be classified as a RAS feature.

Scrubbing Types

Patrol Scrub

Patrol Scrubbing is a process that allows the CPU to correct correctable memory errors detected on a memory module and send the correction to the requestor (the original source). When this item is set to Enabled, the North Bridge will read and write back one cache line every 16K cycles, if there is no delay caused by internal processing. By using this method, roughly 64 GB of memory behind the North Bridge will be scrubbed every day.

Options on motherboards are usually Enabled or Disabled.

Demand Scrub

Demand Scrubbing is a process that allows the CPU to correct correctable memory errors found on a memory module. When the CPU or I/O issues a demand-read command, and the read data from memory turns out to be a correctable error, the error is corrected and sent to the requestor (the original source). Memory is updated as well. Select Enabled to use Demand Scrubbing for ECC memory correction.

Options on motherboards are usually Enabled or Disabled.

See also

Works cited

  1. Ronald K. Burek. "The NEAR Solid-State Data Recorders". Johns Hopkins APL Technical Digest. 1998.
This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.