Cache stampede
A cache stampede is a type of cascading failure that can occur when massive parallel computing systems with caching mechanisms come under very high load. This behaviour is sometimes also called dog-piling.
To understand how cache stampedes occur, consider a web server which uses memcached to cache rendered pages for some period of time, to ease system load. Under particularly high load to a single URL, the system remains responsive so long as the resource remains cached, with requests being handled by accessing the cached copy. This minimizes, the expensive rendering operation.
Under low load, cache misses result in a single recalculation of the rendering operation. The system will continue as before, with average load being kept very low because of the high cache hit rate.
However, under very heavy load, when the cached version of that page expires, there may be sufficient concurrency in the server farm that multiple threads of execution will all attempt to render the content of that page simultaneously. Systematically, none of the concurrent servers know that the others are doing the same rendering at the same time. If sufficiently high load is present, this may by itself be enough to bring about congestion collapse of the system via exhausting shared resources. Congestion collapse results in preventing the page from ever being completely re-rendered and re-cached, as every attempt to do so times out. Thus reducing the cache hit rate to zero, and keeping the system continuously in congestion collapse as it attempts to regenerate the resource for as long as the load remains at the very heavy load.
Examine the following piece of code that either fetches data from the cache, or generates new data and places it into the cache when nothing was found:
// Fetch the data from the cache $data = $cache->get ("cached_item"); // Check if the data was fetched correctly if ($cache->getResult () == Cache::NotFound) { // Not found. generate data (expensive) $data = generateData(); // Save the data into the cache. Expires after 30 seconds. $cache->set ("cached_item", $data, 30); } // Display the data print $data;
Even though the fragment above will work, it does not prevent caching stampedes. If the generateData() function takes 3 seconds, and we have 10 page views per second, it means that at least 30(!) processes will call the generateData() function. Because this has a huge impact on the system (which is why we cached this data in the first place), it might bring down either the database, web server or both when you are very unlucky.
Preventing stampedes. There are multiple ways of preventing stampedes in database. Two of them, both used with different types of data.
Method 1 : Separate update process Instead of having the web-server generate the data on demand, let a separate process update the cache. For instance, when you have statistical data that needs to be updated every 5 minutes (but takes a minute to calculate), let a separate process (a cronjob for instance) calculate this data and update the cache. This way, data is always present and if not, you do not need to worry about stampedes since your clients will never generate the data. Works perfectly for global data that does not need to be calculated on the fly. For hings like user-objects, friends-lists, comments etc its probably not a very good method.
Method 2: locking Instead of an update process creating caches, you can use the normal flow of caching: 1. check cache, 2. if not exist, generate. However, we build in a little extra measurement to prevent the stampedes. This is the code:
function get($key) { $result = $this->_mc->get($key); if ($this->_mc->getResultCode() == Memcached::RES_SUCCESS) return $result; // See if we can create the lock. $this->_mc->add ("stampede_lock_".$key, "locked", 30); if ($this->_mc->getResultCode() == Memcached::RES_SUCCESS) { $this->_setResultCode (Cache::GenerateData); return false; } $this->_setResultCode (Cache::NotFound); return false; }
The code above does the following:
It first checks if the cache is present, if so, just return it and we’re done. If it isn’t present, try to create a “lock”. This will work if (and only if) somebody hasn’t done it before us (since “add” is an atomic process and will succeed for only 1 process, even if multiple processes are trying at the same time). When we did acquire the lock, we return a “GENERATE_DATA” response. When we didn’t acquire the lock, we return a “NOT_FOUND” response. Note that with this flow only 1 process will return a “generate_data” status. All others will return not_found. It’s up to your client to handle this properly by checking the return-status.
It is important that your stampede lock gets a time-to-live. Also make sure this TTL is high enough to actually generate the data in (when it takes 20 seconds to generate the cache, make it 30 seconds or so). Just make sure that this TTL is never higher than the TTL of the actual object. After 30 seconds, the lock will automatically be released.
One might be tempted to change the code above: instead of having a lock with a TTL, you will delete the lock when setting the data in a set() method. This, however, can trigger a race-condition which means that multiple processes still can generate the data (not at the same time, but right after we removed the lock).
References
- Allspaw, John; Robins, Jesse (2010). Web Operations: Keeping the Data on Time. O'Reilly. ISBN 978-1-449-37744-1.
http://www.adayinthelifeof.nl/2010/07/29/minimizing-cache-stampedes/