Cascade failure

From Wikipedia, the free encyclopedia

A cascade failure is a series of events on the internet in which network traffic is severely impaired or halted, to or between larger sections of the internet, caused by failing or disconnected hardware or software. Somewhat similar to the more generic cascading failure — found in, for instance, electrical systems — the cascade failure can affect large groups of people and systems.

Contents

[edit] Causes

The cause of a cascade failure is usually the overloading of a single, crucial router or node. This causes the node to go down, even briefly, resulting in routing of traffic to or through another (alternative) path. This alternative path, as a result, becomes overloaded, causing it to go down, and so on. It will also affect systems which depend on the node for regular operation.

It can also be caused by taking a node down for maintenance or upgrades.

[edit] Symptoms

The symptoms of a cascade failure are easy to see: packet loss and high network latency, not just to single systems, but to whole sections of a network or the internet. The high latency and packet loss is caused by the nodes that fail to operate due to congestion collapse, which causes them to still be present in the network but without much or any useful communication going through them. As a result, routes can still be considered valid, without them actually providing communication.

If enough routes go down because of a cascade failure, a complete section of the network or internet can become unreachable. Although undesired, this can help speed up the recovery from this failure as connections will time out, and other nodes will give up trying to establish connections to the section(s) that have become cut off, decreasing load on the involved nodes.

A common thing to see during a cascade failure is a walking failure, where sections go down, causing the next section to fail, after which the first section comes back up. This ripple can make several passes through the same sections or connecting nodes before stability is restored.

[edit] History

Cascade failures are a relatively recent development, with the massive increase in traffic and the high interconnectivity between systems and networks. The term was first applied in this context in the late 1990's by a Dutch IT professional and has slowly become a relatively common term for this kind of large-scale failure.

[edit] Example

An animation demonstrating how a single failure may result in other failures throughout the network.
An animation demonstrating how a single failure may result in other failures throughout the network.

As an example, let's overload a connecting node between a local ISP and their Internet backbone: Initially, the traffic that would normally go through the node is stopped. Systems and users get errors about not being able to reach hosts. Usually, the redundant systems of an ISP respond very quickly, choosing another path through a different backbone. The routing path through this alternative route is longer, with more hops and subsequently going through more systems that normally do not process the amount of traffic suddenly offered. This can cause one or more systems along the alternative route to go down, causing similar problems of their own.

Also, related systems are affected in this case: as example, DNS resolution might fail and what would normally cause systems to be interconnected, might break connections that are not even directly involved in the actual systems that went down. This, in turn, may cause seemingly unrelated nodes to develop problems, that can cause another cascade failure all on its own.

[edit] See also

[edit] References