Sorcerer's Apprentice Syndrome
From Wikipedia, the free encyclopedia
Sorcerer's Apprentice Syndrome (SAS) is a particularly bad network protocol flaw, discovered in the original versions of TFTP. It was named after the Sorcerer's Apprentice segment of Fantasia, because the details of its operation closely resemble the disaster that befalls the sorcerer's apprentice: the problem resulted in an ever-growing replication of every packet in the transfer. The problem occurred because of a known failure mode of the internetwork which, through a mistake on the part of the protocol designers, was not taken into account when the protocol was designed; it interacted with several details of the mechanisms of TFTP to produce SAS.
Contents |
[edit] Technical background
TFTP operates in a simple lock-step - there is only ever one packet outstanding at any time, and every packet received by either party caused one packet to be sent in reply (until the termination of the transfer). The TFTP specification said that any time any packet was received, the receiver was required to send the appropriate reply packet. Thus, the receipt of a block of data triggered the sending of an 'acknowledgement', and the receipt of an acknowledgement triggered the sending of the next data block. This may sound fairly harmless, but it led to disaster.
TFTP also, like all protocols designed to operate across an unreliable network, includes timeouts. I.e. when it does something to which it expects a reply from the party at the other end (e.g. sends it a packet), it starts a timer, and if the timer goes off and the reply has not been received, it takes some action; usually, the response is to re-send the original packet.
[edit] Details of SAS
SAS occurred when a packet was not lost in the internetwork, but rather simply delayed, and later successfully delivered, after a timeout had occurred (on either side).
The timeout caused a second copy of the previous packet to be generated, notionally to replace the 'lost' packet. However, the first copy was not lost, and since, according to the TFTP specification, receipt of any packet always forced the generation of a reply packet, two replies were generated (one to each copy). Those forced the generation of two replies to them, and so on. A typical scenario was as follows:
- Computer S (source) sends data block X to computer D (destination)
- Computer D receives block X, and sends an acknowledgement for X back to S
- The packet containing the acknowledgement for X is delayed in the internetwork
- Computer S times out, and resends data block X to D
- Computer S receives the delayed acknowledgement for X, and sends data block X+1
- Computer D receives the second copy of block X, and sends another acknowledgement for X back to S
- Computer D receives block X+1, and sends an acknowledgement for X+1 back to S
- Computer S receives the second acknowledgement for X, and sends a second copy of data block X+1
- Computer S receives the acknowledgement for X+1, and sends data block X+2
- Computer D receives the second copy of block X+1, and sends another acknowledgement for X+1 back to S
- Computer D receives block X+2, and sends an acknowledgement for X+2 back to S
It will be seen that at this point the situation is now stable, and repeats; every packet from then on is duplicated (i.e. two identical copies are sent across the internetwork).
Even worse, the increased number of packets being sent around the internetwork was likely to cause congestion, which was likely to cause a packet to be delayed past the timeout yet again, which would then cause yet another duplicate packet to be generated by a timeout, and from then on a third copy of each packet would be sent. Needless to say, at that point, the situation would usually snowball, and further copies would be generated —hence the name given to this pattern of behaviour.
For a small file, the transfer would complete, and the duplicate packets would eventually drain from the internetwork. If the file were large, however, congestive collapse would result, and only when the transfer failed would the mass of packets drain from the internetwork.
[edit] Fixing SAS
The fix to SAS was quite simple: the TFTP specification was modified to indicate that only the first instance of a received acknowledgment would cause the next data block to be sent, thus breaking the retransmission loop. In the new version of the protocol, a block would only be retransmitted on timeout.
This change also makes it possible to simplify the implementation of the receiving end (often, a bootstrap program written in a low level language) by omitting the retransmission timer, as any lost packet would cause retransmission of last packet by the sender. However, keeping the timer has its benefits, such as dealing with lost ACKs more efficiently.
[edit] Further reading
- Bob Braden (editor), Requirements for Internet Hosts -- Application and Support (RFC 1123, USC/Information Sciences Institute, October 1989) See section 4.2