Virtual synchrony

From Wikipedia, the free encyclopedia

Virtual synchrony is a method of data replication for sharing information among programs running on multiple machines connected over the Internet. The term virtual synchrony refers to the fact that applications see the shared data evolve in what seems to be a synchronous manner. This form of synchronization is virtual because the actual situation is more complex than seems to be the case from a programmer's perspective: the synchronous model can sometimes be violated. For example, the messaging system might sometimes report the same events in different orders at different processes, but only if this will still leave the replicas in identical states. The flexibility associated with this limited form of event reordering permits virtual synchrony platforms to achieve extremely high data rates while still preserving very strong fault-tolerance and consistency guarantees.

Contents

[edit] Introduction

Developers of distributed computer systems often need a way to replicate data for sharing between programs running on multiple machines, connected by a network. Virtual synchrony is one of three major technologies for solving this problem. The key idea is to create a form of distributed state machine associated with the replicated data item. Called a process group, these state machines share copies of the data, and updates are delivered as events that occur in the same order at all the copies. If a process fails or crashes, this is reported to the other processes in the group; if a process joins, this is similarly reported, and a state transfer is used to initialize the joining member. An application with lots of data items to replicate might do so by using a single group for the whole set, or it could create different groups for different items, with the former approach used when the items are replicated at the identical places, and the latter being used when the replication patterns differ.

Each process group has a name, visible within the network. A single application program can connect itself to many groups at the same time. In effect, a process group becomes an abstraction for sharing data, coordinating actions, and monitoring other processes.

The term virtual synchrony refers to the fact that applications see the shared data evolve in what seems to be a synchronous manner. This form of synchronization is virtual because the actual situation is more complex than seems to be the case from a programmer's perspective. Like a compiler that sometimes reorders the execution of instructions for higher performance, virtual synchrony sometimes reorders events in ways that improve performance, and yet won't be noticeable to applications.

Virtual synchrony replication is used mostly when applications are replicating information that evolves extremely rapidly. As discussed further below, the kinds of applications that would need this model include multiuser role-playing games, air traffic control systems, stock exchanges, and telecommunication switches. Of course, there are other ways to solve the same problems. For example, most of today's online multiuser role-playing games give users a sense that they are sharing replicated data, but in fact the data lives in a server on a data center, and any information passes through the data centers. Those games probably wouldn't use models like virtual synchrony, at present. However, as they push towards higher and higher data rates, taking the server out of the critical performance path becomes important, and with this step, models such as virtual synchrony become valuable.

Using the virtual synchrony model, it is relatively easy to maintain fault-tolerant replicated data in a consistent state. One can then build all sorts of higher level abstractions over the basic replication mechanisms.

[edit] Three Distributed Data Replication Models

Virtual Synchrony is a popular computing model, closely related to the transactional one-copy serializability model (used mostly in replicated database systems) and the state machine (consensus) model, sometimes known as "Paxos", the name given to the most widely cited state-machine implementation. References for all of these appear at the end of the article.

  • Among these, transactional replication is probably the most widely known model -- most database textbooks discuss it. Yet overheads are very high when using true one-copy serializability, hence the approach to replication has never been a commercial success. Turing Award winner Jim Gray offers some thoughts on this issue in a paper he wrote about "The Dangers of Replication and a Solution". Indeed, few database products support true replication of the sort Gray warns about. Instead, they more often support a form of log-based fault-tolerance that performs well, but can leave inconsistencies (updates can be lost) if a failure occurs just as the log is being transmitted.
  • Virtual synchrony has been widely adopted, and even standardized within as part of the CORBA reference model. There are many real-world systems that use this model, and achieve extremely high performance. On the other hand, virtual synchrony is a technology that can be hard to use correctly -- programmers need some training, and without it, may make mistakes. For this reason, virtual synchrony is not often supported in a form that end-users or programmers would encounter directly.
  • The state machine / Paxos approach is often cited in the literature, but is rarely used in commercial products or other forms of serious systems where performance matters.

[edit] Data Replication and Fault Tolerance

The basic goal for all of these protocols is to replicate data in a distributed system in a manner that makes the replicated entity indistinguishable from a non-replicated object implementing the same interface. For example, if we imagine a simple variable, x, that can be read or written to, a replicated version might consist of some set of replicas x0, x1, ... xn and an associated protocol, such that reads and writes to the replicates are performed in a way that looks indistinguishable from reads and writes to the original variable. When we create a process group, the idea is that each of its members will hold a replica. Updates are delivered to the group members through an event notification interface.

The central difference between the three models is that virtual synchrony assumes that the variable is replicated in memory by a set of processes executing on some collection of machines in a network. Transactional one-copy serializability assumes that the data resides in a collection of transactional databases (on disk), and implements the full transactional ACID properties, with the usual begin/commit/abort interface. State-machine consensus lies somewhere in the middle: the variables are assumed to be persistent (for example they might be stored in files), but are not assumed to have full ACID properties, and access is not assumed to go through a transactional begin/commit/abort interface.

Failures make the problem hard. All three models assume that machines may fail by crashing: a computer halts, or some process on it halts, and other processes sense the failure by timeout. Timeout, of course, is a potentially inaccurate way to sense failures (timeouts always discover true crashes, but sometimes a timeout will trigger for some other reason, such as a transient connectivity problem.) A platform implementing any of these models must provide the programmer with a set of system calls that allows him or her to write code that will continue to respect the model even if these kinds of problems occur. In effect, the platform hides this difficult fault-tolerance problem from the programmer.

[edit] Other uses for virtual synchrony

Virtual synchrony is useful for more than just replicating data, although replication is probably the most common use. Other mechanisms that can be constructed "over" a virtual synchrony platform include:

  • Event notification (also called publish-subscribe). These are interfaces that let applications publish event messages, tagging them with topic names. Applications can subscribe to a topic, or a pattern that matches many topics, specifying a function to be invoked when a matching message is received. The platform matches publishers to subscribers. With group communication, this is done by creating a group to correspond to each topic, or to a set of topics. Each new event is published by multicasting it within the group.
  • Locking. Many systems need some form of locking or synchronization mechanism. Locking can easily be implemented on top of a virtual synchrony subsystem. For example, a system can associate a token with each group, and make the rule that to hold the lock, a process must gain "ownership" of the token. Multicasts are used to request and to pass the token.
  • Fault-tolerance. A group can easily support primary-backup forms of fault-tolerance, in which one process performs actions and a second one stands by as a backup. Even fancier is the "coordinator/cohort" model, in which each request is assigned to a different coordinator process. Other processes in the group are ranked to serve as a primary backup, secondary, etc. Since failures are rare, the effect is that a group with N members can potentially handle N times the compute load. Yet if a failure does occur, the group can transparently handle it.
  • Unbreakable TCP connections. The idea here is that a client can make a TCP connection to a whole group. Using multicast, the state of the server-side TCP connection can be replicated, so that even if one server crashes, others are able to take over its roles, transparently.
  • Cooperative caching. Members of a group can share lists of data they have in their caches. This way, if one process needs an object that some other process has a copy of, the group members can help one-another out and avoid fetching the object from a server that might be distant, overloaded or expensive.
  • Other peer-to-peer mechanisms. It isn't hard to see how a group can implement the functions of a distributed hash table (DHT), since every member knows the identity of every other member. Groups can also be a useful infrastructure for implementing swarm algorithms, like the ones used in BitTorrent.

[edit] Performance

Among the three models, virtual synchrony achieves the highest levels of performance, but this comes at a cost: the fault-tolerance properties of virtual synchrony are weaker than what that achieved in the other models.

The Paxos and transactional models guarantee a higher degree of durability in the presence of crashes. Both models need to first ensure that an update is recorded in a write-ahead log before any process can actually perform the update. This introduces a form of two-phase commit into the protocol, and hence slows things down: first the update is sent and logged, and all members confirm that they have it, and only then is it performed. In contrast, virtual synchrony implementations with in-memory data replication can generally update a replicated variable as soon as a message describing the update reaches the relevant group members. They can stream high rates of updates by packing multiple updates into a single message.

To give some sense of the relative speed, experiments with 4-node replicated variables undertaken on the Isis and Horus systems in the 1980s suggested that virtual synchrony implementations in typical networks were about 100 times faster than state-machine replication using Paxos, and about 1000 to 10,000 times faster than full-fledged transactional one-copy-serializability. Of course, these sorts of order of magnitude numbers are highly sensitive to the implementation and choice of platform, but they also reflect underlying obligations within the protocols used to implement the models. Modern systems like Spread and Quicksilver can achieve data rates of 10,000 multicasts per second or more, and can scale to large networks with huge numbers of groups or processes.

Most distributed computing platforms support one or more of these models. For example, the widely popular CORBA platforms all support transactions and some CORBA products support transactional replication in the one-copy-serializability model. The "CORBA Fault Tolerant Objects standard" is based on the virtual synchrony model. Virtual synchrony was also used in developing the New York Stock Exchange fault-tolerance architecture, the French Air Traffic Control System, the US Navy AEGIS system, IBM's Business Process replication architecture for WebSphere and Microsoft's Windows Clustering architecture for Windows Longhorn enterprise servers. A discussion of some of these projects can be found in "A Review of Experiences with Reliable Multicast". K. P. Birman Software Practice and Experience Vol. 29, No. 9, pp, 741-774, July 1999.

[edit] Essential features of the virtual synchrony model

Virtual synchrony is usually presented to programmers through a simple distributed programming library that supports at least three basic interfaces. First, a process (an executing program) can join a process group. Each group has a name (a bit like a file name, although these names are interpreted within a network, not relative to a disk), and each has a list of members. The join primitive returns some form of handle on the group. The process can then register a handler for incoming events, and can send multicasts to the group.

The basic guarantee associated with the model is that all processes belonging to a group see the same events, in the same order. The platform senses failures (using timeouts) but reports them in a consistent manner to all group members. Multicast messages may be initiated concurrently by multiple senders, but will be delivered in some fixed order selected by the protocols implementing the model.

Events are of several types. First, each received multicast is delivered as an event. But membership changes in the group are also reported through events; these are called new views of the group. Moreover, when a process joins a group, some existing member is asked to create a checkpoint: a message describing the state of the group at the time the process joined. This is reported to the new member as a state transfer event, and is used to initialize the joining process.

For example, suppose that an air traffic control system maintains some group associated with the airplanes flying in sector XYZ over Paris. Each air traffic controller who monitors that sector would have a process running on his or her machine, and these processes would join the XYZ group as they start up. The members would replicate the list of air traffic control plans and tracks associated with sector XYZ. Upon joining, a process would obtain a copy of the state of the sector as of the moment it joined, delivered as a checkpoint through a state transfer event. Loading such a checkpoint is analogous to reading a file that lists the current state of the sector. Later, as events occur that impact the status of the sector, they would be multicast so that all members of the group can see those events. Since each member is in the same state, and receives the same updates, each air traffic controller sees the same sector status and they see it evolve in the same manner. If a failure occurs, the surviving systems can take over roles that were previously held by the crashed one.

[edit] Virtual synchrony in a Pictorial Timeline Model

A good way to understand virtual synchrony is to start by thinking about a genuinely synchronous execution. Figure 1 shows some execution timelines for five processes running concurrently in a network. Time advances from left to right. At the start, p creates a process group and is its only member. Then q joins and with p's help, initializes itself. The heavy arrow denotes the creation of a checkpoint by p, which is copied to q, and then loaded by q. Perhaps this group lists air traffic control state for some sector over Paris. Now t, a non-member, asks the group some question. It sends a message, and the group members cooperate to respond (perhaps they each search half of an ATC database -- after all, each knows that the group has two members and each knows its own rank, so parallel computing becomes easy! Next we see some update messages -- multicasts -- exchanged by p and q. Process r joins the group, but q either crashes or fails. Notice that each event is seen in the identical order by all the members. This makes it easy to track the evolving group state. Some would call this a state machine execution.

What makes a virtually synchronous system virtual rather than real is that the synchronous event ordering is actually an illusion. Figure 2 illustrates one way that this can occur: if the timing in the system isn't perfectly synchronized, messages may be delivered with some delays. The execution still looks synchronous to the members of the groups because the messages still arrive in the same order as they did in a true synchronous execution, but we can easily see that the actual times at which events occurred aren't identical.

But how meaningful is the notion of identical timing in a distributed system? Computers can't maintain perfectly synchronized clocks, so if machine p writes down the time when message m arrived, say 10:00:00.000 and machine q records that same event as happening at time 10:00:00.716, can we be sure that m didn't arrive precisely at the same instant at p and q? Perhaps their clocks are simply off by 0.716 seconds! On the other hand, perhaps their clocks are exactly synchronized and m really did arrive late at q (or early at p). Or perhaps there is some mixture of these issues. Even with GPS clocks, synchronization won't be better than a few milliseconds.

Figure 3 takes all of this even further. Here, events genuinely happen out of order. The point this figure is intended to make is that sometimes, a system can deliver events out of order -- and yet the application might not notice. For example, perhaps two multicasts can be delivered in different orders because they update unrelated data items. By deviating from the synchronous order, virtual synchrony systems gain speed and improve fault-tolerance (they are less likely to experience correlated crashes where some message causes all the members to crash simultaneously).

In virtual synchrony systems, the application programmer signals to the platform what form of ordering is really needed. For example, the programmer might indicate that multicast m updates different data than multicast n. Virtual synchrony software systems make it easy to do this sort of thing, although we won't delve into the details here. Basically, the programmer says "you can deliver messages m and n in any order you like, because my application won't notice". When permitted to do so, the communication system can now save time by not delaying messages under conditions where providing identical delivery order for n and m would have introduced extra cost and thereby slowed down the data rate.

When could we get away with this sort of thing? The answer usually depends on the application. But one good example arises if a group is maintaining data about some collection of objects that tend to be accessed independentally. For example, perhaps the group represents the rooms in a multi-user role-playing game. Users are only in one room at a time, hence multicasts that update data in different rooms can be delivered in different orders. If a user sees one such multicast (e.g. that user happens to be in Sarah's Ice Cream shop when the a message is delivered that causes the telephone to ring), they won't see the other one (because it affected the state of some other room). Returning to our air traffic control example, different groups might represent different sectors of the sky, at which point the same kinds of options arise. A programmer designing such an application will often have simple ways to realize that this is the case, and can then signal this through an appropriate system-call.

Why bother? Well, we gain performance as we relax ordering, and this is one major reason that virtual synchrony is so much faster than state machine replication or transactional replication. Both of those models are much closer to a true synchronous execution, and run slowly because their very strong ordering requirements, lock-step coordination of actions and the need to synchronously write data to disk (at least in the case transactions) slow them down enormously. Paxos doesn't actually require that data be written to disk, but is implemented under the assumption that after a crash, processes will remember what happened prior to the crash, which is pretty much the same thing).

[edit] Failure Semantics

We mentioned that there is a sense in which virtual synchrony is a weaker model than transactional one-copy serializability or state machine consensus in the style of Paxos. Partly this relates to ordering: virtual synchrony often weakens the message delivery ordering to gain performance. As mentioned above, doing so can sometimes increase robustness too. If different copies sometimes process events in different orders (doing so only when this won't have any impact on the ultimate state of the object), the copies may still be somewhat more robust against messages that cause exceptions. After all, many bugs are exquisitely sensitive to the exact sequence of events that a process experiences, so processes that see the same things but in different orders can often survive problems that would be fatal in some specific ordering.

But the other sense in which virtual synchrony is a weaker model relates to exactly what happens when some process crashes. Suppose that process p sends a multicast to a group G, and then p and some member of the group, say q, both crash. No process that remains operational has a copy of the multicast. What should the platform do?

In virtual synchrony, the group continues executing as if no message was ever sent. After all, there is no evidence to the contrary. P and q have both crashed, so they won't behave in a manner inconsistent with the model. Yet it is possible that q received p's message and delivered it to the application right before the crash. So there is a case in which virtual synchrony seems to lie: it behaves as if no message was sent, and yet the crashed processes might actually have exchanged a message.

This never happens in Paxos or transactional systems, which makes them a good match for updating database files on a disk. In both systems, if q later recovers and rejoins the group, any data it collected prior to crashing will still be valid, except to the extent that it missed updates delivered to the other group members while it was down. The cost of this guarantee is, however, quite high. Paxos, and transactional systems, impose a long delay before any process can deliver a message. First, these platforms make sure that the message reaches all of its destinations, asking them to delay the incoming message before delivering it. Only after the first step is completed are recipients told that it is safe to deliver the message to the application. (In one variant on these models, the platform only makes sure that a quorum (a majority) receive the message, but the delay is comparable).

The delay associated with this extra round of communication can have a big impact on performance.

Experience with virtual synchrony shows that for most applications, the weak but fast form of delivery is just fine. For rare cases where stronger guarantees are needed, the application programmer can request that a slower delivery be performed, paying an infrequent higher price, but only when necessary. The resulting performance will be much higher than if the slower, more conservative delivery property was used for every message.

[edit] Systems that support virtual synchrony

Virtual synchrony has been supported by the "Isis Toolkit", the "Horus system", the Transis system, the Totem system, an IBM system called Phoenix, a distributed security key management system called Rampart, the "Ensemble system", the Quicksilver system, and a number of products (including the IBM and Microsoft ones mentioned earlier). At the time of this writing, virtual synchrony toolkits that programmers can use to implement new virtually synchronous applications include the Spread Toolkit, the C-Ensemble system and Quicksilver.

[edit] External links

1. Reliable Distributed Systems: Technologies, Web Services and Applications. K.P. Birman. Springer Verlag (1997). Textbook, covers a broad spectrum of distributed computing concepts, including virtual synchrony.

2. Distributed Systems: Principles and Paradigms (2nd Edition). Andrew S. Tanenbaum, Maarten van Steen (2002). Textbook, covers a broad spectrum of distributed computing concepts, including virtual synchrony.

3. "The process group approach to reliable distributed computing". K.P. Birman, Communications of the ACM (CACM) 16:12 (Dec. 1993). Written for non-experts.

4. "Group communication specifications: a comprehensive study" Gregory V. Chockler, Idit Keidar, Roman Vitenberg. ACM Computing Surveys 33:4 (2001). Introduces a mathematical formalism for these kinds of models, then uses it to compare their expressive power and their failure detection assumptions.

5. "Practical Impact of Group Communication Theory." Andre Schiper. Future Directions in Distributed Computing. Springer Verlag Lecture Notes in Computer Science 2584 (July 2005). A history of the area, assumes familiarity with the general topic.

6. "The part-time parliament". Leslie Lamport. ACM Transactions on Computing Systems (TOCS), 16:2 (1998). Introduces the Paxos implementation of replicated state machines.

7. "A review of experiences with reliable multicast" K. P. Birman. Software, Practice and Experience. 29:9 (July 1999). Includes discussion of the New York and Swiss Stock Exchange, French Air Traffic Control System and several other projects that used virtual synchrony as part of a system that was ultimately deployed (in fact with just a few exceptions, these systems are still heavily used).

8. "Exploiting virtual synchrony in distributed systems". K.P. Birman and T. Joseph. Proceedings of the 11th ACM Symposium on Operating systems principles (SOSP), Austin Texas, Nov. 1987. Earliest use of the term, but probably not the best exposition of the topic.