A global state consists of the state of every nodes and every channel in the system, where the state of a channel is the sequence of messages "in-transit," those that have been sent on that channel, but not yet received. We will take a global snapshot of the system by having each node in the system record its own local internal state, as well as the sequence of messages on its incoming channels. However, the question becomes when to have each process take its local snapshot.
If we all processes were synchronized by a single clock, then taking a consistent global snapshot would be trivial. Each process could record its state and that of its incoming channels at an agreed-upon time. However, in most distributed systems, there is not a centralized clock. Communication and computation steps are not synchronized, and messages are needed to trigger each process to take its local snapshot, so the local snapshot taken at each process occurs at different physical times. Furthermore, the problem is even more difficult because we want our snapshots not only to be consistent, but also to be "recent." For example, although the state reported would be consistent, we would not find a global snapshot algorithm acceptable if it always reported the global initial state of the system.
We can define an execution of the system as the sequence of communication events (send and receive events) occurring at every process in the system. Now, consider the global state of the system at the physical instant of time when the global snapshot is requested. Let's call this the "before" state. If we could just capture the "before" state, then we'd be finished, but we can't do this because messages continue to flow through the system even during the process of taking the snapshot, so by the time we notify a process to take a snapshot, it's state may have changed. Next, consider the global state of the system at the physical instant of time when the global snapshot is finished (i.e., after every process has taken its local snapshot). Let's call this the "after" state. Similarly, if our snapshot could just capture the "after" state, then we'd be done. However, we can't do that either because we're not allowed to stop the other messages flowing through the system after we take the local snapshot at a node. Therefore, by the time all nodes take their local snapshots, the global system state may have changed.
So, we define a recent and consistent global state as a global state that "could have occurred" between the "before" state and the "after" state. In other words, consider the sequence S of communication events that occurs in the system between the "before" and the "after" state. A recent and consistent global state X is reachable from the "before" state by some subsequence of the events in C. Furthermore, the "after" state is reachable from the state X by the remaining events in C. Notice that this global state didn't necessarily occur at any point in real time, but it "could have occured" between the "before" and "after" states using the same sequence of communication events.
Global snapshots are useful in a wide variety of distributed applications. One application is in distributed databases, for instance a group of bank branches. Another use is deadlock detection: a global snapshot can be examined to see if there has been any progress made by the algorithm. Termination of a distributed algorithm can detected in the same way.
In this lab, you will consider the problem of computing global snapshots in a distributed bank application. You will implement bank modules that send money to each other at random intervals.
Read over the entire assignment before starting.
The basic strategy of the algorithm is that when a processor initiates a global snapshot, it records its local state and sends a special marker on all of its outgoing channels to signal its neighbors that a snapshot has been initiated and it begins remembering all the messages it receives on its incoming channels. When it gets a marker on an incoming channel from neighbor X, it knows that neighbor X has taken a local snapshot that reflects all of the messages X has sent up to the marker. Therefore, the process receiving the marker records the state of that channel (i.e., the messages it has received between recording its local state and receiving the marker on the channel). After the local state and the states of all channels have been recorded, the local node reports that information. All participants in the global snapshot use the same algorithm, summarized below.
K. Mani Chandy and Leslie Lamport, Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, Vol. 3, No. 1, February, 1985, pp. 63-75.
To receive credit for this lab, you should: