This post originated from an RSS feed registered with Java Buzz
by Brian McCallister.
Original Post: Using Virtual Nodes to Compact Vector Clocks
Feed Title: Waste of Time
Feed URL: http://kasparov.skife.org/blog/index.rss
Feed Description: A simple waste of time and weblog experiment
One hiccup encountered when
using vector
clocks is that there is no inherent way of reducing the size of
the clock. Basically, any node which acts as a causal agent of
change has the potential to be forever recorded in the clock. This
leads to unbounded clock size, with time. Most systems tend to have
a limited number of causal, or lead, nodes providing clock values so
it is avoided, but sometimes you don't have that.
When vector clocks are used to track causality in a storage system,
such as in Amazon's
Dynamo system, it becomes possible to create syncronization
points in the history of the element, between storage nodes, if the
storage nodes are able to form consensus between themselves on the
value of an element at a specific point in the elements history. If
we are talking an eventually consistent system, this can be done by
using a background syncronization and merge algorithm which merges
acausal changes in the background. Alternately, it could be client
resolved, in systems like Dynamo, but that isn't my problem,
so... I digress.
When the system believes it has a value at a given clock value,
where the clock is causally related to the unified value on the
other storage nodes holding the element, it can try to achieve
concensus about this, and if successful, increment an artifical
clock key which we'll call the epoch. If successful, the epoch value
subsumes the vector clock values associated with the epoch in the
element, shrinking the element's clock.
To run through an example, let's say we have a system which uses
three storage nodes for each element. We don't care exactly how
these elements values are assigned, except to recognize that it
allows for non-causally related changes to occur. At a given point
in time the storage nodes may have values for an element A, as
follows:
Node
Clock
red
[red:2, green:1]
blue
[red:2, green:1, blue:2]
green
[red:3, green:1, blue:1]
A Paxos instance may be executed proposing that epoch:1 be [red:2,
green:1]. As each node can agree that [red:2, green:1] comes before
its value, it can accept the epoch value. Upon acceptance of the
value, the clocks would become:
Node
Clock
red
[epoch:1]
blue
[epoch:1, blue:2]
green
[epoch:1, red:3, blue:1]
Assuming a background reconciliation protocol, a system could apply
an appropriate heuristic to decide when to atempt to increment the
epoch. A good example of such would be after unrelated values have
been successfully merged. When it makes sense, and how to back-off
to older clock values really depends on the characteristics of the
system being designed and how it will be used.
As pointed out in the Dynamo paper, systems where there tend to be a
small number of keys in the clock don't generally have this problem,
Dynamo avoids it by causing the clock keys to be based on a small
number of likely coordinator nodes:
To this end, Dynamo employs the following clock truncation scheme:
Along with each (node, counter) pair, Dynamo stores a timestamp that
indicates the last time the node updated the data item. When the
number of (node, counter) pairs in the vector clock reaches a
threshold (say 10), the oldest pair is removed from the
clock. Clearly, this truncation scheme can lead to inefficiencies in
reconciliation as the descendant relationships cannot be derived
accurately. However, this problem has not surfaced in production and
therefore this issue has not been thoroughly investigated.
In something like that, it may not make a lot of sense -- the
problem just doesn't tend to come up. On the other hand, other
systems, such as one which uses a user id, or session id, as a clock
key would tend to generate larger clocks. This kind of keying can be
useful for providing read-what-I-wrote consistency, but that is
another discussion :-)