Angle Brackets and Curly Braces
Cameron Purdy on Dealing with Failure
by Bill Venners
August 14, 2006

Summary
In his weblog, Cameron Purdy suggests that when a distributed system is designed as a multi-cellular organism rather than a collection of individual cells, an application need not deal with the potential for other servers to fail, but rather with it's own potential for failure.

In a conversation with Cameron Purdy, CEO of Tangosol, about distributed systems design, I asked him this question: Why do clients of the Map interface used in Coherence to add objects to a distributed cache need not deal with the potential for failure of the network or other nodes in the distributed system? The reason I asked is that ignoring the potential for failure seems on its surface to be in conflict with A Note on Distributed Computing, by Jim Waldo, et. al. This paper states that "objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space."

In his email response, which he also published in his weblog as Distributed Systems as Organisms, he identified two ways to approach distributed systems design:

There are two vastly different approaches to distributed software, which I can sum up as traditional and organic. In a traditional distributed system, each server, each process is an isolated unit, like a single-celled organism. It exists as an independent unit that must fend for itself within its environment, and must always assume the worst, because it is the last—likely the only—line of defense for the responsibilities that it carries. For example, it must assume that failure to communicate with another part of the distributed system results in indeterminable conditions, such as in-doubt transactions. As a result, it must consolidate those indeterminate conditions into its own condition, likely preventing it from continuing processing.

Thus, in a traditional distributed system, each node must be prepared to deal with the failure of other nodes, as A Note on Distributed Computing recommends. Purdy also suggests that in the traditional approach, if a node fails, other nodes must wait for the failed node to recover. In other words, "dealing with failure" means waiting on the failed node to recover:

...in a traditional distributed system, the loss of communication to a particular server would cause all other servers that were communicating with that server to wait for the recovery of that server's responsibilities, whether by server migration, service failover, or actual repair (e.g. reboot) of that server.

By contrast, he compares organic systems to multi-cellular organisms:

[Organic systems] represent multi-cellular organisms that are designed to survive the loss of individual cells. In other words, a distributed system is no longer a collection of independent organisms, but is itself an organism composed of any number of servers that can continue processing—without any loss of data or transactions—even when server failure occurs.

In such distributed systems, Purdy claims, individual nodes need not deal with the failure of others because in essence, each node interacts not with other nodes, but with the organism itself. In such systems, the main technical challenge is not dealing with failure of other nodes, but the rapid detection and isolation (by the organism) of a failed node:

...the failure of a server is no longer an exceptional condition, and it affects neither the availability of the overall system nor the state of the data and transactions being managed by the overall system. Thus, an application may still have to deal with the potential for failure, but not the failure of a particular server. Instead, ... an application must deal with the fact that it is on the server that failed, and in exchange, it no longer has to worry about the failure of some other server.

I found Purdy's response quite interesting, but it wasn't what I expected. First of all, I think that between Purdy's traditional and organic categories lies a middle ground. An individual node can deal with the failure of another node not only by waiting for the failed node to recover. It could also go looking for a different node that can perform the same responsibility. In other words, the responsibility for fail over can be with the client rather than the failed node. This is, in fact, what I understand Coherence partitioned cache does if the node responsible for storing an object fails. When I ask my local cache Map for that object, Coherence realizes the primary node is down and goes looking for the backup of that object that it placed on a different node. My theory, therefore, is that that clients of a cache Map need not deal with failure, because the Map does a good enough job of dealing with failure itself. In other words, the application does indeed deal with failure of other nodes, but the part of the application that does so, is the implementation of the cache Map.

Talk Back!

Have an opinion? Readers have already posted 7 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Bill Venners adds a new entry to his weblog, subscribe to his RSS feed.

Digg |

del.icio.us |

About the Blogger

Bill Venners is president of Artima, Inc., publisher of Artima Developer (www.artima.com). He is author of the book, Inside the Java Virtual Machine, a programmer-oriented survey of the Java platform's architecture and internals. His popular columns in JavaWorld magazine covered Java internals, object-oriented design, and Jini. Active in the Jini Community since its inception, Bill led the Jini Community's ServiceUI project, whose ServiceUI API became the de facto standard way to associate user interfaces to Jini services. Bill is also the lead developer and designer of ScalaTest, an open source testing tool for Scala and Java developers, and coauthor with Martin Odersky and Lex Spoon of the book, Programming in Scala.


	Web Artima.com