The Artima Developer Community
Sponsored Link

Weblogs Forum
Cameron Purdy on Dealing with Failure

7 replies on 1 page. Most recent reply: Aug 18, 2006 9:43 AM by Manik Surtani

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 7 replies on 1 page
Bill Venners

Posts: 2284
Nickname: bv
Registered: Jan, 2002

Cameron Purdy on Dealing with Failure (View in Weblogs)
Posted: Aug 14, 2006 11:26 AM
Reply to this message Reply
Summary
In his weblog, Cameron Purdy suggests that when a distributed system is designed as a multi-cellular organism rather than a collection of individual cells, an application need not deal with the potential for other servers to fail, but rather with it's own potential for failure.
Advertisement

In a conversation with Cameron Purdy, CEO of Tangosol, about distributed systems design, I asked him this question: Why do clients of the Map interface used in Coherence to add objects to a distributed cache need not deal with the potential for failure of the network or other nodes in the distributed system? The reason I asked is that ignoring the potential for failure seems on its surface to be in conflict with A Note on Distributed Computing, by Jim Waldo, et. al. This paper states that "objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space."

In his email response, which he also published in his weblog as Distributed Systems as Organisms, he identified two ways to approach distributed systems design:

There are two vastly different approaches to distributed software, which I can sum up as traditional and organic. In a traditional distributed system, each server, each process is an isolated unit, like a single-celled organism. It exists as an independent unit that must fend for itself within its environment, and must always assume the worst, because it is the last—likely the only—line of defense for the responsibilities that it carries. For example, it must assume that failure to communicate with another part of the distributed system results in indeterminable conditions, such as in-doubt transactions. As a result, it must consolidate those indeterminate conditions into its own condition, likely preventing it from continuing processing.

Thus, in a traditional distributed system, each node must be prepared to deal with the failure of other nodes, as A Note on Distributed Computing recommends. Purdy also suggests that in the traditional approach, if a node fails, other nodes must wait for the failed node to recover. In other words, "dealing with failure" means waiting on the failed node to recover:

...in a traditional distributed system, the loss of communication to a particular server would cause all other servers that were communicating with that server to wait for the recovery of that server's responsibilities, whether by server migration, service failover, or actual repair (e.g. reboot) of that server.

By contrast, he compares organic systems to multi-cellular organisms:

[Organic systems] represent multi-cellular organisms that are designed to survive the loss of individual cells. In other words, a distributed system is no longer a collection of independent organisms, but is itself an organism composed of any number of servers that can continue processing—without any loss of data or transactions—even when server failure occurs.

In such distributed systems, Purdy claims, individual nodes need not deal with the failure of others because in essence, each node interacts not with other nodes, but with the organism itself. In such systems, the main technical challenge is not dealing with failure of other nodes, but the rapid detection and isolation (by the organism) of a failed node:

...the failure of a server is no longer an exceptional condition, and it affects neither the availability of the overall system nor the state of the data and transactions being managed by the overall system. Thus, an application may still have to deal with the potential for failure, but not the failure of a particular server. Instead, ... an application must deal with the fact that it is on the server that failed, and in exchange, it no longer has to worry about the failure of some other server.

I found Purdy's response quite interesting, but it wasn't what I expected. First of all, I think that between Purdy's traditional and organic categories lies a middle ground. An individual node can deal with the failure of another node not only by waiting for the failed node to recover. It could also go looking for a different node that can perform the same responsibility. In other words, the responsibility for fail over can be with the client rather than the failed node. This is, in fact, what I understand Coherence partitioned cache does if the node responsible for storing an object fails. When I ask my local cache Map for that object, Coherence realizes the primary node is down and goes looking for the backup of that object that it placed on a different node. My theory, therefore, is that that clients of a cache Map need not deal with failure, because the Map does a good enough job of dealing with failure itself. In other words, the application does indeed deal with failure of other nodes, but the part of the application that does so, is the implementation of the cache Map.


Ivan Lazarte

Posts: 91
Nickname: ilazarte
Registered: Mar, 2006

Re: Cameron Purdy on Dealing with Failure Posted: Aug 14, 2006 12:11 PM
Reply to this message Reply
From what I remember this appears to be the philosophy behind the Google File System. "Failure is the norm, not the exception."

Ari Zilka

Posts: 8
Nickname: ikarzali
Registered: Jul, 2006

Re: Cameron Purdy on Dealing with Failure Posted: Aug 15, 2006 6:43 PM
Reply to this message Reply
I do believe in the notion of organic systems, but I view the same concept from a workload perspective. Furthermore, I don't think there is a middle ground between organic clusters and inorganic ones. And I assert this because I think you design an application so that it can be clustered or not.

In short, its the workload that can be partitioned into atoms much smaller than the server doing that work. You partition the work or you don't. And how you partition the work leads to a good organism that can handle unplanned demand or an inorganic system that can only handle what it was designed to.

With small atom definitions, servers can do lots of atoms of work per unit-time. And, if an atom is properly defined (as restartable), then the cluster is inherently an organism where units of work flow around the organism toward a path of least resistance.

Inorganic workloads would be those that are highly deterministic as to where they run and highly contending for particular resources. The atom is roughly equivalent in scale to the server it runs on (meaning serial or sequential processing).

At the end of the day, web sites should have taught us all a lesson in that organic computing (HTTP, load balancing, grids of machines, restartable transactions, initiator of the transaction determines success or failure, etc.) leads to flexible IT. Cameron is spot on. The Web Monsters of the world call their traffic organic / unpredictable. They plan and design for it. And folks like Priceline end up with 100% uptime for 6 years. It is possible. I have seen it.

Bill Venners

Posts: 2284
Nickname: bv
Registered: Jan, 2002

Re: Cameron Purdy on Dealing with Failure Posted: Aug 16, 2006 4:05 PM
Reply to this message Reply
Hi Ari,

> I do believe in the notion of organic systems, but I view
> the same concept from a workload perspective.
> Furthermore, I don't think there is a middle ground
> d between organic clusters and inorganic ones. And I
> assert this because I think you design an application so
> that it can be clustered or not.
>
I think you are using the terms "organic" differently than Cameron. Also, he doesn't use "inorganic" but "traditional." In Cameron's traditional approach, each server is an isolated, independent unit that must recover from its own failure, and explicitly deal with the failure of other servers, which are themselves isolated, independent units. By contrast, in Cameron's organic approach, multiple servers can perform the same responsibility and can take over existing transactions if one server fails.

I don't yet feel like I have a satisfying answer to my question as to why clients of Coherence (or Terracotta for that matter) don't need to deal with failure explicitly. The reason is that first of all, the way an isolated server might perform self recovery is that it has a hot backup and some mechanism for detecting a problem with the primary and switching over to the backup. You might want to have three servers in there, so one can be down for maintenance while the primary is failing and the secondary is taking over. So in this case, it isn't really one server, it is three servers, all of which can "perform the same responsibility and take over existing transactions if one server fails." So why isn't that organic?

Well, perhaps the reason is that to outside clients of that server (though maybe we should call it a service), it is an isolated server. Perhaps it would be more organic if all such parties were organized more as a single organism. However, doesn't that just move the inorganic-ness? Such a super-organism would only be useful if used from the outside, right, and to those clients, the entire super-organism appears as a single server, which is responsible for its own self-recovery.

In contrast, my gut feeling is that if a service is reliable enough, I as a client may decide it is not worth handling the failure case. If I as a server need to provide 99.99% uptime to my clients, and a service I'm using provides 99.999% uptime, then I may be able to safely ignore its failure. When it fails, I fail. For example, if I'm writing code that calls into Coherence's API to add an object to a distributed cache, I just call put on the Map and don't worry about any exception that may be thrown. I trust that the cache API will take care failure for me to a great enough degree that I don't worry about the potential for failure. If the cache fails, I fail. Same thing if I update a shared object managed by Terracotta. I don't attempt to recover, I just wait for Terracotta to solve the problem.

The other thing I notice is that not much worry seems to be given to the network itself failing. Part of the problem pointed out by A Note on Distributed Computing was that I can't tell if a failure is due to the other party actually failing or the network going down between us. Perhaps because tools such as Coherence, Terracotta, etc., are created to serve enterprises, there's an assumption that money can and will be spent to make the network itself reliable enough so that programmers don't have to worry about its failure. And that cost would often be justifiably less than the cost of paying programmers to write code that deals with failure everywhere.

Dan Creswell

Posts: 49
Nickname: dancres
Registered: Apr, 2003

Re: Cameron Purdy on Dealing with Failure Posted: Aug 17, 2006 1:05 AM
Reply to this message Reply
> Hi Ari,
>
> > I do believe in the notion of organic systems, but I
> view
> > the same concept from a workload perspective.
> > Furthermore, I don't think there is a middle ground
> > d between organic clusters and inorganic ones. And I
> > assert this because I think you design an application
> so
> > that it can be clustered or not.
> >
> I think you are using the terms "organic" differently than
> Cameron.

+1

.....

> The other thing I notice is that not much worry seems to
> be given to the network itself failing. Part of the
> problem pointed out by A Note on Distributed
> Computing
was that I can't tell if a failure is due to
> the other party actually failing or the network going down
> between us. Perhaps because tools such as Coherence,
> Terracotta, etc., are created to serve enterprises,
> there's an assumption that money can and will be spent to
> make the network itself reliable enough so that
> programmers don't have to worry about its failure. And
> that cost would often be justifiably less than the cost of
> paying programmers to write code that deals with failure
> everywhere.

Two other thoughts:

(1) Enterprise can, in many cases, force a very controlled environment as in a tightly controlled data-centre with robust networking, admin team etc, but not everyone can afford that approach.

(2) Dealing with failure needn't be difficult - one way or another you either undo or redo the "operation". A good example of this philosophy can be seen in Google's mapreduce.

Ari Zilka

Posts: 8
Nickname: ikarzali
Registered: Jul, 2006

Re: Cameron Purdy on Dealing with Failure Posted: Aug 18, 2006 6:51 AM
Reply to this message Reply
> Hi Ari,
>
> > I do believe in the notion of organic systems, but I
> view
> > the same concept from a workload perspective.
> > Furthermore, I don't think there is a middle ground
> > d between organic clusters and inorganic ones. And I
> > assert this because I think you design an application
> so
> > that it can be clustered or not.
> >
> I think you are using the terms "organic" differently than
> Cameron. Also, he doesn't use "inorganic" but
> "traditional." In Cameron's traditional approach, each
> server is an isolated, independent unit that must recover
> from its own failure, and explicitly deal with the failure
> of other servers, which are themselves isolated,
> independent units. By contrast, in Cameron's organic
> approach, multiple servers can perform the same
> responsibility and can take over existing transactions if
> one server fails.

Ok. I noticed the overlap of terminology and agree that I shifted from Cameron's intent. Sorry for the confusion. But I think it all goes together to answer people's need.

1. Organic vs. Traditional system design. Multiple machines can do the same task. Good. Now I have no single point of failure w.r.t. where my business logic can run assuming a few other things.

2. Organic vs. Inorganic workloads. Let's call this simply "restartable." Restartable workloads means that if I start a transaction on Server A and I decide to restart it on Server B, the system will remain consistent. I need "restartable" with consistency before I can leverage "organic."

3. And this last one is implicit in both #1 and #2; the data needs to be separated from the application. If data is durable, Organic servers can fail w/o losing what they were working on. Think of Tangosol or Terracotta as "Network Attached Memory." And think of Microsoft Word as the example. If Word wrote its local RAM to NAM, then you wouldn't hit Ctrl-S all the time because the machine running Word could never lose the contents of your doc until NAM crashed (and rest assured we can make NAM itself highly available).

A server should be able to write critical pieces of application data to durable NAM. If it can, you get HA because the NAM is the memory of record such that any organic node connecting to NAM gets a consistent view of state. And, scalability is allowing multiples of these organic servers to connect to the same NAM.

As for network stability:
1. In a small network, simply buy double the ports and use multiple IPs per machine...per-port costs are cheap
2. In a large / WAN-based network, segment the data into multiple NAM repositories so that you don't have to share it everywhere and, thus, do not require double the WAN links, etc.
3. With redundant LAN and reasonable data partitioning you will find that you lose organic nodes frequently and rarely the network.

If you have organic nodes, restartable transactions, and NAM, reflying the transaction on a new server shouldn't hurt.

Cameron Purdy

Posts: 186
Nickname: cpurdy
Registered: Dec, 2004

Re: Cameron Purdy on Dealing with Failure Posted: Aug 18, 2006 7:20 AM
Reply to this message Reply
Ari - you're getting pretty close to where I was headed with the analogy.

As for "Network Attached Memory," I love the term, but I'm scared of the acronym ;-)

Peace.

Manik Surtani

Posts: 1
Nickname: msurtani
Registered: Aug, 2006

Re: Cameron Purdy on Dealing with Failure Posted: Aug 18, 2006 9:43 AM
Reply to this message Reply
Hi Ari.

This makes things clearer. You've effectively broken up the 'organic-ness' of Cameron's scenario into:

1) Fault tolerance of the overall system (and I suspect this is the 'middle ground' that Bill speaks of, between traditional and organic architectures) - basically having a backup node able to carry on with the workload.

2) Atomization of the workload into distributable chunks. The ability to scale by adding more resources/dynamically resizing the cluster, and sharing the state of partially processed data.

3) Transparent self healing. (Again, back to Bill's point of the client dealing with a failure vs. the 'organism' dealing with it)

I agree with Bill wrt. point 1, but I believe Cameron was speaking about more than just fault tolerance and dealing with failure. All 3 factors above come into play when talking of a distributed system being a multi-cellular organism, it is more than just dealing with failure on a request/response basis. And on that level, I don't see a middle ground.

Flat View: This topic has 7 replies on 1 page
Topic: Cameron Purdy on Dealing with Failure Previous Topic   Next Topic Topic: One per Pixel.

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use