Sponsored Link •
In this third installment of Bill Venners' interview with Ken Arnold, the discussion centers around designing distributed systems, including the importance of designing for failure, avoiding the "hell" of state, and choosing recovery strategies.
Ken Arnold has done a lot of design in his day. While at Sun Microsystems, Arnold was one of the original architects of Jini technology and was the original lead architect of JavaSpaces. Prior to joining Sun, Arnold participated in the original Hewlett-Packard architectural team that designed CORBA. While at UC Berkeley, he created the Curses library for terminal-independent screen-oriented programs. In Part I of this interview, which is being published in six weekly installments, Arnold explains why there's no such thing as a perfect design, suggests questions you should ask yourself when you design, and proposes the radical notion that programmers are people. In Part II, Arnold discusses the role of taste and arrogance in design, the value of other people's problems, and the virtue of simplicity. In this third installment, Arnold discusses the concerns of distributed systems design, including the need to expect failure, avoid state, and plan for recovery.
Bill Venners: What is important to keep in mind when you are designing a distributed system?
Ken Arnold: We should start off with some notion of what we mean by distributed system. A distributed system, in the sense in which I take any interest, means a system in which the failure of an unknown computer can screw you.
Failure is not such an important factor for some multicomponent distributed systems. Those systems are tightly controlled; nobody ever adds anything unexpectedly; they are designed so that all components go up and down at the same time. You can create systems like that, but those systems are relatively uninteresting. They are also quite rare.
Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure. Imagine asking people, "If the probability of something happening is one in ten to the thirteenth, how often would it happen?" Your natural human sense would be to answer, "Never." That is an infinitely large number in human terms. But if you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time." When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern.
Yes, you have to get done what you have to get done, but you have to do it in the context of failure. One reason it is easier to write systems with Jini and RMI (remote method invocation) is because they've taken the notion of failure so seriously. We gave up on the idea of local/remote transparency. It's a nice thought, but so is instantaneous faster-than-light travel. It is demonstrably true that at least so far transparency is not possible.
What does designing for failure mean? One classic problem is partial failure. If I send a message to you and then a network failure occurs, there are two possible outcomes. One is that the message got to you, and then the network broke, and I just couldn't get the response. The other is the message never got to you because the network broke before it arrived. So if I never receive a response, how do I know which of those two results happened? I cannot determine that without eventually finding you. The network has to be repaired or you have to come up, because maybe what happened was not a network failure but you died.
Now this is not a question you ask in local programming. You invoke a method and an object. You don't ask, "Did it get there?" The question doesn't make any sense. But it is the question of distributed computing.
So considering the fact that I can invoke a method on you and not know if it arrives, how does that change how I design things? For one thing, it puts a multiplier on the value of simplicity. The more things I can do with you, the more things I have to think about recovering from. That also means the conceptual cost of having more functionality has a big multiplier. In my nightmares, I'll tell you it's exponential, and not merely a multiplier. Because now I have to ask, "What is the recovery strategy for everything on which I interact with you?" That also implies that you want a limited number of possible recovery strategies.