artima - Article Discussion

Article Discussion

Designing Distributed Systems

View Threaded

Summary: In this third installment of Bill Venners' interview with Ken Arnold, the discussion centers around designing distributed systems, including the importance of designing for failure, avoiding the "hell" of state, and choosing recovery strategies.

14 posts on 1 page.

« Previous 1 Next »

The ability to add new comments in this discussion is temporarily disabled.

Most recent reply: May 11, 2006 9:43 AM by K

Bill

Posts: 409 / Nickname: bv / Registered: January 17, 2002 4:28 PM

Designing Distributed Systems

September 21, 2002 8:59 PM

"Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectation of failure. Imagine asking people, "If the probability of something happening is one in ten to the thirteenth, how often would it happen?" Your natural human sense would be to answer, "Never." That is an infinitely large number in human terms. But if you ask a physicist, she would say, "All the time. In a cubic foot of air, those things happen all the time." When you design distributed systems, you have to say, "Failure happens all the time." So when you design, you design for failure. It is your number one concern," says Ken Arnold in this interview:

http://www.artima.com/intv/distrib.html

Posts: 1 / Nickname: lujhuang / Registered: September 25, 2002 7:24 AM

Re: Designing Distributed Systems

September 25, 2002 11:28 AM

I found Arnold is very insightful on the issues that are covered in the discussions/interview

Carfield

Posts: 12 / Nickname: carfield / Registered: September 16, 2002 3:19 PM

Re: Designing Distributed Systems

September 28, 2002 8:59 AM

I have a question about "Idempotency". In my understand, it mean try more times until it success. However, is it possible in reality? There must be cases that you have to abort and do other thing. Then you need to set up a timeout and do the transaction handling.

So I think transaction handling cannot prevent. Then what is the value of "Idempotency"?

Other than this, I think the article is great, especially the discussion of "state"

Ken

Posts: 9 / Nickname: kcrca / Registered: March 8, 2002 4:25 AM

Re: Designing Distributed Systems

September 29, 2002 8:15 AM

Technically idempotency means that results that are equally correct no matter how many times the call is executed. It can make designs much easier.

For example, the transaction join call in the Jini distributed transaction mechanism is idempotent so you can try to join multiple times without harm. This is pretty important -- you cannot require that joining a transaction be done inside a transaction, or you have a chicken and egg problem.

Transactions can be complex, and they are certainly expensive. Idempotency is much simpler, and can work in places where transactions can't. My point is not that idempotency is always better than transactions, but that idempotency is an important design tool that can better solve many problems. Many people instinctively run to transactions immediately, and they should think harder.

I agree that if you need to be able to positively assure either failure or success you need to use a transaction. Idempotency can logically only guarantee you the option of repeatedly trying for success.

Carfield

Posts: 12 / Nickname: carfield / Registered: September 16, 2002 3:19 PM

Re: Designing Distributed Systems

September 29, 2002 8:34 AM

>that idempotency is an important design tool that can better solve many problems. Many people instinctively run to transactions immediately, and they should think harder.

This is my question in fact. I can't think a case that don't need to handle transaction but only with idempotency, can you talk more about this? Could you explain futher about the jini join transaction example?

Ken

Posts: 9 / Nickname: kcrca / Registered: March 8, 2002 4:25 AM

Re: Designing Distributed Systems

September 29, 2002 10:44 AM

I can try. The join problem is this: Suppose you are asked to become a participant in a transaction. You talk to the TransactionManager and ask to join, but you get a remote exception. Are you in?

Maybe you are and maybe you aren't. But if you want to continue to do the work that requires the transaction, you can simply try to join again. The server will just consider redundant joins from the same participant to be successful. Repeat until you are ready to give up, or until you get in.

So far so good, but to really make it work we have to consider another case. Suppose you joined the transaction successfully and then crashed before you could record that information. When you restart, you will not know anything about that transaction. This is OK by itself -- the TransactionManager will eventually ask you to vote on the transaction and you will reply "I don't know about that transaction", causing the manager to abort the transaction.

The problem comes when you are asked to perform another operation under the transaction and therefore innocently try to rejoin. A trivial idempotency would allow you to do so, and then you would be a member but have forgotten the early (pre-crash) stuff. This would be bad.

So for that reason, the join call has a "join state" identifier that you pass along. Whenever you are unsure if you have all previous join state (such as across a crash) you generate a new ID. Now the manager will see a second join from the same potential participant, but with a different join state ID. Now the manager knows that something has gone wrong. It will refuse to let you in, and it will abort the transaction, rolling everyone back to a previous (consistent) state.

The important facts about this are (a) The client's failure to succeed is something it can handle; (b) The server is unharmed by executing a call even if the client doesn't learn about the execution. In this case a client that hasn't crashed can retry or give up as it prefers, and a client that has crashed will be stopped from causing harm. And the server is unharmed because it can always abort the transaction in the future if things go weird.

Another example might be logging: I will tell the server when I do something. If that fails, I will repeat the call, but only a few times. The client knows what to do (retry a few times) and the server can live equally well by ignoring redundant logging and by not getting the logging that happens during a network failure.

Dhruv

Posts: 1 / Nickname: dhruv / Registered: September 30, 2002 1:38 PM

Webservices: Designing Distributed Systems

September 30, 2002 5:41 PM

What happens when developers adopt Web services on a large scale ? Your thoughts imply that Web services based applications will need to be crafted to handle all the problems youve mentioned in distributed systems. Do you have any suggestions ?

Ken

Posts: 9 / Nickname: kcrca / Registered: March 8, 2002 4:25 AM

Re: Webservices: Designing Distributed Systems

September 30, 2002 8:03 PM

Sure, web services are distributed systems. They tend to be fairly large and monolithic to the outside, but inside they can be built of lots of parts. If you don't build them to be prepared for failure, they will fail catastrophically.

I think most people get this at a basic level. It's just that many more people are dealing with these intrinsically distributed systems, but aren't learning much about distributed system design and construction. Like with much of computer science, people are relearning lessons.

So I suppose my major advice is to go learn what's already known about building distributed systems. At least then if you get bitten hard by a problem it will be a novel one.

Frank

Posts: 135 / Nickname: fsommers / Registered: January 19, 2002 7:24 AM

Re: Designing Distributed Systems

September 30, 2002 11:51 PM

I think idempotancy is great (I try to use it whenever I can), but I'd say that transactions might not be the best example to illustrate idempotency. They're a fine example with only one participant, but the real power of transactions (IMO) are as a concurrency control and recovery mechanism, i.e., multiple participants competing for the same resources.

With Jini, the specs define the 2-phase commit as a coordination protocol, and 2-phase locking as default semantics, and the JSK comes with the a transaction coordinator for 2-phase commit. I found that implementing correct transaction semantics (locking and recovery) is non-trivial (read: hard as hell). Do you believe that the JSK should define some utility classes that help implement that behavior? For instance, a lock manager or a recovery manager, at least? What would you recommend to those wanting to implement transaction participant services?

Mark

Posts: 1 / Nickname: distobj / Registered: October 1, 2002 7:50 AM

State Hell

October 1, 2002 0:07 PM

Ken, I was really enjoying your description of "state hell", and still agree with it, but was confused by some of the stuff at the end, including this bit;

"Have as few stateful components as you can."

Stateless interaction == goodness (which you appeared to be talking about above), sure, but there you seem to suggest that this is the same as persistent state within your components. I don't get that. I mean, typically whether a component has state or not is a requirement of the role that the component plays (though roles can be chosen to minimize that, which I agree with).

What I believe is most important is that interactions with components be stateless so that the effects of partial failure are localized (preferably with the message sender for large scale systems). Of course, that doesn't mean you should be running off adding state to components where it's not needed, only that priority wise, I'd rank stateless interaction as more important than stateless components.

Does that jive with your thinking?

Hubert

Posts: 2 / Nickname: hubert / Registered: October 1, 2002 8:16 AM

Re: Designing Distributed Systems

October 1, 2002 1:03 PM

> Technically idempotency means that results that are
> equally correct no matter how many times the call is
> executed. It can make designs much easier.

One of the methods you describe for making something idempotent is by adding an ID to each call: "this is txn ID 75; debit $100 from a/c 234". If the server receives a call with an already processed ID it rejects it. This works fine but does have some wrinkles. For instance, this imposes on the server the need to store sufficient additional state to be able to decide if it has seen that method's ID before. The cheap approach is a lastID attribute on the server. We have now have shared knowledge - i.e. distributed state and an attendent invariant - between the client and the server that must be maintained. Following a crash then the shared counter may well be out of sync so we need a resync protocol. This adds complexity and a host of extra failure modes. When we add multiple clients to the server then we add even more complexity because we need to ensure that each client uses different IDs, so either we need a central ID generator or we need to identify clients.

Some databases use version counting with writeback failures to achieve optimistic locking and this debit example fits the mould well. So, instead of using a client-generated ID use a server-generated one instead. Transform the single debit operation into a getDebitTicket(accNum) operation that returns a "debit request form" with an ID on it, fill in the amount and submit it to the server. If you get a sequence number failure then get another debit ticket (someone else got there in between) and resubmit; if successful then great; if no reply then resubmit the original until you get a reply or you get bored. Variations on the theme with leasing of tickets and out-of-sequence invocations can be created too.

In the end I would say that idempotence is, like statelessness, desirable. However, the cure can be worse than the illness in some cases.

Ken

Posts: 9 / Nickname: kcrca / Registered: March 8, 2002 4:25 AM

Re: State Hell

October 1, 2002 1:14 PM

I was explicitly thinking about components with stored state. People tend to do this without reckoning the real costs because they are used to single-application situations where this is much more controllable.

Yes, stateless interaction is good too. How much can I cover in one interview?-)

Ken

Len

Posts: 1 / Nickname: thinker / Registered: February 7, 2003 9:34 AM

Re: Designing Distributed Systems

February 7, 2003 2:43 PM

My understanding of it is that you only need to perform transaction in case of the timeout, but not in case of success, which should be 99.9...%

Posts: 1 / Nickname: kiz / Registered: May 11, 2006 5:02 AM

Re: Designing Distributed Systems

May 11, 2006 9:43 AM

Please tell me more about the design of a distributed system and the problems that will be encountered in the design. What are the factors that will improve the performance of the system.

14 posts on 1 page.

« Previous 1 Next »