Sponsored Link •
I tend to prefer something called idempotency. You can't solve all problems with it, but basically idempotency says that reinvoking the method will be harmless. It will give you an equivalent result as having invoked it once.
If I want to manipulate a bank account, I send in an operation ID: "This is operation number 75. Please deduct $100 from this bank account." If I get a failure, I just keep sending it until it gets through. If you're the recipient, you say, "Oh, 75. I've seen that one, so I'll ignore it." It is a simple way to deal with partial failure. Basically, recovery is simple retry. Or, potentially, you give up by sending a cancel operation for the ID until that gets through. If you want to do that, though, you're more likely to use transactions so you can abort them.
Generally, with idempotency, everybody needs to know how to go forward. But people don't often need to know how to go back. I don't abort a transaction. I just repeatedly try again until I succeed. That means I need to know how to say to do this. I don't have to deal with all sorts of ugly recovery most of the time.
Now, what happens if failure increases on the network? You start sending messages more often. If that is a problem, for a long distance you can solve it by writing a check and buying more hardware. Hardware is much cheaper than programmers. Other ways of dealing with this tend to increase the system's complexity, requiring more programmers.
Bill Venners: Do you mean transactions?
Ken Arnold: Transactions on everything can increase complexity. I'm just talking about transactions and idempotency now, but other recovery mechanisms exist.
If I just have to try everything twice, if I can simply reject the second request if something has already been done, I can just buy another computer and a better network—up to some limit, obviously. At some point, that's no longer true. But a bigger computer is more reliable and cheaper than another programmer. I tend to like simple solutions and scaling problems that can be solved with checkbooks, even though I am a programmer myself.
Bill Venners: Is there anything in particular about Internet- wide distributed systems or large wide area networks that is different from smaller ones? Dealing with increased latency, for example?
Ken Arnold: Yes, latency has a lot to do with it. When you design anything, local or remote, efficiency one of the things you think about. Latency is an important issue. Do you make many little calls or one big call? One of the great things about Jini is that, if you can use objects, you can present an API whose natural model underneath deals with latency by batching up requests where it can. It adapts to the latency that it is in. So you can get away from some of it, but latency is a big issue.
Another issue is of course security. Inside a corporate firewall you say, "We'll do something straightforward, and if somebody is mucking around with it, we'll take them to court." But that is clearly not possible on the Internet; it is a more hostile environment. So you either have to make things not care, which is fine when you don't care if somebody corrupts your data. Or, you better make it so they can't corrupt your data. So aside from latency, security is the other piece to think about in widely distributed systems.