If processors and networks are getting faster, why do distributed computing at all? If we just wait, servers will be fast enough to do it all. I don't think so, and give some reasons why...
In response to my last posting, Berco Beute asked if faster processors, faster networks, and larger computer capacity allowed all clients to become essentially terminals, and all processing done on the server. Berco was thinking that this might lessen the need for mobile code (which it would), but the stronger conclusion is that this would mean that we really don't need to do distributed computing at all. All our computing can be done in one place, if we are just patient and wait for the machines to get large enough, fast enough, and the networks good enough to allow that sort of concentration.
If this were possible, it would certainly make programming easier...no more messy partial failures to deal with, for example. We get rid of the 7 (or 8) fallacies of distributed computing by simply getting rid of the distributed computing, or at least limiting it to the channel between the client (which becomes essentially a very smart terminal) and the server.
This is the sort of design center that Plan 9 (the Bell Labs system) had. Users would interact via terminals (that looked a lot like Blits) with servers that were stuck away someplace else. This is also a lot like the Sun strategy with SunRays and servers. It does simplify administration, and make programming easier.
But it isn't going to make the need for distributed computing go away. At best, it is a way of putting the problem off for a short period of time; at worst it is just pushing the problem back a level and giving us all an illusion which will bite us soon. The mathematics is simply wrong; looking at the trends reinforces the need for distributed computing.
The trends to look at are those described by Moore's law having to do with processors and the trends in network traffic (not speed). Moore's law, we all know, says that the performance of a processor doubles every 18 months (or that the price is cut in half for the same performance). The trend in network traffic, however, is that it doubles every 12 months (or less). So the increase in network traffic is outpacing the increase in processor performance, at the same time that competent processors are becoming cheaper (and therefore being placed out on the edges of the network cloud). It's just math, folks--the processors can't keep up.
This means that the need for distributed computing is going to increase, not decrease. And part of this need is that more and more different kinds of computing devices, from servers to cell phones to automobiles to refrigerators will be on the network. Humans won't be part of most loops (which is why I worry more about program-to-program distribution) and mobile code is going to be key (an assertion without proof in this log; that will be the subject later).
There is an interesting paper that supports Jim's view that increase in data flows outpaces the growth in processor capability to process that data. That report is based on experience gleaned from the Sloan Digital Sky Server project, and it establishes the need for distributed data mining (as opposed to mining data located at a single location): They claim that no single data warehouse will contain more than 12% of the world's astronomy research data. Thus, to come up with interesting discoveries, you need to do distributed computing. Similar situations exist in other areas of scientific computing (e.g., particle physics, genetics, drug discovery, etc.)
"Astronomical data is growing at an exponential rate: it is doubling approximately every year. The main reason for this trend is Moore s Law, since our data collection hardware consists of computers and detectors based on the same technology as the CPUs and memory. However, it is interesting to note how the exponential trend emerges. Once a new instrument is built, the size of the detector is frozen and from that point on the data just keeps accumulating at a constant rate. Thus, the exponential growth arises from the continuous construction of new facilities with ever better detectors. New instruments emerge ever more frequently, so the growth of data is faster than just the Moore s Law prediction. Therefore while every instrument produces a steady data stream, there is an ever more complex network of facilities with large output data sets over the world.
"How can we cope with this trend? First of all, since the basic pipelines processing and storage are linearly proportional to the amount of data, the same technology that gives us the larger detectors, will also give us the computers to process and the disks to save the data. On a per project basis the task will probably get easier and easier, the first year will be the most expensive, later it becomes increasingly trivial. On the community level however, the trend is not so clear, as we show below. More and more projects will engage in data intensive projects, and they will have to do much of the data archiving themselves. The integrated costs of hardware and storage over the community will probably increase as time goes on, but only slightly.
"The exponential growth in the data sources and the exponential growth in the individual data sets put a particular burden on the projects. It only makes sense to spend 6 years to build an instrument, if one is ready to use the instrument for at least the same amount of time. This means, that during the lifetime of a 6-year project, the data growing at a linear rate, the mean time the data spends in the project archive before moving to the centralized facility is about 3 years. Turning this around, the data that makes it into the national facilities will be typically 3 years old. As the amount of data is doubling in the world, every year, in 3 years the data grows by 8-fold, thus no central archive will contain more than about 12% of the world's data at any one time. The vast majority of the data and almost all the current data will be decentralized among the data sources. This is a direct consequence of the patterns of data intensive science. These numbers were of course taken from astronomy, the rates may be different for other areas of science, but the main conclusions remain the same."
I would like to add another point of view (to follow Jim Waldo's and Frank Sommers' reasoning). There is not only the amount of data that leads to the distributed computing. We are searching also for new ways to solutions.
Think about how Object Oriented programming started to change the ways of programming. It brought the possibility to break the monolithic functionality of applications into communicating islands of smaller functionality. The objects also hide the complexity behind them. This in other words means that one can think about the problem more naturally, hierarchically, and that we are able to solve more complex problems.
From hardware point of view, there are some physical limits for the classical processors. The processors have to follow the way of concurency (parallelism). And they already do that. Still, the new processors try to pretend that they are simple (mono) processors, because the programmers require this kind of behaviour.
The truth is that pure Von Neumann architecture is still the basis even of the most modern processors. But they develop towards something that is more natural for applications that are composed of objects. In fact, the object oriented solution can naturally be mapped to a distributed system -- in the ideal case. So the question is not "Why distributed computing?", but rather "How to make it simpler for a programmer?". Recall the time when people struggled for simpler programming in autocode, assembler, higher languages... Think about how higher data abstractions emerged in many languages. The languages, their features, and the used abstractions were developed much earlier, but now you can observe their massive usage in everyday programmer's work -- like the containers and iterators, the list and dictionary data types (just the fragments that came to my mind). They were not widely known in the past. Similarly, new abstractionsthat simplify the building of distributed application will emerge.
I personally believe that distributed programming does not neccessarily be extremely difficult. We only have to find the ways how to do it correctly and reliably. And also the hardware must get matured (and cheap enough) in that sense. In my opinion, many "distributed people" are desperately waiting for new hardware to implement their new ideas.
I think that many programmers who have not had a scientific or engineering background may lack an understanding of what makes a problem parallelizable or distributable.
For example, consider the following problem:
BigServer can process up to 10 requests/second. LittlePC can process 1 request/second. Both types of machine can queue pending requests if a request is in process.
What will be the difference in the characteristics of a system built with 1 BigServer, vs. a distributed system built with 10 LittlePCs, if the total offered load is 5 requests per second (assume exponentially distributed interarrival times)?
Many people are at first surprised by the fact that the wait time for a request submitted to the LittlePC system is 10 times longer than the wait time in the system with 1 BigServer, despite the fact that the overall throughput is the same.
My point is that distributed systems can work wonders, but even when they behave as intended, there are certain characteristics that may be unavoidable. Once you are off the drawing board and into implementation, things can get worse. Consider EJB's Remote interface. In practice, network time can't just be written off as negligible.
That said, there are many problems where distributed computing makes sense because the I/O time is minimal compared to the time required to process individual chunks, or because BigServer is too expensive!
> My point is that distributed systems can work wonders, but > even when they behave as intended, there are certain > characteristics that may be unavoidable. Once you are off > the drawing board and into implementation, things can get > worse. Consider EJB's Remote interface. In practice, > network time can't just be written off as negligible. > > That said, there are many problems where distributed > computing makes sense because the I/O time is minimal > compared to the time required to process individual > chunks, or because BigServer is too expensive!
I'd expand that and say that sometimes network I/O is a problem but it's one I'm willing to live with. The BigServer approach has certain limits - it may not be possible for me to get enough network traffic into it to take advantage of all it's processing power. It may be that I require a "global presence" which isn't dependent on a single machine in one location (after all, networks break and I'd really like to continue to service my customers). It may be that I'd like to migrate data closer to it's owner who just took a plane from the US to Australia.
Thus, whilst network I/O can certainly be considered an issue it may well not be *the* issue (in certain environments, it could well be a fact of life that I just have to live with) I am most concerned with. i.e. I'll trade a little bit of "performance" to have some other ability.
Agreed, agreed, as long as we're aware that there are inherent differences between the systems, that have nothing to do with I/O and network performance.
The result that LittlePc system has longer wait times than BigServer, for the same throughput, is an "on paper" result that assumes zero network latency. It's a consequence of "Little's law", one of the most well known laws of queueing theory.
Real deployments alter the preconditions upon which Little's Law is postulated. Nevertheless, an understanding of the fundamental reason why Little's Law is valid will help one avoid system bottlenecks.
Skip the 'why we need mobile code proof' and go directly to the - IMHO - heart of the matter - 'how are we going to do it'.
I think Jini is a failure in one respect: security of mobile code. I also think that it will not possible to add it - I already hardly understand Java's security model, and 'security' and 'hard to understand' are two things you ideally don't want to encounter in a single sentence.
Personally, but that's more an opinion than scientific evidence ;-), I think the capability people might be on the right track - the E programming language (http://www.erights.org) offers mobile code in a capability-based environment (basically the only thing you can 'steal' at the moment is CPU cycles) but lacks the dynamism of Jini (so, it is just as 'broken' as Jini, only in the exactly opposite respect).
It'd be cool to see what a combination would give us.
I'd be willing to just go for the "we've got to do it"; while it is true that needing to do it doesn't mean that you can, it at least means that we need to try. There are many who think that we don't even need to try.
However, I agree with you that security is needed, and that we don't have an answer that is both good and capable of being understood at the moment. I actually think that Jini is no worse off then other systems in this respect; actually, I think it is better off with the Davis release since there is a form of security that some can understand (not enough, but some) and so we at least know that it is possible.
Unfortunately, "security" has come to mean far too many things to far too many people, so saying that we need security is a pretty open-ended statement these days. There are basic problems (like "how do I know that the entity trying to do something has the rights to do that?"), there are problems with mobile code ("how do I know that this code isn't going to do something bad?") and problems with communication ("how do I know that the message I got wasn't garbled, and really comes from who it says it comes from?"). These can be related, or not, depending on how you approach the overall "security" question.
So I'll admit that I don't know how to do security, because I'm not sure that there is one thing that is meant by security. There are lots of interesting approaches, but first I'd like to make sure that we know the set of problems that we are trying to solve.