Sun Lab's Ron Goldman notes in a recent interview that as applications become more complex, writing bug-free software becomes an increasingly hard task. He also observes that many applications take advantage of only a fraction of available computing resources to perform their requirements. Those available CPU cycles can be dedicated to tasks that support self-monitoring and self-healing. That mimics how biological systems work:
In computing, perhaps 5% of our code deals with exception handling and error correction, which seems like a lot, while 95% tries to get the basic job done. Biology appears to reverse this, with 5% doing the basic metabolism and 95% functioning to make sure that the 5% can do its job. Think about keeping your heart beating -- is that overhead? Or is that a core activity that's part and parcel of who you are? Think of your body doing the work of keeping your mind and brain functioning. That's not overhead.
Likewise, maintaining your computer system's health to make sure that all of its components are functioning is not overhead. That's just what is required to have a robust system.
Goldman notes that automatic garbage collection is already an example that principle:
When John McCarthy was designing the LISP language, one of the programs he wrote was an elegant algorithm to do symbolic differentiation. He recognized that the code would be using up memory and if it wasn't released that memory would eventually run out. And he deliberately decided that he didn't want to mess up his elegant algorithm for differentiation with a lot of record keeping and bookkeeping for memory, which had nothing to do with the problem he really cared about. So he did something that we're considering doing in a number of other places. He accepted the idea that all programs have bugs and created a system that can repair and clean up unused memory and, in a sense, that can recycle it and make it available.
In order to build self-monitoring and self-healing systems, he suggests that blackbox-style component encapsulation sometimes doesn't work, and why it may be advantageous to sometimes be able to "see inside" software:
When we write code we are well advised to follow the principles of encapsulation and information hiding. Otherwise our modules will become very tightly coupled to each other and hard to change. However, when we run a program it can be advantageous to be able to see into it. An obvious example is testing, where the test code may need to check the internal state of a module. We believe that it is important to have visibility into the system in order to assess its health and to make decisions about adjusting it. Visibility consists of continually updated descriptions, for example, of what's inside a system's software components, how a system is currently configured, the overall state of the system, what it's working on, which users use what software in which ways, and so on.
In order to facilitate such self-healing and self-monitoring, objects should expose a richer interface, and even API:
Instead of a rigid, minimalistic API between two modules, [object] [...] can be pattern-recognized or sampled. We are considering shared blackboards with simple textual pattern-matching, extensions of something like Common Lisp's keyword/optional argument lists and calling conventions, or even passing XML documents. The key idea being that, instead of one agent reaching inside another and commanding it to do some function (e.g., a remote procedure call), it instead deposits a request that the second agent can then interpret and deal with as best it can.
Many of Goldman's ideas build on the self-healing notion of distributed systems, such as Jini. For instance, Jini can be used to facilitate dynamic loading of exception handlers to deal with new error types in an application.
Do you see self-monitoring and self-healing taking up increasingly importants parts of your application?
> <p>Sun Lab's Ron Goldman notes in a recent <a > href="http://java.sun.com/developer/technicalArticles/Inter > views/goldman_qa.html">interview</a> that as applications > become more complex, writing bug-free software becomes an > increasingly hard task.
It is not code complexity or code size that creates bugs, but the poor programming languages being used.
> > as applications become more complex, writing bug-free > > software becomes an increasingly hard task. > > It is not code complexity or code size that creates bugs, > but the poor programming languages being used.
Maybe using a programming language inadequate for the problem complexity and size.
> It is not code complexity or code size that creates bugs, > but the poor programming languages being used.
Selecting a programming language is a question of choosing between tradeoffs. It is very difficult to get a "safe" programming language that can also guarantee hard real-time performance. And even if you find one, it may not be ported to the processor that you need it to run on.
> <p>In computing, perhaps 5% of our code deals with > exception handling and error correction, which seems like > a lot, while 95% tries to get the basic job done. Biology > appears to reverse this, with 5% doing the basic > metabolism and 95% functioning to make sure that the 5% > can do its job. Think about keeping your heart beating -- > is that overhead? Or is that a core activity that's part > and parcel of who you are? Think of your body doing the > work of keeping your mind and brain functioning. That's > not overhead.</p>
I think that is how it should be. If the work is uniform with few or no exceptions, let a computer handle it. If exceptions are the rule let people (biology) handle it, they are much better at that. When there are too many exceptions computer code just becomes a royal mess.
Reminds me of a chapter in the book Peopleware by De Marco and Lister. Somebody has that reference handy?
> > It is not code complexity or code size that creates bugs, > but the poor programming languages being used.
I see, is this when the programming language did not properly understand the requirements?
There is a standard saying about guns killing people, this seems the inverse, i.e. "People don't kill people, guns do" argument.
I refer readers to something said by Roger Waters in a dcoumentary of Pink Floyd's Pompeii concert, where they were discussing whether the digital revolution made musicians irrelevant, i.e. anyone who could push a button could now make music. If I recall correctly, Roger Waters modestly said something to the effect that this was not the case.
"programming languages don't create bugs, people do"
> > > > It is not code complexity or code size that creates > bugs, > > but the poor programming languages being used. > > I see, is this when the programming language did not > properly understand the requirements? > > There is a standard saying about guns killing people, this > seems the inverse, i.e. "People don't kill people, guns > do" argument. > > I refer readers to something said by Roger Waters in a > dcoumentary of Pink Floyd's Pompeii concert, where > they were discussing whether the digital revolution made > musicians irrelevant, i.e. anyone who could push a button > could now make music. If I recall correctly, Roger Waters > modestly said something to the effect that this was not > the case. > > "programming languages don't create bugs, people do" > > -Mike
Let me rephrase: the current bunch of mainstream programming languages are not good enough to allow for writing large bugless programs. Of course it is people that make bugs, but programming languages should make an effort to minimize those bugs.
I do not ever think that conscientious software will be realized, because the analogy between biological systems and software systems is wrong. In biological systems, a 'bug' is something that breaks a correct system down, whereas in software systems the system is not correct initially. And since it is not possible to find an algorithm from a result, conscientious software will never be realized.
For example, let's take the case of a null pointer: is there an algorithm that finds why there is a null pointer and to correct it? there is not. Another example: a wrong array index. Is there an algorithm that can fix a computation in order to produce a correct index value? there is not. It is impossible to correct bugs like this.
Although the idea of software that is "self-monitoring and self-healing" is great, I would settle for something that is far, far more doable by mere mortal programmers, and that is to develop software that "can be monitored and can be healed". This has been a passion of mine for many years, but one I find very few people actually practice.
What do I man by this? Simply put, I mean the practice of intentionally and carefully designing into the software some ability to diagnose problems (and there will always be problems) so that you can then take actions to correct either the problem itself or the root cause that triggered the problem (i.e. a "workaround" to avoid the problem).
And what I suggest need not be complex at all. In fact, I think this goal can be realized in many cases just by careful use of something like log4j. From the very begining of the codes lifecycle (i.e. when the lines of code are initially typed in), there should be a continual process of asking the question "if this code does not work right for some reason, what will I need to be able to get in the logfile in order to diagnose the problem?". Then, in answer to that question, put reasonable and informative log.debug statements into the code so that at runtime, if and only if you need to, you can turn on debug code and get additional information that is useful.
In my 30+ years of writing software, I have found over and over again that customers of software systems don't mind bugs nearly as much as they mind bugs that take a long period of time to fix. Designing "diagnosibility" into software goes a long way towards reducing the time to analyze and resolve a problem.
My day job is developing a product for monitoring applications. As such we depend on various API's exposed by the applications or the logs created during runtime. The application could also heal the monitored application if it exposes an interface, else the "remote application restart" is the only option. Hardly any of the application that we monitor has a self-healing capability ( if it did we would be out of business) and very poor monitoring hooks. Building monitoring, let alone self healing capability is always an after thought. Typically when an application includes such capability, it means that the product features have matured enough that stability and diagnostics become a bigger issue. Oracle 10g finally has some neat diagnostic and self healing ( in most cases ) capabilites or alteast provides recommendations to solve issues. It took so long for it to get to this stage. Even though we face these issues every day, we have a hard time justifying including the monitoring/self healing features over functional features in our product. We have started on the path, but are still have a long way to go before we start self healing. I have planned to atleast start including monitoring and maybe self healing hooks as a standard practice into our code. Initially I am planning to do it statically ( compile time ) , but use AOP (instead of the object exposing the API) to implement the auditing, monitoring and self healing features, so that the domain logic remains simple and maintainable.