With well over five million lines of code used on the latest jetliners, versus fewer than a million on older planes, it's increasingly difficult to detect and fix embedded problems before they surprise pilots.
The article contrasts the discipline of material science, responsible for ensuring the safety of materials, to the less well-understood task of creating very complex software in a fault-tolerant manner:
Mechanical components such as jet engines are just as complex in their own way as computers. But aviation engineers now have an exhaustive understanding of the physical properties of metals, plastics and other materials and they know how to test them together as a system. That helps the industry produce parts that can handle the stresses of wind, turbulence and landing. Such parts almost never fail so long as they're properly maintained and operated.
However, engineers can't predict as easily what kind of stresses might cause a computer program to go haywire. "Software is different," says Gérard Ladier, the senior manager of software engineering at Airbus.
Part of the problem stems not from bugs in individual software modules, but from the subtle interaction of many such modules in an airplane:
Specialists say the biggest problems in aviation software don't stem from bugs in the code of a single program but rather from the interaction between two different parts of a plane's computer system. In extreme cases, foul-ups can lead to sudden loss of control, sometimes not showing up until years after aircraft are introduced into service.
One such example, quoted in the article, was a Malaysian Airlines flight that completely went haywire for about 45 seconds, not even giving pilots a chance to override the automated flight control systems. The incident was caused by the way messages were interpreted from a recently upgraded flight control component:
Boeing's 777 jets started service in 1995 and had never experienced a similar emergency before. According to Boeing and Honeywell, the source of the problem was a revised computer program that had recently been installed on all 777s to fix a minor navigation flaw.
As this example illustrates, the problem with a complex system is that its multiple components evolve independently, subtly altering the behavior of the system as a whole. What are your approaches to creating reliability on complex systems that consist of many independently evolving parts?
Software reliability will not come of age until the software industry realizes that values are types.
Inconsistency in a program is simply caused when a set of instructions is invoked with the wrong set of values.
Furthermore, inconsistency is 'encouraged' by using the wrong programming model. What most applications need is an change-driven model where code is invoked as a result of a state change. What most applications get is a poor attempt using object orientation or functional programming, with mediocre results.
> Software reliability will not come of age until the > software industry realizes that values are types. > > Inconsistency in a program is simply caused when a set of > instructions is invoked with the wrong set of values.
This assumes that the ranges or constraints of types are always easy to define. But that is not so.
Also, faults can stem from incorrect translation of requirements. The form of expression of the software is not involved.
> This assumes that the ranges or constraints of types are > always easy to define. But that is not so. > Also, faults can stem from incorrect translation of > requirements. The form of expression of the software is > not involved.
Of course, it goes without saying. And the halting problem has not been solved yet. But a better job can be done, methinks.
> Software reliability will not come of age until the software industry realizes that values are types. > > Inconsistency in a program is simply caused when a set of instructions is invoked with the wrong set of values. > > Furthermore, inconsistency is 'encouraged' by using the wrong programming model. What most applications need is an change-driven model where code is invoked as a result of a state change. > What most applications get is a poor attempt using object orientation or functional programming, with mediocre results.
I think this kind of systems need to work more like Bertrand Meyer proposed in "Object-Oriented Software Construction", with Design by Contract.
The contracts should of course also be used for API:s between components in a system. And I realise that it may sometimes be hard to know which values are ok, or to specify time dependant rules, but it is a very good start that seems to have been forgotten by many people and projects in the IT industry.
> The contracts should of course also be used for API:s > between components in a system. And I realise that it may > sometimes be hard to know which values are ok, or to > specify time dependant rules, but it is a very good start > that seems to have been forgotten by many people and > projects in the IT industry.
Oh, I don't think it's been forgotten. I've seen plenty of references out there. On the other hand, I think it's largely been superceded by Test-Driven Development.
Anyway, I think this article demonstrates exactly what's wrong with Test-Driven Development as it's practiced today. The TDD focus is always on unit testing, rather than acceptance testing. Lip service is always paid to acceptance testing, but the focus is *always* on the unit tests. (Unit tests are a good thing, of course. Just incomplete.)
We must find ways to test the whole system, not just individual parts.
> Anyway, I think this article demonstrates exactly what's > wrong with Test-Driven Development as it's practiced > today. The TDD focus is always on unit testing, rather > than acceptance testing. Lip service is always paid to > acceptance testing, but the focus is *always* on the unit > tests. (Unit tests are a good thing, of course. Just > incomplete.) > > We must find ways to test the whole system, not just > individual parts.
One point of the article, as I understood it, was that some systems are just inherently hard to test, especially when the system has a long life. An aircraft may be in service for several decades, and during that time its on-board systems are upgraded independently of each other. The example they mentioned was actually a minor bugfix upgrade to one system that caused a problem with another system.
Even if the contracts are well-specified, and I'd imagine they are in an aircraft system, testing the whole system is still hard.
Another point they brought up is that such bugs are very rare - in fact, they pointed out just how safe air travel has become, partly as a result of better avionics in planes. The problems that do crop up from time to time are very hard to reproduce. The problem is that when such bugs do manifest, they often result in spectacular system failures.
> Anyway, I think this article demonstrates exactly what's > wrong with Test-Driven Development as it's practiced > today. The TDD focus is always on unit testing, rather > than acceptance testing. Lip service is always paid to > acceptance testing, but the focus is *always* on the unit > tests. (Unit tests are a good thing, of course. Just > incomplete.) > > We must find ways to test the whole system, not just > individual parts.
I entirely agree. I think one reason that unit testing is so popular with developers is that independent testers can be a PITA. That's why they're so useful, of course.
My opinion is that agile methods are not ideal for complex hardware/software system development. There's a greater need to understand the requirements up-front than there is for a software-only system.
/* One point of the article, as I understood it, was that some systems are just inherently hard to test, especially when the system has a long life. An aircraft may be in service for several decades, and during that time its on-board systems are upgraded independently of each other. The example they mentioned was actually a minor bugfix upgrade to one system that caused a problem with another system. */
Ah, they just need to use my particular silver bullet to solve their problems.
When comparing software engineering with electronic engineering during a certain class I was taking, we found that individual electronic components are not failsafe. Capacitors for instance don’t have a mechanism to protect them from over-voltage. You send too much voltage in and it blows up. The same with parameters like temperature, humidity, polarity, current, tolerance etc. The individual components are rated for certain parameters; if you use them outside of those you most likely will break them.
The question then is: How do electronic engineers build reliable circuits out of cheap unreliable components? We concluded that those components are well known and documented and have not changed over the years. The experienced electronic engineer knows what parts can be combined safely.
So maybe you can build reliable systems from unreliable parts and also build unreliable systems from reliable parts. Maybe the key is not if a part is reliable or not but to be aware of its limitations.
Unfortunately in software engineering most parts are custom and in constant flux and documentation is very tricky. Too little specification and you will miss some situation. Too much specification and it will start sounding like the instructions to the holy hand grenade of Antioch. I think the most practical solution might be pretty printing summary results of automated tests, from unit all the way to system tests. It should be kind of a “white list” of parameters. That way I can say: as long as you run these parts with these parameters these are the known results, if not, you are on your own.