Frank Thoughts
Is Complete Test Coverage Desirable - or Even Attainable?
by Frank Sommers
February 16, 2005

Summary
Testing code is different from testing a system. Code in real-world, production systems has to contend with an ever changing, often unpredictable environment that renders unit tests an unreliable predictor of system behavior. In the real world, system robustness matters, but writing more tests can produce diminishing returns.

Do unit tests instill in us a false sense of certainty?

That's how I felt the other night. Bill Venners (Artima's chief editor) and I were helping a group of developers at a Silicon Valley Patterns Group meeting. Our goal that evening was to build a Jini and JavaSpaces compute grid. Before everyone could go to work on grid code, we needed to get Jini lookup services up and running on an impromptu wireless network assembled at the back room of Hobee's restaurant in Cupertino.

The Jini Starter Kit, which Sun will soon open-source, is as high quality and throughly tested a piece of code as code gets. Indeed, the JSK is used to run high-volume business-critical transactions at a handful of large corporations. Starting up Jini lookup services with the JSK is typically a snap.

But it wasn't so that night. We struggled for an hour with this normally simple step, having to adjust several aspects of users' local environment: moving files around, deleting directories, checking for multicast capability on network interfaces, etc. The exercise was frustrating to those who, just a few short hours prior to our meeting, were able to run Jini lookup services on the very same laptops they brought with them to the meeting. The rigorous testing and QA processes followed by the Jini developers predicted nothing about how well the system would work on our impromptu network that night.

A few days later, Bill and I were sitting just a few yards away from Hobee's, trying to start up a new version of Artima code. Before checking in that code, I made sure that all the over one hundred unit tests for that module passed. Yet, when Bill checked that code out and started up the Web app on his laptop, a subtle configuration issue prevented the app from working as intended. While code itself was tested, the system relied on configuration options that were defined partially outside the code. The unit tests, again, were no indication of whether the code would run at all in a real-world environment.

Were our tests, or the Jini JSK's tests, flawed? How could we account for environmental exigencies in those tests? How deep should we aim for in our test coverage? Should we strive to cover all the permutations of code and its environment in our test suites? Is such complete test coverage of code even attainable?

System Matters

These experiences made me appreciate the distinction between testing code and testing a system. The real world only cares about the system - the actual interaction of all the code in a given piece of software with its environment. Unit tests, on the other hand, mostly test code: Unit tests are proof that a given method, or set of methods, act in accord with a given set of assertions.

Unit tests are also code. When running a set of unit tests, the code that's being tested and the test code itself become part of the same environment - they are part and parcel of the same system. But if unit tests are part of the system that's being tested, can unit tests prove anything about the system itself?

No lesser a logician than Kurt Gödel had something to say about this. To be sure, Gödel's concern was algebraic proof, not unit testing. But in addressing the false sense of certainty implied in Bertrand Russell's Principia Mathematica, Gödel demonstrated that it is not possible to prove all aspects of a system from within a system itself. In every logical system, there must be axioms - truths that must be taken for granted, and that can be demonstrated true or false only by stepping outside the system.

Such axioms are present in any software system: We must assume that the CPU works as advertised, that the file system behaves as intended, that the compiler produces correct code. Not only can we not test for those assumptions from within the system, we also cannot recover from situations where the axiomatic aspects of the system turn out to be invalid. If any of a system's axioms turn out to be wrong, the system suffers catastrophic failure - failure from which no recovery is possible from within the system itself. In practical terms, you will just have to reboot.

A cardinal aspect of a test plan, then, is to determine a system's axioms, or aspects (not in the AOP sense) that cannot be proven true or false from within the system. Apart from those system axioms, all other aspects of the system can, and should, be covered in the test plan.

The fewer the axioms, the more testable the system. Fewer axioms also result in less possibilities for catastrophic failure. But in any system, there will always be conditions that cause complete system failure - CTRL-ALT-DEL will be with us for good. Fully autonomous, infallible systems truly belong in the realms of science fiction and fantasy.

Degrees of Belief

If we accept that there will always be a few aspects of a system that we cannot write tests for, aspects whose correctness we must take for granted, how do we decide on those "axioms"?

Do you write test methods for simple property accessor methods, or do you just assume that the JVM does what it's supposed to do? Do you write a test to ensure that a database, indeed, inserts a record, or do you decide to take that operation for granted? Do you just assume that a network connection can be opened to a host - is that operation a system "axiom"? And do you just assume that a remote RMI call will return as intended, or do you write test for all sorts of network failures, along with possible recovery code? Finally, do you just assume that a user types a correct piece of data in an HTML form, or do you write tests and error handling code in that situation?

Clearly, there is a spectrum, and we often make our decisions about our "system axioms" based on our beliefs of certainty about correctness. Most of us are highly uncertain that every user always enters the right answer in a form, so we always write tests in that situation. But most of us are fairly sure that a database can perform an insert just fine, so writing tests for that operation would seem like a waste of time, unless we're testing the database itself.

If our decisions about what to take for granted in a system is based on such degrees of belief, and if tests start where "axioms" end, then the degree to which testing tells us about a system's behavior in the larger, operating context of that system, is also dependent on those beliefs.

The Jini code, for instance, assumed that multicast is enabled on all network hosts. The Artima code took a specific configuration for granted, and assumed that that configuration is the one supplied at system startup. We didn't test for that, just assumed that that is always so. The tests passed, but the system still failed when that condition was not satisfied in a different operating environment.

In addition to beliefs, we also have to contend with market pressures when choosing system "axioms." You may know that a remote method call can fail a hundred different ways on the network, but you also know that shipping a product today, as opposed to tomorrow, can lead to a market share gain. So you decide to not test for all those possible network failures, and to take the "correctness" of the network for granted. You hope you get lucky.

Past Behavior

We could improve our degrees of beliefs about system correctness if we analyzed past system behavior. One way to do this is to rely on experience. But another way to do this is to follow what a better search engine, such as Google, does: The more we use the system, the better the system gets because it learns form past data to improve its results.

We could instrument code in such a way as to capture failure conditions (e.g., by logging exceptions). We could then tell that, for example, one out of every N remote calls on a proxy interface results in failure, given the typical operating environment of that code. Note that that's real-world data, not just assumptions. Then we could assign the inverse of that measure - the degree of probability that the call succeeds - to that method call. We could then correlate that information to how often a call is used, and produce a matrix of the results.

That probability matrix would be a more accurate indicator of the code's actual reliability - or "quality" - than just having a set of tests that happen to all run fine on a developer's machine. Such information would help developers pinpoint what "system axioms" are valid, and what assumptions prove incorrect in the real world.

With that information, we would not need to strive for complete code coverage, only coverage that leads to a desired quality level. That would, in turn, make us all more productive.

I think it may even be possible to find emergent patterns in code with a probability matrix of that code shared on the Web. Coverage and testing tools could tap into that database to make better decisions about where to apply unit tests, and about how indicative existing unit tests are about actual code behavior.

That said, I'm curious how others deal with ensuring proper configuration, and how others account for configuration options in tests. What are some of the ways to minimize configuration so as to reduce the chances of something going wrong? Then, again, isn't reducing configuration is also reducing flexibility and "agility?"

In general, do you agree with my conclusion that complete test coverage is not desirable, or even attainable? How do you choose what to test and what not to test? How do you pick your "system axioms"?

Talk Back!

Have an opinion? Readers have already posted 21 comments about this weblog entry. Why not add yours?

RSS Feed

If you'd like to be notified whenever Frank Sommers adds a new entry to his weblog, subscribe to his RSS feed.

Digg |

del.icio.us |

About the Blogger

Frank Sommers is a Senior Editor with Artima Developer. Prior to joining Artima, Frank wrote the Jiniology and Web services columns for JavaWorld. Frank also serves as chief editor of the Web zine ClusterComputing.org, the IEEE Technical Committee on Scalable Computing's newsletter. Prior to that, he edited the Newsletter of the IEEE Task Force on Cluster Computing. Frank is also founder and president of Autospaces, a company dedicated to bringing service-oriented computing to the automotive software market.

Prior to Autospaces, Frank was vice president of technology and chief software architect at a Los Angeles system integration firm. In that capacity, he designed and developed that company's two main products: A financial underwriting system, and an insurance claims management expert system. Before assuming that position, he was a research fellow at the Center for Multiethnic and Transnational Studies at the University of Southern California, where he participated in a geographic information systems (GIS) project mapping the ethnic populations of the world and the diverse demography of southern California. Frank's interests include parallel and distributed computing, data management, programming languages, cluster and grid computing, and the theoretic foundations of computation. He is a member of the ACM and IEEE, and the American Musicological Society.


	Web Artima.com