Summary
Testing code is different from testing a system. Code in real-world, production systems has to contend with an ever changing, often unpredictable environment that renders unit tests an unreliable predictor of system behavior. In the real world, system robustness matters, but writing more tests can produce diminishing returns.
Do unit tests instill in us a false sense of certainty?
That's how I felt the other night. Bill Venners (Artima's chief editor) and
I were helping a group of developers at a Silicon Valley Patterns Group meeting. Our
goal that evening was to build a Jini and JavaSpaces compute grid. Before
everyone could go to work on grid code, we needed to get Jini lookup services up and
running on an impromptu wireless network assembled at the back room of Hobee's restaurant in Cupertino.
The Jini Starter Kit, which Sun will soon open-source, is as high quality and
throughly tested a piece of code as code gets. Indeed, the JSK is used to run
high-volume business-critical transactions at a handful of large corporations. Starting up Jini lookup services with the JSK is typically a snap.
But it wasn't so that night. We struggled for an hour with this normally simple step,
having to adjust several aspects of users' local environment: moving files around,
deleting directories, checking for multicast capability on network interfaces, etc.
The exercise was frustrating to those who, just a few short hours prior to our
meeting, were able to run Jini lookup services on the very same laptops they brought
with them to the meeting. The rigorous testing and QA processes followed by the Jini
developers predicted nothing about how well the system would work on our impromptu
network that night.
A few days later, Bill and I were sitting just a few yards away from Hobee's, trying
to start up a new version of Artima code. Before checking in that code, I made sure
that all the over one hundred unit tests for that module passed. Yet, when Bill
checked that code out and started up the Web app on his laptop, a subtle
configuration issue prevented the app from working as intended. While code itself was
tested, the system relied on configuration options that were defined partially
outside the code. The unit tests, again, were no indication of whether the code would
run at all in a real-world environment.
Were our tests, or the Jini JSK's tests, flawed? How could we account for
environmental exigencies in those tests? How deep should we aim for in our test
coverage? Should we strive to cover all the permutations of code and its environment
in our test suites? Is such complete test coverage of code even attainable?
System Matters
These experiences made me appreciate the distinction between testing code and testing
a system. The real world only cares about the system - the actual
interaction of all the code in a given piece of software with its environment. Unit
tests, on the other hand, mostly test code: Unit tests are proof that a given method,
or set of methods, act in accord with a given set of assertions.
Unit tests are also code. When running a set of unit tests, the code that's being
tested and the test code itself become part of the same environment - they are part
and parcel of the same system. But if unit tests are part of the system that's being
tested, can unit tests prove anything about the system itself?
No lesser a logician than Kurt Gödel had something to say about this. To be sure,
Gödel's concern was algebraic proof, not unit testing. But in addressing the false
sense of certainty implied in Bertrand Russell's Principia Mathematica, Gödel demonstrated that it is not possible to prove all aspects of a system from within a system itself. In every logical system, there must be axioms - truths that must be taken for granted, and that can be demonstrated true or false only by stepping outside the system.
Such axioms are present in any software system: We must assume that the CPU works as
advertised, that the file system behaves as intended, that the compiler produces
correct code. Not only can we not test for those assumptions from within the system,
we also cannot recover from situations where the axiomatic aspects of the system turn
out to be invalid. If any of a system's axioms turn out to be wrong, the system
suffers catastrophic failure - failure from which no recovery is possible
from within the system itself. In practical terms, you will just have to reboot.
A cardinal aspect of a test plan, then, is to determine a system's axioms, or aspects
(not in the AOP sense) that cannot be proven true or false from within the system.
Apart from those system axioms, all other aspects of the system can, and should, be
covered in the test plan.
The fewer the axioms, the more testable the system. Fewer axioms also result in less
possibilities for catastrophic failure. But in any system, there will always be
conditions that cause complete system failure - CTRL-ALT-DEL will be with us for
good. Fully autonomous, infallible systems truly belong in the realms of science
fiction and fantasy.
Degrees of Belief
If we accept that there will always be a few aspects of a system that we cannot write
tests for, aspects whose correctness we must take for granted, how do we decide on
those "axioms"?
Do you write test methods for simple property accessor methods, or do you just assume
that the JVM does what it's supposed to do? Do you write a test to ensure that a
database, indeed, inserts a record, or do you decide to take that operation for
granted? Do you just assume that a network connection can be opened to a host - is
that operation a system "axiom"? And do you just assume that a remote RMI call will
return as intended, or do you write test for all sorts of network failures, along
with possible recovery code? Finally, do you just assume that a user types a correct
piece of data in an HTML form, or do you write tests and error handling code in that
situation?
Clearly, there is a spectrum, and we often make our decisions about our "system
axioms" based on our beliefs of certainty about correctness. Most of us are highly
uncertain that every user always enters the right answer in a form, so we always
write tests in that situation. But most of us are fairly sure that a database can
perform an insert just fine, so writing tests for that operation would seem like a
waste of time, unless we're testing the database itself.
If our decisions about what to take for granted in a system is based on such degrees
of belief, and if tests start where "axioms" end, then the degree to which testing
tells us about a system's behavior in the larger, operating context of that system,
is also dependent on those beliefs.
The Jini code, for instance, assumed that multicast is enabled on all network hosts.
The Artima code took a specific configuration for granted, and assumed that that
configuration is the one supplied at system startup. We didn't test for that, just
assumed that that is always so. The tests passed, but the system still failed when
that condition was not satisfied in a different operating environment.
In addition to beliefs, we also have to contend with market pressures when choosing
system "axioms." You may know that a remote method call can fail a hundred different
ways on the network, but you also know that shipping a product today, as opposed to
tomorrow, can lead to a market share gain. So you decide to not test for all those
possible network failures, and to take the "correctness" of the network for granted.
You hope you get lucky.
Past Behavior
We could improve our degrees of beliefs about system correctness if we analyzed past
system behavior. One way to do this is to rely on experience. But another way to do
this is to follow what a better search engine, such as Google, does: The more we use
the system, the better the system gets because it learns form past data to improve
its results.
We could instrument code in such a way as to capture failure conditions (e.g., by
logging exceptions). We could then tell that, for example, one out of every N remote
calls on a proxy interface results in failure, given the typical operating environment
of that code. Note that that's real-world data, not just assumptions. Then we could
assign the inverse of that measure - the degree of probability that the call succeeds
- to that method call. We could then correlate that information to how often a call
is used, and produce a matrix of the results.
That probability matrix would be a more accurate indicator of the code's actual
reliability - or "quality" - than just having a set of tests that happen to all run
fine on a developer's machine. Such information would help developers pinpoint what
"system axioms" are valid, and what assumptions prove incorrect in the real world.
With that information, we would not need to strive for complete code coverage, only
coverage that leads to a desired quality level. That would, in turn, make us all more
productive.
I think it may even be possible to find emergent patterns in code with a probability
matrix of that code shared on the Web. Coverage and testing tools could tap into that
database to make better decisions about where to apply unit tests, and about how
indicative existing unit tests are about actual code behavior.
That said, I'm curious how others deal with ensuring proper configuration, and how others account for configuration options in tests. What are some of the ways to minimize configuration so as to reduce the chances of something going wrong? Then, again, isn't reducing configuration is also reducing flexibility and "agility?"
In general, do you agree with my conclusion that complete test coverage is not desirable, or even attainable? How do you choose what to test and what not to test? How do you pick your "system axioms"?