Registered: Nov, 2007
Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects
Posted: Dec 19, 2007 1:30 PM
> As to the ACM search, I am curious how the synonym
> searches for construct validity worked out. It could very
> well be that people are describing the same concept with
> different terms; it happens all the time. How did the
> other 48,000 papers check out? I am certain you did good
> research, and that you just abbreviated this description
> to make your point. It would be interesting to hear more
> about how you measured the presence or absence of
> 'construct validity' in the actual approaches taken in all
> these papers.
I searched in a pretty wide variety of ways over several years because I couldn't believe that the concept was so weakly addressed. It doesn't matter what those strategies were because you can always argue that they are insufficient to prove the negative (some other search for some other synonym that I haven't tried could always yield undiscovered gold...) The reason that I report numbers against "construct validity" is that this phrase is widely used across several disciplines. The lack of reference to it is an indicator, in itself, of the disconnect between software engineering measurement researchers and the broader measurement theory community.
The primary way that I have seen construct validity addressed in texts on software measurement (I have taught from several and reviewed several more--perhaps all of the books marketed as suitable as metrics course texts) is indirectly, through the representational theory of measurement. If a metric satisfies all of the requirements of the representational theory (and I haven't seen a serious claim that any of them do, just several critiques of metrics that don't), then it will almost certainly have construct validity. However, the head-on confrontation with the question," what is the underlying attribute we are trying to measure and how does this relate to that?", is almost always buried. I have been repeatedly disappointed by the brevity and shallowness of this discussion in books on software-related metrics that I have taught from or considered teaching from.
Apart from my own searches, I have also challenged practitioner and academic colleagues to help me find better references. Some of my colleagues have worked pretty hard on this (some of them also teach metrics courses). So far, all we've found are representational theory discussions.
Maybe you've found better somewhere. If so, maybe you could share those references with us.
Bob Austin writes in his book about his interviews with some famous software metrics advocates and how disappointed he was with their naivete vis-a-vis measurement theory and measurement risk.
> Another interesting thing you said: on coverage. You ask
> what it means. If you wanted a clearer answer of what it
> measures, I recommend an interesting survey paper by Hong
> Zhu, Software Test Adequacy Criteria (it's in the ACM dl)
> that examined most of the, up-to-that-point work on
> testing adequacy criteria. It seems quite appropriate
> given that coverage is one adequacy criteria that could be
> measured. There are many criteria like def-use paths,
> state coverage, and so on and on and on.
Yes, yes, I've read a lot of that stuff.
Let me define coverage in a simple way. Consider some testable characteristic of a program, and count the number of tests you could run against the program, with respect to that characteristic. Now count how many you have run. That percentage is your coverage measure. You want to count def-use pairs? Go ahead. Statements? Branches? Subpaths of lengths N? If you can count it, you can report coverage against it.
Understand that coverage is not only countable against internal, structural criteria. When I was development manager for a desktop publishing program, our most important coverage measure was percentage of printers tested, from a target pool of mass-market printers. At that time, we had a lot of custom code tied to different printers. Each new working printer reflected a block of capability finally working. It also reflected a barrier to product release being removed, because we weren't going to ship until we worked with our selected test pool. For us, at that time, on that project, knowing that we were at 50% printer coverage was both a meaningful piece of data and a useful focuser of work.
We can measure coverage against assertions of the specification, coverage of individual input/output variables (count the number of variables and for each one, test minima, maxima, out-of-bounds, and special cases), combinatorial coverage (all-pairs, all-triples, all-quadruples, whatever your coverage criterion is for deciding which variables to test to what degree of interaction with other variables).
At a meeting of the Software Test Managers Roundtable, we identified hundreds of potential coverage measures. I listed 101 (just to provide a long sample of the huge space possible) coverage measures in my paper Software Negligence & Testing Coverage, http://www.kaner.com/pdfs/negligence_and_testing_coverage.pdf
So, yes, there is a lot of ambiguity about what "coverage" means.
It is seductive to identify specific attributes as THE attributes of interest, but if you focus your testing on attribute X, you will tend to find certain types of errors and miss other types of errors. Complete coverage against X is not complete coverage. It is just complete coverage against X.
For example, suppose we achieve 100% statement coverage. That means we executed each statement once. In an interpreted language, this is useful because syntax errors are detected in real time (at execution time) and not during compilation. So 100% statement coverage assures that there are no syntax errors (unnecessary assurance in a compiled language, because the compiler does it already). However, it offers no assurance that the program will process special cases correctly, that it will even detect critical special cases (if there are no statements to cover divide-by-zero, you can test every statement and never learn that the program will crash when certain variables take a zero value.) You never learn that the program has no protection against buffer overflows, that it is subject to serious race conditions, that it crashes if connected to an unexpected output device, that it has memory leaks, that it corrupts its stack, that it adds input variables together in ways that don't guard against overflow, and on and on and on.
When you focus programmers / testers on a specific coverage measurement, they optimize their testing for that. As a result, they achieve high coverage on their number but low coverage against the other attributes. Brian Marick has written and talked plenty about the ways in which he saw coverage-focused testing cause organizations to achieve better metrics and worse testing. This is the kind of side effect Bob Austin wrote about, and the kind that almost none of the metrics papers in the ACM/IEEE journals even mention the possibility of.
People often write about their favorite coverage metric as "coverage" rather than "coverage against attribute X" -- but if by "coverage", we want to mean how much of the testing that we could have done that we actually did, then we face the problem that the number of tests for any nontrivial program is essentially infinite, even if you include only distinct tests (two tests are distinct if the program could pass one but fail the other). If we measure coverage against the pool of possible tests rather than against attribute X, our coverage is vanishingly small (any finite number divided by infinity is zero).
> It seems that the whole point with metrics is to put them
> into context, understand the narrow story they tell about
> the system being measured and then make intelligent
> decisions. To throw complexity or coverage out completely
> seems to insist that since we have no perfect answers we
> should give up and go home.
> Your statement about complexity was a further curiosity. I
> think you made a slight equivocation. When someone tells
> you about the complexity as measured by the decision
> points, I hope it is understood by both of you that you
> are using jargon. "Complexity" in this instance only
> references McCabe's work. And hopefully, you both realize
> that within that context it is a measure (or a metric) for
> an aspect of the system that seems to be somewhat
> correlated with defect density (check McCabe's 96 NIST
> report where he points to a couple of projects that saw a
> correlation.) Based on that context, a complexity score is
> possibly a useful thing to know and to use for improving
> the software.
McCabe's metric essentially counts the number of branches in a method. Big deal.
Structural complexity metrics, which are often marketed as "cognitive complexity" metrics, completely ignore the semantics of the code. Semantic complexity is harder to count, so we ignore it.
Yes, structural complexity is one component of the maintainability problem. But so is comprehensibility of variable names, adequacy and appropriateness of comments, coherence of the focus of the method, and the underlying difficulty of the aspect of the world that is being modeled in this piece of code.
Defining a metric focuses us toward optimizing those aspects of our work that are being measured. And taking work / focus away from those aspects that are not being measured. Choosing to use a structural "complexity" metric is a choice about what kinds of things actually make code hard to read, hard to get right, hard to fix, and hard to document.
I've seen some of the correlational studies on structural metrics. Take some really awful code and some really simple code. Those are your anchors. The simple, reliable code has good structural statistics, the awful code is terrible by any measure, and the correlation will show up as positive because of the end points even if the intermediate values are almost random.
If you want to figure out what aspects of programs create complexity, one of the obvious ways is to put code in front of people and assess their reactions. How complex do they think it is? (People can report their level of subjective complexity. Their reports are not perfect, and there are significant practice effects before irrelevant biasing variable get weeded out, but we ask questions like this all the time in psychophysical research and get useful data that drives advances in stereo systems, perfumes, artificial tastes in foods, lighting systems, alarms, etc.) You can also measure how long it takes them to read the code, where duration is measured as the time until they say that they feel like they understand the code. Or you can suggest a specific code change and see how long it takes them to successsfully change the code in that way. We have plenty of simple dependent variables that can be used in a laboratory setting. The research program would crank through different attributes of software, comparing the impacts on the dependent variables. This is the kind of work that can keep a labful of grad students busy for a decade. I'd be surprised if it wasn't fundable (NSF grants). I've been astonished that it hasn't been done, it's so obvious. (Yes, I know, I could do it. But I have too many projects already and not enough time to do them.)
> Later you say, "When we try to manage anything on the
> basis of measurements that have not been carefully
> validated, we are likely to create side effects of
> measurement ...
> There is a lot of propaganda about measurement, starting
> with the fairy tale that "you can't manage what you don't
> measure." (Of course we can. We do it all the time.)"
> So, this seems to contradict itself. If I understood the
> aphorism about managing and measuring, admittedly I
> haven't heard Tom DeMarco say it personally, what I took
> it to mean is that there is an implied "good" after the
> word 'managing'. That is, he was saying, we cannot do a
> good job managing without measuring.
Are you aware that Tom has repeatedly, publicly retracted this comment?
> As to your summary point, I think we agree. It takes a lot
> of thinking to do metrics right. Most people get them
> wrong. We should spend tons of money on research that
> validates metrics. (I am willing to co-write a grant to
> study crap4j if anyone is game?)
> What I disagree with is a perception that metrics are not
> useful, that we are managing just fine without them, and
> that because some people misuse them (over and over again
> no less) that nobody should use them without exorbitant
> expenditures of time and money. It sounds a lot like
> trying to ignore the problem.
I spent a lot of years developing software and consulting to development companies before coming back to universities. Almost no one had metrics programs. Capers Jones claimed that 95% of the software companies he'd studied didn't have metrics programs. I hear time and again that this is because these companies lack the discipline or the smarts. What I heard time and again from my clients was that they abandoned the metrics programs because those programs did more harm than good. It is not that they are ignoring the problem or that they think there is no problem. It is that they have no better alternative to a multidimensional, qualitative assessment, even though that is unreliable, difficult, and inconsistent.
You can cure a head cold by shooting yourself in the head. Some people would prefer to keep the cold.
> We must keep trying to improve our measures by studying
> them, by validating them, and by improving them based on
> that study.
Remarkably little serious research is done on the quality of these measures.
> And without a doubt, it requires a coherent
> approach, and a clear understanding of what is being
> measured -- whether we call it construct validity or
> something else.
> Are we actually in violent agreement?
One of the not-so-amusing cartoons/bumper-stickers/etc. that I see posted on cubicle walls at troubled companies states, "Beatings will continue until morale improves." OK, obviously, morale is a problem and something needs to be done. But beatings are not the solution. In a dark period of the history of psychology, we got so enamored with high tech that we used the high-tech equivalent of beatings (electroshock therapy) to treat depression. It didn't work, but it was such a cool use of technology that we applied this torture to remarkably many people for a remarkably long time.
We have a serious measurement problem in our field. There are all sorts of things we would like to understand and control better. But we don't have the tools and I see dismayingly little effort to create well-validated tools. We have a lot of experience with companies abandoning their metrics programs because the low-quality tools being pushed today have been counterproductive.
We are not in violent agreement.
I see statistics like crap4j as more crappy ways to treat your head cold with a shotgun and I tell people not to rely on them. Instead, I try to help people think through the details of what they are trying to measure (the attributes), why those are critical for them, and how to use a series of converging, often qualitative, measurements to try to get at them. It's not satisfactory, but it's the best that I know.
-- cem kaner