The Artima Developer Community
Sponsored Link

Weblogs Forum
Software Metrics Don't Kill Projects, Moronic Managers Kill Projects

50 replies on 4 pages. Most recent reply: May 16, 2008 1:38 PM by Thomas Cagley

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 50 replies on 4 pages [ « | 1 2 3 4 | » ]
Anton Litvinenko

Posts: 1
Nickname: shmonder
Registered: Nov, 2007

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Nov 9, 2007 7:57 AM
Reply to this message Reply
Advertisement
I would like to question the idea of comparing two completely different applications. There should be some reason to do that which changes from case to case. And this reason should dictate the choice of metrics. So, no we can't have a single universal metric. Though we can use a combination of several metrics to emphasize or support our conclusions.

On the other hand people comparison is much more compelling for managers :) So, for me the real question sounds like "is there a way to compare two developers working on different projects?"

I am totally convinced that it is possible, but again it shouldn't narrow down to some universal metric. Rather everybody chooses what are the most important indicators for him or her.

Anton
--
http://sourcekibitzer.org/Bio.ext?sp=l20

Jeroen Wenting

Posts: 88
Nickname: jwenting
Registered: Mar, 2004

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Nov 13, 2007 4:54 AM
Reply to this message Reply
> > I am not arguing that all software should be held to
> the
> > same standard wrt any given metric. For example, I
> expect
> > that medical applications to be more thoroughly tested
> > than, say, a video game.
>
> Actually, in practice, it would surprise me if this were
> generally true. At least console video games get very
> hard testing for their domain; the general rule for
> allowing a release used to be 100 hours of testing without
> finding any flaw, *after* all flaws found in previous
> testing had been fixed. I suspect that medical software
> that isn't used in life-and-death situations see
> significantly less quality assurance.
>
Of course software for game consoles has several things that make it easier to test well (and more vital as well):
1) it's a closed environment, there's no miriad different hardware/software combinations it's run on top of (and concurrently with).
2) it's distributed on a medium that makes updating it to fix bugs impossible (or nearly so). ROM Cartridges and CDs/DVDs need to be physically replaced for every customer, you can't just send them a patchfile as a download (modern consoles with embedded harddisks make this somewhat possible, but they're a very recent development).

That medical software (at least most of it) is run on PCs, each of which has potentially different hardware and other software (including operating system patch levels) installed.
At the same time sending the users a patch via email or a download link is easy and cheap.

It's therefore (despite the seemingly more important problem domain in medicine) more important economically for the manufacturer to get that game console software correct out of the box than that medical software (as long as no patient dies of course, in which case the liability claims can run into astronomical figures).

Bill Pyne

Posts: 165
Nickname: billpyne
Registered: Jan, 2007

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Nov 13, 2007 6:03 AM
Reply to this message Reply
> That medical software (at least most of it) is run on PCs,
> each of which has potentially different hardware and other
> software (including operating system patch levels)
> installed.
> At the same time sending the users a patch via email or a
> download link is easy and cheap.
>
> It's therefore (despite the seemingly more important
> problem domain in medicine) more important economically
> for the manufacturer to get that game console software
> correct out of the box than that medical software (as long
> as no patient dies of course, in which case the liability
> claims can run into astronomical figures).

What you say is correct but I'd like to add a little to it. Medical software is broad. Focusing exclusively on the laboratory system: embedded software in the lab equipment which took the actual readings, a "collector" package that ran on a UNIX server, an HL7 message translator running on a UNIX server, a transaction server that took translated HL7 messages/transformed them per physician's rules/loaded them into the patient chart database, and the client piece running on PC's. Everything except the embedded software had the luxury of patching.

As far as testing goes, something as simple as changing the measurement unit in the patient chart system for a particular lab test, per physician request, involved a full regression test of the transaction server and client pieces. Every possible transaction was retested along with all known error conditions and the server's ability to handle them. Physicians' training can prevent data errors from being catastrophic, but physicians can't catch everything - especially not when in hour 25 of a 32 hour shift. In a hospital, information in medical systems CANNOT be wrong. The motivation for developers isn't financial (liability), it's watching life and death play out.

Cem Kaner

Posts: 4
Nickname: cemkaner
Registered: Nov, 2007

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Nov 20, 2007 10:45 PM
Reply to this message Reply
I think the biggest problem with software metrics is that we don't have any.

Consider "coverage" for example. What does "coverage" actually measure? We know how to compute coverage (for simplicity, let's count the percentage of statements tested), but that's just a count. What's the meaning behind this count?

In most fields, measurement starts from an attribute (aka a construct), something we want to measure. For example, we might want to measure productivity, quality, intelligence, aptitude, maintainability, scalability, thoroughness of testing, reliability--these are attributes.

Given an attribute, we use a measuring instrument of some sort to map the "value" of the attribute to a number. The instrument is easy to identify in some cases--think of rulers and voltmeters. Some instruments are more complex--for example, intelligence tests. Some instruments require multiple readings under different circumstances--for example, we might try to measure how loud a sound is, to you, by having you compare it to dozens of other sounds, indicating for each comparison which sound was louder. (If you wear glasses, you've gone through this type of measurement of subjective visual clarity.)

The reading from the instrument is the value that software engineers call "the metric." (There are varying uses of the word "metric"--see wikipedia http://en.wikipedia.org/wiki/Metric)

In most fields that use measurement, the fundamental question is whether the instrument you are using actually measures the construct (or attribute) that you think you are measuring. That concern is called "construct validity."

If you search the ACM's Guide to the Computing Literature (which indexes ACM, IEEE and many other computing publications), there are only 490 papers / books / etc. that include the phrase "construct validity" out of 1,095,884 references searched. There are 48,721 references that refer to "metrics" (only 490 of them mention the "construct validity" of these "metrics"). I read most of the available ACM-Guide-listed papers that mentioned "construct validity" a few years ago (Cem Kaner & Walter P. Bond, "Software engineering metrics: What do they measure and how do we know?" 10th International Software Metrics Symposium (Metrics 2004), Chicago, IL, September 14-16, 2004, http://www.kaner.com/pdfs/metrics2004.pdf) -- of those, most were discussions of social science issues (business, economics, psychology) rather than what we would normally think of as software metrics.

The problem with taking "measurements" when you don't have a clear idea of what attribute you are trying to measure is that you are likely to come up with very precise measurements of something other than the attribute you have sort-of in mind. Consider an example. Suppose you wanted to measure aptitude for algebra. We sample the population and discover a strong correlation between height and the ability to answer algebra questions in a written test. People who measure between 5" and 30" tall (who are, coincidentally, very young and they don't yet know how to read) are particularly algebra-challenged. What are we really measuring?

When people tell me that you can measure the complexity of a program by counting how many IF statements it has (McCabe's metric), I wonder whether they have a clue about the meaning of complexity.

When people tell me you can measure how thoroughly a program has been tested by computing the percentage of statements tested, I wonder if they have a clue about testing. See "Software negligence and testing coverage." (Keynote address) Software Testing, Analysis & Review Conference, Orlando, FL, p. 313, May 16, 1996. http://www.kaner.com/pdfs/negligence_and_testing_coverage.pdf

When we try to manage anything on the basis of measurements that have not been carefully validated, we are likely to create side effects of measurement. See Bob Austin's doctoral research, published in book form, http://www.amazon.com/Measuring-Managing-Performance-Organizations-Robert/dp/0932633366

There is a lot of propaganda about measurement, starting with the fairy tale that "you can't manage what you don't measure." (Of course we can. We do it all the time.)

Much of this propaganda is moralistic in tone or derisive. People get bullied by this and they conform by using measurement systems that they don't understand. (In many cases, that perhaps no one understands.) The result is predictable. You can blame individual managers. I blame the theorists and consultants who push unvalidated "metrics" on the field. People trust us. When we put defective tools in the hands of executives and managers, it's like putting a loaded gun in the hands of a three-year old and later saying, "guns don't kill people, people kill people." By all means, blame the victim.

Capers Jones wrote in one of his books (maybe many of his books) that 95% of software companies don't use software metrics. Most of the times I hear this quoted, the writer or speaker goes on to deride the laziness and immaturity of our field. Mature, sensible people would be in the 5%, not the great unwashed 95% that won't keep their records.

My experience is a little different. I'm a professor now, but I did a lot of consulting in Sili Valley. I went to company after company that didn't have software measurement systems. But when I talked to their managers / executives, they told me that they had tried a software measurement system, in this company or a previous one. Many of these folks had been involved in MANY software measurement systems. But they had abandoned them. Not because they were too hard, too time consuming, too difficult -- but because, time after time, they did more harm than good. It's one thing to pay a lot of money for something that gives valuable information. It's another thing to pay for golden bullets if all you're going to use them for is shooting holes in your own foot.

It takes years of work to develop valid measurement systems. We are impatient. In our impatience, we too often fund people (some of them charlatans) who push unvalidated tools instead of investing in longer term research that might provide much more useful answers in the future.

Raoul Duke

Posts: 127
Nickname: raoulduke
Registered: Apr, 2006

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Nov 29, 2007 3:11 PM
Reply to this message Reply
> It takes years of work to develop valid measurement
> systems. We are impatient. In our impatience, we too often
> fund people (some of them charlatans) who push unvalidated
> tools instead of investing in longer term research that
> might provide much more useful answers in the future.

Well said, I think. Do you have any suggestions for less-suckful metrics in the software world?

Peter Booth

Posts: 62
Nickname: alohashirt
Registered: Aug, 2004

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 2, 2007 11:05 PM
Reply to this message Reply
There is a theme that I am reading in many posts here that seems something like "metrics are imperfect and flawed therefore they are useless." I suspect that this idea says much more about the speaker than metrics per se. Many people appear to be drawn to software development because of comfort with black and white, clear specific realities. The irony is that software development embraces a number of chaotic non-predictable processes and is much less black and white than people think.

Yet - how do you compare systems without metrics. A recent conversation:

Me: SO can give me an estimate of how large system X is?"
DEV: It's large
ME: How big is large?
DEV: Um I don't know exactly
ME: order of magnitude? Are we talking about thousands, tens of thousands, even 100s thousands of classes?

I don't understand how someone can work on a system without knowing its size, and where the size is increasing or decreasing.

Merriodoc Brandybuck

Posts: 225
Nickname: brandybuck
Registered: Mar, 2003

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 3, 2007 10:25 AM
Reply to this message Reply
> There is a theme that I am reading in many posts here that
> seems something like "metrics are imperfect and flawed
> therefore they are useless." I suspect that this idea says
> much more about the speaker than metrics per se. Many
> people appear to be drawn to software development because
> of comfort with black and white, clear specific realities.
> The irony is that software development embraces a number
> of chaotic non-predictable processes and is much less
> black and white than people think.
>
> Yet - how do you compare systems without metrics. A recent
> conversation:
>
> Me: SO can give me an estimate of how large system X is?"
> DEV: It's large
> ME: How big is large?
> DEV: Um I don't know exactly
> ME: order of magnitude? Are we talking about thousands,
> tens of thousands, even 100s thousands of classes?
>
> I don't understand how someone can work on a system
> without knowing its size, and where the size is increasing
> or decreasing.

In the interest of being a pedantic shmuck, if my system has no classes, is it then small and never growing?

Your token DEV's response should have been to ask "large how?". Lines of code? Cyclomatic complexity? Class count? Do you want just the classes we wrote or all classes that might be called by the program because it runs on the .NET framework? If the system consists of components in a variety of languages, how do you count the parts that may not have any classes?

What does 'large' in and of itself tell you anyway? I currently work on some large systems by any measure that I don't worry too much about the actual size because the pieces are well thought out and pretty well put together and updates and changes are pretty easy. I've worked on small systems (again, by just about any measure) that have made me want to cry because the were fragile and horrible to update.

I think most people's issue with metrics is that they attempt to take something that is, as you say, chaotic, and reduce to something very black and white. I don't get how you can take something that has been worked on for months or years, run it through some sort of crank and get a number and have it have any real meaning. I think most people imagine the following exchange when it comes to the use of metrics:

Manager: What is the WizzleWub count of the DungBomb project?

Dev: 17

Manager: We were hoping for at least 19. I need to see you in my office...

I can only speak for myself, but my issue with metrics isn't whether they are flawed, but when they are used by people that don't really know what they represent to make some absolute evaluation of a system. That leads to trouble in most cases. And then instead of making the system better (by fixing problems, adding features, etc.) you are more worried about bringing the WizzleWub count up.

The ultimate measure of any software system is does it do what it was intended to do. Unfortunately there isn't any single number or metric that will tell you that.

Raoul Duke

Posts: 127
Nickname: raoulduke
Registered: Apr, 2006

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 7, 2007 4:37 PM
Reply to this message Reply
> Manager: What is the WizzleWub count of the DungBomb
> project?
>
> Dev: 17
>
> Manager: We were hoping for at least 19. I need to see you
> in my office...

hm, seems the right way to check up on something is basically to have, in effect, a whole bunch of metric values. I mean, if you dig into the code and then make judgments based on experience, you are basically deciding on metric values in your head.

so if the "dashboard" which previously only showed the single WizzleWub value instead also showed umpteen other values, and then even synthesized some sum-up values out of those, would that seem any more reasonable to those who distrust metrics so much?

(metrics seem great to me, yet i also completely agree that any tool can be abused. so if my boss is a jerk who uses nothing but WizzleWub... sucks to be me, for sure.)

Merriodoc Brandybuck

Posts: 225
Nickname: brandybuck
Registered: Mar, 2003

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 8, 2007 8:05 AM
Reply to this message Reply
> so if the "dashboard" which previously only showed the
> single WizzleWub value instead also showed umpteen other
> values, and then even synthesized some sum-up values out
> of those, would that seem any more reasonable to those who
> distrust metrics so much?
>
That seems very reasonable. It also seems like a lot of work. And it requires experience and first hand knowledge in the trenches to do that well. Two criteria that make it likely not to be adopted by any big organization. Better to have lots of simple, barely meaningful measures that are easy to document and generate so as to get your desired CMMI level certification.

And it would require a manager to let go a little bit. If you already have such a manager that can do this, then this isn't a problem. If you have a manager that you already have issues with, this is yet one more weapon they can use to bludgeon you with their ignorance and stupidity. Up until the last job I had which I left in September, I have been blessed with good managers throughout my career. Some of them required metrics but they were nothing more than a tool. Sometimes they were misapplied but we were able to take a step back and say "ok, that's interesting information, but it doesn't make much sense or isn't telling us anything useful" and we would change it.

The only metric I really care about is open bug count. If it is going down that's good. If it is going up that's bad. I don't mind my manager holding that against me as long as the source of the defects is kept in mind. There is nothing so frustrating come review time as being penalized by inheriting an old, buggy system. Nothing like having your bug count triple for reasons way beyond your control and then getting hurt for it. Getting punished for other people's sins is no fun. And I've had that happen a couple of time. It stinks.

Robert Evans

Posts: 11
Nickname: bobevans
Registered: Jun, 2003

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 19, 2007 1:08 AM
Reply to this message Reply
I think coming from first principles is one good way to solve problems (it worked for Einstein). So, having a clear idea of what you are trying to measure is important.

As to the ACM search, I am curious how the synonym searches for construct validity worked out. It could very well be that people are describing the same concept with different terms; it happens all the time. How did the other 48,000 papers check out? I am certain you did good research, and that you just abbreviated this description to make your point. It would be interesting to hear more about how you measured the presence or absence of 'construct validity' in the actual approaches taken in all these papers.

Another interesting thing you said: on coverage. You ask what it means. If you wanted a clearer answer of what it measures, I recommend an interesting survey paper by Hong Zhu, Software Test Adequacy Criteria (it's in the ACM dl) that examined most of the, up-to-that-point work on testing adequacy criteria. It seems quite appropriate given that coverage is one adequacy criteria that could be measured. There are many criteria like def-use paths, state coverage, and so on and on and on.

It seems that the whole point with metrics is to put them into context, understand the narrow story they tell about the system being measured and then make intelligent decisions. To throw complexity or coverage out completely seems to insist that since we have no perfect answers we should give up and go home.

Your statement about complexity was a further curiosity. I think you made a slight equivocation. When someone tells you about the complexity as measured by the decision points, I hope it is understood by both of you that you are using jargon. "Complexity" in this instance only references McCabe's work. And hopefully, you both realize that within that context it is a measure (or a metric) for an aspect of the system that seems to be somewhat correlated with defect density (check McCabe's 96 NIST report where he points to a couple of projects that saw a correlation.) Based on that context, a complexity score is possibly a useful thing to know and to use for improving the software.

Even if it isn't correlated in perfect lock-step all the time, anyone who has written any substantial software knows the anguish of maintaining really ugly large methods/functions. McCabe is trying to measure something we know is there. Is his measure complete? No. Is it sufficient? No. Is it useful? Yes. It seems to be supported in the studies too. If you have references for field studies that contradict the 96 report, please post them.

Later you say, "When we try to manage anything on the basis of measurements that have not been carefully validated, we are likely to create side effects of measurement ...
There is a lot of propaganda about measurement, starting with the fairy tale that "you can't manage what you don't measure." (Of course we can. We do it all the time.)"

So, this seems to contradict itself. If I understood the aphorism about managing and measuring, admittedly I haven't heard Tom DeMarco say it personally, what I took it to mean is that there is an implied "good" after the word 'managing'. That is, he was saying, we cannot do a good job managing without measuring. On the other hand, I agree that people manage (with no good after it) all the time without measuring. You might say that managing without measuring is a derivative form of managing on the basis of measurements that have not been carefully validated.

While we are clearing things up, I got the point you are trying to make, but what does it mean when you make the quote, ""guns don't kill people, people kill people." By all means, blame the victim." Where in there is the victim blamed?

Traditionally, that argument means that the guns should not be outlawed, but that criminals should be jailed. It's an argument by gun owners to keep their guns legally. How is it blaming the victims?

On that point, you say "putting defective tools in the hands of managers". Let's not forget putting tools that require expertise in the hands of managers. That is as likely to blow up in somebody's face, and is arguable the current state of affairs.

As to your summary point, I think we agree. It takes a lot of thinking to do metrics right. Most people get them wrong. We should spend tons of money on research that validates metrics. (I am willing to co-write a grant to study crap4j if anyone is game?)

What I disagree with is a perception that metrics are not useful, that we are managing just fine without them, and that because some people misuse them (over and over again no less) that nobody should use them without exorbitant expenditures of time and money. It sounds a lot like trying to ignore the problem.

We must keep trying to improve our measures by studying them, by validating them, and by improving them based on that study. And without a doubt, it requires a coherent approach, and a clear understanding of what is being measured -- whether we call it construct validity or something else.

Are we actually in violent agreement?

Robert Evans

Posts: 11
Nickname: bobevans
Registered: Jun, 2003

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 19, 2007 1:44 AM
Reply to this message Reply
This was supposed to be a reply to Cem's post. D'oh, no threading.

Alberto Savoia

Posts: 95
Nickname: agitator
Registered: Aug, 2004

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 19, 2007 11:48 AM
Reply to this message Reply
> I think the biggest problem with software metrics is that
> we don't have any.
...
>
> It takes years of work to develop valid measurement
> systems. We are impatient. In our impatience, we too often
> fund people (some of them charlatans) who push unvalidated
> tools instead of investing in longer term research that
> might provide much more useful answers in the future.

Cem,

I believe that software metrics can, and often are, abuse, misused, overused, confused, etc. And that A LOT more work, experiments, and research is needed to improve the current state of affairs. Not to mention educating people in the proper ways to use (or not use) metrics.

But if I interpret your post (and position) correctly, the only conclusion I can draw is that - as of today - we should not be using ANY metric. Zero, nada, nyet. Is that the case? Should we burn any and all metrics tool? Remove all code coverage from our IDEs?

Are you suggesting a full moratorium on all metrics until we have invested a few years in "longer term research that might provide much more useful answers in the future."?

And, given that the overwhelming majority of metrics in existence today (all of which are invalid in your opinion) already come from researchers and academics, how can we make sure that THIS TIME we fund the right researchers and academics?

When you look at the body of work in software metrics, you see a bunch of charlatans and incompetent theorists, and consultants who are trying to swindle a bunch of clueless managers who, in turn, are going to abuse their poor programmers with those metrics. I see it a little differently...

There may be some bad apples (as in any field). But for the most part, I see a bunch of people, many of them very smart, who are motivated by a deep desire to understand and improve the way we design, write and test software. This is a very difficult task, made considerably more arduous by the constantly changing environment (i.e. every few years there are new programming models, languages, styles, etc.) Most of these people are smart enough to realize, and make it clear, that the metrics they are proposing and experimenting with are nowhere near perfect and that no single metric (or even a set of metrics) can tell the whole story. But that does not stop them from experimenting and using those metrics to learn more about them and, frankly, how else are you going to learn more and improve something if you don't experiment with it.

>When we put defective tools in the hands of executives
>and managers, it's like putting a loaded gun in the
>hands of a three-year old and later saying, "guns
>don't kill people, people kill people." By all means,
>blame the victim.

I find this attitude toward executives and managers surprisingly insulting, patronizing, and a gross over generalization. There are, for certain, some Dilbertesque managers and executives who will misunderstand and misuse metrics (that's the group that my YouTube video on "Metrics-Based Software Management" pokes fun at). But, based on my experience, most of them have enough sense to see metrics for what they are: a tool that, properly used will give them and their team some valuable (if not complete, perfect, or infallible) insight.

I have a hard time believing that you would hold such extreme positions; but I re-read your post several times and the only conclusion I can draw is that in your view:

i) As of today, there are ZERO metrics that meet your standard/definition for construct validity.

ii) Putting invalid metrics in the hands of managers and executives is like putting guns in the hands of three-year olds (who will then aim them at innocent developers).

iii) Therefore, we should not use ANY software metrics AT ALL until a group of enlightened researchers (which will probably exclude all the charlatans and incompetent nincompoops responsible for the current crop of metrics) has had sufficient time to perform experiments in a protected environment and might come up with some metrics safe for general use sometime in the future.

Is that right?

Alberto

P.S. Cem, while we might hold different opinions on how to improve/fix the state of software metrics, I believe we share several common goals. I have enormous respect for you , your work, and your passion for software quality and testing (which we share.) Not to mention the fact that I really like you on a personal level :-). I hope that this post is interpreted in the spirit in which it was written (i.e. a true desire to confirm my understanding of your position, not poke fun at it) and that we can continue this discussion in a constructive way that will help us (and the readers) gain a better understanding of different positions.

Cem Kaner

Posts: 4
Nickname: cemkaner
Registered: Nov, 2007

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 19, 2007 1:30 PM
Reply to this message Reply
> As to the ACM search, I am curious how the synonym
> searches for construct validity worked out. It could very
> well be that people are describing the same concept with
> different terms; it happens all the time. How did the
> other 48,000 papers check out? I am certain you did good
> research, and that you just abbreviated this description
> to make your point. It would be interesting to hear more
> about how you measured the presence or absence of
> 'construct validity' in the actual approaches taken in all
> these papers.

I searched in a pretty wide variety of ways over several years because I couldn't believe that the concept was so weakly addressed. It doesn't matter what those strategies were because you can always argue that they are insufficient to prove the negative (some other search for some other synonym that I haven't tried could always yield undiscovered gold...) The reason that I report numbers against "construct validity" is that this phrase is widely used across several disciplines. The lack of reference to it is an indicator, in itself, of the disconnect between software engineering measurement researchers and the broader measurement theory community.

The primary way that I have seen construct validity addressed in texts on software measurement (I have taught from several and reviewed several more--perhaps all of the books marketed as suitable as metrics course texts) is indirectly, through the representational theory of measurement. If a metric satisfies all of the requirements of the representational theory (and I haven't seen a serious claim that any of them do, just several critiques of metrics that don't), then it will almost certainly have construct validity. However, the head-on confrontation with the question," what is the underlying attribute we are trying to measure and how does this relate to that?", is almost always buried. I have been repeatedly disappointed by the brevity and shallowness of this discussion in books on software-related metrics that I have taught from or considered teaching from.

Apart from my own searches, I have also challenged practitioner and academic colleagues to help me find better references. Some of my colleagues have worked pretty hard on this (some of them also teach metrics courses). So far, all we've found are representational theory discussions.

Maybe you've found better somewhere. If so, maybe you could share those references with us.

Bob Austin writes in his book about his interviews with some famous software metrics advocates and how disappointed he was with their naivete vis-a-vis measurement theory and measurement risk.

>
> Another interesting thing you said: on coverage. You ask
> what it means. If you wanted a clearer answer of what it
> measures, I recommend an interesting survey paper by Hong
> Zhu, Software Test Adequacy Criteria (it's in the ACM dl)
> that examined most of the, up-to-that-point work on
> testing adequacy criteria. It seems quite appropriate
> given that coverage is one adequacy criteria that could be
> measured. There are many criteria like def-use paths,
> state coverage, and so on and on and on.

Yes, yes, I've read a lot of that stuff.

Let me define coverage in a simple way. Consider some testable characteristic of a program, and count the number of tests you could run against the program, with respect to that characteristic. Now count how many you have run. That percentage is your coverage measure. You want to count def-use pairs? Go ahead. Statements? Branches? Subpaths of lengths N? If you can count it, you can report coverage against it.

Understand that coverage is not only countable against internal, structural criteria. When I was development manager for a desktop publishing program, our most important coverage measure was percentage of printers tested, from a target pool of mass-market printers. At that time, we had a lot of custom code tied to different printers. Each new working printer reflected a block of capability finally working. It also reflected a barrier to product release being removed, because we weren't going to ship until we worked with our selected test pool. For us, at that time, on that project, knowing that we were at 50% printer coverage was both a meaningful piece of data and a useful focuser of work.

We can measure coverage against assertions of the specification, coverage of individual input/output variables (count the number of variables and for each one, test minima, maxima, out-of-bounds, and special cases), combinatorial coverage (all-pairs, all-triples, all-quadruples, whatever your coverage criterion is for deciding which variables to test to what degree of interaction with other variables).

At a meeting of the Software Test Managers Roundtable, we identified hundreds of potential coverage measures. I listed 101 (just to provide a long sample of the huge space possible) coverage measures in my paper Software Negligence & Testing Coverage, http://www.kaner.com/pdfs/negligence_and_testing_coverage.pdf

So, yes, there is a lot of ambiguity about what "coverage" means.

It is seductive to identify specific attributes as THE attributes of interest, but if you focus your testing on attribute X, you will tend to find certain types of errors and miss other types of errors. Complete coverage against X is not complete coverage. It is just complete coverage against X.

For example, suppose we achieve 100% statement coverage. That means we executed each statement once. In an interpreted language, this is useful because syntax errors are detected in real time (at execution time) and not during compilation. So 100% statement coverage assures that there are no syntax errors (unnecessary assurance in a compiled language, because the compiler does it already). However, it offers no assurance that the program will process special cases correctly, that it will even detect critical special cases (if there are no statements to cover divide-by-zero, you can test every statement and never learn that the program will crash when certain variables take a zero value.) You never learn that the program has no protection against buffer overflows, that it is subject to serious race conditions, that it crashes if connected to an unexpected output device, that it has memory leaks, that it corrupts its stack, that it adds input variables together in ways that don't guard against overflow, and on and on and on.

When you focus programmers / testers on a specific coverage measurement, they optimize their testing for that. As a result, they achieve high coverage on their number but low coverage against the other attributes. Brian Marick has written and talked plenty about the ways in which he saw coverage-focused testing cause organizations to achieve better metrics and worse testing. This is the kind of side effect Bob Austin wrote about, and the kind that almost none of the metrics papers in the ACM/IEEE journals even mention the possibility of.

People often write about their favorite coverage metric as "coverage" rather than "coverage against attribute X" -- but if by "coverage", we want to mean how much of the testing that we could have done that we actually did, then we face the problem that the number of tests for any nontrivial program is essentially infinite, even if you include only distinct tests (two tests are distinct if the program could pass one but fail the other). If we measure coverage against the pool of possible tests rather than against attribute X, our coverage is vanishingly small (any finite number divided by infinity is zero).


>
> It seems that the whole point with metrics is to put them
> into context, understand the narrow story they tell about
> the system being measured and then make intelligent
> decisions. To throw complexity or coverage out completely
> seems to insist that since we have no perfect answers we
> should give up and go home.
>
> Your statement about complexity was a further curiosity. I
> think you made a slight equivocation. When someone tells
> you about the complexity as measured by the decision
> points, I hope it is understood by both of you that you
> are using jargon. "Complexity" in this instance only
> references McCabe's work. And hopefully, you both realize
> that within that context it is a measure (or a metric) for
> an aspect of the system that seems to be somewhat
> correlated with defect density (check McCabe's 96 NIST
> report where he points to a couple of projects that saw a
> correlation.) Based on that context, a complexity score is
> possibly a useful thing to know and to use for improving
> the software.

McCabe's metric essentially counts the number of branches in a method. Big deal.

Structural complexity metrics, which are often marketed as "cognitive complexity" metrics, completely ignore the semantics of the code. Semantic complexity is harder to count, so we ignore it.

Yes, structural complexity is one component of the maintainability problem. But so is comprehensibility of variable names, adequacy and appropriateness of comments, coherence of the focus of the method, and the underlying difficulty of the aspect of the world that is being modeled in this piece of code.

Defining a metric focuses us toward optimizing those aspects of our work that are being measured. And taking work / focus away from those aspects that are not being measured. Choosing to use a structural "complexity" metric is a choice about what kinds of things actually make code hard to read, hard to get right, hard to fix, and hard to document.

I've seen some of the correlational studies on structural metrics. Take some really awful code and some really simple code. Those are your anchors. The simple, reliable code has good structural statistics, the awful code is terrible by any measure, and the correlation will show up as positive because of the end points even if the intermediate values are almost random.

If you want to figure out what aspects of programs create complexity, one of the obvious ways is to put code in front of people and assess their reactions. How complex do they think it is? (People can report their level of subjective complexity. Their reports are not perfect, and there are significant practice effects before irrelevant biasing variable get weeded out, but we ask questions like this all the time in psychophysical research and get useful data that drives advances in stereo systems, perfumes, artificial tastes in foods, lighting systems, alarms, etc.) You can also measure how long it takes them to read the code, where duration is measured as the time until they say that they feel like they understand the code. Or you can suggest a specific code change and see how long it takes them to successsfully change the code in that way. We have plenty of simple dependent variables that can be used in a laboratory setting. The research program would crank through different attributes of software, comparing the impacts on the dependent variables. This is the kind of work that can keep a labful of grad students busy for a decade. I'd be surprised if it wasn't fundable (NSF grants). I've been astonished that it hasn't been done, it's so obvious. (Yes, I know, I could do it. But I have too many projects already and not enough time to do them.)

>
> Later you say, "When we try to manage anything on the
> basis of measurements that have not been carefully
> validated, we are likely to create side effects of
> measurement ...
> There is a lot of propaganda about measurement, starting
> with the fairy tale that "you can't manage what you don't
> measure." (Of course we can. We do it all the time.)"
>
> So, this seems to contradict itself. If I understood the
> aphorism about managing and measuring, admittedly I
> haven't heard Tom DeMarco say it personally, what I took
> it to mean is that there is an implied "good" after the
> word 'managing'. That is, he was saying, we cannot do a
> good job managing without measuring.

Are you aware that Tom has repeatedly, publicly retracted this comment?


>
> As to your summary point, I think we agree. It takes a lot
> of thinking to do metrics right. Most people get them
> wrong. We should spend tons of money on research that
> validates metrics. (I am willing to co-write a grant to
> study crap4j if anyone is game?)
>
> What I disagree with is a perception that metrics are not
> useful, that we are managing just fine without them, and
> that because some people misuse them (over and over again
> no less) that nobody should use them without exorbitant
> expenditures of time and money. It sounds a lot like
> trying to ignore the problem.

I spent a lot of years developing software and consulting to development companies before coming back to universities. Almost no one had metrics programs. Capers Jones claimed that 95% of the software companies he'd studied didn't have metrics programs. I hear time and again that this is because these companies lack the discipline or the smarts. What I heard time and again from my clients was that they abandoned the metrics programs because those programs did more harm than good. It is not that they are ignoring the problem or that they think there is no problem. It is that they have no better alternative to a multidimensional, qualitative assessment, even though that is unreliable, difficult, and inconsistent.

You can cure a head cold by shooting yourself in the head. Some people would prefer to keep the cold.
>
> We must keep trying to improve our measures by studying
> them, by validating them, and by improving them based on
> that study.

Remarkably little serious research is done on the quality of these measures.

> And without a doubt, it requires a coherent
> approach, and a clear understanding of what is being
> measured -- whether we call it construct validity or
> something else.
>
> Are we actually in violent agreement?

One of the not-so-amusing cartoons/bumper-stickers/etc. that I see posted on cubicle walls at troubled companies states, "Beatings will continue until morale improves." OK, obviously, morale is a problem and something needs to be done. But beatings are not the solution. In a dark period of the history of psychology, we got so enamored with high tech that we used the high-tech equivalent of beatings (electroshock therapy) to treat depression. It didn't work, but it was such a cool use of technology that we applied this torture to remarkably many people for a remarkably long time.

We have a serious measurement problem in our field. There are all sorts of things we would like to understand and control better. But we don't have the tools and I see dismayingly little effort to create well-validated tools. We have a lot of experience with companies abandoning their metrics programs because the low-quality tools being pushed today have been counterproductive.

We are not in violent agreement.

I see statistics like crap4j as more crappy ways to treat your head cold with a shotgun and I tell people not to rely on them. Instead, I try to help people think through the details of what they are trying to measure (the attributes), why those are critical for them, and how to use a series of converging, often qualitative, measurements to try to get at them. It's not satisfactory, but it's the best that I know.

-- cem kaner

Cem Kaner

Posts: 4
Nickname: cemkaner
Registered: Nov, 2007

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 19, 2007 2:17 PM
Reply to this message Reply
Alberto Savoia wrote:


> Cem,
>
> I believe that software metrics can, and often are, abuse,
> misused, overused, confused, etc. And that A LOT more
> work, experiments, and research is needed to improve the
> current state of affairs. Not to mention educating people
> in the proper ways to use (or not use) metrics.
>
> But if I interpret your post (and position) correctly, the
> only conclusion I can draw is that - as of today - we
> should not be using ANY metric. Zero, nada, nyet. Is that
> the case? Should we burn any and all metrics tool?
> Remove all code coverage from our IDEs?

Sure, I want to achieve 100% statement/branch coverage of code that I write. Before I had tools, I had to use a yellow highlighter on a code listing (really, we did this at Telenova, a phone company I programmed at for 5 years in the 1980's). It was valuable, but it was very tedious. Having simple coverage monitors in Eclipse makes my life easier.

But I think this is the beginning of testing, not the end of it. And in finite time, I am perfectly willing to trade off some other type of testing against this. I see this as a tool for me as a programmer, not as a management tool. As soon as it turns into a management tool, I want to burn it because of the side effects.

If I was assessing someone's code (as their manager), there are several questions that I'd want to balance, such as:

- does it work?
- what evidence has this person collected to suggest that it works, or works well enough?
- is it a straightforward implementation?
- can I understand the code?
- is it usable (either at the UI level or in its interface to the relevant other parts of the application)?
- how much did this cost and why?
- did s/he consider implementation cost and quality explicitly in the design and implementation? What evidence?

Code coverage (take your pick of metrics) is a tiny part of this picture. The more I focus on it, the more tightly I am hugging one tree in a big forest.

One approach is to combine several simplistic metrics (hug a few trees), but just as coverage is a terribly weak indicator of how well the code has been tested, many of these other metrics are weak indicators of whatever they are supposed to measure. Combining them gives an appearance of much greater strength (dashboards or balanced scorecards are very impressive) but they still provide very little information against the questions I'm asking.

The questions are much more critical than the metrics.

It is common to teach an approach to measurement called Goal / Question / Metric. You define a measurement goal ("I want to understand the productivity of my staff in order to manage my project's costs and schedule better") and then a few questions ("What is the productivity of my staff?" "How would a change in productivity impact my costs?") and then a metric or two per question.

One of the exercises we do in my metrics class is to pick a question and take it seriously. Suppose we really wanted an answer to the question. What kinds of information would we have to collect to get that answer? We often come up with lists of several dozen candidates, some of which are easy to translate to numbers and others that need a more qualitative assessment. It is so very tempting to pick one or two easy ones to calculate, declare that these are a sufficient sample of the space of relevant metrics, and then manage on these. And that temptation is so very dangerous in terms of the side effects.

>
> Are you suggesting a full moratorium on all metrics until
> we have invested a few years in "longer term research that
> might provide much more useful answers in the future."?

I am not suggesting a moratorium on management. I am suggesting a moratorium on the hype. I am suggesting a huge increase in the humility index associated with the statistics we collect from our development and testing efforts. I am suggesting a fundamental refocusing on the questions we are trying to answer rather than the statistics we can easily compute that maybe answer maybe some of the questions maybe to some unknown degree with some unconsidered risk of side effects. I am suggesting that we take the risks of side effects more seriously and consider them more explicitly and manage them more thoughtfully. And I am saying that we demand research that is much more focused on the construct and predictive validity of proposed metrics, with stronger empirical evidence--this is hard, but it is hard in every field.


> P.S. Cem, while we might hold different opinions on how to
> improve/fix the state of software metrics, I believe we
> share several common goals. I have enormous respect for
> you , your work, and your passion for software quality and
> testing (which we share.) Not to mention the fact that I
> really like you on a personal level :-). I hope that this
> post is interpreted in the spirit in which it was written
> (i.e. a true desire to confirm my understanding of your
> position, not poke fun at it) and that we can continue
> this discussion in a constructive way that will help us
> (and the readers) gain a better understanding of different
> positions.

Alberto, I wrote my last note (the one on your blog post that follows up to this post), speaking to you by name, because I respect you enough and like you enough to say that I'm disappointed. I've spent a lot of writing hours on this thread--usually I skip blog posts on what I think of as overly simplistic approaches to software measurement, but I put a lot of time into this one because it is your thread. That makes it worth my attention.

-- cem

Alberto Savoia

Posts: 95
Nickname: agitator
Registered: Aug, 2004

Re: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Posted: Dec 19, 2007 3:56 PM
Reply to this message Reply
Cem Kaner wrote:
-----------------------------------------------------------
I am suggesting a huge increase in the humility index associated with the statistics we collect from our development and testing efforts. I am suggesting a fundamental refocusing on the questions we are trying to answer rather than the statistics we can easily compute that maybe answer maybe some of the questions maybe to some unknown degree with some unconsidered risk of side effects. I am suggesting that we take the risks of side effects more seriously and consider them more explicitly and manage them more thoughtfully. And I am saying that we demand research that is much more focused on the construct and predictive validity of proposed metrics, with stronger empirical evidence--this is hard, but it is hard in every field.
------------------------------------------------------------

Cem,

I agree with everything you say in the above paragraph. Believe it or not, the goals of C.R.A.P when we started are very similar to the ones you state. Especially the humility part (although, given my personality that translates into "let's not take ourselves too seriously), focusing on specific attributes, collecting data, doing more research, keep the metric and the thinking behind it open so people can do their own experiments, etc.

Here's some unedited text from one of the earliest C.R.A.P. posts in July of this year:

-----------------------

Below is some of our thinking behind the C.R.A.P. index:

[] We believe that software metrics, in general, are just tools. No single metric can tell the whole story; it’s just one more data point. Metrics are meant to be used by developers, not the other way around – the metric should work for you, you should not have to work for the metric. Metrics should never be an end unto themselves. Metrics are meant to help you think, not to do the thinking for you.

[] We believe that, in order to be useful and become widely adopted, a software metric should be easy to understand, easy to use, and – most importantly – easy to act upon. You should not have to acquire a bunch of additional knowledge in order to use a new metric. If a metric tells you that your inter-class coupling and coherence score (I am making this up) is 3.7, would you know if that’s good or bad? Would you know what you need to do to improve it? Are you even in a position to make the kind of deep and pervasive architectural changes that might be required to improve this number?

[] We believe that the formula for the metric, along with various implementations of the software to calculate the metric should be open-source. We will get things started by hosting a Java implementation of the C.R.A.P. metric (called crap4j) on SourceForge.

[] The way we design, develop, and deploy software changes all the time. We believe that with software metrics, as with software itself, you should plan for, and expect, changes and additions as you gain experience with them. Therefore the C.R.A.P. index will evolve and, hopefully, improve over time. In that spirit, what we present today is version 0.1 and we solicit your input and suggestions for the next version.

[] We believe that a good metric should have a clear and very specific purpose. It should be optimized for that purpose, and it should be used only for that purpose. The more general and generic a metric is, the weaker it is. The C.R.A.P. index focuses on the risk and effort associated with maintaining and changing an existing body of code by people other than the original developers. It should not be abused or misused as a proxy for code quality, evaluating programmers’ skills, or betting on a software company’s stock price.

[] Once the objective for the metric is established, the metric should be designed to measure the major factors that impact that objective and encourage actions that will move the code closer to the desired state with respect to that objective. In the case of C.R.A.P., the objective is to measure and help reduce the risks associated with code changes and software maintenance – especially when such work is to be performed by people other than the original developers. Based on our initial studies and research on metrics with similar aims (e.g., the Maintainability Index from CMU’s Software Engineering Institute) we decided that the formula for version 0.1 of the C.R.A.P. index should be based on method complexity and test coverage.

[] There are always corner cases, special situations, etc., and any metric might misfire on occasion. For example, C.R.A.P. takes into account complexity because there is good research showing that, as complexity increases, the understandability and maintainability of a piece of code decreases and the risk of defects increases. This suggests that measuring code complexity at the method/function level and making an effort to minimize it (e.g. through refactoring) is a good thing. But, based on our experience, there are cases where a single method might be easier to understand, test, and maintain than a refactored version with two or three methods. That’s OK. We know that the way we measure and use complexity is not perfect. We have yet to find a software metric that’s right in all cases. Our goal is to have a metric that’s right in most cases.

...

Software metrics have always been a very touchy topic; they are perfect can-of-worms openers and an easy target. When we started this effort, we knew that we’d be in for a wild ride, a lot of criticism, and lots of conflicting opinions. But I am hopeful that – working together and with an open-source mindset – we can fine tune the C.R.A.P. index and have a metric to will help reduce the amount of crappy code in the world.

OK. Time for some feedback – preferably of the constructive type so that C.R.A.P. 0.2 will be better than C.R.A.P. 0.1.


----------------

I'd like to think that the above thinking provides evidence on our part of humility, awareness of the many inadequacies of any metric, potential for misuse, need for focusing on specific attributes (which, for CRAP is maintainability by developers others than the original developers - not quality), testing the predictive power, etc.

Cem Kaner wrote:

-----------------------------------------------------------
And I am saying that we demand research that is much more focused on the construct and predictive validity of proposed metrics, with stronger empirical evidence--this is hard, but it is hard in every field.
-----------------------------------------------------------

I want that too; but in order to test the construct validity and predictive value we need to have some metrics to test with, some people willing to use them on their projects (real world projects) and also willing to share the data as well as their opinion of the metric "readings". You say, "this is hard", and I could not agree more but we gotta start somewhere. The latest version of crap4j offers an embryonic mechanism to encourage data sharing. It's very, VERY, primitive and limited at this time but you can get a flavor of it at:
http://crap4j.org/benchmark/stats/ and use your imagination for how it might be evolved and used.

We don't want to do this work alone. We are looking for other people to push-back, propose and test with completely different measures and formulae, etc. That's why all the code is open-source. Of course, it would be great to have a combination of industry and academic people working on "the next generation of metrics". Given how strongly and passionate you feel about the topic, is this something that you (or some of your students/colleagues) might be interested in?

Alberto

P.S.

Cem Kaner wrote:
------------------------------------------------------------
Alberto, I wrote my last note (the one on your blog post that follows up to this post), speaking to you by name, because I respect you enough and like you enough to say that I'm disappointed. I've spent a lot of writing hours on this thread--usually I skip blog posts on what I think of as overly simplistic approaches to software measurement, but I put a lot of time into this one because it is your thread. That makes it worth my attention.
------------------------------------------------------------

Hopefu lly, by reading some of the material in this reply (as well from previous posts) gives you a bit more context for my last two posts and a better perspective on what we are trying to accomplish.

I also spent a lot of writing hours on these replies for the same reasons you mention (including last night past 11PM - when I told my wife what I was doing she thought I was crazy :-)). I appreciate the respect, return it several fold, and - if at all possible from your end - I would love an opportunity to continue this discussion offline and see if we can find a way to work together, or at least with more awareness of each other, going forward.

Alberto

Flat View: This topic has 50 replies on 4 pages [ « | 1  2  3  4 | » ]
Topic: Software Metrics Don't Kill Projects, Moronic Managers Kill Projects Previous Topic   Next Topic Topic: Thinking in Java 4e

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use