While code metrics are frequently used to improve code quality, traditional metrics are not sufficient to determine the true risk behind a code base, argues Coverity's Ben Chelf in an interview Artima. Coverity recently released its Software Readiness Manager for Java, a tool that aims to combines traditional metrics with some additional data, and correlate all that information, in order to create a better gauge of software risk:
We are giving software development managers a set of aggregated and normalized statistics, metrics about their software system. Those metrics accurately describe the risk inherent in that system that can be correlated with the eventual deployment quality of that system.
The key observation behind the product is that while lots of people have measured software before, and there are many different ways to pull data out of a software system, none of the traditional measures can predict anything a priori about the software. Traditional measures include the lines of code, cyclomatic complexity, notions of how many comments you have, test coverage, defect detection and static analysis as a metric, and so on.
We found it necessary to combine these statistics to correlate them into higher-level indicators that truly measure what’s going on in a software system. By aggregating, normalizing, and correlating these things, we can compare apples to apples, and numbers from one codebase to another.
For example, think of just two metrics: test case coverage and complexity. The way you want to correlate those is that if you have a very complex function, you want to require higher test-case coverage. If you have very high complexity and low test-case coverage, that’s inherently more risky than if you have high complexity and high test coverage. Similarly, if the complexity is not very high, the coverage might not be as important: lower test coverage might be OK for a very simple function.
We have a few terms to describe the roll-up of these metrics. There’s something we call the volume of the code, which shows how dense the code is. That’s something the takes into account the comments, complexity, and number of lines per function. We also have something called raw complexity: it combines traditional notions of complexity, with some additional relevant metrics. We also look at results of static analysis and correlate that to the total number of violations. Another metric tries to gauge the testability of the system: combining test case coverage with complexity, for example, gives you some notion of how well the code base is tested.
Part of this is a question of normalization. You want for each of those indicators to have a single number that’s normalized so that you can compare volume to complexity , to static analysis violations, across many projects and code bases. In the raw metrics themselves, things don’t line up that nicely: you can’t necessarily compare the amount of defects with code coverage, for example.
At the highest level, we roll up those indicators into two main indices: overall risk that is a measure of quality, and unmaintainability that measures the potential of the system to evolve over time.
Managers that tried out this product tell us that there is a one-to-one correspondence with the issues that they’re finding in the field after they release their code and the highlighted hotspots from our product. There is also a direct correlation between the experience of the development team developing the product, and the risk our product highlighted in that piece of code.
While managers had informal notions of these metrics in the past, armed with this data, they are better equipped to discuss deadline and resource allocation issues both with their upper management and with the development team.