Summary
This is the fourth installment in the "Working Effectively With Characterization Tests" series. This time we look at how automation can help you create and/or improve your characterization tests using JUnit Factory, a free, web-based, experimental characterization test generator (and my pet research project).
Advertisement
In part 2 we wrote a set of characterization tests by hand. In part 3 we showed you how those tests helped you to catch unwanted or unintended changes in the behavior of the legacy code. In this part I want to have some fun and introduce you to a pet research project of mine: JUnit Factory.
JUnit Factory, JUF for short, is a free web-based characterization test generator for Java. You send JUF some Java code and, if it can make any sense out of it, it sends you back JUnit characterization tests.
JUF is one of Agitar Lab’s (http://www.agitar.com/company/agitarlabs.html) initiatives aimed primarily at test automation researchers and computer science students (it has already been used in a computer science class assignment at Carnegie-Mellon University), but anyone can use it (with the caveat that they have to be comfortable sending their code to the JUF servers over the Internet).
Before we jump into our example, I’d like be clear about one thing: I believe that developers should take responsibility for unit testing their code. Ideally, every piece of code, should be accompanied by unit tests. I see test automation tools playing a key role in making developers more efficient and effective at testing their own code, but I don’t believe in developers abdicating all responsibility for testing and expecting a push-button tool to do all their testing work for them (even if such a tool was possible – which it isn’t). In other words, I believe in the developer and the test automation tools working together, each doing what they are best at. This applies both to testing new code and legacy code.
I hope I can give you a flavor of what I mean by this developer + test automation tool cooperation with the examples that follow.
As a reminder, this is the original legacy code that we inherited:
And these are the characterization tests that we lovingly hand-crafted for it:
public void testCalculateCommissionDue1() {
assertEquals(200.0, SalesUtil.calculateCommissionDue(1000.0), 0.01);
}
public void testCalculateCommissionDue2() {
assertEquals(5000.0, SalesUtil.calculateCommissionDue(20000.0), 0.01);
}
public void testCalculateCommissionDue3() {
assertEquals(14000.0, SalesUtil.calculateCommissionDue(30000.0), 0.01);
}
Let’s see what kind of characterization tests JUnit Factory comes up with. Since I have downloaded the Eclipse plug-in for JUF, all I have to do is press the JUF Generate Tests button and, a few seconds later, JUF sends back the following tests for the method calculateCommissionDue:
public void testCalculateCommissionDue() throws Throwable {
double result = SalesUtil.calculateCommissionDue(9999.999);
assertEquals("result", 1999.9998, result, 1.0E-6);
}
public void testCalculateCommissionDue1() throws Throwable {
double result = SalesUtil.calculateCommissionDue(20000.001);
assertEquals("result", 5000.0009, result, 1.0E-6);
}
public void testCalculateCommissionDue2() throws Throwable {
double result = SalesUtil.calculateCommissionDue(10000.001);
assertEquals("result", 2000.0003000000002, result, 1.0E-6);
}
public void testCalculateCommissionDue3() throws Throwable {
double result = SalesUtil.calculateCommissionDue(10000.0);
assertEquals("result", 2000.0, result, 1.0E-6);
}
public void testCalculateCommissionDue4() throws Throwable {
double result = SalesUtil.calculateCommissionDue(20000.0);
assertEquals("result", 5000.0, result, 1.0E-6);
}
public void testCalculateCommissionDue5() throws Throwable {
double result = SalesUtil.calculateCommissionDue(19999.999);
assertEquals("result", 4999.9997, result, 1.0E-6);
}
public void testCalculateCommissionDue6() throws Throwable {
double result = SalesUtil.calculateCommissionDue(0.0);
assertEquals("result", 0.0, result, 1.0E-6);
}
Hey, these tests look pretty darn good – if I may say so myself – and the price is right. But, of course, I am biased. So let me tell you why I consider these generated tests to be useful, and also how they can be improved.
For one thing, I like the fact that, in addition to testing for the basic values (e.g. 10000 and 20000), JUnit Factory applied boundary value analysis (a best practice in testing) and created test cases just above and below the boundary values (e.g. 9999.999 and 10000.001). I should have those values in my tests.
In addition to boundary value testing, JUF applied a classic testing heuristic and used 0.0 as an input value. I like this because: 1) using zero for a numerical input value is a always a good testing idea and, 2) because it got me thinking and helped me realize that the existing code will do the right thing with 0.0, but it will also gladly accept a negative number for totSales and will return a negative commission. Now, call me paranoid or over-protective, but this method begs for some input checking.
By looking at the tests, I also realize that with this code I have a real problem with fractional cents both in the input and the output. The double data type is not ideal for representing dollars. I knew this all along, but seeing code like the following,
I like the fact that the tests used 0.0 as an input. But why didn’t JUF use some negative values? I would have also liked to see some very large number; this would have helped me to think about putting a reality-check upper bound on the input – before we accidentally pay a commission of several million dollars.
Why doesn’t JUnit Factory generate the tests I just described? It’s not a technical problem. For us it’s very easy to add a heuristic to generate additional tests with negative and/or large input values for the totSales parameter. The answer is that we are trying to find the proper balance between bare minimum and overkill in the number and types of tests JUnit Factory generates. This is one of the aspects that makes it experimental, and there are other default behaviors that are up for debate. A few of the many things we are trying to decide are:
Should we make assertions on private fields? Some people believe that having to assert on private fields is an indication that there’s something wrong with your design. Others believe that testing trumps encapsulation.
If a method has an object parameter, should we always generate a test using a null value? Some think that testing some methods with null is a waste of time – at best: “A null will never make this far”. Others have seen too many unexpected NullPointerExceptions percolate up to the end-user, and believe that having such a test might help developers think more carefully about their null handling behavior.
If we can’t construct an object, should we automatically mock it? How far do we take mocking? Some believe that proper unit tests should make extensive use of mocks. Others believe that mocks are a weapon of last resort since they can hide serious problems between collaborating classes.
All good questions with strong arguments and proponents for both sides. Why don’t you give JUnit Factory a try yourself, with your own sample code, and let us know what you think. The simple web-based demo (http://www.junitfactory.com/demo/) gives you an opportunity to rate the generated tests (i.e. 1 to 5 stars) and also provide free-form text feedback. For the full JUnit Factory experience, you should download the Eclipse plug-in.
The way I see it, the best way to use automatically generated characterization tests is to consider them a starting point. These tests get my testing juices flowing and they make me think of cases I might have otherwise ignored. But, ultimately, I believe in taking control. Keep the test cases I like (and possibly edit them) and add my own. If some generated don’t make sense or don’t apply, I simply delete them from the set.
In this case, I combined my original tests with the generated tests and added a few of my own to characterize behavior for negative and very large input values. Below is the result:
public void testCalculateCommissionDue1() throws Throwable {
double result = SalesUtil.calculateCommissionDue(-10000.00);
assertEquals("result", -2000.00, result, 0.01);
}
public void testCalculateCommissionDue2() throws Throwable {
double result = SalesUtil.calculateCommissionDue(-0.01);
assertEquals("result", 0.0, result, 0.01);
}
public void testCalculateCommissionDue3() throws Throwable {
double result = SalesUtil.calculateCommissionDue(0.0);
assertEquals("result", 0.0, result, 0.01);
}
public void testCalculateCommissionDue4() throws Throwable {
double result = SalesUtil.calculateCommissionDue(1000.00);
assertEquals("result", 200.00, result, 0.01);
}
public void testCalculateCommissionDue5() throws Throwable {
double result = SalesUtil.calculateCommissionDue(9999.99);
assertEquals("result", 2000.00, result, 0.01);
}
public void testCalculateCommissionDue6() throws Throwable {
double result = SalesUtil.calculateCommissionDue(10000.0);
assertEquals("result", 2000.00, result, 0.01);
}
public void testCalculateCommissionDue7() throws Throwable {
double result = SalesUtil.calculateCommissionDue(10000.01);
assertEquals("result", 2000.00, result, 0.01);
}
public void testCalculateCommissionDue8() throws Throwable {
double result = SalesUtil.calculateCommissionDue(19999.99);
assertEquals("result", 5000.00, result, 0.01);
}
public void testCalculateCommissionDue9() throws Throwable {
double result = SalesUtil.calculateCommissionDue(20000.0);
assertEquals("result", 5000.00, result, 0.01);
}
public void testCalculateCommissionDue10() throws Throwable {
double result = SalesUtil.calculateCommissionDue(20000.01);
assertEquals("result", 5000.00, result, 0.01);
}
public void testCalculateCommissionDue11() throws Throwable {
double result = SalesUtil.calculateCommissionDue(999999.99);
assertEquals("result", 886999.99, result, 0.01);
}
Too many tests? Too few? Just right?
What about the pesky fractional cents problem? Is it acceptable to check the commission to the nearest cent?
Going forward, should we create a DollarAmount class instead of using a double type? Should we throw an exception for negative values or unrealistically large values?
As you can see tests, even those that are automatically generated, really help you think about current and potential problems. Sometimes they raise a lot of questions – good questions.
That’s it for this installment. I hope you found this detour into JUnit Factory interesting and that it motivated you to experiment with it yourself.
Also, if you have followed this series on characterization tests so far, please let me know what you think of it and what you’d like me to cover next.
I'm afraid I remain unconvinced. Good unit test makes it easier to change your code. I imaging that 6 months after generating these tests, you find out that, say BQ changes. What effect will the test have then? Or that your rules for calculating the total changes.
Second: Good tests have a documentation effect. This is especially useful in what you mention about boundary conditions. But neither the name of the test methods, or the assertion string ("result"?! That's not helping!) help the reader of the test understand which boundary conditions are being tested. This also means that despite the good coverage of the test, the reader may be skeptical of whether all boundaries are covered. (And I am not sure boundary conditions are that useful to test if you generate the tests from the boundaries!)
So in conclusion, characterization tests may be useful if you need to modify complex legacy code, but it not a replacement for writing good tests yourself.
With regard to your questions: Right number of tests? I don't know: Have you asked Jester? Having test generated when using a Money type would be very interesting. I hardly never use primitive types, and the JUF would have problems inferring the relationship between Money.add and Money.subtract, I image.
Finally, I agree with you that reading the test raises a number of interesting questions that potentially could help when working with the code.
> I'm afraid I remain unconvinced. Good unit test makes it > easier to change your code. I imaging that 6 months after > generating these tests, you find out that, say BQ changes. > What effect will the test have then? Or that your rules > for calculating the total changes.
Hi Johannes,
Thank you for your reply, and skepticism is healthy thing.
I assume that in your first question you are suggesting that the assertions should use the variables (e.g. BQ and BCR) instead of the actual value so that a change in those variables does not break the tests. In other words, instead of:
> Second: Good tests have a documentation effect. This is > especially useful in what you mention about boundary > conditions. But neither the name of the test methods, or > the assertion string ("result"?! That's not helping!) help > the reader of the test understand which boundary > conditions are being tested. This also means that despite > the good coverage of the test, the reader may be skeptical > of whether all boundaries are covered. (And I am not sure > boundary conditions are that useful to test if you > generate the tests from the boundaries!)
One of our research project is to come up with more descriptive names for the test methods, but it's a non-trivial problem (to say the least).
The way to check that the generated tests cover all boundaries is to use a coverage analyzer. Our Eclipse test runner (which is included with the JUnit Factory plug-in) includes built-in code coverage.
> So in conclusion, characterization tests may be useful if > you need to modify complex legacy code, but it not a > replacement for writing good tests yourself.
Yes. If the code already came with a great set of tests, it would not benefit as much from characterization tests - and it would not be legacy code in the first place since I like Michael Feathers' definition of legacy code as code without tests.
But, in my experience, even if you have written some manual tests it's very easy to overlook some interesting and bug prone error conditions. Most manual tests focus on what is commonly called the "happy path", they use basic values and ignore important corner cases.
Even if you have developed your code test-first, you will often get some additional insight and other interesting tests by running JUnit Factory. I recently recorded a session with Kent Beck where we developed some code using TDD and then used JUnit Factory to discover some interesting test cases that made us change the code. Let me see if this session is already on our website and if not, we'll put it up and let you know.
One thing I am discovering is that if JUF generates tests that are exactly the same tests I would have written manually, it would not be nearly as valuable. The fact that the generated tests use input values I would not normally use, and do other strange things is a plus.
Let's assume that you have two (non-empty) sets of tests A and B, neither of which is complete. In the extreme case where A == B, one set of tests is redundant - a waste of time. The best situation is when the intersection (used in the set theory sense) of A and B is empty. Another subject worth expanding on - I'll probably blog about it soon.
> With regard to your questions: Right number of tests? I > don't know: Have you asked Jester?
Good to know other people are familiar with mutation testing. We are fans of Jeff Offutt (who invented mutation testing) and used mutation testing (MuJava though, not Jester) in our research and have reached some very interesting conclusions. A fascinating topic, I'll try to cover it in another blog soon.
>Having test generated > when using a Money type would be very interesting. I > hardly never use primitive types, and the JUF would have > problems inferring the relationship between Money.add and > Money.subtract, I image.
In the next installment, I will probably create a Money type and see how my tests do.
> Finally, I agree with you that reading the test raises a > number of interesting questions that potentially could > help when working with the code.
Thanks for recognizing that. It's one of the key benefits of automatically generated characterization tests.
Thanks again for the feedback and interesting discussion. As I mentioned, I consider JUnit Factory an experimental proving ground; we can try all sorts of fun and interesting things when generating tests but we need feedback and suggestions like yours.
Alberto, this is precisely the kind of help I expected Agitator/AgitarOne/JUnitFactory to give me as a TDD practitioner. This is a tremendous help to people with experience writing characterization tests for legacy code. I firmly hope that people new to TDD, who don't have much knowledge of testing don't delude themselves into thinking that generating characterization tests obviates the need to think about it. They might miss a few things, such as
- Do the inputs make sense? (Should we test 0? negative?) - Do the outputs make sense? (Is 1.50000000000002 bad?) - Are there any other inputs missing?
In the hands of a seasoned TDD practitioner, though, this looks tremendous. Thanks for a thorough example.
> Alberto, this is precisely the kind of help I > expected Agitator/AgitarOne/JUnitFactory to give me as a > TDD practitioner. This is a tremendous help to people with > experience writing characterization tests for legacy code. > I firmly hope that people new to TDD, who don't have much > knowledge of testing don't delude themselves into thinking > that generating characterization tests obviates the need > to think about it. They might miss a few things, such as > > - Do the inputs make sense? (Should we test 0? negative?) > - Do the outputs make sense? (Is 1.50000000000002 bad?) > - Are there any other inputs missing? > > In the hands of a seasoned TDD practitioner, though, this > looks tremendous. Thanks for a thorough example.
Thank you JB. Coming from you this means a lot.
It's clear that we have quite a bit of education to do on both sides.
1) The uninitiated in developer testing (TDD or otherwise) need to learn that there is no automated test generation silver bullet. Some aspects of testing can be automated with great success. Some cannot be automated at all. But most require user interaction and co-operation with the test automation tool to get the full benefit.
2) The already test infected need to learn that some amount of test automation is not only NOT EVIL (sorry, I'm an ex-Googler :-)), but necessary to A) make sure you have not overlooked anything and B) help you automate the generation of the more mundane test so you can focus on testing the more complicated cases where the tools don't do as well and where human intelligence and creativity is required. When it comes to automated test generation, a lot of TDD/XP practitioners seem to have a strong built-in resistance, but I believe that by shunning it completely they are throwing away the baby with the bath water.
Perhaps some people will be more open to the automated test generation if we started calling it test amplification - since amplification implies some starting input.
I know as a fact that what keeps most programmers away from practicing developer testing is the, perceived or actual, amount of effort they associate with it. Most developers know that testing is good - not just for the project, and for the noble the feeling of doing something right, but also for their own selfish motives (i.e. less time spent chasing bugs and reworking code). It's just that they don't see themselves spending even as little as 20% of their time writing re-usable tests instead of code.
This is where I believe test automation and test amplification will help.
Thanks again for the kind comments and for having an open mind.
You're welcome, Alberto. We can certainly bridge the gap here. I still have one concern: if I amplify my testing with generated characterization tests for legacy code, do I rob myself of feeling the pain of that code? Do I therefore rob myself of learning about the specific design problems in the code?
On the one hand, if I don't understand the design problems, I am less likely to do something about them. On the other hand, with much legacy code, it's almost better /not/ to see the problems, because I can't do anything about them yet, anyhow.
I think this is a great way to break the chicken-and-egg problem with legacy code: I can't refactor without tests and I can't write tests without refactoring. I'd be interested to see generated characterization tests for a more involved call stack.
> On the one hand, if I don't understand the design > problems, I am less likely to do something about them. On > the other hand, with much legacy code, it's almost better > /not/ to see the problems, because I can't do anything > about them yet, anyhow.
Here's an idea that I have been flirting with...
Well-designed code is easier to test, right?
If it's also true that it's easier to _generate_ tests for well-designed code, then test generation may be a useful proxy for measuring how well-designed a body of code is.
That's a big 'if' but, in the experiments that I have done os far, I see a dramatic difference in the quality of the generated tests on TDD'd code versus code-written-with-no-tests.
So, while it may be true that a test-infected TDDer gets less incremental benefit from JUnit Factory than someone who currently writes no tests at all, the TDDer has to invest less additional time to produce adequate characterization tests.
Conversely, code that is hard to test (or generate tests for) is often poorly designed. If JUnit Factory gets stuck on poorly designed code, the best remedy is often to improve the design by refactoring.
It's currently only a tentative hypothesis of mine and the results may not support it - but I am having a fun time doing the experiment!.