This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Testing hard algorithms
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I developed software to find a maximum common subgraph (MCS) given a
set of molecules represented as a chemical graph. It's called fmcs. My previous three
essays were about the background
of the MCS problem, introducing fmcs, and an example of when MCS is used.
What I didn't describe was the mental effort it took to develop this
program. This is the second time I've written code to find the
multiple structure MCS, and both times it took a couple of months and
put my head in a very strange place. You would think the second time
is easier, but it means that I spent more time adding features and
doing things that my first version couldn't begin to handle. (Did
someone just mutter "second system effect"? Pshaw!)
This time too I had a better understanding of the development process.
I think I know why it's so much harder than most of the software I
develop. Back in the late 1990s, I almost never wrote automated tests,
and never wrote an extensive test suite. Nowadays I'm pretty thorough,
and use coverage-guided techniques to get good, extensive tests
through some semi-permanent API layer that will allow me to refactor
the internals without changing the tests.
Think about the problem for a moment. How would you write an algorithm
to find the largest common subgraph between two graphs?
Take your time...
You probably came up with a graph walking algorithm which tries
different match pairings and uses backtracking or lazy evaluation to
search all of graph space. The more mathemtically inlined might have
converted this into a maximum clique algorithm.
In both cases there are a lot of arbitrary decisions to make. Which
pairings should you investiage first? Are there times when you can
prune the search space?
There are a lot of heuristics, and some of them are only triggered
under unusual circumstances, after a bunch of combinitorial
possibilities. Outside of a few minor components, I couldn't figure
out how to write unit tests for the code.
I ended up pushing most of the testing into validation testing of the
complete software, which meant I wrote some 1,000+ lines of code
without strong testing. Moreover, I used a new approach, so the
algorithm I was working on doesn't have the track record of the
standard clique-based or backtracking approaches.
So the stress I faced was a combination of not knowing if the
algorithm would work, and not being able to develop enough test cases
to provide some validation during the development.
As a rule of thumb, it's easiest to fix bugs which are caught
early. Unit tests and evolutionary prototyping are two of the
techniques that people use to tighten the feedback loop between
specification, implementation, and valiation. I think my stress level
is propotional to the size of the feedback loop.
What testing did I do?
I mean, I did have tests during development. I came up with a few
examples by hand, I did a substructure search of a large data set and
I verified that the MCS code found that substructure, but I know
that's not enough tests. I know this because after six weeks of
development and over 1000 lines of code, I spent another two weeks
doing a large amount of post-development testing, and found several
bugs.
I did most of my tests based on the ChEMBL data: random pairs of
structures, the k=2, k=10, and k=100 nearest neighbors, the k<=100
neighbors with and similarities of at least 0.95, 0.9, and 0.8. I also
did, at the end, test based on the ChEBI structure ontology.
What bugs did I find? I think it's educational to characterize a few of them.
Typo caught by an assertion check
One of the simplest bugs was a poorly formatted string. I used "%d"
when I should have used ""%%%d". A bit of jargon for those in the
know; I generate SMARTS strings for the intermedate subgraphs. If
there are more than 9 open rings then the syntax goes from a single
digit closure number to a closure number like "%10". I forgot to
include the '%' for that case, and probably because the '%' was
already there for the format string.
This wasn't triggered by my random-pairs test nor my various
similarity-search based tests. Only when I ran through the ChEBI data,
did I get an assertion failure when RDKit refused to accept my SMARTS
string. That was the first time where I had a SMARTS with 10 unclosed
rings.
As it happens, this error could have been caught by one style of unit
testing. It's a four line function which takes a string and a
number. I didn't test it because I feel that testing directly against
internal functions inhibits refactoring. I prefer to test against
higher-level, more stable APIs.
It's usually easy to come up with high-level test cases which trigger
a certain function call. But not in this case. The MCS search
algorithm, while deterministic, uses so many arbitrarily defined
values that I couldn't on my own come up with a test case with at
least 10 open ring closures. And even if I did, a new search heuristic
might change things so that only 7 open ring closures were needed.
I felt that my validation testing and the assertion check would be
enough to identify if there was a problem. A low-level unit test might
have helped, especially as I still don't have a specific test for that
failure.
cross-validation testing
The first MCS papers are about as old as I am. Some of the other
implementations are available both at no cost and with no prohibitions
on using it to develop a new MCS algorithm. (Some commercial companies
don't like you using their software to write new software which is
competitive to it, or even use their software for benchmark
comparisons.)
I tested pairs of structures against SMSD and I
did more extenstive tests against Indigo's MCS
implementation.
This is cross-validation testing. It's a relatively rare technique
because the cost of producing multiple identical implementations
usually isn't worth the benefits. Even here the results aren't exactly
identical because of differences in how the toolkits perceive
chemistry, and more specifically, aromaticity. I ended up spending a
lot of investigation time staring at cases with different answers and
trying to figure out if it was a chemistry problem or an MCS algorithm
implementation problem.
I found the SMSD had a bug in one of its options, which I
reported. The code had been fixed internally but not pushed to the
outside world. Its default mode and fmcs matched quite well, except
for a couple of chemistry differences. The new version is out - I need
to test it again.
The only problem I found in the Indigo code was a part of their setup
code which didn't check the timeout. That's also been fixed after I
reported it.
The cross-validation with Indigo found problems in my code. For
example, I was often getting smaller MCSes then Indigo. After looking
at them, I figured out that my code didn't correctly handle the case
when a molecule was fragmented after a bond was removed because its
bondtype wasn't in all of the structures.
My code usually got the right answer when using highly similar
structures, for obvious reasons. It was only the random pair testing
where the problem really stood out.
It's not easy to build a specific test case for this error. I don't
know which of the two fragments will be first - it depends on so many
arbitrary decisions which could change as the software is improved.
Bad heuristics
Most of the implementation contains heuristics. There's a branch and
bound search, there's subgraph canonicalization to reduce the number
of substructure tests, and so on. Each of these is supposed to help
make the code faster.
One of the tests takes the current subgraph and the list of "outgoing"
bonds to see how much of the remainder of the graph is accessible for
growth, using that subgraph as a seed. It took a couple of days to
think of the algorithm, write the code, and get it working. I then did
the timing tests to find out it was 1% faster.
After looking at the algorithm a lot, I finally realized that I had
forgotten to exclude additional bonds from consideration. With two
changed lines, the performance was 30% faster.
I think I spent two weeks on running test upon test. I tried random
pairs of compounds pulled from ChEMBL, I tried k-nearest neighbors, I
tried all similar structures (up to k=100) at different similarity
thresholds. Even then, I still found a bug when I switched over to
exploring the ChEBI structure ontology.
. If unit testing co-exists with
development, and integration tests are downstream from that, then
cross-validation is solidly in hte "veriif
The MCS problem is NP-hard. There are a huge number of hur