This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: MCS background
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
For the last couple of months I've been working on a new MCS
implementation. MCS is short for "Maxium Common Substructure", and is
the cheminformatics term for the maximum
common subgraph isomorphism problem. I'll be posting a set of
essays about it, so I'll start with a background piece.
I thought about going into the details of what that means, but the
detailed description was too boring. Here's the short version: given
two structures, find the largest substructure which is common to
both. These can be connected or disconnected substructures; I'm only
concerned with connected substructures. Even then there are still
variations; does "maximum" mean to maximize the number of atoms in the
subgraph, the number of bonds, the number of cycles, or perhaps
maximize some more physical property?
There are also variations in how you define atom and bond
equivalency. Two atoms might match if and only if they are of the same
element type, or you can be more lenient and say that any halogen
matches any other halogen. You can say that bonds only match bonds of
the same bond type, or that aromatic bonds are allowed to match single
or double bonds, or that ring bonds are only allowed to match other
ring bonds.
You can also make requirements on overall structure properties, like
if a ring bond is part of the MCS then it must also be a ring in the
MCS.
That said, the most common MCS is to say that atoms are the same if
and only if the element numbers are the same, and bonds are the same
if they have the same bond type (single, double, triple, and aromatic
types never match each other.) Here's an example of that MCS between
acesulfame
and saccharin:
The MCS problem is NP-hard. As the number
of atoms increases, the run-time should tend to increase
exponentially. An advantage to chemical graphs is that atoms have a
limited valence; few atoms have more than 4 bonds. This makes the MCS
problem a more tractable than the general case. An advantaget to
working in small-molecule chemistry is that N is usually in the 10s
and rarely ever over 100.
So let's make the MCS problem harder (although not in the theoretical
sense; it's still NP hard). Instead of finding the MCS between two
compounds, find the MCS of two or more compounds. This is sometimes
called the "Mulitple MCS problem."
This need often comes up during clustering, where you have have an
algorithm which grouped structures together based on various
properties, including experimental results. A question might be, is
there a structural explanation for this grouping? One way to answer it
is to look at the MCS and see if it's a significant part of the
structures.
The usual solutions
The multiple MCS problem is not new. People have developed algorithms
to find it for several decades. The 2002 review paper by Raymond and
Willett, "Maximum common subgraph isomorphism algorithms for the
matching of chemical structures", JCAMD 16: 521bvariety of solutions, and says "the clique-based approach has been the
most prevalent technique involving MCS-based chemical structure
manipulation." They mostly discuss the pairwise MCS, but there is some
mention of the multiple MCS as well.
A more recent implementation paper is "MultiMCS: a fast algorithm for
the maximum common substructure problem on multiple molecules" by
Hariharan, Janakiraman, Nilakantan, Singh, Varghese, Landrum, and
Schuffenhauer, J Chem Inf Model. 2011 Apr 25;51(4):788-806. Epub 2011
Mar 29. They use the clique-based approach to find all maximal common
substructures between pairs of structures, and use those to find the
maximum common substructure of the set.
I myself have worked on the multiple MCS problem before, when I was a
Bioreason employee in the late 1990s. There I used a back-tracking
solution based on "Backtrack Search Algorithms and the Maximal Common
Subgraph Problem" by McGregor, Software-Practice and Experience,
vol. 12, 23-34 (1982), and also informed by "An algorithm for the
multiple common subgraph problem", Bayada, Simpson, Johnson, and
Laurenco, J Chem Inf Model. 1992 v32, 680-685. It too enumerated
maximal pair-wise solutions to find the MCS for the entire set.
"Maximal" common subgraphs?
You need to be a bit careful when doing a pairwise MCS. It's easy to
think that the MCS for the entire set must be a substructure of the
MCS for any two pairs, but it's almost as easy to come up with a
counter-example. Consider: CCCSSNNN CCCPPPSS NNNPPPOSS. The MCS
between the first two structures is "CCC", between the first and third
is "NNN", and between the second and third is "PPP" while the MCS
between all three is "SS".
Instead, find all the "maximal" common subgraphs between a pair. These
are common subgraphs where it's impossible to add a bond to the
subgraph and still be common in both graphs. Any MCS must be part of
at least one of the maximal subgraphs between a pair of
structures. Once you have a maximal common substruture, find the
maximal common substructures between that and the next molecule. Keep
applying the MCS to successive pairs until done.
One of the nice advantages of this approach is that you get a parital
answer early. If you've gotten to the last pairing and ended up with a
common substructure for the entire set containing 6 atoms and 6 bonds,
then you can quickly reject any future maximal common substructures
which are smaller than that. This is called "branch and bound." The
algorithm explores different branches (in this case, different maximal
common substructure pairings) and sets a bound which can be used to
"prune" the branch, that is exclude further searching along a branch.
That was then, this is now
That was the general approach I did in the late 1990s. It took a lot
of mental strain and concentration for about 6 weeks, but at the end,
it did work. I even released the implementation as PyMCS, under a
Python-like license. You can see remnants of the documentation
thanks to Archive.org, but no one seems to have access to a copy of it
any more.
While it worked, there were some things I didn't like about
it. Molecules with lots of local symmetry cause problems, because
there are a large number of ways to get maximal common substructure
mappings out of it. There are, after all, 12 ways to map a benzene
ring to itself. Also, my implementation used threads, where each
thread generated the maximimal common substructures for a pair. These
days I would use a generator.
Instead, I came up with an alternate solution, based out of ideas I've
been thinking of regarding subgraph enumeration. Enumerate all of the
subgraphs of a reference structure, convert each subgraph into a
subgraph isomorphism query, and test if the subgraph exists in the
other structures. The largest such match is the MCS.
There can be exponentially many subgraphs in a structure, but there
are ways to make it faster. Again, use a branch and bound
technique. If a subgraph doesn't exist in all of the structures then
there's no way that any containing subgraph can exist in all
structures. So if you use a growth-based method to enumerate the
subgraphs, you can test each subgraph and reject the ones which don't
work, as well as any further growth in that direction.
Another optimization, also available in RASCAL and MultiMCS, is to
check the size of the remaining search space. If the current size,
plus the maximum number of atoms or bonds remaining, isn't better than
the current best match, then there's no need to keep on searching.
This ends up finding all of the common substructures, and not just the
maximal ones. There's a few advantage to this approach. You only need
to find one substructure match for each structure, not all of
them. Also, subgraph isomorphism tests are well-optimized core parts
of a cheminformatics toolkit, so that's work I don't have to do.
I didn't read this paper until this evening. It's really neat to see
the similarities between their ideas and mine, as well as the
differences. They omit a pre-processing step which removes obvious
bond mismatches, and they prefer checking all subgraphs of size N
before checking subgraphs of size N+1. I'll need to mull it over, and
dig up the enumeration method of Varkony, Shiloach, and Smith in
"Computer-assisted examination of chemical compounds for structural
similarities", J. Chem. Inf. Comput. Sci. 1979, 19 (2), 104bI can make a more comprehensive comparison. Grrr! And that's only
available from behind the ACS paywall.
In that paper they report that the MCS between acesulfame and
saccharin, in "level 1" (which corresponds to my "topology" option)
took their FORTRAN 77 program 21.15 CPU seconds on a Data General
ECLIPSE-MV/6000. My code on my desktop takes 0.010 seconds, or 2000
times faster. With 25 years between us, it's meaningless to make any
comparison of these other than the normal exclamation that computers
have gotten a lot faster. BTW, my PyMCS code from the late 1990s,
using larger structures, took about 10 seconds on average.