This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: fmcs - find the MCS of a set of compounds
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
My new MCS program is named "fmcs." It uses RDKit for the cheminformatics and is is
available under the BSD 2-clause license. Many thanks to Roche for
funding the project and making it available under a free
license. Their hope and mine is that fmcs will be included someday as
part of RDKit. For now, you can download it from the Bitbucket
site. My hope is that people will fund me to continue developing the
program.
Show me an example
I'll start with a set of 3669 compounds which I know have a benzotriazole
core. I know this because that data set come from doing a substructure
search. If you want to try this yourself, the SD file is part of the
fmcs distribution.
% fmcs sample_files/benzotriazole.sdf --verbose
Loaded 3669 structures from sample_files/benzotriazole.sdf
[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2 9 atoms 10 bonds (complete search)
Total time 5.89 seconds: load 2.76 fragment 2.66 select 0.42 enumerate 0.06 (MCS found after 3.13)
The core result is the SMARTS pattern on the second line, which
represents the common substructure. The first and third line show text
sent to stderr when the "--verbose" flag is used. In this case, it
took 2.76 seconds to load the 3669 structures, 2.66 seconds to remove
atom-bond-atom patterns which aren't in all of the structures, 0.42
seconds to select the reference structure, and only 0.06 seconds to do
the actual MCS enumeration algorithm. (It's so fast for this case
because the preprocessing stage was enough to identify the MCS.)
Most people can't look at a SMARTS like
"[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2" and easily
figure out what it means, so I'll use Karen Schomburg's excellent SMARTSviewer to
render it graphically:
as well as RDKit's image renderer to show the match to the first 10
structures of that data file.
Alternate atom and bond comparisons
ignore atom and bond types
By default, the subgraph matches atoms by element and bonds by bond
type. There are other options. The "--compare topology" option lets
any atom match any other atom, and any bond match any other bond.
Here is a graphical comparison of those two patterns:
That's very interesting. It looks like there's a common substructure,
based on the topology, but not when using on the default match
parameters. Why is this?
It's a matter of bond perception. Some of the rings found by the
topology search are aromatic in some structures but not in others, and
some of the bonds are double in one and single in others. I know I
gave PubChem a SMARTS pattern to find those structures. I wonder how
it ended up like it did.
Match atom types but ignore bond types
The "--compare topology" flag is a shortcut for "--atom-compare any
--bond-compare any". You can chose different options for comparing
atom types and bond type. For example, if you want to compare atoms by
element and ignore bond types then you can do "--atom-compare elements
--bond-compare any".
Coming up with a good example of when to use it is left to the student
as an exercise. (Let me know the answer when you figure it out, so I
can include it here.)
Bond modifiers
Chemists like rings. That's because rings are important. A problem
with MCS is that sometimes it matches a long chain of atoms with a
strange routing through a ring system. Chemists don't always like
that.
If you're one of those people, use the "--ring-matches-ring-only"
flag. This make ring atoms match only ring atoms and chain atoms only
match chain atoms.
Even that might be too generous. Sometimes chemists really want to see
a complete ring in the MCS, and not just a partial one. That is, if
the MCS includes a bond which maps to a ring bond in the original
structures, then the MCS bond must also be in a ring.
If that's what you want, use the "--complete-rings-only" option.
When I figure out good uses cases for these, I'll update the primary
documentation. (I'm told that they are important, but I don't have a
good example for you to understand why.)
All that, and more!
I'm still working on the documentation for the newest features, added
during the last couple of weeks. You can hijack the "isotopes" atom
matcher to define your own atom classes, and specify those classes
through a tag in the SD file. Instead of showing the SMARTS as the
result, you can see the fragment as SMILES, or you can save the
results to an SD file.