This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Finding the MCSes for the ChEBI ontology
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
The industrious folks at EBI have been developing ChEBI, which expands to
"Chemical Entities of Biological Interest." Quoting Wikipedia, "[ChEBI] is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort."
They define several distinct ontologies. One is a chemical structure
ontology. For example, the identifier CHEBI:33567 contains
catecholamine, and a few examples of catecholamines are hexoprenaline
(CHEBI:37950), arbutamine (CHEBI:50580), L-isoprenaline
(CHEBI:6257). In addition, catecholamine is a catechol (CHEBI:33566),
which in turn is a benzenediol (CHEBI:33570), and so on. A group can
have more than one parent; catecholamine is also a monoamine molecular
messenger (CHEBI:25375).
The end result is a hierarchial structure. The bottom of the hierarchy
are structures, and intermediate nodes are such that all children of
the node have some common property.
Some of these common properties map directly to a common
substructure. For example, CHEBI:33853
contains phenols, so every compound under that node has "one or more
hydroxy groups attached to a benzene or other arene ring."
However, not all of them do. As Chepelev, Hastings, Ennis, Steinbeck,
and Dumontier pointed out in "Self-organizing
ontology of biochemically relevant small molecules", BMC
Bioinformatics 2012, 13:3, the term "'ester' includes compounds that
conform to C(=O)OC (i.e. carboxylic esters) and C(=S)OC patterns, among others."
Other cases can't even be represented as SMARTS. They give "bicyclic"
as one such example.
Can I find the MCS of all structures in a node in the ontology?
I was curious to see if I could use their data set as a test of fmcs. If their
intermediate nodes have a machine-readable way to tell if it's a
purely substructure-based node, and if I could get the size
information, then I could get all the structures underneath it, find
the MCS, and compare my answer to theirs.
Alas, they don't have that annotation information. It's something they
are working on, but I didn't get the impression that it's a high
priority. (I don't see why it should, either.)
Still, it's an interesting thought - what if I were to generate the
MCS for all nodes, and visualize the results somehow?
It took a bit longer than I thought, but I finally downloaded their
ontology (in OBO format), parsed it, extracted the hierarchy, figured
out the compounds in each node, tossed out the structures that RDKit
couldn't parse, and the nodes which didn't have at least two remaining
structures in them.
One that was done, I let my MCS algorithm at it. It took about 50
minutes to process. (Well, I had a 15 second timeout on the MCS. I've
found that 15 seconds is usually good enough.)
I also developed a visualizer for the result, using Karen Schomburg's
SMARTSviewer and
Daylight's depictmatch.cgi
Oooh! More pictures!
Here's a snapshot of one of the successful cases, CHEBI:16648,
which is dialkyl phosphate:
Most of the results aren't as clear-cut. For example, CHEBI:16389
contains the ubiquinones. I found the MCS:
which is nearly right, but the Wikipedia page for Coenzyme_Q10
("Coenzyme Q10, also known as ubiquinone, ...") shows a
methyl attached to the top-most oxygen this SMARTS depiction. This is because
CHEBI:18238
is a structure in the set which does not have that methyl attached!
It this methyl important? I don't know. I'm not a chemist, and this
requires expertise I simply don't have.
An oopsie in the oxolanes?
What I do know is that there's a mistake in the oxolanes, CHEBI:26912.
Wikipedia calls this tetrahydrofuran
and says it's an 5-membered ring with the formula (CH2)4O. I would
write it as the SMILES/SMARTS "O1CCCC1".
However, my search finds only "OCCCC"; it doesn't find the
cycle. There shouldn't be a problem with this one so I investigated
further, wondering if it was a bug. It ended up that acetylblasticidin S
(CHEBI:2413)
is considered an oxolane. A quick look at the structure though
shows that it has no 5-membered ring.
I think that's an annotation error. BTW, I do not envy the job of
annotator. There's a lot of data to review, and people like me end up
pointing out the mistakes, not the huge amount of work to get all the
other parts right.
Even more pictures... most of the ChEBI ontology!
Do you want to see the output of my full analysis? Do you have a lot
of memory on your computer? If so, download fmcs_chebi.html.bz2. It's
only 7.5 MB but it bzip2 uncompresses to 166 MB. Open fmcs_chebi.html
in your browser, and have fun! (Note: I'll probably delete it after a
month or so.)
BTW: the images are computed on-demand using servers from the
University of Hamburg and from Daylight. I didn't want to show
everything at once since that would put a huge demand on those
servers. Instead, you'll need to press the "Toggle images" button in
order to see the SMARTS and the graphical depiction of the matches.