The Artima Developer Community
Sponsored Link

Python Buzz Forum
Finding the MCSes for the ChEBI ontology

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.
Finding the MCSes for the ChEBI ontology Posted: May 12, 2012 11:09 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Finding the MCSes for the ChEBI ontology
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Latest Python Buzz Posts
Latest Python Buzz Posts by Andrew Dalke
Latest Posts From Andrew Dalke's writings

Advertisement

The industrious folks at EBI have been developing ChEBI, which expands to "Chemical Entities of Biological Interest." Quoting Wikipedia, "[ChEBI] is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort."

They define several distinct ontologies. One is a chemical structure ontology. For example, the identifier CHEBI:33567 contains catecholamine, and a few examples of catecholamines are hexoprenaline (CHEBI:37950), arbutamine (CHEBI:50580), L-isoprenaline (CHEBI:6257). In addition, catecholamine is a catechol (CHEBI:33566), which in turn is a benzenediol (CHEBI:33570), and so on. A group can have more than one parent; catecholamine is also a monoamine molecular messenger (CHEBI:25375).

The end result is a hierarchial structure. The bottom of the hierarchy are structures, and intermediate nodes are such that all children of the node have some common property.

Some of these common properties map directly to a common substructure. For example, CHEBI:33853 contains phenols, so every compound under that node has "one or more hydroxy groups attached to a benzene or other arene ring."

However, not all of them do. As Chepelev, Hastings, Ennis, Steinbeck, and Dumontier pointed out in "Self-organizing ontology of biochemically relevant small molecules", BMC Bioinformatics 2012, 13:3, the term "'ester' includes compounds that conform to C(=O)OC (i.e. carboxylic esters) and C(=S)OC patterns, among others."

Other cases can't even be represented as SMARTS. They give "bicyclic" as one such example.

Can I find the MCS of all structures in a node in the ontology?

I was curious to see if I could use their data set as a test of fmcs. If their intermediate nodes have a machine-readable way to tell if it's a purely substructure-based node, and if I could get the size information, then I could get all the structures underneath it, find the MCS, and compare my answer to theirs.

Alas, they don't have that annotation information. It's something they are working on, but I didn't get the impression that it's a high priority. (I don't see why it should, either.)

Still, it's an interesting thought - what if I were to generate the MCS for all nodes, and visualize the results somehow?

It took a bit longer than I thought, but I finally downloaded their ontology (in OBO format), parsed it, extracted the hierarchy, figured out the compounds in each node, tossed out the structures that RDKit couldn't parse, and the nodes which didn't have at least two remaining structures in them.

One that was done, I let my MCS algorithm at it. It took about 50 minutes to process. (Well, I had a 15 second timeout on the MCS. I've found that 15 seconds is usually good enough.)

I also developed a visualizer for the result, using Karen Schomburg's SMARTSviewer and Daylight's depictmatch.cgi

Oooh! More pictures!

Here's a snapshot of one of the successful cases, CHEBI:16648, which is dialkyl phosphate:


Most of the results aren't as clear-cut. For example, CHEBI:16389 contains the ubiquinones. I found the MCS:


which is nearly right, but the Wikipedia page for Coenzyme_Q10 ("Coenzyme Q10, also known as ubiquinone, ...") shows a methyl attached to the top-most oxygen this SMARTS depiction. This is because CHEBI:18238 is a structure in the set which does not have that methyl attached!

It this methyl important? I don't know. I'm not a chemist, and this requires expertise I simply don't have.

An oopsie in the oxolanes?

What I do know is that there's a mistake in the oxolanes, CHEBI:26912. Wikipedia calls this tetrahydrofuran and says it's an 5-membered ring with the formula (CH2)4O. I would write it as the SMILES/SMARTS "O1CCCC1".

However, my search finds only "OCCCC"; it doesn't find the cycle. There shouldn't be a problem with this one so I investigated further, wondering if it was a bug. It ended up that acetylblasticidin S (CHEBI:2413) is considered an oxolane. A quick look at the structure though shows that it has no 5-membered ring.

I think that's an annotation error. BTW, I do not envy the job of annotator. There's a lot of data to review, and people like me end up pointing out the mistakes, not the huge amount of work to get all the other parts right.

Even more pictures... most of the ChEBI ontology!

Do you want to see the output of my full analysis? Do you have a lot of memory on your computer? If so, download fmcs_chebi.html.bz2. It's only 7.5 MB but it bzip2 uncompresses to 166 MB. Open fmcs_chebi.html in your browser, and have fun! (Note: I'll probably delete it after a month or so.)

BTW: the images are computed on-demand using servers from the University of Hamburg and from Daylight. I didn't want to show everything at once since that would put a huge demand on those servers. Instead, you'll need to press the "Toggle images" button in order to see the SMARTS and the graphical depiction of the matches.

Read: Finding the MCSes for the ChEBI ontology

Topic: Testing hard algorithms Previous Topic   Next Topic Topic: fmcs - find the MCS of a set of compounds

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use