This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Naming known molecules
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
That runs over and and over in your head. Why do you want a
name? You're looking for additional information about a chemical
graph, so what about using a graph search instead of a text search?
Suppose all chemical compounds were stored in a computer as a graph.
To search the database, sketch the compound then do a graph
isomorphism search. Graph isomorphism is slower than a text compare,
so the search could be sped up with filters. Eg, search first for a
matching molecular formula and only do the graph search on the records
which pass the filter.
Hey! That could work! It would be even better if all the chemistry
papers were put into the database, so anyone could look up a paper
given the graph of a compound of interest. Oooh! And if it included
published reactions as well, then people can get pointers on
how to synthesize a compound.
Substance identification is a special strength of CAS. It is widely
known as the CAS Registry, the largest substance identification system
in existence. When a chemical substance, newly encountered in the
literature, is processed by CAS, its molecular structure diagram,
systematic chemical name, molecular formula, and other identifying
information are added to the Registry and it is assigned a unique CAS
Registry Number. Registry now contains records for more than 22
million organic and inorganic substances and more than 34 million
sequences.
They digitize all this information, make it searchable, and license
the technology for others to develop search software for your
computer. Or if you want, you can get it on paper, microfilm, or
CD-ROM. All for a price of somewhere between a few hundred and nearly
30,000 dollars/year depending on who you are and what you want. (Who
says information wants to be anthropomorphizedfree?
Actually, the cost in part reflects the service needed to keep things
up to date with the literature and in part the high barrier to anyone
else reproducing their database; the skills of inexpensive off-shore
chemists not withstanding.)
They are also a naming service. They assign a new, unique CAS number
for every compound in the database. Ethanol is
CAS# 64175.
You can design your compound database system to store the CAS# as the
primary key. When you need more ethanol -- without the tasty
impurities you'll get from your pub -- ring up your supplier and order
it by CAS#. This helps make sure both parties are talking about the
same thing.
Problem solved. You can isolate a compound, determine its structure,
get the CAS# and/or its IUPAC name, and look it up in the literature.
Or is it solved?....