This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Naming molecules
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Suppose you are a physicist. After some analysis in your home-built
NMR machine you've figured out the active ingredient in your vodka has
the following chemical structure:
If you really had no chemistry training at all you probably wouldn't
even include the bonds. A bond is a way of representing electron
density, which can be computed knowing the atoms' positions and
applying some quantum mechanics and computer power. And if you want
to show off, toss in the phrase "Bohr-Oppenheimer approximation" so
you don't have to worry about treating the nucleon as anything other
than a fixed point. (Huh. Given how rarely that phrase is used,
my act of saying "Bohr-Oppenheimer approximation" may make this page
the top hit for it in search engines.)
Odds are you probably had a chemistry class in high school so you know
about drawing the structure with bonds. Now you want to find out more
about it. But how? Image search is still very immature and there are
many ways to depict the graph, so that's not going to work.
One way is to look for the molecular formula, which is the counts of
the number of each atom type. This structure is
C2H6O but a web browser doesn't like subscripts
so try "C2H6O". The first hit is for the
"c2h6o -- happy hour"
mailing list at Georgia Tech, which suggests people already know about
this compound. But it doesn't give you much clue as to what it is.
The next hit is for
dimethyl ether
which has the same molecular formula but looks like
There you see the problem. The molecular formula isn't unique. You
really would like a compound to have one and only one name, and for a
name to refer to only one molecule. After thinking about it some more
you realize that the molecular formula itself could be written several
ways, like H6C2O (lightest element first) or
OC2H6 (heavest element first). There are six
possible permutations for three atoms.
Searching for the first alternative you come across
lecture slide
which says "H6C2O could correspond to both Ethanol (H3CH2COH) and
dimethyl ether (H3COCH3)". Ahh! A clue! Maybe this is called
ethanol. But it's kinda worrying to see the formula written as
H3CH2COH, which is different than the six permutations listed above.
Further searching finds links to sites promoting the commercial use
of ethanol, but not until the
sixth link
do you find some useful chemical information and verification that
you've got the right structure. But it is still disconcerting that
they use the formula CH3CH2OH which
is yet another possibility.
What are you going to do the next time you want to find information
about a molecule? It seems these things have names, so you look into
that some more and find out that the International Union of Pure and
Applied Chemistry (IUPAC to its friends and enemies alike) have a
huge amount of documentation
related to nomenclature. Using their rules gives a way to assign a
unique name to a molecule.
And look, that page says ethanol is written
C2H5OH. *sigh*.
At its simplest, the IUPAC name for an organic compound contains
these two parts:
a root indicating how many carbon atoms are in the
longest continuous chain of carbon atoms.
a prefix and/or suffix to indicate the family to which the
compound belongs.
The longest carbon chain is two carbons so it has the prefix "eth".
There is a single bond between them (that's "single bond" as in a bond
with single bond type, not that there's only one bond between them) so
it's an "ethane". There's an OH on the end which uses the suffix
"ol". Drop the "e" and join them to make "ethanol". Ta-da!
Upon reading that tutorial you realize there's a lot of memorization
of names, and you went into physics because you prefered formulas and
math over names. And because you would rather be electrocuted or
irradiated instead of being around chemical containers with big
warning stickers like "Danger: Bone Seeker".
After digging around a bit you realize that
even
trained chemists have problems with names. Chemistry librarians
were worth their weight in platinum in their knowledge of the
arcane magic of finding the right literature references.
Good thing you've got a computer. There is software to help
generate an IUPAC name. But my, the results sure looks complicated,
the process is opaque (to non-experts and even non-specialists in a
domain) and there's the fine print that "from time-to-time" some
compounds can't be named because "some classes of compounds may not
yet have systematic nomenclature definitions available."
The names look complicated in part because they derive from a system
originally designed to be pronouncable and to reflect the way that a
chemist understands the system. The result is a name like (from the
ACD/Name example
on that ACD/Labs link -- it's got cool mouseovers!):