This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Computers -- History of Chemical Nomenclature
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
In comes the 1950s, and the start of commercial, general purpose
computers. People started to use them to automate many of the
mechanical tasks once done by hand. Looking back 50 years, one
obvious task is assigning chemical nomenclatures, but there are some
big steps to overcome.
First, can computers be taught nomenclature? Sure, there's all this
talk of 'electronic brains', but it is real or just hype? Second, the
early computers were quite small (in power; definitely not small in
size). One reason it took three years to train a chemist to be a
nomenclaturist is because chemistry is complex. How can all those
rules be encoded in the computer? Third, can computer handle the I/O
appropriate to chemistry?
Luckily, there are useful intermediate steps. Before getting to them,
let me elaborate on the interesting influence of linguistics in
computer science. If you've ever programmed in Fortran you'll notice
it has a different feel than C, Java, Python, Lisp and just about
every other programming language. It seems to lack from a certain
sense of structure, but I'm at a loss for giving a good, clear
example. Fortran was designed before there was a good theoretical
understanding of languages and parsers.
Chomsky, now best known as a .. political dissident is probably the
best term, formalized in 1956 how we think about syntax in computer
languages. He introduced context-free grammars, which are now used to
describe the syntax of nearly all computer languages. CFGs are
succinctly written in a variation of BNF ("Backus-Naur Form") and can
easily be turned into a parser using a parser-generator like yacc or
ANTLR. There is a CFG for Fortran but it's very complicated; there
are definite advantages in having a language be designed to be easy to
parse as a CFG. BNF, by the way, was introduced in 1960 as a way to
describe the Algol programming language.
Linguistics also provides tools to understand existing languages.
Garfield's 1961 PhD thesis,
which I've mentioned favorably in a few of my older essays, used
linguisitic tools to determine the morphemes (smallest parts of a
language) in chemical nomenclature and how they fit together. One
result of this analysis, which is the topic of his thesis, was the
realization that the molecular formula for a compound could be
determined directly from the morphemes in the compound name.
This was surprising, even stunning. Previously (meaning for the
previous century), chemists were taught to use the chemical name to
make the chemical graph, then count the atoms in the graph to get the
molecular weight. This takes time and is error-prone, even for an
expert. His thesis showed that 1) chemical nomenclature has a
well-defined syntax which can be studied with the tools of
linguistics, 2) the results have direct applicability to real needs in
chemistry, and 3) can be encoded on a computer, or done by a
relatively untrained person. Chemists were shocked that it was so
easy, and it took a lot of checking and double checking to verify the
results really were correct.
In the 1960s the big pharmas and indexing companies like ISI started
using computers for their chemical informatics requirements.
Computers could do all the things punch card machines could do, and
faster. They could be programmed increasingly nuanced rules of
chemistry, to compute molecular weight, figure out the molecular
graph, etc. given a WLN or a molecular name. Graphic display systems
became mainstream enough that people could
sketch the molecular graph
on a display instead of on paper, and get a high-quality printout of
the graph. The first commercial system for this was ISI's CROSSBOW,
(released in 1969 and written in Cobol (!) and assembler
[#]).
It only had text input but did have graph output. Within a few years
there were several available products, including programs which
could take graphical input.
Computers got powerful enough to support new types of searches based
on structure. IUPAC systematic names, WLNs, and other line notations
required years of training in order to produce a unique (or
canonical) compound name. This made it hard for normal
chemists to find a compound by name. But any chemist could sketch out
the molecular graph on the screen, which could easily and
automatically be converted to a connection table listing all the atoms
and bonds. To find a matching compound in the database, the computer
just finds all records with an isomorphic connection table. (In other
words, where the graphs were identical.) Real code would use various
tricks to filter out obvious rejects, because graph isomorphisms are
slow.
Computers can also do substructure and superstructure searches, which
look for compounds where part of the molecule matches the sketched out
query, or vice versa. The older line notations along with paper-based
indicies could handle a few of these, but they were basically
precomputed. Computers allowed chemists to search for an arbitrary
substructure. Subgraph isomorphism searches are a bit harder than
graph isomorphism ones, and required cleverer tricks to prefilter out
likely mismatches, but chemists and programmers are clever. I expect
true substructure search appeared after structure search, but I don't
know when. (I say "true" because CROSSBOW did allow searching for WLN
fragment names.)
Computers could also be used for similarity searches. This is mostly
done using
fingerprints,
which is a bit string. A key based system like MACCS applies a
chemical meaning to each bit; "bit 4 is only on when there is at least
one selenium atom", "bit 5 is on if and only if there is a C=O bond."
A path based system like Daylight's looks at all paths up to a certain
size and computes a hash value based on the atoms and bonds in the
path, which is used to turn a bit on. The result is an invariant
fingerprint that can be compared to other fingerprints. If two
fingerprints are highly dissimilar than they are different, and if
they are similar than the compounds are (hopefully) likely to be
similar. If the similarity value is a metric (or metric-like) then it
can be used to cluster data sets, which gives yet another way
computers can help understand chemistry. (Contact Mesa Analytics if you want to hire
people with expertise on this topic. Tell John I apologize for
borrowing his ladder for so long :)
(As another aside: the Levenstein distance between two words is the
number of character insertions and deletions to transform one into the
other. I know of no similar distance measure for graphs. Do you?
If so, please let me know.)
All of these searches became possible because computers could apply
graph algorithms to the connection table. Even better, computers
could be used to make predications of physical properties (like the
number of rotatable bonds, flexibility, and CLogP) based only on that
connection table. It's no wonder that by the latter part of the 1970s
people could exclaim that
"Line notations are dead!"
There are some problems with connection tables. They are big. The
most commonly used connection table format comes from MSI and looks like
According to the filename, that's aspirin, although I don't know what
the hydrogen is doing there as the 14th atom. The first three numbers
in the atoms fields are for the coordinates. Other fields in the
table store the atom symbol (remember Berzeluis?), formal charge, bond
type, chirality, and other atom or bond properties. What makes this a
connection table is the first two numbers in the second section
(starting with 1 2 1. This is the bond section and each
field connects one atom to another. It's very hard for people to read
a connection table directly, and tedious even to convert it by hand
into a chemical graph, which would look like
The full SD file for aspirin takes 928 bytes, which is about 40 times
larger than the systematic name of "2-Acetoxybenzoic acid" or the
semi-systematic name of "acetylsalicylic acid," or the trivial name of
"aspirin." The coordinates aren't needed for the topology, so the
connection table could in theory be smaller, but still nowhere
comparable in size.
Why is space a problem? In part because there are a lot of known
compounds. By the early 1980s, CAS had 6.3 million
compounds. Uncompressed it would take over 6 GB just for the
connection table, and even with a good binary data structure it needed
to live on disk and not in memory. (A Cray supercomputer may have had
that much RAM, but likely not your local VAX.)
Even if space wasn't a concern, a good line notation is nice because
you can stick it in an text field of a relational database, or into a
cell of an Excel spreadsheet. If it's a canonical name, it's easy to
check for duplicate compounds and even, with some line notations,
remove salts and do other simple manipulations right on the string.
And with a really good one, a chemist can convert it to/from a graph
with little problem.
It's actually relatively easy to come up with a line notation. The
standard way is to create a spanning tree for the graph, label each
atom and bond (perhaps with a shortcut notation for things like
carbons and single bonds), and come up with a way to write that graph
as a string, eg, by using markers for the start of each branch in the
spanning tree and markers which join cycles which were broken to make
the tree.
A good example of this style of line notation is SMILES. A
(non-canonical) SMILES for aspirin is C(=O)(O)c1ccccc1OC(=O)C which is
only one character longer than the systematic name. The '(' is the
start of a branch and the ')' closes a branch. Digits are used to
close open rings, so the 1 closes up the benzene ring.
If you've noticed, I keep mentioning the canonicalization problem.
It's easy to make a line notation, but it's really hard to make a
canonicalization scheme using that notation. That's the big problem.
found in all graph-based chemistry nomenclatures developed since the
early 1800s, and was only known solution required people spend a few
training to use a nomenclature system. How could a computer hope to
match that?
The answer, which was very, very clever, is described in
David Weininger, Arthur Weininger and Joseph L. Weininger, ``SMILES 2:
Algorithm for Generation of Unique SMILES Notation'', Journal of
Chemical Information and Computer Science (JCICS), Vol. 29, No. 2,
pp. 97-101, 1989.
It depends on the
Funtamental Theorem of Arithmetic.
That's the theorem which states that all integers > 1 have a
unique prime decomposition. (It's also the reason what 1 is neither a
prime nor a compositite, as then the FToA wouldn't be true, or would
need a special-case exclusion of 1.)
It's so cool I'm going to tell you how it works. Assign each atom a
number based on all the properties which distinguish it from another
atom; two atoms are the same if and only if that number is that same.
It doesn't matter what the numbers are, so long as they can be
compared that way. Renumber them so the atom with the lowest value is
0, the second lowest is 1, etc. Duplicates are allowed if the
original values were the same. Get the corresponding prime (0->2,
1->3, 2->5, 3->7, etc.). Now replace all the numbers with the product
of the values of its neighbors. Eg, if an atom is bonded to two other
atoms, with values 5 and 7 then its new value is 35. Renumber them
again so that the lowest value is 0, etc. If the order hasn't changed
then stop.
This gives every atom a unique number, independent of the input order,
up to symmetry. (Ie, the two carbons in C-O-C will have the same
number. Root the spanning tree at an atom with the lowest value.
When a branch is possible, take the one to the atom with the lowest
value. Presto - a canonical SMILES! And you might be able to
tell that this solution could only have been done on a computer; no
one would want to work through the number of multiplications needed to
generate it by hand.
As it turns out, the original SMILES paper only establishes atom order
and ignores bonds. What happens if there the bonds are of different
type? The paper does say that when there are multiple atoms with the
same type to take the one with highest bond type first (triple over
double over single). It's a good approximate solution, but I think
there are cases where it can fail. I don't know the generally
accepted solution for those cases.
Daylight Chemical Information
Systems is built around SMILES and related languages like SMARTS
(a substructure search language). Because the notation is simple, it
can be entered by people with minimal training, even using a keyboard.
A SMILES can be easily converted by hand into a graph depiction. The
SMILES string is small for reasonable-sized molecule, so can easily be
manipulated by perlPython, Excel, Oracle or other
tools. It has a simple grammar so it's easy to write parsers for it.
All-in-all, a nice language.
Most importantly, it meant their tools were fast. Instead of doing a
graph isomorphism search (even with filters) to find an identical
compound, they get the canonical SMILES for the query compound and use
a very fast string match. Instead of reading records from a disk,
they can stick entire chemical database in memory, which totally
eliminates all slow disk I/O.