This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Exceptions to the graph model
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Yesterday
I said that many compounds could be represented as a graph. The
computer scientist in you might ask what can't be represented as
graphs. Salts (like good old NaCl) are one. These have ionic bonds
caused by the interactions between strong charges, and not covalent
ones caused by the overlap of orbitals. Salt dissolves so easily
because a charge doesn't care what it interacts with so long as it's
the opposite charge. Water has its own partial charges which love to
join in on the fun, and the result is entropy heaven.
Crystals are infinite in extent. What's the graph of diamond or
graphite?
A graph based on covalent bonds cannot handle metals, which have
delocalized electrons and so no real covalent bonds. There's even
problems when there's only one metal atom, like in the heme of
hemoglobin or photosynthesis. An atom binds to the delocalized ring
of the heme and not to any one atom. There are ways to fake it, but I
recall that Cherwell's tookit supported "ring" as a new graph element,
in addition to atoms and bonds.
And then there's hydrogen gas absorbed in platinum. Not only is Pt a
metal but the hydrogens reversibly bind to the Pt atoms so move about.
Polymers cause more problems. These can have a huge number of
different lengths and branchings
("1020 and more"
for polysaccharides!). While each one can be represented as a
discrete graph, you'll need better than 64 bits to enumerate them all.
Even if you do that, it isn't obvious what you can do with that data.
The nice thing about these exceptions is that they rarely make good
drugs. Good drugs are small, have well-defined shapes, and only come
in one of at most a couple chiral forms. The exceptions are usually
protein-based drugs, like insulin, where non-graph models are more
appropriate.
The usefulness of the molecular graph decreases with the size of the
molecule. Molecules have a three-dimensional structure (called a
conformation) as well as the two-dimensional structure (the
graph). Small molecules usually have one or only a few common
conformations and these can be determined reasonably well directly
from the graph using Corina, Omicron, Concord, or similar programs.
Even when the prediction isn't exactly the minimum energy state,
thermal motion and other flucuations make it likely that the molecule
will be close to a predicted conformation at some point.
The larger the molecule the less likely it will be predicted
correctly, and the less likely it will be ergodic. (That's fancy for
"explore all phase space." That's fancy for "take on all available
conformations.") Huge molecules like proteins are extremely hard to
predict ab initio and are not ergodic. That's the whole protein
folding problem, and the Levinthal paradox is that folding does happen
despite the combinitorial problems.
When a molecule gets that large, its conformation plays a much more
important role. For example, the serine protease works because of the
arrangement of the catalytic triad. You can change a lot of the rest
of the molecule, making a very different graph, and still have a
serine protease so long as the fold is correct.
A better model for this stage is the bioinformatics one, where
proteins and DNA are represented by one-dimensional sequences (a
high-level abstraction of the graph). This allows similarity
searches, which are extremely hard to do for graphs, and supports
concepts like phylogenies to describe the evolutionary relationship
between sequences.