This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: connection matrix vs. node with edges for molecules
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
At the CRBM conference I showed the basic graph-like API that I and
Brian Kelley (and others) found useful for molecular data structures.
It says that atoms have bonds and bonds have atoms, and the molecule
has a list of all atoms and all bonds. Iddo Friedberg asked why we
chose that over a connection matrix. I don't think we answered it
fully, so here's part of my follow-up....
I wanted to follow up on your comments about using a
connection matrix for bond information. Both Brian and I
were against it, but weren't that clear in explaining why.
My comment was that the matrix itself would take up too
much space, which isn't correct since it can be stored as
a sparse matrix, as you pointed out.
Brian gave a different reason - that an atom with a list
of ordered bonds - helps with chirality, but didn't full
explain why. I thought at first I understood his reason,
but now I'm less certain. Reading the SMILES docs at
http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html#RTFToC24
SMILES uses a very general type of chirality specification
based on local chirality. Instead of using a rule-based
numbering scheme to order neighbor atoms of a chiral center,
orientations are based on the order in which neighbors occur
in the SMILES string.
This means Daylight doesn't believe the actual order of
bonds is important, only the order of atoms in the SMILES
string (or more generally, the insertion order of the atom
into the molecule).
If that's the case, and Brian has more experience in dealing
with chirality so he'll make the final call on that, then
that isn't a strong reason against using a sparse matrix.
However, the data model we use is
a molecule has a list of atoms
an atom has a list of bonds / "other" atoms
This model is a representation of a sparse matrix, is it
not? That is, to see if there's a connection between
two atoms, index into the list of atoms in the molecule
to get the list of bonds and search that list for the
other atom.
API-wise, all that's missing in what we propose compared to a matrix
API is the [i][j] lookup, that is, "given an atom, return the bond
to another atom, or None/KeyError if there isn't one." Easy enough to
add, but something I haven't needed often.
In the cases where I do need a connection matrix,
I just create one from the graph. In that way I
don't have to worry about changing the matrix when
I add/delete atoms as it only reflects the connectivity
which existed when I made the matrix. I do think
maintaining a proper, generic sparse distance matrix
in the face of edits is the major reason to use a
representation which more closely fits the graph-like
data model.
The chirality argument says that you may want to change
the order of the bonds for a given atom. With a connection
matrix, the bond order is invariant and it depends on the
order of atoms in the matrix. If I want the bonds
for a given atom to be in a different order (which is
effectively never, and so why I consider this a
weak argument) then I'll need a rearrangement vector
which tells me how to get the atoms in the correct
order. A performance hit and cumbersome code.
(If you do change the bond ordering then the library
may need to automatically reflect those changes so the
chirality stays correct. The toolkits I know, Daylight
and OEChem, don't let users change the bond ordering -
and it's not a feature I've needed - so I don't know
the expected behaviour for this case.)