This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Ontologies of Algorithm Implementations
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I have difficulties with ontologies. Part of it is a knee-jerk
reaction to deep hierarchical classification schemes, which I don't
understand well enough to be able to explain fully here. Nature is
messy, and definitions have soft, vague edges. The ontologies I've
seen which were more than a controlled vocabulary seem to want sharp
edges, especially if the goal is to apply some sort of machine
reasoning or first-order predicate logic to the result. (But remember,
I'm not going to go into this, only show you my colors.)
One of the ontology terms she brought up was "results computed by
algorithm X." I have more concrete arguments against modeling things
this way.
Algorithms are not implementations
I've been working on a pair of fingerprint file
formats. An optional metadata field contains the fingerprint
type. I thought of defining that "MACCS" is used for all programs
which implement the MACCS keys but ran into a few problems.
The biggest is that OpenBabel 2.3.0 and OEChem 1.7.4 contain bugs in
their respective implementations. Briefly, the OpenBabel pattern
definitions used 5-membered rings for the 4-membered ring field, 6-
for the 5-, and so on, while in OEChem the bit for "has more than 1
component" turned out to be "has more than 0 components."
Both groups have been informed. The OpenBabel patterns are fixed
in version control and I assume the OEChem ones will be fixed in
the next release. But I don't know if other errors remain. I do know
there are structures where they give different fingerprints.
That's where I decided to call their respective outputs
"OpenBabel-MACCS/1" and "OpenEye-MACCS/1", replacing "1" with "2" once
the new forms are available.
Therefore, it's important to track not just the algorithm (or
algorithms) in use, but also the programs which implemented the
algorithm.
More than one algorithm is involved
RDKit implements one of the MACCS bits as the SMARTS
pattern"c:n". This is an aromatic carbon connected to an aromatic
nitrogen by an aromatic bond. But what does "aromatic" mean?
The Daylight aromaticity model looks for the "smallest
set of smallest rings (SSSR). If the number of π electrons in
the ring == 2 mod 4 then it's a ring. The Daylight toolkit
automatically determines aromaticity on every molecular structure it
processes.
A number of tookits (OpenBabel and CDK) follow this approach, but
still others do not. For example, OpenEye does not force a specific
aromaticity model. The input structure is presumed to be in the right
form and you are free to repercieve aromativity if you want, either
with your own function or via one of the 5
aromaticity model perceivers included with OEChem.
In the table from the above link you can easily see cases where a
"C-N" is not aromatic in the MDL, Tripos, and MMFF definitions, but is
aromatic in Daylight's and OpenEye's models. Thus, the pattern "c:n"
will be 1 for some models, and 0 for others.
In other words, multiple algorithm implementations are required in
order to reproduce a certain output. Even knowing the program version
(say, "OEChem 1.7.0") isn't enough, since the user can decide which
aromaticity model to use.
More than one algorithm is involved, part 2
The problem isn't limited to the MACCS fingerprints. Even if the
molecular perception was indentical, the hash fingerprints can still
give different results. Many of these use the hash to seed a (pseudo)
random number generator, then use the first few (1 to "several")
outputs to populate a fingerprint.
This is an example of just how hard it might be to track down the
different factors which go into making a data set. RDKit doesn't even
have a direct way to figure out which version of Boost it was compiled
with.
My solution in the fingerprint format is a metadata field modeled on
the user agent
field in HTTP. Fingerprint generation tools can include multiple
versioned implementation names so as to (hopefully) characterize what
they did.
What is the algorithm?
I was the primary VMD developer in the
mid-90s. People wanted 3D structure alignment, so I implemented the Kabsch
algorithm from Acta Crystallographica (1976). It worked, excepting
a few cases which I couldn't figure out.
Someone pointed out to me that Kabsch wrote a followup article in
1978. As I dimly recall, the original algorithm would sometimes
generate a mirror image. The updated paper correctly handled that case.
Now consider the Indigo cheminformatics toolkit. It implements the Kabsch
algorithm but the reference it gives is only to the 1976
paper. Given the quality of their work, I'll assume it includes the
correction, but which does it use? (I couldn't find "Kabsch" mentioned
in their code so I don't know.)
Or look at the Wikipedia page and see how Cealign
implements "the Kabsch algorithm." I looked in the source code and
found:
# @SUMMARY: -- QKabsch.py. A python implementation of the optimal superposition
# of two sets of vectors as proposed by Kabsch 1976 & 1978.
So when someone says "the Kabsch algorithm", do they mean the 1976
paper or the 1978 paper? Until I updated the Wikipedia page, there was
no mention of the 1978 correction, and the Indigo example lends
strength to my argument that people consider the 1976 paper as being
"the reference" but the implementations include the 1978 correction.
You can easily see how different variations (what if the correction
fixed a minor point? and was written by someone else?) might make
things more complex. I recall learning some math theorem which goes by
3 letters (the surname initials of the 3 main developers) in every
country except England, where an Englishman is the 4th name. But these
are hypotheticals and I wanted to stay in the concrete.
How does an ontology try to capture this? If the ontology is complex
enough to handle this, is it still easy enough for the simple cases?
Once there's an agreed upon ontology for this, will there be a new
special case which doesn't fit into the existing ontology?
I don't know, and I don't care enough to work on the ontologies. I
prefer at best well-defined (and nearly non-hierarchical) controlled
vocabularies along with enough background information so that people
can figure out how the bits and pieces go together.
(P.S. There really is no strong point or conclusion to this post. It's
details that Janna asked me to post so it would be externally
referenceable as part of their ontology discussion.)