This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: KNIME and beginners
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I gave a presentation at OpenEye's
CUP last week. More precisely, I was assigned a talk with the
title "Evils of KNIME." I don't chose that sort of name, but the CUP
organizers like to be a bit confrontational with presentation
titles. I used my speaking slot as a platform for expressing my views
on dataflow/visual languages. I don't like them, and think their
effectivity is limited compared to a text language, so I explained
why. Other people do like them and enjoy them. I've asked them why,
and they have some good reasons. My presentation outlined those
responses with some observations of my own, including suggestions for
ways to improve the text-based toolkits so they are more accessible to
"non-programmers."
The next few posts will be based on parts of that talk. Feel free to
leave comments.
Upcoming training classes (pre-announcement)
I ended by pointing out that these are technological solutions. Why
not spend some time training computational chemists to be more
effective at writing software? I provide that sort of
training. If you are interested, email me. I'm pinning down
the dates for a course in Leipzig in mid-May (likely 18-20 May), and
another in Boston in late July. I'll announce them when the dates are
determined. if you want to influence those dates or schedule a course
at your site, let me know.
Sample test case for KNIME
I haven't used KNIME for about two years. That experience was with
KNIME 1.x. People told me that it's gotten better, so I decided it was
well time to take a fresh look. Last time I couldn't get it to work on
my Mac. I'm happy to report that things have changed, although there
are still some difficulties with it regarding updates.
import pybel
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
print mol.OBMol.NumHvyAtoms()
It's not as short as I would like because I had to specify "sdf" twice
and because it had to reach down into the underlying OpenBabel
molecule object. Still, it's a lot more succint than using any of the
base toolkits directly, and a good reference of what a text-based
programming language is capable of when designed for ease of use.
What molecular properties can I compute? And how do I do it?
The first step was to find out if KNIME could compute the number of
heavy atoms. When I say "KNIME" I mean "the CDK nodes which come with
KNIME" since KNIME is a dataflow-based visual programming language
with support for a number of extension packages, including chemistry
nodes based on the CDK. Schrodinger, Tripos, ChemAxon and likely other
companies provide nodes based on their respective toolkits, but I
don't have a license to those tools. In any case the Mac version of
KNIME doesn't yet support adding new nodes.
The most likely candidate was "Molecular Properties." The help says:
Create new columns holding molecular properties, computed for each
structure. The computations are based on the CDK toolkit and include
logP, molecular weight, number of aromatic bonds, and many others.
What other properties does it compute? I put the node on the workspace
and double clicked on it to bring up the dialog box. The result is:
The dialog cannot be opened for the following reason:
No column in spec compatible to "CDKValue".
Huh? What does that mean?
A Google search for that error message found the same question
from 9 September 2009 although concerning a different node. Bernd
Wiswedel answered:
We obviously need to improve on the error messages. You need to
process the output of the SD reader with the "Molecule to CDK" node,
which will parse the structures into an appropriate format for the
Lipinski node. Reason is that the Lipinski node is contributed from
the CDK plugin, so it needs its desired input format.
What this means is the inputs need to be set up correctly before I can
see more details. However, it's more complicated then that. If I set
up the nodes as shown:
I still get the same error message when I click on the "Molecular
Properties" box. Double-clicking on the "Molecule to CDK box" gives me
The dialog cannot be opened for the following reason:
No column in spec compatible to "SdfValue" "SmilesValue" "MolValue" "Mol2Value" or "CMLValue".
Turns out I need to put in a valid SD filename in the "SDF Reader" box
(the one with the exclaimation point under it), in order to get the
right inputs to "Molcule to CDK", in order to see the "Molecular
Properties."
How accessible is KNIME to first-time users?
Is that really friendly for first-time users? That is, how is a
first-time user supposed to: 1) know which options are available if
they can't open an unconnected node, 2) know which inputs are required
for a node, or for that matter see what outputs are available, 3) know
that the "SDF Reader" needs to be converted from "Molecule to CDK"
before it can be used by the CDK nodes?
Of course all those can be explained in the documentation, and perhaps
they are explained. I admit I haven't read it, but then again the
knime.org documentation doesn't show how to use the CDK nodes. And
should someone have to read the documentation in order to do something
basic like this task? If so, are dataflow systems really any easier
than working with a text-based programming language?
Can't compute the number of heavy atoms?
I looked through the list of properties which could be computed:
Atomic Polarizabilities
Aromatic Atoms Count
Aromatic Bonds Count
Element Count
Bond Polarizabilities
Bond Count
Carbon connectivity index (order 1)
Carbon connectivity index (order 0)
Eccentric Connectivity Index
Fragment Complexity
Hydrogen Bond Acceptors
Hydrogen Bond Donors
Largest Chain
Largest Pi Chain
Petitjean Number
Rotatable Bonds Count
Topological Polar Surface Area
Vertex adjacency information magnitude
Molecular Weight
Zagreb Index
(BTW, it really does have mixed capitalization. Why yes, I am a
nitpicker. How did you guess? ;) )
No "heavy atom count." Next option is to see if there's a way to
specify the counts based on a SMARTS pattern. Nope, didn't find
anything.
As far as I can tell, there's no way with the default nodes to do much
of anything with KNIME. I assume there are additional packages which I
can install, but why aren't there more useful CDK nodes as part of the
standard installation? An obvious one to me would be a SMARTS count
pattern matcher, where I could specify the SMARTS pattern, the option
for unique or non-unique matche counts, and the output column name.
Is my problem because I'm on a Mac? Do Linux users get more nodes? Or
is there something else I'm missing? How would you find the number of
heavy atoms using KNIME? Is there a solution using the default CDK
nodes or do I have to use one of the commercial toolkits?