This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Wrapping Dragon
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
An interesting thing happend recently while doing work for a client.
I needed to make an interface to Dragon,
a program that can compute a large number of chemical descriptors.
It's was originally a GUI program for MS Windows, written in Borland
Delphi but there is now a command-line version for Linux using
Borland's Kylix.
It's a batch oriented program. It reads configuration data from a
file, including the filenames used for structure input and descriptor
output. The structure input file is actually a list of filenames to
the actual structure files, one filename per line, so there are two
levels of indirection here. The filename '-' for stdin/stdout is not
supported. I wanted to turn this into a stream oriented program so I
could give it one structure at a time. Using multiple "batches" of
one structure at a time caused too much overhead.
This looked like a good chance to use Unix named
pipes, also known as a FIFO for "First In First Out." The mkfifo
function (in the os module) creates a named pipe in the file system.
This acts like a normal files to the standard open/fopen functions.
One program opens the named pipe for reading and the other opens it
for writing. If a read occurs when there is no data the read process
hangs until there is data or the write process closes the pipe.
Similarly, writes block until there is a read.
Turing a simple batch program into a stream program is conceptually
easy. Make two named pipes, one for the structure input and the other
for the descriptor output. Tell Dragon to use those pipes instead of
normal files. To compute the Dragon descriptors, write the structure
to a temporary file and write the filename to Dragon's input pipe then
read the descriptors from the output pipe.
Tricking programs like this can be, err, tricky . Dragon (or more
likely the Kylix I/O library) reads a block of text from the input and
extracts lines from the block. This is identical to how Python's
for line in open("filename"): works. If there isn't enough
data for a block then Dragon hangs waiting for more data. As it
happens, Dragon ignores blank lines so I padded the filename with about
1000 extra newlines.
I figured that out using the strace command. It's a debugging
tool that lets you see all of the system calls made by a program. In
this case I used strace to see that Dragon was hanging trying to read
1024 bytes from the structure input file handle.
Problems caused by processes blocking for input is pretty common.
Dragon though had one condition that was more unusual. Its output
file looks like this:
Dragon version 1
2 2
Name MW AMW SV
Mol1 18.2 18.2 1.4
Mol2 348.4 349.1 65.3
That is, the first line is a version string, the second lists the
number of compounds processes and the number successfully processed.
The remaining lines are tab separated columns with the third line
listing the property names. I'm describing this from memory and I'm
pretty sure I've made a mistake because I think the header names are
really on the 4th line. Still, close enough for this essay.
For some strange reason Dragon opens and writes the header several
times. That is, it opens the file, seeks to the beginning, writes the
header, and closes the file, then repeats this process several times
before it starts writing the data. I think it does this once per
descriptor group. My program, reading from the named pipe hooked into
Dragon's output, needed to ignore the multiple closes and wait until
it gets actual data.
Also, if you'll look back at the example output you'll see the second
line reports how many compounds were computed. Dragon can't write
this line until it knows how many structures are in the input and how
many can be processed. What Dragon does instead is write the output
with that line omitted. After all the input has been processed it
renames the output file to a temporary file then copies the temporary
file back into the original filename, inserting the proper second line
into the copy.
This meant Dragon renamed the named pipe then opened it for input. It
was blocking on the read because there was no data written to that
pipe. My wrapper didn't need the counts so when it was done with
Dragon I had it rename the named pipe, replace it with a normal
containing only a few lines, and only then close the output to
Dragon's input pipe. Dragon then moved the short file and inserted
the count information in the copy. It's a bit of a dance but it works
and is sufficiently robust.
I was worried about the wrapper's performance. My first version was
very simple because it restarted Dragon once per structure; a batch
size of one. Simple but very slow because the setup and startup costs
are much larger than the time needed compute the descriptors for a
given compound.
I tried my wrapper code with a moderate sized data set. It was faster
than the normal command-line version. That normally doesn't happen!
With more research and staring at the strace output I found out the
performance bottleneck was disk I/O. There was a small speedup
because the output was read by another program instead of going to a
file. The big part was from the dance to use a named pipe and also
allow Dragon to insert the second line. It takes a good chunk of time
to copy a large file (I suspect the internal Dragon code for the copy
isn't that fast) so when I replace a large output file with a small
file that overhead goes away.