This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: New Cheminformatics Projects
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I've started two new open projects for cheminformatics and I'm looking
for help in both of them.
Chemistry Toolkit Rosetta
The Chemistry
Toolkit Rosetta (CTR) is a set of common cheminformatics tasks
implemented using a variety of different toolkits and approaches. It
is meant primarily as a way for people to understand and compare how
the different APIs work.
Currently there are 16 tasks, 14 of which are well-defined and have at
least one solution (in OpenEye/Python since that's what I know best).
Several also have solutions in Pybel, and there are a couple RDKit and
CDK solution as well.
It needs your help. The project started in part because I don't know
RDKit, CDK, or Indigo that well - to say nothing of the commercial
tools available from Symyx, Accelrys, Schrodinger, and others. I know
them a bit better now, but not enough.
Feel free to contribute a solution in your toolkit of choice! Or
provide commentary, feedback, or improve an existing solution. You can
even contribute a new task, if it's characteristic of a frequently
encountered cheminformatics-related problem which several toolkits can
handle.
By the way, I give a big thanks to Noel O'Boyle for his feedback on
the project direction and for his Pybel and Cinfony contributions to
help flesh out CTR before this public annoucement.
Chem Fingerprints
The other project I started is called "chem-fingerprints"
or "chemfp" for short. Its goal is to develop a couple of file formats
for cheminformatics fingerprints as well as tools and libraries which
work with those formats.
The main problem it addresses is that there is no widely used
fingerprint format, so each research group or even individual
researcher ends up making a new one, as well as the tools to work with
it. See the use
cases for some more detailed examples.
So far I've written a proposal for a line-oriented text format called
"FPS"
meant to be easy to generate and parse, and have sketched out a inary
format called FPB
meant for fast loading, at the expense of some preprocessing.
The FPS format is simple enough that you can likely figure out most of
it from this example, taken from the specification:
I've developed a set of tools to generate FPS fingerprints from
OpenEye, OEChem, and RDKit, as well as to extract fingerprints from SD
tags; specifically the CACTVS substructure keys in PubChem. These are
available from the
Mercurial repository.
These tools are in development status, and are primarily meant at this
time as a way to get concrete feedback for the specification.g
Other tools I would like to develop, perhaps with your help, are
command-line programs for similarity search and substructure filters.
I'm also looking for input and feedback on the format definitions, and
for people who want to add support for these formats in their tools.