This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Screening Teaser
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I have to work on other tasks for the next couple of months (paying
work calls, and I need to prepare for my Python and Django
programming courses for cheminformatics; February 2011 in Leipzig,
spaces still available!) so I don't have time to write up what I did
to optimize the subgraph enumeration algorithm so it's about 3 times
faster.
I do want to give you an idea of one of the directions I'm going. I
want to see if there are better substructure fingerprint filters than
the published ones I know about (MACCS and CACTVS), and I want to use
more automated target-based techniques to find those keys. (Actually,
I want to use query-based techniques but getting access to good query
data sets is hard.)
I can now process about 150 structures per second, so I analyzed the
first 352,710 PubChem structures (those with ids between 1 and 400000)
with k=6 to get 128,284 unique SMARTS patterns with up to 6 atoms. I
removed those patterns which occured in less than 0.1% of the
structures leaving 7,999 SMARTS patterns.
I want to make a substructure filter based only on those SMARTS
patterns. This is different from the MACCS and CACTVS keys which have
fields for "more than 2 carbons" or "halogen". Can I pick, say, 64 of
those SMARTS in order to have good selectivity?
What does "good" mean? I don't have the time to work out the ideas,
but my first thought was to minimize the number of structures which
have the same fingerprint. (I think this minimizes the NxN search
time, but don't quote me on that.)
This is a high-dimensional search in a non-smooth space. After 20
years I finally have a chance to use a genetic algorithm. I downloaded
Pyevolve and spent just
enough time playing with it to get preliminary results. Result? Worse
than a naive algorithm I'll talk about next, and without some nice
characteristics like consistency of keys when the input set
changes. That's often the problem with GAs. Well, there went two days
of CPU time.
My naive approach was to start with the set of all compounds. For each
SMARTS pattern I can break the set up into two parts: those with the
pattern and those without. With a target-based optimization the best I
can do is find the SMARTS which most closely splits the set in
half. Now I have two sets. Repeat the algorithm but this time see how
the SMARTS splits every set and use the sum of the squares of the size
differences as my metric. (I have no idea if that's really the right
metric.)
I let it run for about 6-7 hours and got the following set of SMARTS,
from most divisive to least:
Cc(c)cc
CCO
CN
C=O
CCCC
cN
n
C=C
Cl
CC(C)C
cO
S
COC
cc(c)c
CC(C)O
CC
CN(C)C
O
C1CCCCC1
CCC=O
F
CCCCCC
Br
C(CO)O
C
CccC
CNC
C1CCCC1
NO
P
C(=O)O
cc(c)nc
C(CO)CO
CCC
N
cccc
CCC(C)C=O
C(CN)N
cCc
cc(c)c(c)c
CC(C)(C)C
[I]
C(O)O
CcccC
C=N
CC(=O)C
o
CCNCC
ccncn
C(CCO)CO
[Si]
c(cO)O
CccccC
CCOCC
s
CCC=C(C)C
[As]
CC(C)N
C#C
[Se]
NN
COCCOC
CC(C)C(C)C
c(cCl)cCl
At least they look reasonable.
At that point there were 53,430 fingerprints with at least 2 members,
and at least one fingerprint with 773 structures. I think that means
that an NxN search will need 773*N subgraph isomorphism tests. (It
does not mean that a query needs only 773 tests since a search for
"CC" will hit a lot of structures.)
I wonder how useful they are in practice. If you test it out, let me
know.
These are meant mostly as a teaser. They are not a replacement
for existing substructure keys. It's tuned for a single target data
set and it's made up of the SMARTS which are a subset of SMILES. A
real set of filters should never have a small query which requires
subgraph isomorphism against a large portion of the database. For
example, if the query is [Dy] then these filters will do nothing at
all to help.
Nobelium
BTW, what in the world is CID444218? Its molecular formula is
C10H39Ac10N7No7NpO6PS-3. Seven
atoms of nobelium?
That's not bad for something with a half-life of an hour and where "no
[chemical] compounds exist for nobelium." Nor is that the only PubChem
records with nobelium. Anyone want to place an order for MolPort-006-699-029?
With overnight shipping you'll only need 4kg in order to get 1g the
next day.