This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Optimizing substructure keys
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Advertisement: This essay describes a research project I've slowly
been working on during the last 18 months. I believe I can do very
fast substructure searches of large compound data sets. I would like
to make it more real, but that requires funding.
Contact me if you want
to collaborate and can pay for it.
There are two related goals in my quest for an inverted index library,
which is unfortunate because what I want for one is not necessarily
the best for the other. One goal is to see if I can do PubChem-scale
substructure searches with "interactive" performance (under 0.1
seconds). I think I can do that with a lot of precomputation based on
substructure enumeration and optimized key development. Based on the
feedback from various people, I think this is best handled within an
existing database, and most probably PostgreSQL.
The first goal requires developing new database substructure search
methods. My other goal is to see if I can optimize substructure keys
within the framework used in most existing substructure search
programs.
The existing approaches use either a explicit substructure-based
fingerprint like the
MACCS or CACTVS/PubChem keys,
or an exhaustive hash-based fingerprint like the
Daylight fingerprints.
The substructure-based ones are hand-optimized and small (under 1000
bits), but don't do well with unusual chemistry. The exhaustive ones
can handle diverse chemistries, but are larger (typically 1024-4096
bits).
I want to see if I can use my
subgraph enumeration approach
to do target- and query-based key optimization. That is, given a set of
structures in the database and a characteristic set of queries, I
would like to develop short, dense fingerprints which can optimize
those queries.
Problem: I lack a query set
I'm not there yet. The biggest problem is that I lack any sort of
representative sample of the types of queries people do. People at
pharmas don't want to contribute their data sets because it's hard to
remove all of the proprietary queries from the dataset. By law,
PubChem is not allowed to release this data.
Do you have a good (1,000+) query set that I can use for a benchmark?
If yes, email me!
So until I get a good query set, I'll have to assume that the targets
themselves are representative queries. This of course is incorrect. By
definition, a substructure search will likely be smaller than what's
in the database. Plus, if someone uses a sketch program to make the
query, then they are going to build it up an atom or ring at a time,
so the first dozen queries or so will likely be small and have lots of
matches.
Optimizing the query keys
At its heart, a substructure query is a subgraph isomorphism
test. This is slow. A substructure filter is a quick test which tries
to reduce the number of required isomorphism tests. For example, if an
"N" (nitrogen) exists in the query but does not exist in the target
then there's no need to do the full isomorpism test.
However, if "N" does not exist in the query then it might still match
a target molecule which has an "N" in it. For example, the query "O"
should still match the target "N#[N+][O-]" (nitrous oxide).
So if the query feature is "1" then the target feature must be a "1",
otherwise the isomorphism test is required. This is a simple decision
tree; all compounds with featurei take one branch, and all
which don't take the other. Since I'm assuming that the queries are
sampled from the targets, then what I want is to end up with the set
of features which minimize the size of the number of nodes at the end
of the decision tree.
Last year I tried a
genetic algorithm
to find this, but it didn't converge well, and the results weren't
pleasing. For example, if I decide that I want to add 32 bits to a
fingerprint, then I have to redo the search from scratch and I end up
with a completely different set of fingerprints.
A greedy optimization algorithm
I came up with a greedy optimization instead. The base of the decision
tree has all of the records in a single node. I find the "most
divisive" feature, defined as the feature which best separates the
tree into two branches, with about half of the records in one branch
and half in the other. Each iteration finds the features which most
evenly tries to split all of the nodes. I use the scoring function:
This does a lot of set intersection tests; about 30 million per
iteration in my benchmark. Actually, since all I care about is the
number of elements in the intersection, and not the set itself, I wish
that Python had a "set.intersection_size()" method.
I use the scoring function to find the most divisive feature. This is
the one which generates the smallest score. I use that feature to
split all of the current leaves into "haves" and "have nots." As an
optimization, I've decided that if there are fewer than 10 elements in
a node then I don't care about further splitting that node. That's
small enough that I can easily do a subgraph isomorphism test.
There are a few other optimizations. I ignore duplicate fingerprints,
and I ignore features with fewer than 10 records in them. Set
intersections are expensive, so I also do a short-circuit evaluation
while computing the scoring function and abort the calculation if the
current score is ever worse than the current best score. I also sort
the nodes so the ones with the largest size go first. This increases
chance of early short-circuiting.
I do some other optimizations, but you'll need to
see the code for all
the details. Yesterday I found that pypy 1.9's set intersection was about
twice as fast as the one in cpython, so that's what I used to run the code.
Preliminary results
I used the same benchmark set
(37 MB 7zip compressed) from
yesterday's essay,
which is a small subset of my full data set. From
previous analysis,
I know it has high local fingerprint diversity, so I figure it's a
good enough start. Here's the subset of output containing the feature
keys and the amount of time it took to find each one:
This says that 353 bits are all that's needed in order to put 202,632
unique fingerprints into bins with at most about 10 unique
fingerprints in a given bin.
I think that's quite amazing given that most people use fingerprints
which are at least twice and usually at least four times that
length.
Also, it took 1.75 hours to generate this, and most of the time is
spent doing len(set1 & set).
Is it useful?
I really wish I had a good query set to use for tuning - and another
dataset to use for validation. I don't, but Greg Landrum made a
synthetic one for testing the RDKit/PostgreSQL cartridge. Last
December we worked on optimizing the fingerprint filters. I used an
earlier version of this program to generate 54 query bits, which he
fused with his hash-based fingerprint. Greg
reported:
Tacking your most recent 54 bits, interpreted as SMILES, onto the end
of a shorter version of my new fingerprint gives:
[06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 80.30
[06:10:40] INFO: screenout: 0.03, accuracy: 0.84
The original performance for the shorter fingerprint was:
[05:04:23] INFO: FINISHED 50001 (41150823 total, 1693315 searched, 1118055 found) in 90.90
[05:04:23] INFO: screenout: 0.04, accuracy: 0.66
so those bits definitely help with the screenout.
Putting the bits at the beginning of the fingerprint gets a bit more speed:
[06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 71.09
[06:14:48] INFO: screenout: 0.03, accuracy: 0.84
I think a 20% speedup is nothing to sneeze at.
I would still like an inverted index library
The greedy algorithm does a lot of set intersections, partitions sets
into subsets, and removes old sets. I feel like this is not what
databases are designed for, That's why I'm looking for an inverted
index library which lets me manage boolean set operations myself. Then
again I'm not a database person and I don't know if my belief is true.
I ran my greedy feature selection algorithm until it ran out of data,
at only 353 bits. I want to have 1024 substructure bits. This means I
need to feed my algorithm a lot more data. That's fine - my test set
is only 1/200th of the data I have. However, my implementation running
the test set takes up 2 GB of memory on my 12 GB desktop machine, and
took 6300 seconds (1h45m) to run.
My concern is that a distributed memory implementation of my algorithm
is more complex to write than a shared memory one. The Amazon
high-memory instances have 68 GB of RAM. I estimate I need 400 GB with
Python sets, and under 40 GB with a more compact bitset implemented in
C. This seems eminently doable, which is why I'm looking to see if
someone has already done the work.
Based on my searching, and the lack of pointers by others, the answer
is "no."