This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: The Structure Query Collection
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Finally! Long time readers might recall my long-term goal of improving
small molecule search performance. There's my chemfp project
for high-performance Tanimoto searches, and I have ideas on how to optimize
substructure keys. But they suffer from a lack of a good
performance benchmark.
Well, I still don't have a benchmark, since that requires
targets. What I do have are several sets of structure queries, in a
form that you can easily use to make your own benchmarks. I call it the
Structure Query
Collection, or SQC for short.
So you want to know if search is faster
There are dozens, if not hundreds of papers on ways to improve search
performance. Let's suppose you have a faster way to do a substructure
search. (By this I mean that that someone sketches a compound, which
your search program interprets as a substructure query.) To show
improvement, you need a baseline. The easiest is to select a random
set of compounds from the database itself, to use a queries. A problem
is that all of the queries will have at least one match - the query
compound itself. Real-world queries will sometimes not find
matches. Perhaps you can partition your data so that some of the
compounds aren't in the and can be used as queries. What's a good
ratio for that?
There's even a more fundamental problem: I think people who do a
substructure search tend to use smaller structures than what's in the
database. If the benchmark queries and targets come from the same set
of compounds, then the queries are probably not characteristic of how
the dataset will be used. This is only a suspicion: am I right?
The search for real-world data
The only way to find out is to look at real-world query sets. I know
people at pharmas who of course have logs from internal searches, but
it's considered confidential information since it may contain
proprietary compound structure data. Even if the dataset was filtered
so that only the public structures are released, there still may be
enough for a competitor to get an idea of the types of structures the
company is interested in. So that's a no.
PubChem by law can't reveal any information about the user-submited
queries. I contacted ChemSpider, but at the time they didn't have
enough queries to be worthwhile. In general, companies which provide
search don't want to reveal their queries since they get paid by
people who want that information to be secret. And so on.
(BTW, if you have a chemical query data set that you can make
public, email me and
let me know so that I can add it to the collection.)
BindingDB queries
But all is not lost. Last year I talked with Michael Gilson of BindingDB. They have no promises
or expectation of privacy of user-submitted data, and he had no
problem with getting me that information. More specifically, he asked
Tiqing Liu to do the data extraction for me.
That meant some munging of the log files, and a bit of back and forth
while I figured out what Tiqing did and he could figure out what
information I wanted. The result was a set of SMILES, sketched in
Marvin, for each of three classes of search: 3.3 million for exact
search, 550,000 for similarity search, and 14,000 for substructure
search.
(While I first asked for this data last year, this topic is very much
a background task and I didn't have the time to follow up on it until
a couple of weeks ago.)
SMARTS queries
That covers searches where the input is a sketched structure. What
about SMARTS searches? Rather fewer people sketch or write SMARTS
queries directly. Instead, SMARTS are more often part of the automated
workflow for feature counts and structure alerts. While I would love
to get my hand on real-world user-created SMARTS query sets, it's also
useful to have workflow query sets too. Luckily, I know of a few.
A few weeks ago, Ehrlich and Rarey published Systematic
benchmark of substructure search in molecular graphs - From Ullmann to
VF2. This has a set of SMARTS patterns extracted from various
sources. They used it to benchmark two different subgraph isomorphism
implementations. The SMARTS patterns are available in the
supplementary information, although it took me a while to figure out
how to get the patterns.
Greg Landrum had developed a similar benchmark for RDKit, but it was
somewhat hidden away in its source distribution and not many people
knew about it.
For a classical touch, RDKit defines 162 of the 166 MACCS keys as
SMARTS patterns. The MACCS keys are widely used as a cheminformatics
fingerprint, so this could be used to benchmark fingerprint
generation. SQC actually includes the tweaked version of these SMARTS
which I put into chemfp. I changed a few of the SMARTS to be more
cross-toolkit friendly.
The "Structure Query Collection"
I've collected all of these together into one spot, called the Structure Query
Collection. Not a very creative name, I know, but it is
descriptive. There are currently the three BindingDB structure sets
and the three SMARTS query sets.
Each set contains the raw data or relevant subset of the data as I got
it from the respective sources, as well as source details. The raw
data needs processing - at the very least to put the SMILES or SMARTS
into a easily used form - so I also included the details, including
software, for how I extracted and cleaned up the data so it can be
used in a benchmark.
First exploration: substructure query sizes
Earlier I conjectured that structures used as a substructure query
will tend to be smaller than the molecules in the target set. Now I
have the data to test that hypothesis.
I'll use my OpenSMILES-ragel
project, which has a command-line program to count the number of atoms
and bonds in a SMILES string. It does very little SMILES processing,
and mostly assumes that the SMILES are correct beyond the basic syntax
level.
Here I get the number of atoms in each substructure query:
For representative targets, I'll use the same ZINC subset from Ehrlich
and Rarey's supplemental file "supp3.zip", in
datasets/zinc_lead-like_2011-02-12_first100k.smi. This contains some
SMILES which the OpenSMILES grammar doesn't accept (it uses an odd
SMILES dialect I've not seen before), but I can filter out the
failures by removing lines where the number of atoms is "-1".
Since I wasn't sure if there was a bias in the '-1' reports, I
double-checked this by changing the failure cases to something which
was no longer a valid SMILES but where smiles_counts would still
report the same number of atoms. First, I check that the
transformation works, and then I redo the count:
So, BindingDB substructure queries average 18 atoms while the ZINC
targets average about 20 atoms. This agrees with my hypothesis, but I
would need to get a copy of BindingDB for better assurance.