This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: ECFP-like fragments in PubChem
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Previously I posted about unique
fragments in PubChem. That used my molecular
subgraph enumeration algorithm. In this essay I'll report some
results from looking at the unique bit counts from RDKit's
MorganFingerprint algorith, which is an ECFP variant.
My first graph in the previous essay shows that there are about 2
million unique fragments of size up to 7 in PubChem, and that the
second 1/2 of the data files contained few fragments which weren't in
the first 1/2. This suggests that there aren't that many substructures
of size 7, compared to the number of possible structures of size 7,
which is quite curious.
Rather, I expect that the number of unique fragments should level off
with enough molecules. In the simplest case, there are 112 elements
and 5 elements which can be aromatic, for a total of 117 possible
unique atom types. I found 110 of them in my PubChem subset.
Similarly, I found only 1103 unique fragments with two atoms. The
breakdown as a function of fragment length is:
Length 1: 110 unique fragments
Length 2: 1103 unique fragments
Length 3: 4209 unique fragments
Length 4: 19398 unique fragments
Length 5: 86336 unique fragments
Length 6: 364342 unique fragments
Length 7: 1447488 unique fragments
(However, there are 2199 fragments which my SMILES atom count code
didn't parse. I'll need to figure out what caused it to fail.)
How many possible substructures are there of size 7? Assuming only 10
atom types and ignoring cycles and different bond types gives about 10
million. There are 1+1+2+6+21+112+853=996 connected
subgraphs of up to size 7, and I'll guess that 800 of them are
chemically accessible. I'll hazard 2**6 different bond types, so about
500 billion possible substructures. That's over three orders of
magnitude larger than what I see.
It's curious because it means that any substructure-based fingerprint
using up to 7 atoms has at most a small multiple of ~2,000,000
different values. Sure, a hash algorithm might potentially generate
values in the range 0-232, but only a few million of them
will be actually be generated.
(Some algorithms will generate multiple bits per feature, eg, features
with 1 or 2 atoms generates a single bit while features with 3 or more
atoms generates two bits. This acts as a simple weighting
scheme. There's a perfect correlation between those two bits, so I'm
not counting the second one as meaningful.)
Fingerprint statistical models often assume a roughly Bernoulli
process, and the number of unique features is unbounded, though
increasingly rare. This observation suggests that there is actually an
upper limit to the size, which changes the distribution type slightly.
Is this observable in other fingerprints? I used RDKit's
MorganFingerprint algorithm, which is a variation of the ECFP
"extended connectivity fingerprint." I used radii values of 1, 2, 3,
4, and 5, and with the other parameters at their default. Each step
includes increasingly distant information, so should be more diverse.
The following shows the number of unique bits found as a function of
the number of molecules processed. The molecules are ranked by PubChem
id, although it doesn't include all of the PubChem structures since
RDKit couldn't process all of the structures. (It complains about a
number of bad valences.) A more complete analysis would randomize the
structures to remove local coherence effects.
That graph is almost impossible to understand because the dynamic
range is so large. Log and log-log scales don't help either. The
best solution is to normalize by the maximum number for each
graph. That gives:
This sort of curve is a species
discovery curve. It appears to show that the
MorganFingerprint(radius=1) saturates at around 100,000 different bit
values, and radius=2 might saturate at around 3.5 million unique
values. This makes sense, as the radius=2 fingerprint corresponds to
about 6-7 atoms, and I found 2 million unique values. (An average
branching factor of 2.5 gives 2.5**2 or 6.25 atoms. However, a local
branching factor of 3 gives 9 atoms, which adds more unique values to
the fingerprint.)
I'll guess that there's under 40 million unique bits for radius=3 but
it becomes harder to estimate. As the radius increases, the trend in
the diversity of new values clearly gets closer to linear, which means
there's less and less saturation. I can't predict the total number of
unique values for radius=5 because it's still too flat.
The species growth curve is often fit as A(1-exp(-bx)) or
A(log(b*x-1)). The first has a fixed limit, the latter implies there
is no upper bound. This case is somewhere in the middle: for good
chemical reasons, there's a large but finite number of possible ways
to arrange a fixed number of atoms. For the smallest fragment size (1
atom), we are at that limit. For larger sizes, we are nowhere near the
chemical limit, and I think a log fit works best.
Equally obvious, I would need to randomize the input order in order to
get a smoother curve so I could make that prediction. But the end of
the holidays was a couple of days ago and I need to get back to paying
work.