This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Unique fragments in PubChem
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
For reasons I'll get into later, I wanted to get an idea of the
subgraph distribution of PubChem. That is, given my method for molecular
subgraph enumeration, create all subgraphs of up to size 7 atoms
and get an idea of how common they are. More specifically, atom
uniqueness depends only on the atomic element and aromaticity, as
assigned by OEChem, and the unique bond categories are
"single-or-aromatic", double, and triple.
Last month I downloaded 2,138 sdf.gz files from PubChem and did
structure perception with OpenEye's OEChem. Starting a couple of weeks
ago, I use my subgraph enumeration algorithm to process 1,724 of
them. For some reason, it stopped at that point. Since it took 7.5
days to process those files, and the data set is already a bit
ungainly, I decided to leave the full analysis for another time and to
not figure out what happened with the processing.
In the 1,724 files are 21,570,907 PubChem records and my enumeration
found 1,925,185 unique substructures.
I kept track of the number of unique fragments per input file and the
running total number of unique fragments over all of the files,
plotted here:
You can see that 50% of the unique fragments are in the first 25% of
the data files and essentially all are found in the first 50% of the
files. (The number does increase after the 1000th file, but it's very
slow.) It's also interesting to see the internal structural diversity
in the different files. I suspect there are some large regions made
from contributed combinitorial libraries.
The unique fragments which exist in the most number of records are:
21387437 C
20195255 O
19959057 c
19892743 cc
19755355 ccc
19457485 cccc
19270867 CC
19015890 ccccc
18599872 cccccc
18488545 c1ccccc1
18386628 N
17672171 Cc
17324074 Ccc
17109361 CN
16985355 Cccc
16533358 C=O
16522121 Ccccc
15993406 Cc(c)c
15759069 Cc(c)cc
15508521 Cccccc
You shouldn't be surprised to see that carbon is found in 21,387,437
of the 21,570,907 structures.
I made a distribution plot of the fragments, where the horizontal axis
is rank order (C then O, cc, and so on). I show it at a few different
scales in order to get a better understanding of the
distribution. It's quite obviously *not* a Zipf distribution.
The vertical axis is the count in millions. You can see that the
10,000th most common substructure is in a very small percentage of the
structure; it's actually 0.5%.
At the other end of the list, 478,278 fragments (24.8%) exist only
once (like C#NF), 251,372 fragments (13.1%) exist twice (like B#[Cr]),
and 132,574 fragments (6.89%) exist thrice. Here's the first 20 values
as a table,
1 478278 # In other words, 478,278 substructures exist only once in the data set
2 251372
3 132574
4 100665
5 67536
6 57500
7 42959
8 37983
9 31750
10 28684
11 24016
12 23169
13 18695
14 17659
15 15501
16 14717
17 13452
18 12500
19 11394
20 11276