Python Buzz Forum - Unique fragments in PubChem

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
Unique fragments in PubChem

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.

Unique fragments in PubChem

Posted: Dec 24, 2011 8:19 PM

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Unique fragments in PubChem Feed Title: Andrew Dalke's writings Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.	Latest Python Buzz Posts Latest Python Buzz Posts by Andrew Dalke Latest Posts From Andrew Dalke's writings

For reasons I'll get into later, I wanted to get an idea of the subgraph distribution of PubChem. That is, given my method for molecular subgraph enumeration, create all subgraphs of up to size 7 atoms and get an idea of how common they are. More specifically, atom uniqueness depends only on the atomic element and aromaticity, as assigned by OEChem, and the unique bond categories are "single-or-aromatic", double, and triple.

Last month I downloaded 2,138 sdf.gz files from PubChem and did structure perception with OpenEye's OEChem. Starting a couple of weeks ago, I use my subgraph enumeration algorithm to process 1,724 of them. For some reason, it stopped at that point. Since it took 7.5 days to process those files, and the data set is already a bit ungainly, I decided to leave the full analysis for another time and to not figure out what happened with the processing.

In the 1,724 files are 21,570,907 PubChem records and my enumeration found 1,925,185 unique substructures.

I kept track of the number of unique fragments per input file and the running total number of unique fragments over all of the files, plotted here:

You can see that 50% of the unique fragments are in the first 25% of the data files and essentially all are found in the first 50% of the files. (The number does increase after the 1000th file, but it's very slow.) It's also interesting to see the internal structural diversity in the different files. I suspect there are some large regions made from contributed combinitorial libraries.

The unique fragments which exist in the most number of records are:

21387437 C
20195255 O
19959057 c
19892743 cc
19755355 ccc
19457485 cccc
19270867 CC
19015890 ccccc
18599872 cccccc
18488545 c1ccccc1
18386628 N
17672171 Cc
17324074 Ccc
17109361 CN
16985355 Cccc
16533358 C=O
16522121 Ccccc
15993406 Cc(c)c
15759069 Cc(c)cc
15508521 Cccccc

You shouldn't be surprised to see that carbon is found in 21,387,437 of the 21,570,907 structures.

I made a distribution plot of the fragments, where the horizontal axis is rank order (C then O, cc, and so on). I show it at a few different scales in order to get a better understanding of the distribution. It's quite obviously *not* a Zipf distribution.

The vertical axis is the count in millions. You can see that the 10,000th most common substructure is in a very small percentage of the structure; it's actually 0.5%.

At the other end of the list, 478,278 fragments (24.8%) exist only once (like C#NF), 251,372 fragments (13.1%) exist twice (like B#[Cr]), and 132,574 fragments (6.89%) exist thrice. Here's the first 20 values as a table,

1 478278  # In other words, 478,278 substructures exist only once in the data set
2 251372
3 132574
4 100665
5 67536
6 57500
7 42959
8 37983
9 31750
10 28684
11 24016
12 23169
13 18695
14 17659
15 15501
16 14717
17 13452
18 12500
19 11394
20 11276

and in graphical form.

Read: Unique fragments in PubChem

Previous Topic

Next Topic


	Web Artima.com