This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: How to compile openCHORD on a Mac
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
In my upcoming
training course I will teach how to develop web applications for
cheminformatics. A few people have signed up already but there are
still slots open for those interested. It will be in Leipzig in the
latter part of May.
One topic will be connecting to a remote relational database to do
chemistry queries. I decided to base it on TJ O'Donnell's openCHORD, which is a
chemistry extension on top of PostgreSQL. TJ quite literally wrote the
book on the
Design and Use of Relational Database in Chemistry, which uses CHORD for its examples.
I am not a database person. I know enough to be able to use it, and
design schemas. Most databases are designed with a keeper in mind
(called a DBA, for "DataBase Administrator"), who knows and cares a
lot about how the database works. I'm not one of those people. I dread
each time I have to work with a database on my own.
(One exception is SQLite. It's designed to have zero administration,
and because it's an in-process database it doesn't have the hassles of
connecting to a database process that MySQL, PostgreSQL, Oracle and
others have. But it doesn't make sense for pharmas to keep their
corporate data in SQLite.)
TJ is a database person and has been using PostgreSQL for many
years. He made his openCHORD code available. It's designed for someone
like him who knows how to make the pieces go together. I don't have
that experience so I struggled for a while. Here's my step-by-step
process. I hope it helps me next time I need to do this, and you, if
you want to try it out.
Prerequisites
openCHORD uses C and Python code to extend PostgreSQL
functionality. It uses OpenBabel for the chemistry. This means you'll
need to have Python installed (which is always the case on a Mac) as
well as gcc (available in the developer tools). You'll also need to
install OpenBabel, including the
Python interface.
Install PostgreSQL with server- and client-side Python support
First I needed to install PostgreSQL, along with support for
server-extensions written in Python ("plpythonu") and the Python
client code.
I'm using a Mac. I have MacPorts installed. That's a
package management system for a lot of the freely available Unix
packages. My first step was to ask for the Python client library for
PostgreSQL and see what happens:
% sudo port install py-pygresql
Password:
---> Computing dependencies for py-pygresql
---> Fetching postgresql83
---> Attempting to fetch postgresql-8.3.8.tar.bz2 from http://arn.se.distfiles.macports.org/postgresql83
---> Attempting to fetch postgresql-8.3.8.tar.bz2 from http://ftp7.de.postgresql.org/ftp.postgresql.org/source/v8.3.8/
---> Verifying checksum(s) for postgresql83
---> Extracting postgresql83
...
You see it installed version 8.3.8. The most recent version is 8.4.3
and the most recent in the 8.3 series is 8.3.10. But it should be
plenty good enough for my needs.
plpython
However! After working through TJ's instructions I got a failure
trying to execute:
sudo -u $owner createdb openchord
where "$owner" was superuser database owner, in my case "dalke" but
it's often "postgres".
It gave me an error which included the message
could not access file "$libdir/plpython"
That's because PostgreSQL needs to be compiled with support for Python
extensions in the server. The default ports installation does not do
that. Instead, I need to request the "+python" install option
% sudo port install -d postgresql83-server +python
---> Computing dependencies for postgresql83-server
---> Fetching postgresql83-server
---> Verifying checksum(s) for postgresql83-server
---> Extracting postgresql83-server
---> Configuring postgresql83-server
---> Building postgresql83-server
---> Staging postgresql83-server into destroot
---> Creating launchd control script
###########################################################
# A startup item has been generated that will aid in
# starting postgresql83-server with launchd. It is disabled
# by default. Execute the following command to start it,
# and to cause it to launch at startup:
#
# sudo launchctl load -w /Library/LaunchDaemons/org.macports.postgresql83-server.plist
###########################################################
---> Installing postgresql83-server @8.3.8_0
To create a database instance, after install do
sudo mkdir -p /opt/local/var/db/postgresql83/defaultdb
sudo chown postgres:postgres /opt/local/var/db/postgresql83/defaultdb
sudo su postgres -c '/opt/local/lib/postgresql83/bin/initdb -D /opt/local/var/db/postgresql83/defaultdb'
To tweak your DBMS, consider increasing kern.sysv.shmmax by adding an increased kern.sysv.shmmax .. to /etc/sysctl.conf
---> Activating postgresql83-server @8.3.8_0
---> Cleaning postgresql83-server
PostreSQL is now installed. Make sure its installation directory is on
your shell path. If you used Mac ports and installed version 8.3 like
me (and use the same shell I do) then that's:
set path= ( $path /opt/local/lib/postgresql85/bin )
Initialize database storage with initdb
The next step is to initialized storage for the database. I'm going to
put them under ~/teaching/pg .
% initdb -D ~/teaching/pg
The files belonging to this database system will be owned by user "dalke".
This user must also own the server process.
The database cluster will be initialized with locale en_US.UTF-8.
The default database encoding has accordingly been set to UTF8.
The default text search configuration will be set to "english".
creating directory /Users/dalke/teaching/pg ... ok
creating subdirectories ... ok
selecting default max_connections ... 20
selecting default shared_buffers/max_fsm_pages ... 1600kB/20000
creating configuration files ... ok
creating template1 database in /Users/dalke/teaching/pg/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the -A option the
next time you run initdb.
Success. You can now start the database server using:
postgres -D /Users/dalke/teaching/pg
or
pg_ctl -D /Users/dalke/teaching/pg -l logfile start
All this did was initialize storage space. It did not start the
database for you. I'll do that after I install the openCHORD
extension.
Installing openCHORD
Now that you know the base PostgreSQL installation works, the next
step is to install the openCHORD extension. The source file is named
openchord.tgz.
It unpacks to the local directory so I did:
% curl -O http://www.gnova.com/demos/openchord.tgz
% mkdir openchord
% cd openchord
% tar -xvzf ../openchord.tgz
x amw.sql
x bits.sql
x frowns-core.sql
x makedb
x Makefile
x makenci
x openbabel-core.sql
x openchord.py
x perlmol-core.sql
x public166keys.sql
x sdfloader
x similarity.sql
x splitter.sql
x tpsa.sql
x varbit.c
x varbit.so
x varbit.sql
x vla4.sdf
There are three installation phases:
build and install a shared library
execute a set of SQL functions which define the
extension functions and database schemas
install a Python module
Build and install the shared library
The shared library "varbit.so" defines a few functions to help work
with bitsets, which are a built-in datatype in PostgreSQL. openCHORD
comes with a Makefile but it only works for Linux and is hard-coded
for PostgreSQL 8.2 installed under /usr. I have a Mac, a different
database version, and installed in a different locations. The easiest
solution was to make a platform-specific file named "Makefile.mac"
which contains the following
# Makefile.mac
# Compile openCHORD for a Mac
INCS = -I/opt/local/include/postgresql83/server
LIBS = -L/opt/local/lib/postgresql83/ -lpq
varbit.so: varbit.c
$(CC) -c varbit.c $(INCS)
$(CC) -bundle -flat_namespace -undefined suppress -o varbit.so varbit.o
Part of openCHORD fails to compiled because of a change between
PostgreSQL 8.2 and 8.3. If you try to compile as-is then you'll see:
% make -f Makefile.mac
cc -c varbit.c -I/opt/local/include/postgresql83/server
varbit.c: In function 'bit_set':
varbit.c:90: error: lvalue required as left operand of assignment
make: *** [varbit.so] Error 1
This is due to what looks like a change to make a macro look more like
an l-value. Here's the basic diff:
and here's the unified diff version of the same thing:
% diff -u varbit.c varbit.c.orig
--- varbit.c 2010-04-11 01:56:37.000000000 +0200
+++ varbit.c.orig 2010-04-11 01:55:50.000000000 +0200
@@ -87,7 +87,7 @@
/* create result bitstring and copy input bitstring to result */
int rlen = VARBITTOTALLEN(alen);
result = (VarBit *) palloc0(rlen);
- SET_VARSIZE(result, rlen);
+ VARATT_SIZEP(result) = rlen;
VARBITLEN(result) = alen;
unsigned char *ap = VARBITS(a);
unsigned char *rp = VARBITS(result);
With that in place,
% make -f Makefile.mac
cc -c varbit.c -I/opt/local/include/postgresql83/server
cc -bundle -flat_namespace -undefined suppress -o varbit.so varbit.o
This shared library must be added to the PostgreSQL package
library. Use the helper utility "pg_config" to get that location and
install the varbit.so file:
Start the database server and verify the installation
Now you can start the database. I'll start it in interactive mode so I
can see the log messages sent to the terminal, and switch to another
terminal window to keep on working
% postgres -D /Users/dalke/teaching/pg
There's also a way to set it up in the background using "pg_ctl" and a
way to have the server start automatically when the computer starts
up, via launchctl. I won't go into them though.
How can you verify that it installed correctly? If you look in
varbit.c you'll see it defined the functions
Datum nbits_set(PG_FUNCTION_ARGS);
Datum isbit_set(PG_FUNCTION_ARGS);
Datum bit_set(PG_FUNCTION_ARGS);
Datum bit_contains(PG_FUNCTION_ARGS);
I should be able to call them.
I'll start the interactive PostgreSQL client. The default database
name is my username but that database doesn't exist, so I'll tell it
to use the "postgres" database. (You can get a list of all databases
with "psql -l".)
% psql postgres
Welcome to psql 8.3.8, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
postgres=#
I have to tell PostgreSQL to create
a new function based on the extension module. The actual SQL text is:
CREATE or REPLACE FUNCTION nbits_set(bit)
RETURNS integer AS 'varbit', 'nbits_set'
LANGUAGE c IMMUTABLE STRICT;
which I copied from 'varbits.sql'. I'll do that at the client prompt:
postgres=# CREATE or REPLACE FUNCTION nbits_set(bit)
postgres-# RETURNS integer AS 'varbit', 'nbits_set'
postgres-# LANGUAGE c IMMUTABLE STRICT;
CREATE FUNCTION
postgres=#
then test it out. The b'1001' uses the special PostgreSQL extension for
making a bit array.
Go ahead and try other inputs to check that it works.
Now that I know it works, I'll remove the function since I don't
actually need it in the "postgres" database. I'll drop it then prove
that it's no longer present:
postgres=# DROP FUNCTION nbits_set(bit);
DROP FUNCTION
postgres=# select nbits_set(b'1001');
ERROR: function nbits_set(bit) does not exist
LINE 1: select nbits_set(b'1001');
^
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
Run the SQL setup scripts
The next step is to run the various SQL scripts which install the
extension function and set up some auxillary tables. These are in the
script "makedb" but there were some problems with the script. For one,
it assumed the "postgres" username exists, which isn't true for my
install. I've tweaked it a bit to make it work for me:
varbit.sql - create SQL functions based on the functions defined in varbits.so
openbabel-core.sql - define SQL functions based on OpenBabel, called via Python
bits.sql - additional functions for working with bitsets
public166keys.sql - creates and populates the "public166keys" table with
SMARTS patterns for the 166 bit MACCS keys, and the corresponding fingerprint
generation function
similarity.sql - creates several bitset similarity functions
amw.sql - creates and populates the "amw" table with patterns for calculating
molecular weight, and the corresponding MW calculation function
tpsa.sql - creates and populates the "tpsa" table with patterns for calculating
molecular polar surface area, and the corresponding calculation function
Here I'll run the script, which needs an account name with sufficient privileges:
% bash makedb dalke
CREATE FUNCTION
COMMENT
CREATE FUNCTION
COMMENT
...
CREATE FUNCTION
COMMENT
Did it work? I'll test to see if the nbits_set function was added to
the newly created "openchord" database:
% psql openchord
Welcome to psql 8.3.8, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
openchord=# select nbits_set(b'1000');
nbits_set
-----------
1
(1 row)
Success!
What doesn't yet work is the code which calls out to OpenBabel via Python.
openchord=# select amw('c1ccccc1O');
ERROR: plpython: function "count_matches" failed
DETAIL: <type 'exceptions.ImportError'>: No module named openchord
CONTEXT: SQL function "amw" statement 1
To fix that requires the final step.
Install openchord.py
The file "openchord.py" contains a set of support functions used by
the functions created by openbabel-core.sql, most notably parsing
SMILES and SMARTS strings and caching the resulting molecule and
pattern objects. It must be installed on the PYTHONPATH for the
PostgreSQL extensions to be able to use it.
It is possible to install the file by hand, but I would rather let
Python do it for me. I created the following "setup.py"
# setup.py
from distutils.core import setup
setup(name="openchord",
version='1.0',
py_modules=['openchord'],
)
then did the standard
% sudo python setup.py install
The "import openchord" failure from the previous section probably
means the Python runtime in the database server doesn't know that the
extension can be loaded available now, so I killed the database server
and restarted it.
openCHORD comes with a script named "makenci" which imports the NCI
data set (NCI-Open_09-03.sdf.gz)
into the openchord database.
The script doesn't work on Macs because it assumes 'zcat' can handle
.gz extensions. I also didn't like how it gets the database username
from the variable USERNAME instead of the command-line. Here are the tweaks:
--- makenci.orig 2010-04-11 03:16:26.000000000 +0200
+++ makenci 2010-04-11 03:16:44.000000000 +0200
@@ -1,5 +1,5 @@
owner=$1
-owner=$USERNAME
+#owner=$USERNAME
if [ "$owner" == "" ] ; then
echo "no owner (user name) specified"
echo "usage: makenci owner"
@@ -8,7 +8,7 @@
# iconv -c removes any chars that cannot be represented as utf-8, which postgres uses
# see iconv --help to convert chars in another encoding
-time zcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord
+time gzcat NCI-Open_09-03.sdf.gz | iconv -c | perl sdfloader nci | sudo -u "$owner" psql openchord
# you may wish to use some of the following SQL commmands on the new schema and tables
#Grant Usage On Schema nci To Public;
If you look at the script it's mostly comments, containing suggestions
of how to work with the data. The core code is this single line:
which takes the NCI data set, removes non-ASCII characters (with
iconv), then calls "sdfloader" to parse an SD file and generate SQL
commands for bulk data loading, which it passes to the psql client for
transfer to the database server.
The documentation warns that the data load takes an hour. It took my
laptop a bit longer than that. Here's what it looked like in
operation:
% bash ./makenci dalke
CREATE SCHEMA
GRANT
CREATE SEQUENCE
CREATE TABLE
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "structure_pkey" for table "structure"
CREATE TABLE
CREATE TABLE
GRANT
GRANT
GRANT
{several minutes with no output}
SET
{much longer period with no output (about 50 minutes)}
INSERT 0 260071
{more waiting}
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
real 68m43.143s
user 2m0.958s
sys 0m10.789s
Tracking down an OpenBabel problem. Not yet successful.
That was unexpected. The database server crashed. The database log says:
LOG: server process (PID 58767) was terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2010-04-11 16:19:14 CEST
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/4A00E6B8
LOG: unexpected pageaddr 0/45BCA000 in log file 0, segment 76, offset 12361728
LOG: redo done at 0/4CBC9868
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
Based on the outputs, this happened either during canonical SMILES
generation or during fingerprint generation. Databases are designed to
keep the data even in the face of crashes, so figuring out where it
crashed shouldn't be too hard. The relevant code from sdfloader is:
Insert Into $schema.structure (id, name, isosmiles, coords, atoms)
Select id, (molfile_mol(molfile)).* from $schema.sdf;
Update $schema.structure Set cansmiles=cansmiles(isosmiles) Where valid(isosmiles);
Update $schema.structure Set fp=fp(cansmiles) Where valid(cansmiles);
I'll ask how many records there are, how many have isosmiles set, how
many have cansmiles set, and how many have their fingerprints set:
openchord=# select count(*) from nci.structure;
count
--------
260071
(1 row)
openchord=# select count(*) from nci.structure where nci.structure.isosmiles is NULL;
count
-------
0
(1 row)
openchord=# select count(*) from nci.structure where nci.structure.cansmiles is NULL;
count
--------
260071
(1 row)
openchord=# select count(*) from nci.structure where nci.structure.fp is NULL;
count
--------
260071
(1 row)
Huh. It looks like the problem was with canonicalization. Okay, I'll
walk though that manually:
openchord=# select count(*) from nci.structure where openbabel.valid(isosmiles);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Wow! Some sort of bug in openbabel.valid()? Investigating ....
I've been working with TJ on trying to track this down. It only
happens on my Mac, which is running OS X 10.6 and OpenBabel 2.2.3. I
have a reproducible which only uses OpenBabel
import sys
import gzip
import openbabel
f = gzip.open("NCI-Open_09-03.sdf.gz")
f = iter(enumerate(f))
GD = {}
GD['mol'] = dict()
GD['obc'] = openbabel.OBConversion()
GD['obc'].SetInFormat("smi")
GD['nmol'] = 0
GD['maxsmi'] = 10000
n = 0
for lineno, line in f:
if lineno % 10000 == 0:
sys.stdout.write("\r %d / %d" % (n, lineno))
sys.stdout.flush()
if line.startswith("> <E_SMILES>"):
lineno, line = next(f)
smiles = line.strip()
#mol = openchord.parse_smi(GD, smiles)
if GD['nmol'] < GD['maxsmi']:
mol = openbabel.OBMol()
GD['nmol'] += 1
#plpy.notice('new mol for %s' % smiles)
else:
key,mol = GD['mol'].popitem()
#plpy.notice('mol reuse %s for %s' % (key,smiles))
if GD['obc'].ReadString(mol, smiles):
GD['mol'][smiles] = mol
# return copy is slower, but safer?
# return openbabel.OBMol(mol)
n += 1
meaning it died here after reading over 226,282 SMILES and over
460,000,000 lines into the input file. This takes a long time to fail
and has been very irritating to try to track down.
The stack trace one time (through Apple's crash reporter) looked like:
#0 OpenBabel::OBMol::DestroyBond (this=<value temporarily unavailable, due to optimizations>, bond=0x10541efa0) at mol.cpp:1471
#1 0x0000000101a0fd9f in OpenBabel::OBConversion::GetInStream () at /Users/dalke/ftps/openbabel-2.2.3/include/openbabel/obconversion.h:256
#2 0x0000000101a0fd9f in OpenBabel::SMIBaseFormat::ReadMolecule (this=0x10541efa0, pOb=0x102c83990, pConv=0x10027cc10) at base.h:854
#3 0x00000001013dfe2c in OpenBabel::OBConversion::Read (this=0x10027cc10, pOb=0x102c81140, pin=<value temporarily unavailable, due to optimizations>) at obconversion.cpp:745
#4 0x00000001013e3f4c in OpenBabel::OBConversion::ReadString (this=0x10027cc10, pOb=0x102c81140, input=<value temporarily unavailable, due to optimizations>) at obconversion.cpp:893
warning: .o file "/Users/dalke/ftps/openbabel-2.2.3/scripts/python/build/temp.macosx-10.6-universal-2.6/openbabel_python.o" more recent than executable timestamp in "/Library/Python/2.6/site-packages/_openbabel.so"
warning: Couldn't open object file '/Users/dalke/ftps/openbabel-2.2.3/scripts/python/build/temp.macosx-10.6-universal-2.6/openbabel_python.o'
#5 0x0000000101135d6c in _wrap_OBConversion_ReadString ()
(Both times truncating the stack once it got into the Python run-time.)
I'm now running it with memory
checks enabled (valgrind does not work on Mac OS X 10.6). That's
very slow so I'll let it run overnight. I'll also see if building
OpenBabel from version control fixed any problems, and email the
OpenBabel list to see if anyone there knows what's going on.