This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Fingerprint File Format
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
(This has nothing at all to do with human fingerprints. I'm talking
about a technique used in chemistry used to represent characteristics
of a small molecule as a bit string so that fast bit operations can be
used instead of slow graph operations. It's often rather like a
Bloom filter
where the hash function is based on molecular substucture.)
SD files and SMILES files are the two most common small molecule
structure formats. PDB is the most common macromolecular structure
format. FASTA and GenBank are similarly common in sequence
databases. But molecular fingerprints don't have a common format.
I propose a new common format for these, called "FingerPrint Format",
or "FPF file", with the standard extension of ".fpf." I'm looking for
feedback and suggestions, and hopefully also uptake by others.
One problem of course is that they are many types of fingerprints and
many ways to represent them. I'm limiting myself to those which are
easily represented as a fixed-length bit string of length between 8
and about 8K, and which are dense enough to store the bits directly as
bits rather than run-length or other encoding.
My goal is to make fingerprint data sets more portable, so that tools
and algorithms developed by one group can more easily be shared and
tested by another.
Use cases
Yes, these are all oriented around me. If it wasn't something I wanted
then I would be getting paid for this. (Got funding?)
1) One of my clients wanted a 3-nearest-neighbors search (within some
cutoff) of a set of Daylight fingerprints, to be used in part of their
dataflow system. The files were static, and only updated once every 6
months or so. There was no need for a database, so I wrote a one-off
search system for them.
2) I want to write a high-speed Tanimoto similarity search algorithm
which works with a pre-built data structure and makes good use of
modern processors. Most formats are not designed for high performance
and require a preprocessing phase to load the data, and in a
command-line tool the data load time will be the limiting factor, not
the search.
3) I've found it a bit trying to determine the format and bit order of
each fingerprint tool I use, since they are almost all different.
4) I want to develop some screening data sets for substructure queries
and evaluate how effective they are against different inputs.
5) Oh, here's one which others want! Do the N**2 nearest-neighbor
searches of a data set across a set of machines without spending much
time building a fingerprint generation and parsing infrastructure.
Design goals
Here are my design goals:
be able to share fingerprint files between different tools
store "small" dense fingerprints and unique identifiers
fast load times
support for memory-mapped access
word-aligned binary data (with selectable word sizes)
fast lookup from fingerprint match to identifier
input order does not need to be preserved
allow some future extensibility
everything is stored in a single file
allow fingerprints sorted by popcount, for Baldi optimization
portable across different architectures
These are not design goals:
human-readable format
compatibility with an existing fingerprint format
minimize space/compressibility
I haven't decided how important these are:
error detection
preserving input order
superlinear search of the fingerprint names is acceptable (vs. log time)
support largish data sets (>4 million structures with a 1Kbit fingerprint)
Also, while not a design goal, I would like to have a set of reference
implementation for different types of Tanimoto searches and
substructure filtering and others. I also want them to be fast enough to
be used in benchmarking new implementations, since I've seen too many
benchmarks in general where the reference baseline was a naive
(meaning "not fast") implementation.
FPF is based on the PNG block structure
I decided to base the format around the PNG specification. The first 8
bytes are a unique signature, and the rest of the file is a set of
blocks. Each block contains a 4 byte identifier tag, followed by the
length as a 4 byte integer, followed by 'length' many bytes for the
block data, followed by 4 bytes for the CRC checksum.
All integers in this proposal are unsigned 32 bit values and stored in
network byte order, taking up four bytes. That leads to a limitation
in the fingerprint block size which I'll get to later.
FPF uses the same chunk layout, but with a different signature and
with tags which are more appropriate for fingerprints.
The FPF signature as hex bytes is 0x89 0x46 0x50 0x46 0x0D 0x0A 0x1A 0x0A. The subsequence 0x46 0x50 0x46 is "FPF", where PNG uses "PNG".
I keep the PNG's use of a CRC checksum as a way to check for
incomplete or corrupted files. I don't know how useful that is in
practice.
FPF block types
FHDR: header block
I haven't figured out what goes in here. Perhaps something which
indicates if the fingerprint type is known to be good/bad for
substructure filtering or comparison?
tEXt (and potentially iTXt, zTXt): text block(s)
These are exactly the same as
from PNG, which are key/value pairs including the controlled vocabulary of:
Title Short (one line) title or caption for image
Author Name of image's creator
Description Description of image (possibly long)
Copyright Copyright notice
Creation Time Time of original image creation
Software Software used to create the image
Disclaimer Legal disclaimer
Warning Warning of nature of content
Source Device used to create the image
Comment Miscellaneous comment; conversion from
GIF comment
Needless to say, but I'll do so anyway, some of these need to be tweaked for fingerprints.
I think FPF only needs tEXt, which is a simple key/value record using
Latin-1. If those should be UTF-8 then look at iTXt or come up with a
new type of text block.
Question: should there be some field which encodes the generation
parameters ("OpenBabel FP2, folded to 32 bits")? That lets someone
pass in structures without having to know the details of the given
data set. Yes, these options are very implementation specific and not
portable.
FDNS: dense fingerprint block
The block format is:
[ integer number of bits per fingerprint ]
[ integer number of bytes per fingerprint (including alignment) ]
[ integer number of bytes used for initial alignment = "initial alignment"]
[ "initial alignment" number of bytes of value 0 ]
[ fingerprint 1, which is "number of bytes per fingerprint" long ]
[ fingerprint 2 ]
[ fingerprint 3 ]
...
[ fingerprint N ]
The fingerprints are stored in binary as a set of bytes. The bytes are
written in little-endian order and the bits of each byte are written
in big-endian order. To get bit i from a fingerprint:
byte_offset = i / 8;
bit_offset = i % 8;
bit = (fingerprint[byte_offset] >> bit_offset) & 1;
(CACTVS uses little-endian for the bytes and the bits, OpenBabel uses
big-endian. I decided this mixture was easier to code.)
The minimum number of bytes per fingerprint is
(bits_per_fingerprint+7)/8. Extra bits used to fill in the remainder
of the last byte must be set to 0. The fingerprint may be padded with
extra 0 bytes in order to make the fingerprint a multiple of a word
size (typically 32 bits, 64 or 128 bits).
The "initial alignment" part is there to align the fingerprints in the
case of memory-mapped files. It can be judiciously chosen so that the
first byte of the first fingerprint is word aligned or even page
aligned, if that makes a difference.
The number of fingerprints is not stored. It can be calculated based
on the block size, the header size, and the number of bytes per
fingerprint.
By the way, I expect memory alignment to be a down-stream issue where
one program can generate a simple fingerprint file, using no extra
alignment bytes, and another program can tweak the alignments to be
more optimal for a given OS and implementation.
POPC: ordered population count offsets block
[ integer offset in DENS to first fingerprint with popcount=0 ]
[ integer offset " " to first fingerprint with popcount=1 ]
...
[ integer offset " " to first fingerprint with popcount=N ]
[ integer offset in DENS to the byte after the last fingerprint with popcount=N ]
One way to speed up similarity queries and substructure filters is to
exclude testing fingerprints which trivially cannot be accepted based
on the popcounts of the target and query fingerprints. "Popcount" is
short for "population count" and is the number of set bits in the
fingerprint. For example, in a substructure filter test the query
cannot have a smaller popcount than the target, and Baldi explained
how reduce the Tanimoto search space.
FPF stores the popcount information implicitly, by ordering the
fingerprints in the FDNS block so that all fingerprints with
popcount=0 are listed first, then those with popcount=1, then
popcount=2 and so on.
The "POPC" block contains byte offsets to the start and end of each
set of fingerprints with the same popcount. If there are N bits per
fingerprint then there will be N+2 offsets, and the fingerprints with
popcount 0<=i<=N are between offsets POPC[i] and POPC[i+1].
The last offset is redundant and must be identical to the length of
the DENS block. I include it here because my C implementations of the
algorithms were easier if they have this information, and storing it
in the data file means one less malloc or special case to consider.
If the POPC block is present then the DENS fingerprints must be
ordered by population count.
NAME block - list of fingerprint identifiers
[NUL byte]
[ name 1 ]
[NUL byte]
[ name 2 ]
....
[ name N ]
[NUL byte]
If present, each fingerprint is associated with a single name and vice
versa. The name must not contain the NUL character and should be a
unique identifier and should not contain any control characters. (Are
the names Latin-1 or UTF-8 encoded? If UTF-8, does the string also
undergo a canonicalization step? Or are only ASCII names allowed?)
The names are stored in the same order as the fingerprints in the DENS
block, so that the first name corresponds to the first fingerprint,
the second with the second, and so on. Each name must be NUL
terminated. (However, sorted names would make searching possible in
log time instead of linear. On the other hand, most people will load
the names into a local dictionary, and take the linear hit once.)
The first byte in the NAME block must be 0 (the NUL character). This
is present even if there are no fingerprints and hence no names.
The purpose of the leading NUL character is to simplify text
searches. To find the identifier "ID", construct a search for
NUL+"ID"+NUL and do a linear search. (It also simplifies binary
searches; pick a point p then backtrack to the previous NUL before
doing the test.)
NOFF block - offsets into the NAME block
[integer offset to name for fingerprint 1]
[integer offset to name for fingerprint 2]
[integer offset to name for fingerprint 3]
...
[integer offset to name for the last fingerprint]
This block maps contains offsets into the NAME block. To get the name
for fingerprint i, go to the ith entry in the table
then use that as the byte offset of the name the NAME block.
To go the other way requires a binary search. Given the byte offset to
a name in the NAME block, use a binary search to find the entry in the
NOFF block. The byte position of the entry gives the index into the
NOFF block, which is also the index of the fingerprint in the DENS
block.
Using sorted names means the mapping from identifier to fingerprint
can be done in log time. Using unsorted names requires linear time.
However, I expect most people to load the NAME data into a local
dictionary/hash table, which will take linear time but only once. In
that case, keeping the names in the same sort order as the
fingerprints is much easier to deal with.
Unresolved issues
Sorted or unsorted names?
I'm repeating it here because it's tricky. How well supported should
fingerprint lookup by name be? If the names are in sorted order then I
can find the name in log(n) time, then find the fingerprint index in
log(n) time, but I had to do that every time. While building a table
mapping from name to index requires disentangling the lookup table and
an linear operation.
Preserving the fingerprint sort order means the fastest program
requires O(n) time lookup for every case but makes building the
mapping from name to fingerprint very easy.
Scaling
This description used 32-bit unsigned integers as lengths and
offsets. That means the largest fingerprint data set can contain at
most 2**32 bytes. Consider someone using the 881 CACTVS substructure
keys aligned to 128 bits. This uses 896 bits or 112 bytes per
fingerprint, which means the block can hold a bit over 38 million
fingerprints.
Of course, if someone does an 8K fingerprint then that drops to just
over 4 million fingerprints.
Is this a realistic concern for the next 10 years? I think most people
don't use fingerprints over 1024 bits, but on the other hand the
databases are currently in the 50 million compound range.
There are three ways to address it. One is to use 64 bit values
instead of 32 bit. Another is to use multiple blocks, with a special
block to indicate the start of a new set of blocks. A third is to
simply have multiple FPF files. One thing to bear in mind is that
multiple section or multiple files is likely easier for people using
map/reduce strategies.
Storing multiple diverse fingerprint sets in the same file
Substructure fingerprints are likely not good similarity fingerprints
and vice versa. Is there a need to store different fingerprint sets in
the same file?
I don't think so. I think using multiple files works well for that.
Preserving order
A substructure screen may go through several stages. For example, if
the input structure contains a hormone substructure then the results
of a general query filter might be passed through one optimized to
distinguish between different types of hormones, with the hitlist
passed from the first stage to the next.
Using names as the hitlist identifiers does not work that well because
of the expensive lookup cost. It's better to pass around a list of
offsets. But if the DENS block is reordered then the original sort
order is no longer available. Also, there needs to be a way to report
the match using its input order.
There are two solutions to that: don't sort by popcount, or add a
table which maps from input order to DENS order. (Or possibly two
tables, although the inverse mapping can be constructed in O(N) time
with only one malloc because it's a bucket sort.)
Usefulness of Baldi
The Baldi algorithm suggests optimial searching based on ordering the
search of the popcount bins based on the maximum minimum value of the
bin. (The paper actually suggests allocating an array and sorting, but
as the two sides of the distribution are monotonic decreasing it can
also be done with an iterator doing the equivalent of a merge sort of
the two sides.)
I implemented it but excepting high similarity searches (perhaps 80%
or so? I did this over a year ago), I didn't notice any real
improvement in Tanimoto searches. I conjectured that non-linear disk
seeks were causing the problem. The memory and especialy disk
subsystems are really optimized for forward searches but the Baldi
algorithm jumps back-and-forth, and breaks all the lovely
cache-prefetching going on behind the scenes.
While I'm convinced that the Baldi limits are useful, the reordering
to make it fast ends up making the system more complicated.
BTW, SSDs (Solid State Drives) fix part of the problem by eliminating
seek time, through not cache prefetching. Another solution might be to
store two sets of fingerprints, with the second sorted in reverse
popcount order and on another disk. Then with two threads going
forward... Hmmm...
In any case, Swamidass and Baldi reported good speedups for top-K
searches. I haven't seen their code, but I have seen other search code
which uses a very slow scoring function, so I would like a platform
where it's easier for people to compare against known fast
implementations of different approaches.
Fingerprint density
I said that I'm assuming dense fingerprints, where the bits are
random. Most chemical fingerprints are sparse, with under 20% density
for similarity fingerprints and even less for feature key fingerprints
used for substructure screening.
Perhaps using run-length encoding of the bits in a fingerprint, or
run-length encoding of the inverted index, might be better. However, I
have a lot less experience with that, and storing those data
structures on disk is more complex than I want to deal with.
CPUs are very fast. A test I did a few years ago suggests that disk
and memory access are the limiting factor, and not the algorithm. If
that's the case then it's more worthwhile to make the CPU do extra
work while it's waiting for less data. On the other hand, another test
showed my code was still rather slower than computing the md5 checksum
of the same file. These tests were suggestive, not rigorous.
Streaming output
Each chunk starts off with a length. This leads to several
complications. It's hard to stream everything to a pipe except if the
output data is known, which likely means storing all of the
fingerprints in memory before generating the output.
If the output is a file, not a stream, then it's possible to write the
all the fingerprints to the file, then seek back to the length field
and, and then seek back to the end. They wouldn't be sorted, but a
post-processor could sort a file which is too large to keep in
memory. (Which given that my laptop has 3GB of memory doesn't seem a
real concern.)
That would also mean keeping the names in memory until the
fingerprints are written, since the above trick only works for a
single block.
If memory is critical (which shouldn't be the case for names, as they
are usually only 10 bytes or so), then an implementation might write
the blocks to the filesystem and merge later on. Just like the old
days of dealing with tape drives. But really, this seems more a
theoretical concern than something to worry about since the point of
an FPF file is use it as a data source for a pipeline but not part of
the pipeline stream.
Development and future
This is something I've been thinking about for a while but haven't
gotten around to because I'm not working with fingerprints right
now. I am a consultant so if you're interested in funding me, email me. I also do have
other, paying work so this doesn't have much priority.
I like the general approach of using PNG blocks. I've implemented
parts of the system, and the resulting code has been nice and
clean. The format I sketched out fits in well with the
high-performance C code for Tanimoto search and substructure filter
codes I wrote a couple of years ago. I also like how the format is
relatively future proof and if someone needs some new data chunk it
isn't hard to add.
I do need feedback from others. If you have ideas, or comments, or
perhaps know about existing fingerprint formats I should use instead,
then
leave a message.
If there's enough interest in this, I'll set up some sort of project
repository and mailing list for it, so it's up to you to speak up!