This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: In search of an inverted index library
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I'm looking for a Python library for managing inverted indices. I've
been using Python sets, which are nice and fast, but they take up too
much memory. Lists use 1/10th the space but are a lot slower and
still not compact enough. The most likely solution so far is
Xapian. Are there other options I should consider? If so,
email me or comment on the
Reddit
thread I started on this topic.
Let explain what I want and what I've done so far.
Sparse feature fingerprints
I'm working with chemical fingerprints. That's not unusual for
me. What's new is that these fingerprints are not fixed length but are
of indefinite length. For each molecule I find all fragments with no
more than 7 atoms, and put the fragment into a canonical
representation, here a canonical fragment SMILES. Each unique
representation gets its own unique integer identifier. The end result
is a fingerprint which looks like:
0:4,1:3,5:2,18:2,28:2,124,47298,53119 254846
The original SMILES for this is CCO.CCO.O=[V].Cl. I picked
this structure because it has a very small fingerprint.
This fingerprint line says that PubChem record 254846 has 8
substructures ("features"). The features are stored along with a count
of the number of times it's found. Here, features '5', '18', and '28'
are found in two difference places in the molecule, while features
'124', '47298', and '53119' are found only once. Also, the feature
field and the identifier field are separated by a tab.
I kept track of the original canonical description. Here's the
breakdown:
canonical
bit# description
---- -----------
0 C
1 O
5 CC
18 CO
28 CCO
124 Cl
47298 O=[V]
53119 [V]
With a bit of back and forth you can verify that it's correct.
Overall I have about 25 million records and 2 million canonical
values, using a subset of PubChem's 35 million compounds. The average
fingerprint contains about 300-400 features, though value ranges from
1 to about 2,000 features. These are "sparse" fingerprints because
because the number of features per record divided by the number of
features total is very small (350/2,000,000) - small enough that it's
more efficient to store the list of feature bits (0, 1, 5, etc.) than
it is to store the bit pattern of 0s and 1s.
The features are very long-tail. Carbon ("C") exists in nearly every
record, in both aromatic and non-aromatic forms. "O", "cc", "ccc", and
"CC" are also very popular. But about 1/2 of the features only exist
in one record. Here's more on the fragment distribution pattern.
Why do I want an inverted index?
Feature fingerprints are old news in cheminformatics. One application
is to improve substructure search performance. Suppose you are
searching for a phenol (structure "c1ccccc1O"). You could do a
substructure isomorphism test with every single record, but that's
slow. Instead, there are some obvious improvements.
You know that if a record matches then it must have at least 7 atoms
and 7 bonds, at least 6 carbons, and at least 1 oxygen. It's easy to
precompute this information, so the set labeled "at least 6 carbons"
contains the set of all records with at least 6 carbons, and so
on. When a query comes in, analyze the structure based on those
predetermined features. This ends up with a list of sets, one for each
feature. Find the intersection of those sets, and you've reduced your
search space, potentially by a lot.
This is exactly when one should use an inverted index.
(There are other optimizations. If the query exists already as one of
the precomputed sets then there's no need to any additional
substructure search.)
Feature "folding"
Most cheminformatics systems "fold" their sparse bits into a fixed bit
length, like 1024 bits. For example, if a feature is assigned the
canonical id of "458793484" then some programs will apply modulo 1024
to that to get the folded bit id "524". This saves a lot of space, at
the cost of extra subgraph isomorphism search time.
I've been wondering, do I really need to fold the bits? Why can't I
keep all of the information around? My thoughts are informed by Managing Gigabytes, which points
out that there are efficient ways to encode the inverted index. Can I
apply those techniques to this problem?
Working code
The first step is to establish that there is a problem. I'll use a
subset of 10 files, which represents about 0.5% of my actual data set,
and write some code to handle it. Here's the final code and here's the code with the benchmark data set (37 MB, compressed with 7zip).
The simplest inverted index implementation uses Python sets.
from collections import defaultdict
class InvertedIndex(object):
def __init__(self):
# Map from a feature string like "ccO" to an integer feature id
self.feature_to_id = {}
# Map from an integer feature id to a set of record ids
self.inverted_indices = defaultdict(set)
def get_feature_id(self, feature):
try:
return self.feature_to_id[feature]
except KeyError:
n = len(self.feature_to_id)
self.feature_to_id[feature] = n
return n
def add_record(self, id, feature_ids):
for feature_id in feature_ids:
self.inverted_indices[feature_id].add(id)
def search(self, features):
# These are *features*, not *feature_ids*.
# If the feature wasn't seen before then there are no entries
# and this will raise a KeyError, meaning nothing found.
try:
terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
except KeyError:
return set()
terms.sort(key=len)
return set.intersection(*terms)
Very simple, and it works pretty well. There's a bit of API trickiness
to get the feature id for a given feature. This simplifies the loader
and handles integer interning. (Interning is outside of the discussion
of this essay.)
Second is the code parse the data files and load the index. The file
format looks like:
#FPC1
#bit-0=13120 C
#bit-1=12986 c
#bit-2=12978 cc
...
#bit-46924=1 COCNCP=O
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942 26350001
0:6,1:21,2:21,3:22,4:23,5:2,6,...,36858,36869,37247,37429:4,37431,39942 26350002
...
0:5,1:12,2:12,3:12,4:12,5:3,6:3,...,25998,32031,32086:2,33182,33998,38357 26374982
There's a version line, some header lines starting "#bit-", and the
fingerprint lines. The fingerprint line uses the sparse fingerprint I
mentioned earlier, followed by a tab, followed by the record
identifier. (Here it's a PubChem id, which is always an integer.)
Note: the bit definitions are not the same across different files. The
header lines show that bit 0 corresponds to the feature "C" for this
file (and it exists in 13120 records), and bit 1 is "c" (in 12986
records), but in another file I see that bit 0 is the feature "cc" and
in a third it's "ccc(c)CCN". So the parser needs to be smart and map
the per-file bit numbers to its own internal fragment identifier.
The details of the parser aren't important to this essay, and left to
the implementation.
Benchmark
I need a benchmark. I'll load
10 data files
(links to the 37 MB benchmark file) to make the inverted indices. I
also need some query data. I reused the parser to extract a subset
(1%) of the records in a fingerprint file. Actually, two fingerprint
files. One timing set reuses one of the data files used to build the
invertex indices, which means that everything query will match, while
the other uses a new data file, which should have many fewer matches.
== Sets ==
Load 0 Compound_000000001_000025000.counts.fpc
Load 1 Compound_000025001_000050000.counts.fpc
Load 2 Compound_000050001_000075000.counts.fpc
Load 3 Compound_000075001_000100000.counts.fpc
Load 4 Compound_000100001_000125000.counts.fpc
Load 5 Compound_000125001_000150000.counts.fpc
Load 6 Compound_000150001_000175000.counts.fpc
Load 7 Compound_000175001_000200000.counts.fpc
Load 8 Compound_000200001_000225000.counts.fpc
Load 9 Compound_000225001_000250000.counts.fpc
Loading took 63.0 seconds and 2092896256 memory
100 66575 315.0 per second
200 100909 337.5 per second
all exist: 227 searches took 0.7 seconds (counts 114990)
100 2323 1161.5 per second
200 3865 1360.9 per second
some exist: 230 searches took 0.2 seconds (counts 4048)
Some of this is progress information. The important parts are the load
time (63.0) seconds, the total amount of memory used in the index (2.1
GB), and the search times of 0.7 and 0.2 seconds for the two test
sets. (I use 'psutil'
to get the memory usage; note that the exact meaning is OS dependent.)
Also, the match counts of "114990" and "4048" are useful checks that
each inverted index implementation is correct.
This output shows the problem. I'm using 2.1 GB of memory for 10 data
files. I only have 12 GB on this desktop machine, and remember, the
full data set is 200 times bigger. Do you have a 500 GB machine I can
use?
Use an array instead of a set
Python objects have overhead. For example, the Python integer 300 on
my desktop requires 24 bytes, while I know I have about 2**25 records
so could easily get away with 4 bytes.
>>> import sys
>>> sys.getsizeof(3)
24
The other 20 bytes handle things like a reference to the Python
integer class object and the reference count.
The Python set data type uses a lot of space because it's implemented
as a hash, and many of the internal slots are empty in order to
improve lookup performance. Quick testing shows that a set uses about
30-40 bytes for each element in the set.
It's not quite as bad as what this because the set elements contain
Python references, which are 8 byte pointers on my 64-bit Python. But
that's still about 42 bytes of space used to store at most 4 bytes of
data. I should be able to do better.
The usual way "to do better" in Python, when storing a list of C-like
integers is to use an
'array.array.'
This is similar to a Python list, except that the contents are all of
the same type, so there's no per-item overhead to store the type
information. (Note: pje described this approach on
the Reddit thread.
The downside is that there's no equivalent to a set.intersection. For
now, I'll convert an array into a set, and do the normal set
intersection. Here's the code:
# Make sure I can handle 4 byte integers
import array
_test = array.array("I", (0,))
try:
_test[0] = 2**31+100
array_code = "I"
except OverflowError:
array_code = "L"
def make_unsigned_array():
return array.array(array_code, ())
class InvertedIndexArray(object):
def __init__(self):
# Map from a feature string like "ccO" to an integer feature id
self.feature_to_id = {}
# Map from an integer feature id to an set of record ids
self.inverted_indices = defaultdict(make_unsigned_array)
def get_feature_id(self, feature):
try:
return self.feature_to_id[feature]
except KeyError:
n = len(self.feature_to_id)
self.feature_to_id[feature] = n
return n
def add_record(self, id, feature_ids):
for feature_id in feature_ids:
self.inverted_indices[feature_id].append(id)
def search(self, features):
assert features
try:
terms = [self.inverted_indices[self.feature_to_id[feature]] for feature in features]
except KeyError:
return set()
terms.sort(key=len)
# Set conversion is expensive; it's faster to pass an iteratable to intersection_update.
result = set(terms[0])
for term in terms[1:]:
result.intersection_update(term)
# Short-circuit the evalution when the result set is empty.
# Gives a 3x speedup for the "some exists" benchmark, with no
# performance hit for the "all exist" benchmark.
if not result:
break
return result
How well does it work? Here's the benchmark output using the
array-based inverted index implementation, trimmed somewhat to make it
easier to read:
== Sets ==
Loading took 63.0 seconds and 2092896256 memory
all exist: 227 searches took 0.7 seconds (counts 114990)
some exist: 230 searches took 0.2 seconds (counts 4048)
== Arrays ==
Loading took 51.2 seconds and 165552128 memory
all exist: 227 searches took 41.7 seconds (counts 114990)
some exist: 230 searches took 16.1 seconds (counts 4048)
The new version takes less 10% of the memory as the old, which is
about what I predicted. It's nice to see the confirmation. It's also a
bit faster to build the data structure. But on the other hand, search
is over 60x slower. At 0.1 seconds per query, this performance is not
acceptable the types of highly interactive searches I want to do.
There are of course hybrid solutions. This version spends a lot of
time to convert an array to a set, which is then discarded. I convert
the 'C', 'c' and a few other inverted indices a lot, so what I could
do is use an LRU cache for the top 1,000 or so sets.
But let's get back to my problem. My data set is 200x bigger than the
test set, so simple scaling of 166MB says I'll need 33 GB of RAM in
the best of cases. I really would like this to work on my desktop,
instead of renting some Amazon compute node every time I want to try
things out.
Sets now have strategies just like dictionaries. This means for example
that a set containing only ints will be more compact (and faster).
This is good because in previous benchmarking I found that pypy sets
were rather slower than cpython's own sets.
I re-ran the benchmark using pypy 1.7 and 1.9. Here are the results,
first for 1.7:
== Sets ==
Loading took 278.6s seconds and (psutil not installed) memory
all exist: 227 searches took 1.0 seconds (counts 114990)
some exist: 230 searches took 0.2 seconds (counts 4048)
== Arrays ==
Loading took 32.7s seconds and (psutil not installed) memory
all exist: 227 searches took 299.6 seconds (counts 114990)
some exist: 230 searches took 335.6 seconds (counts 4048)
and then for 1.9:
== Sets ==
Loading took 47.9s seconds and (psutil not installed) memory
all exist: 227 searches took 0.5 seconds (counts 114990)
some exist: 230 searches took 0.1 seconds (counts 4048)
== Arrays ==
Loading took 35.4s seconds and (psutil not installed) memory
all exist: 227 searches took 186.2 seconds (counts 114990)
some exist: 230 searches took 72.9 seconds (counts 4048)
The good news is that 1.9 set operations are faster — 5x faster
to load, and 2x faster to search. In fact, the search time is faster
than cpython's! The bad news is it looks like pypy's
intersection_update isn't yet as well optimized. Also, manual
inspection shows that pypy 1.9 uses 2.1 GB of memory, which is about
the same as cpython's. pypy on its own isn't the right solution.
Is there an inverted index library for Python?
Inverted indices are used in all sort of search engines, and there are
many well-known ways to improve intersection performance and decrease
memory use. Some of the ways I know of storing bit sets are:
However, I haven't found good Python bindings for these. (I wrote some
Python/Judy bindings years ago. Unfortunately, it's buggy, and I would
have to start from scratch to get it working again.)
Here too I've not been able to find an existing library which handled
this for me, much less one with Python bindings.
Search engine components
Actually, I'm wrong. These are available, as part of a larger search
engine package like Lucene and Xapian. Lucene is very well known, but
I would have to mix the Java runtime and the chemistry code that I'm
using from C++. Doable, but it feels like too much weight.
Instead, I downloaded, compiled, and installed Xapian and its Python
bindings and wrote an inverted index based on its API. (Justinsaccount was the
first to suggest this in
the Reddit thread.)
Xapian uses the
filesystem to store the database, so this version uses a lot less
local memory than the previous two inverted indices. I wonder how much
of the data set is stored in memory vs. how much is read from disk. I
imagine there are tuning variables, and it looks like there's an
in-memory option I could also do. I am quite the newbie with this
library.
and when I run the benchmark I get this timing information:
Loading took 349.9 seconds and 46465024 memory
all exist: 227 searches took 15.7 seconds (counts 114990)
some exist: 230 searches took 20.7 seconds (counts 4048)
Queries are about 3-12x slower than the cpython implementation, and
5-20x slower than pypy. But on the other hand, this uses only 90 MB of
RAM, although it does build a 340 MB data file. Simple scaling
suggests that I'll need 65 GB of disk space for the entire data
set. That's quite doable.
The Xapian load time is a lot slower, but as the data is persistent,
reload time from a built database is trivially fast. The documentation
says I can set XAPIAN_FLUSH_THRESHOLD to a larger value for better
indexing performance. Are there other tuning variables I should
consider?
Xapian might be the way to go. Value ranges might
be useful when I think about how to handle fragment counts. I'll need
to do more evaluation. For example, I don't like that it's an order of
magnitude slower than Python sets. After all, Python sets aren't
optimized for this task, so I expected faster search times.
Sadly, there isn't much documentation about Xapian - and nowhere near
like what there is for Lucene. (For example, there appears to be no
page which describes the tuning environment variables. There's
XAPIAN_MAX_CHANGESETS and XAPIAN_FLUSH_THRESHOLD .. are these or
something else relevant for what I want?)
I still want a good inverted index library for Python
Xapian might work well for this project, but there are other projects
I work on which use inverted indicies. In one project I try to
optimize the ~1024 fixed patterns used in a traditional feature-based
substructure screen. Both my greedy algorithm and my genetic algorithm
for this used a lot of temporary intersections to evaluate a given
selection. I'm not sure how well it would work on top of a persistent
database.
If you know of a good library, please email me or comment on the
Reddit
thread I started on the topic.