This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Inverted index using Python sets
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I am working on a problem which is very similar to an inverted
index. What's an
inverted index?
Suppose I want to find a book on "sharks." I could search every book
in the library until I find one on sharks, but that's tedious, and
every time I do a new search I would need to re-search all of the
books again. (Hopefully I would remember some of what I read the first
time!)
Instead, I can invert the task. For every word in every book, make a
list of books which contain the word. That also takes a lot of work
but I only need to do it once. To search for a book about sharks, I
look up the word "shark" and get the small list of candidates books,
which I need to search manually. I still need to search them because
while 20,000 Leagues
Under the Sea mentions sharks, it isn't about sharks, and
neither is a book which mentions a "loan shark" or the "land
shark" skit.
This list of word to books which contain the word is called an
"inverted index."
Compound term searches
I can use it as the basis for more complex queries. I want to find
books which mention "whale shark". If the inverted index contains only
single words then I get the list of books containing the word "shark"
and manually search those for "whale shark", but it would be better if
I combined the list of books containing "whale" and the list of books
containing "shark" to make a new list of those books containing both
"whale" and "shark."
In other words, I find the intersection of those two lists.
An inverted index for letters in words
It's very easy to create an inverted index using Python's
"set type."
Instead of the usual case of searching a book (or document) for words,
I'll show an example of how to search words for letters. On my
computer, the file
"/usr/share/dict/words"
contains 234936 different English words, with one word per line, of
which 233614 are unique after I convert everything to lowercase.
I'll turn that into an inverted index where each letter is mapped to
the set of words which contain that letter:
import collections
inverted_index = collections.defaultdict(set)
for line in open("/usr/share/dict/words"):
word = line.strip().lower() # ignore case
for letter in word:
inverted_index[letter].add(word)
I'll check the number of inverted indices (there should be one for
each letter), and I'll show the sizes of a couple of them.
This means there are 144086 unique lower-cased words with an "a" or
"A" in them, but only 2993 with a "j" or "J". (From here on I'll only
mention the lower-case letter even when I mean either lower or upper
case.) How many words have both an "a" and a "j" in them?
I used
sorted()
because it's a builtin function. However, while it works, it's a bit
wasteful to sort the entire list of 1724 items when I only want the
first 5. If you need something faster, try
heapq.nsmallest
(and
heapq.nlargest.).
It can be faster because it only worries about sorting the needed subset
A quick timing test shows that heapq.nsmallest is about 40% faster than sorted()[:5].
Inverted index as a search filter
What about a harder search? Which words contain all 6 vowels
(including y) in alphabetical order? The simplest solution is a linear
search using a regular expression:
>>> import re
>>> sequential_vowels = re.compile("a.*e.*i.*o.*u.*y")
>>> words = [line.strip() for line in open("/usr/share/dict/words")
... if sequential_vowels.search(line)]
>>> len(words)
8
>>> words
['abstemiously', 'adventitiously', 'auteciously', 'autoeciously', 'facetiously',
'pancreaticoduodenostomy', 'paroeciously', 'sacrilegiously']
This works, but if this type of search will occur often, and if
there's enough memory, and if there's a performance need, then it's
easy to speed up using an inverted index.
Filter using the inverted index
I'll split this search into two stages. The first will filter out the
obvious mismatches, and leave a smaller set of candidates for the
second stage.
For the first stage, I'll use the inverted index to find the words
which contains all of the vowels. The inverted index doesn't know the
character order, so once I find the candidates with all of the letters
then I'll use the regular expression test from before.
To find the set of words with all of the vowels, I could continue the
sequence of "&" binary operators as I did earlier, but that gets to be
clumsy when there are six terms. Instead, I'll call
set.intersection()
with the intersection sets as parameters:
>>> vowels = [inverted_index[c] for c in "aeiouy"]
>>> len(set.intersection(*vowels))
670
The variable "vowels" contains a list of sets, "*vowels" in a function
call turns that list parameter into individual parameters, and
"set.intersection" creates a new set which is the intersection of all
of the sets in vowels.
("set.intersection" used here is actually an unbound method, and it's
pretty rare to find Python code where using an unbound method makes
sense. The above code is almost identical to
"vowels[0].intersection(*vowels[1:])".)
I used two lines above, more for clarity reasons. For myself I would
probably put it into a single line:
>>> len(set.intersection(*(inverted_index[c] for c in "aeiouy")))
670
Yes, the inverted index can be done in a single line!
Testing the candidates
The first stage reduced the search space from 233614 words to 670. In
the second stage I'll use the regular expression to check which ones
contain the vowels in order.
>>> candidates = set.intersection(*(inverted_index[c] for c in "aeiouy"))
>>> [word for word in candidates if sequential_vowels.search(word)]
['autoeciously', 'adventitiously', 'facetiously', 'abstemiously', 'sacrilegiously',
'auteciously', 'pancreaticoduodenostomy', 'paroeciously']
You can verify that it finds the same matches as before, although
because sets are unordered, the result order has changed.
I've shown that the inverted index can be used to make the code more
complicated. Is is worthwhile? That is, how much faster is the new
code?
My timings show that the old version (which does 233614 regular
expression searches) takes 0.092 seconds to run, while the new one
takes 0.056 seconds. It's about 40% faster.
Use integers as set elements, not strings
It's easy to go faster still. The core of Python's set itersection
works something like this:
new_set = set()
for element in set1:
if element in set2:
new_set.add(element)
This requires a hash comparison for every element in set1. If that
passes then there's an equality test in set2, and if that passes then
there's another hash and possible equality test to insert into
new_set.
The set element are currently strings. String hash and comparisons are
very optimized in Python, but integers are even faster. What if I used
an index into a word list rather than the word itself? The
corresponding code is:
import collections
# Get all of the words into a list of words.
# Ignore words which are the same except for capitalization.
unique_words = set(line.strip().lower() for line in open("/usr/share/dict/words"))
words = sorted(unique_words)
# Map from character to the set of word indicies
inverted_index = collections.defaultdict(set)
for i, word in enumerate(words):
for c in word.lower():
inverted_index[c].add(i)
I can do the inverted index operations just like before, but the
result is a list of indices into the "words" list. For example:
>>> set.intersection(*(inverted_index[c] for c in "jkx"))
set([99716, 98492])
>>> for i in set.intersection(*(inverted_index[c] for c in "jkx")):
... print words[i]
...
jukebox
jackbox
I'll modify the "sequential_vowels" code to use the index-based
inverted index instead of the string-based version:
>>> candidates = set.intersection(*(inverted_index[c] for c in "aeiouy"))
>>> [words[i] for i in candidates if sequential_vowels.search(words[i])]
['pancreaticoduodenostomy', 'adventitiously', 'abstemiously', 'auteciously', 'sacrilegiously', 'autoeciously', 'facetiously', 'paroeciously']
My timings numbers give 0.029 seconds per search, with nearly all of
the time spent in the set intersection. Remember that the brute-force
linear search takes 0.094 seconds and the original inverted index
takes 0.056 seconds, so switching to integer-based indices brings
another factor of two performance gain. The overall search with an
inverted index is 3x faster than the original regex-based linear
search.
Order-dependent performance
Python's set.intersection() actually works more like this:
def intersection(*args):
left = args[0]
# Perform len(args)-1 pairwise-intersections
for right in args[1:]:
# Tests take O(N) time, so minimize N by choosing the smaller set
if len(left) > len(right):
left, right = right, left
# Do the pairwise intersection
result = set()
for element in left:
if element in right:
result.add(element)
left = result # Use as the start for the next intersection
return left
That is, it does a set of pair-wise reductions of list of sets. The
order of the set operations affects the performance! Recall that there
are only 1724 words with "j" in them. If I search for words with "m",
"a", and "j" in them, as
>>> len(set.intersection(*(inverted_index[c] for c in "maj")))
286
then Python computes the intersection of inverted_index["m"] and
inverted_index["a"], giving an intermediate set with 41148 hits, which
it then intersects with the 1724 "j" elements.
However, if the search order were "jma" then the intermediate set for
the intersection of "j" and "m" give only 450 elements, which means
only 450 tests against inverted_index["a"].
Both give the same answer, but one requires a lot more work than the
other.
Here's evidence of just how big that impact is:
>>> def go(letters):
... t1 = time.time()
... for i in range(1000):
... x = len(set.intersection(*(inverted_index[c] for c in letters)))
... return x, (time.time()-t1)
...
>>> go("jma")
(286, 0.12344503402709961)
>>> go("jam")
(286, 0.2954399585723877)
>>> go("amj")
(286, 6.223098039627075)
>>>
Yes, it's a factor of 50 between the slowest and fastest ordering!
There are some obvious ways to make the Python code faster. The
easiest is to process the sets from smallest size to largest. That was
proposed in Issue3069
on 2008-06-10, but the patch was not integrated into the Python
codebase.
However, that's not necessarily the best strategy. Suppose I want
letters with "q", "u", and "c" in them. There are 3619, 75144, and
85679 words with those letters, respectively, so you might think the
best sort order is "quc". Testing that hypothesis:
14 words in the data set with a 'q' but without a 'u', while there are
2776 words with a 'q' and no 'c'.
Some years ago I came up with a dynamic
algorithm which tries to prefer the set which least matches the
reference set.
For the case of three sets, it simplifies to:
def set_intersection3(*input_sets):
N = len(input_sets)
assert N == 3
min_index = min(range(len(input_sets)), key=lambda x: len(input_sets[x]))
best_mismatch = (min_index+1)%N
new_set = set()
for element in input_sets[min_index]:
# This failed to match last time; perhaps it's a mismatch this time?
if element not in input_sets[best_mismatch]:
continue
j = 3-best_mismatch-min_index
# If the element isn't in the set then perhaps this
# set is a better rejection test for the next input element
if element not in input_sets[j]:
best_mismatch = j
else:
# The element is in all of the other sets
new_set.add(element)
return new_set
while the intersection version to sort by size is:
def set_intersection_sorted(*input_sets):
input_sets = sorted(input_sets, key=len)
new_set = set()
for element in input_sets[0]:
if element in input_sets[1]:
if element in input_sets[2]:
new_set.add(element)
return new_set
Here's a head-to-head comparison between the three versions
pattern
set.intersection
set_intersection3
set_intersection_sorted
quc
0.462
0.852
1.032
qcu
0.312
0.842
1.032
ucq
7.152
0.772
0.962
which shows how (for this case) my dynamic algorithm is less sensitive
to the initial query choice and faster than the sorted version. In
fact, if each of the three orderings is equally possible, then the
average search time for the C code is 1/2 the speed of the Python
code.
Of course it's not possible to drop in one of these alternate
algorithms because set.intersection can take iterables. These other
algorithms would only work for the special case where all of the
inputs are sets.
I've submitted this observation as Issue13653. I'm curious to see what will become of it.