Ruby Buzz Forum - Doing some n-gram analysis over Ruby's docs

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
Doing some n-gram analysis over Ruby's docs

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.

Doing some n-gram analysis over Ruby's docs

Posted: Dec 24, 2006 5:14 AM

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: Doing some n-gram analysis over Ruby's docs Feed Title: Eigenclass Feed URL: http://feeds.feedburner.com/eigenclass Feed Description: Ruby stuff --- trying to stay away from triviality.	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Eigen Class Latest Posts From Eigenclass

The first attempts to optimize my pure-Ruby, 200LoC full-text search engine based on suffix arrays (which evolved into the in-progress FTSearch) led me to perform some n-gram analysis over Ruby's core/stdlib docs (as well as a pack of gems).

With a suffix array, locating a word in the corpus involves a binary search of complexity $O log N$ , N being the total number of suffixes. Since the suffix array (SA) is but a large array of integer values representing offsets in the "full text" file (which correspond to the start of the suffixes), you need to read the text at the position indicated in the SA in order to compare it to the term you're looking for.

Therefore a corpus like the Linux sources (with some 20 million suffixes) would require about 25 disk seeks even if the suffix array itself were held in memory. At 10ms the seek, that would be 250ms when the file isn't cached...

The first idea that came to mind to minimize the number of seeks was storing the first n bytes of text right after the offset in the SA:

 ...
 offset(N)    abcdefgh
 offset(N+1)  abcdefgh      (first 8 bytes coincide)
 offset(N+2)  abcdefgi
 ...

Ideally, if suffixes covered the whole "character space" uniformly, n bytes would represent n * 8 fewer disk seeks. So n = 4, which would double the size of the SA (offsets are 32 bit integers), would be enough to find any word in a corpus under 4GB without a single disk seek when the SA is held in memory.

Of course, the distribution of suffixes is hardly uniform. We tend to repeat a limited number of words instead of writing random byte strings... I did some basic n-gram analysis over the documentation from the core/stdlib and some ~80 gems; character-wise to being with (since this is what matters for the full text search engine) and then word-based while I was at it.

This analysis proved that such "inline suffixes" add a large space overhead to the index with relatively little gains, so I adopted a slightly different solution.

Counting n-grams

If you have FastRI and have used it to build a full-text index of your core/stdlib/gem docs, your ~/.fastri-fulltext/ directory will hold a "suffixes.dat" file (just a large array of 32 bit ints) and a "full_text.dat" file which holds the actual indexed documentation.

Collecting the n-grams is easy:

MAX_SUFFIX_SIZE = 64
NGRAM_SIZES = [4, 8, 10, 12]

suffixes = []

File.open("full_text.dat", "rb") do |ft|
  File.open("suffixes.dat", "rb") do |f|
    until f.eof?
      offset = f.read(4).unpack("V")[0]
      ft.pos = offset
      suffix = ft.read(MAX_SUFFIX_SIZE).split(/\0/)[0]
      suffixes << suffix
    end
  end
end

puts "Total suffixes: #{suffixes.size}"

ngrams = Hash.new{|h,k| h[k] = Hash.new{|h2,k2| h2[k2] = 0} }
all    = {}
suffixes.each do |suffix|
  all[suffix] = true
  NGRAM_SIZES.each{|n| ngrams[n][suffix[0,n]] += 1 }
end

puts <<EOF

===============================
Character-based n-gram analysis
===============================
#{all.size} unique #{MAX_SUFFIX_SIZE}-grams
A lookup would take on average #{Math.log(suffixes.size)/Math.log(2)} disk seeks.

EOF

With my index, I get:

Total suffixes: 224687
196625 unique 64-grams
A lookup would take on average 17.7775571295372 disk seeks.

n-gram statistics

At this point, ngrams[n] is a hash associating n-grams to number of occurrences, and we can try to obtain some interesting statistics:

number of suffixes which start with the same n bytes (the same n-gram)
mean, median, maximum number of suffixes per n-gram, as well as stddev
how many disk seeks we can expect to save on average

The latter is actually the entropy of the n-gram distribution. If there's only one n-gram, knowing the first n bytes doesn't give you any usable info (the entropy is 0), and you don't save any disk seek, all you've done is increase the size of the SA with no gain whatsoever. Now, if there are two n-grams, 50% of the suffixes start with one and 50% with the other, then a binary search will take 1 disk seek less (you just saved 10ms). If we consider a n-gram, we can save at most $8 times n$ seeks, iff there are $2 sup { 8 times n }$ suffixes and each starts with a different n-gram.

The gain (number of disk seeks saved) can thus be computed as

$sum from i to M - { { n sub i } over S } log sub 2 { { n sub i } over S }$

if there are M distinct N-grams each with $n sub i$ suffixes, $n sub i$ is the number of suffixes starting with the i-th n-gram, and $S = sum from i to M n sub i$ .

Previous Topic

Next Topic


	Web Artima.com