The Artima Developer Community
Sponsored Link

Ruby Buzz Forum
Indexing faster than Ferret with some algorithmic help (an order of magnitude faster than...

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Eigen Class

Posts: 358
Nickname: eigenclass
Registered: Oct, 2005

Eigenclass is a hardcore Ruby blog.
Indexing faster than Ferret with some algorithmic help (an order of magnitude faster than... Posted: Nov 26, 2006 7:47 AM
Reply to this message Reply

This post originated from an RSS feed registered with Ruby Buzz by Eigen Class.
Original Post: Indexing faster than Ferret with some algorithmic help (an order of magnitude faster than...
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by Eigen Class
Latest Posts From Eigenclass

Advertisement

I've realized that my initial performance comparisons were flawed because the index included neither the text nor the term vectors. According to Ferret's documentation, (and a basic understanding of inverted indices) term vectors are needed for creating search result excerpts and performing phrase searches. Also, since the index based on suffix arrays has a copy of the original text, it makes sense to have it stored in Ferret's index if the comparison is to be meaningful.

Meanwhile, I've also discovered that Ferret is much slower than I thought when you actually try to do something with the results, such as getting their URIs (otherwise, all you have is an internal document ID that doesn't tell you anything). In some quick tests, it needed over 0.30 seconds to return 1165 hits when looking for "sprintf" in linux's sources after a few runs, and over 8 seconds when the cache was cold. I think both figures will be quite easy to beat, but that will come later --- I want indexing to be fast to begin with, as I'll be running the indexer often while I develop this.

I've rewritten the indexer, made it modular (e.g. it can index documents with multiple fields, using different analyzers on each), and then implemented a couple functions in C --- some 150 lines of C, compared to Ferret's >50000... This is the beginning of FTSearch (I'll soon put the darcs repository online).

Here's how FTSearch compares to Ferret right now: benchmark2.png times2.png

Needless to say, this would make it way faster than Lucene --- maybe an order of magnitude, if this still holds*1 for this corpus.


Read more...

Read: Indexing faster than Ferret with some algorithmic help (an order of magnitude faster than...

Topic: Static site development Previous Topic   Next Topic Topic: Mephisto Comment Hack

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use