The Artima Developer Community
Sponsored Link

Python Buzz Forum
other Wide Finder implementations

0 replies.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a flat view of this topic  Flat View
Previous Topic   Next Topic
Threaded View: This topic has 0 replies on 1 page
Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.
other Wide Finder implementations Posted: Oct 9, 2007 6:34 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: other Wide Finder implementations
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Latest Python Buzz Posts
Latest Python Buzz Posts by Andrew Dalke
Latest Posts From Andrew Dalke's writings

Beating the wide finder topic to death, I compared my code timings with other implementations. These were done with Python 2.5, Ruby 1.8.2 and gcc 4.0.1 using the same machine, timed with "time" and using the wall-clock time. I have a 2.33 GHz MacBook Pro laptop which was not plugged in when I did the tests. These explain why the numbers I get now are bigger than I reported before. In all cases I run the test code twice and report the smallest time of the two, though those never varied by more than 0.01 second. The input test set contains 1,000,000 lines and 200,995,500 bytes, as generated by Fredrik's code.

Implementationtime (s)
Python, mxTextTools0.75
command-line tools (see below)0.82
Python, simple tally plus post-processing filter1.0
C, wf3.c by Ilmari Heikkinen1.2
Python, regex on a mmap1.4
Ruby, parallel wide finder1.7
Ruby, Tim Bray's original code2.3
Python, dehÓra's version with simple speedups3.3
Python, dehÓra's version4.0

It's interesting that two Python programs are faster than the C program, though please do note that it's only that - interesting. It doesn't say much about the absolute performance of a given language. Indeed, since the Python implementation I'm using is in C the performance is not directly based on the language.

Instead, I think it's because those two fast Python programs use search implementations based on variants of Boyer-Moore, while the C code does a character-by-character search, with a few simple performance tweaks and ones which filter using regular expressions end up using a more general-purpose-but-slower tool.

In theory the regular expression engine could look at the pattern and determine that the prefix is a fixed string so use a better search algorithm. It may even do some of that now. But that's a lot of work to implement and maintain, and since most of my regex parsing is on lines under 80 or so characters and I mostly use "match" instead of "search", it's a feature which I know I don't often needed.

I thought about implementing something using flex or ragel or re2c. I even started sketching out some code. I stopped when I realized that those are not search programs - they are tokenizers and expect the current position to match exactly. I don't want to implement some Boyer-Moore-like algorithm myself, especially since I don't need this code, the Python version works very well, and the command-line version is:

egrep 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+ ' o1000k.ap | 
      awk '{print $7}' | sort | uniq -c | sort -n | tail -10
which runs in 0.82 seconds.

Any Comments?

Read: other Wide Finder implementations


Topic: A personal update Previous Topic   Next Topic Topic: Zope is the father of them all

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use