|
This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
|
Original Post: other Wide Finder implementations
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
|
Latest Python Buzz Posts
Latest Python Buzz Posts by Andrew Dalke
Latest Posts From Andrew Dalke's writings
|
|
Beating the wide finder topic to death, I compared my code timings
with other implementations. These were done with Python 2.5, Ruby
1.8.2 and gcc 4.0.1 using the same machine, timed with "time" and
using the wall-clock time. I have a 2.33 GHz MacBook Pro laptop which
was not plugged in when I did the tests. These explain why the
numbers I get now are bigger than I reported before. In all cases I
run the test code twice and report the smallest time of the two,
though those never varied by more than 0.01 second. The input test
set contains 1,000,000 lines and 200,995,500 bytes, as generated by Fredrik's code.
It's interesting that two Python programs are faster than the C
program, though please do note that it's only that - interesting. It
doesn't say much about the absolute performance of a given language.
Indeed, since the Python implementation I'm using is in C the
performance is not directly based on the language.
Instead, I think it's because those two fast Python programs use
search implementations based on variants of Boyer-Moore, while the C
code does a character-by-character search, with a few simple
performance tweaks and ones which filter using regular expressions
end up using a more general-purpose-but-slower tool.
In theory the regular expression engine could look at the pattern and
determine that the prefix is a fixed string so use a better search
algorithm. It may even do some of that now. But that's a lot of work
to implement and maintain, and since most of my regex parsing is on
lines under 80 or so characters and I mostly use "match" instead of
"search", it's a feature which I know I don't often needed.
I thought about implementing something using flex or ragel or re2c. I
even started sketching out some code. I stopped when I realized that
those are not search programs - they are tokenizers and expect the
current position to match exactly. I don't want to implement some
Boyer-Moore-like algorithm myself, especially since I don't need this
code, the Python version works very well, and the command-line version
is:
egrep 'GET /ongoing/When/[0-9]{3}x/[0-9]{4}/[0-9]{2}/[0-9]{2}/[^ .]+ ' o1000k.ap |
awk '{print $7}' | sort | uniq -c | sort -n | tail -10
which runs in 0.82 seconds.
Any Comments?
Read: other Wide Finder implementations