Latest Ruby Buzz Posts
Latest Ruby Buzz Posts by
Latest Posts From cfis
Advertisement
I'm happy to announce the release of
libxml-ruby
1.1.3. Besides including the usual assortment of new features
and bug fixes,
this release also includes a speed boost of roughly 10% to 20%.
This resulted from
RubyInside's recent post summarizing the performance of Ruby
parsers. As expected,
libxml-ruby blew away
Hpricot and
REXML
in pure parsing speed (which of course is a simplistic view of what
is important in an xml processor, but nevertheless still important) .
But it consistently finished a bit behind
Nokogiri.
I was a bit surprised by that since libxml-ruby and Nokogiri use the
libxml2 library as their parsing engine. Since the
specific test cases almost exclusively tested parsing, the two
extensions should have identical run times.
Since the times were different, then the
obvious conclusion was that the two extensions were
using different libxml2 APIs or using different settings. I
suspected the second, but when
investigating performance you never know beforehand.
Not to bore everyone with the nitty-gritty details of using
libxml2, but when looking into the first test, parsing an in-memory
string, it didn't look there was much difference in API calls.
The next possibility was xmlDoRead was modifying the libxml2
parser context. Now a libxml2 parser context is a beast of a thing -
for those brave souls who want to take a peek, its defined in
libxml2's online
documentation.
Working through the options one-by-one, I finally found the
culprit, an obscure field in the structure:
int dictNames : Use dictionary names for the tree
What this setting controls is whether libxml2 uses a dictionary
to cache strings it has previously parsed. Caching strings
makes a big difference, so by default it should be enabled.
That is now the case with libxml-ruby 1.2.3 and higher.
Rerunning the published benchmarks now shows libxml-ruby and Nokogiri to
have equivalent performance. If you run the tests yourself, beware though. The
order in which the extensions are tested changes the results. Whichever
extension is tested first will always be faster, at least on my Fedora 10 box.
I assume that's because the first parser has more memory available to it when
the test begins and therefore invokes Ruby's garbage collector a few times
less.