Ruby Buzz Forum - Oniguruma (Ruby with Demon wheels)

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Ruby Buzz Forum
Oniguruma (Ruby with Demon wheels)

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Andrew Johnson

Posts: 39
Nickname: jandrew
Registered: Mar, 2004

Simple things ...

Oniguruma (Ruby with Demon wheels)

Posted: Jan 23, 2006 8:59 PM

This post originated from an RSS feed registered with Ruby Buzz by Andrew Johnson.
Original Post: Oniguruma (Ruby with Demon wheels) Feed Title: Simple things ... Feed URL: http://www.siaris.net/index.cgi/index.rss Feed Description: On programming, problem solving, and communication.	Latest Ruby Buzz Posts Latest Ruby Buzz Posts by Andrew Johnson Latest Posts From Simple things ...

byline: Andrew L Johnson

Oniguruma is a regular expression C library you can use in your own projects under the BSD license, or you can install it as Ruby’s regular expression engine (in which case it falls under the Ruby license). Oniguruma may be translated to English as Demon Wheel (or something along those lines).

Oniguruma is slated to become Ruby’s default regular expression engine, and Ruby 1.9 already has it included. But you don’t have to wait to try it out — it is easily incorporated into 1.8* ruby builds and basically just involves:

1 downloading and unpacking the latest oniguruma sources for Ruby 2 configure oniguruma with your Ruby source directory 3 make oniguruma (which applies the patches to the Ruby sources) 4 rebuild and test your ruby (make clean;make;make test) in Ruby directory 5 test oniguruma (make test) in oniguruma directory

The only danger in doing this is forgetting that oniguruma is not yet standard Ruby and shouldn’t be a dependency in released code. You might want to build both a standard ruby and an oni-ruby (or perhaps guru-ruby).

Oniguruma brings several features to Ruby’s regexen, notably:

positive and negative look-behind
possessive quantifiers (like atomic/independent subexpressions but as quantifier)
named backreferences
callable backreferences

Look-behind and callable backreferences are probably the main reasons you’d want to install oniguruma.

Look-Behinds

Look-ahead assertions have been around for some time, in many regular expression flavors. Look-behind assertions are less prevalent. Oniguruma brings positive and negative look-behind assertions ((?<=…) and (?<!…) respectively) to Ruby. Just like look-ahead assertions, these are zero-width assertions — they match the current position if the assertion about what follows (look-aheads) or precedes (look-behinds) is true. They do not consume any part of the string.

Unlike look-ahead assertions, look-behinds must contain fixed-width patterns which means: no indeterminate quantifiers. However, alternation is allowed at the top level of the look-behind, and the alternations need not be of the same fixed width. Capturing is allowed within positive look-behinds, but not in negative look-behinds (which makes sense).

Callable Backreferences

Callable backreferences give us recursively defined regular expressions, which allow one to match/extract arbitrarily nested balanced parentheses (or other delimiters).

  # to match a group of nested unescaped parentheses:

  re = %r/((?<pg>\((?:\\[()]|[^()]|\g<pg>)*\)))/
  s = 'some(stri\)\((()x)(((c)d)e)\))ng'
  mt = s.match re
  puts mt[1]

    ==> (stri\)\((()x)(((c)d)e)\))

Difference between Oniguruma and Standard Ruby Regular Expressions

The main behavioral difference I’ve noted between the two regular expression engines involves capturing with zero-length subexpression matches. In the following, sruby is standard ruby, and oruby is compiled with oniguruma:

  $ sruby -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba::a:b
  $ oruby -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba::a:b

No difference there, but note that Perl handles this differently (and, IMHO more correctly):

  $ perl -e '"abax" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  #{aba}:#{}:#{}:#{}

In my mind, with nested capturing such as this I would expect that the contents of $2 and $3 would be substrings (even if only empty strings) of $1 — like Perl handles it. However, Ruby isn’t alone in that Python and the pcre both handle it as Ruby does.

If this behavior doesn’t seem strange, consider this more obvious example:

  $ sruby -e '"ba" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  ba::a:b
  $ oruby -e '"ba" =~ /((a)*(b)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  ba::a:b

I understand the interpretation, I just don’t think it is the most correct interpretation to follow.

The difference between Oniguruma and Ruby becomes apparent in the following example, when the subexpressions themselves may be zero-length:

  $ sruby -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba::a:b
  $ oruby -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  aba:::
  $ perl -e '"abax" =~ /((a*)*(b*)*)*/; print "#{$&}:#{$1}:#{$2}:#{$3}\n"'
  #{aba}:#{}:#{}:#{}

Here, Oniguruma sides with Perl instead of Ruby, and all the captured subexpressions are the empty string. However, pcre agrees with standard Ruby on this one, and Python won’t even compile the regular expression.

  Versions used in testing:
    sruby       => Ruby 1.8.4 (2006-01-21)
    oruby       => Ruby 1.8.4 (2006-01-21) with Oniguruma 2.5.2
    Perl 5.8.7
    Python 2.4
    pcre 6.3

__END__

Read: Oniguruma (Ruby with Demon wheels)

Previous Topic

Next Topic


	Web Artima.com