Agile Buzz Forum - ComposedRegex

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Agile Buzz Forum
ComposedRegex

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Martin Fowler

Posts: 1573
Nickname: mfowler
Registered: Nov, 2002

Martin Fowler is an author and loud mouth on software development

ComposedRegex

Posted: Jul 23, 2009 11:28 AM

This post originated from an RSS feed registered with Agile Buzz by Martin Fowler.
Original Post: ComposedRegex Feed Title: Martin Fowler's Bliki Feed URL: http://martinfowler.com/feed.atom Feed Description: A cross between a blog and wiki of my partly-formed ideas on software development	Latest Agile Buzz Posts Latest Agile Buzz Posts by Martin Fowler Latest Posts From Martin Fowler's Bliki

One of the most powerful tools in writing maintainable code is break large methods into well-named smaller methods - a technique Kent Beck refers to as the Composed Method pattern.

People can read your programs much more quickly and accurately if they can understand them in detail, then chunk those details into higher level structures.

--Kent Beck

What works for methods often works for other things as well. One area that I've run into a couple of times where people fail to do this is with regular expressions.

Let's say you have a file full of rules for scoring frequent sleeper points for a hotel chain. The rules all look rather like:

score 400 for 2 nights at Minas Tirith Airport

We need to pull out the points (400) the number of nights (2) and the hotel name (Minas Tirith Airport) for each of these rows.

This is an obvious task for a regex, and I'm sure right now you're thinking - oh yes we need:

      const string pattern = @"^score\s+(\d+)\s+for\s+(\d+)\s+nights?\s+at\s+(.*)";

Then our three values just pop out of the groups.

I don't know whether or not you're comfortable in understanding how that regex works and whether it's correct. If you're like me you have to look at a regex like this and carefully figure out what it's saying. I often find myself counting parentheses so I can see where the groups line up (not actually that hard in this case, but I've seen plenty of others where it's tougher).

You may have read advice to take a pattern like this and to comment it. (Often needs a switch when you turn it into a regex.) That way you can write it like this.

    protected override string GetPattern() {
      const string pattern =
        @"^score
        \s+  
        (\d+)          # points
        \s+
        for
        \s+
        (\d+)          # number of nights
        \s+
        night
        s?             #optional plural
        \s+
        at
        \s+
        (.*)           # hotel name
        ";
  
      return pattern;
    }
  }

This is easier to follow, but comments never quite satisfy me. Occasionally I've been accused of saying comments are bad, and that you shouldn't use them. This is wrong, in both senses. Comments are not bad - but there are often better options. I always try to write code that doesn't need comments, usually by good naming and structure. (I can't always succeed, but I feel I do more often than not.)

People often don't try to structure regexs, but I find it useful. Here's one way of doing this one.

    const string scoreKeyword = @"^score\s+";
    const string numberOfPoints = @"(\d+)";
    const string forKeyword = @"\s+for\s+";
    const string numberOfNights = @"(\d+)";
    const string nightsAtKeyword = @"\s+nights?\s+at\s+";
    const string hotelName = @"(.*)";

    const string pattern =  scoreKeyword + numberOfPoints +
      forKeyword + numberOfNights + nightsAtKeyword + hotelName;

I've broken down the pattern into logical chunks and put them together again at the end. I can now look at that final expression and understand the basic chunks of the expression, diving into the regex for each one to see the details.

Here another alternative that seeks to separate the whitespace to make the actual regexs look more like tokens.

    const string space = @"\s+";
    const string start = "^";
    const string numberOfPoints = @"(\d+)";
    const string numberOfNights = @"(\d+)";
    const string nightsAtKeyword = @"nights?\s+at";
    const string hotelName = @"(.*)";

    const string pattern =  start + "score" + space + numberOfPoints + space +
      "for" + space + numberOfNights + space + nightsAtKeyword + 
       space + hotelName;

I find this makes the individual tokens a bit clearer, but all those space variables makes the overall structure harder to follow. So I prefer the previous one.

But this does raise a question. All of the elements are separated by space, and putting in lots of space variables or \s+ in the patterns feels wet. The nice thing about breaking out the regexs into sub-strings is that I can now use the programming logic to come up with abstractions that suit my particular purpose better. I can write a method that will take sub strings and join them up with whitespace.

    private String composePattern(params String[] arg) {
      return "^" + String.Join(@"\s+", arg);
    }

Using this method, I then have.

    const string numberOfPoints = @"(\d+)";
    const string numberOfNights = @"(\d+)";
    const string hotelName = @"(.*)";

    const string pattern =  composePattern("score", numberOfPoints, 
      "for", numberOfNights, "nights?", "at", hotelName);

You may not use exactly any of these alternative yourself, but I do urge you to think about how to make regular expressions clearer. Code should not need to be figured out, it should be just read.

Read: ComposedRegex

Previous Topic

Next Topic


	Web Artima.com