This post originated from an RSS feed registered with Agile Buzz
by Martin Fowler.
Original Post: ComposedRegex
Feed Title: Martin Fowler's Bliki
Feed URL: http://martinfowler.com/feed.atom
Feed Description: A cross between a blog and wiki of my partly-formed ideas on software development
One of the most powerful tools in writing maintainable code is
break large methods into well-named smaller methods - a technique
Kent Beck refers to as the Composed Method pattern.
People can read your programs much more quickly and accurately
if they can understand them in detail, then chunk those details
into higher level structures.
What works for methods often works for other things as well. One
area that I've run into a couple of times where people fail to do
this is with regular expressions.
Let's say you have a file full of rules for scoring frequent
sleeper points for a hotel chain. The rules all look rather like:
score 400 for 2 nights at Minas Tirith Airport
We need to pull out the points (400) the number of nights (2) and
the hotel name (Minas Tirith Airport) for each of these rows.
This is an obvious task for a regex, and I'm sure right now
you're thinking - oh yes we need:
I don't know whether or not you're comfortable in understanding
how that regex works and whether it's correct. If you're like me you
have to look at a regex like this and carefully figure out what it's
saying. I often find myself counting parentheses so I can see where
the groups line up (not actually that hard in this case, but I've
seen plenty of others where it's tougher).
You may have read advice to take a pattern like this and to
comment it. (Often needs a switch when you turn it into a regex.)
That way you can write it like this.
protected override string GetPattern() {
const string pattern =
@"^score
\s+
(\d+) # points
\s+
for
\s+
(\d+) # number of nights
\s+
night
s? #optional plural
\s+
at
\s+
(.*) # hotel name
";
return pattern;
}
}
This is easier to follow, but comments never quite satisfy
me. Occasionally I've been accused of saying comments are bad, and
that you shouldn't use them. This is wrong, in both senses.
Comments are not bad - but there are often better options. I always
try to write code that doesn't need comments, usually by good
naming and structure. (I can't always succeed, but I feel I do more
often than not.)
People often don't try to structure regexs, but I find it
useful. Here's one way of doing this one.
I've broken down the pattern into logical chunks and put them
together again at the end. I can now look at that final expression
and understand the basic chunks of the expression, diving into the
regex for each one to see the details.
Here another alternative that seeks to separate the whitespace to
make the actual regexs look more like tokens.
const string space = @"\s+";
const string start = "^";
const string numberOfPoints = @"(\d+)";
const string numberOfNights = @"(\d+)";
const string nightsAtKeyword = @"nights?\s+at";
const string hotelName = @"(.*)";
const string pattern = start + "score" + space + numberOfPoints + space +
"for" + space + numberOfNights + space + nightsAtKeyword +
space + hotelName;
I find this makes the individual tokens a bit clearer, but all
those space variables makes the overall structure harder to
follow. So I prefer the previous one.
But this does raise a question. All of the elements are separated
by space, and putting in lots of space variables or \s+
in the patterns feels wet. The nice thing about breaking out the
regexs into sub-strings is that I can now use the programming logic
to come up with abstractions that suit my particular purpose
better. I can write a method that will take sub strings and join
them up with whitespace.
You may not use exactly any of these alternative yourself, but I
do urge you to think about how to make regular expressions
clearer. Code should not need to be figured out, it should be just read.