The Artima Developer Community
Sponsored Link

Python Buzz Forum
Expressing Oneself, Fluently

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ben Last

Posts: 247
Nickname: benlast
Registered: May, 2004

Ben Last is no longer using Python.
Expressing Oneself, Fluently Posted: Jul 11, 2013 1:09 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ben Last.
Original Post: Expressing Oneself, Fluently
Feed Title: The Law Of Unintended Consequences
Feed URL: http://benlast.livejournal.com/data/rss
Feed Description: The Law Of Unintended Consequences
Latest Python Buzz Posts
Latest Python Buzz Posts by Ben Last
Latest Posts From The Law Of Unintended Consequences

Advertisement
I use regular expressions[1] in Python often enough that I know many of the character classes and syntax tricks.

I use regular expressions in Python seldom enough that I forget many of the character classes and syntax tricks.

This is annoying, but I have lived with it. Then I came across a little article in Dr Dobb's Journal (which used to be great, many years ago, but is now a mere ghost of its former glory), in which Al Williams writes about (ab)using the C/C++ compiler to allow regular expressions to be written as:

start + space + zero_or_more + any_of("ABC") + literal(":") + group(digit + one_or_more)

rather than

^\s*[ABC]:(\d+)

I quite like the idea of fluent syntax (though I appreciate that it's not necessarily so appealing if your native language isn't English), but I spend marginally more time writing Python than C++ these days. Also, I like the idea of trying to build a fluent interface in a functional way. So, I started out by writing some examples of what I would want to be able to do:

start().end() would give the minimal empty string regex "^$". Easy enough.

Or how about

any_number_of().digits().followed_by().dot().then().at_least_one().digit()

You get the general idea: the fluent syntax describes the expression and results in a string that matches what it describes. One really good thing is that it avoids the "backslash plague" that can confuse those new to writing regular expressions in Python.

This now exists, and is on github at https://github.com/benlast/grimace and it will do the above and more. Time for some more intense code examples:

The grimace.RE object is our starting point; any method we call on it returns a new RE object. Let's get the regex to match an empty string.

>>> from grimace import RE
>>> print RE().start().end().as_string()
^$

The as_string() call turns the generated expression into a string that can then be used as the argument to the standard Python re module. There's also as_re() which will compile the regular expression for you and return the resulting pattern-matching object.


>>> #Extract the extension of a short DOS/FAT32 filename, using a group to capture that part of the string
>>> regex = RE().start().up_to(8).alphanumerics().dot().group(name="ext").up_to(3).alphanumerics().end_group().end().as_string()
>>> print regex
^\w{0,8}\.(?P\w{0,3})$
>>> #Use the re module to compile the expression and try a match
>>> import re
>>> pattern = re.compile(regex)
>>> pattern.match("abcd.exe").group("ext")
'exe'
>>>
>>> #We can do that even more fluently...
>>> RE().start().group(name="filename").up_to(8).alphanumerics().end_group() \
... .dot().group(name="ext").up_to(3).alphanumerics().end_group() \
... .end().as_re().match("xyz123.doc").groups()
('xyz123', 'doc')('xyz123', 'doc')
>>>
>>> #The cool example that I wrote out as a use case
>>> print RE().any_number_of().digits().followed_by().dot().then().at_least_one().digit().as_string()
\d*\.\d+
>>>


if you're not writing in the Python interpreter directly, you can do cleverer stuff in code, like splitting a complex regex over several lines:

# My python module
from grimace import RE

def is_legal_number(number):
#Match a US/Canadian phone number - we put the RE() stuff in parentheses so that we don't
#have to escape the ends of lines
north_american_number_re = (RE().start()
.literal('(').followed_by().exactly(3).digits().then().literal(')')
.then().one().literal("-").then().exactly(3).digits()
.then().one().dash().followed_by().exactly(4).digits().then().end()
.as_string())
number_re = re.compile(north_american_number_re)
return number_re.match("(123)-456-7890") is not None


There is more to do: control over greedy matching, and ways to express some of the more complex tricks like backreferences and multiple matching subexpressions. And I'll also package it properly for installation via pip. But for now, it's available and it works.

[1] I can't write any article on regular expressions without quoting Jamie Zawinski: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Read: Expressing Oneself, Fluently

Topic: More Grimacing Previous Topic   Next Topic Topic: Graphing Raspberry Pi internal temperature with collectd

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use