This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: python4ply tutorial, part 3
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
The following is an except from the
python4plytutorial.
python4ply is a Python parser for the Python language using PLY and
the 'compiler' module from the standard library to parse Python code
and generate bytecode for the Python virtual machine.
Creating regular expression pattern objects
Regular expressions are fun. The first contact I had with them was
through DOS globbing, where "*.*" matched all files with an extension.
Then I started using Unix, and started using Archie, which supported
regular expressions. Hmm, that was in 1990. I read the documentation
for regexps but I didn't understand them. Instead I mentally
translated the glob "?" to "." and the glob "*" to ".*".
Luckily for me I was in college and I took a theory of automata
course. I loved that course. It taught me a lot about how to think
about computers as what they are - glorified state machines.
Other programmers also really like regular expressions, and languages
like Perl, Ruby, and Javascript consider them so important that are
given syntax level support. Python is not one of those languages, and
someone coming from Ruby, where you can do
# lines.rb
File.open("python_yacc.py").each do |line|
if line =~ /def (\w+)/
puts "#{$1}\n"
end
end
will probably find the corresponding Python both tedious and (because
of the separation between the pattern definition and use) harder to
read:
# lines.py
import re
pattern = re.compile(r"def (\w+)")
for line in open("python_yacc.py"):
m = pattern.match(line)
if m is not None:
print m.group(1)
This code is generally considered the best practice for Python. It
could be made a bit shorter by using re.match instead of the
precompiled pattern, but at the cost of some performance.
I'll give Perl5 regular expressions (as implemented by the 're'
module) first-class syntax support for creating patterns. That will
shorten the code by getting rid of the "import re" and the
"re.compile()" call. Here's how I want the pattern creation to look
like
pattern = m/def (\w+)/
This new syntax is vaguely modelled after Perl's. It must start with
a "m/" and end with a "/" on the same line. Note that my new syntax
might break existing code because
m=12
a=3
i=2
print m/a/i
is already valid.
The new token definition goes before the t_NAME definition, to prevent
the NAME from matching first. This token returns a 2-ple of the
regular expression pattern as a string, and the flags to pass to
re.compile. I need to pass it back as basic types and not a pattern
object because the bytecode generation only understands the basic
Python types.
import re
_re_flags = {
"i": re.IGNORECASE,
"l": re.LOCALE,
"m": re.MULTILINE,
"s": re.DOTALL,
#"x": re.VERBOSE, # not useful in this context
"u": re.UNICODE,
}
def t_PATTERN(t):
r"m/[^/]*/[a-z]*"
m, pattern, opts = t.value.split("/")
flags = 0
for c in opts:
flag = _re_flags.get(c, None)
if flag is None:
# I could pin this down to the specific character position
raise_syntax_error(
"unsupported pattern modifier %r" % (c,), t)
flags |= flag
# test compile to make sure that it's a valid pattern
try:
re.compile(pattern, flags)
except re.error, err:
# Sadly, re.error doesn't include the error position
raise_syntax_error(err.message, t)
t.value = (pattern, flags)
return t
# This goes after the strings otherwise r"" is seen as the NAME("r")
def t_NAME(t):
r"[a-zA-Z_][a-zA-Z0-9_]*"
t.type = RESERVED.get(t.value, "NAME")
return t
This PATTERN will be a new "atom" at the grammar level, which will
correspond to a call to re.compile("pattern", options).
See how I'm using the impossible variable name '_$re_compile'? That's
going to be "re.compile" and I'll use the same trick I did with the
DECIMAL support and insert the AST corresponding to
These changes make it easier to define a pattern, but not to use it.
As another example of (fake?) Perl envy. I'm going to support its
"=~" match syntax so that the following is valid:
# count_atoms.py
import time
# Count the number of atoms in a PDB file
# Lines to match looks like:
# ATOM 6312 CB ALA 3 235 24.681 54.463 137.827 1.00 51.30
# HETATM 6333 CA MYR 4 1 6.722 54.417 88.584 1.00 50.79
count = 0
t1 = time.time()
for line in open("nucleosome.pdb"):
if line =~ m/(ATOM |HETATM)/:
count += 1
print count, "atoms in", time.time()-t1, "seconds"
This turned out to be very simple. I need a new token for "=~". Most
of the simple tokens are defined in "python_tokens.py". I added
"EQUALMATCH" in the list of tokens in the place shown here
Note that this will break legal existing code, like
>>> a=~2
>>> a
-3
>>>
The lexer doesn't need anything else because I've already defined a
PATTERN token.
I need to decide the precedence level of =~. Is it as strong as "**"
or as weak as "or", or some place in between? I decided to make it as
weak as "or", which is defined by the "test" definition. Here's my
new "p_test_4" function:
>>> from compiler import parse
>>> parse("pat.search(line) is not None")
Module(None, Stmt([Discard(Compare(CallFunc(Getattr(Name('pat'), 'search'),
[Name('line')], None, None), [('is not', Name('None'))]))]))
>>>
And that's it! Well, I could add an optimization in this case and
move the ".search" outside the loop, but that's an exercise left for
the student.
Now I'll put a toe into evil, just to see how cold it is. I'm going
to add support for
# get_function_names.py
for line in open("python_yacc.py"):
if line =~ m/def (\w+)/:
print repr($1)
That is, if the =~ matches then $1, $2, ... will match group 1, 2.
Oh, and while I'm at it, if there's a named group then $name will
retrieve it. And '$' will mean to get the match object itself.
To make it work I need some way to do assignment in the expression.
Python doesn't really support that except through a hack I don't want
to use, so I'll use another hack and change the bytecode generation
stage.
I created a new AST node called "AssignExpr" which is like an "Assign"
node except that it can be used in an expression. The compiler module
doesn't know about it and it's hard to change the code through
subclassing, so I patch the compiler and its bytecode generation code
so it understands the new node type. These changes are in
"compiler_patches.py" and the patches are done when the module is
imported. Take a look at the module if you want to see what it does.
It doesn't escape my notice that with AssignExpr there's only a
handful of lines needed for support assignment in an expression, like
if line = readline():
print repr(line)
Before you do that yourself, read the Python FAQ
for why Python doesn't support this.
To support the new pattern match syntax I need to make two changes to
python_yacc.py. The first is to import the monkeypatch module:
import compiler_patches
then make the changes to the p_test_4 function to save the match
object to the variable "$".
Does it work? Try this program, which is based on the Ruby code I
started with at the start of this tutorial section, oh so long ago.
# get_function_names.py
for line in open("python_yacc.py"):
if line =~ m/def (\w+)/:
# I don't yet have syntax support to get to the special '$'
# variable so I have to get it from the globals dictionary.
print repr(globals()["$"].group(1))