This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: python4ply tutorial
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
The following is an except from the
python4plytutorial.
ply4python is a Python parser for the Python language using PLY and
the 'compiler' module from the standard library to parse Python code
and generate bytecode for the Python virtual machine.
What is it python4ply?
python4ply is a Python parser for the Python language. The grammar
definition uses PLY, a parser
system for Python modelled on yacc/lex. The parser rules use the "compiler"
module from the standard library to build a Python AST and to generate
byte code for .pyc file.
You might use python4ply to experiment with variations in the Python
language. The PLY-based lexer and parser are much easier to change
than the C implementation Python itself uses or even the ones written
in Python which are part of the standard library. This tutorial walks
through examples of how to make changes in different levels of the
system.
If you only want access to Python's normal AST, which includes line
numbers and byte position for the code fragements, you should use the
_ast module.
Reminiscing, fabrications, and warnings
Back long time ago I had a class assignment to develop a GUI interface
using drawpoint and drawtext primitives only. Everything - buttons,
text displays, even the mouse pointer itself - was built on those
primitives. It gave the strange feeling of knowing that GUIs are
completely and utterly fake. There's no there there, and it's only
through a lot of effort that it feels real. Those that aren't as old
and grizzled as I am might get the same feeling with modern web GUIs.
Those fancy sliders and cool UI effects are built on divs and spans
and CSS and a lot of hard work. They aren't really there.
This package gives you the same feeling about Python. It contains a
Python grammar definition for the PLY parser. The file python_lex.py
is the tokenizer, along with some code to synthesize the INDENT,
DEDENT and ENDMARKER tags. The file python_yacc.py is the parser.
The result is an AST compatible with that from the compiler module,
which you can use to generate Python byte code (".pyc" files).
There's also a python_grammer.py file which makes a nearly useless
concrete syntax tree. This parser was created by grammar_to_ply.py,
which converts the Python "Grammar" definition into a form that PLY
can more easily understand. I keep it around to make sure that the
rules in python_yacc.py stay correct. You might also find it useful
if you want to port the grammar directly to yacc or some similar
parser system.
What this means is this package gives you, if you put work into it,
the ability to create a Python variant that works on the Python VM, or
if you put a lot of work into it (like the Jython, PyPy, and
IronPython developers), a first step into making your own Python
implementation.
If you think this sound like a great idea, you're probably wrong.
Down this path lies madness. Making a new language isn't just a
matter of adding a new feature. The parts go together in subtle ways,
and if you tweak the language and someone else tweaks the language a
different way, then you quickly stop being able to talk to each other.
Lisp programmers are probably thinking now that this is just a
half-formed macro system for Python. They are right. Once you have an
AST you can manipulate it in all sorts of ways. But many experienced
Lisp programmers will caution against the siren call of macros. Don't
make a new language unless you know what dangerous waters you can get
into.
On the other hand, it's a lot fun. Someone has to make the new cool
langauge for the future so you've got to practice somewhere. And
there are a few times when changing things at the AST or code
generation levels might make good sense.
To bytecompile it use the provided "compile.py" file. This is
similar to "py_compile.py" from the standard library.
% python compile.py owe_me.py
Compiling 'owe_me.py'
% ls -l owe_me.pyc
-rw-r--r-- 1 dalke staff 165 Feb 17 19:21 owe_me.pyc
%
Running this is a bit tricky because the .pyc file is only used when
the file is imported as a module. The easiest way around that is to
import the module via a comment-line call.
% python -c 'import owe_me'
You owe me 10000000 dollars
%
(I thought it would be best to use the '-m' option but that seems to
import the .py file before the .pyc file. Hmm, I should check into
that some more.)
If you want to prove that it's using the .pyc generated by this
"compile.py", try renaming the file
The compile module also supports a '-e' mode, which executes the file
after byte compiling it, instead of saving the byte compiled form to a
file.
% python compile.py -e owe_me.py
You owe me 10000000 dollars
%
Numbers like 1_000_000 - changing the lexer
Reading "10000000" is tricky, at least for humans. Is that 1 million
or 10 million? You might be envious of Perl, which supports using "_"
as a separator in a number
% perl
$amount = 10_000_000;
print "You owe me $amount\n";
^D
You owe me 10000000
%
You can change the python4ply grammar to support that. The
tokenization pattern for base-10 numbers is in python_lex.py in the
function "t_DEC_NUMBER":
def t_DEC_NUMBER(t):
r'[1-9][0-9]*[lL]?'
t.type = "NUMBER"
value = t.value
if value[-1] in "lL":
value = value[:-1]
f = long
else:
f = int
t.value = (f(value, 10), t.value)
return t
Why do I return the 2-tuple of (integer value, original string) in
t.value? The python_yacc.py code contains commented out code where
I'm experimenting with keeping track of the start and end character
positions for each token and expression. PLY by default only tracks
the start position, so I use the string length to get the end
position. I'm also theorizing that it will prove useful for those
doing round-trip conversions and want to keep the number in its
original presentation.
Okay, so change the pattern to allow "_" as a character after the
first digit, like this:
r'[1-9][0-9_]*[lL]?'
then modify the action to remove the underscore character. The new
definition is:
def t_DEC_NUMBER(t):
r"[1-9][0-9]*[lL]?"
t.type = "NUMBER"
value = t.value.replace("_", "")
if value[-1] in "lL":
value = value[:-1]
f = long
else:
f = int
t.value = (f(value, 10), t.value)
return t
To see if it worked I changed owe_me.py to use underscores, and I
changed the value to prove that I'm using the new file instead of some
copy of the old