This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: python4ply tutorial, part 2
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
The following is an except from the
python4plytutorial.
python4ply is a Python parser for the Python language using PLY and
the 'compiler' module from the standard library to parse Python code
and generate bytecode for the Python virtual machine.
Syntax support for decimal numbers
How about something more complicated? Python's "decimal" module is a
fixed point numeric type using base 10, which is especially useful for
those dealing with money. Here's an obvious limitation of doing base
10 calculations in base 2. I stole it from the decimal documentation.
>>> 1.0 % 0.1
0.09999999999999995
>>> import decimal
>>> d = decimal.Decimal("1.0")
>>> d
Decimal("1.0")
>>> d / decimal.Decimal("0.1")
Decimal("10")
>>>
The normal way to create a decimal number is to "import decimal" then
use "decimal.Decimal". I'm going to add grammar-level support so that
"0d12.3" is the same as decimal.Decimal("12.3"). There's a few
complications so I'll walk you through how to do this.
I need a new DECIMAL token type that matches "0[dD][0-9]+(\.[0-9]+)?".
This allows "0d1.23" and "0D1" and "0d0.89" but not "0d.2" nor "0d6."
Feel free to change that if you want. Bear in mind possible
ambiguities; does "0d1.x" mean the valid "Decimal('1').x" or the
syntax error "Decimal('1.') x". What about "0d1..sqrt()"?
Designing a new programming language really means having to pay
attention to nuances like this.
The DECIMAL rule is simple, in part because limitations of what can be
saved the byte code means the creation of the decimal object must be
deferred until later. Just like with the t_BIN_NUMBER rule, this new
t_DECIMAL rule must go before t_OCT_NUMBER so there's no confusion.
% python compile.py -e div.py
Traceback (most recent call last):
File "compile.py", line 76, in <module>
execfile(args[0])
File "compile.py", line 43, in execfile
tree = python_yacc.parse(text, source_filename)
File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
parse_tree = parser.parse(source, lexer=lexer)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 237, in parse
lookahead = get_token() # Get the next token
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 657, in token
x = self.token_stream.next()
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 609, in add_endmarker
for tok in token_stream:
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 534, in synthesize_indentation_tokens
for token in token_stream:
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 493, in annotate_indentation_state
for token in token_stream:
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 435, in create_strings
for tok in token_stream:
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/lex.py", line 305, in token
func.__name__, newtok.type),lexdata[lexpos:])
ply.lex.LexError: /Users/dalke/src/python4ply-1.0/python_lex.py:203: Rule 't_DECIMAL' returned an unknown token type 'DECIMAL'
The list of known token type names is given in the 'token' variable,
defined at the top of python_lex.py. I'll add "DECIMAL" to the list
With that change I get a new error message. Whoopie for me!
% python compile.py -e div.py
Traceback (most recent call last):
File "compile.py", line 76, in <module>
execfile(args[0])
File "compile.py", line 43, in execfile
tree = python_yacc.parse(text, source_filename)
File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2607, in parse
parse_tree = parser.parse(source, lexer=lexer)
File "/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/site-packages/ply/yacc.py", line 346, in parse
tok = self.errorfunc(errtoken)
File "/Users/dalke/src/python4ply-1.0/python_yacc.py", line 2488, in p_error
python_lex.raise_syntax_error("invalid syntax", t)
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 27, in raise_syntax_error
_raise_error(message, t, SyntaxError)
File "/Users/dalke/src/python4ply-1.0/python_lex.py", line 24, in _raise_error
raise klass(message, (filename, lineno, offset+1, text))
File "div.py", line 3
print "decimal", 0d1.0 % 0d0.1
^
SyntaxError: invalid syntax
That's because the parser doesn't know what to do with a DECIMAL.
What do you think it should it do? The ast.Const node only takes a
string or a built-in numeric value. It doesn't take general Python
objects because those can't be marshalled into bytecode.
I'll wait a moment for you to think about it.
Thought enough? No? Okay, just a moment more.
This new token should correspond to making a new Decimal object at
that point. You might think you could be more clever than that and
create the decimals during module imports, like I will do for the
regular expression definitions coming later on in this tutorial. That
would make the object creation occur only once, instead of once for
each function call or for every time through a loop. But a decimal
object depends on a global/thread-local context, and if I move the
decimal creation then I might create it in the wrong context.
To make my life easier, I'm going to import the Decimal class as the
super seekret module variable "_$Decimal". This is a variable name
that can't occur in normal Python (because of the "$") and which is
hidden from "... import *" statements (because of the leading "_").
That way the object creation is mostly a matter of calling
"_$Decimal(s)" in the right place, which I can only do by constructing
the AST myself.
What will that look like? I'll use the compiler package to show what
that AST should look like:
They are simple because the AST nodes are designed for Python. Nearly
every token type and statement type maps directly to an AST node. The
"locate" function assigns a line number to each created node, and you
can see some of my experimental work also assign a start and end byte
location.
Here's the new definition for DECIMAL, which is a bit more complex
because I need to call _$Decimal. Remember that I can't simply use an
ast.Const containing a decimal.Decimal because the byte code
generation only supports strings and numbers.
At this point running the code should fail because _$Decimal doesn't
exist.
% python compile.py -e div.py
yacc: Warning. Token 'WS' defined, but not used.
yacc: Warning. Token 'STRING_START_SINGLE' defined, but not used.
yacc: Warning. Token 'STRING_START_TRIPLE' defined, but not used.
yacc: Warning. Token 'STRING_CONTINUE' defined, but not used.
yacc: Warning. Token 'STRING_END' defined, but not used.
/Users/dalke/src/python4ply-1.0/python_yacc.py:2473: Warning. Rule 'encoding_decl' defined, but not used.
yacc: Warning. There are 5 unused tokens.
yacc: Warning. There is 1 unused rule.
yacc: Symbol 'encoding_decl' is unreachable.
yacc: Generating LALR parsing table...
float 0.1
decimal
Traceback (most recent call last):
File "compile.py", line 76, in <module>
execfile(args[0])
File "compile.py", line 48, in execfile
exec code in mod.__dict__
File "div.py", line 3, in <module>
print "decimal", 0d1.0 % 0d0.1
NameError: name '_$Decimal' is not defined
Why are the 'yacc:' messages there? PLY uses a cached parsing table
for better performance. When it notices a change in the grammar it
invalidates the cache and rebuilds the table based on the new grammar.
What you're seeing here are the messages from the rebuild.
Why is the exception there? Because the function call uses _$Decimal
but that name doesn't exist. Why does it report line 3 even through I
only assigned a line number to the ast.CallFunc and not the ast.Name,
which is what acutally failed? Because the AST generation code in the
compiler module doesn't always assign line numbers so the byte code
generation step assumes it's the same as the line number for the
previously generated instruction.
For extra credit, why does the following report the error on line 3
instead of line 1?
def p_atom_12(p):
"atom : DECIMAL"
decimal_string = p[1]
p[0] = ast.CallFunc(ast.Name("_$Decimal"),
[ast.Const(decimal_string)], None, None)
locate(p[0], 1) # Why doesn't this report the error on line 1?
The last bit of magic is to import the Decimal constructor correctly.
The root term in the Python grammar is "file_input". (There's another
root if you're doing an 'eval'.) One case is for an empty file and
the other is for a file that contains statements. The code as distributed
looks like this:
By definition the empty file can't have any Decimal statements in it
so I'll only worry about p_file_input_2. But I won't worry much. For
instance, for now I won't worry that the file can contain __future__
statements. These must go before any statement other than the doc
string. (If you really want to worry about that then feel free to
worry. And also worry that in older Pythons "as" and "with" were not
reserved words.)
I'll insert the new import statement as the first statement in the
created module.