This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: ANTLR rules
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Previously
I showed how to use ANTLR to build a AST from a molecular formula then
evaluate that AST to calculate the molecular weight. For complex
grammars it's often useful to work with and transform parse trees,
which I'll probably talk about when I get into developing a SMARTS
grammar.
For doing molecular weight calculations though, there's no reason to
generate an intermediate AST. I can calculate the weight during the
parsing by using action rules. Here's an example of using actions in
lexer and parser rules to print something out.
grammar MolecularFormulaWithPrint;
options {
language=Python;
}
parse_formula : species* EOF;
species
: ATOM DIGITS? {
print "Species defined", $ATOM.text,
# // My first use of Python's new (in 2.5) ternary operator
print $DIGITS.text if $DIGITS else "default=1" }
;
ATOM
: 'H' { print "H = 1.00794" }
| 'C' { print "C = 12.001" }
// Added 'Cl' to see how that interacts with 'C'
| 'Cl' { print "Cl = 35.453" }
| 'O' { print "O = 15.999" }
| 'S' { print "S = 32.06" }
;
// I need a local variable name so the rule can refer to the match
DIGITS : count='0' .. '9'+ {print " repeat", $count};
Some notes about this grammar. ANTLR does some parsing of the code
inside of an action block so while you can use '#' for a Python
comment, it interpreted the apostrophe in "Python's" as the start of a
string. To work around that I added the leading '//' so ANTLR really
thought it was a comment.
I added "Cl" as a possible atom type (it wasn't in the previous code)
because I wanted to see how the lexer handles terms with a common
prefix. You can see how in the syntax diagram:
and in the generated lexer:
Man! That's going to be some slow code when I get around to doing
timings.
I'm also showing off the new ternary operator in Python 2.5. For the
record, I'm against it, but because it's present I need to learn when
it's appropriate to use, and I think this is one such case.
print $DIGITS.text if $DIGITS else "default=1" }
is the same as
if $DIGITS:
print $DIGITS.text
else:
print "default=1"
The DIGITS term is optional, and if it's not present then that
associated variable in Python is None. What this test does is print
the count number if it's present, otherwise prints "default=1",
because 1 is the default count if not explicitly given.
Continuing on to using the new grammar, my driver code is pretty
simple, because I'm not really doing anything except setup and
requesting the parse:
import sys
import antlr3
from MolecularFormulaWithPrintParser import MolecularFormulaWithPrintParser
from MolecularFormulaWithPrintLexer import MolecularFormulaWithPrintLexer
formula = "CH3COOH"
if len(sys.argv) > 1:
formula = sys.argv[1]
char_stream = antlr3.ANTLRStringStream(formula)
lexer = MolecularFormulaWithPrintLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = MolecularFormulaWithPrintParser(tokens)
parser.parse_formula()
which with the formula "H2SO4" gives.
H = 1.00794
repeat 3
S = 32.06
O = 15.999
repeat 4
Species defined H 3
Species defined S default=1
Species defined O 4
You can see that the lexer actions are executed, at least for this
case, before the parser actions.
Parser rules can return something
A lexer rule always returns a Token. A parser rule by default returns
a Tree but I can have it return something else. In this I want the
atom parser to return the molecular weight rather than the atomic
symbol. (I don't need to do that. I could use a table lookup on the
symbol to get the molecular weight. But the parser already knows
which atom it parsed so it feels needless to do that lookup again. As
a consequence, the parser loses track of the token location, but there
are ways to handle that if needed.)
I need to turn the "ATOM" lexer rule into an "atom" parser rule. In
ANTLR, lexer rules are in uppercase and parser rules are lower case,
so the conversion is pretty easy in this case - change the case of the
name. It works here because the pattern in the rule is a string. In
general that doesn't work. For example, I changed DIGITS to a parser
rule and got these warning messages:
warning(200): MWGrammar.g:10:9: Decision can match input such as "'C'"
using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
I don't know what that means, but I decided not to worry much about
it. My general rule will be to keep things in the lexer, because I
understand lexers a lot better than grammars.
I declared that the 'atom' rule sets a 'weight'. The 'float' is
needed because ANTLR supports languages like Java and C++ which need
to know the data type of the value returned. The 'weight' is how
other rules, like 'species', can get the new value, in this case via
$atom.weight. In general an ANTLR rule can declare that it returns
multiple values.
Using return values from a parser rule
Computing the total molecular weight for a species is very simple.
The only difference in the following is the 'species' rule:
Next I'll make "species" return a value, a float named
"species_weight". But how do I access it inside of parse_formula?
The definition is
parse_formula : species* EOF;
so how do I get a rule executed once for every time it matches? The
answer is very elegant. I can have rules attached to part of the
expression like this:
will execute the action for each 'species' that matches. That action
is included in the "*" so the match and action are done 0 or more
times. The new grammar is:
The last step is to sum each of species weights into a total molecular
weight and return that sum. I'm going to rename "parse_formula" into
"calculate_mw" and have it return a "mw", so the rule becomes
Ahh, the default value of 'mw' is None, and I want it to be 0.0. I
want to set the value before any of the other actions run, which I can
do with an "@init" action. That's a special directive to ANTLR.
There's also '@after' for adding code after all of the rule code.
With the @init in place, here's the code
and the driver code, which includes some self-tests. (I didn't quite
feel like making it work under unittest or py.test or similar code.)
import sys
import antlr3
from MWGrammarParser import MWGrammarParser
from MWGrammarLexer import MWGrammarLexer
formula = "H2SO4"
if len(sys.argv) > 1:
formula = sys.argv[1]
def calculate_mw(formula):
char_stream = antlr3.ANTLRStringStream(formula)
lexer = MWGrammarLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = MWGrammarParser(tokens)
return parser.calculate_mw()
print "MW is", calculate_mw(formula)
print "Running self-tests"
# Run random tests to validate the parser and results
_mw_table = {
'H': 1.00794,
'C': 12.001,
'Cl': 35.453,
'O': 15.999,
'S': 32.06,
}
# Generate a random molecular formula and calculate
# it's molecular weight. yield the weight and formula
def _generate_random_formulas():
import random
# Using semi-random values so I can check a wide space
# Possible number of terms in the formula
_possible_lengths = (1, 2, 3, 4, 5, 10, 53, 104)
# Possible repeat count for each formula
_possible_counts = tuple(range(12)) + (88, 91, 106, 107, 200, 1234)
# The available element names
_element_names = _mw_table.keys()
for i in range(1000):
terms = []
total_mw = 0.0
# Use a variety of lengths
for j in range(random.choice(_possible_lengths)):
symbol = random.choice(_element_names)
terms.append(symbol)
count = random.choice(_possible_counts)
if count == 1 and random.randint(0, 2) == 1:
pass
else:
terms.append(str(count))
total_mw += _mw_table[symbol] * count
yield total_mw, "".join(terms)
_selected_formulas = [
(0.0, ""),
(1.00794, "H"),
(1.00794, "H1"),
(32.06, "S"),
(12.001+1.00794*4, "CH4"),
]
for expected_mw, formula in (_selected_formulas +
list(_generate_random_formulas())):
got_mw = calculate_mw(formula)
if expected_mw != got_mw:
raise AssertionError("%r expected %r got %r" %
(formula, expected_mw, got_mw))
% python calculate_mw.py H2O
MW is 18.01488
Running self-tests
Andrew Dalke is an independent consultant focusing on
software development for computational chemistry and biology.
Need contract programming, help, or training?
Contact me