|
After saying HelloSablecc I also wanted to try out Antlr, which
is another compiler-compiler for the Java space. As with that entry,
this is just about getting Antlr going with a very simple "hello
world" style grammar. Like SableCC, Antlr is a compiler-compiler tool. It's been around
for a while, and I've run into a few projects that use it. Unlike
SableCC (and the venerable lex/yacc combo) it generates a recursive
descent parser using LL grammars. Compiler heads like to argue about
whether LL or LALR are better, I'll not step into that debate here. My simple case is to parse a file of a list of items like this:
item camera
item laser
Each line has the 'item' keyword followed by a single word for
the name of an item. I shall load each item object into a
configuration object that keeps them all together.
public class Configuration {
private Map<String, Item> items = new HashMap<String, Item>();
public Item getItem(String key) {
return items.get(key);
}
public void addItem(Item arg) {
items.put(arg.getName(), arg);
}
public class Item {
private String name;
public Item(String name) {
this.name = name;
}
Here's a test for that, using the file I showed above.
@Test public void readTwoItems() {
Reader input = null;
try {
input = new FileReader("catalog.txt");
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
}
Configuration config = ParserCommand.parse(input);
assertNotNull(config.getItem("camera"));
assertNotNull(config.getItem("laser"));
assertEquals(2, config.getItems().size());
}
As before - using a compiler-compiler for this problem is silly,
but then is printing "hello world" on a console. For the same reason
as I always write "hello world" with a new environment, I like to
write something dirt simple to just make sure I can get things
working at all before I start doing anything real with it. One hassle with using an compiler-compiler like this is that it
makes the build process more complicated. I have to run antlr on the
grammar file to create java classes for the parser, then include
them in the compilation. So it's time to fight with ant again -
here's the ant target:
<property name = "dir.parser" value = "${dir.gen}/parser"/>
<path id = "path.antlr">
<fileset dir = "${dir.lib}">
<include name = "antlr*.jar"/>
<include name = "stringtemplate*.jar"/>
</fileset>
</path>
<target name = "gen" >
<mkdir dir="${dir.parser}"/>
<java classname="org.antlr.Tool" classpathref="path.antlr" fork = "true" failonerror="true">
<arg value="-o"/>
<arg value="${dir.parser}"/>
<arg value="Catalog.g"/>
</java>
</target>
This generates code into the gen directory. This way
generated code is separate from source code I write myself. Another
target does the compilation
<property name = "dir.build" value = "classes/production/antlrLair"/>
<target name = "compile" depends = "gen">
<mkdir dir="${dir.build}"/>
<javac destdir="${dir.build}" classpathref="classpath">
<src path = "src"/>
<src path = "${dir.gen}"/>
<src path = "test"/>
</javac>
</target>
I can then run the tests with a final target.
<target name = "test" depends="compile">
<junit haltonfailure = "on">
<formatter type="brief"/>
<classpath refid = "classpath"/>
<batchtest todir="${dir.build}" >
<fileset dir = "test" includes = "**/*Test.java"/>
</batchtest>
</junit>
</target>
Antlr works with a grammar file Catalog.g. The
grammar file defines the productions in the grammar and also actions
that the parser takes when it encounters productions. The grammar
file also defines the lexer (you can split them if you want). In
this sense Antlr is more traditional (and flexible) than
SableCC. SableCC doesn't allow actions, instead you generate a parse
tree or AST and walk that with java. Antlr allows arbitrary actions,
or it supports building a tree in the same manner as SableCC. (Antlr
also uses a grammar file to walk the tree.) Since I'm building up a
simple domain model of items and a configuration I'll forgo the
tree building and do all the work in my actions. I'll go through this file in chunks, with descriptions as I go. I
start with a grammar heading grammar Catalog; Antlr supports a number of points at which you can inject code
into the generated parser (instead of the generating a superclass
which SableCC does.) I put package declarations and imports into the
header.
@header{
package parser;
import model.*;
}
@lexer::header {
package parser;
}
The next code injection is to put code into the body of the
generated class. Essentially this adds members to the class, hence
the name of the command.
@members {
public Configuration result = new Configuration();
}Now I can get into the productions of the grammar. I'll do this
top down, since it's a top-down parser. I begin by saying that the
catalog consists of multiple item clauses followed by the end of the
file. catalog : item* EOF; Next I define the item clause as the literal string 'item'
followed by a string. item : 'item' name=STRING
{result.addItem(new Item ($name.text));};
Here I also put in the action, which is to create a new item in
the model with the name set to the value of the string. The code
inside the curlies is java code which is added to the parser after
that term is recognized. I can name elements in the production which
I then refer to in the action. Here I've given the string the name
'name', which makes sense in context even though it makes for an
awkward write-up. The last productions define the lexer elements for string and whitespace.
STRING : ('a'..'z' | 'A'..'Z')+ ;
WS : (' ' |'\t' | '\r' | '\n')+ {skip();} ;
The action of whitespace is to skip (ignore) it. There are a few things that make Antlr easier to work with than
SableCC. Antlr has a nice IDE called AntlrWorks that can plug into
IntelliJ. The tool will give you syntax highlighting and completion
on grammar elements, plot syntax diagrams for your grammar, and
allow you to enter test fragments to parse - displaying the
resulting parse tree. It's a very helpful tool to see what the
parser is doing. However there's no highlighting/completion for the
code inside actions, which is an understandable pain. Another good feature of Antlr is the fact that there is a decent
book on it in the works. The book gives detailed coverage of how
the tool works and useful background on language and compiler
principles. It does assume you're working on a full blown language and
that you'll be generating code - which isn't necessarily so for DSL
work. However the detail it gives looks like it will be invaluable as
I probe deeper. Antlr's actions seem like an easier bet if you want to populate a
model - I'm not sure how useful an intermediate parse tree or AST
would be here. Again further investigation will give me a better
feel. The more complex the language the more useful it is to have an
intermediate tree representation. I like Antlr's flexibility in
allowing you to do actions or tree building with transformations. Inevitably I did have problems even with this simple example. My
biggest blocker was that I originally defined the catalog term as
catalog : item*;, that is without the EOF. I
then got confused because the parser didn't indicate an error when it
got spurious input (like xitem foo). This wasn't helped
by inconsistencies between Antlr and AntlrWorks (the latter did show
an error and older versions of AntlrWorks would handle whitespace
differently too.) (Another big cause of trouble was getting ant and JUnit to work.
I don't want to have to think about the amount of time I've spent over
the years trying to diagnose classpath problems, especially with the infamous "Ant could
not find the task or a class this task relies upon." message.)
|