|
As I explore parser generator tools for external
DomainSpecificLanguages, I've said HelloAntlr and
HelloSablecc. If you spend much time looking at parser
generators you can't really avoid looking at the old stalwarts lex
and yacc (or their gnu counterparts flex and bison). I want to
explore the way lex and yacc operate, but my C has got too rusty. As
Erich Gamma quipped, I've got too lazy to take out my own
garbage. Fortunately there is an implementation of a yaccish system
for Java, which is just what I need. The Java implementation, like the classic lex and yacc, is two
independent tools: JFlex and CUP. Although they are developed
separately they do provide hooks to work together. As with my previous posts along these lines, this is a
overly-simple example just to get the tools working. I take an input
file which says:
item camera
item laser
and turn it them into item objects inside a configuration using
the following model:
public class Configuration {
private Map<String, Item> items = new HashMap<String, Item>();
public Item getItem(String key) {
return items.get(key);
}
public void addItem(Item arg) {
items.put(arg.getName(), arg);
}
public class Item {
private String name;
public Item(String name) {
this.name = name;
}
to pass the following test
@Test public void itemsAddedToItemList() {
Reader input = null;
try {
input = new FileReader("rules.txt");
} catch (FileNotFoundException e) {
throw new RuntimeException(e);
}
Configuration config = CatalogParser.parse(input);
assertNotNull(config.getItem("camera"));
assertNotNull(config.getItem("laser"));
}
The first issue is just to get the build going. As with my
previous examples I want to take the grammar input files and
generate the lexer and parser into a gen
directory. Unlike my previous examples I don't do this directly in
ant, instead I'm using ant to call a ruby script.
--- build.xml
<target name = "gen" >
<exec executable="ruby" failonerror="true">
<arg line = "gen.rb"/>
</exec>
</target>
--- gen.rb
require 'fileutils'
include FileUtils
system "java -cp lib/JFlex.jar JFlex.Main -d gen/parser src/parser/catalog.l"
system "java -jar lib/java-cup-11a.jar src/parser/catalog.y"
%w[parser.java sym.java].each {|f| mv f, 'gen/parser'}
Yes, I know it's a long way around, but with a lot of source
files I'm using the approach in FlexibleAntlrGeneration
to do my generation and I can't be bothered to sort it out in ant as
well. (When I attended CITCON recently,
I was surprised to find out that people were much happier with ant than
I thought. Grumpy me thinks it's a case of Stockholm Syndrome. Even
when less grumpy I'm keeping my eye on things like Raven and BuildR which has now got some
documentation. I'm so ready to ditch ant.) You'll notice that CUP puts its output files in the current
directory and I couldn't see how to override that behavior. So I
generated them and moved them with a separate command. Once I generate the code I then compile and test it with ant.
<target name = "compile" depends = "gen">
<mkdir dir="${dir.build}"/>
<javac destdir="${dir.build}" classpathref="path.compile">
<src path = "${dir.src}"/>
<src path = "${dir.gen}"/>
<src path = "${dir.test}"/>
</javac>
</target>
<target name = "test" depends="compile">
<junit haltonfailure = "on" printsummary="on">
<formatter type="brief"/>
<classpath refid = "path.compile"/>
<batchtest todir="${dir.build}" >
<fileset dir = "test" includes = "**/*Test.java"/>
</batchtest>
</junit>
</target>
Lex and yacc separate the lexer and parser into different
files. Each is generated independently and combined during
compilation. I'll start with the lexer file (catalog.l). The opening
declares the output file's package and imports.
package parser;
import java_cup.runtime.*;
JFlex uses %% markers to break the file into
sections. The second section consists of various declarations. The
first bit names the output class and tells it to interface with CUP.
%%
%class Lexer
%cup
The next bit is code to be folded into the lexer. Here I define a
function to create Symbol objects - again to hook into CUP.
%{
private Symbol symbol(int type) {
return new Symbol(type, yytext());
}
%}
The
Symbol class is defined in CUP and is part of its runtime jar. There
are various constructors taking various information about the
symbol and where it is. Next up are some macros to define words and whitespace.
Word = [:jletter:]*
WS = [ \t\r\n]
The final section is the actual lexer rules. I define one to
return the item keyword and the other to return simple words to the
parser.
%%
"item" {return symbol(sym.K_ITEM);}
{Word} {return symbol(sym.WORD);}
{WS} {/* ignore */}
So the lexer will send a stream of K_ITEM and WORD tokens to the
parser. I define the parser in catalog.y. Again it
starts with package and import declarations.
package parser;
import java_cup.runtime.*;
import model.*;
I'm parsing the data into a configuration object, so I need to
declare a place to put that result. Again this code is copied
directly into the parser object.
parser code {: Configuration result = new Configuration(); :}
With CUP I need to define all the rule elements that I'll use in
the productions.
terminal K_ITEM;
terminal String WORD;
non terminal catalog, item;
The terminals are the tokens I get from the lexer, the non
terminals are the rules I'll build up myself. If I want to get a
payload from the token, I need to declare its type, so WORD is a string. The catalog is a list of items. Unlike with antlr or sablecc I
don't have EBNF here, so I can't say item*, instead I need a
recursive rule.
catalog ::= item | item catalog;
The item rule itself contains the embedded action to put the item
into the configuration. item ::= K_ITEM WORD:w {: parser.result.addItem(new Item(w)); :}
;
A little wrinkle to note here is that the actions are put into a
separate class to the parser object, so to get to the result field I
defined earlier I have to use the parser field of the actions
object. I should also mention that once I do much further with this I
start to use an EmbedmentHelper to keep action code simple. People who have used yacc before might notice that I can label
the elements of the rule to refer to them in the actions instead of
the $1, $2 convention used in yacc.
Similarly if the rule returns a value CUP uses RESULT
rather than $$. My memories of lex and yacc are dim, but these tools do seem to
mimic the style of using them pretty well. My biggest beef so far is
the error handling, which caused me much more fuss than antlr. My
feeling so far is that if you're new to parser generators then antlr
is the better choice (particularly due to its book). However if
you're familiar with lex and yacc then these two are similar enough
to build off that knowledge.
|