Terence Parr Introduces ANTLR 3.0

by Frank Sommers

May 17, 2007

Summary

Terence Parr released today ANTLR 3.0, the latest version of the popular Java parser and code generator. In this interview with Artima, Parr discusses the most significant new ANTLR features: a new parsing strategy, a new technique for building syntax trees, integration with StringTemplate, and re-targetable code generation.

Frank Sommers: What differentiates ANTLR 3.0 from its predecessors?

Terence Parr: One of the things that bugged me about previous ANTLR versions was that the source code was built under duress when I was working really hard at a startup, JGuru. I didn't have a lot of time to think, and the inside of the code is dark and scary. With no unit tests, there is no way to alter the software without fear of breaking thousands of grammars out there on the Internet.

I had the right idea in making the lexer use a recursive-descent strategy, except that I didn't have the right technology then to make that work really well. [Editor's note: A lexer is used to recognize meaningful symbols for a grammar from a stream of characters. A recursive-descent strategy follows a set of mutually recursive methods to implement a parser.] The lexers were a bit slow in ANTLR version 2, and in some sense the linear approximate lookahead strategy used there was a little weak. In addition, the license was loose, and I didn't have a contributor record. All of these things provided motivation to start on something new.

One of my biggest motivations for ANTLR 3 was that I wanted something really clean in terms of the source code, a really decent piece of software. I wanted a boat-load of unit tests—I have 800 unit tests right now. I also wanted to have a very clean BSD license, with all the contributors signing certificates of origin.

Version 3.0 retains the mojo of the last version—all the goodness is there, except I fixed a lot of the quirkiness, and got rid of a lot of special cases. I also added a significant number of new features, such as a new parsing strategy that lets you look far ahead when parsing tokens, a new mechanism for building syntax trees, dynamically scoped attributes, integration with the StringTemplate template engine, and new error recovery and error handling mechanisms.

The LL(*) Parsing Strategy

Frank Sommers: Why did you change the parsing strategy in ANTLR 3?

Terence Parr: The new parsing strategy is called LL(*), replacing LL(k). Parsing is typically a two-pass process. Just like when you're reading English, your brain implicitly puts letters between spaces into words, and creates a sequence of words from a sequence of characters. We call that process lexical analysis.

The parser applies grammatical structure on top of that. Grammatical structure can say, for instance, that you have to see an equals sign, followed by a number, followed by a semicolon. The easiest way to describe what that structure looks like is with a grammar that says, literally, "Give me an equals, give me an integer, and give me a semicolon." The grammar applies a structure to the sequence of words or tokens.

The parser implementation I use, and that you would build by hand, is called recursive descent: You create a series of mutually recursive functions that apply grammatical structure. The recursive descent parser is a direct translation, a one-to-one mapping, from a grammar to an implementation. For example, you have a declaration initialization rule:

declInit : '=' INT ';' ;

Then the recursive descent parser would look like

declInit() {
  match('=');
  match(INT); // match an integer
  match(';');
}

Both of those say, "Match an equals, then match an integer, and then match a semicolon."

Now, in order to distinguish between LL(k) and LL(*), consider a very simple parser that only has to recognize the difference between an integer and an identifier. You might have a rule that says, "Input is defined to be INT or ID." In rule form, that looks like:

input : ID | INT ;

With one symbol of look-ahead, you can decide which one it is—it's either an ID or an INT. We would call that an LL(1) parser.

Imagine that you wanted to see something different: an ID followed by a semicolon, or an ID followed by a period. In that case, you need two symbols of look-ahead to distinguish between those two cases. In its rule form, that looks like:

input : ID ';'
      | ID '.'
      ;

Now, make that an even worse case: it's either an ID* followed by a semicolon, or it's ID* followed by a period. That means, in effect, a whole bunch of identifiers followed by a semicolon, or a whole bunch of identifiers followed by a period:

input : ID* ';' {System.out.println("alternative 1");}
      | ID* '.' {System.out.println("alternative 2");}
      ;

To distinguish between those alternatives, you need to scan past an arbitrary number of identifiers to see the token that follows. There is no fixed number of look-ahead tokens that will work, because I can always give you one more identifier that you need to see past. LL(k), the typical recursive descent parsing strategy, fails in this case.

LL(*), on the other hand, allows the look-ahead to spin arbitrarily forward in a loop, as opposed to some fixed number of symbols. Instead of looping from i=1..k, LL(*) look-ahead could scan until it runs out of input. LL(*) in ANTLR 3.0 does that in a really tight DFA [Editor's note: Deterministic Finite Automata], and not the parser, so it's not considered back-tracking.

A common example is when you try to match function declarations versus definitions. A function declaration ends with a semicolon, and a definition would have a left curly brace at the end of the header of the method. With LL(*), you can just scan ahead automatically with a little DFA, looking for that semicolon or the left curly.

In a sense, it's like walking through a maze, and looking at words on the floor of the maze, and having a pass-phrase that tells you which path to take. It's kind of like having a little trained monkey with you that races down both paths and can figure out what's down there, looking for a particular sequence or word that is going to differentiate between these two paths, and then race back, and tell you what it is.

Re-Targetable Code Generation

Frank Sommers: What benefit does StringTemplate bring to ANTLR 3?

Terence Parr: The biggest benefit of using StringTemplate in ANTLR 3 is that it allows ANTLR to easily re-target code. At this time, we support Java, C#, Python, and C. We'll have Ruby, C++, and Objective C next month. The reason it's so easy to support languages in ANTLR 3 is because I worked really hard to produce a re-targetable code generator.

The thing that distinguishes StringTemplate from other generators is that StringTemplate strictly enforces separation of data and logic from presentation (so-called model-view separation). Although it is a pain sometimes, strict separation guarantees retargetability. If there is no way to make logic or computations inside a template, then there is no way the templates can be of the program. A therefore, all logic for code generation can be properly encapsulated in a single program entity.

I have a single code generator, and the emitter simple says, "I need a template that tells me how to define a rule." That usually ends up being a function definition. Then I need a template that tells the code generator how to match a token. I just go to a template library, a StringTemplate group, and pull that template out of there, and build a bigger and bigger StringTemplate, until I have one StringTemplate that is the entire output file, and then write that output to a file. All the targets are purely a text file that specify what the templates look like. There is not a single character, a single literal, that gets emitted by the code generator that's a literal in Java code or in some other language. It's all done via templates.

Party Time

Frank Sommers: Can you tell us about the ANTLR 3.0 launch party?

Terence Parr: The party will be held Tuesday, May 22nd, at 7:00 pm, at the University of San Francisco. There is room for about 30 people. Anyone interested in coming should RSVP to me directly via email. My email address is the domain cs.usfca.edu preceded by parrt and the usual @ sign.

Resources

ANTLR 3.0
http://www.antlr.org/

The Definitive ANTLR Reference: Building Domain-Specific Languages. A new book by Terence Parr.
http://www.pragmaticprogrammer.com/titles/tpantlr/index.html

Talk back!

Have an opinion? Be the first to post a comment about this article.

About the author

Frank Sommers is Editor-in-Chief of Artima Developer. He also serves as chief editor of the IEEE Technical Committee on Scalable Computing's newsletter, and is an elected member of the Jini Community's Technical Advisory Committee. Prior to joining Artima, Frank wrote the Jiniology and Web services columns for JavaWorld.