The Artima Developer Community
Sponsored Link

Let's Reconsider That
Parsing with Tree Constructors
by Michael Feathers
October 7, 2005
Summary
A pattern for the recursive creation of composite objects.

Advertisement

If you haven't browsed the code in Ward Cunningham's FIT framework, you really should. It's extremely elegant well-thought-out code, and it's the quintessential example of an open framework, a framework designed to allow extension everywhere.

One of the patterns Ward uses is one I've never seen named or discussed. Right now, I'm calling it the Tree Constructor Pattern. If you think of a better name or you've seen it written up elsewhere, please let me know.

Here's how it works in Ward's code.

FIT has two primary classes: Fixture, and Parse. Fixture iterates through a representation of an HTML document and runs tests defined within the document. Parse is the representation that Fixture iterates.

Parse is an interesting class. You create a Parse by giving it a string like this:

Parse parse = new Parse(someHMTLString);

The constructor recursively constructs a tree structure of Parse objects that represents the text in the document. The Parse that you construct ends up being the root of a tree that represents the document. You can get from one parse to another by following its more and parts fields, like this:

parse.more.more.parts.text();

When you use tree constructor, you can pack a powerful punch in a single class. You end up with:

  • A single class which represents both a tree and its leaves.
  • A mechanism for automatically creating the subobjects of a parent.

    In Ward's Parse class, the entire tree is populated on construction, but population can be done on demand. Here's an example from an interpreter I was writing in C# a while ago. I have a class called Token. We can construct a Token from a string containing a program fragment:

    Token firstToken = new Token("1 to: 4\n    ");

    Now that we have the firstToken, we can ask for subsequent ones:

    Assert.AreEqual("1", firstToken.Text);
    Assert.AreEqual("to:", firstToken.Next.Text);
    Assert.AreEqual("4", firstToken.Next.Next.Text);

    Here's how the Token class does its work:

    public class Token
    {
    	private string		sourceText = "";
    	private string		text = "";
    	private TokenType	type = TokenType.GrotUnknown;
    	private int		indentationPosition = 0;
    
    	public Token(string sourceText) 
    		: this (sourceText, 0)
    	{
    	}
    
    	public Token(string sourceText, int indentationPosition)
    	{
    		this.indentationPosition = indentationPosition;
    		this.sourceText = sourceText;
    
    		SkipWhitespace();
    		bool dummyResultBecauseCSharpCantHandleBooleanExpressionsCalculatedSolelyForTheirSideEffects
    			= ParseNumericLiteral() 
    			|| ParseIdentifier() 
    			|| ParseDeclarator() 
    			|| ParseNewLine()
    			...
    			|| ParseAssignmentOperator();
    	}
    
            ...
    
    	public Token Next 
    	{
    		get { 
    			return new Token(
    				sourceText.Substring(text.Length), indentationPosition); 
    		}
            }
    
    	public string Text 
    	{
    		get { return text; }
    	}
    
    	public TokenType Type 
    	{
    		get { return type; }
    	}
    }
    

    The act of accessing the Next property creates the next token and returns it. This is inefficient if you have to do it more than once, but it's easy enough to change Next so that it caches.

    Tree Constructor classes have some upsides and some downsides. One upside is simplicity. Tree Constructor is simple for users because they only have to create a single object. All of the child objects are created automatically. It is also rather simple to implement.

    Using the same class for both a tree and its nodes does make some things easier. For instance, converting a tree to some other form can often be handled simply by adding a method to the class.

    string programRepresentation = token.AsString();

    The nice thing is that it works at any level. You can take any token, no matter how far down it is in the tree and send it AsString to get a representation from that level of the tree downward.

    Tree Constructor is a nice pattern, but there are a couple of downsides. Tree Constructor works only when a parent object can know everything that it needs to construct its children. If you have to pass in other objects as helpers, you might want to consider creating a factory class that manages the construction externally.

    Another downside is co-mingled responsibility. Tree Constructor classes typically group together several responsibilities: creation of self, creation of subobjects, access of subobjects, and any operations that apply to the current object. Is this a problem? It can be if those responsibilities vary independently, but in cases where they don't, Tree Constructor is fine.

    The last downside is naming. It can be hard to come up with a class name that is works for both the part and the whole. Ward chose the name Parse for his class, and it is a rather neat word. Parse can be used as both a noun and a verb so code like this:

    Parse parse = new Parse(someHMTLString);

    can be understood as "construct a parse of this string" or "parse this string."

    Despite the slight advantage of the noun/verb ambiguity, Parse does seem to make more sense when applied to the whole (saying "here is a parse of the file") rather than a part (saying "here is a parse of an identifier" when what you have is just an identifier).

    In the example above, I used the name Token for my Tree Constructor class and it does feel a little weird to create a token using a string that contains a program fragment or an entire program. However, after construction, Token is a very accurate name. Each Token really does represent a token. Once you construct the first one, all of the others are there and ready to be used.

    Talk Back!

    Have an opinion? Readers have already posted 4 comments about this weblog entry. Why not add yours?

    RSS Feed

    If you'd like to be notified whenever Michael Feathers adds a new entry to his weblog, subscribe to his RSS feed.

    About the Blogger

    Michael has been active in the XP community for the past five years, balancing his time between working with, training, and coaching various teams around the world. Prior to joining Object Mentor, Michael designed a proprietary programming language and wrote a compiler for it, he also designed a large multi-platform class library and a framework for instrumentation control. When he isn't engaged with a team, he spends most of this time investigating ways of altering design over time in codebases.

    This weblog entry is Copyright © 2005 Michael Feathers. All rights reserved.

  • Sponsored Links



    Google
      Web Artima.com   

    Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use