Weblogs Forum - Parsing with Tree Constructors

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Weblogs Forum
Parsing with Tree Constructors

4 replies on 1 page. Most recent reply: Oct 12, 2005 10:39 PM by Mike Stockdale

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 4 replies on 1 page

Michael Feathers

Posts: 448
Nickname: mfeathers
Registered: Jul, 2003

Parsing with Tree Constructors (View in Weblogs)

Posted: Oct 7, 2005 6:29 AM

Summary
A pattern for the recursive creation of composite objects.

If you haven't browsed the code in Ward Cunningham's FIT framework, you really should. It's extremely elegant well-thought-out code, and it's the quintessential example of an open framework, a framework designed to allow extension everywhere.

One of the patterns Ward uses is one I've never seen named or discussed. Right now, I'm calling it the Tree Constructor Pattern. If you think of a better name or you've seen it written up elsewhere, please let me know.

Here's how it works in Ward's code.

FIT has two primary classes: Fixture, and Parse. Fixture iterates through a representation of an HTML document and runs tests defined within the document. Parse is the representation that Fixture iterates.

Parse is an interesting class. You create a Parse by giving it a string like this:

Parse parse = new Parse(someHMTLString);

The constructor recursively constructs a tree structure of Parse objects that represents the text in the document. The Parse that you construct ends up being the root of a tree that represents the document. You can get from one parse to another by following its more and parts fields, like this:

parse.more.more.parts.text();

When you use tree constructor, you can pack a powerful punch in a single class. You end up with:

A single class which represents both a tree and its leaves.

A mechanism for automatically creating the subobjects of a parent.

In Ward's Parse class, the entire tree is populated on construction, but population can be done on demand. Here's an example from an interpreter I was writing in C# a while ago. I have a class called Token. We can construct a Token from a string containing a program fragment:

Token firstToken = new Token("1 to: 4\n    ");

Now that we have the firstToken, we can ask for subsequent ones:

Assert.AreEqual("1", firstToken.Text);
Assert.AreEqual("to:", firstToken.Next.Text);
Assert.AreEqual("4", firstToken.Next.Next.Text);

Here's how the Token class does its work:

public class Token
{
	private string		sourceText = "";
	private string		text = "";
	private TokenType	type = TokenType.GrotUnknown;
	private int		indentationPosition = 0;

	public Token(string sourceText) 
		: this (sourceText, 0)
	{
	}

	public Token(string sourceText, int indentationPosition)
	{
		this.indentationPosition = indentationPosition;
		this.sourceText = sourceText;

		SkipWhitespace();
		bool dummyResultBecauseCSharpCantHandleBooleanExpressionsCalculatedSolelyForTheirSideEffects
			= ParseNumericLiteral() 
			|| ParseIdentifier() 
			|| ParseDeclarator() 
			|| ParseNewLine()
			...
			|| ParseAssignmentOperator();
	}

        ...

	public Token Next 
	{
		get { 
			return new Token(
				sourceText.Substring(text.Length), indentationPosition); 
		}
        }

	public string Text 
	{
		get { return text; }
	}

	public TokenType Type 
	{
		get { return type; }
	}
}

The act of accessing the Next property creates the next token and returns it. This is inefficient if you have to do it more than once, but it's easy enough to change Next so that it caches.

Tree Constructor classes have some upsides and some downsides. One upside is simplicity. Tree Constructor is simple for users because they only have to create a single object. All of the child objects are created automatically. It is also rather simple to implement.

Using the same class for both a tree and its nodes does make some things easier. For instance, converting a tree to some other form can often be handled simply by adding a method to the class.

string programRepresentation = token.AsString();

The nice thing is that it works at any level. You can take any token, no matter how far down it is in the tree and send it AsString to get a representation from that level of the tree downward.

Tree Constructor is a nice pattern, but there are a couple of downsides. Tree Constructor works only when a parent object can know everything that it needs to construct its children. If you have to pass in other objects as helpers, you might want to consider creating a factory class that manages the construction externally.

Another downside is co-mingled responsibility. Tree Constructor classes typically group together several responsibilities: creation of self, creation of subobjects, access of subobjects, and any operations that apply to the current object. Is this a problem? It can be if those responsibilities vary independently, but in cases where they don't, Tree Constructor is fine.

The last downside is naming. It can be hard to come up with a class name that is works for both the part and the whole. Ward chose the name Parse for his class, and it is a rather neat word. Parse can be used as both a noun and a verb so code like this:

Parse parse = new Parse(someHMTLString);

can be understood as "construct a parse of this string" or "parse this string."

Despite the slight advantage of the noun/verb ambiguity, Parse does seem to make more sense when applied to the whole (saying "here is a parse of the file") rather than a part (saying "here is a parse of an identifier" when what you have is just an identifier).

In the example above, I used the name Token for my Tree Constructor class and it does feel a little weird to create a token using a string that contains a program fragment or an entire program. However, after construction, Token is a very accurate name. Each Token really does represent a token. Once you construct the first one, all of the others are there and ready to be used.

Robert C. Martin

Posts: 111
Nickname: unclebob
Registered: Apr, 2003

Re: Parsing with Tree Constructors

Posted: Oct 7, 2005 9:21 AM

There is both elegance and ugliness in Ward's Parse structure. There is no doubt that it is easy to use.


Parse p = new Parse(""<table><tr><td>hi</td></tr></table");

is remarkably elegant. On the other hand,


String name = p.parts.parts.more.parts.more.text();

is close to an abomination. Ward tried to fix this with his "at" functions so that you can say:


String name = p.at(3,2);

instead of


String name = p.parts.parts.more.more;

but that's only a small improvement.

Moreover I find that there's a SRP violation in the whole concept. Parse is deeply tied to HTML, whereas the FIT data need not be so deeply tied. The Parse structure makes it hard to think of non-HTML ways to deal with the Fit data. For example, It might be nice to pass FitNesse wiki-text, or CSV strings, or some other kind of table format to Fit. But the strong coupling between Parse, the Fixtures, and HTML makes this hard.

In hindsight, I wish that the parsing aspects of Fit, and the data presented to the fixtures, were strongly decoupled. The TreeConstructor pattern, though elegant in some ways, does strongly couple both the parsing and the subsequent traversal of the data.

Brian Slesinsky

Posts: 43
Nickname: skybrian
Registered: Sep, 2003

Re: Parsing with Tree Constructors

Posted: Oct 7, 2005 12:10 PM

Using a single type for all nodes in a tree is very compact and clever and reminiscent of the days when we didn't have object-oriented programming and were running lisp on slow machines with little memory. An expression like "parse.more.more.parts.text()" is really just a more readable version of "(get-text (car (cdr (cdr parse))))".

But elegant? More like confusing, as the awkwardness in language shows. The nodes in an abstract syntax tree aren't all the same and a program using a regular type hierarchy, while more verbose, is a lot easier to explain and debug.

Todd Blanchard

Posts: 316
Nickname: tblanchard
Registered: May, 2003

Re: Parsing with Tree Constructors

Posted: Oct 10, 2005 4:45 PM

I wrote a MIME library in 2000 or so in Objective C and this is the approach I took to parsing MIME messages. Basically recursive descent parsing but each object is responsible for parsing its construct.

In my case, I had a large map that mapped constructs to class names and simply contructed an object of the right class. Each class's constructor just knew how to parse/trim its construct, then call the factory method to build the next node out of what's left.

Hardly a new technique I think.

Mike Stockdale

Posts: 2
Nickname: jediwhale
Registered: Oct, 2005

Re: Parsing with Tree Constructors

Posted: Oct 12, 2005 10:39 PM

I have the greatest respect for Ward's creativity and "big ideas" but frankly I was unimpressed by the FIT code. I found it NOT intention revealling and with classes like Parse that violate SRP...

Flat View: This topic has 4 replies on 1 page

Previous Topic

Next Topic


	Web Artima.com