Organic Schemas and Outlier Data

A Conversation with Elliotte Rusty Harold, Part IX

by Bill Venners
October 6, 2003

Summary
Elliotte Rusty Harold talks with Bill Venners about strict versus forgiving XML parsing, dealing with outlier data, and growing schemas organically.

Elliotte Rusty Harold is a prolific author of books about Java and XML, and creator of the popular Java website Cafe au Lait and XML website Cafe con Leche. He contributed to the development of JDOM, a popular XML processing API for Java. His most recent book, Processing XML with Java, shows how to parse, manipulate, and generate XML from Java applications using several XML APIs, including SAX, DOM, and JDOM.

At a meeting of the New York XML SIG in September, 2002, Harold unveiled an XML processing API of his own design: the XOM (XML Object Model) API. On Cafe au Lait and Cafe con Leche, Harold described XOM like this:

Like DOM, JDOM, dom4j, and ElectricXML, XOM is a read/write API that represents XML documents as trees of nodes. Where XOM diverges from these models is that it strives for absolute correctness and maximum simplicity. XOM is based on more than two years' experience with JDOM development, as well as the last year's effort writing Processing XML with Java. While documenting the various APIs I found lots of things to like and not like about all the APIs, and XOM is my effort to synthesize the best features of the existing APIs while eliminating the worst.

In this interview, which is being published in multiple installments, Elliotte Rusty Harold discusses the strengths and weaknesses of the various XML processing APIs for Java, the design problems with existing APIs, and the design philosophy behind XOM.

Strict versus Forgiving XML Parsers

Bill Venners: In your book XML Processing with Java, you wrote:

Invariably sooner or later you will encounter a document that purports to adhere to the implicit schema and indeed is very close to it, but doesn't quite match what you were assuming. Explicit validation is necessary.

To what extent should we strictly require validity when parsing documents? When should we be forgiving of invalid documents and try our best to parse them anyway?

Elliotte Rusty Harold: I'm not sure I believe anymore that explicit validation is necessary. I've become much more liberal in my understanding of XML in the last year. I'm much more willing to accept invalid content than I was. Perhaps ultimately with a fall back to a human being, there are some forms of documents you can pass in that don't adhere to the schema. They don't adhere to the contract, and yet they can still be usefully processed. The big example of this right now is RSS.

There are currently four separate official RSS formats out there. The tools can pretty much handle them all. But guess what? Probably less than half of the RSS feeds out there regularly satisfy any of those four formats. Nonetheless, if you have a basic RSS reader that can find and load items and titles, that behaves sanely when something it wants isn't there rather than just crashing, then you can pretty much handle RSS as it exists, even though the documents don't really adhere to the schemas.

On the other hand, you may want to use validity tests to help you decide how to process a document. You get a document and check it for validity against a schema. If it is valid, you pass it process A. If it's not valid according to that schema, then check it against another schema. If it matches that schema, you pass it to process B. Or you transform it to the first schema and pass it to process A. Perhaps you eventually get a document that doesn't match any schema you've seen before, so you send that document off to Joe over in the IT department. You ask Joe to take a look at the document and figure out how to integrate it into your system. And maybe at the same time kick back a message to the submitter saying, "This is going to take a little longer."

Dealing with Outlier Data

Bill Venners: In your book Processing XML with Java, you wrote:

The real world does not always fit into neatly typed categories. There's almost always some outlier data that just doesn't fit the schema.

Elliotte Rusty Harold: Yes, and I found a very interesting example of outlier data. I did not go looking for this. I knew that flat data in comma-delimited files often flips columns, has missing fields and corrupt data. This happens in practice all the time. But the specific case I looked at in that chapter had an even more interesting example: a year that only contained three months.

In the late 70s, the U.S government shifted its fiscal year forward one quarter. They still had to have budgetary data for the three month period in between the two fiscal years, so they created a year that contains only three months. It's called the Transitional Quarter. The data goes: 1975, 1976, TQ, 1977. This is the sort of thing you have to deal with in real world data. It happens all the time. The real world is not as simple as the schema you design. There's always outlier data.

Bill Venners: The last thing you write in that paragraph is:

You cannot assume the data actually adheres to its schema, either implicit or explicit.

So what do you do about that? How do you deal with it?

Elliotte Rusty Harold: You verify it. You check to see if each document does adhere to the schema. You don't just take the data in blindly assuming it is valid and then you have your program crash with a NullPointerException because a field is missing. You always check for values that are null. Anything that could go wrong, you should assume will go wrong.

I've noticed that a lot of applications—Microsoft Word, Adobe Illustrator, and various other major programs—occasionally crash because they attempt to open a corrupt file. Something went wrong. The power went down when the file was saved. A zip program mangled the file. Files can get corrupted in many ways. But even though the data in a file is bad, the program should not crash. It should deal sanely with that failure.

One of the nice features of XML processing is that the XML parser can help you deal with corrupt data. Parsers are designed to check for problems, both well-formedness and validity errors. No matter how corrupt a file is, it should not crash the XML parser. The parser will simply notice the corrupted data, report it as a well-formedness error, and throw the appropriate exception. The program can then proceed to handle it. The program shouldn't overwrite memory, or set a positive number to the value -200,000, or anything else that's going to cause real problems. If there is genuine corrupted data when you're using an XML format, the parser will catch it.

Organic Schema Design

Bill Venners: Do you have any general guidelines for designing an XML schema, designing the data structure?

Elliotte Rusty Harold: The main thing I would say is: grow your documents organically. Try and model the actual content for which you're writing a schema, and see what sort of XML structures come out. Don't start by writing schemas. Start by writing example instance documents, and see what you get.

For example, if you're modeling invoices, pull out a few invoices. Ask yourself, "If I wrote this invoice in XML, what it would look like? That invoice, what it would look like?" If you have a large and representative enough collection of previous documents—in whatever format: paper, electronic—you can get a good start. Then you will gradually discover other documents coming into your system that don't really fit your designs. They have a couple extra fields. One document has two shipping addresses instead of one, so you figure out how to handle that in your schema. Another document has an address that's in the U.K. instead of in the United States, and that has a very different format. So you adjust the schema.

If you grow your schemas organically, you gradually figure out how the documents are likely to be structured. You don't write down in stone up front that the documents must be structured like this, that all these elements must be present, that these attributes must not be present if something else is present, and so on. You let the actual information drive the design, rather than letting the design constrain what documents you're willing to accept.

Next Week

Come back Monday, October 13 for the first installment of a conversation with C++ creator Bjarne Stroustrup. I know I promised this last week, but one must always keep up some element of surprise. Nevertheless, look for Bjarne next Monday. He will be here, really. If you'd like to receive a brief weekly email announcing new articles at Artima.com, please subscribe to the Artima Newsletter.

Resources

Elliotte Rusty Harold is author of Processing XML with Java: A Guide to SAX, DOM, JDOM, JAXP, and TrAX, which is available on Amazon.com at:
http://www.amazon.com/exec/obidos/ASIN/020161622X/

XOM, Elliotte Rusty Harold's XML Object Model API:
http://www.cafeconleche.org/XOM/

Cafe au Lait: Elliotte Rusty Harold's site of Java News and Resources:
http://www.cafeaulait.org/

Cafe con Leche: Elliotte Rusty Harold's site of XML News and Resources:
http://www.cafeconleche.org/

JDOM:
http://www.jdom.org/

DOM4J:
http://www.dom4j.org/

SAX, the Simple API for XML Processing:
http://www.saxproject.org/

DOM, the W3C's Document Object Model API:
http://www.w3.org/DOM/

ElectricXML:
http://www.themindelectric.com/exml/

Sparta:
http://sparta-xml.sourceforge.net/

Common API for XML Pull Parsing:
http://www.xmlpull.org/

NekoPull:
http://www.apache.org/~andyc/neko/doc/pull/

Xerces Native Interface (XNI):
http://xml.apache.org/xerces2-j/xni.html

TrAX (Tranformation API for XML):
http://xml.apache.org/xalan-j/trax.html

Jaxen (a Java XPath engine):
http://jaxen.org/

RELAX NG:
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=relax-ng

Talk back!

Have an opinion? Be the first to post a comment about this article.

About the author

Bill Venners is president of Artima Software, Inc. and editor-in-chief of Artima.com. He is author of the book, Inside the Java Virtual Machine, a programmer-oriented survey of the Java platform's architecture and internals. His popular columns in JavaWorld magazine covered Java internals, object-oriented design, and Jini. Bill has been active in the Jini Community since its inception. He led the Jini Community's ServiceUI project that produced the ServiceUI API. The ServiceUI became the de facto standard way to associate user interfaces to Jini services, and was the first Jini community standard approved via the Jini Decision Process. Bill also serves as an elected member of the Jini Community's initial Technical Oversight Committee (TOC), and in this role helped to define the governance process for the community. He currently devotes most of his energy to building Artima.com into an ever more useful resource for developers.