Sponsored Link •
Elliotte Rusty Harold talks with Bill Venners about strict versus forgiving XML parsing, dealing with outlier data, and growing schemas organically.
Elliotte Rusty Harold is a prolific author of books about Java and XML, and creator of the popular Java website Cafe au Lait and XML website Cafe con Leche. He contributed to the development of JDOM, a popular XML processing API for Java. His most recent book, Processing XML with Java, shows how to parse, manipulate, and generate XML from Java applications using several XML APIs, including SAX, DOM, and JDOM.
At a meeting of the New York XML SIG in September, 2002, Harold unveiled an XML processing API of his own design: the XOM (XML Object Model) API. On Cafe au Lait and Cafe con Leche, Harold described XOM like this:
Like DOM, JDOM, dom4j, and ElectricXML, XOM is a read/write API that represents XML documents as trees of nodes. Where XOM diverges from these models is that it strives for absolute correctness and maximum simplicity. XOM is based on more than two years' experience with JDOM development, as well as the last year's effort writing Processing XML with Java. While documenting the various APIs I found lots of things to like and not like about all the APIs, and XOM is my effort to synthesize the best features of the existing APIs while eliminating the worst.
In this interview, which is being published in multiple installments, Elliotte Rusty Harold discusses the strengths and weaknesses of the various XML processing APIs for Java, the design problems with existing APIs, and the design philosophy behind XOM.
Bill Venners: In your book XML Processing with Java, you wrote:
Invariably sooner or later you will encounter a document that purports to adhere to the implicit schema and indeed is very close to it, but doesn't quite match what you were assuming. Explicit validation is necessary.
To what extent should we strictly require validity when parsing documents? When should we be forgiving of invalid documents and try our best to parse them anyway?
Elliotte Rusty Harold: I'm not sure I believe anymore that explicit validation is necessary. I've become much more liberal in my understanding of XML in the last year. I'm much more willing to accept invalid content than I was. Perhaps ultimately with a fall back to a human being, there are some forms of documents you can pass in that don't adhere to the schema. They don't adhere to the contract, and yet they can still be usefully processed. The big example of this right now is RSS.
There are currently four separate official RSS formats out there. The tools can pretty much handle them all. But guess what? Probably less than half of the RSS feeds out there regularly satisfy any of those four formats. Nonetheless, if you have a basic RSS reader that can find and load items and titles, that behaves sanely when something it wants isn't there rather than just crashing, then you can pretty much handle RSS as it exists, even though the documents don't really adhere to the schemas.
On the other hand, you may want to use validity tests to help you decide how to process a document. You get a document and check it for validity against a schema. If it is valid, you pass it process A. If it's not valid according to that schema, then check it against another schema. If it matches that schema, you pass it to process B. Or you transform it to the first schema and pass it to process A. Perhaps you eventually get a document that doesn't match any schema you've seen before, so you send that document off to Joe over in the IT department. You ask Joe to take a look at the document and figure out how to integrate it into your system. And maybe at the same time kick back a message to the submitter saying, "This is going to take a little longer."