The Artima Developer Community
Sponsored Link

Plain Text and XML
A Conversation with Andy Hunt and Dave Thomas, Part X
by Bill Venners
May 5, 2003

<<  Page 3 of 4  >>


Parsing with Partial Knowledge

Bill Venners: There's something that's been bothering me about the way people have been raving about XML. One of the big claims is that because XML data is self-describing, with data wrapped by tags like <customerid>12345</customerid>, clients can figure things out even for documents that don't strictly adhere to their schema and specification. I hear claims that XML is more flexible, because providers of documents can be sloppy and just add new pieces of data here and there. Clients can just ignore tags they don't recognize and find data even if it is in the wrong place according to the schema. The Java class file is not XML, but like XML is a data structure and file format. There is a detailed specification for the Java class file that describes all the data and semantics, and also clearly defines the way in which class files can be extended. Providers and consumers of Java class files adhere strictly to the specification. This approach of strict compliance to a specification and schema makes more sense to me. I like what you have said about self-describing data, but I'm concerned about the leap that some XML enthusiasts seem to make that because the data is self-describing, the way in which a particular schema can evolve doesn't have to be clearly specified or followed, because they assume clients will just ignore anything they don't understand.

You write in your book, "You can parse a plain text file with only partial knowledge of its format." How often do we lose the format specification, or is this more about not needing to "read the manual"—the specification—because the data is more user-friendly.

Dave Thomas: Oh no, it's not so you don't have to read the manual. It's that, if all you have is a pile of data, I'm sure you'd much rather have something in there that gives you some hints to the semantics, as well as just the data itself.

Andy Hunt: We mean using partial knowledge of the format in a forensic sense. You want to go back and dig out account numbers. If the data is tagged such that you can see which pieces of data are account numbers, it becomes a much easier job than just having to dig through a bunch of numbers.

Bill Venners: So the metadata makes the data itself more programmer-friendly. I don't have to go to the manual. It's like there's a miniature, really terse manual in the data itself.

Dave Thomas: Yes, and I think you're also assuming there's a manual.

Bill Venners: Well, that's part of what I'm asking. How often is there no manual?

Dave Thomas: Most of the time there is no manual. If I give you a Word 1 file, where's the manual? If I ship you the output of my stock controller system, where's the manual? If I'm gone, if my program's gone, what are you going to do with that file? There are terabytes of data sitting around in an unusable state, because the software that reads them is gone. Yes, you could probably sit there and reverse engineer it, but it would be a whole lot easier to reverse engineer it if it were self-describing.

<<  Page 3 of 4  >>

Sponsored Links

Copyright © 1996-2018 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use