The Artima Developer Community
Sponsored Link

Organic Schemas and Outlier Data
A Conversation with Elliotte Rusty Harold, Part IX
by Bill Venners
October 6, 2003

<<  Page 2 of 3  >>


Dealing with Outlier Data

Bill Venners: In your book Processing XML with Java, you wrote:

The real world does not always fit into neatly typed categories. There's almost always some outlier data that just doesn't fit the schema.

Elliotte Rusty Harold: Yes, and I found a very interesting example of outlier data. I did not go looking for this. I knew that flat data in comma-delimited files often flips columns, has missing fields and corrupt data. This happens in practice all the time. But the specific case I looked at in that chapter had an even more interesting example: a year that only contained three months.

In the late 70s, the U.S government shifted its fiscal year forward one quarter. They still had to have budgetary data for the three month period in between the two fiscal years, so they created a year that contains only three months. It's called the Transitional Quarter. The data goes: 1975, 1976, TQ, 1977. This is the sort of thing you have to deal with in real world data. It happens all the time. The real world is not as simple as the schema you design. There's always outlier data.

Bill Venners: The last thing you write in that paragraph is:

You cannot assume the data actually adheres to its schema, either implicit or explicit.

So what do you do about that? How do you deal with it?

Elliotte Rusty Harold: You verify it. You check to see if each document does adhere to the schema. You don't just take the data in blindly assuming it is valid and then you have your program crash with a NullPointerException because a field is missing. You always check for values that are null. Anything that could go wrong, you should assume will go wrong.

I've noticed that a lot of applications—Microsoft Word, Adobe Illustrator, and various other major programs—occasionally crash because they attempt to open a corrupt file. Something went wrong. The power went down when the file was saved. A zip program mangled the file. Files can get corrupted in many ways. But even though the data in a file is bad, the program should not crash. It should deal sanely with that failure.

One of the nice features of XML processing is that the XML parser can help you deal with corrupt data. Parsers are designed to check for problems, both well-formedness and validity errors. No matter how corrupt a file is, it should not crash the XML parser. The parser will simply notice the corrupted data, report it as a well-formedness error, and throw the appropriate exception. The program can then proceed to handle it. The program shouldn't overwrite memory, or set a positive number to the value -200,000, or anything else that's going to cause real problems. If there is genuine corrupted data when you're using an XML format, the parser will catch it.

<<  Page 2 of 3  >>

Sponsored Links

Copyright © 1996-2018 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use