Sponsored Link •
Pragmatic Programmers Andy Hunt and Dave Thomas talk with Bill Venners about the value of storing persistent data in plain text and the ways they feel XML is being misused.
Andy Hunt and Dave Thomas are the Pragmatic Programmers, recognized internationally as experts in the development of high-quality software. Their best-selling book of software best practices, The Pragmatic Programmer: From Journeyman to Master (Addison-Wesley, 1999), is filled with practical advice on a wide range of software development issues. They also authored Programming Ruby: A Pragmatic Programmer's Guide (Addison-Wesley, 2000), and helped to write the now famous Agile Manifesto.
In this interview, which has been published in ten weekly installments, Andy Hunt and Dave Thomas discuss many aspects of software development:
Bill Venners: In your book, The Pragmatic Programmer, you write, "We believe the best format for storing knowledge persistently is plain text." Why? What are the advantages? What are the costs?
Dave Thomas: Does it ever happen to you that someone sends you a Microsoft Word file?
Bill Venners: It happens all the time.
Dave Thomas: A Word file that you can't open?
Bill Venners: No, because I have Microsoft Word. One of the main reasons I have Word and Excel on my Macintosh is because people send me Word and Excel files all the time and I need to be able to open them.
Dave Thomas: Well that's funny, because I have Word on my Macintosh. I have the very latest Word, and yesterday I received a Word document that it won't open.
Andy Hunt: This problem also happens between Word 97 and later versions for Windows, not just between say Word 97 and the Macintosh version of Word.
Dave Thomas: The problem is, once we store data in a non-transparent, inaccessible format, then we need code to read it, and that code disappears. Code is disappearing all the time. You probably can't go to a store and ask for a copy of Word 1, or whatever the first version of Word was called. So we are losing vast quantities of information, because we can no longer read the files.
One of the reasons we advocate using plain text is so information doesn't get lost when the program goes away. Even though a program has gone away, you can still extract information from a plain text document. You may not be able to make the information look like the original program would, but you can get the information out. The process is made even easier if the format of the plain text file is self-describing, such that you have metadata inside the file that you can use to extract out the actual semantic meaning of the data in the file. XML is not a particularly good way to do this, but it's currently the plain text transmission medium du jour.
Another reason for using plain text is it allows you to write individual chunks of code that cooperate with each other. One of the classic examples of this is the Unix toolset: a set of small sharp tools that you can join together. You join them by feeding the plain text output of one into the plain text input of the next. There's no concept of trying to make sure the word count program outputs things in a format that's compatible with the next tool in the chain. It's just plain text to plain text, and that's a very powerful way to do it.
Andy Hunt: Virtually any program that's going to operate on text of some sort can operate on plain text as the lowest common denominator. Very often you get into a state where you want to work with some program, but its properties file has gotten corrupted such that the program won't even come up to let you change the property. If that file is in some binary format that needs the program itself to fix it, you're hosed. You've catch-22ed yourself right out of existence. If it's in a plain text format, you can go in with any generic tool—a text editor, whatever you like to use to deal with plain text—and fix the problem. So in terms of emergency recovery, or changes in the field, plain text is helpful. It provides another level of insurance.
Dave Thomas: Earlier in the interview (See Resources), I was talking about putting abstractions into code, specifics into metadata. We will be handing the programs we're writing today down to the next generation of programmers, and the ones after that. They will have to deal with this mess we've left behind. If we give them a load of gibberish consisting of binary data, they're going to have a harder time understanding it. If we give them nice plain text or XML files, it will be a lot easier to understand. Plain text will obviously require less mental energy to figure out.