Sponsored Link •
Looking at XML
The document-style format that looked most promising to me was Moreover.com's XML feed. XML was designed to enable just the kind of software parsing I wanted to do in my Python script. In an XML document, in contrast to one in HTML, information and presentation are cleanly separated. The information contained in the document is marked up in tags that, rather than describe how the information should be presented, hints at the semantic meaning of the information. For example, here's a snippet of XML code from the XML feed at Moreover.com:
<headline_text>Java, XML to survive Sun/Microsoft war</headline_text>
<harvest_time>Apr 12 2000 4:34PM</harvest_time>
Directions on how to present the information contained in the XML document's semantic tags can be defined separately, using a style markup language such as CSS or XSL. In the Moreover.com case, the XML document is intended to be consumed only by programs, not by people, so no style markup is provided. Nevertheless, the primary reason my Python script could parse the XML feed more easily than the HTML feed is that XML is designed to avoid HTML's intermingling of information and presentation.
Settling on tab-separated values
I liked the XML approach, but unfortunately I was unable to figure out quickly enough how to work with XML in Python. All I wanted to do was pass a chunk of XML to some library routine, get back a nice data structure corresponding to the XML document, and use it to effortlessly write out the news page. I was (and still am) on the Python learning curve, and as I was rooting around in the Python documentation looking for my desired library routine, I noticed that Moreover.com also offered a tab-separated value (TSV) feed. At that point I paused and said to myself, "Self, if you just use this TSV feed, then you can get this job done right now." For reasons of speed, therefore, I abandoned my search for the elusive XML-to-data-structure Python library routine and completed my script using the TSV feed.
Here's one line from the TSV feed at Moreover.com. (The single line is split into three lines with
\\ and tabs are replaced with
\t here, but not in the actual feed.)
Java, XML to survive Sun/Microsoft war\tvnunet.com\ttext\t\\
Java news\t \thttp://www.vnunet.com/\tApr 12 2000 4:34PM\t \t