Sponsored Link •
Deciding upon a data format
Moreover.com offers its news feeds in several data formats, each available at a different URL. Thus, I next had to decide which data format my script should use for processing.
One data format that I did not choose, but which I'd like to mention here, is HTML. Among other data formats, Moreover.com offers an HTML Webpage full of the latest Java news. The trouble with this approach, of course, is that HTML pages are intended to be consumed by people, not programs. Although the information my Python script needs is contained in an HTML page, the page's markup tags make it difficult for programs like my script to acquire the information. Rather, HTML markup tends to focus on enabling a Web browser to render the information buried in a screen's markup, so that a human user can gaze upon the screen and pull the information into his or her brain.
In HTML, information intermingles freely with directions on presenting that information. For example, here's a snippet of HTML code from the HTML news page at Moreover.com:
<TR BGCOLOR="#ffffff"><TD><FONT FACE="Arial, Helvetica, sans-serif">
<A HREF=http://c.moreover.com/click/here.pl?j6547539 TARGET="_blank"><FONT SIZE="-1" COLOR="#333333"
><B>Java, XML to survive Sun/Microsoft war...</B></FONT></A><BR>
<A HREF=http://www.vnunet.com/ TARGET=_blank>
<FONT SIZE="-2" COLOR="#ff6600">vnunet.com</FONT></A>
<FONT SIZE="-2" COLOR="#ff6600"> Wed Apr 12 09:34:25 GMT-0700 (Pacific Daylight Time) 2000</FONT>
</TD></TR><TR BGCOLOR="#ffffff"><TD BGCOLOR="#ffffff" HEIGHT="5"></TD></TR>
Aside from the trouble of parsing out the information from all this HTML markup, a far more insidious problem exists with the parsed-HTML approach. Given that HTML pages are intended to be rendered by browsers and read by people, Webmasters have no qualms about changing their pages in ways that browsers and people can deal with, but programs cannot. So even if I decided to parse the information out of the HTML, chances are good that eventually Moreover.com's Webmaster would make a change to its Webpages' structure that would break my script.