I decided it was time to validate the HTML on my site, but wanted an integrated solution that would flag problems during the build process.
I generate my website using a local servlet container and JSP pages converting text source to html pages, then I upload all the pages to the server. Inspired by reading
Cleaning Your Web Pages with HTML Tidy, I decided it was about time I had my HTML validated. But I wanted to do it as an integral part of the build process, not as an afterthought. That way, if HTML errors crept in to the pages for whatever reason, they would be flagged immediately. It turned out to be extremely easy to do so.
First off, I am already building my pages locally using a Java program which connects to my local servlet container and asks for each page then stores it locally. This allows me to have a dynamic page display process for building my pages, giving me all the power and flexibility of servlets and JSPs. The result is a set of static pages which I can upload to my internet site, providing extremely fast downloads of pages from my internet site JavaPerformanceTuning.com.
So all I had to do to add HTML validation was add one method to my build process. Once each page is complete and loaded into a local file, I simply added a call to a new validateHTML(File destinationfile) method.
My validateHTML method basically calls the "Tidy" executable on the newly created HTML file, (Tidy validates and corrects HTML, and is available here). Then I check Tidy's output for anything I'm interested in. If there is a problem, I throw an exception.
I use Process to execute Tidy as an external process. I could process Tidy's stdout and stderr directly from the program, but there is no need, it is much simpler to use Tidy to dump these to files and check those files. I don't actually use Tidy's HTML output for my web pages, I'm really using it only as a validator. It is worth noting that the W3 organization has a validator at http://validator.w3.org/ if you only need to check some pages, but in my case I wanted to have all my pages checked each time I re-built the site.
I am only interested in the line notifcation warnings and errors that Tidy emits, so I use a regular expression to detect and parse those lines. In addition, there are some warnings that I don't really care to fix at the moment, so I have added the ability to ignore those, either on a per file basis or globally (see the two entries in the TidyNoficationsToIgnore HashMap for examples).
Finally, if I do find a problem, I like to print the error and relevant line from the HTML file so that I can see where it is and what to fix
Here's the code in case anyone else needs to resolve this problem in a similar way. If you have problems getting Tidy to execute, it's probably a path issue so you might try using the path to the executable in the command, e.g. .\Tidy or ./Tidy
//Note I am putting this code fragment in the public domain
public static final Pattern TidyHTMLLineNotification = Pattern.compile("^line\\s+(\\d+)\\s+column\\s+(\\d+)\\s+\\-\\s+(.*)$");
static HashMap TidyNoficationsToIgnore = new HashMap();
TidyNoficationsToIgnore.put("newsletter013.shtml+Warning: discarding unexpected </p>", Boolean.TRUE);
TidyNoficationsToIgnore.put("Warning: trimming empty <p>", Boolean.TRUE); //always ignore
public static void validateHTML(File destinationfile)
throws IOException, InterruptedException
//Stdout to tt.txt, stderr to t2.txt.
//tt.txt contains fixed HTML if you want it.
//t2.txt contains Tidy's warnings and errors
String command = "Tidy -o tt.txt -f t2.txt " + destinationfile;
BufferedReader rdr = new BufferedReader(new FileReader("t2.txt"));
while( (line = rdr.readLine()) != null)
//Only interested in lines beginning with "line"
if (line.startsWith("line "))
Matcher m = TidyHTMLLineNotification.matcher(line);
String linenumstr = m.group(1);
String colnum = m.group(2);
String message = m.group(3);
if ( (TidyNoficationsToIgnore.get(message) != Boolean.TRUE) &&
(TidyNoficationsToIgnore.get(destinationfile.toString()+'+'+message) != Boolean.TRUE) )
//line number in destinationfile of problem. Read the file
//and get that line and the line before
int linenum = Integer.parseInt(linenumstr);
BufferedReader rdr2 = new BufferedReader(new FileReader(destinationfile));
String l2 = null, l1 = null;
for (int i = 0; i < linenum; i++)
l1 = l2;
l2 = rdr2.readLine();
throw new IOException("HTML Validation Problem Identified by Tidy in file " + destinationfile + ": line " +
linenum + " / " + message + System.getProperty("line.separator") + l1 +System.getProperty("line.separator") + l2);
Have you got your own solutions to this or other website build problems? Tell us.