Agile Buzz Forum - Dynamic websites using Apache, xsltproc, XML XSLT and BASH

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Agile Buzz Forum
Dynamic websites using Apache, xsltproc, XML XSLT and BASH

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

James Robertson

Posts: 29924
Nickname: jarober61
Registered: Jun, 2003

David Buck, Smalltalker at large

Dynamic websites using Apache, xsltproc, XML XSLT and BASH

Posted: Feb 21, 2004 5:13 PM

This post originated from an RSS feed registered with Agile Buzz by James Robertson.
Original Post: Dynamic websites using Apache, xsltproc, XML XSLT and BASH Feed Title: Michael Lucas-Smith Feed URL: http://www.michaellucassmith.com/site.atom Feed Description: Smalltalk and my misinterpretations of life	Latest Agile Buzz Posts Latest Agile Buzz Posts by James Robertson Latest Posts From Michael Lucas-Smith

I thought people might be interested in some very basic Apache hacking that I use to build websites. The idea is to have a bunch of XML on your hard disk and you want to use XSLT to produce XHTML pages for people to view - but on the fly at request time, not with some 'build' process.

The ingredients are these: Apache's HTTP daemon, xsltproc (libxslt), your XML, your XSLT and a programming language, such as BASH. Note, I've used ruby, python, bash and smalltalk for this purpose on occassion - but I always start with BASH and promote up to a real programming language from there.

To start with, we still want to be able to serve up regular content on our website - it's just the XML that needs to be treated differently. So lets say, any time Apache cannot find what it's looking for it'll throw a 404. Let's catch that 404 and do our own work.

In your .htaccess file, you can add a simple command of: ErrorDocument 404 /cgi-bin/website.cgi

This rule will work for all subdirectories too, which means you can still have a fulfeatured directory structure. Only missing files, such as all your html, will fall in to the script.

Now we make ourselves our script. Make sure .cgi is a valid script file in your cgi-bin directory. We need to do some basics. For this example, which can obviously be expanded on, we'll just assume the reason they got here was because they were looking for some of our xml.

So lets turn that .html in to an .xml using a regular expression.

DOCUMENT=`echo $REDIRECT_URL | sed 's#\.html#\.xml#'`

That's pretty easy. So now can dump out our XML to begin with: cat $DOCUMENT

That'll fail naturally, we need to put out all the normal headers, etc:

echo Status: 200 OK HTTP/1.1
echo Content-Type: text/html
echo

Now lets get a bit tricker. What if the .xml file doesn't exist either? We should do a real 404

if test -f $DOCUMENT; then
  echo .. headers
  cat $DOCUMENT
else
  echo Content-Type: text/plain
  echo
  echo 404.. oops
fi

We can get more fancy and support 304 NO CHANGED by using the last modified timestamp of the .xml file and checking $HTTP_IF_MODIFIED_SINCE against that, also puting out Last-Modified in to the headers.

One last step really, which XSL do we use? I do a very basic check inside my XML for an xmlns=, then match that up against an xsl filename in a file I call lookandfeels.

NAMESPACE=`grep xmlns= $DOCUMENT | sed 's#.*xmlns="\(.*\)".*#\1#'`
STYLESHEET=`grep $NAMESPACE $LOOKANDFEEL | sed 's#.* ##'`
xsltproc --xinclude $STYLESHEET $DOCUMENT

So we do that instead of doing the cat $DOCUMENT and now we're serving up XHTML using our XSLT and xsltproc.

There are many things we can do next. For example, we can cache our result to a tmp directory. We then touch the tmp file and the original document and every subsequent request we check that their timestamps still match. If they do, we can keep using our cached version and not use the processor again until the souce has changed.

We can get more sophisticated and check to see if the stylesheet has changed since it was transformed as well. We can even use grep to look for xsl:import and xsl:include and document() statements and use them as dependencies too. But to do all that we'd need to record this information in to a separate file. We've really stepped out of the league of nice-bash scripting.

One other one I like to do is first check for a .sh file instead of an .xml file. If there's a shell file, run it to get the xml content. This allows me to have scripts running behind the xslt processor. If the script changes than its content is out of date and it all runs again. A sister script to this one would do the dependency checking.

All of these caching tricks sure sound nice - but do you really need them? Unless you're getting hit with a million hits a day.. well, a million hits a day is only 41666 hits an hour, which is 694 hits a minute, which is 11 hits a second. Most dynamic applications can serve up 20 documents a second with ease. The stuff I have at work does over a hundred dynamic deliveries a second.

So it seems likely we can serve up a great many hits on a simple computer using this technique and not need to cache. Lets double check with httperf.

httperf --hog --rate 10000 --num-conns 1000 ...

Total: connections 1000 requests 1000 replies 1000 test-duration 23.366s

That's network saturation level, which means you're been slashdotted or DDOS'd, but we came out on top, responding to 40 requests a second. That ain't bad for a small to medium website. Large obviously would need some nice caching.

Read: Dynamic websites using Apache, xsltproc, XML XSLT and BASH

Previous Topic

Next Topic


	Web Artima.com