I thought people might be interested in some very basic Apache hacking that I use to build websites. The idea is to have a bunch of XML on your hard disk and you want to use XSLT to produce XHTML pages for people to view - but on the fly at request time, not with some 'build' process.
The ingredients are these: Apache's HTTP daemon, xsltproc (libxslt), your XML, your XSLT and a programming language, such as BASH. Note, I've used ruby, python, bash and smalltalk for this purpose on occassion - but I always start with BASH and promote up to a real programming language from there.
To start with, we still want to be able to serve up regular content on our website - it's just the XML that needs to be treated differently. So lets say, any time Apache cannot find what it's looking for it'll throw a 404. Let's catch that 404 and do our own work.
In your .htaccess file, you can add a simple command of: ErrorDocument 404 /cgi-bin/website.cgi
This rule will work for all subdirectories too, which means you can still have a fulfeatured directory structure. Only missing files, such as all your html, will fall in to the script.
Now we make ourselves our script. Make sure .cgi is a valid script file in your cgi-bin directory. We need to do some basics. For this example, which can obviously be expanded on, we'll just assume the reason they got here was because they were looking for some of our xml.
So lets turn that .html in to an .xml using a regular expression.
DOCUMENT=`echo $REDIRECT_URL | sed 's#\.html#\.xml#'`
That's pretty easy. So now can dump out our XML to begin with: cat $DOCUMENT
That'll fail naturally, we need to put out all the normal headers, etc:
echo Status: 200 OK HTTP/1.1 echo Content-Type: text/html echo
Now lets get a bit tricker. What if the .xml file doesn't exist either? We should do a real 404
if test -f $DOCUMENT; then
echo .. headers
cat $DOCUMENT
else
echo Content-Type: text/plain
echo
echo 404.. oops
fi
We can get more fancy and support 304 NO CHANGED by using the last modified timestamp of the .xml file and checking $HTTP_IF_MODIFIED_SINCE against that, also puting out Last-Modified in to the headers.
One last step really, which XSL do we use? I do a very basic check inside my XML for an xmlns=, then match that up against an xsl filename in a file I call lookandfeels.
NAMESPACE=`grep xmlns= $DOCUMENT | sed 's#.*xmlns="\(.*\)".*#\1#'`
STYLESHEET=`grep $NAMESPACE $LOOKANDFEEL | sed 's#.* ##'`
xsltproc --xinclude $STYLESHEET $DOCUMENT
So we do that instead of doing the cat $DOCUMENT and now we're serving up XHTML using our XSLT and xsltproc.
There are many things we can do next. For example, we can cache our result to a tmp directory. We then touch the tmp file and the original document and every subsequent request we check that their timestamps still match. If they do, we can keep using our cached version and not use the processor again until the souce has changed.
We can get more sophisticated and check to see if the stylesheet has changed since it was transformed as well. We can even use grep to look for xsl:import and xsl:include and document() statements and use them as dependencies too. But to do all that we'd need to record this information in to a separate file. We've really stepped out of the league of nice-bash scripting.
One other one I like to do is first check for a .sh file instead of an .xml file. If there's a shell file, run it to get the xml content. This allows me to have scripts running behind the xslt processor. If the script changes than its content is out of date and it all runs again. A sister script to this one would do the dependency checking.
All of these caching tricks sure sound nice - but do you really need them? Unless you're getting hit with a million hits a day.. well, a million hits a day is only 41666 hits an hour, which is 694 hits a minute, which is 11 hits a second. Most dynamic applications can serve up 20 documents a second with ease. The stuff I have at work does over a hundred dynamic deliveries a second.
So it seems likely we can serve up a great many hits on a simple computer using this technique and not need to cache. Lets double check with httperf.
That's network saturation level, which means you're been slashdotted or DDOS'd, but we came out on top, responding to 40 requests a second. That ain't bad for a small to medium website. Large obviously would need some nice caching.