Changing blog software can be hard, as many of us now doubt know. In
a recent post
Bill de h��ra discusses some of the issues exporting his existing blog
to a new blog system. I know both the difficulty doing this sort of
thing, the difficulties implementing it, and all the broken links left
behind because of it. And it's all stupid. We can make things much
easier if the people writing the software would just let go a
little more and stop demanding that all the information go in some
opaque (to the web) backend data silo (i.e., a database).
For blogs the transition story should be much easier. When you want
to retire your blogging software you should rip your site. Spider the
whole thing down. It's not that much space, even when you have lots
of redundant files.
Of course, you probably can't serve the results up with Apache.
Apache just... isn't good enough. For instance, your files may not
have extensions; you have to keep some of the metadata, at least
content type, and there's no clear way to do that in Apache. (There
might be tricks or modules to do this in Apache, I'm not sure.) Also,
the page names might be awkward, like /blog/?p=45 -- Apache
interprets that sort of stuff, and we don't want to interpret it, we
spidered it as a page (irregardless of what implements that page),
and we just want to serve it up as a page. That's it.
But ignore the server portion for now -- we know we can build it if we
want to. You now have an archive you can keep around forever and move
to different hosts, and it doesn't rely on a software stack or any
data formats except the one that matters most: the HTML that people
read.
But it's not perfect. You have a link like /blog/2005/07/my-post,
or even /blog/?p=45. You new system has a link like
/blog/2007/01/another-post.html. They both share /blog/ --
you can configure your way around some (like ^/blog/200[012345]/),
but for how long? Must you only change blog software on new years
day, or on the first of the month?
When serving files you need a layered approach. First you look for
this legacy content. If the content doesn't exist, you do the normal
lookups, going to your new blogging software. The lookup must go to
the full depth. That is, you don't look at /blog and decide what
to dispatch to. You look at /blog/?p=45 (or whatever) and if you
don't find it you start with the next option (the live software).
Then you only have to worry about overlaps. And that's not without
concern; /blog/archive/ may in both systems. So some management
may be necessary to rename or remove some of that legacy content.
This is a good solution for content that is timely, or archival. If
you have content that is not timely, you are probably using a CMS, and
you really want to move the content into the new system so you can
continue to manage it in a consistent way.
In that case while it would be nice to keep URIs stable it might not
be feasible. Which is why all stacks need easily managed redirects,
with feedback taken from 404s in the logs.
A system setup like this still has some problems. How do you show all
the old entries as well as the new on your archive page? While I'm
not terribly concerned about the styling of old pages, sensible
navigation would be nice -- if you have archive links in your old
pages, it will appear as though you've stopped posting (when in fact
that's just the last date when the page was updated). And you want to
do searches across all your content.
These are also solvable problems, but they require some more
significant changes. We need to move the data that is currently
stored in the model out into the pages themselves. We need smart
spidering, maybe building on Google Sitemaps or other enumerations of
the "interesting" parts of a site. In addition notification of new
content (by pushing out event notification, maybe in the form of APP). From
there you could build an archive page entirely separate from the blog
software, and capable of reading content from more than one location.
Another more conservative option is just building some basic
customizability into blog software so that you could copy the static
HTML for your old archives, and the blog software would simply append
this to the archive page.
For styling and navigation I think a pipeline approach is most
appropriate -- we're working on Deliverance to do this this, but
it's a tricky problem. For searching... well, that at least is
already easy; reasonable hints along with a third-part search service
should work well (where "reasonable hints" includes not having your
archive page indexed), and if a third-party search won't work then
privately-hosted search services are still likely to be of higher
quality than searches built into blog software.
What annoys me is that technically none of this stuff is difficult
(this is actually the dumbest/easiest way to handle legacy content),
but existing stacks make it difficult. (I'm looking at you,
Apache!) This kind of system design can be more stable than what we
currently have, easier to maintain, more general, and just more web.
This is the kind of stuff I'm trying to tackle with Paste and WSGI and
all these little bits of code fitting together. I might just be
tilting at windmills in the endeavor, as developers both young and old
just love their backend models and want to have total control over
everything (driven in no small part by over-controlling customers,
graphic designers, marketers, and their ilk). And I can't really
expect WSGI to take over the world; it's not going to be the next
Apache. But we are definitely going to give
some of this stuff a serious try.