This post originated from an RSS feed registered with Java Buzz
by Bill de hÓra.
Original Post: Dates in Atom
Feed Title: Bill de hÓra
Feed URL: http://www.dehora.net/journal/atom.xml
Feed Description: FD85 1117 1888 1681 7689 B5DF E696 885C 20D8 21F8
Atom Feeds and AtomPub collections are time ordered data. I think
most people intuitively know that Atom feeds are time ordered data, but
perhaps not that they're ordered by update and edit times, or why time is the natural order for atom serviced content even though domain content might have other natural orders that make sense. Since it's not
that commonly talked about, I figure it's worth at least one post to
explain why.
Dates in Atom
There's a long (torrid) history of datestamping in
the Atom standards and more generally feed syndication. When the Atom
format was being designed some working group members felt you needed 3 dates - an edit date, a publish date and a creation date. Or maybe an edit,
updated and published. Or... you get the idea. And as prior art to Atom Dublin Core had already settled on 3 dates. Anyway, the Atom working group couldn't agree on 3 (really),
but we could identify and agree on 2 meaningful dates -
updated and published. As a result, Atom Entries must have an updated
date, and can have a published date.
Why all the work to naturally order by time? Historically it's because
feeds come from blogs, which are diaries, which are lists of entries
ordered by date. Today it's increasingly for systems reasons, most
importantly, to support cheap synchronisation by clients. What happens
is that the combination of atom:id and atom:updated is enough
information for clients to synchronise new or updated content - they
work from the top of the feed and walk the entries and/or the feed's
previous links until they hit the first atom:id/atom:updated pair that
matches their local Entry cache - sync over. This lowers overall
traffic and data loading costs out of persistent storage.
Dates in AtomPub
AtomPub (RFC5023) added another date. The working group said that
AtomPub collections (feeds you can post content to) should be ordered
by a date called app:edited. Entries in AtomPub collections should contain one app:edited element, and
must not contain more than one.
Ideally this natural ordering would
have been be a must level specification, but RFC5023 couldn't mandate
the app:edited be universally understood, as that would break Atom's
versioning policies which say that new elements are 'foreign markup'
and can be optionally processed or must be ignored. In other words
no-one can introduce a new must understand datum into Atom (RFC4287)
markup and retroactively break the planet's deployed Atom aware systems
- not even AtomPub (RFC5023). Unless you are unlucky, app:edited works well, even where the feed itself is latently updated.
[By
the way in the "real world" feeds that can act as AtomPub collections
will also appear as being ordered by atom:updated, even though
app:edited is what the spec says you should expose. Some systems will
update on every edit; that's just how they roll.]
Domain gnarliness
The AtomPub spec doesn't say why app:edited exists, but the following example should help explain why.
Not all domain content is naturally time ordered (there's more to
digital life than blogging). Address and contact books for example will
tend to be sorted and presented to a user by some other key, maybe last
name. This is a gnarly case, that came up on the Atom protocol list a
while back.
So say my information store has a list of contacts
- and a collection resource for managing those contacts. Generally I'm
not interested in retrieving things by last edit/update, I want
contacts alpha ordered, becuase my client is a useful application that
happens to use Atom/AtomPub, not some kind of an entry cache. If I'm
using Atom to represent an address book, using atom:updated or
ap:edited seems to be the wrong approach for the UI.
The
problem is, not ordering collection entries by update time will result in
inefficient syncing (syncing is probably use case 2 or 3 for a network
address book, hence you tend to see SyncML and address books go hand in
hand).
For example if I add new contact with a last name of
"Wordsworth", that will go to the back of the feed and not the front,
where it can be picked up cheaply on the next sync. The client the edit
came from could of course either hold onto the recent additions/edits
(essentially acting as a writethrough cache) instead of paging back to
"W". But my client got a bit more complicated. And my other HTTP
connected devices wanting the newest stuff will need to page all the
way back to "W" in the book to sync up. In fact to be sure they'll have
to pull the whole book evey time they sync. The approach of stopping at
the first matching id/update pair won't work - algorithmically
speaking, syncing will always be a worst case.
Eventually
something like the following will happen to deal with the UI being
slow, or concurrent client refreshes pegging the server. A new
"recently added" contacts feed will be added. Or the sort will be
extended to allow by-added/by-updated. Either way, it'll be a
reinvention of AtomPub's app:edited default sorting. In that case we'll
want move the order by last-name feature of the domain/UI into the
implementation detail, perhaps by defining some query params that
provides the user optimised view of the data (ie the one that makes
most sense for the user browsing the content), and keep the time
ordered feed as the protocol default.
What's happening is that
there are two use cases. One for viewing an address book in an
application (sorted by alpha), and another for adding and syncing
contacts to it, and probably the server needs to provide different
views on the data for each.
Incidently an AtomPub client can
work without app:edited sorting (it won't necessarily know the sort
order, unless there's a private contract between client and server),
but it will be inefficient on update. So it seems to be in the general
case, even for a domain like an address book, order by time is the best
natural sort for an AtomPub collection.
Backend databases
Most people I think
use databases to back web sites and sometimes you'll want to just use
the database primary key to sort the entries. Ordering on the pk is
great because it's FaF (Fast as ****). And if the database is using
autoincrementing keys we'll naturally sort by content creation date.
But there are downsides. For example, this technique won't be optimal
for updates as they won't be captured in the order-by clause. At the
system level it means that clients will have to start paging more data
to sync up content, which means more load against the DB.
Non-auto-incrementing keys and very possibly split/federated databases
won't be support the implicit creation. And a database wipeout
potentially loses the order of actual creation (who knows how the data
will be reimported and new keys assigned).
What this means that RDBMS managed content being served up for
feeds or managed using AtomPub (which will over time trend to being most web
content) will have multiple date columns. An insert time (generally
good for data management anyway) will be very common. But for content
management they'll need an updated column that's indexed, to track
recent changes. You might have a third published date, and maybe and
edit one as well (if you need to distinguish between an update and an
edit), but to let AtomPub clients use and manage the data, an updated date seems to be the
minimum must have.