Julian Hyde: "You would think that something called a 'feed' would push content is
pushed to subscribers as soon as it arrives, but in fact RSS and the
other feed types in the prototype use a pull protocol. With a pull
protocol, the subscriber needs to continually poll the feed to get the content (typically an XML document a few kilobytes
long), parse the content, and figure out what, if anything, is new
since the last time we polled.
This process soaks up a lot of
network bandwidth and resources for both the provider and the
subscriber, and the cost goes up the more regularly we poll. Typically
the provider has to throttle the feed to prevent their servers from
being overwhelmed. For example, Twitter updates its feed only once per
minute and limits the number of tweets on the page. At times of high
volume, only a small percentage of tweets make it into the feed.
This
may not sound that serious if the content is a twitter conversation
between friends, or a blog with one or two posts a week. But web feed
protocols are becoming part of the IT infrastructure, and business
users require lower latency, higher throughput and higher availability.
(The existence of services like Gnip is evidence of the need to control the web content chaos.)"
I would like to know how to scale this so that the origin server does not melt down under query load. Let me explain, assuming the origin server is backed by a relational database.
Most people that want real time efficient feeds are concerned about bandwidth overhead or the apparent technical stupidity of polling the same data over and over. They would just like what has changed since the last time they asked. It's clearly more efficient and better. Let's call this a "bespoke" feed model.
What tends to gets forgotten about with bespoke feeds is that each client request forces a subselect on the database. This model is not likely to scale nearly as well on the server as resending redundant information and letting the client sort it out locally, however dumb that approach might seem. The Atom format for example is designed so that the client can sort it out locally by virtue of the atom:id and atom:updated values.
The alternative polling option people arrive at is to not support bespoke queries but to serve the same redundant data to all clients. Let's call this the "one size fits all" (osfa) feed model. It is the standard approach on the Web for scalable, high availibility feed serving. The osfa approach "works" insofar as it assumes a lot of clients are accessing data and makes a tradeoff preferring bandwidth overhead to database load. This tradeoff makes a lot of sense as the number of clients go up - anyone who builds database backed websites quickly learns to reduce the number of calls on the database, be it through query caches, L2 object cache, caching proxies, and so on. An osfa approach allows the data to served off disk directly, making it a pure file serving problem, which is far easier to scale than hitting a relational database.
So, where does that leave us? Well I think if you must allow per client querying for a lot of clients, you need to be sure the server can handle the database load at scale. If you are really worried about bandwidth then compression is the first obvious thing to do. Another is caching, but that leads to data latency and if you are asking for "just" the changed data there is a chance you want that data "right now" as well (more on that in a minute). You might also think that sending down less data will be a win - but this really depends on your use case. Replacing one coarse grained fetch with 4 fine grained queries isn't neccessarily going to lead to a better user experience or sane usage of the data server, though a client developer might find it convenient to not have to om nom through a larger dataset. If you are familar in enterprise development with the .NET/JEE antipattern of data access that leads to the use of DTOs, well, fine grained feeds present similar issues.
Julian has a suggestion:
"I would like to see the emergence of a genuine 'push' protocol for
web-based content. It doesn't have to be particularly complicated. To
illustrate what I have in mind, here is an example of a simple,
stateless protocol, built using XML over HTTP, like the current feed
formats. A subscriber sends a request
"According to the protocol, the provider sends the results after 10
seconds, or when there are 1000 records to return, whichever occurs
sooner. After it has received a result, the subscriber will typically
ask for the next set of rows with a higher rowtime threshold.
Even
though it is simple, the protocol ensures that data flows efficiently
for feeds of all data rates. For a high volume feed, the 1000 record
limit will be reached before the 10 second timeout, so latency
naturally decreases. For a low volume feed, many requests may time out
and return an empty result; but the 10 second wait limits the number of
requests per minute that the server has to handle."
It is simple, but by virtue of assuming the data server can handle the load of pushing out the data and managing subscription state; the protocol does nothing to manage that part of the architecture. Good client-server protocol designs (where good means scale to large numbers of both) try to avoid or mitigate these kind of asymmetries.
Back to latency. Many web sites
scale of the basis of the data being latent - even a few minutes can
make a huge engineering and operational difference, especially as your
application grows beyond a single cluster (or geographic location). IMO the mapreduce pattern scales not just on parallelisation but on the data latency the results are allowing to have (which is why it gets used a lot for log/warehouse analytics and post-hoc querying). So
if you demand real time precision in the data, be aware that this can put
stress on your server.
"Real time" requirements in turn might lead you towards a push
model, but I think it's reasonable to say that we don't know how to do
internet scale push yet, at least not without creating asymmetries - its hard to have a
lot of clients to send data to, and the problems gets harder as you add
things in like filtering and long held connections by clients that will
have you ripping out those loadbalancers.
For push, I think XEP-60 is worth looking at, even though we (imo) have work to learn how to manage mass subscriptions, and if you are interested in systems architecture, Rohit Khare's ARRESTED model.