This post originated from an RSS feed registered with Java Buzz
by Bill de hÓra.
Original Post: Efficient API Paging: count down, not up
Feed Title: Bill de hÓra
Feed URL: http://www.dehora.net/journal/atom.xml
Feed Description: FD85 1117 1888 1681 7689 B5DF E696 885C 20D8 21F8
Hueniverse: "Making an API call is like asking a question and if the question is simple enough, the answer is easy to come by. For example, asking to describe a single user is simple. A single database lookup using a single unique key is fast and generally easy to scale. This is the most optimized use case in relational databases. However, asking for the last 20 messages of all the friends I am following is not a simple question. It is complex because it is really a list of questions batched together. To find the last 20 messages, we must ask:
* Who do I follow? * What are their last 20 messages? * Of all the messages found, which are the most recent 20? * What is the content and other metadata for each of these 20 messages?
This list of questions becomes pretty inefficient with a very large set of friends being followed. However, the data set can be reduced by first finding the top 20 friends with the most recent updates, and only requesting their 20 latest messages. Even with this optimization, we are looking at fetching up to 400 messages in order to display 20. This gets much worse when the request is for the second page of messages (messages 21-40) where we can fetch up to 1600 messages from up to 40 friends in order to display 20 messages. By the 5th page we are fetching up to 10,000 messages to display 20 (which might explain why Twitter temporarily disabled the paging feature of their API)."
Let's ignore the notion of an inbox for now that avoids the who-do-I-follow-means-a-sql-join problem and look at paging over the network. The way some services exposes paging physically ensures each page has to be recomputed every time there's a new addition - new content keeps getting added to page=1, pushing everything else down off the main page causing a ripple effect over the set of pages. As a result, pages easily can't be archived and are tricky to cache - the content of "page=5" changes every time there's new content on the homepage (in Twitter's case new content on the hompage is extrememly frequent), because the paging model is a pushdown stack. Here's that stack:
All the messages are pushed down and logically, every page in the stack gets touched. As a result nothing is long term cacheable and nothing is physically archivable to disk. It's inherently non-scalable.
Another approach would be to work the paging backwards by counting down. In that model page=1 will be the oldest page not the newest one. Like this:
Note that we let the homepage grow by 1. That stopped propagation to the other pages. When the home page gets too big we can push some messages back to a new page and splice it into the list:
Upside? N-2 pages can be archived, even taken off your active database and served as flat files or out a memcached layer (if a database what you're using). If you don't mind a long homepage, you can manage the resizing just on the homepage and splice in a new page when the length of the homepage is 2x your archive page size. Downside? Somebody will file a bug that the main pages vary in size.
By the way, if you designed your Data API around Atom and RFC5005 thus telling everyone "follow the next link" instead of "compute the next link's url, then follow it", you could switch physical paging models after the fact - the user agent doesn't care about your notion of a linked list, it'll key off the "rel" paging attributes. Whereas if you exposed the counting model directly via URL parameters, all clients will need to be upgraded. It's basic Data API practice to tell clients to follow typed links instead of asking them to compute parameterised URLs.