Java Buzz Forum - Design considerations for fine grained data access via the Web

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Buzz Forum
Design considerations for fine grained data access via the Web

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Bill de hÓra

Posts: 1137
Nickname: dehora
Registered: May, 2003

Bill de hÓra is a technical architect with Propylon

Design considerations for fine grained data access via the Web

Posted: Dec 12, 2008 9:08 PM

This post originated from an RSS feed registered with Java Buzz by Bill de hÓra.
Original Post: Design considerations for fine grained data access via the Web Feed Title: Bill de hÓra Feed URL: http://www.dehora.net/journal/atom.xml Feed Description: FD85 1117 1888 1681 7689 B5DF E696 885C 20D8 21F8	Latest Java Buzz Posts Latest Java Buzz Posts by Bill de hÓra Latest Posts From Bill de hÓra

Julian Hyde: "You would think that something called a 'feed' would push content is pushed to subscribers as soon as it arrives, but in fact RSS and the other feed types in the prototype use a pull protocol. With a pull protocol, the subscriber needs to continually poll the feed to get the content (typically an XML document a few kilobytes long), parse the content, and figure out what, if anything, is new since the last time we polled.

This process soaks up a lot of network bandwidth and resources for both the provider and the subscriber, and the cost goes up the more regularly we poll. Typically the provider has to throttle the feed to prevent their servers from being overwhelmed. For example, Twitter updates its feed only once per minute and limits the number of tweets on the page. At times of high volume, only a small percentage of tweets make it into the feed.

This may not sound that serious if the content is a twitter conversation between friends, or a blog with one or two posts a week. But web feed protocols are becoming part of the IT infrastructure, and business users require lower latency, higher throughput and higher availability. (The existence of services like Gnip is evidence of the need to control the web content chaos.)"

I would like to know how to scale this so that the origin server does not melt down under query load. Let me explain, assuming the origin server is backed by a relational database.

Most people that want real time efficient feeds are concerned about bandwidth overhead or the apparent technical stupidity of polling the same data over and over. They would just like what has changed since the last time they asked. It's clearly more efficient and better. Let's call this a "bespoke" feed model.

What tends to gets forgotten about with bespoke feeds is that each client request forces a subselect on the database. This model is not likely to scale nearly as well on the server as resending redundant information and letting the client sort it out locally, however dumb that approach might seem. The Atom format for example is designed so that the client can sort it out locally by virtue of the atom:id and atom:updated values.

The alternative polling option people arrive at is to not support bespoke queries but to serve the same redundant data to all clients. Let's call this the "one size fits all" (osfa) feed model. It is the standard approach on the Web for scalable, high availibility feed serving. The osfa approach "works" insofar as it assumes a lot of clients are accessing data and makes a tradeoff preferring bandwidth overhead to database load. This tradeoff makes a lot of sense as the number of clients go up - anyone who builds database backed websites quickly learns to reduce the number of calls on the database, be it through query caches, L2 object cache, caching proxies, and so on. An osfa approach allows the data to served off disk directly, making it a pure file serving problem, which is far easier to scale than hitting a relational database.

So, where does that leave us? Well I think if you must allow per client querying for a lot of clients, you need to be sure the server can handle the database load at scale. If you are really worried about bandwidth then compression is the first obvious thing to do. Another is caching, but that leads to data latency and if you are asking for "just" the changed data there is a chance you want that data "right now" as well (more on that in a minute). You might also think that sending down less data will be a win - but this really depends on your use case. Replacing one coarse grained fetch with 4 fine grained queries isn't neccessarily going to lead to a better user experience or sane usage of the data server, though a client developer might find it convenient to not have to om nom through a larger dataset. If you are familar in enterprise development with the .NET/JEE antipattern of data access that leads to the use of DTOs, well, fine grained feeds present similar issues.

Julian has a suggestion:

"I would like to see the emergence of a genuine 'push' protocol for web-based content. It doesn't have to be particularly complicated. To illustrate what I have in mind, here is an example of a simple, stateless protocol, built using XML over HTTP, like the current feed formats. A subscriber sends a request

<readRequest>
  <minimumRowtime>2008-12-04 18:00:46.000</minimumRowtime>
  <maximumCount>1000</maximumCount>
  <maximumWait>10s</maximumWait>
</readRequest>

over HTTP"

I would like to see such a thing as well. But.

"According to the protocol, the provider sends the results after 10 seconds, or when there are 1000 records to return, whichever occurs sooner. After it has received a result, the subscriber will typically ask for the next set of rows with a higher rowtime threshold.

Even though it is simple, the protocol ensures that data flows efficiently for feeds of all data rates. For a high volume feed, the 1000 record limit will be reached before the 10 second timeout, so latency naturally decreases. For a low volume feed, many requests may time out and return an empty result; but the 10 second wait limits the number of requests per minute that the server has to handle."

It is simple, but by virtue of assuming the data server can handle the load of pushing out the data and managing subscription state; the protocol does nothing to manage that part of the architecture. Good client-server protocol designs (where good means scale to large numbers of both) try to avoid or mitigate these kind of asymmetries.

Back to latency. Many web sites scale of the basis of the data being latent - even a few minutes can make a huge engineering and operational difference, especially as your application grows beyond a single cluster (or geographic location). IMO the mapreduce pattern scales not just on parallelisation but on the data latency the results are allowing to have (which is why it gets used a lot for log/warehouse analytics and post-hoc querying). So if you demand real time precision in the data, be aware that this can put stress on your server.

"Real time" requirements in turn might lead you towards a push model, but I think it's reasonable to say that we don't know how to do internet scale push yet, at least not without creating asymmetries - its hard to have a lot of clients to send data to, and the problems gets harder as you add things in like filtering and long held connections by clients that will have you ripping out those loadbalancers.

For push, I think XEP-60 is worth looking at, even though we (imo) have work to learn how to manage mass subscriptions, and if you are interested in systems architecture, Rohit Khare's ARRESTED model.

Read: Design considerations for fine grained data access via the Web

Previous Topic

Next Topic


	Web Artima.com