Jack Moffit:"Sorry, Twitter. Until we see some answers, you don’t have data, just a big mouth."
I think Jack Moffit, always excellent, is being hard on Alex Payne and the Twitter gang. He is criticising Twitter for restricting access to the firehose - the XMPP stream of events - "tweets" in Twitter parlance. Jack alludes to a strategic reason for this, as in - twitter 'own' the data and therefor should own the derivative value obtained from analysing or reorganising the data. "I don’t know the exact time that they started pruning the list of consumers of the firehose, but to me it seemed like this starting happening after Summize was acquired or around that time. The logical conclusion from this is that Twitter does not want more interesting things being built on top of its data."
Scale? It's received wisdom that heavy HTTP polling is stupid and wrong, whereas push is both more efficient and more optimal. The problem is there isn't much science or shared field experience on what it means to have a public XMPP data and notification endpoint with a lot of subscribers. When I say a lot, I mean 250,000 to 1M clients holding open connections to your server(s). Issues I've seen are that load balancing becomes a problem, db access costs dominate login times for clients, and XMPP server clustering isn't as far along as I'd thought it was. Scaling XMPP does not appear to be a commodity problem the way HTTP scaling is - you are back down to looking at whether/if the servers are using epoll/nio; whether load balancing should be done by clients (remember the load balancers actually get in the way), how long it takes to log a user in, set up presence, rosters etc; what the cluster toplogy's graph connectivity measure is (S2S doesn't seem to be the answer). It's like being back in 2000 and wistfully reading Dan Kegel's c10k page.
My suspicion is that services pushing out notifications to a number if subscribers (Sn) where that number is large is not yet a panacea to web poll scaling issues because there is latent asymmetry in the costs of pushing out events to increasing numbers despite it being more peformant and less latent for smaller values of Sn. And that service providers will need to look carefully at graph theory, flooding and gossip/propogation models to get pub/sub notifications to meet web scale delivery - and at that point we'll be half-way to either a peer to peer model, or usenet - take your pick ;)