This post originated from an RSS feed registered with Python Buzz
by Phillip Pearson.
Original Post: Personal whole-blogosphere crawlers - still feasible?
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Back when I started blogging in 2002, several of us were running code that would download the front page of every blog when it updated. There were public sites like DayPop, Blogdex, my Blogging Ecosystem, and later on Technorati, but also one or two private crawlers.
I was wondering this morning: is it still feasible (if not necessarily sensible) to run a private crawler?
This puts the lower bound on data transfer at something like 2k * 1.4M = 2.8G/day, assuming an average post size of 2k, and the existence of a magic way of retrieving posts with no overhead. If you have to download the whole blog front page each time, it could be more like 50k * 1.4M = 70G/day, or just over 2T/month. RSS/Atom feeds should be a little smaller (mine is 36k compared to a 45k index page), and if you're lucky, you'll be able to use RFC3229 delta encoding to reduce that a bit more.
So it's looking feasible. Servers from LayeredTech have a bandwidth limit of 1.5T/month, so if you have fast enough code (able to pull on average 8 megabits/sec down your pipe in the worst case) and can take advantage of streams like the Six Apart Update Stream to reduce bandwidth where possible, you might be able to crawl the entire blogosphere on a not-too-expensive server.