Python Buzz Forum - Personal whole-blogosphere crawlers

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
Personal whole-blogosphere crawlers - still feasible?

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand

Personal whole-blogosphere crawlers - still feasible?

Posted: Aug 20, 2007 4:06 PM

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: Personal whole-blogosphere crawlers - still feasible? Feed Title: Second p0st Feed URL: http://www.myelin.co.nz/post/rss.xml Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange	Latest Python Buzz Posts Latest Python Buzz Posts by Phillip Pearson Latest Posts From Second p0st

Back when I started blogging in 2002, several of us were running code that would download the front page of every blog when it updated. There were public sites like DayPop, Blogdex, my Blogging Ecosystem, and later on Technorati, but also one or two private crawlers.

I was wondering this morning: is it still feasible (if not necessarily sensible) to run a private crawler?

Dave Sifry's last State of the Live Web report shows posting volumes at about 1.4M/day.

This puts the lower bound on data transfer at something like 2k * 1.4M = 2.8G/day, assuming an average post size of 2k, and the existence of a magic way of retrieving posts with no overhead. If you have to download the whole blog front page each time, it could be more like 50k * 1.4M = 70G/day, or just over 2T/month. RSS/Atom feeds should be a little smaller (mine is 36k compared to a 45k index page), and if you're lucky, you'll be able to use RFC3229 delta encoding to reduce that a bit more.

So it's looking feasible. Servers from LayeredTech have a bandwidth limit of 1.5T/month, so if you have fast enough code (able to pull on average 8 megabits/sec down your pipe in the worst case) and can take advantage of streams like the Six Apart Update Stream to reduce bandwidth where possible, you might be able to crawl the entire blogosphere on a not-too-expensive server.

Comment

Read: Personal whole-blogosphere crawlers - still feasible?

Previous Topic

Next Topic


	Web Artima.com