The Artima Developer Community
Sponsored Link

Python Buzz Forum
Personal whole-blogosphere crawlers - still feasible?

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
Personal whole-blogosphere crawlers - still feasible? Posted: Aug 20, 2007 4:06 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: Personal whole-blogosphere crawlers - still feasible?
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

Back when I started blogging in 2002, several of us were running code that would download the front page of every blog when it updated. There were public sites like DayPop, Blogdex, my Blogging Ecosystem, and later on Technorati, but also one or two private crawlers.

I was wondering this morning: is it still feasible (if not necessarily sensible) to run a private crawler?

Dave Sifry's last State of the Live Web report shows posting volumes at about 1.4M/day.

This puts the lower bound on data transfer at something like 2k * 1.4M = 2.8G/day, assuming an average post size of 2k, and the existence of a magic way of retrieving posts with no overhead. If you have to download the whole blog front page each time, it could be more like 50k * 1.4M = 70G/day, or just over 2T/month. RSS/Atom feeds should be a little smaller (mine is 36k compared to a 45k index page), and if you're lucky, you'll be able to use RFC3229 delta encoding to reduce that a bit more.

So it's looking feasible. Servers from LayeredTech have a bandwidth limit of 1.5T/month, so if you have fast enough code (able to pull on average 8 megabits/sec down your pipe in the worst case) and can take advantage of streams like the Six Apart Update Stream to reduce bandwidth where possible, you might be able to crawl the entire blogosphere on a not-too-expensive server.

Comment

Read: Personal whole-blogosphere crawlers - still feasible?

Topic: More on decentralized social networking Previous Topic   Next Topic Topic: DictMixin

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use