This post originated from an RSS feed registered with Python Buzz
by Phillip Pearson.
Original Post: Hadoop up and running... now what to do with it?
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
I've had a pile of unused but relatively powerful (one Athlon 1333, one Athlon XP 2100+ and two Athlon XP 2800+) computers sitting around not doing anything for the last little while. I've been meaning to hook them all up together, along with the various laptops in the house, into some sort of cluster to play around with for a while. Finally got around to installing Hadoop on a couple of yesterday, so now I have a little mini-cluster here.
Hadoop is interesting; my initial expectation was that it would be a general distributed task/job handling system that happens to handle Map/Reduce type workflows, but from the looks of things it's more the other way around - it's completely built to run Map/Reduce jobs (in particular ones which handle large numbers of records from log files or databases) and can possibly be hacked to handle general distributed jobs.
The one distributed job I want to try to run on it doesn't quite fit the Map/Reduce model, but that's probably just because of how it's structured right now. It's an "embarrassingly parallel" type problem with lots of brute-force testing (machine learning stuff) that boils down to something like average { map { evaluate(best_of(map { test_candidate } find_candidates)) } (split $input) }.
I guess the way to do it is to run it as three consecutive Map/Reduce jobs.
An interesting observation is that while the model is named Map/Reduce, the initial step, splitting the data up, is really important too. I get the impression that much of the work that's gone into Hadoop has been in the area of intelligently dividing data between workers.