Python Buzz Forum - Hadoop up and running... now what to do with it?

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
Hadoop up and running... now what to do with it?

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand

Hadoop up and running... now what to do with it?

Posted: Jul 6, 2008 5:03 PM

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: Hadoop up and running... now what to do with it? Feed Title: Second p0st Feed URL: http://www.myelin.co.nz/post/rss.xml Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange	Latest Python Buzz Posts Latest Python Buzz Posts by Phillip Pearson Latest Posts From Second p0st

I've had a pile of unused but relatively powerful (one Athlon 1333, one Athlon XP 2100+ and two Athlon XP 2800+) computers sitting around not doing anything for the last little while. I've been meaning to hook them all up together, along with the various laptops in the house, into some sort of cluster to play around with for a while. Finally got around to installing Hadoop on a couple of yesterday, so now I have a little mini-cluster here.

Hadoop is interesting; my initial expectation was that it would be a general distributed task/job handling system that happens to handle Map/Reduce type workflows, but from the looks of things it's more the other way around - it's completely built to run Map/Reduce jobs (in particular ones which handle large numbers of records from log files or databases) and can possibly be hacked to handle general distributed jobs.

The one distributed job I want to try to run on it doesn't quite fit the Map/Reduce model, but that's probably just because of how it's structured right now. It's an "embarrassingly parallel" type problem with lots of brute-force testing (machine learning stuff) that boils down to something like average { map { evaluate(best_of(map { test_candidate } find_candidates)) } (split $input) }.

I guess the way to do it is to run it as three consecutive Map/Reduce jobs.

1. Only Map: $sets_of_candidates = map { find_candidates } (split $input)

2. Map/Reduce: $best_candidates = best_of { map { test } ($sets_of_candidates) }

3. Map/Reduce: $evaluation = average { map { evaluate } ($best_candidates) }

An interesting observation is that while the model is named Map/Reduce, the initial step, splitting the data up, is really important too. I get the impression that much of the work that's gone into Hadoop has been in the area of intelligently dividing data between workers.

Comment

Read: Hadoop up and running... now what to do with it?

Previous Topic

Next Topic


	Web Artima.com