The Artima Developer Community
Sponsored Link

Python Buzz Forum
Hadoop up and running... now what to do with it?

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Phillip Pearson

Posts: 1083
Nickname: myelin
Registered: Aug, 2003

Phillip Pearson is a Python hacker from New Zealand
Hadoop up and running... now what to do with it? Posted: Jul 6, 2008 5:03 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Phillip Pearson.
Original Post: Hadoop up and running... now what to do with it?
Feed Title: Second p0st
Feed URL: http://www.myelin.co.nz/post/rss.xml
Feed Description: Tech notes and web hackery from the guy that brought you bzero, Python Community Server, the Blogging Ecosystem and the Internet Topic Exchange
Latest Python Buzz Posts
Latest Python Buzz Posts by Phillip Pearson
Latest Posts From Second p0st

Advertisement

I've had a pile of unused but relatively powerful (one Athlon 1333, one Athlon XP 2100+ and two Athlon XP 2800+) computers sitting around not doing anything for the last little while. I've been meaning to hook them all up together, along with the various laptops in the house, into some sort of cluster to play around with for a while. Finally got around to installing Hadoop on a couple of yesterday, so now I have a little mini-cluster here.

Hadoop is interesting; my initial expectation was that it would be a general distributed task/job handling system that happens to handle Map/Reduce type workflows, but from the looks of things it's more the other way around - it's completely built to run Map/Reduce jobs (in particular ones which handle large numbers of records from log files or databases) and can possibly be hacked to handle general distributed jobs.

The one distributed job I want to try to run on it doesn't quite fit the Map/Reduce model, but that's probably just because of how it's structured right now. It's an "embarrassingly parallel" type problem with lots of brute-force testing (machine learning stuff) that boils down to something like average { map { evaluate(best_of(map { test_candidate } find_candidates)) } (split $input) }.

I guess the way to do it is to run it as three consecutive Map/Reduce jobs.

1. Only Map: $sets_of_candidates = map { find_candidates } (split $input)

2. Map/Reduce: $best_candidates = best_of { map { test } ($sets_of_candidates) }

3. Map/Reduce: $evaluation = average { map { evaluate } ($best_candidates) }

An interesting observation is that while the model is named Map/Reduce, the initial step, splitting the data up, is really important too. I get the impression that much of the work that's gone into Hadoop has been in the area of intelligently dividing data between workers.

Comment

Read: Hadoop up and running... now what to do with it?

Topic: Django for CMS, Part 2 Previous Topic   Next Topic Topic: Testing the bitslice algorithm for popcount

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use