The Artima Developer Community
Sponsored Link

Java Community News
MapReduce: A Major Step Backwards

6 replies on 1 page. Most recent reply: Jan 24, 2008 8:57 AM by James Watson

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 6 replies on 1 page
Frank Sommers

Posts: 2642
Nickname: fsommers
Registered: Jan, 2002

MapReduce: A Major Step Backwards Posted: Jan 21, 2008 11:56 AM
Reply to this message Reply
Summary
In an in-depth opinion piece, database luminaries Mike Stonebraker and David DeWitt compare the recently popular MapReduce distributed computing framework with traditional database techniques.
Advertisement

Arguing that,

Computer science communities tend to be insular and do not read the literature of other communities,

Mike Stonebraker and David DeWitt compare the recently popular MapReduce distributed computing framework with more traditional database techniques in MapReduce: A major step backwards.

Stonebraker and DeWitt themselves invented and described some of the most important parallel and scalable database technologies in use today, and are thus able to provide a deep, in-depth comparison. Their main point addresses some of the hype surrounding MapReduce:

As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

  • A giant step backward in the programming paradigm for large-scale data intensive applications,
  • A sub-optimal implementation, in that it uses brute force instead of indexing
  • Not novel at all—it represents a specific implementation of well known techniques developed nearly 25 years ago
  • Missing most of the features that are routinely included in current DBMS
  • Incompatible with all of the tools DBMS users have come to depend on

DeWitt and Stonebraker start out by providing a concise description of MapReduce:

The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster...

The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket... In general, there are multiple instances of the map program running on different nodes of a compute cluster.

The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N... After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program.

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

Stonebraker and DeWitt point out that MapReduce is frequently touted as a replacement for traditional data management techniques and tools, such as relational databases. The advantage of MapReduce compared with other data management systems is that provides a high degree of fault tolerance, but MapReduce ignores the most important lessons learned from hard-won experience, according to the authors:

The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968:

  • Schemas are good.
  • Separation of the schema from the application is good.
  • High-level access languages are good.

MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented...

The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset...

Stonebraker and DeWitt also point out that MapReduce's lack of indexing can limit its scalability, and that many MapReduce implementations ignore lessons learned over the past thirty years from constructing parallel databases.

Where do you think MapReduce, and some other grid-like computing techniques, fit in an enterprise infrastructure?


Brian Slesinsky

Posts: 43
Nickname: skybrian
Registered: Sep, 2003

Re: MapReduce: A Major Step Backwards Posted: Jan 21, 2008 12:53 PM
Reply to this message Reply
I think they're just confused. Of course doing a query using an index is far faster when one exists, but the index has to be built first. MapReduce is a fine tool for building custom indexes.

Morel Xavier

Posts: 73
Nickname: masklinn
Registered: Sep, 2005

Re: MapReduce: A Major Step Backwards Posted: Jan 21, 2008 11:28 PM
Reply to this message Reply
The part Stonebraker and DeWitt seem to have missed, which was discussed in depth in responses to their articles and which you also seem to have missed, is that MapReduce is not a database technology and it's not supposed to be or replace one.

Thus, all criticisms of mapreduce based on it not being a good enough database are, well, pretty strange.

See http://typicalprogrammer.com/programming/mapreduce/

Also, the remark about MapReduce's scalability is probably their most bizarre claim. One would think that a technology able to process 20 terabytes/day (the latest data from google) would be considered able to scale. And as MapReduce is not a database, indexes don't make sense.

Not to mention, as others have pointed out, that mapreduce is the kind of technologies that can be used to create these indexes in the first place.

Morel Xavier

Posts: 73
Nickname: masklinn
Registered: Sep, 2005

Re: MapReduce: A Major Step Backwards Posted: Jan 22, 2008 10:43 AM
Reply to this message Reply
Another very in-depth article on Good Math, Bad Math explaining that DeWitt and Stonebraker pretty much completely missed the point of mapreduce in their rant: http://scienceblogs.com/goodmath/2008/01/databases_are_hammers_mapreduc.php

David Biesack

Posts: 8
Nickname: djb
Registered: Jan, 2008

Re: MapReduce: A Major Step Backwards Posted: Jan 23, 2008 8:42 AM
Reply to this message Reply
Note, however, that the Yahoo! Research Pig project is an attempt to build a distributed data/analysis system on top of a MapReduce (Apache/Lucene Hadoop) framework, and it includes a query-like language, Pig Latin. See http://research.yahoo.com/node/90 and http://incubator.apache.org/pig/ . This does not make MapReduce a database system, but uses MapReduce for the distribution and reduction of operations/results. I've not used it, so I do not know whether any of the criticisms of MapReduce apply to Pig.

Mike O'Keefe

Posts: 16
Nickname: kupci2
Registered: Mar, 2005

Re: MapReduce: A Major Step Backwards Posted: Jan 23, 2008 6:56 PM
Reply to this message Reply
> I think they're just confused.

Nah, they're not confused. These are some of the top DBMS folks in the industry. Dewitt is a U.W. professor, the kind of top talent giving U.W. first dibs on "Number one database school in the country!", as one Indian grad student proudly told me.
http://pages.cs.wisc.edu/~dewitt/

Michael Stonebraker is the guy behind INGRES.
http://www.ieee.org/portal/pages/about/awards/bios/2005vonneumann.html

I'll lean more towards Rob McCool's comment (is this the Rob McCool?) at the article, where he compares this argument to the one between AST and Torvalds in the early days of Linux, saying "they are both correct, and irrelevant".

"kids, get off the lawn!" hilarious.

James Watson

Posts: 2024
Nickname: watson
Registered: Sep, 2005

Re: MapReduce: A Major Step Backwards Posted: Jan 24, 2008 8:57 AM
Reply to this message Reply
> > I think they're just confused.
>
> Nah, they're not confused. These are some of the top DBMS
> folks in the industry.

Talking about something that isn't a DBMS.

Flat View: This topic has 6 replies on 1 page
Topic: Neal Gafter on Living Languages, Legacy Languages Previous Topic   Next Topic Topic: OSGi for Enterprise Applications

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use