Java Buzz Forum - Hadoop and Mike Cannon Brookes on using Lucene for Data rather than Text

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Buzz Forum
Hadoop and Mike Cannon Brookes on using Lucene for Data rather than Text

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

dion

Posts: 5028
Nickname: dion
Registered: Feb, 2003

Dion Almaer is the Editor-in-Chief for TheServerSide.com, and is an enterprise Java evangelist

Hadoop and Mike Cannon Brookes on using Lucene for Data rather than Text

Posted: Mar 22, 2007 8:58 PM

This post originated from an RSS feed registered with Java Buzz by dion.
Original Post: Hadoop and Mike Cannon Brookes on using Lucene for Data rather than Text Feed Title: techno.blog(Dion) Feed URL: http://feeds.feedburner.com/dion Feed Description: blogging about life the universe and everything tech	Latest Java Buzz Posts Latest Java Buzz Posts by dion Latest Posts From techno.blog(Dion)

Mike kindly started the presentation with a consuming warning, letting us know in advance that he was going to be pimping JIRA (because this was going to be case study-esque).

These days JIRA uses Lucene for "Generic Data Indexing": Fast retrieval of complex data object. This isn't about text searching for "dog" sorted by relevance. The statistic pages all come back from a Lucene index, not from the DB.

Lucene has a way for you to write your own Sort routines via Sort, SortField.

I have seen the "viral Lucene" pattern apply in a variety of projects. You start out using it for /search, and then you see that you can use it for other things. Slowly your DB is doing less, and your Lucene indexes are growing. This is a killer open source project, even if the API is a little weird.

Hadoop: Open Source MapReduce

I had a couple of people ask "why Google hasn't open sourced our MapReduce?" They didn't know about Hadoop:

Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

The intent is to scale Hadoop up to handling thousand of computers. Hadoop has been tested on clusters of 600 nodes.

Hadoop is a Lucene sub-project that contains the distributed computing platform that was formerly a part of Nutch. This includes the Hadoop Distributed Filesystem (HDFS) and an implementation of map/reduce.

For more information about Hadoop, please see the Hadoop wiki.

The great efforts of Christophe Bisciglia of the open source group revolve around UW classes where Hadoop is used in the curriculum.

Read: Hadoop and Mike Cannon Brookes on using Lucene for Data rather than Text

Previous Topic

Next Topic


	Web Artima.com