Big data adoption has been growing by leaps and bounds over the past few years, which has necessitated new technologies to analyze that data holistically. Individual big data solutions provide their own mechanisms for data analysis, but how do you analyze data that is contained in Hadoop, Splunk, files on a file system, a local database, and so forth?
The answer is that you need an abstraction that can pull data from all of these sources and analyze potentially petabytes of information very rapidly.
Spark is a computational engine that manages tasks across a collection of worker machines in what is called a computing cluster. It provides the necessary abstraction, integrates with a host of different data sources, and analyzes data very quickly. This installation in the Open source Java projects series reviews Spark, describes how to set up a local environment, and demonstrates how to use Spark to derive business value from your data.