This article is sponsored by the Java Community Process.
A key purpose of data mining is either to help explain the past, or to try to predict the future based on past data. To those ends, data-mining techniques help identify patterns in a vast data store, and then build models that concisely represent those patterns. Such models capture the essential characteristics of the underlying data, helping humans gain new insights—and knowledge—from that data.
Data mining differs from other data access mechanisms both in process and technique. "When you formulate theories and test them out, that's deductive reasoning. By contrast, data mining is inductive," says Oracle's Hornick. "You often hear people say, 'Oh, I do data mining.' But they're just doing queries. Even a complex SQL query executing over large amounts of data is merely the extraction of detailed summary data. OLAP [online analytic processing] allows you to do slicing and dicing of data cubes, but that's still not data mining. Data mining gives you the ability to look into the future, to do predictions, to extract information that an individual would not have had any real capacity to discern because of the volume of the data and the complexity in that data."
A typical data-mining system consists of a data-mining engine and a repository that persists the data-mining artifacts, such as the models, created in the process. The actual data is obtained via a database connection, or even via a filesystem API. A key JDM API benefit is that it abstracts out the physical components, tasks, and even algorithms, of a data-mining system into Java classes.
Building a data-mining model typically starts with identifying recurring patterns in the data, and then distilling those patterns in a way that helps communicate them to humans or other machines. Models can take the form of a graphical representation, a set of equations, a neural network, or even a collection of rules. Models can be applied to new data, or evaluated and refined in the presence of ever larger data sets. The process can be summarized as follows:
Decide what you want to learn. The first, and most important, step is to decide what kinds of new knowledge or insight you want to gain from the data. The more specific you are, the more likely your data-mining process succeeds.
For instance, you may want to find out what factors led to a successful holiday sales promotion so that you can recommend strategies for future sales events. Or, you may wish to predict which customers are likely to buy a product you are about to introduce. In another instance, you may want to identify potential galaxies from a digital sky image database. Or, you may want to learn what articles to show together on a Web site to improve user experience.
Select and prepare your data. Once you've decided on the objective, you must identify the data that you think may help you achieve those goals. For instance, sales records and lists of advertising placements may be relevant to deciding what factors contributed to sales success. Web server log files, in turn, may be relevant to gaining insight into Web site usage. Initially, you may wish to select only the subset of the available data that you believe is most representative of what you wish to find out. You can later select additional data subsets to improve your initial findings.
Relevant data sets are seldom in the format suited to data mining. Often, you must transform that data, possibly cleaning it—eliminating incomplete records, for instance—and sometimes also preprocessing it. For example, you may need to combine sales data with customer demographic data, or Web log records with user account data.
Choose and configure the mining tasks. Next, you should decide on the specific data-mining task to perform. For instance, you may wish to cluster users together that visited similar Web pages, and then derive association rules that show how those users and pages are related. Those rules, in turn, can help evaluate, or "score," new Web pages and readers, and decide what links to place on those pages for new visitors. For a sales promotion, the task may be to select the features most relevant to a successful sales event, and then quantify how those features impact sales.
Having selected a mining task, you would then configure that task with parameters suitable for the task. In the JDM API, such configuration is specified with settings.
Select and configure the mining algorithms. Settings allow you to select algorithms for a mining task. Many data-mining algorithms are available for a given task. Algorithms differ not only in the accuracy of their end-product, but also in the computational resources they require.
Many data-mining tools are able to automatically match algorithms to a desired data-mining objective; for instance, a clustering algorithm to create data clusters, or an association-rules algorithm to identify association rules.
Build your data-mining model. The output of executing a data-mining task is your data-mining model: That model, ideally, is a representation of your data suited to your objective. For instance, the model might be a neural network, a decision tree, or even a set of rules understandable by humans.
Test and refine the models. You might create several models, evaluate the accuracy of each model with past data, and possibly select a "best" model for your purpose. One way to evaluate models is to apply the newly gained insight to past data, and compare that with results that would be obtained without the aid of that insight, for instance by random sampling. Ideally, your newly gained insight should produce improved results—a "lift," in data-mining argot.
Report findings or predict future outcomes. Finally, you could either report your findings—think PowerPoint—or use your data-mining models to predict future results. In some cases, you may build systems that automatically improve their data-mining models with new data, or systems that take actions in the presence of a continuous stream of new information.
The current trend is towards automating as much of this process as possible. "Even those not expert in data mining can reap the benefits of data-mining technologies," says Oracle's Hornick. "In the [JDM] standard, [users may have the] system determine automatic settings. We provide [in the JDM API] functions for classification and clustering, for instance."
This article is sponsored by the Java Community Process.