The Artima Developer Community
This article is sponsored by the Java Community Process.

Leading-Edge Java
Mine Your Own Data with the JDM API
Exploring the Java Data Mining API
by Frank Sommers
July 7, 2005

<<  Page 3 of 3

Advertisement

Using the JDM API

A typical data-mining project might start with a data analyst—someone very familiar with the problem domain—using JDM to explore what models and model building parameters work best. The analyst could persist those models in the mining object repository (MOR), which would then become part of the specifications a developer can use to deploy a data-mining solution. The specs might include other artifacts as well, such as scoring techniques or the algorithms to use, for instance.

To create a data-mining model, an analyst or developer would follow a few simple steps that map the process described in the previous section to JDM interactions. The steps below are illustrated with code snippets later in the article:

  1. Identify the data you wish to use to build your model—your build data—with a URL that points to that data.

  2. Specify the type of model you want to build, and parameters to the build process. Such parameters are termed build settings in JDM. The most important build setting is the definition of the data-mining task, such as clustering, classification, or association rules. These tasks are represented by API classes. Additionally, you may also specify the algorithms to use in those tasks. If you don't care about a specific algorithm, the data-mining engine may select one for you, based on the desired task.

  3. Optionally, you may wish to create a logical representation of your data. That allows you, for instance, to select certain attributes of the physical data, and then map those attributes to logical values. You can specify such mappings in your build settings.

  4. Another optional step allows you to specify the parameters to your data-mining algorithms as well.

  5. Next, you create a build task, and apply to that task the physical data references and the build settings.

  6. The JDM API allows you to verify your settings before running a build task. That lets you catch errors early, since a build tasks may run for a long time.

  7. Finally, you execute the task. The outcome of that execution is your data model. That model will have a signature—a kind of interface—that describes the possible input attributes for later applying the model to additional data.

Once you've created a model, you can test that model, and then even apply the model to additional data. Building, testing, and applying the model to additional data is an iterative process that, ideally, yields increasingly accurate models. Those models can then be saved in the MOR, and used to either explain data, or to predict the outcome of new data in relation to your data-mining objective.

The following code example, based on the JDM 1.0 specification, creates a model that predicts which customers would purchase a certain product. Suppose that you have a database of existing customers, with each record consisting of many customer attributes. You could build a model with that data to help explain how certain customer or purchase attributes impact the likelihood of a customer purchasing your product. You could later use that model to predict the likelihood of new purchases.

Suppose a customer attribute, purchase, indicates whether a customer has purchased the product, with two possible values: Y and N. Attributes that correspond to such discrete values are termed categorical attributes: They tell which of a given set of categories an item belongs to (buyers and non-buyers, in this case).

For future customers, we would want our model to predict the value of the purchase attribute. In other words, the objective of our data mining is to create a model that classifies customers based on the value of the purchase attribute. Not surprisingly, the data-mining task we will use is classification. The classification task has a target, which is the attribute we try to predict (purchase, in this case). All other customer attributes will be used by the classification as predictors of that target.

The mining data for this task can be located in a flat file, or in a database table. In either case, the data-mining system needs to know how to map attributes of the physical data to logical attributes. For example, the purchase attribute may be labeled "hasPurchased" in the actual physical data set. In addition, we need to tell the data-mining engine that the purchase attribute is a categorical attribute. You can specify such mappings via the settings supplied to the classification mining task. The code below shows how to map physical data to logical values, and also how to specify data-mining task settings.

The example assumes that you've already obtained a connection to a data-mining engine, perhaps via a JNDI call. A JDM connection is represented by the engine variable, which is of type javax.datamining.resource.Connection. JDM connections are very similar to JDBC connections, with one connection per thread.

Having obtained a connection to the data-mining engine, the next step is to define the physical data you wish to mine. The build data is referenced via a PhysicalDataSet object, which, in turn, loads the data from a file or a database table, referenced with a URL. The JDM specs define the acceptable data types and format of the input file. You would use a PhysicalDataSet for both model building and to subsequently test and evaluate your model:

PhysicalDataSetFactory dataSetFactory
    = (PhysicalDataSetFactory) engine.getFactory("javax.datamining.data.PhysicalDataSet");
PhysicalDataSet dataSet = 
    pdsFactory.create(
    "file:///export/data/textFileData.data", 
    true);
engine.saveObject("buildData", dataSet, false);

Based on the physical data, we can define a logical data model. In this example, we specify that purchase is a categorical attribute of the data:

LogicalDataFactory logicalFactory
    = (LogicalDataFactory) engine.getFactory("javax.datamining.data.LogicalData");
LogicalData logicalData = logicalFactory.create(dataSet);
LogicalAttributeFactory logicalAttributeFactory = (LogicalAttributeFactory) 
      engine.getFactory("javax.datamining.data.LogicalAttribute");
LogicalAttribute purchase = logicalData.getAttribute("purchase");
purchase.setAttributeType(AttributeType.categorical);
engine.saveObject("logicalData", logicalData, false);

Next, we proceed to specify settings for building the model. Since the mining task we wish to perform is classification, we create settings for classifications, including specifying naive Bayesian as the algorithm. This example has the single target attribute of purchase. Note that the algorithm itself accepts settings via an algorithm-specific settings object:

ClassificationSettingsFactory settingsFactory = (ClassificationSettingsFactory)
    engine.getFactory("javax.datamining.supervised.classification.ClassificationSettings");
ClassificationSettings settings = settingsFactory.create();
settings.setTargetAttributeName("purchase");
NaiveBayesSettingsFactory bayesianFactory = (NaiveBayesSettingsFactory)
    engine.getFactory("javax.datamining.algorithm.naivebayes.NaiveBayesSettings");
NaiveBayesSettings bayesSettings = bayesianFactory.create();
bayesSettings.setSingletonThreshold(.01);
bayesSettings.setPairwiseThreshold(.01);
settings.setAlgorithmSettings(bayesSettings);
engine.saveObject("bayesianSettings", settings, false);

Having specified the settings, we create a build task. As an optional step, the JDM API allows you to verify the settings before starting to build the model. We will not handle verification errors in this example:


BuildTaskFactory buildTaskFactory = 
	(BuildTaskFactory) engine.getFactory
	("javax.datamining.task.BuildTask");
BuildTask buildTask = buildTaskFactory.create("buildData", "bayesianSettings", "model");
VerificationReport report = buildTask.verify();
if( report != null ) {
	ReportType reportType = report.getReportType();
	//Handle errors here
}

//If no errors, save build task
engine.saveObject("buildTask", buildTask, false);

//Execute the build task
ExecutionHandle handle = engine.execute("buildTask");

//This may take a long time. So wait for completion
handle.waitForCompletion(Integer.MAX_VALUE); 

Finally, we can access the resulting model:

ExecutionStatus status = handle.getLatestStatus();
if (ExecutionState.success.equals(status.getState())) {
     ClassificationModel model = (ClassificationModel) 	engine.retrieveObject( "model", NamedObject.model );
}

Data mining for the masses

Proposed new features for JDM 2.0 include mining capabilities for time-series data, which is useful in forecasting and anomaly detection, such as in security. Another proposed feature will allow you to mine unstructured text data, and there are plans to extend the JDM's scope to data preparation and transformation as well.

While the JDM 1.0 API has been an approved standard for almost a year, only a handful of products implement portions of the standard at the time of this writing. "Oracle has a product based on JDM 1.0. There are other vendors as well . . . Many commercial vendors are waiting on the sidelines to find demand for data mining, and will then implement [JDM] in their own products," says Hornick.

That demand may not be far in coming as data mining is becoming an increasingly mainstream data management task. The Basel Committee, established by the central bank board of governors of the Group of Ten countries to provide regulatory oversight and best practices for the world's financial institutions, recently allowed banks to reduce the amount of their mandatory reserves if they can build their own predictive data models to accurately assess the risk of defaults.[4] Data mining is a key component in building such models.

The JDM itself may contribute to increased demand for data mining. An analogy with databases may illustrate the point: Prior to ODBC and JDBC, database access was possible only via proprietary vendor interfaces. Not only did that render database-aware applications dependent on specific database products, vendors often charged extra for those interface components. ODBC and JDBC eliminated those barriers, making database access universal and ubiquitous. The JDM API might similarly make data-mining capabilities available to any Java application. JDM spec lead Hornick puts it this way: "Our goal [with the JDM] was to bring data mining to the masses."

Talk back!

Have an opinion about the JDM? Discuss this article in the Articles Forum topic, Mine Your Own Data.

Resources

[1] Java Data Mining API 1.0, JSR 73
http://www.jcp.org/en/jsr/detail?id=73

[2] Java Data Mining API 2.0, JSR 247
http://www.jcp.org/en/jsr/detail?id=247

[3] "What is OLAP?" From the OlapReport
http://www.olapreport.com/fasmi.htm

[4] The Basel II banking regulations
http://www.bis.org/publ/bcbsca.htm

[See also]

Micheal Lesk, How Much Data is There in the World?
http://www.lesk.com/mlesk/ksg97/ksg.html

Vannevar Bush, As We May Think
http://ccat.sas.upenn.edu/~jod/texts/vannevar.bush.html

MyLifeBits project
http://www.research.microsoft.com/barc/MediaPresence/MyLifeBits.aspx

Advances in Knowledge Discovery and Data Mining, Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy, editors. MIT Press, 1996.
http://www.amazon.com/exec/obidos/ASIN/0262560976/

Predictive Data Mining: A practical guide, Sholom M. Weiss, Nitin Indurkhya, Morgan Kaufmann, 1997.
http://www.amazon.com/exec/obidos/ASIN/1558604030/

KDnuggets (data mining portal)
http://www.kdnuggets.com/

ACM Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD)
http://www.acm.org/sigs/sigkdd/

<<  Page 3 of 3


This article is sponsored by the Java Community Process.

Sponsored Links



Google
  Web Artima.com   
Copyright © 1996-2014 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use - Advertise with Us