Algorithmic analysis of natural language can enhance many enterprise applications and lead to improved user experience. In a recent article, Rod Coffin and Matt Smith demonstrate three open-source Java natural language-processing tools.
Much input into enterprise applications comes from humans, and in the form natural language. The ability to algorithmically analyze human-language input can not only provide a better user experience, but can also make an application more effective by automating classification of user input, for example.
In spite of many decades of natural-language processing, relative few applications take advantage of advances in the field and of the numerous natural-language processing open source projects. In a recently published article, Bad at Grammar? Cheat with Java Linguistics Tools, Rod Coffin and Matt Smith demonstrate three such Java tools: LingPipe for text classification, OpenNLP for sentence identification, and Inflector for pluralization.The authors introduce text classification as the capability to:
Programmatically identify the language in which a text is written, its topic, sentiment (i.e. inflammatory, reasoned, etc.), or to identify a possible author. Most text classification techniques involve applying statistical methods to a training corpus (a set of known texts used for training systems) to develop a model for determining the most likely category of future text passages.
The article demonstrates text classification with LingPipe by creating a program that automatically determines which of two topics an email message is about.
While identifying sentences seems easy enough, Coffin and Smith show that there is more to sentence identification than parsing text based on punctuation marks. They introduce OpenNLP, including its SentenceDetector sub-project:
OpenNLP is an umbrella project that includes several projects related to linguistics. The SentenceDetector included with OpenNLP tools uses a maximum entropy ... algorithm that is trained on a corpus of text extracted from the Wall Street Journal.
Finally, the authors introduce the java.net project project Inflector that can perform pluralization of English-language words:
Pluralization of English words is one of those problems where the 80 percent case is easy but the cost for the remaining 20 percent is exponentially more expensive because of English's many irregularities. One Java tool that can perform similar pluralization is the java.net Inflector project... Out of the box, Inflector doesn't handle 100 percent of English irregular words correctly but does provide a framework for handling user-specified pluralizations.
Have you used natural-language processing tools in your projects? If so, what do you think of the effectiveness of such tools?