Java Community News - Web-Harvest Project Announces Initial Code Release

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Java Community News
Web-Harvest Project Announces Initial Code Release

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Frank Sommers

Posts: 2642
Nickname: fsommers
Registered: Jan, 2002

Web-Harvest Project Announces Initial Code Release

Posted: Sep 4, 2006 12:10 PM

Summary
Web-Harvest is an open-source screen-scraping tool that helps extract data from Web sites. It uses XSLT, XQuery, and regular expressions, and provides a configurable set of pipelines that process the raw HTML data of a Web site.

The open-source Web-Harvest data extraction tool project announced its initial public release. The Java-based tool allows developers to programmatically extract data from existing Web sites.

While screen-scraping evokes memories of DOS applications, the technique was also the primary means of programmatically extracting information from Web sites in the pre-RSS and Web-services days. Indeed, many large sites, such as eBay, developed their programmatic APIs in response to the increasing number of screen-scraping applications harvesting information from their Web pages.

Web-Harvest uses XSLT, XQuery, and regular expression to aid data extraction, or screen scraping, from HTML and XML-based Web sites. It provides a set of configurable pipelines to process each page:

Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one.

While RSS provides a standard format to expose a site's data in a structured manner, a site's RSS feed may still only publish a subset of available data. Screen scraping that takes into account knowledge of a Web page's layout to obtain from the page just the needed data, can still come in handy. What Web sites do you obtain data from in that manner?

Previous Topic

Next Topic


	Web Artima.com