The Artima Developer Community
Sponsored Link

Java Community News
Web-Harvest Project Announces Initial Code Release

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Frank Sommers

Posts: 2642
Nickname: fsommers
Registered: Jan, 2002

Web-Harvest Project Announces Initial Code Release Posted: Sep 4, 2006 12:10 PM
Reply to this message Reply
Summary
Web-Harvest is an open-source screen-scraping tool that helps extract data from Web sites. It uses XSLT, XQuery, and regular expressions, and provides a configurable set of pipelines that process the raw HTML data of a Web site.
Advertisement

The open-source Web-Harvest data extraction tool project announced its initial public release. The Java-based tool allows developers to programmatically extract data from existing Web sites.

While screen-scraping evokes memories of DOS applications, the technique was also the primary means of programmatically extracting information from Web sites in the pre-RSS and Web-services days. Indeed, many large sites, such as eBay, developed their programmatic APIs in response to the increasing number of screen-scraping applications harvesting information from their Web pages.

Web-Harvest uses XSLT, XQuery, and regular expression to aid data extraction, or screen scraping, from HTML and XML-based Web sites. It provides a set of configurable pipelines to process each page:

Every extraction procedure in Web-Harvest is user-defined through XML-based configuration files. Each configuration file describes sequence of processors executing some common task in order to accomplish the final goal. Processors execute in the form of pipeline. Thus, the output of one processor execution is input to another one.

While RSS provides a standard format to expose a site's data in a structured manner, a site's RSS feed may still only publish a subset of available data. Screen scraping that takes into account knowledge of a Web page's layout to obtain from the page just the needed data, can still come in handy. What Web sites do you obtain data from in that manner?

Topic: Web-Harvest Project Announces Initial Code Release Previous Topic   Next Topic Topic: JOTM Transactions in Spring and Hibernate

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use