This post originated from an RSS feed registered with Java Buzz
by Michael Cote.
Original Post: Reverse PageRank from IBM
Feed Title: Cote's Weblog: Coding, Austin, etc.
Feed URL: https://cote.io/feed/
Feed Description: Using Java to get to the ideal state.
That theory suggests that the best way to find information on the Web is to look at the biggest and most popular sites and Web pages. Hubs, for example, are usually defined as Web portals and expert communities. Similarly, the concept of authorities rests on identifying the most important Web pages, including looking at the number and influence of other pages that link to them. The latter concept is mirrored in Google's main algorithm, called PageRank.
IBM applied the same concepts in an early Web data-mining project called Clever, but shortcomings eventually led researchers to turn the theory of hubs and authorities on its head. In short, IBM found that it could excavate more interesting data from pages that the theory of hubs and authorities normally pushed to the bottom of the heap--unstructured pages like discussion boards, Web logs, newsgroups and other pages. With that insight, WebFountain was born.
"We're looking at...the low-level grungy pages," said Gruhl.
The rest of the article is a complete dork-idea-fest: there's mentions of NLP, allusions to Ye Olde Semantic Web, reputation systems, and drool-inducing machine talk like the below,
A main cluster consists of 32 eight-server racks running dual 2.4GHz Intel Xeon processors, capable of writing 10GB of data per second to disk. Each rack has 5 terabytes of storage, for a total of 40 terabytes for the system.
The three clusters together currently run a total of 768 processors, and that number is growing fast.
The cluster and storage is migrating to blade servers this year, which will save space and provide a total of 896 processors for data mining and 256 for storage. In total, the system will add 1,152 processors, allowing it to collect and store as many as 8 billion Web pages within 24 hours.