ICDL crawling

From Wikipedia, the free encyclopedia

ICDL crawling is an open distributed web crawling technology based on Website Parse Template (WPT).

Contents

[edit] What is Website Parse Template?

Distributed ICDL Crawling
Distributed ICDL Crawling

Website Parse Template (WPT) is an XML based open format which provides HTML structure description of website pages. WPT format allows web crawlers to generate Semantic Web’s RDF triplets for web pages. WPT is compatible with existing Semantic Web concepts defined by W3C (RDF and OWL) and UNL specifications.

[edit] Distributed ICDL crawling

ICDL crawling involves parsing of websites’ content considering HTML structure templates represented in WPT files.

Distributed crawling is carried out by open source client/server application installed on volunteers’ personal computers. After authentication procedures, application registers each PC as a Distributed Crawling node. Crawler periodically receives tasks from management console to download specified websites, parse their content and submit the results into Parsed Content Storage. Crawling processes are activated when user’s computer is in idle and Internet connection is not in use.

Internet content parse results from several Crawlers are compared by management console to increase crawling results' accuracy grade. Crawling results can be stored to be used by thematic and general search engines with different search algorithms, such as Google, Live, Yahoo!, Froogle, etc. to perform more accurate web search.

[edit] See also

[edit] External links