ICDL crawling

From Wikipedia, the free encyclopedia

This article may not meet the general notability guideline or one of the following specific guidelines for inclusion on Wikipedia: Biographies, Books, Companies, Fiction, Music, Neologisms, Numbers, Web content, or several proposals for new guidelines. If you are familiar with the subject matter, please expand or rewrite the article to establish its notability. The best way to address this concern is to reference published, third-party sources about the subject. If notability cannot be established, the article is more likely to be considered for redirection, merge or ultimately deletion, per Wikipedia:Guide to deletion.
This article has been tagged since May 2008.

ICDL crawling is an open distributed web crawling technology based on Website Parse Template (WPT).

1 What is Website Parse Template?
2 Distributed ICDL crawling
3 See also
4 External links

[edit] What is Website Parse Template?

Distributed ICDL Crawling

Website Parse Template (WPT) is an XML based open format which provides HTML structure description of website pages. WPT format allows web crawlers to generate Semantic Web’s RDF triplets for web pages. WPT is compatible with existing Semantic Web concepts defined by W3C (RDF and OWL) and UNL specifications.

[edit] Distributed ICDL crawling

ICDL crawling involves parsing of websites’ content considering HTML structure templates represented in WPT files.

Distributed crawling is carried out by open source client/server application installed on volunteers’ personal computers. After authentication procedures, application registers each PC as a Distributed Crawling node. Crawler periodically receives tasks from management console to download specified websites, parse their content and submit the results into Parsed Content Storage. Crawling processes are activated when user’s computer is in idle and Internet connection is not in use.

Internet content parse results from several Crawlers are compared by management console to increase crawling results' accuracy grade. Crawling results can be stored to be used by thematic and general search engines with different search algorithms, such as Google, Live, Yahoo!, Froogle, etc. to perform more accurate web search.