Hamshahri Corpus

From Wikipedia, the free encyclopedia

Hamshahri Corpus Logo
Hamshahri Corpus Logo

The Hamshahri Corpus is based on the newspaper Hamshahri, one of the first online Persian newspapers in Iran. It has presented its archive to the public through its website [1] since 1996.

This corpus has been created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval experiments.

The collection contains more that 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.

The corpus is available in several formats for download [2]:

  • Tagged Text: 560 MB
  • In SQL Server 2000 Tables: 712 MB

[edit] See also


[edit] External links