Hamshahri Corpus

From Wikipedia, the free encyclopedia

Hamshahri Corpus Logo

The Hamshahri Corpus is based on the newspaper Hamshahri, one of the first online Persian newspapers in Iran. It has presented its archive to the public through its website [1] since 1996.

This corpus has been created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval experiments.

The collection contains more that 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.

The corpus is available in several formats for download [2]:

Tagged Text: 560 MB
In SQL Server 2000 Tables: 712 MB

[edit] See also

[edit] External links

The Homepage of Hamshahri Corpus (In English)

Views

Interaction

Search

This page was last modified 22:44, 23 February 2008 by Wikipedia user Jonsafari. Based on work by Wikipedia user(s) Bonadea, A.aleahmad, Rjwilmsi, SmackBot, Amirieb, Darrudi, Davewild and Alaibot and Anonymous user(s) of Wikipedia.
All text is available under the terms of the GNU Free Documentation License. (See Copyrights for details.)
Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a U.S. registered 501(c)(3) tax-deductible nonprofit charity.
About Wikipedia
Disclaimers