Persian Today Corpus

From Wikipedia, the free encyclopedia

The Persian Today Corpus or The Persian One-Million-word Corpus (Persian: واژه‌هاي پركاربرد فارسي امروز ‎ ) is a book written in Persian by Hamid Hassani, published in Iran, Tehran, 2005. The book is based on a 1,000,000-word corpus that contains 80 ‘‘main texts’’ (over 500 subtexts) of modern Persian, mostly written in the years 1994-2004. By ‘‘main texts’’ the writer means those publications which are referred to as ‘‘books’’, ‘‘magazines’’, and ‘‘newspapers’’ as well as ‘‘subtexts’’ chapters or short and long articles and essays that books, magazines, and newspapers are composed of. There is no doubt that the usefulness of a corpus is primarily judged by its volume and the variety of its sources. The Persian Today Corpus is a Corpus not a Concordance Dictionary. In a corpus, the words appear exactly as used in the source texts.

The first important advantage of a corpus is its efficiency in language description (morphological, lexical, orthographic, and phonetic features, to name the least). The second advantage is providing accurate statistics for collecting basic vocabulary and compiling textbooks for language teaching.

There are different types of corpora: sheer corpora, concordance dictionaries, and word indexes. Compiled by specialists in research centers, universities, and academies of several countries, especially developed ones, lingual corpora have been around since decades ago. The best known corpora of the world, such as the Brown Corpus, usually include around 1,000,000 words, though there are some corpora made up of several hundred million words. Among corpora the most famous ones in the world are those prepared for English (American and British), some of which, like the British National Corpus), consist of over 100,000,000 words.

Sponsored by the Iran Language Institute (ILI), a learner’s dictionary of Persian is being compiled by the other Iranian scholar, Behruz Safarzadeh (in collaboration with Hamid Hassani), which is due to be published in 2006. This dictionary consists of over 5,000 entries and the basis for choosing some of entries and the defining vocabulary is the above-mentioned 1,000,000-word corpus. It is expected that the learner’s dictionary, which is the first corpus-based Persian dictionary, will be welcomed by Persian lovers around the world.

These are some Persian words with their original orthography, pronunciation (large letters show accented syllable in each word), meaning in English, frequency, and usage percentage according to Hassani’s corpus:

1. و <VA/-O> (a conjunction that means and): 49,758 times of 1,002,394 (4.96%),

2. به <BE> (a preposition that means to, at, in, or with): 32,478 times (3.24%),

3. را <RAA> (a particle serving as a sign of the [definite] direct object): 25,797 times (2.57%),

4. از <AZ> (a preposition that means from, of, since, than, out of, or belonging to): 23,717 times (2.37%),

5. كه <KE> (a conjunction, a pronoun, a relative, or an interrogative that means that, which; who, who?; or used idiomatically): 22,593 times (2.25%),

6. در <DAR> (a preposition that means in, at, on, or within; a noun that door): 21,671 times (2.16%),

7. اين <IIN> (an adjective or a pronoun that means this): 11,762 times (1.17%),

8. با <BAA> (a preposition that means with or by): 11,611 times (1.16%),

9. است/-ست <AST/-ST> (a verb that means is): 9,837 times (0.981%),

10. آن <AAN> (an adjective or a pronoun that means that; moment): 6,999 times (0.698%)...

30. كار <KAAR> (a noun that means work): 2,535 times (0.253%)...

50. بيرون <biiROON> (an adverb that means out or outside): 1,551 times (0.155%)...

70. هيچ <HIICH> (an adjective, a noun, or an adverb that means any, nothing, ever, at all, or no): 1,277 times (0.127%)...

100. بابا <baaBAA> (a noun that means papa, daddy, dad, or father): 1,005 times (0.1%)...

125. شب <SHAB> (a noun or an adverb that means night): 856 times (0.085%)...

137. ايران <iiRAAN> (the proper noun Iran): 774 times (0.077%)...

142. كتاب <keTAAB> (a noun that means book): 759 times (0.076%)...

150. آنجا/ آنجا <aan-JAA> (an adverb or a pronoun that means there): 726 times (0.072%)...

196. شهر <SHAHR> (a noun that means city or town): 594 times (0.059%)...

210. چشم <CHESHM> (a noun that means eye): 552 times (0.055%)...

376. امروز <emROOZ> (a noun or an adverb that means today): 319 times (0.032%)...

396. كشور <keshVAR> (a noun that means country): 297 times (0.03%)...

476. آمريكا/امريكا <aamriiKAA/emriiKAA> (the proper noun America): 258 times (0.026%)...

545. ده <DAH> (a numeral (adjective/noun) that means ten): 233 times (0.023%)...

838. امام <eMAAM> (a noun that means Imam): 157 times (0.016%)...

879. انگليسي <engeliiSII> (the proper nouns English or British): 149 times (0.015%)...

1000. حسابي <hesaaBII> (an adjective that means good or regular): 133 times (0.013%)...

1150. عسل <aSAL> (a noun that means honey): 116 times (0.011%)...

1500. دروني <darooNII> (an adjective that means internal): 87 times (0.009%)...

1857. ده <DEH> (a noun that means village): 70 times (0.007%)...

2000. ميرساند <MI-resaanad> (a verb that means he/she/it reaches/extends/delivers/supplies/carries): 65 times (0.006%)...

2792. جمعه <jom’E> (a noun or an adverb that means Friday): 43 times (0.004%)...

3000. كلاسها <kelaas-HAA> (a plural noun (a noun + suffix) that means classes): 40 times (0.004%)...

3445. شاهزاده <shaah-zaaDE> (a noun that means prince or princess): 34 times (0.003%)...

4418. جوراب <jooRAAB> (a noun that means socks or stockings): 24 times (0.002%)...

5000. بخت <BAKHT> (a noun that means luck or fortune): 20 times (0.002%)...

5552. ميليمتر <miiliiMETR> (a noun that means millimeter): 18 times (0.002%)...

8000. سووشون <soovaSHOON> (the proper noun Souvashoun, the name of a Persian novel): 10 times (0.001%)...