Persian Today Corpus

From Wikipedia, the free encyclopedia

The Persian Today Corpus or The Persian One-Million-word Corpus (Persian: واژه‌هاي پركاربرد فارسي امروز ) is a book written in Persian by Hamid Hassani, published in Iran, Tehran, 2005. The book is based on a 1,000,000-word corpus that contains 80 ‘‘main texts’’ (over 500 subtexts) of modern Persian, mostly written in the years 1994-2004. By ‘‘main texts’’ the writer means those publications which are referred to as ‘‘books’’, ‘‘magazines’’, and ‘‘newspapers’’ as well as ‘‘subtexts’’ chapters or short and long articles and essays that books, magazines, and newspapers are composed of. There is no doubt that the usefulness of a corpus is primarily judged by its volume and the variety of its sources. The Persian Today Corpus is a Corpus not a Concordance Dictionary. In a corpus, the words appear exactly as used in the source texts.

The first important advantage of a corpus is its efficiency in language description (morphological, lexical, orthographic, and phonetic features, to name the least). The second advantage is providing accurate statistics for collecting basic vocabulary and compiling textbooks for language teaching.

There are different types of corpora: sheer corpora, concordance dictionaries, and word indexes. Compiled by specialists in research centers, universities, and academies of several countries, especially developed ones, lingual corpora have been around since decades ago. The best known corpora of the world, such as the Brown Corpus, usually include around 1,000,000 words, though there are some corpora made up of several hundred million words. Among corpora the most famous ones in the world are those prepared for English (American and British), some of which, like the British National Corpus, consist of over 100,000,000 words.

Sponsored by the Iran Language Institute (ILI), a learner’s dictionary of Persian is being compiled by the other Iranian scholar, Behruz Safarzadeh (in collaboration with Hamid Hassani), which is due to be published in 2008. This dictionary consists of over 5,000 entries and the basis for choosing some of entries and the defining vocabulary is the above-mentioned 1,000,000-word corpus. It is expected that the learner’s dictionary, which is the first corpus-based Persian dictionary, will be welcomed by Persian lovers around the world.

These are some Persian words with their original orthography, pronunciation (large letters show accented syllable in each word), meaning in English, frequency, and usage percentage according to Hassani’s corpus:


No. Words Grammatical Categories (and Meanings) Frequencies and Percentages
1 و <VA/ -O> a conjunction that means and 49,758 times of 1,002,394 (4.96%),
2 به <BE> a preposition that means to, at, in, or with 32,478 times (3.24%),
3 را <RAA> a particle serving as a sign of the [definite] direct object 25,797 times (2.57%),
4 از <AZ> a preposition that means from, of, since, than, out of, or belonging to 23,717 times (2.37%),
5 كه <KE> a conjunction, a pronoun, a relative, or an interrogative that means that, which; who, who?; or used idiomatically 22,593 times (2.25%),
6 در <DAR> a preposition that means in, at, on, or within; a noun that means door 21,671 times (2.16%),
7 اين <IIN> an adjective or a pronoun that means this 11,762 times (1.17%),
8 با <BAA> a preposition that means with or by 11,611 times (1.16%),
9 است /-ست <AST/-ST> a verb that means is 9,837 times (0.981%),
10 آن <AAN> an adjective or a pronoun that means that, or a noun that means moment 6,999 times (0.698%)...
30 كار <KAAR> a noun that means work 2,535 times (0.253%)...
50 بيرون <biiROON> an adverb that means out or outside 1,551 times (0.155%)...
70 هيچ <HIICH> an adjective, a noun, or an adverb that means any, nothing, ever, at all, or no 1,277 times (0.127%)...
100 بابا <baaBAA> a noun that means papa, daddy, dad, or father 1,005 times (0.1%)...
125 شب <SHAB> a noun or an adverb that means night 856 times (0.085%)...
137 ايران <iiRAAN> the proper noun Iran 774 times (0.077%)...
142 كتاب <keTAAB> a noun that means book 759 times (0.076%)...
150 آنجا / آنجا <aan-JAA> an adverb or a pronoun that means there 726 times (0.072%)...
196 شهر <SHAHR> a noun that means city or town 594 times (0.059%)...
210 چشم <CHESHM> a noun that means eye 552 times (0.055%)...
376 امروز <emROOZ> a noun or an adverb that means today 319 times (0.032%)...
396 كشور <keshVAR> a noun that means country 297 times (0.03%)...
476 آمريكا /امريكا <aamriiKAA/emriiKAA> the proper noun America 258 times (0.026%)...
545 ده <DAH> a numeral (adjective/noun) that means ten 233 times (0.023%)...
838 امام <eMAAM> a noun that means Imam 157 times (0.016%)...
879 انگليسي <engeliiSII> the proper nouns English or British 149 times (0.015%)...
1000 حسابي <hesaaBII> an adjective that means good or regular 133 times (0.013%)...
1150 عسل <aSAL> a noun that means honey 116 times (0.011%)...
1500 دروني <darooNII> an adjective that means internal 87 times (0.009%)...
1857 ده <DEH> a noun that means village 70 times (0.007%)...
2000 ميرساند <MI-resaanad> a verb that means he/she/it reaches/extends/delivers/supplies/carries 65 times (0.006%)...
2792 جمعه <jom’E> a noun or an adverb that means Friday 43 times (0.004%)...
3000 كلاسها <kelaas-HAA> a plural noun (a noun + suffix) that means classes 40 times (0.004%)...
3445 شاهزاده <shaah-zaaDE> a noun that means prince or princess 34 times (0.003%)...
4418 جوراب <jooRAAB> a noun that means socks or stockings 24 times (0.002%)...
5000 بخت <BAKHT> a noun that means luck or fortune 20 times (0.002%)...
5552 ميليمتر <miiliiMETR> a noun that means millimeter 18 times (0.002%)...
8000 سووشون <soovaSHOON> the proper noun Suvashun, the name of a Persian novel) written by Simin Daneshvar 10 times (0.001%)...

[edit] See also