Arabic letter frequency

The frequency of letters in text has often been studied for use in cryptanalysis, and frequency analysis in particular.

No exact letter frequency distribution underlies a given language, since all writers write slightly differently. As a rule texts in different languages using the Arabic script (e.g. the Arabic, Old Turkish, Persian and Urdu languages) will have different letter frequencies, most obviously in the case of letters which are not used at all in a language (e.g. the Persian letters پ, چ, گ are not found in written Arabic language texts).

Methods encoding the most frequent letters with the shortest symbols were pioneered by telegraph codes, and are used in modern data-compression techniques such as Huffman coding.

What gets counted in input Arabic text?

Chiefly, the Arabic alphabet consists of 28 primary letters, these are letters 1 to 28 in Table 1. However, when scripting in Arabic, the eight modified letters listed in positions 29 to 36 in the same table are used just the same. If these 8 modified forms are lumped back into the primary list based on shape or phonetic similarity, the outcome then is as shown in Table 2. For accurate frequency analysis, each of the 36 letters of Table 1 gets its frequency counted independently. The ordering of the alphabet shown in the tables is more logical than is used by the Unicode standard.

Figure 1: Arabic characters that can be produced using the Arabic Letter Keyboard Intellark.
Table 1: The Arabic alphabet. Letters 1 to 28 are the primary letters. Letters 29 to 36 are the modified letters.
Table 2: The Arabic alphabet, with modified letters lumped onto their primary forms.

Although the full set of Arabic characters includes about ten diacritics as shown in the Figure 1, frequency analysis of Arabic characters is only concerned with computing the frequency of alphabet letters shown in Table 2.

Sources with over five million letters

The following famous Arabic sources are used to generate an acceptable amount of data on which frequency statistics are conducted.

Collectively, these sources add up to 3,378 pages, generating 1,297,259 words, or, 5,122,132 letters. The following two figures show the letter frequency distribution for the counted letters; Figure 2 shows a histogram data sorted on Unicode value. Figure 3 shows a histogram data sorted on frequency.

Qur'an letter and word frequency statistics

Using the Qur'an as data source for Arabic letter frequency generation, the frequency distribution of letters is much in line with what's reported and exhibited in the Figures 2 and 3 above.[4] The following list highlights statistics particular to one of the most common print editions (the recitation of Hafs through Asim) also available online.

A detailed study of letter and word frequency analysis of the entire book of Qur'an is provided by Intellaren Articles.[5]

References

  1. Ibn Kathir, Ismail (1???). The beginning and the End (in Arabic). Retrieved 23 January 2011. Check date values in: |date= (help)
  2. Almubarakfuri, Safiyyurrahman (19??). The Sealed Nectar. Retrieved 24 January 2011. Check date values in: |date= (help)
  3. Ash-shuri, Majdi (19??). Masterpiece of the Bride. Retrieved 24 January 2011. Check date values in: |date= (help)
  4. Madi, Mohsen (2010). "Comparative frequency analysis of Arabic Texts". Intellaren Articles. Retrieved 24 January 2011. External link in |publisher= (help)
  5. Madi, Mohsen (2010). "Quran Suras Statistics". Intellaren Articles. Retrieved 16 January 2011. External link in |publisher= (help)

External links

This article is issued from Wikipedia - version of the Sunday, August 10, 2014. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.