Wikipedia:Language recognition chart
From Wikipedia, the free encyclopedia
This article describes a variety of simple clues one can use to determine what language a document is written in with high accuracy.
[edit] Characters
You can recognize text in a foreign language by looking up characters specific to that language. For some reason this is often more accurate than language recognition software, which pays little attention to the characters.
- ABCDEFGHIJKLMNOPQRSTUVWXYZ (Latin alphabet)
- and no other - English language, Zulu language, Indonesian language, Malay language, Swahili language
- êéë Afrikaans language
- ØÅæøå - Danish language, Norwegian language
- ÖÅæöå - Norwegian language
- ÄÖåäö - Swedish language
- ÐÉÍÓÚÝÞÆÖáðéíóúýþæö - Icelandic language
- Öäö - Finnish language (occasionally ŠšŽž in loanwords as well as Åå in names)
- ÖÕÜäöõü - Estonian language
- àéëï - Dutch language
- êôúû - West Frisian language
- ĉĈĝĜĥĤĵĴŝŜŭŬ - Esperanto
- àâçéèêîïôœùû - French language
- ÀÇÉÈÍÓÒÚËÜÏáàçéèíóòúëüï (· only in Gascon dialect) - Occitan language
- ÖÜäöüß - German language
- àéèìòù - Italian language
- ÉÍÓÚÂÊÔÀãõçáéíóúâêôà (ü Brazilian and k, w and y not in native words) - Portuguese language
- áéíñÑóúü ¡¿ - Spanish language
- ÇÉÈÍÓÒÚÜÏàçéèíóòúüï· - Catalan language
- kñ (c not in native words) - Basque language
- ÊÎÔÛŴŶâêîôûŵŷáéíï - Welsh language
- ÉÍÓÖŐÚÜŰáéíóöőúüű - Hungarian language
- ĂÎÂŞŢăîâşţ - Romanian language
- çÇğĞıİöÖşŞüÜ - Turkish language
- çÊêÎÛû - Kurdish language
- ÁĄĄ́ÉĘĘ́ÍĮĮ́ŁŃ áąą́éęę́íįį́łń (FQRVfqrv not in native words) - Southern Athabaskan languages
- ’ÓǪǪ́ āą̄ēę̄īį̄óōǫǫ́ǭúū - Western Apache language
- 'ÓǪǪ́ óǫǫ́ - Navajo language
- ’ÚŲŲ́ úųų́ - Chiricahua language/Mescalero language
- ąćęłńóśźż Polish language
- ČŠŽ
- and no other - Slovenian language
- ĆĐ - Bosnian language, Croatian language
- ĎÉĚŇÓŘŤÚŮÝáďéěňóřťúůý - Czech language
- ÄĎÉÍĽĹŇÓÔŔŤÚÝáäďéíľĺňóôŕťúý - Slovak language
- ĀĒĢĪĶĻŅŌŖŪāēģīķļņōŗū - Latvian language
- ĄĘĖĮŲŪąęėįųū - Lithuanian language
- ả ạ ấ ầ ẩ ẫ ậ ắ ằ ẳ ẵ ặ đ ₫ ẻ ẹ ế ề ể ễ ệ ỉ ĩ ị ỏ ọ ổ ỗ ộ ơ ớ ờ ở ỡ ợ ủ ụ ư ứ ừ ử ữ ự ỷ ỹ ỵ – most are Vietnamese
- ā ē ī ō ū - May be seen in some Japanese texts in Romaji or transcriptions (see below) or Hawaiian and Māori texts.
- é - Sundanese language
- ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي Arabic script
- Brahmic family of scripts
- Bengali script
- অ আ কা কি কী উ কু ঊ কূ ঋ কৃ এ কে ঐ কৈ ও কো ঔ কৌ ক্ কত্ কং কঃ কঁ ক খ গ ঘ ঙ চ ছ জ ঝ ঞ ট ঠ ড ঢ ণ ত থ দ ধ ন প ফ ব ভ ম য র ৰ ল ৱ শ ষ স হ য় ড় ঢ় ০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯
- Devanāgarī
- अ प आ पा इ पि ई पी उ पु ऊ पू ऋ पृ ॠ पॄ ऌ पॢ ॡ पॣ ऍ पॅ ऎ पॆ ए पे ऐ पै ऑ पॉ ऒ पॊ ओ पो औ पौ क ख ग घ ङ च छ ज झ ञ ट ठ ड ढ ण त थ द ध न प फ ब भ म य र ल ळ व श ष स ह ० १ २ ३ ४ ५ ६ ७ ८ ९ प् पँ पं पः प़ पऽ
- used to write, either along with other scripts or exclusively, several Indian languages including Sanskrit, Hindi, Marathi, Kashmiri, Sindhi, Bihari, Bhili, Konkani, Bhojpuri and Nepali from Nepal.
- Gurmukhi
- ਅਆਇਈਉਊਏਐਓਔਕਖਗਘਙਚਛਜਝਞਟਠਡਢਣਤਥਦਧਨਪਫਬਭਮਯਰਲਲ਼ਵਸ਼ਸਹ
- primarily used to write Punjabi as well as Braj Bhasha, Khariboli (and other Hindustani dialects), Sanskrit and Sindhi.
- Gujurati script
- અઆઇઈઉઊઋઌઍએઐઑઓઔકખગઘઙચછજઝઞટઠડઢણતથદધનપફબભમયરલળવશષસહૠૡૢૣ
- used to write Gujurati and Kachchi
- Bengali script
- БДЖИЛПУЦЧШ (Cyrillic alphabet)
- ЙЩЬЮЯ
- ҐЄІЇ - Ukrainian language
- Ъ - Bulgarian language
- ЁЭЫ - Russian language
- Ў, І instead of И - Belarusian language
- ЁЭЫ - Russian language
- ЉЊЏ (Vuk Karadžić's reform)
- ЋЂ - Serbian language
- ЃЌЅ - Macedonian language
- ЅЋѸѲѠЩЪЬҌЮЯѦѪѮѰѴ - Old Church Slavonic
- In Transnistria, Romanian is written in Cyrillic characters
- ЙЩЬЮЯ
- ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩαβγδεζηθικλμνξοπρςστυφχψω (Greek Alphabet) – Greek language
- אבגדהוזחטיכלמנסעפצקרשת (Hebrew alphabet)
- and maybe some odd dots and lines above, below, or inside characters - Hebrew language
- פֿ; dots/lines below letters appearing only with א,י, and ו - Yiddish
- no dots or lines around the letters, and more than a few words end with א (i.e., they have it at the leftmost position) - Aramaic
- Ladino
- 日本語勉強 - East Asian Languages
- and no other - Chinese language
- with あいうえお Hiragana and/or アイウエオ Katakana - Japanese language
- with characters like 위키백과에 - Korean language
- Vietnamese uses Latin alphabet – see above
- ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏ etc. -- ㄓㄨㄧㄋㄈㄨㄏㄠ (Zhuyin)
- ㄪㄫㄬ -- not Mandarin
- กขคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤฤๅลฦฦๅวศษสหฬอฮ (Thai alphabet) - Thai language
- Ա Բ Գ Դ Ե Զ Է Ը Թ Ժ Ի Լ Խ Ծ Կ Հ Ձ Ղ Ճ Մ Յ Ն Շ Ո Չ Պ Ջ Ռ Ս Վ Տ Ր Ց Ւ Փ Ք Օ Ֆ (Armenian alphabet) - Armenian language
- ა ბ გდ ევ ზ ჱ თ ი კ ლ მ ნ ჲ ო პ ჟ რ ს ტ ჳ უ ფ ქ ღ ყ შ ჩ ც ძ წ ჭ ხ ჴ ჯ ჰ ჵ ჶ ჷ ჸ (Georgian alphabet) - Georgian language
[edit] Latin alphabet (possibly extended)
[edit] Romance languages
Lots of Latin roots.
[edit] French (Français)
- Common words: de, la, le, du, des, il, et;
- Words ending in -x, especially -aux or -eux;
- Many apostrophised contractions, i.e. words beginning with l' or d'
- Accented letters: à â ç è é ê î ô û, rarely ë ï, but never á í ì ó ò ú, and ù only in the word où
[edit] Jèrriais
- Common words: lé, dé, tchi, ès, i', ch'
- "Tch", "dg", "th" and "în" are common character combinations. "ou" is frequently followed by another vowel.
- Many apostrophised short forms, e.g. words beginning with l', d' or r'. é frequently alternates with an apostrophe e.g. c'mîn/quémîn.
[edit] Spanish (Español)
- Characters: ¿ ¡ (inverted question and exclamation marks), ñ
- All vowels (á, é, í, ó, ú) may take an acute accent
- Some words frequently used: de, el, los, la(s), uno(s), una(s), y
- No apostrophised contractions
- Word endings: -o, -a, -ción, -miento, -dad
- Angle quotation marks: « » (though "curly-Q" quotation marks are also used); dialogue often indicated by means of dashes
[edit] Italian (Italiano)
- Almost every word ends in a vowel. Exceptions include non, il, per, con.
- Common one-letter word: è
- Common word: perché
- Letter sequences: gli, gn, sci
- Word endings: -o, -a, -zione, -mento, -tà, -aggio
- Grave accent (e.g., on à) almost always occurs in the last letter of words.
- Geminate consonants (tt, zz, cc, ss, bb, pp, ll, etc) are frequent.
[edit] Catalan (Català)
- Character combination "l·l"
- Word endings: -o, -a, -es, ció, -tat
- Word beginning: ll-
[edit] Romanian (Română)
- Characters: ă â î ş ţ
- Common words: şi, de, la, a, ai, ale, alor, cu
- Word endings: -a, -ă, -u, -ul, -ului, -ţie (or -ţiune), -ment, -tate
- Double and triple i: copii, copiii
- Note that Romanian is sometimes written online with no diacritics, making it harder to identify
[edit] Portuguese (Português)
- Common one-letter words: a, à, e, é, o
- Common two-letter words: ao, as, às, da, de, do, em, os, ou, um
- Common three-letter words; aos, das, dos, ele, ela, não, por, que, uma, uns
- Common endings: -ção, -ções, -dade
- Common digraphs: nh, lh
- Most singular words end in vowels. Other singular words end in l, m, r, z
- Plural words end in -s
- European Portuguese often uses c before ç and t: acção, acto, etc.
[edit] Walloon (Walon)
- Characters: å, é, è, ê, î, ô, û
- Common digraphs and trigraphs: ai, ae, én, -jh-, tch, oe, -nn-, -nnm-, xh, ou
- Common one-letter words: a, å, e, i, t', l', s', k'
- Common two-letter words: al, ås, li, el, vs, ki, si, pô, pa, po, ni, èn, dj'
- Common three-letter words: dji, nos, vos, les, ses, nén, rén, bén, pol, tel, mel
- Common endings: -aedje, -mint, -xhmint, -ès, -ea, -ou, -owe, -yî, -åcion
- Apostrophes are followed by a space (preferably non breaking one), eg: l' ome instead of l'ome.
[edit] Germanic languages
[edit] English
- words: an, in, on, the, that, is, are, I (should always be a capital)
- letter sequences: th, ch, sh, ough, augh
- word endings: -ing, -tion, -ed, -age, -s, -’s, -’ve, -n’t, -’d
[edit] Dutch (Nederlands)
- letter sequences ij, ei, doubled vowels, kw, sch,
- words: het, op, en, een, voor (and compounds of voor).
- word endings: -tje, -sje, -ing, -en, -lijk,
- at the start of words: z-, v-, ge-
- t/m occasionally occurs between two points in time or between numbers (e.g. house numbers).
[edit] West Frisian (Frysk)
- letter sequences: ij, ei, oa
- words: yn
[edit] Afrikaans (Afrikaans)
- words: 'n, deur
[edit] German (Deutsch)
- umlauts (ä, ö, ü), eszet (ß)
- letter sequences: sch, tsch, tz, ss,
- common words: der, die, das, den, dem, des, er, sie, es, ist, ich, du, aber
- common endings: -en, -er, -ern, -st, -ung, -chen
- rare letters: x, y (except in loanwords)
- long compound words
- many capitalised words in the middle of sentences
[edit] Swedish (Svenska)
- common words: och, i, att, det, en, som, är, av, den, på
- long compound words
- letter sequences: stj, sj, skj, tj
[edit] Baltic languages
[edit] Latvian (Latviešu)
- uses diacritics: ā, č, ē, ģ, ī, ķ, ļ, ņ, ō, ŗ, š, ū, ž
- does not have letters: Q, W, X, Y
- extremely rare doubling of vowels
- rare doubling of consonants
- a period (.) after ordinal numbers, e.g. 2005. gads
- common words: "ir", "bija", "tika", "es", "viņš"
[edit] Lithuanian (Lietuvių)
- visual abundance of letters ą, č, ę, ė, į, š, ų, ū, ž
- does not have letters q, w, x, y
- extremely rare doubling of vowels and consonants
- many varying forms (usually endings) of the same word, e.g. namas, namo, namus, namams, etc.
- generally long words (absence of articles and fewer prepositions in comparison to Germanic languages)
- common words: "ir", "yra", "kad", "bet".
[edit] Slavic languages
[edit] Polish (Polski)
- consonant clusters "rz", "sz" , "cz", "prz", "trz";
- uses : ą , ę , ć , ś , ł , ó , ż , ź
- words "i", "w";
- word "się".
[edit] Czech (Čeština)
- visual abundance of letters "ž,š,ů,ě,ř";
- words "je", "v";
- to distinguish from Slovak: does not use ä, ľ, ĺ, ŕ or ô.
[edit] Slovak (Slovenčina)
- visual abundance of letters "ž, š, č";
- uses : ä, ľ, ĺ, ŕ and ô;
- typical suffixes: -cia, -ť,
- to distinguish from Czech: does not use ě, ř or ů;
[edit] Macedonian (Македонски)
- uses : ј , љ , њ , џ , ѓ , ќ , ѕ
[edit] Serbian (Српски)
- uses : ј , љ, њ , џ , ђ , ћ
[edit] Bulgarian (Български)
- uses : ъ , щ , я , ю , й
[edit] Celtic languages
[edit] Welsh (Cymraeg)
- letters Ŵ, ŵ used in Welsh
- words y, yr, yn, a, ac, i, o
- letter sequences wy, ch, dd, ff, ll, mh, ngh, nh, ph, rh, th, si
- letters not used: k, q, v, x, z
- letter only used rarely, in loanwords: j
- commonly accented letters: â, ê, î, ô, û, ŵ, ŷ
- word endings: -ion, -au, -wr, -wyr
- y is the most common letter in the language
- w between consonants (w is in fact a vowel in the Welsh language)
- circumflex accent (^) is by far the commonest diacritical mark, although diacritics are often omitted altogether.
[edit] Irish (Gaeilge)
- vowels with acute accents: á é í ó ú
- words beginning with letter sequences bp dt gc bhf
- letter sequences sc cht
[edit] Scottish Gaelic (Gàidhlig)
- vowels with grave accents: à è ì ò ù
- letter sequences sg chd
[edit] Iranian languages
[edit] Kurdish (Kurdî / كوردی)
- The word "xwe" (oneself, myself, yourself etc.) is highly specific (xw combination) and frequent.
- kir
[edit] Finno-Ugric languages
[edit] Finnish (Suomi)
- distinct letters ä and ö; but never õ or ü
- common words: sinä, on
- common endings: -nen, -ka/-kä, -in
- common vowel combinations: ai, uo, ei, ie, oi, yö, äi
- unusually high degree of letter duplication, both vowels and consonants will be geminated, for example aa, ee, ii, kk, ll, ss
[edit] Estonian (Eesti)
- distinct letters: ä, ö, õ and ü; but never ß or å
- f, z, š and ž appear in loanwords and proper names only; the last two are substituted with sh or zh in some texts
- c, q, w, x, y appear in (typically foreign) proper names only
- similar to Finnish, except:
- letter õ is unique to Estonian
- words end in consonants more frequently than in Finnish
- letter d is much more common in Estonian than in Finnish, and in Estonian it is often the last letter of the word, which it never is in Finnish
- common words: ja, on, ei, ta, see
[edit] Hungarian (Magyar)
- letters Ő, Ű, ő and ű unique to Hungarian
- letter combinations: sz, gy, cs, leg‐, ‐obb
- common words: a, az, ez, egy, és, van
[edit] Southern Athabaskan languages
- vowels with acute accent, ogonek (nasal hook), or both: á, ą, ą́
- doubled vowels: aa, áá, ąą, ą́ą́
- slashed l: ł
- n with acute accent: ń
- quotation mark: ' or ’
- sequences: dl, tł, tł’, dz, ts’, ií, áa, aá
- may have rather long words
[edit] Western Apache
In addition to the above,
- may use: u or ú
- may use vowels with macron: ā ą̄
- does not use ų
[edit] Navajo
In addition to the above,
- does not use u, ú, or ų
[edit] Chiricahua or Mescalero
In addition to the above,
- uses: u, ú, ų
- does not use o, ó, or ǫ
[edit] Basque (Euskara)
- word ending: -ak
- letter sequence: tx
[edit] Japanese in Romaji (Nihongo/日本語)
- words: "desu", "aru", "suru", esp. at end of sentences;
- word endings: "-masu", "-masen", "-shita";
- letters: nearly 50% vowels (a e i o u);
- letters: no consonants, except "n" and "h", at end of words
- a macron or circumflex may be used to indicate doubled vowels, eg. Tōkyō
- common words: no, o, wa, de, ni
[edit] Hmong written in Romanized Popular Alphabet
- Almost all written words are quite short (one syllable).
- Syllables (unless they are pronounced with mid tone) end in a tone letter: one of b s j v m g d, leading to apparent "consonant clusters" such as -wj
- w can be the main vowel of a syllable (e.g. tswv)
- Syllables can begin with sequences such as hm-, ntxh-, nq-.
- Syllables ending in double vowels (especially -oo, -ee) possibly followed by a tone letters (as in Hmoob "Hmong").
[edit] Vietnamese (Tiếng Việt)
- Roman characters with many diacritical marks on vowels. See above.
- Almost all written words are quite short (one syllable).
- Words beginning with "ng"
- common words: "cái", "không", "có", "ở"
[edit] VIQR
- The following characters (often in combination) after vowels: ^ ( + ' ` ? ~ .
- DD, Dd, or dd
- The following character before punctuation: \
[edit] VNI
- The digits 1-8 after vowels
- The digit 9 after a D or d
- The following character before numbers: \
[edit] Telex
- The following characters after vowels: s f r x j
- The following vowels, doubled up: a e o
- The letter "w" after the following characters: a o u
- DD, Dd, or dd
[edit] Chinese, Romanized
[edit] Standard Mandarin
- In general, Mandarin syllables end only in n, ng, r; never in p, t, k, m
[edit] Pinyin
- Words beginning with x, q, zh
- Tone marks on vowels, such as ā, á, ǎ, à
- For convenience while using a computer, these are sometimes substituted with numbers, e.g. a1, a2, a3, a4
[edit] Wade-Giles
- Words do not begin with b, d, g
- Words beginning with hs
- Many hyphenated words
- Apostrophes, e.g. t`a, ch`i
[edit] Gwoyeu Romatzyh
- Many unusual vowel combinations such as ae, eei, ii, iee, oou, yy, etc.
- Insertion of r, e.g. arn, erng, etc.
- Words ending in nn, nq
[edit] Standard Cantonese
- In general, Cantonese syllables can end in p, t, k, m, n, ng; never r
[edit] Minnan in Pe̍h-oē-jī
- Many hyphenated words.
- Words can end in p, t, k, m, n, ng, h; never r
- Roman characters with many diacritical marks on vowels. Unlike Vietnamese, each character has at most one such mark.
- Unusual combining characters, namely · (middle dot, always after "o") and | (vertical bar). - (macron) is also common.
[edit] Turkic languages
Note that some Turkic languages like Azeri and Türkmen use a similar Latin alphabet (often Jaŋalif) and similar words, and might be confused with Turkish. Azeri has the letters Əə, Xx and Qq not present in the Turkish alphabet, and Türkmen has Ää, Žž, Ňň and Ýý. Latin Characters uniquely (or nearly uniquely) used for Turkic languages: Əə, Ŋŋ, Ɵɵ, Ьь, Ƣƣ, Ğğ, İ, and ı.
[edit] Turkish
[edit] Turkish Alphabet
Lowercase: a b c ç d e f g ğ h ı i j k l m n o ö p r s ş t u ü v y z
Uppercase: A B C Ç D E F G Ğ H I İ J K L M N O Ö P R S Ş T U Ü V Y Z
[edit] Common words
- bir — one, a
- bu — this
- fakat — but
- oldu — was
- şu — that
[edit] Misc.
- Look for word endings. Tense changes in Turkish verbs are created by adding suffixes to the end of the verb. Pluralizations occur by adding -lar and -ler.
- Common Tense Changes: -mış -muş -sun
- Possessivity/person: -im -un -ın -in -iz -dur -tır
- Example: Yapmıştır, "[He] did it"; Yap is the verb stem meaning "to do", -mış indicates the perfect tense, -tır indicates the third person (he/she/it).
- Example: Adalar, "Islands"; Ada is a noun meaning "island", -lar makes it plural.)
- Example: Evimiz, "Our house"; Ev is a noun meaning "house", -im indicates the first-person possessor, which -iz then makes plural.)
[edit] Azeri
Azeri can be easily recognized by the frequent use of ə. This letter is not used in any other officially recognized modern Latin alphabet. In addition, it uses the letters x and q, which are not used in Turkish.
- Common words: və, ki, ilə, bu, o, isə, görə, da, də
- Frequent use of diacritics: ç, ə, ğ, ı, İ, ö, ş, ü
- Words ending in -lar, -lər, -ın, -in, -da, -də, -dan, -dən
- Words never beginning with ğ or ı
- Words rarely beginning with two or more consonants
- Transliteration of foreign words and names, e.g. Audrey Hepburn = Odri Hepbern
[edit] Chinese
- No spaces
- Arabic numerals (0-9) sometimes used
- Punctuation:
- Period 。(not .)
- Serial comma 、(distinguished from the regular comma ,)
- Ellipse …… (six dots)
- No hiragana, katakana, or hangul
- May be written vertically
[edit] Simplified Chinese vs Traditional Chinese
Note: Many characters were not simplified. As a result, it is common for a short word or phrase to be identical between Simplified and Traditional, but it is rare for an entire sentence to be identical as well.
Common radicals different between Traditional and Simplified:
- Simplified: 讠钅饣纟门(e.g. 语 银 饭 纪 问)
- Traditional: 訁釒飠糹門(e.g. 語 銀 飯 紀 問)
Common characters different between Traditional and Simplified:
- Simplified: 国 会 这 来 对 开 关 门 时 个 书 长 万 边 东 车 为 儿
- Traditional: 國 會 這 來 對 開 關 門 時 個 書 長 萬 邊 東 車 為 兒
[edit] Standard written Chinese (based on Mandarin) vs written Vernacular Cantonese
Note: Cantonese-speakers live in Mainland China, Hong Kong and Macau, so written Cantonese can be written in either Simplified or Traditional characters.
Common characters in Vernacular Cantonese that do not occur in Mandarin (only characters that are the same between Traditional and Simplified are chosen here):
- 嘅 咗 咁 嚟 啲 唔 佢 乜 嘢
Some of the above characters are not supported in all character encodings, so sometimes the 口 radical on the left is substituted with a "0" or "o", e.g.
- o既 0既
[edit] Japanese
- Katakana (カタカナ) and hiragana (ひらがな) characters mixed with kanji (漢字)
- Few or no spaces
- Arabic numerals (0-9) sometimes used
- Punctuation:
- Period 。
- Comma 、(,also used)
- Quotation marks 「」
- Occasional small letters beside large ones, eg. しゃ りゅ しょ って シャ リュ ショ ッテ
- Double tick marks appearing at upper right of letters, eg. で が ず デ ガ ズ
- Empty circles appearing at upper right of letters, eg. ぱ ぴ パ ぴ
- Frequent characters: の を は が
- May be written vertically
[edit] Korean
- Western-style punctuation marks
- Western-style spacing
- Hangul letters, e.g. ㅎ h, ㅇ ng, ㅂ b, etc.
- Hangul letters used to form syllable blocks; e.g. ㅅ s + ㅓ eo + ㅇ ng = 성 seong
- Circles and ellipses are commonplace in Hangul; are exceedingly rare in Chinese.
- General appearance has relatively-uniform complexity, as contrasted with Chinese or Japanese.
[edit] Thai
- Thai language in writing can most easily be identified by its unique alphabet (Thai alphabet):
- Thai alphabet consonants, in order: กขคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรฤฤๅลฦฦๅวศษสหฬอฮ
- No spaces, generally
- Use of double-quotes (" ") and exclamation mark (" ไทย! ") somewhat common, especially in newsprint
- Unique system of diacritics (ไม้เอก, ไม้โท, ไม้ตรี, and ไม้จัตวา), derived from Indic numberals.
- Frequently uses roman numerals, but often uses Thai numerals (๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙ ).
- Example of roman numberal usage: วันอาทิตย์ ที่ 30 ธันวาคม 2550 ("Sunday 30 December 2007")
- Certain vowels located above ( -ิ -ี -ึ -ื ), and others below ( อุ อู ), consonant letters on the line.
[edit] Greek
Modern Greek is written with Greek alphabet in monotonic, polytonic or atonic, either according to Demotic (Mr. Triantafilidis) grammar or Katharevousa grammar. Some people write in Greeklish (Greek with Latin script) which is either Visual-based, orthographic or phonetic or just messed-up (mixed). The only official forms of Greek language are the Monotonic and Polytonic.
[edit] Normal Modern Greek (Greek Monotonic)
- words "και", "είναι";
- Each multi-syllable word has one accent/tone mark (oxia): ά έ ή ί ό ύ ώ
- The only other diacritic ever used is the trema: ϊ/ΐ, ϋ/ΰ, etc.
[edit] Ancient or pre-1980s Greek (Greek Polytonic)
- This is Katharevousa or some mixed form of Demotiki (Triantafilidis' grammar) and Katharevousa;
- You will notice several accents/tones. Examples: ~ ` and oxia (looks like 'ί);
- You may also notice this: ΐ, ΰ. ϊ, ϋ etc.
[edit] Greek Atonic
- Was common in some Greek media (television);
- You will see Greek characters without accents/tones;
- words: "και, ειναι, αυτο".
[edit] Greek in Greeklish
- Automated conversion software for Greeklish->Greek conversion exists. If you notice a Greeklish text it may be useful for the Greek el.wikipedia (after conversion).
- Keep in mind: in Greeklish more than one characters may be used for one letter. (example: th for theta).
[edit] Orthographic Greeklish
- words "kai", "einai".
[edit] Phonetic Greeklish
- words "ke", "ine";
- omega appears as o;
- ei, oi appear as i;
- ai appears as e.
[edit] Visual-based Greeklish
- omega (Ω or ω) may appear as W or w;
- epsilon (E) may appear as "3";
- alpha (A) may appear as "4";
- theta (Θ) may appear as "8";
- upsilon (Y) may appear as "\|/";
- gamma (γ) may appear as "y"
- More than one characters may be used for one letter.
[edit] Messed-up (Mixed) Greeklish
- words "kai", "eine";
- combines principles of phonetic, visual-based and orthographic Greeklish according to writer's idiosyncrasy;
- The most commonly used form of Greeklish.
[edit] Armenian language
Armenian can be recognised by its unique 38-letter alphabet:
Ա Բ Գ Դ Ե Զ Է Ը Թ Ժ Ի Լ Խ Ծ Կ Հ Ձ Ղ Ճ Մ Յ Ն Շ Ո Չ Պ Ջ Ռ Ս Վ Տ Ր Ց Ւ Փ Ք Օ Ֆ
[edit] Georgian language
Georgian can be recognised by its unique alphabet.
ა ბ გდ ევ ზ ჱ თ ი კ ლ მ ნ ჲ ო პ ჟ რ ს ტ ჳ უ ფ ქ ღ ყ შ ჩ ც ძ წ ჭ ხ ჴ ჯ ჰ ჵ ჶ ჷ ჸ
[edit] Malay and Indonesian
May contain the following:
Prefixes: me-, mem-, memper-, pe-, per-, di-, ke-
Suffixes: -kan, -an, -i
Others (these almost always written in lower case): yang, dan, di, ke
Malay and Indonesian are mutually intelligible to proficient speakers, although translators and interpreters will generally be specialists in one or other language.
Frequent use of the letter 'a' (comparable to the frequency of the English 'e').
[edit] Artificial languages
[edit] Esperanto
- words: de, la, al, kaj
- Six accented letters: ĉ Ĉ ĝ Ĝ ĥ Ĥ ĵ Ĵ ŝ Ŝ ŭ Ŭ
- words ending in o, a, oj, aj, on, an, ojn, ajn, as, os, is, us, u, i, aŭ
[edit] Klingon
- When written in the Latin alphabet Klingon has the unusual property of a distinction in case; "q" and "Q" are different letters, and other letters are either always (e.g. D, I, S) or never (e.g. ch, t, v) written in upper case. This causes a large number of words that look quite strange to people who aren't used to it, for example: "yIDoghQo'", "tlhIngan Hol" (with mixed case).
- The apostrophe is fairly frequent, especially at the end of a word or syllable.
- Common suffixes: -be', -'a'
- Common words: 'oH
[edit] Lojban
- starts with "ni'o" or ".i" (or "i");
- has many words like "ko'a" "pi'o" etc;
- all lowercase;
- usually no punctuation except for dots;
- may use commas in the middle of words (typically proper nouns).
[edit] External links
- Translated, an online language identifier, 102 languages supported
- Xerox, an online language identifier, 47 languages supported