User:Aleksandar Šušnjar/Serbian Wikipedia's Challenges
From Wikipedia, the free encyclopedia
-
- Main copy of the article is currently at sr:Корисник:Asusnjar/Serbian Wikipedia's Challenges. These two copies should be synchronized.
Serbian language Wikipedia has an interesting usability challenge stemming from the availability and widespread use of two separate alphabets: Cyrillic and Latin and presence of multiple dialects.
This situation is shared or similar to other languages that either do not use Latin script ar their primary or otherwise have multiple scripts available, such as:
These languages have to deal with transliteration issues like or similar to what is discussed here.
On top of this, there are some issues that Wikipedia does not seem to solve entirely and are common to all the languages - snippets of text of one language inserted into the text of another language. Currently there is no standard way of easily tagging such a text to let both machines and people know of it. While the importance of this varies between the languages, presence of transliteration possibility requires it in order to determine whether and which transliteration rules to apply.
Contents
|
[edit] Background
[edit] Alphabets
The primary alphabet used is Cyrillic. It is more precise than Latin (as will be explained later) and is taken as the standard alphabet of Serbian Wikipedia. Both alphabets record what is spoken and not necessarily Serbian language text. As such there is generally no "correct way" to spell words - there is a correct way to pronounce them, and properly recorded correct pronunciation results in "correct writing".
As such, Serbian alphabets do not hide pronunciation differences as English does. For example, English word spelled "tomato" can be pronounced (correctly or not) as tom-uh-toh or tom-ey-tow, which would yield to different spellings in either Serbian alphabet:
- tom-uh-toh: "томато" (Cyrillic) and "tomato" (Latin)
- tom-ey-tow: "томејтоу" (Cyrillic) and "tomejtou" (Latin)
This leads to a tendency to transliterate foreign words, including trademarks, into a Serbian pronunciation form. For example, "Microsoft Windows" may be written (not translated) as "Majkrosoft Uindouz" (Latin) or "Мајкрософт Уиндоуз" (Cyrillic).
[edit] Cyrillic
Cyrillic alphabet has thirty letters, each in both uppercase and lowercase form. All letters consist of a single symbol and there are no combinations that require special treatment or pronunciation. It is not entirely phonetic but it comes very close enough for most people to not understand the difference.
The letters (upper and lowercase pairs) are, in proper sort order:
- Аа Бб Вв Гг Дд Ђђ Ее Жж Зз Ии Јј Кк Лл Љљ Мм Нн Њњ Оо Пп Рр Сс Тт Ћћ Уу Фф Хх Цц Чч Џџ Шш
[edit] Latin
Latin alphabet also has thirty letters that correspond one-to-one to Cyrillic. They, however, have a different sort order. Details can be found at Serbian language - Alphabets. The letters are, in latinic sort order:
- Aa Bb Cc Čč Ćć Dd Dž/dž Đđ Ee Ff Gg Hh Ii Jj Kk Ll Lj/lj Mm Nn Nj/nj Oo Pp Rr Ss Šš Tt Uu Vv Zz Žž
Some of the letters are composed of dual glyphs that are entered as two separate characters (lj, nj and dž). This causes them to also have "title case" (in addition to upper and lower case) because both, leading-only or no characters in the pair may be uppercase:
- LJ / Lj / lj
- NJ / Nj / nj
- DŽ / Dž / dž
Unicode provides special code points for those letters, allowing them to be single characters, but their availability is not widespread and are, thus, not used - and introduction would require major changes as standard latinic computer and typewriter keyboards do not contain these characters (they require them to be typed as two).
Being composed of two characters (visually) those letters present a certain ambiguity - whether to read them as a single letter or two. In significant but not sufficient majority of cases whenever these combinations are seen they can be treated as a single letter. There are exceptions, however, for example:
- "nj" in "injekcija" is pronounced as two separate sounds, but as one in "njuška
- "dž" in "nadživeti" is pronounced as two separate sounds, but as one in "džak"
- "lj" in transcription of Slovenian capitol "Ljubljana" is pronounced as two separate sounds (both times), but it is usual to pronounce the name of the city with one sound (both times).
Having on-line dictionary available to automated transliterator would help to an extent. However, it is very hard to make a complete Serbian dictionary covering all word forms as there are many, unlike in English. Each word can have many forms (by virtue of conjugations, declension, etc.). Recording pronunciation further complicates an issue because the same word is pronounced slightly differently in different dialects, introducing even more forms.
Another transliteration ambiguity is selection of letter case (upper/title/lower) when transliterating from Cyrillic to Latin. In vast majority cases it is straightforward and can be determined from the case of the following letter, for example:
- "ЏАК" becomes either "DŽAK" (typically for titles) or "DžAK" (safer for acronyms)
- "Џак" becomes "Džak"
- "џак" becomes "džak"
This, however is not always possible, as is the case with acronyms.
[edit] ASCII Latin
Serbian Latin is essentially composed of symbols available in English Latin, with some special forms with "things" on top of the letters (as in Š, Ć, Č, Ž) and Đ. Original and current unavailability of Serbian keyboards, operating system support, etc. and popularization of Internet lead to widespread use of "base letter forms" without those "things on top". Thus, Š becomes S, both Č and Ć become C and Ž becomes Z. Letter Đ takes the approach of dual letters "Lj" and "Nj" in which "j" acts as a "softener" - so Đ becomes yet another "dual letter" - Dj.
Obviously, information is lost, but human brain is mostly capable of figuring it out from context. However, context may sometimes be insufficent or unavailable. Examples:
- "koza" can either mean "koza" (goat) or "koža" (skin)
- "sisati" can either mean "sisati" (to suck) or "šišati" (to cut (hair))
ASCII Latin conversion table:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Notes: "Tanjug" standard was used by Tanjug News Agency for its news feeds when only ASCII is/was available. "Other" examples are not standard but some cases of actual empirical use (there are more). Neither one needs to be considered for Wikipedia - of all the "incorrect/incomplete" transliterations ASCII is by far the most common, having more than 99% of all texts available in incorrect transliterations.
[edit] Dialects
Once upon a time, long time ago, they say there was a special letter Yat. That letter that was pronounced differently in different regions - see South Slavic languages - Rendering of yat. That letter is now gone, but its different pronunciations remain:
- "Е" (Cyrillic) / "E" (Latin) - as in tEn in English
- "(И)ЈЕ" (Cyrillic) / "(I)JE" (Latin) - "(EE)YE" in English
- "И" (Cyrillic) / "I" (Latin) - "E" in English
One example can be Serbian word for milk: млеко/млијеко/млико (Cyrillic) or mleko/mlijeko/mliko (Latin).
The first dialects two are major, and last one may not be treated as "Serbian" at all but nevertheless present in the region. All dialects have names based on the pronunciation of the letter + "KAVSKI" suffix:
- ЕКАВСКИ / EKAVSKI
- ИЈЕКАВСКИ / IJEKAVSKI
- ИКАВСКИ / IKAVSKI
Since Serbian alphabets record speech (pronunciation) it is possible for a single article to contain multiple dialects. For example, the most common "ekavski" dialect can be used for the main article text, but it can contain a quote of a person speaking in "ijekavski" dialect.
The difference between dialects is minor enough to not cause any problems for a speaker of any dialect to understand spoken or written word from any other dialect, but having an inappropriate mixture of dialects in the same article (unless for a specific reason) is considered bad practice. This may be somewhat analogous to differences between American and British spellings and consistency desired in a single article but is much more frequent than English spelling differences.
Wikipedia contributors will typically write articles in their own dialect, which is then likely to be edited by a speaker of another, causing such inconsistencies. Speakers of one dialect may not be aware of exactly which words and letters are affected and may miss or overdo any "personal conversion" attempts. For example, following words are subject to change from one dialect to another, although they may appear to be to the automatic convertor:
- десно / desno
- оријентација / orijentacija
- инјекција / injekcija
Unlike spelling differences in English, dialect pronunciation differences are (although relatively minor) very frequent. This means that while cross-dialect readability of text is not an issue, searching for an article is - in English Wikipedia spelling differences are handled with redirects (e.g. Colour vs. Color) but they would be too frequent and also frequently omitted by contributors in Serbian.
So, those two or three dialects coupled with two official and extra "crippled Latin" alphabet potentially give up to six (or nine if "ikavian" dialect is included) different correct forms of affected words.
Since Serbian alphabets record pronunciations, there are always few more incorrect, but frequent uses such as "инекција" or "ињекција" (must always be "инјекција/injekcija"), "мљеко", etc.
[edit] Different Cyrillics
Cyrillic alphabet, like Latin, differs from language to language. Additionally, there are certain letters that share the sound and even the uppercase glyph, for example, but are different in lowercase and/or italics forms. This situation is similar to Han unification controversy in that the same Unicode codepoint is used to denote similar but nevertheless different characters. Unicode consortium's approach was to differentiate letters based on their "meaning" and not appearance, which appropriatelly results in, for example, two separate codes for Latin letter "A" and Cyrillic letter "А", although they look exactly the same and may very well share the glyph in font files.
Challenges specifically related to Serbian are:
- Regularly available fonts (that come with operating systems) are based on Russian, not Serbian cyrillic. Some have multiple sets of glyphs, but the system of selecting appropriate set is either not available or is "well hidden". Even applications such as "Microsoft Word" do not seem to make use of the language markup of text to perform a proper selection of glyphs and, in case of Cyrillic, only Russian forms appear.
- Lowercase cyrillic letter "б" is somewhat different in Serbian than in Russian (which appears more like a numeral 6), although the difference is not huge (the letter remains recognizable although sometimes confused with lowercase "в" which may have a similar glyph)
- Lowercase cyrillic Letters д, г, п and т have different lowercase italics form in Russian and Serbian (see table below). Initial reactions of Serbian reader to those letters would be that they are, in the same order (note: you'll need Times-type font to see all the differences, otherwise refer to table below):
-
- д Latin lowercase letter d instead of cyrillic д
- г unknown letter resembling question mark or Latin letter "z", instead of cyrillic г
- п either upside-down Cyrillic leter "и" or upside-down Latin letter "u", instead of cyrillic п
- т Latin lowercase letter m instead of cyrillic т
- There are other differences that can be attributed to a particular font style and not to differences in Cyrillics. Uppercase and lowercase letters Д, Л and Љ can either be designed after the shape of letter "П" (rectangular) or letter "А" (triangular, sometimes treated as preferred in Serbian). Whatever the form, they are easily recognizable.
Russian forms can be recognized but not at the first glance and make reading harder and slower. Following table presents major differences:
Serbian uppercase | Image:BDGPT-uppercase-Serbian.png | Note: letter Д is from a slightly different font as its Serbian glyph is not available in Times New Roman |
Russian uppercase | Image:BDGPT-uppercase-russian.png | Stylistic (not necessarily language) differences not affecting readability: letters Д (shown) and Л and Љ (not shown). |
Serbian lowercase | Image:Bdgpt-lowercase-Serbian2.png | Letters б and д are from a very slightly different font as their Serbian glyphs is not available in Times New Roman. Letter б was actually recreated from its correct italics form, for illustrative purposes (its preferred upright form was not found). |
Russian lowercase | Image:Bdgpt-lowercase-russian.png | Letter "б" is slightly different and can sometimes be confused with number 6 or letter "в". Letters д (shown) and л and љ (not shown) shown in different style. Neither affects readability. |
Serbian lowercase italics | Image:Bdgpt-lowercase-italics-Serbian.png | Note: letter б is from a very slightly different font as its Serbian glyph is not available in Times New Roman |
Russian lowercase italics | Image:Bdgpt-lowercase-italics-russian.png | All letters are different. |
[edit] Word order
As in many, if not all, inflected languages, the word order (or sentence part order) is not fixed. Consider, for example, the sentence "Anna is talking to Peter". In Serbian it can be expressed as:
- Ана прича Петру - (Anna) (is talking) (to Peter)
- Ана Петру прича - (Anna) (to Peter) (is talking)
- Прича Ана Петру - (Is talking) (Anna) (to Peter)
- Прича Петру Ана - (Is talking) (to Peter) (Anna)
- Петру Ана прича - (To Peter) (Anna) (is talking)
- Петру прича Ана - (To Peter) (is talking) (Anna)
[edit] Issues
[edit] Articles not found
[edit] Keyboards and Computer Newbies
Most of Serbian computer users, like in most countries, are "newbies". They have a computer because they use it as a tool for specific purpose, such as entertainment, internet browsing, telecommunications, etc. They learn how to use those specific features but generally do not know much more. This includes things such as setting the operating system to use domestic keyboards. Yes, plural - one has two really have at least two keyboard mappings - Latin to be able to type web and e-mail addresses, for example and Cyrillic to support Cyrillic.
The situation is only worsened by the fact that many (if not most) users actually have US-layout keyboards (as in "standard USA layout" with English letters only printed on the key caps).
This causes a couple of related issues:
- To "arrive" to a web site, such as Serbian Wikipedia, one must use a Latin keyboard, then
- To find a specific Serbian Wikipedia article one must switch the keyboard mapping to Cyrillic and type article's name
In many cases "Latin" keyboard is, in fact "USA Keyboard" and not "Serbian Latin Keyboard". Then, most users being "non-experts" don't know how to switch to (or even configure) additional keyboard layouts and will attempt to type the name of the article in (incorrect) ASCII form without Serbian characters. These are two separate causes why Serbian Wikipedia won't find an article:
- It will not find an article on "Književnost" because the article is named in Cyrillic only - "Књижевност".
- It will not find an article on "Knjizevnost" (notice the use of "z" instead of "ž") even if automatic Latin/Cyrillic conversion is available because of the wrong letter.
Google and possibly other search engines seems to equate all those letters that are "confused with one another". In other words, it will find articles writing about "koža" even though one searches for "koza". This appears to be acceptable compromise - although it yields false positive matches, it also allows one to skip switching keyboard layout.
Creating redirect articles can address this issue but at high cost - all articles would require many more redirects (especially when coupled with dialects) than, say English Wikipedia, causing creation and maintenance nightmare. And it does not make the article content searchable.
Without a proper solution, most of Serbian speaking population really can not access or use Wikipedia - simply because they can't get to the article they need.
People who are learning Serbian language or live abroad may have reduced access to Cyrillic (keyboards, fonts and own knowledge) and would appreciate latinic access to Serbian Wikipedia. Although their case is different, the outcome and the issues are the same as mentioned above.
[edit] Dialect clash
A user that has solved the keyboard issues may come to Serbian Wikipedia and want to learn more about milk. Since he speaks Ijekavian dialect, he types "млијеко" in the search field. The article will not be found because its name only exists in Ekavian dialect - "млеко".
Again, although this appears to be solvable by redirection articles it is not really the proper solution because of the great number of such redirects required and it does not address article body text search.
In many cases one can attempt to "change the dialect" of the search and type "ekavian" form instead of "ijekavian", for example. But it is not natural, may not be remembered, and the user may not know how to do it properly (may omit some changes or make them where they are not supposed to be done).
To spice the issue further, Wikipedia is open - anyone can write its articles. This means that there are articles in various dialects, which means that there is no one "standard" dialect.
[edit] Web Search Engines
Related to this is an issue of web search engines indexing Serbian Wikipedia content. In Google, for example, users would typically type (ASCII) Latin text they want to find, not the primary Serbian alphabet - Cyrillic. Since Serbian Wikipedia content is in Cyrillic, no Wikipedia articles will be found.
Same as with alphabet choice, cross-dialect search in Google will not report Wikipedia articles.
[edit] Word Forms
In many languages (and Serbian as well) the form of the word changes based on many factors through declension, conjugation, etc. English words change very little, if at all, when compared to Serbian. Serbian, for example, has 7 cases causing words to change while English essentially has two with the same effect (Nominative and genitive; not including sentence forms). Furthermore, in English these changes are relatively simple - e.g. appending "'s" to the end of the word. In Serbian this is (usually) more complex and requires the root of the word to be known and still "not safe" - it may need to change itself depending on what is to be "appended" to it.
One of rather interesting examples is the question of how many bags one-inside-another can you describe inside a single sentence by using only a single word in any/all of its forms but without repeating them. Consider the following:
- kesinoj kesi kesine kese kesina kesa
English translation would roughly be:
- of bag's bag, and its bag's bag, bag's bag
Searching and finding English sentence by typing "bag" in the search field is easy. Essentially, it appears six times... sometimes just having "'s" after it. Searching for a text that mentions word "kesa" but not in its nominative form (say we replaced last word "kesa" with Serbian word for "handle") would find such an article, although the word appears in it five other times (just in "wrong" case). Notice that, in this example, other cases of "kesa" don't even begin with "kesa" - only first three letters are common.
While this does not generally introduce (m)any issues in article titles, full-text search capability is greatly affected. This, like many other challenges presented here, is not a speciality of Serbian language but is common to many.
[edit] Word Order
This is very unlikely to represent a problem, but is here for completeness. One may attempt to find an article by name, but use a different order of words (as order of words may not be important). The reason why this is a lesser issue is that there is (unofficially) such a thing as "naturally preferred" order as felt by all speakers and is "subject predicate object" (vs. any other sentence order).
[edit] Readability
[edit] Alphabet Availability
Serbian Wikipedia is written in primary Serbian alphabet - Cyrillic. Showing Cyrillic on screen requires the presence of appropriate fonts, browsers and operating systems that can show it. Generally speaking this is not at all a problem today - even those not interested in having Cyrillic at all can see it.
However, those learning Serbian or living abroad may not be familiar with Cyrillic and may prefer reading articles in Latin script. Although they may understand the language and know Serbian Latin, they would not be able to read (or find) Serbian Wikipedia content without automated transliteration.
[edit] Font (Typographical) Correctness
Unicode solves everything. Right? Wrong. Unicode at least has separate code points for separate letters. Right? Wrong again. While Unicode does a better encoding job of anything previously available, it is not sufficient. For example, various Kanji characters of the same code point have somewhat different renditions in Japanese than in Chinese languages. Turkish has both "dotted" and "dottless" letters "i" (in both upper and lowercase form) but Unicode consortium decided to save a few code points by not treating them as sufficiently separate, causing lettercase conversion issues (in all other languages between dotless uppercase and dotted lowercase, but not in Turkish - each has its own).
It is simply expected from computer users to have their own localized operating systems and fonts installed. But this assumption (or at least "hope") is wrong. The fact of modern life is living and reading multiple languages at once and not having separate computer for each.
Serbian language, like those mentioned above, has some specifics:
- Latin letter "Đ" (Unicode name "D with stroke") looks exactly like "Ð" (Unicode name "ETH") but their lowercase forms are different - "đ" (Serbian) and "ð" (lowercase eth). These two independent letters are sometimes confused. Searching for a wrong one will cause article to not be found.
- Cyrillic letter "Д" has significantly different lowercase italics form than commonly available (even in Serbia) Russian form "д". See #Different Cyrillics for details.
- Cyrillic letter "Г" has significantly different lowercase italics form than commonly available (even in Serbia) Russian form "г". See #Different Cyrillics for details.
- Cyrillic letter "Т" has significantly different lowercase italics form than commonly available (even in Serbia) Russian form "т". See #Different Cyrillics for details.
- Cyrillic letter "П" has a different lowercase italics form than commonly available (even in Serbia) Russian form "п". See #Different Cyrillics for details.
- Cyrillic letter "Б" has a somewhat (although not critically) different lowercase form than commonly available (even in Serbia) Russian form "б". It can, however, be sometimes confused with lowercase "в". See #Different Cyrillics for details.
Those Cyrillic letters share the same code points as Russian Cyrillic but are typographically different in italics form. Adobe fonts have both forms within them, but the appropriate glyph has to be chosen based on enabled language - a system presently not widely unavailable.
Reading text with wrong glyphs, especially in case of letters г and т essentially becomes "fill in the blanks" guessing game and, thus uneasy and slow.
[edit] Different dialect
As stated before, there is an issue of a reader speaking a different dialect than used in an article. Since, in Serbian, writing records speech, it also records differences in dialects. While English writing is, to a large degree, immune to dialects or different pronunciations, and written text generally can not indicate a dialect, Serbian exposes the variety of its dialects in written text as well.
Readability and understandability of text is not affected as Serbian dialects are completely mutually intelligible. Reading a text not of one's own dialect is analogous to listening to a pre-recorded speech of another dialect - for example, British person listening to a speech made by an American. The disctinction is fully percieved but, generally speaking, does not reduce "understandability" of the text.
Automatic conversion between dialects is possible but not at all trivial and may require ever expanding and very complex dictionaries. Speakers of one dialect, naturally, prefer it over another. But there is a different issue here. Contributors also speak various dialects so any single Wikipedia article may end up being composed of multiple sections written in different dialects. Noone wants to impose a single standard (e.g. the dialect of the majority) since the fact of having all these dialects is welcome and nice. Yet, reading an article with alternating sentences in alternating dialects may feel like listening to multiple different people reading it, with different dialects.
Therefore, automated system of conversion or "consistent representation" would be welcome, although probably not required.
[edit] Foreign-Language, Strict-Dialect and/or Strict-Script Text
Articles are predominantly written in one language but they nevertheless frequently include words, registered/trademark names, quotes or similar in another. Knowing that a part of the text is not in the "host" language is necessary because:
- text may be confused with the primary language word(s) having different meaning
- imply primary language pronunciation rules even for foreign text (e.g. "Microsoft" would be pronounced approx. "Mitsrosoft" if read as if it were Serbian Latin)
- cause automated transliteration when it should not occur (e.g. "Microsoft" to "Мицрософт")
- tamper with possible future use of spelling, grammatical and related text checkers
Similar to this is the issue of intentionally using a dialect (of the primary language) different than the one in the rest of the article - for example when quoting a speaker of that dialect. If/when any dialect normalization/conversion is to occur, it must avoid such text.
A third form of the same problem is the one related to the script. While most of the article can be in one or either script, parts of it may need to always stay non-transliterated - for example in those articles comparing multiple scripts.
[edit] Sort Order
Sort order is different for Cyrillic and Latin alphabets. Cyrillic rules are simple and straightforward - analogous to English alphabet sorting rules - only the letters and order are different. However, there are complex sort rules for Latin alphabet digraphs - for example word "njuška" comes after "nos" because "nj" comes after "n". Ambiguity of Serbian Latin alphabet hits here again - "injekcija" comes a lot before "inje" because "nj" in "injekcija" is actually two letters, not one as in "inje".
Regardless of deciding what comes first, when viewing category listings in either alphabet, the viewer is expecting to see them sorted according to that alphabet's sort order. Should parts of the articles need to be alphabetically sorted, this issue would apply to them as well.
[edit] Contributing
[edit] Keyboards and Alphabets
Contributors may be expected to be, on average, more advanced computer users than average Wikipedia readers. They may be expected to be able to configure, switch and use keyboards at any moment. But, such an expectation does reduce the number of potential minor contributors.
Ability to enter articles in either available proper alphabet (not plain ASCII) would likely be welcome by many, such as those who live abroad or are learning Serbian and for a simple reason of reducing the number of otherwise frequently required keyboard layout switches.
A big issue with this is the lossy translation from Cyrillic to Latin, if it ever needs to be performed. Should one edit the article and it is auto-transliterated from Cyrillic to Latin, then submitting it back with changes may cause untouched parts of the article to appear to have changed (e.g. нј → nj → њ as in in инјекција → injekcija → ињекција) or incorrect automatic transliteration of the new text (see second half of the same example). A dictionary of exceptions would not only have to exist but would have to be rather complex to support all word forms and appearances in compound words. For example:
Singular declension:
- инјекција
- инјекције
- инјекцији
- инјекцију
- инјекцијо
- инјекцији
- инјекцијом
Plural declension:
- инјекције
- инјекција
- инјекцијама
- инјекције
- инјекције
- инјекцијама
- инјекцијама
Example of compounding can be adding prefix "ко" (used to indicate a second element of a pair) to any of the above forms.
In any case, it is considered improper and undesirable to mix scripts (except for for inclusion of foreign language text). Any searching or indexing should not be affected by the script, yet it currently is (if something is written in Cyrillic, it can only be found in Cyrillic; if something is written in Latin, it can only be found in Latin).
[edit] Dialects
Contributors naturally type in their own dialect when writing articles. This should not change, especially because Wikipedia is not attempting to impose any standards over the others. However, unlike with books (or at least their chapters) which may typically be written by a single author and, thus, be consistent in use of a specific dialect, the situation with Wikipedia is quote different. Many people of different dialects contribute to the same article. Leaving it as is results in a problem already mentioned in both "Articles not found" and "Readability" sections.
[edit] Suggestions: What is Needed
[edit] Markup
The first and major thing that needs to be done is to come up with a standard markup that will allow later development of other features needed. Automatic conversion of any kind (e.g. alphabet, dialect or both), indexing and searching is severely challenged without information currently unavailable. Those additional markup standards would likely be very useful for many languages including English and would also help future Wikipedia Machine Translation Project.
Additional markup standards are needed as quickly as possible in order to have existing articles updated sooner and new articles written correctly from the beginning.
What is, essentially immediately needed is:
[edit] Language, dialect and script markup
In HTML language can be specified via lang attribute. There are no provisions to specify dialect or the script as HTML primarily does not "care" about this. Wikipedia has so far relatively successfully avoided HTML syntax. Having it simplified here as well, to provide simple markings of even single words would be welcome. This is needed to control font used (see typographical issues mentioned earlier), transliteration, translation, etc. It is also needed for automated text readers or pronunciation generators.
Although it seems that script can be automatically identified based on segments of Unicode code space used, this is not necessarily true. Some scripts, or at the very least writing standards, may overlap. For example, it is possible to write Serbian Latin digraphs by writing them as two separate letters (nj, lj, dž) or by using specially provided Unicode code points for those digraphs. Since all other letters share numeric codes, it would be hard for a transliterator to identify which standard has been used if no special Unicode digraphs are present: is "nj" in "injekcija" really two separate letters or it is just that the typist could not use special digraphs? Similar situation may be present in other languages as well, albeit in differrent forms - e.g. those that have a variety of different Latin transliteration standards.
[edit] Pronunciation descriptions, translations and similar
In English Wikipedia International Phonetic Alphabet (IPA) is typically used to represent pronunciations. While entirely correct it is in many cases an overkill in Serbian language articles. Although they can't cover all international phonemes, Serbian Cyrillic and Latin are typically used to describe how (foreign) words are pronounced. Only when ultimate precision is needed IPA can be used. In that regard, however, preferred approach is automatic linking to appropriate language wictionary which may then include much more and better maintained information. In any case, such text should accompany marked foreign-language text and may be presented in various ways, including, for example, in a tooltip, having minimal adverse effects on the article. Such tooltips may not contain only the pronunciation but also translations that would otherwise be secondary to the article (such as frequent use of French "Voila!" or "Ce la vie" in English text).
[edit] Strict-alphabet text
Future automatic transliterator must be able to detect what it should or should not transliterate. Note that this applies to presentation only and not indexing/searching - one should still be able to find that text by using either alphabet.
[edit] Strict-dialect text
Any future automatic dialect convertor must be able to detect what it should or should not convert. Note that this applies to presentation only and not indexing/searching - one should still be able to find that text by using either dialect. This is generally used only in quotes (citations) but also in those articles comparing dialects, for example.
[edit] Combination/other strict-form text
A combination of both strict alphabet and dialect may be required in places, for example, when recording exactly what was printed (not spoken).
[edit] Markup within other markup
It is important to make this markup also function within other markup, for example, specifying only a part of the link name as foreign text.
[edit] Font (Typographical) Correctness
This is really not an issue of Wikipedia itself as much it is of everything else (operating systems, browsers, etc). However, there are ways to help. One way is already mentioned - markup. It provides sufficient information for automatic selection of glyphs when, and if, that selection is possible. We have to follow HTML, CSS and browser developments and make sure we start to use any relevant features made available.
"Normal" users can currently solve the problem in two ways - either replace (or modify) their existing system fonts (containing Russian form of Cyrillic) with those containing Serbian glyphs. In this case they loose the ability to see Russian text rendered correctly.
Another way is by using entirely different font altogether, for example "Serbian Times New Roman" instead of "Times New Roman". Wikipedia styles can be updated to list (prefer) those fonts specific to marked-up language of any part of the article. Granted - this will only have effects if visitors actually have those fonts installed and would otherwise fall back to standard (possibly other alternatives in between). However, until operating systems, browsers and fonts start fully supporting specific language differences between glyphs (such as Adobe OpenType fonts and systems do), there does not seem to be a better solution.
[edit] Searchability
There are many issues causing articles not to be found by either title or content, generally coming down to issues of multiple alphabets, dialects and word forms.
[edit] Quick-hack: Front-End Resolution
A quick-hack has been suggested that whatever user types in the search field is pre-processed and automatically translated to a more complex query matching all possible forms.
Take "Microsoft Windows" as an example. If one types it into the search field, the engine should really try to find:
- "Microsoft Windows" (original) OR
- "Мицрософт Windows" (transliterated those words that can be transliterated)
This, of course, does not cover dialects and word forms. There are no dialect differences here, but without dictionary, the engine might assume that this is "IKAVIAN", so it can also generate the following (in addition to above):
- "Mecrosoft Windows" (ekavian Latin) OR
- "Mijecrosoft Windows" (ijekavian Latin) OR
- "Мецрософт Windows" (ekavian Cyrillic) OR
- "Мијецрософт Windows" (ijekavian Cyrillic)
To cover word forms it also needs to expand all of them, not knowing the type of the word (noun, verb, etc.) Just assuming that "Microsoft" is a noun and that we need singular genitive form of it for (assuming) masculine "Windows", we would also have to add all forms of "Microsoft's Windows" (on top of all mentioned above):
- "Microsoftov Windows" (original) OR
- "Мицрософтов Windows" (transliterated those words that can be transliterated)
- "Mecrosoftov Windows" (ekavian Latin) OR
- "Mijecrosoftov Windows" (ijekavian Latin) OR
- "Мецрософтов Windows" (ekavian Cyrillic) OR
- "Мијецрософтов Windows" (ekavian Cyrillic)
... this sufficiently represents the issue without attempting to add plural, other case, other gender or other word type forms - there would be an absolute explosion of forms the search engine needs to search for and would cause significantly greater stress on that component.
Another, less visible but nevertheless important issue is that this system ties the search to a particular language, in this example Serbian. Should one want to search for, say, Russian text appearing within Serbian Wikipedia, the system would fail miserably, trying to treat Russian query as if it were Serbian and attempting to find all the wrong forms.
[edit] Better: Improved Indexing
Instead of dealing with current "false negatives" (unability to match actually correct articles), approach that yields some "false positives" (potentially matches even incorrect articles, but does find all correct ones) is preferred. One of the available and common ways of indexing and searching text of multi-alphabet, multi-dialect and/or inflected languages is to index or search for words from text directly, but variable-length "hash words" instead (unlike conventional fixed-size hash codes).
In a way this is already done to an extent - letter case is said to be ignored but virtue of, actually, indexing and searching only lower or uppercase forms. The hash function in this case is the one that calculates either lowercase or uppercase form of the input.
We need to extend the function to perform some more transformation (for purposes of normalization) of text. For example:
- Replacing all occurrencesof possibly dialect-specific parts of words (such as e, ije, je, i, ...) with a "normalized part", whether we are sure of it or not. For example:
-
- levi → l#vi
- lijevi → l#vi
- orijentacija → or#ntacija (incorrect)
- Note that this is not as simple as it looks above. Sometimes there are effects on surrounding (preceding) letters, whether correct or not. For example, many people do, in fact, say "ињекција" instead of "инјекција" and would write (search for it) that way. These side-effects must be accounted for.
- Replacing all letter sequences that would cause loss of information when transliterated into an "already lost" form. For example: њ→нј, љ→лј, џ→дз, ђ→дј, ч→ц, ћ→ц, ш→с, ж→з, etc.
- Replacing all letters of secondary alphabet with equivalent letters in the primary alphabet.
- Recognizing and trimming off any common endings (e.g. due to declension or conjugation) to end up with something that may more frequently serve as a root of the word (this would really be done better with a dictionary).
Note that this system is good whatever the combination of language of the encyclopedia, particular article and/or its part or query. One can safely attempt to search for Russian text in Serbian Wikipedia, for example, and be sure to have the same chances of finding the article as if the search is done on Russian Wikipedia. This, of course, requires joint efforts, development and maintenance of Wikipedias of all languages - improvement of searching for any language should be added to all Wikipedias.
Depending on details of Wikipedia software architecture (e.g. whether custom or own indexing and search engine), proper solution can be reached by adding new common component (the "word hash function"; existing systems usually do not share the normalization routine since it is simple and, therefore, one has to be introduced).
Such hash function would, of course, be different for different languages. The trick is to not tie a single Wikipedia to a single hash function - it should depend on the language of the text, not of the entire encyclopaedia, since it may contain snippets of text (and even article titles) in other languages (such as registered trademark names, for example).
If Wikipedia uses a third party full-text indexing and search engine then the same effect can be achieved by "wrapping" it into a layer that instead of giving it actual article text provides transformed version of it (with hash words instead of actual words). Symmetrically, all queries should also be transformed the same way before being passed to the search engine.
[edit] Article Titles in Relational Database: Automatic Aliases
Assuming article titles are searched in a relational database and not in a full-text index a somewhat different system is required (if underlying database system cannot be itself improved).
Article (title) aliases can be automatically created, updated and deleted whenever the article changes its name. Aliases would be tied to articles differently than standard redirect articles - they would not be articles themselves - just hidden database entries. No human should have the need to see them or manipulate them. They are automatically generated and "attached" to an article (even redirect articles). When that article is renamed, existing set of aliases is automatically removed and replaced with a new one. If the article is deleted, the aliases should be deleted as well.
[edit] Real-time transliteration for display
Having all articles in Cyrillic as primary alphabet eases this as transliteration to Latin is straightforward (Cyrillic actually contains more information than needed for Latin) and can be achieved through a simple lookup table:
Cyrillic | Latin | ASCII | Cyrillic | Latin | ASCII | |
---|---|---|---|---|---|---|
А а | A a | A a | Н н | N n | N n | |
Б б | B b | B b | Њ њ | Nj nj | Nj nj | |
В в | V v | V v | О о | O o | O o | |
Г г | G g | G g | П п | P p | P p | |
Д д | D d | D d | Р р | R r | R r | |
Ђ ђ | Đ đ | Dj dj | С с | S s | S s | |
Е е | E e | E e | Т т | T t | T t | |
Ж ж | Ž ž | Z z | Ћ ћ | Ć ć | C c | |
З з | Z z | Z z | У у | U u | U u | |
И и | I i | I i | Ф ф | F f | F f | |
Ј ј | J j | J j | Х х | H h | H h | |
К к | K k | K k | Ц ц | C c | C c | |
Л л | L l | L l | Ч ч | Č č | C c | |
Љ љ | Lj lj | Lj lj | Џ џ | Dž dž | Dz dz | |
М м | M m | M m | Ш ш | Š š | S s |
Note: ASCII column(s) were provided although they are not official alphabet and is considered bad practice, to present how Unicode decomposition (ignoring marks) works out and how to present the text to those that only have standard ASCII available.
Lettercase conversion of latin digraphs intentionally always uses lowercase right (second) character regardless to facilitate less lossy transliteration, especially for acronyms. That means that:
- "лепљиво" will be transliterated as "lepljivo",
- "Љубовија" will be transliterated as "Ljubovija",
- "ЉУБОВИЈА" will be transliterated as "LjUBOVIJA", and
- "ФКЉ" (football club Ljubovija) as "FKLj" (to avoid confusion with FKLJ, which may be an acronym of four, not three words)
[edit] Transliteration of contributed changes
When (if) Serbian Wikipedia allows either alphabet to be used not only for reading but also for contributing to articles, we must make sure that the entire round trip of article text is lossless, to avoid compounding of errors through a series of contributions.
Assume the following:
- All articles use Cyrillic as their base (storage) alphabet (except when strictly specifying non-Cyrillic segments such as foreign text or alphabet comparison articles).
- Visitors can read those articles in either Cyrillic or Latin text. Latin text will be presented by virtue of automatic transliteration.
- While reading an article transliterated to Latin alphabet a visitor decides to contribute to the article and hits the "Edit button".
If this takes them to editing Cyrillic original then there are no transliteration issues to talk about, but this reduces the number of potential contributors to those that know Cyrillic and know how to configure their operating systems and change keyboard layouts on the fly. Otherwise they will need to be editing tranlisterated form of the article.
They will make their changes and save them. What happens next is very important and may or may not cause loss of information depending on how it is done. Consider a hypothetical article containing the sentence:
Инјекција је математички појам.
-
-
- (English: Injection is a mathematical term)
-
Contributor notices that the medical meaning of word "injection" is omitted and wants to change it to say "Injection is both mathematical and medical term."
[edit] Scenario #1 (bad)
- Editing page containing automatically transliterated article (to Latin alphabet) presented to the user. The sentence reads:
Injekcija je matematički pojam.
- User changes it to:
Injekcija je i matematički i medicinski pojam. ^^ ^^^^^^^^^^^^^
- User saves it back. Transliterator now tranliterates the whole thing back to Cyrillic for storage:
Ињекција је и математички и медицински појам.
- Notice the first word of the sentence. Word "Инјекција" became "Ињекција" - two letters "нј" became a single "њ" because Latin alphabet's loss of information. Also notice that this loss/error occurred although the contributor did not even touch that part of the sentence.
[edit] Scenario #2 (better)
- Editing page containing automatically transliterated article (to Latin alphabet) presented to the user. The sentence reads:
Injekcija je matematički pojam.
- User changes it to:
Injekcija je i matematički i medicinski pojam. ^^ ^^^^^^^^^^^^^
- User submits it back. Back-end now re-transliterates original article to Latin again (or uses cached copy if available):
Инјекција је математички појам. ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ Injekcija je matematički pojam.
- Then it compares the submitted chunk against the original chunk and finds that "i" was insterted just before "matematički" and "i medicinski" was inserted after it.
Injekcija je ··matematički ·············pojam. `----------------------v---------------------' compare with ,----------------------^---------------------\ Injekcija je i matematički i medicinski pojam. ==============!!============!!!!!!!!!!!!!======= [i ] [i medicinski]
- It then back-transliterates (from Latin to Cyrillic) only those parts of the text that got changed (inserted) and applies transliterated changed back into original (Cyrillic) text:
[i ] [i medicinski] translit. | | v v [и ] [и медицински] apply: | | v v Инјекција је ··математички ·············појам. result: Инјекција је и математички и медицински појам.
- The article will now correctly read:
Инјекција је и математички и медицински појам.
Note that in this process "strict script" markup gains importance. For example, some text may be intended to be in Latin, such as links to articles such as (English) "Unix operating systems", which needs to retain Unix in Latin (as registered name) and transliterate the rest: "Unix оперативни системи".
[edit] Something easy to start with
Proper solution requires use of dictionaries (for new content) and relatively complicated discovery of changes, their transliteration and re-application of transliterated changes to original article. Should we agree to develop such a solution it will take comparatively long time to do so.
Another, not so nice but nevertheless extremely useful solution exists. Reminder: the only transliteration problem we need to solve is from Latin to Cyrillic and specifically deals with Latin digraphs lj, nj and dž. When they appear in Latin text it is ambiguous whether they are actually digraphs (single letters composed of two parts) or two separate letters (l followed by j, n followed by j, or d followed by ž).
We can introduce a hinting mechanism that tells the transliterator how to resolve ambiguities. It has to be unobtrusive, simple to use and usable within any part of the article (e.g. within links). There are, generally speaking, following ways to do this:
- Require the use of special Unicode digraphs for digraphs and separate letters otherwise. Although this would be a good solution in the long run it, however, only causes more problems than it solves, as those digraphs are not widely available in fonts, people are unware of them and would not use them. It is highly questionable whether those digraphs will ever be used by actual users.
- Use markup. For example, we could either provide markup that tells the transliterator that two letters are supposed to be treated as a digraph or, conversely, that two letters are supposed to be treated as separate. This reduces readability of text and should only be used at editing. Attempts should be made to also reduce the necessity (occurence) and size of this markup to a minimum as it will appear in the middle of the word. Since combinations lj, nj and dž are more frequently digraphs than separate letters, we can require the markup only for those cases when they are not digraphs.
So how should that markup look like? The best solution is to have it as a single "separator" character to be used in between letters that would otherwise be treated as digraphs. For example:
- njuška (nj as digraph) vs. in•jekcija (n followed by j)
- džak (dž as digraph) vs. nad•živeti (d followed by ž)
Part of transliteration table from Cyrillic to Latin covering potential digraphs would be as follows:
Cyrillic | Condition | Latin |
---|---|---|
Ж ж | not following Д/д | Ž ž |
Ж ж | following Д/д | •Ž •ž |
Ј ј | not following Н/н or Л/л | J j |
Ј ј | following Н/н or Л/л | •J •j |
Л л | always | L l |
Љ љ | always | Lj lj |
Н н | always | N n |
Њ њ | always | Nj nj |
Џ џ | always | Dž dž |
Conversely, inverse table would be based on greedy algorithm:
Latin | Cyrillic |
---|---|
N•J | НЈ² |
N•j | Нј² |
n•j | нј² |
n•J | нЈ² |
L•J | ЛЈ² |
L•j | Лј² |
l•j | лј² |
l•J | лЈ² |
D•Ž | ДЖ² |
D•ž | Дж² |
d•ž | дж² |
d•Ž | дЖ² |
NJ | either¹ НЈ (to account for acronyms) or Њ (for titles) |
Nj | Њ |
nj | њ |
nJ | нЈ |
LJ | either¹ ЛЈ (to account for acronyms) or Љ (for titles) |
Lj | Љ |
lj | љ |
lJ | лЈ |
DŽ | either¹ ДЖ (to account for acronyms) or Џ (for titles) |
Dž | Џ |
dž | џ |
dž | дЖ |
(any other single Serbian Latin letter) | (directly corresponding Cyrillic letter) |
• (separator) | (nothing - skipped/consumed) |
else | leave as is (not transliterated) |
Note 1: Cyrillic-to-Latin transliteration table never generates digraphs with both halfs uppercase to account for acronyms - this should be preferred way.
Note 2: A smaller table can be used covering only "undotted" forms and simply ignoring (consuming) those "dots". However, that would not account for some special cases directly. Above table is shown as full for completeness, not to indicate specific optimization of implementation.
Requirements for the markup character are:
- Must not be obtrusive, such as #, $, @, etc. Center dots, dashes or hyphens are preferred.
- Must not be (commonly) needed or used otherwise
- Must not have a clearly defined different function
- Must be available in common fonts and always visible (never omitted or rendered as space or something else)
Some "obvious" candidate characters:
- Unicode "WORD JOINER" (0x2060: ) or "ZERO WIDTH NO-BREAK SPACE" (0xFEFF: ). Perfect as markup (intended function) but either obtrusive (square) or invisible for a reason - it is intended for storage and processing, not for display.
- Unicode "NON-BREAKING HYPHEN" (0x2011: ‑) and/or "HYPHENATION POINT" (0x2027: ‧) - depending on whether line break is allowed in between the letters in question → complicated. Can easily be confused with "minus" -. Questionable visibility and availability in fonts. Otherwise functionally good match.
- Unicode "HYPHEN BULLET" (0x2043: ⁃) - Intended for a completely different purpose. Questionable visibility and availability renditions in fonts.
- Unicode "BULLET" (0x2022: • as used in the example above) - Intended for a completely different purpose. However, it is already commonly available under the edit box in Wikipedia.
- Unicode "HYPHEN" (0x2010: ‐). Can easily be confused with "minus" -. Also commonly used for other purposes.
- Tilda (~) - the only one present on the keyboard and, therefore, easiest to type but also commonly used for other purposes; sometimes appears on top of the character space and not in the middle.
- Unicode "MIDDLE DOT" (0x00B7: ·) - small, commonly available under edit box, but hard to click on and causes weird formatting problems.
Note that this system is only supposed to be used while editing. It would eliminate the immediate need for complicated "differential transliteration" and would also have an added benefit of providing an easy way to automatically build a transliteration dictionary.
Finally, the system does not ever have to be disabled - it can stay as an add-on even when dictionary is available - to handle new content not yet in dictionary.
[edit] Web Search Engines
Search engines are probably the main way of getting to Wikipedia content for most people. Yet current issues prevent Wikipedia content from being consistently found by such searches.
Depending on features and capabilities of any particular search engine (/SE/ in further text) the following must be done when they index Wikipedia content:
- Always provide Cyrillic content as stored.
- if /SE/ is not aware of equivalence between Cyrillic and Latin characters (and most aren't), both Cyrillic and Latin forms have to be given to it
- If /SE/ does not gracefuly ignore the differences between special Serbian Latin non-ASCII characters (ž, š, č, ć, đ), then additionally transliterated ASCII-only form needs (using z, s, c, c, d instead) is to be given to it for any/all articles. Google appears to not need this.
- If /SE/ is not aware and capable of equivalence of Serbian dialects (none are and are ever likely to be) then all of the above-determined forms must be presented for all dialects, most importantly Ekavian and Ijekavian. Note that "hash words" indexing method mentioned earlier does not help as we cannot influence the inner workings of /SE/ and the only way to make content reachable through /SE/ is by providing it with sufficient number of different article renditions.
- Articles with strict-dialect and/or strict-script parts present an interesting problem. The question is whether anyone would search for documents with multiple dialects or would their queries always be in a single dialect only. Assumint the latter case (single dialect queries only) strict-dialect and/or strict-script markup should be ignored when presenting content to /SE/ and conversion nevertheless performed. In another case, the conversion is to be avoided for such parts. A third case, combination, requires two forms of everything mentioned in previous bullet points to be presented to /SE/ - one with strict parts left as they are and another with them converted nevertheless.
[edit] Real-time conversion of dialect
[edit] Reading in a different dialect
This is a very complex issue. Without contributors specially marking the text to identify different dialects for each word (something very few can actually do and would nevertheless be very cumbersome) or reverting to ancient use of letter "yat" (again, something very few would actually be able to do), the only way to handle this is automatically. And the only possible way to do it is to have a large, ever-maintained, dictionary that covers roots of the words and their attributes (e.g. conjugation or declension rules, dialect forms, etc.) and a system of looking up any particular word in that dictionary.
While creating such a dictionary is immense work, it is possible and would be very useful. In fact, creating a generic system for such a dictionary available for use in other languages as well would give the world incredibly useful tool, not currently available (to my knowledge). Better yet, it would be maintained by many speakers of those languages and never become stale.
There are potential problems though. It is theoretically possible that some words are homonyms (or homophones or homographs, there is no distinction with Serbian Cyrillic - if words are homophones they are also homographs and vice versa) in one dialect and not in another. It would be very helpful if someone can do a research on this and check this possibility. Such words introduce ambiguity in dialect conversion and would require context analysis and artificial intelligence to fully automatically solve them.
Assumption, however, is that there aren't too many such cases and that they can be handled "semi-automatically". The engine can help identify "problem words" in articles. A human user would then come and use specific markup to place all possible dialect forms within the article itself. Wikipedia rendition system would, of course, choose only appropriate one when showing the article to a reader, but include the entire specification for editing.
Now, here's the punch line - if we want to have the content searchable by external Web Search Engines in any of the dialects we essentially have to have such a system anyway. Question of using it for display purposes in addition to Web Search Engine indexing becomes a trivial one.
[edit] Contributing in a different dialect
Contributing text in a different dialect will undoubtely occur often, sometimes as a quote and other times as regular improvement to the article. While quotes should be treated as strict-dialect text and left untouched, other contributions should be normalized to allow consistent searchability and reading of the article.
Recommended approach is the same as for transliteration (only converting changes, not the whole edited chunk) but converting dialects is much more complex than converting script, as described previously.
[edit] Sorting
As mentioned before, sort order is different for Cyrillic and Latin alphabets. When viewing alphabetically sorted categories, reader expects to see entries collated in the natural order for presented alphabet.
Primary alphabet and sort order for Serbian Wikipedia is Cyrillic. Secondary alphabet, Latin, has different sorting rules that require a dictionary to resolve the digraph ambiguity (are nj, lj, and/or dž one or two letters in any occurence?). There are some extraordinarily good news, though - having Cyrillic as the primary alphabet eliminates the need for such complex sorting. Essentially, sorting can be performed on Cyrillic entries, just using Latin collation rules (sort order).
There are some exceptions that are present in other languages as well - presence of foreign-language or non-primary strict-script entries. In this regard Serbian population seems to prefer the following:
- If a list to be shown in Cyrillic has certain Latin entries, sort them transliterated to Cyrillic even though they may be of foreign-language. Since there are foreign-language Latin letters that can not be transliterated to Cyrillic, they can be assumed to come after any/all Cyrillic letters, in any order. If an entry starts with such a letter (e.g. Q, W, X or Y) it will get a separate non-Cyrillic section for that all the entries beginning with that letter.
- If a list to be shown in Latin has certain Cyrillic entries, use the same approach as above but use Latin sort order.
- Any entries that begin with or contain letters that are neither Latin nor Cyrillic (e.g. Kanji) should be treated separately and listed at the end of the list in the same way other languages treat such foreign entries.
Note that alphabet is not the only thing affecting sort order - dialect affect it as well. While the Seerbian word for milk comes before mleti in ekavian dialect (mleko) it comes after it in ijekavian form (mlijeko).
Also note that categories are not the only place where alphabetically sorted lists may appear - articles may tables or lists that have to be sorted as well. To faciliate this it would be the best to come up with appropriate Wiki tables feature that support dynamic resorting (even viewing time by clicking on the column heading, why not?).
[edit] Dictionary
Dictionary was mentioned many times in previous text. It is the centerpiece of a number of solutions. It is also the most complex component.
Note that this is not a typicall dictionary designed for human use. Instead, it needs to be a human-language dictionary for machine use (but it can also be exposed to humans very nicely). This simple disctinction makes all the difference in the world. While humans are naturally aware of word forms caused by slapping prefixes, postfixes, condjugations, declensions, compounded words, etc., machines are not. Recognizing a root of the word requires the knowledge of that root - it generally cannot be automatically determined.
Take a word "метаподаци" (plural of "metadata") for example. Suppose that there is no entry for it exactly as shown. A human would easily guess that it is actually composed of "мета" (meta) and "подаци" (plural of "data"). Furthermore, the same human would also know that singular form of "подаци" is "податак". A human can then look up "мета" and "податак" separately and find all the details needed (such as differences between dialects, etc).
A machine cannot do it that easily. First, it does not know whether the word is composed of other words at all and, if it is, how to break it - is it "ме" + "таподаци" or "метапо" + "даци" or ... or ... or ... ? Second issue is the form of the word itself - in this case "подаци", which is a plural of "податак". Note that although both words begin with "пода" it is not the root of the word, nor is sufficient to recognize it - in that form it appears more as the Serbian word for floor. How can a machine find or match appropriate dictionary entry?
Perhaps we should look at what tools are available for German language as it is famous for its long compound words. But, essentially, the asnwer is simple:
- It is likely impossible to go straight to the correct dictionary entry. We need to match the entry (or multiple entries) that completely match(es) given form.
- Dictionary has to be aware of various prefixes or how words are continued with other words. With some iteration this information can match the entire compound word. In our example, it would be aware that "meta" can be directly followed, with no modifications, by any other word. Note that the prefix might change its form depending on attributes of following word and that needs to be taken into account.
- Dictionary has to be aware of all word forms, whether by knowing the roots and applying rule-based + exceptions changes to them (better) or simply by storing all available forms. It has to know which forms are available for which context (e.g. dialect) and be able to provide dialect translation.
- For machine use, dictionary does not actually have to include meanings of words. However, classifications of words are useful in otherwise ambiguous situations that require context to be "understood". These classifications are not limited to types of words such as nouns vs. verbs but can go further indefinitely. For example a word "sparrow" can be classified as:
[noun] <--- [species] <--- [animal] <--- [flying animal] A A | | [small animal] <--- [small flying animal] <--- [bird] <--- [sparrow]
... similarly to how categorization already works in Wikipedia. Having such a dictionary will significantly ease future development of language tools and make dialect conversion possible.
[edit] Inter-Wikipedia Integration & Joint Efforts
Many issues and suggestions presented here are common for many languages - searchability, alphabets/scripts, dialects, dictionaries, etc. These issues and their solutions beg for Wikipedia to be further internationalized, not just localized - that is to support any number of languages in any particular installation (curiously enough, this may not apply to UI itself but to all the other complicated things). OI Wouldn't it be nice to be able to:
- search for any-language text quoted in any other language Wikipedia,
- get the translation or definition of particular word in the article simply by clicking or hovering over it (whatever the language is)
- have commonly available lingustic tools backed by incredible database of encyclopaedic articles
Some of the functionality mentioned here can be used for entirely different purposes in other languages. For example, while learning how to write Hebrew, children write with both consonants and vowel marks included. Eventually they learn to read having those vowel marks are omitted, the practice common in literature not specifically made for children. Systems represented here could help automatically restore those vowel marks for those still learning the language (or how to read/write it). The situation is similar with Arabic vowel marks.
To achieve this we need standards and joint development (or at least specification) efforts, in order to fit all or as many languages as possible. Achieving everything listed here is likely not possible in short term, but if we start well, we can eventually reach the goal through a number of planned iterative improvements, starting with highest-priority, most pressing issues first.
[edit] See also
- Romanization - Romanization category in English-language Wikipedia, especially Scientific transliteration.
- ISO 9 transliteration standard
- Wikipedia Machine Translation Project
- Niqqud - serious bug affecting Hebrew vowel marks in all Wikimedia projects (in English)
- Norma kodnih rasporeda i srodna pitanja, by Miloš Rančić, in Serbian.
- Cirilica.com - Large library of Cyrillic fonts mostly adjusted for Serbian use (in Serbian)