Uyghur Language Processing on the Web

Uyghur language processing on the Web Dr. Waris Abdukerim Janbaz , Prof. Imad Saleh Paragraphe Laboratory, University of Paris VIII, France [email protected], [email protected] http://paragraphe.univ-paris8.fr Abstract navigators) and correctly displaying Uyghur characters In this paper, we discuss some important issues related to presented huge difficulties. In spite of the fairly passive web processing of an agglutinative Turkic language – attitude of Government authorities to the development of Uyghur. Especially, we will discuss the advent of Uyghur information technology, many individuals started grassroots efforts on Uyghur Unicode font developing, creating Uyghur websites using the three above Uyghur character displaying, font embedding and mentioned script. ASU, used by the most populous Uyghur character inputting method within Uyghur- segment of XUAR Uyghurs caused special coding support-less environment. We will also introduce a problems given that it uses a non-standard set of Arabic- multiscript conversion application to further use the based glyphs. Unicode standard for Uyghur language processing. 2. Background Keywords: Unicode, Font, Turkic Language, multiscript, For ASU, before 2002, either of the two following transliteration, Arabic-Script Uyghur, Cyrillic-Script methods became very common on web publishing in Uyghur, Latin-Script Uyghur. Uyghur: 1) font downloading; and/or 2) image format. There is no need to explain the inconvenience of the 1. Introduction second method. More interesting but complex problems The Uyghurs are a Turkic-speaking ethnic group, occurred in the case of the first one. The major problem officially about nine million, inhabiting in Central Asia came from the fact that every web site owner created and including today’s Xinjiang Uyghur Autonomous Region named his/her own fonts, and users/visitors had to (hereafter: XUAR, also called Chinese Turkistan) as well download a specific font (or different fonts) for almost as parts of Kazakhstan and urban regions in the Ferghana every single website. No one accepted the font name and valley. The official writing system of the XUAR Uyghurs coding of the other, and no common standard was created. is Arabic-Script Uyghur1 (hereafter: ASU) whereas the Most of the fonts created during this period, either Cyrillic-Script Uyghur2 (hereafter: CSU ) is still in used replaced the ASCII characters or replaced the Unicode by the Uyghurs of the ex-Soviet Union Republics Arabic characters (0x600-0x6FF) with Uyghur characters, (USSR). The newly introduced transliteration3 – Latin- without replacement agreement. Since the number of the Script Uyghur 4 (hereafter: LSU) has become widely Arabic letters in the code rage 0x600-0x6FF is larger accepted among Uyghurs and Uyghurologists is a than the number of ASU letters, people made different commonly used standard for the transliteration for both choices as they replaced some Arabic characters with ASU and CSU. ASU characters. Therefore, multiplication of the font The influence of web publishing started appearing in names and the growth of coding differences (for the same Uyghur society in the last 10 years. Since the existing glyphs) among the fonts became an obstacle to the platforms don’t supply any Uyghur input method nor any development of ASU computer processing and web fonts that including all the glyphs of the ASU alphabet, publishing. A large number of issues regarding non- inputting Uyghur text into interactive web pages (in the standard fonts and their use were addressed in many different ways to the individual computer scientists. Meanwhile, many of these problems were circumvented 1 See annex 2 by using methods unrelated to the Unicode standard. As a 2 See annex 1 result, web site creators eventually expressed their strong 3 Using one writing system to represent words in another is desire to further use the Unicode standard for Uyghur called transliteration. language processing. 4 called Uyghur Kompyutér Yéziqi (UKY) or Uyghur Latin In June 2002, the author developed the first Uyghur Yéziqi (ULY) in Uyghur, meaning “Uyghur Computer Writing” or “Latin-Script Uyghur”. See Unicode font and implemented both system-level and http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYeziq).htm browser-level Input Method Editors for Windows. It became a revolutionary accomplishment, owing mostly The creation of a Unicode based Uyghur font has became to the new method and applications that are fully a necessity for the progress of Uyghur information Unicode-compliant (as opposed to occasionally processing since the existing platforms do not include compatible). Hence, a campaign was launched to (supply) any Uyghur font. Existing fonts (both Arabic popularize and adapt the Unicode standard for Uyghur fonts and other fonts which include Arabic letters) do not fonts. In this paper, we present the entire process that we include all the necessary shapes of Uyghur letters (see have been following and developing for three years. The annex 2), and therefore some substitution sequences following subsections will cover four major parts of the mislead display problems. For example: 1. ﺋﺎﻟەﻣﺪىﻜﻰ هەﻣﻤە ﺋىﻨﺴﺎن ﻗەﺑىﻪ ﺋەﻣەس .entire implementation procedure 2. ﺋﺎﻟﻪﻣﺪﯨﻜﻰ ھﻪﻣﻤﻪ ﺋﯩﻨﺴﺎﻥ ﻗﻪﺑﯩﻬ ﺋﻪﻣﻪﺱ 3. Uyghur Unicode font developing (Not all human beings in the world are evil) Uyghur (ASU) letters have been developed on the basis The first sentence above is considered illegal character of the Arabic alphabet from Arabic. The ASU alphabet combination if it uses existing fonts (ex: Times New has 8 vowels5 and 24 consonants (see annex1). Uyghur, Roman, Traditional Arabic) because the cursive shapes of are not correct according to the ASU alphabet ﺋﻪ ,ھ ,ﻯ just like Arabic, is written from right to left, each letter having different shapes depending on its position in a (see annex 2). It should appear as in sentence 2 in which word. The Uyghur letters have initial, median, final and the letters use a specific font — UKIJ Tuz Tom. In order isolated forms; some letters have conjunct forms6. In total, to create right cursive connection forms for Uyghur, it the Uyghur alphabet has 126 different glyphs. The 108 was necessary to take special measures for three basic glyphs7 of the Uyghur letters have already been ”ﺌ , ﺉ and two “glottal stop signs ﺋﻪ ,ھ , ﻯproblem-letters accepted by the Unicode Consortium/ISO, and 18 glyphs8 out of the 20 glyphs for composed forms were added in (supported hamze), during the creation of Uyghur fonts. 1998. Unfortunately, two conjunct median forms (of the The absence of such measures would make it impossible are still absent11 in to display the cursive forms of the three letters correctly 10ﺌﯩ and 9ﺌﯧ (ﺋﻰ and ﺋﯥ Uyghur letters in browsers and other application software. the Unicode Standard’s table 12 – Arabic Presentation door). The 8 ,ﺋﯩﺸﯩﻚ) Uyghur letter i as in ishik : 13 ﻯ forms-A. This lack renders the Unicode Consortium/ISO as it stands incomplete and this has forced people to different forms are listed in the table 1 below. For the of this letter we use the (ﯨ , ﯩ) supplement it through borrowing from FBD1 and FBD2 initial′ and median′ forms for ;0649 ﻯ the “supported hamze” which is then combined with the initial and median forms of the Arabic letter we use the final and (ﻯ , ﻰ) to generate two synthetic the final′ and isolated′ forms ﺋﻰ and ﺋﯥ median′ form of .06CC, respectively ﻯ combined letters. isolated forms of the Farsi letter The 20 conjunct glyphs can also be expressed as a in the ,ﺋﻪﻳﻨﻪﻛﻠﻪﺭﺩە) Uyghur letter e as in eyneklerde :14ﺋﻪ sequence of two existing Unicode glyphs (as it is the case , ﻩ)now for the two missing conjunct glyphs). But this kind mirrors). This letter uses the final and isolated glyph s h), in the same way as)0647 ھ of the Arabic letter (ﻪ of usage may cause problems like reducing text inputting speed, increasing data storage redundancy, complicating Persian does. This causes a special problem due to the h) in the initial)0647 15ھ data sorting operations etc. fact that the glyphs of Arabic correspond to those of Uyghur (ھ , ﻬ)and median positions 5 gunah, sin or ﮔﯘﻧﺎھ ;hélihem, even now ھﯧﻠﯩﻬﻪﻡ h as in) ھ The Arabic alphabet only has 3 letters and for long vowels The others are not noted in normal writing. Given its .ﺍ ﻭ ﻱ uses qebih, odious), which, in turn, has different ﻗﻪﺑﯩﻬ ;offense ﺋﺎ، ﺋﻪ، :phonetic characteristics, Uyghur notes down all vowels In order to deal with this .(ھ , ﻬ)final and isolated glyphs using derivates of traditional Arabic , ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰ letters. inconsistency, we have chosen to use 06D5 for the .ھ and 06BE for the Uyghur letter ﺋﻪ The initial form and, under some circumstances, the median Uyghur letter 6 .iso.′ fin.′ med.′ ini.′ iso. fin. med. ini ”ﺌ or ﺉ form of all vowels is preceded by one “glottal stop sign ﯪ ﯫ ﺎ ﺍ supported hamze) with which they form a common letter) followed ﻝ .(treated by Uyghur as a single letter, see annex 2) ﯬ ﯭ ﻪ ﻩ .depending on their position ﻻ or ﻼ forms ﺍ by ﯮ ﯯ ﻮ ﻭ 7 See http://www.oyghan.com/images/UyghurUnicodeTable.gif ﯰ ﯱ ﯘ ﯗ – See Arabic Presentation Forms-A, glyph code range: FBEA 8 ﯲ ﯳ ﯚ ﯙ .FBFB. See also table 1 9 Character name for the Unicode Standard: ARABIC ﯴ ﯵ ﯜ ﯛ LIGATURE YEH WITH HAMZA ABOVE WITH E .(Baghériq) ﺑﺎﻏﺌﯧﺮﯨﻖ :MEDIAN FORM. Ex 10 Character name for the Unicode Standard: ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA 13 Character name for the Unicode Standard: ARABIC ABOVE WITH ALEF MAKSURA MEDIAN FORM. Ex: LETTER UIGHUR KAZAKH KIRGHIZ ALEF MAKSURA certainly, doubtlessly) (represents YEH-shaped letter with no dots in any positional) ﻗﻪﺗﺌﯩﻲ 11 The XUAR’s delegation members, Prof. Hoshur Islam and form), 0649. Yasin Imin, who have submitted the proposition also admit this 14 Character name for the Unicode Standard:ARABIC LETTER .(ە fault.

Uyghur Language Processing on the Web

Similarities and Dissimilarities of English and Arabic Alphabets in Phonetic and Phonology: a Comparative Study

Arabic Sociolinguistics: Topics in Diglossia, Gender, Identity, And

Arabic Letters Joined Up

Proposal for Arabic Script Root Zone LGR

A Handbook of Modern Uyghur

Henze, Paul B

Transliteration Rules Arabic

Arabic Range: 0600–06FF

Automatic Arabic Dialect Identification Systems for Written Texts: a Survey

An Uyghur-English Dictionary

Turkmen Language Manual. INSTITUTION Peace Corps, Washington, D.C

Uyghur Script in ISO/IEC 10646