processing on the Web

Dr. Waris Abdukerim Janbaz , Prof. Imad Saleh Paragraphe Laboratory, University of Paris VIII, France [email protected], [email protected] http://paragraphe.univ-paris8.fr

Abstract navigators) and correctly displaying Uyghur characters In this paper, we discuss some important issues related to presented huge difficulties. In spite of the fairly passive web processing of an agglutinative Turkic language – attitude of Government authorities to the development of Uyghur. Especially, we will discuss the advent of Uyghur information technology, many individuals started grassroots efforts on Uyghur Unicode font developing, creating Uyghur websites using the three above Uyghur character displaying, font embedding and mentioned script. ASU, used by the most populous Uyghur character inputting method within Uyghur- segment of XUAR Uyghurs caused special coding support-less environment. We will also introduce a problems given that it uses a non-standard set of - multiscript conversion application to further use the based glyphs. Unicode standard for Uyghur language processing. 2. Background Keywords: Unicode, Font, Turkic Language, multiscript, For ASU, before 2002, either of the two following transliteration, Arabic-Script Uyghur, Cyrillic-Script methods became very common on web publishing in Uyghur, Latin-Script Uyghur. Uyghur: 1) font downloading; and/or 2) image format. There is no need to explain the inconvenience of the 1. Introduction second method. More interesting but complex problems The Uyghurs are a Turkic-speaking ethnic group, occurred in the case of the first one. The major problem officially about nine million, inhabiting in Central Asia came from the fact that every web site owner created and including today’s Xinjiang Uyghur Autonomous Region named his/her own fonts, and users/visitors had to (hereafter: XUAR, also called Chinese Turkistan) as well download a specific font (or different fonts) for almost as parts of Kazakhstan and urban regions in the Ferghana every single website. No one accepted the font name and valley. The official writing system of the XUAR Uyghurs coding of the other, and no common standard was created. is Arabic-Script Uyghur1 (hereafter: ASU) whereas the Most of the fonts created during this period, either Cyrillic-Script Uyghur2 (hereafter: CSU ) is still in used replaced the ASCII characters or replaced the Unicode by the Uyghurs of the ex-Soviet Union Republics Arabic characters (0x600-0x6FF) with Uyghur characters, (USSR). The newly introduced transliteration3 – Latin- without replacement agreement. Since the number of the Script Uyghur 4 (hereafter: LSU) has become widely Arabic letters in the code rage 0x600-0x6FF is larger accepted among Uyghurs and Uyghurologists is a than the number of ASU letters, people made different commonly used standard for the transliteration for both choices as they replaced some Arabic characters with ASU and CSU. ASU characters. Therefore, multiplication of the font The influence of web publishing started appearing in names and the growth of coding differences (for the same Uyghur society in the last 10 years. Since the existing glyphs) among the fonts became an obstacle to the platforms don’t supply any Uyghur input method nor any development of ASU computer processing and web fonts that including all the glyphs of the ASU alphabet, publishing. A large number of issues regarding non- inputting Uyghur text into interactive web pages (in the standard fonts and their use were addressed in many different ways to the individual computer scientists. Meanwhile, many of these problems were circumvented 1 See annex 2 by using methods unrelated to the Unicode standard. As a 2 See annex 1 result, web site creators eventually expressed their strong 3 Using one writing system to represent words in another is desire to further use the Unicode standard for Uyghur called transliteration. language processing. 4 called Uyghur Kompyutér Yéziqi (UKY) or Uyghur Latin In June 2002, the author developed the first Uyghur Yéziqi (ULY) in Uyghur, meaning “Uyghur Computer Writing” or “Latin-Script Uyghur”. See Unicode font and implemented both system-level and http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYeziq).htm browser-level Input Method Editors for Windows. It became a revolutionary accomplishment, owing mostly The creation of a Unicode based Uyghur font has became to the new method and applications that are fully a necessity for the progress of Uyghur information Unicode-compliant (as opposed to occasionally processing since the existing platforms do not include compatible). Hence, a campaign was launched to (supply) any Uyghur font. Existing fonts (both Arabic popularize and adapt the Unicode standard for Uyghur fonts and other fonts which include Arabic letters) do not fonts. In this paper, we present the entire process that we include all the necessary shapes of Uyghur letters (see have been following and developing for three years. The annex 2), and therefore some substitution sequences following subsections will cover four major parts of the mislead display problems. For example: 1. ﺋﺎﻟەﻣﺪىﻜﻰ هەﻣﻤە ﺋىﻨﺴﺎن ﻗەﺑىﻪ ﺋەﻣەس .entire implementation procedure 2. ﺋﺎﻟﻪﻣﺪﯨﻜﻰ ھﻪﻣﻤﻪ ﺋﯩﻨﺴﺎﻥ ﻗﻪﺑﯩﻬ ﺋﻪﻣﻪﺱ 3. Uyghur Unicode font developing (Not all human beings in the world are evil) Uyghur (ASU) letters have been developed on the basis The first sentence above is considered illegal character of the from Arabic. The ASU alphabet combination if it uses existing fonts (ex: Times New has 8 vowels5 and 24 consonants (see annex1). Uyghur, Roman, Traditional Arabic) because the cursive shapes of are not correct according to the ASU alphabet ﺋﻪ ,ھ ,ﻯ just like Arabic, is written from right to left, each letter having different shapes depending on its position in a (see annex 2). It should appear as in sentence 2 in which word. The Uyghur letters have initial, median, final and the letters use a specific font — UKIJ Tuz Tom. In order isolated forms; some letters have conjunct forms6. In total, to create right cursive connection forms for Uyghur, it the Uyghur alphabet has 126 different glyphs. The 108 was necessary to take special measures for three basic glyphs7 of the Uyghur letters have already been ”ﺌ , ﺉ and two “glottal stop signs ﺋﻪ ,ھ , ﻯproblem-letters accepted by the Unicode Consortium/ISO, and 18 glyphs8 out of the 20 glyphs for composed forms were added in (supported hamze), during the creation of Uyghur fonts. 1998. Unfortunately, two conjunct median forms (of the The absence of such measures would make it impossible are still absent11 in to display the cursive forms of the three letters correctly 10ﺌﯩ and 9ﺌﯧ (ﺋﻰ and ﺋﯥ Uyghur letters in browsers and other application software. the Unicode Standard’s table 12 – Arabic Presentation door). The 8 ,ﺋﯩﺸﯩﻚ) Uyghur letter i as in ishik : 13 ﻯ forms-A. This lack renders the Unicode Consortium/ISO as it stands incomplete and this has forced people to different forms are listed in the table 1 below. For the of this letter we use the (ﯨ , ﯩ) supplement it through borrowing from FBD1 and FBD2 initial′ and median′ forms for ;0649 ﻯ the “supported hamze” which is then combined with the initial and median forms of the Arabic letter we use the final and (ﻯ , ﻰ) to generate two synthetic the final′ and isolated′ forms ﺋﻰ and ﺋﯥ median′ form of .06CC, respectively ﻯ combined letters. isolated forms of the Farsi letter The 20 conjunct glyphs can also be expressed as a in the ,ﺋﻪﻳﻨﻪﻛﻠﻪﺭﺩە) Uyghur letter e as in eyneklerde :14ﺋﻪ sequence of two existing Unicode glyphs (as it is the case , ﻩ)now for the two missing conjunct glyphs). But this kind mirrors). This letter uses the final and isolated glyph s h), in the same way as)0647 ھ of the Arabic letter (ﻪ of usage may cause problems like reducing text inputting speed, increasing data storage redundancy, complicating Persian does. This causes a special problem due to the h) in the initial)0647 15ھ data sorting operations etc. fact that the glyphs of Arabic correspond to those of Uyghur (ھ , ﻬ)and median positions 5 gunah, sin or ﮔﯘﻧﺎھ ;hélihem, even now ھﯧﻠﯩﻬﻪﻡ h as in) ھ The Arabic alphabet only has 3 letters and for long vowels The others are not noted in normal writing. Given its .ﺍ ﻭ ﻱ uses qebih, odious), which, in turn, has different ﻗﻪﺑﯩﻬ ;offense ﺋﺎ، ﺋﻪ، :phonetic characteristics, Uyghur notes down all vowels In order to deal with this .(ھ , ﻬ)final and isolated glyphs using derivates of traditional Arabic , ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰ letters. inconsistency, we have chosen to use 06D5 for the .ھ and 06BE for the Uyghur letter ﺋﻪ The initial form and, under some circumstances, the median Uyghur letter 6 .iso.′ fin.′ med.′ ini.′ iso. fin. med. ini ”ﺌ or ﺉ form of all vowels is preceded by one “glottal stop sign ﯪ ﯫ ﺎ ﺍ supported hamze) with which they form a common letter) followed ﻝ .(treated by Uyghur as a single letter, see annex 2) ﯬ ﯭ ﻪ ﻩ .depending on their position ﻻ or ﻼ forms ﺍ by ﯮ ﯯ ﻮ ﻭ 7 See http://www.oyghan.com/images/UyghurUnicodeTable.gif ﯰ ﯱ ﯘ ﯗ – See Arabic Presentation Forms-A, glyph code range: FBEA 8 ﯲ ﯳ ﯚ ﯙ .FBFB. See also table 1 9 Character name for the Unicode Standard: ARABIC ﯴ ﯵ ﯜ ﯛ LIGATURE YEH WITH ABOVE WITH E .(Baghériq) ﺑﺎﻏﺌﯧﺮﯨﻖ :MEDIAN FORM. Ex 10 Character name for the Unicode Standard: ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA 13 Character name for the Unicode Standard: ARABIC ABOVE WITH ALEF MAKSURA MEDIAN FORM. Ex: LETTER UIGHUR KAZAKH KIRGHIZ ALEF MAKSURA certainly, doubtlessly) (represents YEH-shaped letter with no dots in any positional) ﻗﻪﺗﺌﯩﻲ 11 The XUAR’s delegation members, Prof. Hoshur Islam and form), 0649. Yasin Imin, who have submitted the proposition also admit this 14 Character name for the Unicode Standard:ARABIC LETTER .(ە fault. See also Arabic Presentation Forms-A (code range: FBEA AE (Uighur, Kazakh, Kirghiz), 06D (isolated form is – FBFB). 15 See http://www.unicode.org/standard/where/ , Variant shapes 12 http://www.unicode.org/charts/PDF/UFB50.pdf of the Arabic character hah. and RTL (right to left mark; 200F), is also recommended ﯸ ﺌﯧ ﯷ ﯶ ﯦ ﯧ ﯥ ې in any Uyghur font. The rest of the time-consuming ﯻ ﺌﯩ ﯺ ﯹ ﯨ ﯩ ﻰ ﻯ repetitive font developing task is absolutely the same as ھ ﻬ ﻬ ھ when creating an font 20 . Some Uyghur Table 1. Uyghur vowels and the three problem-letters (the one Arabic Unicode fonts are available for free at the UCSA website. hah has four different basic shapes, which correspond to the ھ character Our recommended font creating tools are: Font Creator21 four shapes of two different letters in Uyghur). 22 and Fontographer . Glyph substitutions, positioning 16 the glottal stop: this is a phoneme which is not lookups and shaping features and Open Type tables of : ﺌ and ﺉ listed separately in the ASU alphabet but still covered by Arabic fonts can be added with the help of software like its spelling rules. In Uyghur words, the glottal stop is not Microsoft VOLT. as strongly pronounced as it is in Semitic languages or in Uzbek, for example, and it has weakened to become no 4. Font embedding and character displaying more than a hiatus. Marked in ASU by a hamza on top of Web pages can be rendered without downloading or a “tooth”, it appears usually in words of Arabic origin installing any specific fonts if: 1) the fonts used in the in a pages are available on user’s computer, and 2) if the (ء) or a hamza (ع) and replaces an original ‘ain browsers provide native support for the fonts and ,ﻋﺎﻟﹶﻢ from Arabic ﺋﺎﻟﻪﻡ .median or final position (e.g languages used. The second condition has already been ﺳﻮﺋﺎﻝ , ﺧﺎﺋِﻦ from Arabic ﺧﺎﺋﯩﻦ ,ﺳﺎﻋَﺔ from Arabic ﺳﺎﺋﻪﺕ In initial position, the same sign is met but unfortunately the first one has not yet, as there .(ﺳُﺆَال from Arabic considered as part of the initial form of a vowel and does are no Uyghur fonts available on the existing platforms not have any phonetic value17. They correspond to the that are installed on the users’ computers. Therefore, to ensure that Uyghur texts are displayed correctly in web .0626 ئ initial and median forms of the Arabic letter These Arabic glyphs are not considered as different browsers, users must find a way to install in their shapes of any independent letter in the Uyghur alphabet computers the fonts that are used in the web pages. The (cf. annex 2). Since one glyph of each of the two letters same holds true for all the other “forgotten languages” on shown in light red in the table above) are still different platforms. The font installation requirement) ﺋﻰ and ﺋﯥ either causes difficulties for people who don’t have much missing in Unicode, we can use a sequence of either of technical experience, or discourages others from .followed by the final, isolated, attempting to read the text (ﺌ or ﺉ ) these glyphs shown in These difficulties can be overcome by embedding fonts) ﺋﻰ and ﺋﯥ median′ or final′ forms of vowels blue in the table above). More precisely, the other into the web pages. When a page is downloaded into a conjunct forms can be obtained combining with the browser via the Hypertext Transfer Protocol, any and a vowel respectively. embedded fonts in the page are also downloaded without 0626 ئ Arabic letter In spite of the above mentioned limitations (two glyphs any need for the user to intervene. The Microsoft Web the above Embedding Fonts Tool—WEFT 23 makes it possible to (ﺋﻰ and ﺋﯥ instead of one conjunct glyph for mentioned conventions have now been widely accepted create embedded font objects that can be linked to web by the Uyghur Computer Science Association(UCSA18), pages. The following steps let web pages developers and at a later date, by the Xinjiang University branch of create embedded fonts and link them to a web page: the 863 Research Group19. • Create embedded fonts using Microsoft WEFT After having learnt the specificities of those letters, it is • Prepare the web page using any fonts that are easy to create Uyghur fonts using existing font creating installed on the platform, and software. The inclusion of non-spacing combining marks, • Link the embedded fonts to the web page. such as ZWJ (zero width joiner 200C), ZWNJ (zero Microsoft WEFT generates 1) embedded fonts for every width non-joiner; 200D), LTR (left to right mark; 200E), web site with a different extension (.EOT), and 2) a script that links an embedding font to a web page. The disadvantage of the WEFT generated embedded fonts is 16 Character name for the Unicode Standard: ARABIC LETTER YEH WITH HAMZA ABOVE and that the fonts are compatible only with Internet Explorer. 0626. This makes it highly desirable for more efforts to be 17 It is often said that the decision of Uyghur linguists to add invested in providing a cross-platform functionality for this sign as part of the initial form of letters is a link with the this kind of software. old Uyghur writing system, in which all initial vowels were and و ,ا ,preceded by a tooth. The Arabic alphabet has 3 letters which can be used to indicate long vowels. Short vowels can ي be indicated through the use of vowel marks above or under the consonants but which are dispensed of in normal writing. Given ﺋﺎ، :its phonetic characteristics, Uyghur notes down all vowels using derivates of traditional Arabic 20 See , ﺋﻪ، ﺋﻮ، ﺋﯘ، ﺋﯚ، ﺋﯜ، ﺋﯥ، ﺋﻰ letters. http://www.microsoft.com/typography/OpenType%20Dev/arabi 18 UCSA – The Uyghur Computer Science Association (or c/intro.mspx for more information about developing OpenType UKIJ – Uyghur Kompyutér Ilimi Jem’iyiti in Uyghur) is a non- Fonts for Arabic Script profit association, founded by the author in Jan 2004. Web site: 21 http://www.high-logic.com/fontcreator.html http://www.ukij.org 22 http://www.fontlab.com/Font-tools/Fontographer 19 A National High-Tech Research Group, financed by the PRC 23 Free software at government. The XJU branch is specialized in multilingual http://www.microsoft.com/typography/web/embedding/default. software development. htm 5. Creation of a browser-level virtual input events” module frees the hook immediately after the user method decides to switch the inputting language to another one. As mentioned in the introduction, the existing platforms This method has been implemented using JavaScript and VBScript language, tested on different browsers and do not supply any system-level Uyghur language 25 inputting service. Late in 2003, the first system-level commonly used in some Uyghur web sites . Uyghur Unicode IME for Windows was developed by the author and distributed free of charge24. Six month later, 6. Multiscript converting the Xinjiang University branch of the 863 Research Due to the co-existence of different writing systems Group and some individuals started joining the Uyghur (Arabic-Script Uyghur, Cyrillic-Script Uyghur and Latin- Unicode Popularization campaign by distributing their Script Uyghur) for the Uyghur language, research on a Unicode-supported IME. Nevertheless, it still can not be conversion tool with which people can toggle between said that all or even most Uyghur internet users are the three scripts is forthcoming for future information equipped with Uyghur inputting tools. Therefore, the sharing. The fact that there is one-to-one 26 browser-level inputting method still fills a great need correspondence between the letters of these three since it enables people to input Uyghur letter into any writing systems is certainly a major helping factor. For text-inputting field on a web page without having to better understanding, we take an example of the Uyghur install a system-level Uyghur IME. The basic structure of proverb “working for free is better than doing nothing” in ﺑﯩﻜﺎﺭ ﻳﯜﺭﮔﯩﭽﻪ ﺑﯩﻜﺎﺭ ﺋﯩﺸﻠﻪ :the browser-level Uyghur text inputting tool is three scripts represented as in figure 1: бикар йүргичə бикар ишлə bikar yürgiche bikar ishle The following basic workflow explains the basic Keyboard and mouse events conversion process:

Source text in source script Input Uyghur? no yes Pre-processing Capture K.&M. Events Character mapping

Code – Char. Mapping Character converting

Dispatch Events Disambiguation no Switch Lang.? no Conversion end.? yes yes Release K.&M. Events Result in destination script

Figure 1. workflow of the browser-level inputting method Figure 2. script converting

As we can see from the workflow above, once the user The functionalities of each module may require some selects the Uyghur Inputting option, the “capture clarification: keyboard and mouse events” module creates a hook to Pre-processing: this is an important step in converting. It monitor the keyboard and mouse activities. The “code- involves preserving elements that should remain char. mapping” module creates a keycode-to-Uyghur- unchanged27 after the conversion. For example, when Character matrix to get the right Uyghur character that converting LSU text “Men Photoshop ni yaxshi körimen” The “dispatch (I love Photoshop) into ASU, we should be able to obtain .(ﻡ corresponds to the key code (ex: 109 Æ .and vice-versa ”ﻣﻪﻥ Photoshop ﻧﻰ ﻳﺎﺧﺸﻰ ﻛﯚﺭﯨﻤﻪﻥ“ events” module sends Uyghur characters from the map to the active text inputting field on a web page. This process repeats itself until the “release keyboard and mouse 25 See www.ukij.org , www.biliwal.com, www.oyghan.com, www.uyghurdictionary.org etc. 26 The only exception is j (as in jurnal) in LSU 24 More than 200,000 downloads counted since Dec 2003 from 27 This is the case of hypertext links, HTML tags and proper www.oyghan.com and www.bizuyghur.com/oyghan . names. Character mapping: creates an “A_is_B” matrix for The embeddable web fonts, generated by third-party every script pair, or three matrices in total. software WEFT, are compatible only with Internet Character converting: uses the three matrices in order to Explorer. Therefore, we are truly looking forward to convert between the different scripts. more efforts by the computer software industry to expand Disambiguation: this module is necessary when compatibility. We expect to improve the pre-processing converting from LSU to ASU and/or CSU, because of module of the converting tool to make it more user- spelling mistakes or, more importantly, because of the friendly. There are undoubtedly other theoretical issues to problems due to the difficulty encountered in typing the resolve especially in the disambiguating of LSU LSU diacritical makes on many keyboards: very misspelled words. commonly, the letters Ö, Ü, É, ö, ü and é are replaced by Another important problem related to Uyghur is the O, U, E, o, u and e. This may cause fatal errors. For major impediment to developing a spell-check example: öltürüsh (to kill) Ù olturush(to sit, party), functionality caused by its agglutinative language, térim yer (cultivable land) Ù terim yer (who eats my coupled with associated spelling changes in root words. sweat), yétim(orphan) Ù yetim(spelling mistake). This work is going to be the focus of our attention in a Besides, spelling mistakes due to the poor grasp of LSU next stage of development. rules are significant problem. All these problems require Finally, we call on software companies not to omit the intensive language processing. This functionality of the Uyghur from their supported language list in the future. multiscript converting tool28 that we have released on the internet is still under development. The following images 8. References will help you understand our converting tools which use [1] Waris A. Janbaz, Online Uyghur Unicode processing above mentioned methods. technique and its implementation (publication in Chinese), Xinjiang University Press, China, 2002. [2] Abdurehim, Waris A. Janbaz, Orthographic rules of the Latin-Script Uyghur (in Uyghur) , 2004, http://www.ukij.org/teshwiq/UKY_Heqqide(KonaYe ziq).htm. [3] The Unicode Consortium The Unicode Standard, Version 4.0, Addison-Wesley Professional, ISBN: 0321185781, USA, 2003. [4] Xinjiang University, Proceedings 2000 International Conference on Multilingual Information Processing. Ürümchi (publication in Chinese), China, 2000. [5] The Unicode Consortium Website Image 1. Offline plug-in version for Microsoft Word http://www.unicode.org [6] Reinhard F. Hahn, Spoken Uyghur. Washington: the University of Washington Press, ISBN: 0-295- 97015-4, USA, 1991.

Annex 1: Arabic-Script Uyghur, Cyrillic- Script Uyghur and Latin-Script Uyghur Alphabets ASU ﺋﺎ ﺋﻪ ﺏ پ ﺕ ﺝ چ ﺥ x ch j t p b e a LSU x ч җ т п б ə а CSU ASU ﺩ ﺭ ﺯ ژ ﺱ ﺵ ﻍ ﻑ LSU f gh sh s j (zh) z r d Image 2. Online demo version CSU ф ғ ш c ж з р д ASU ﻕ ﻙ گ ڭ ﻝ ﻡ ﻥ ھ Conclusions and future work .7 Our work to date has focused mainly on the design and LSU implementation issues related to creating Uyghur h n m l ng g k q Unicode fonts, as well as on browser-level input method һ н м л ң г k қ CSU and multi-script converting application. According to ASU ﺋﻮ ﺋﯘ ﺋﯚ ﺋﯜ ۋ ﺋﯥ ﺋﻰ ﻱ user feedback, we feel fairly satisfied with the results of this first ever research on Uyghur language processing. y i é w ü ö u o LSU й и e в ү ө у o CSU

28 Online demo version is available at Additional Cyrillic letters : ы ё ц э ю я http://www.uyghurdictionary.org/tools.asp, offline plug-in version for Microsoft Word is available at http://oyghan.com/OTB/index.html Annex 2: Arabic-Script Uyghur Alphabet with shapes