Automatic Transliteration of Proper Names from Somali to English
Total Page:16
File Type:pdf, Size:1020Kb
Research Article Automatic Transliteration of Proper Names from Somali to English Ahmed Muktar Omar*, Jian Qu and Sumeth Yuenyong School of Information Technology, Shinawatra University 99 Moo 10, Bangtoey, Samkhok, Pathum Thani 12160, Thailand Abstract Transliterating of proper names is the process of converting words from source natural language (such as Somali) to a target natural language (such as English) while maintaining language pronunciation. Proper names and technical words are challenging in bilingual translation systems and also in Cross-Language Information Retrieval (CLIR) applications, due to their absence from most dictionaries. In this paper, we study an automatic transliteration from Somali to English; which is an under-studied problem. Our Somali-English transliteration system uses transliteration rules based on the orthographic mapping of the source language characters to the characters of the target language. We also propose an alignment method that maps the Somali characters when there is no direct matching character to get accurate transliteration in English. Our novel approach particularly enhances Somali-English transliteration. Keywords: Somali-English; Somali Transliteration Table; Grapheme-Based 1. Introduction special characters, although the “ ‘ ” glottal Somali is a Cushitic language which stop stands for the (Arabic Hamza). belongs to the family of Afro-Asiatic Somali also has three digraph (( ظ or ط) dh ,( ﺵ) sh ,( ﺥ) languages (or Hamito-Semitic). The Somali consonants (kh language is similar to Semitic languages such which are based on similar Arabic sounds. as Arabic and Hebrew. It is a mother tongue Somali orthography corresponds mostly to for ethnic Somalis in Greater Somalia and is Roman alphabets except where some by far the most well-documented of all characters are modified for the usage in Cushitic languages [1]. Somali characters, where the letters c and x Somali is the official language of designed to accommodate the voiced and Somalia and Djibouti and a working voiceless pharyngeal fricatives, comparable Somali long .[1] ( ع = and (ʕ ( ح = Language in the Somali regions of Ethiopia to (h and Kenya. Somali uses different writing vowels usually are written by doubling of the systems, and the Latin alphabet has been the vowel itself. official writing system in the Federal The purpose of general transliteration Republic of Somalia and Djibouti since 1972 is not to introduce new sounds to the target [2]. It mostly uses the Roman alphabet language which the target language does not except for “p,v z” without diacritic signs or provide. However, its purpose is to substitute *Correspondence : [email protected] DOI 10.14456/tijsat.2016.26 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016 the original letter to the nearest letter in the transliteration approaches have been studied target language. The concept of long and in the literature, each of which brings out short vowels letters exists in Somali, and various processes in different languages. long vowels are usually indicated by These methods vary by the direction of repeating the vowel itself such as “aa”, “ee”, transliteration, writing systems of different “ii”, “oo”, “uu.” languages, or intended applications. For example, the Somali word Classification of these works is not Soomaaliya usually transliterated to English straightforward. as Somalia by omitting the long vowels. Earlier work has been done for Figure 1 demonstrated a basic word Machine Transliteration grapheme-based transliteration. approaches or phoneme-based approaches. Lee and Choi proposed a source channel model (SCM) a grapheme-based approach Somali S OO M AA L IY A for English-Korean transliteration [3]. They used a direct orthographical mapping from source graphemes to target graphemes. Knight and Graehl proposed Japanese to English back-transliteration using the English S O M A L I A similarity of SCM [4]. Wan and Verspoor modeled a technique to transliterate proper Figure 1. Basic Transliteration. names from English to Chinese using a phonetic procedure [5]. They proposed an In this paper we propose a Somali- algorithm for mapping from English English transliteration system that uses characters to Chinese characters based on transliteration rules based on the heuristics relationships between English orthographic mapping of the source language spelling and pronunciation, and stable characters to the characters of the target relationships between English phonemes and language. We also propose an alignment Chinese characters. method that maps the Somali characters Kang and Kim explored a forward- when there is no direct matching character to transliteration and back-transliteration for get accurate transliteration in English. English-Korean using a direct and pivot The structure of the paper is as method and then they used chunks of follows. In Section II, we describe the phonemes to perform the transliteration and previous study of machine transliteration. In back-transliteration [6]. Kang and Choi also Section III, we describe our character studied an English-Korean back- mapping method for Somali-English transliteration using a decision-tree learning transliteration; we also discuss our [7]. The English-Korean word alignment transliteration rules of Somali proper names procedure they used is similar to Lee and to English. In Section IV, we detail our Choi [3]. experiments, evaluation metrics and the Oh and Choi also studied a model for results we obtained; and Section V concludes English-Korean transliteration using the paper. pronunciation and contextual rules [8]. Their method was composed of two phases: 2. Related Work alignment and transliteration. In their first Transliteration refers to an phase, they aligned an English pronunciation orthographical transformation or phonetic unit (EPU) taken from a pronunciation change across two languages with different phrasebook and aligned it to Korean scripts. Many different generative phonemes to find the possible 18 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016 correspondence between the EPU and phonemes. Virga and Khudanpur presented Somali C EE L - M A C AA N English-Chinese transliteration using a phonetic representation of English names into Chinese to support Cross-Lingual Speech and Text Processing Applications [9]. English E L - M A ‘ A N AbdulJaleel and Larkey proposed a generative statistical transliteration model for English-Arabic transliteration using n- Figure 2. Forward Transliteration. gram methods [10]. The n-gram model Most of the transliteration methods generates strings of Arabic characters from a have been proposed between English and string of English characters. Malik proposed other common languages such as Arabic, a rule based Punjabi machine transliteration Chinese, or Japanese. Somali, not being a by transliterating a word between two scripts common language, is under-studied for both of Punjabi [11]. transliteration systems and cross-language Grapheme-based transliterations information retrieval applications. consider transliteration as an orthographic process rather than phonetic process and maps groups of graphemes/characters in the 3. Mapping and Transliteration rules source language word directly to groups of To align Somali/English characters, graphemes/characters in the target language we use a direct orthographic mapping word [12]. This approach also is known as between the Somali and English characters; a (spelling-based or direct methods) as it character alignment is given in Somali and its directly transforms the source language orthographic equivalent in English to find the graphemes into the target language most probable letters. graphemes without any phonetic knowledge We start by the alignment of the of the source and target languages. identical letters; in most cases, Somali words Instantaneously the phoneme-based methods are longer than their corresponding English require some steps in the transliteration transliterated words. The mapping type is process. However, most of the grapheme- either one-to-one letter or many-to-one letter based methods directly depend on the to avoid null mapping. information that is attainable from the For example, as shown in Figure 3, characters of the words. the Somali word Ceel-cadde is usually Forward transliteration is transliterated into English as El-Adde. transliterating a word as it is written in the source language such as Somali to a foreign Somali C E E L - C A DD E language such as English. For example, forward transliteration of a Somali name “Ceelmacaan” to English is “Elma’an”. Backward transliteration or back- transliteration is transliterating a word from English E L - A DD E its transliterated version back to the language of origin. For example, back-transliteration of “Elma’an” from English to Somali is Figure 3. Missing equivalent letters. “Ceelmacaan”. This example is shown in Figure 2. The drawback of the above direct mapping is the absence of some Somali 19 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016 letters and long vowels in English. This is the Table 1. Somali Transliteration Table. problem addressed by our method. consonant mapping vowel mapping 3.1 Somali Transliteration Table and its Problems Somali English Somali English As can been seen from Table 1, ‘ b [b] a a Somali and English both use Roman t [t] e e alphabets, though Somali has 24 letters, 19 j [j] i i consonant