Research Article Automatic Transliteration of Proper Names from Somali to English

Ahmed Muktar Omar*, Jian Qu and Sumeth Yuenyong School of Information Technology, Shinawatra University 99 Moo 10, Bangtoey, Samkhok, Pathum Thani 12160, Thailand

Abstract Transliterating of proper names is the process of converting words from source natural language (such as Somali) to a target natural language (such as English) while maintaining language pronunciation. Proper names and technical words are challenging in bilingual translation systems and also in Cross-Language Information Retrieval (CLIR) applications, due to their absence from most dictionaries. In this paper, we study an automatic transliteration from Somali to English; which is an under-studied problem. Our Somali-English transliteration system uses transliteration rules based on the orthographic mapping of the source language characters to the characters of the target language. We also propose an alignment method that maps the Somali characters when there is no direct matching character to get accurate transliteration in English. Our novel approach particularly enhances Somali-English transliteration.

Keywords: Somali-English; Somali Transliteration Table; Grapheme-Based

1. Introduction special characters, although the “ ‘ ” glottal Somali is a Cushitic language which stop stands for the (Arabic ). belongs to the family of Afro-Asiatic Somali also has three digraph (( ظ or ط) dh ,( ﺵ) ,( ﺥ) languages (or Hamito-Semitic). The Somali consonants (kh language is similar to Semitic languages such which are based on similar Arabic sounds. as Arabic and Hebrew. It is a mother tongue Somali orthography corresponds mostly to for ethnic Somalis in Greater Somalia and is Roman alphabets except where some by far the most well-documented of all characters are modified for the usage in Cushitic languages [1]. Somali characters, where the letters and Somali is the official language of designed to accommodate the voiced and Somalia and Djibouti and a working voiceless pharyngeal fricatives, comparable Somali long .[1] ( ع = and (ʕ ( ح = Language in the Somali regions of Ethiopia to ( and Kenya. Somali uses different writing vowels usually are written by doubling of the systems, and the has been the vowel itself. official writing system in the Federal The purpose of general transliteration Republic of Somalia and Djibouti since 1972 is not to introduce new sounds to the target [2]. It mostly uses the Roman alphabet language which the target language does not except for “p, ” without signs or provide. However, its purpose is to substitute

*Correspondence : [email protected] DOI 10.14456/tijsat.2016.26 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016 the original to the nearest letter in the transliteration approaches have been studied target language. The concept of long and in the literature, each of which brings out short vowels letters exists in Somali, and various processes in different languages. long vowels are usually indicated by These methods vary by the direction of repeating the vowel itself such as “aa”, “ee”, transliteration, writing systems of different “ii”, “oo”, “uu.” languages, or intended applications. For example, the Somali word Classification of these works is not Soomaaliya usually transliterated to English straightforward. as Somalia by omitting the long vowels. Earlier work has been done for Figure 1 demonstrated a basic word Machine Transliteration grapheme-based transliteration. approaches or phoneme-based approaches. Lee and Choi proposed a source channel model (SCM) a grapheme-based approach Somali OO AA IY A for English-Korean transliteration [3]. They used a direct orthographical mapping from source graphemes to target graphemes. Knight and Graehl proposed Japanese to English back-transliteration using the English S M A L I A similarity of SCM [4]. Wan and Verspoor

modeled a technique to transliterate proper

Figure 1. Basic Transliteration. names from English to Chinese using a phonetic procedure [5]. They proposed an In this paper we propose a Somali- algorithm for mapping from English English transliteration system that uses characters to Chinese characters based on transliteration rules based on the heuristics relationships between English orthographic mapping of the source language spelling and pronunciation, and stable characters to the characters of the target relationships between English phonemes and language. We also propose an alignment Chinese characters. method that maps the Somali characters Kang and Kim explored a forward- when there is no direct matching character to transliteration and back-transliteration for get accurate transliteration in English. English-Korean using a direct and pivot The structure of the paper is as method and then they used chunks of follows. In Section II, we describe the phonemes to perform the transliteration and previous study of machine transliteration. In back-transliteration [6]. Kang and Choi also Section III, we describe our character studied an English-Korean back- mapping method for Somali-English transliteration using a decision-tree learning transliteration; we also discuss our [7]. The English-Korean word alignment transliteration rules of Somali proper names procedure they used is similar to Lee and to English. In Section IV, we detail our Choi [3]. experiments, evaluation metrics and the Oh and Choi also studied a model for results we obtained; and Section V concludes English-Korean transliteration using the paper. pronunciation and contextual rules [8]. Their method was composed of two phases: 2. Related Work alignment and transliteration. In their first Transliteration refers to an phase, they aligned an English pronunciation orthographical transformation or phonetic unit (EPU) taken from a pronunciation change across two languages with different phrasebook and aligned it to Korean scripts. Many different generative phonemes to find the possible

18 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016 correspondence between the EPU and phonemes. Virga and Khudanpur presented Somali C EE L - M A C AA English-Chinese transliteration using a phonetic representation of English names into Chinese to support Cross-Lingual Speech and Text Processing Applications [9]. English L - M A ‘ A N AbdulJaleel and Larkey proposed a generative statistical transliteration model for English-Arabic transliteration using n- Figure 2. Forward Transliteration. gram methods [10]. The n-gram model Most of the transliteration methods generates strings of Arabic characters from a have been proposed between English and string of English characters. Malik proposed other common languages such as Arabic, a rule based Punjabi machine transliteration Chinese, or Japanese. Somali, not being a by transliterating a word between two scripts common language, is under-studied for both of Punjabi [11]. transliteration systems and cross-language Grapheme-based transliterations information retrieval applications. consider transliteration as an orthographic process rather than phonetic process and maps groups of graphemes/characters in the 3. Mapping and Transliteration rules source language word directly to groups of To align Somali/English characters, graphemes/characters in the target language we use a direct orthographic mapping word [12]. This approach also is known as between the Somali and English characters; a (spelling-based or direct methods) as it character alignment is given in Somali and its directly transforms the source language orthographic equivalent in English to find the graphemes into the target language most probable letters. graphemes without any phonetic knowledge We start by the alignment of the of the source and target languages. identical letters; in most cases, Somali words Instantaneously the phoneme-based methods are longer than their corresponding English require some steps in the transliteration transliterated words. The mapping type is process. However, most of the grapheme- either one-to-one letter or many-to-one letter based methods directly depend on the to avoid null mapping. information that is attainable from the For example, as shown in Figure 3, characters of the words. the Somali word Ceel-cadde is usually Forward transliteration is transliterated into English as El-Adde. transliterating a word as it is written in the source language such as Somali to a foreign Somali C E E L - C A DD E language such as English. For example, forward transliteration of a Somali name “Ceelmacaan” to English is “Elma’an”. Backward transliteration or back- transliteration is transliterating a word from English E L - A DD E its transliterated version back to the language of origin. For example, back-transliteration of “Elma’an” from English to Somali is Figure 3. Missing equivalent letters. “Ceelmacaan”. This example is shown in Figure 2. The drawback of the above direct mapping is the absence of some Somali

19 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016 letters and long vowels in English. This is the Table 1. Somali Transliteration Table. problem addressed by our method. consonant mapping vowel mapping 3.1 Somali Transliteration Table and its Problems Somali English Somali English As can been seen from Table 1, ‘ [b] a a Somali and English both use Roman [t] e e alphabets, though Somali has 24 letters, 19 [j] i i consonant monographs and five vowel x [h] o o monographs as well as three consonant [d] u u [r] digraphs and five long vowels. s [s] The mapping in Table 1 maps only c [ʕ] ‘ equivalent letters, and it is not enough to get [g] good transliteration, so we developed Somali [f] Transliteration Table (STT) similar to [q] [k] Buckwalter transliteration table [13]. It l [l] allows us to map the Somali letters with m [m] sounds that are either not present or used n [n] differently in English. In this case, we would [w] h [h] be able to increase the performance of the [y] transliteration. Consonant mapping: Consonants kh kh aa a sh sh ee e can be divided into two constants that have dh dh ii i similar phonetic properties and consonants oo o that are either not present in English or uu u pronounced differently, for example, the Somali “b” letter matches English “b” and Vowel mapping: Somali has five “p” letters. vowel monographs; Somali vowels have one Consonants which are unique to to one correspondence with English vowels. Somali are (c, x and the “ ‘ ”) and However, Somali is different regarding long for the Arabic Hamza, these consonants vowels, which are short vowels repeated frequently occur in words. Somali syllable twice. We map the Somali diphthong vowels structure is based on Consonant Vowel with double English vowels. Consonant (CVC) and clusters of two As shown in Table 1, the total consonants that do not occur at the beginning number of letters in Somali and English are or the end of a word, but only happen at not equal. The Somali letters “C”, and the “ ‘ syllable boundaries. Somali glottal stop “ ’ ” ” glottal stop have no equivalent mapping in or the Arabic Hamza usually is not written English. These letters will never be mapped unless it happens at the border of a syllable in Somali to English transliteration using a or in the middle of the word. direct orthographic mapping. Another problem is the use of long vowels in Somali. We came up with novel dependency rules which address the problem of no direct equivalent letter in Somali to English transliteration.

20 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016

3.2 Dependency rules If “Y” happens in the middle of the Character to character mapping only word after an “E” vowel, but not between is not satisfactory to get an acceptable result two “E” vowels, then “Y” is omitted. for the transliteration. We need to add a Finally, if “Y” occurs in the middle particular dependency or appropriate rules of the word after and “A” vowel, but not for constructing accurate transliteration. between two “A” vowels, then “Y” is Consonants: Somali consonants omitted. are transliterated into their corresponding Long vowels: Somali long vowels English consonants; here we discuss the are twice as long as short vowels and are consonants that are unique to Somali and written as double vowels. how to transliterate them into English. The rules of the Somali long vowels Starting with “C” letter called transliterated by transforming the long “Ceyn” in Somali, when a “C” occurs at the vowels to short vowels. beginning, and the end of a word then “C” For example, if long vowel “AA” will be omitted. occurs in a word it is substituted with short “C” also is omitted when it occurs vowel “A” and the rest of the long vowels between two different vowels, but it is follow the same procedure. replaced with the ‘glottal stop. The exact order of application of “C” is also transliterated into “ ‘ ” these dependency rules can be seen in glottal stop if C appears at the end of mid- Algorithm 1. syllable and the next syllable is a consonant. If “C” occurs at the beginning of a word and is followed by “U” then “C” is omitted and “U” is transliterated into “O.” The letter “X” is assigned to a Somali sound, so there is no equivalent English letter. Therefore, “X” is transliterated into “H” which is the nearest English letter. “X” always transliterated into “H” no matter the position in which it occurs. If “X” occurs in the middle of a word behind the letter “U” then “X” is replaced with “H” and “U” is replaced with “O”. Hamza “ ’ ” is shown only if it occurs between the same vowels, or when it takes place in a single syllable word, but most of the cases are not written. The letter “Y” in Somali is treated as a consonant, but in English it is regarded as both vowel and consonant. If “Y” occurs in the middle of the word and is followed by “I” but not between two “I” vowels, then “Y” is omitted.

If “Y” occurs in the middle of the word after “I”, but not between two “I” vowels, then “Y” is omitted.

21 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016

3.3 Somali transliteration Process Algorithm1: Somali transliteration rules The Somali transliteration architecture and its functionality are Require Somali dependency rules discussed in this section. Figure 4 describes Insert string S (Somali word) the structure of our proposed transliteration Search for pattern in Regex Foreach all dependency rules method, the rules for transliteration, and its If S contain C letter implementation using regular expression While C is matched pattern matching algorithm (Algorithm1). Do If ((C) letter occur initial of a word) && next vowel Somali OOV not "u" Omit C; Else if (vowel is "u") Omit C && replace "u", "o"; Else if (C in between same vowels) or ultimate of syllable) Mapping Replace (" C “," ' ") Else Transliteration Table Omit C; Check for End while rules If S contain X letter While X matched Do if next vowel to X is "u" replace "x" with "h" && "u" Englsih OOV with "o" Else Replace "x" with "h". If S contain long vowels while long vowels "aa", "ee" , Figure 4. Somali Transliteration "ii", "oo" , "uu" is matched Architecture. Do Replace all with their short vowels "a”, "e", "i", "o”, and The system takes Somali text as “u" input and searches for the letters to align each Else letter to its corresponding letter in English as If (Y in mid-word vowel before a pattern to match in the transliteration table is "i" or vowel after Y is "i" which consists of the character mappings and not between two "i") dependency rules. If there are no similar Omit Y; letters in the alignment, it looks for the else if (Y in mid-word vowel dependency rules to check if there is an before is "e" or vowel after Y is "e" not between two "e") applicable rule, which it then applies and Replace "y" , "i" sends to the transliteration unit. The else if (Y in mid-word vowel transliteration unit replaces the matched before is "a" and next letter is pattern to transliterate and then outputs the consonant) word as an English text. Replace "y" , "i" Endwhile Endif No pattern to match End foreach Output S as T (English word)

22 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016

4. Experiments We evaluate the BLEU score using In this section, we describe the the Interactive BLEU tool developed by experimental data which includes the data Madnani, N [15]. The system metric used for testing as well as the data used for measures the n-gram (n=1 to 4) precisions of validation purpose. the hypothesis against the reference.

4.1 Data Set 4.3 Results We obtained 600 Somali words and After the selected input Somali texts English transliteration from SomaliNames they are transliterated into English texts by website [14] which contains bilingual using the Somali Transliteration tool, the instances of the most frequently used Somali transliterated English texts are verified for names with their English transliterations. mistakes and accuracy. Measurement is The data were divided randomly into two accomplished with the help of the sets: 400 names were used for developing the transliterated words retrieved from rules and the remaining 200 words were SomaliNames of Somali and English [16]. chosen as testing for the transliteration rules. The English transliterated words were used Table 2. Results of Somali transliteration. as a reference to verify the correctly Type Accuracy transliterated Somali names.

4.2Evaluation metrics DOM 34.62% The results of Somali transliteration STT 64.73% to English were measured by the number of the transliterated words that correctly STT with DR 96.53% matched the transliterated words obtained from the Somali website divided by the total number of the phrase in the validation set. From the experimental results shown Word accuracy (WA), also known in Table 2, it is clear that the Somali as transliteration accuracy, measures the transliteration (STT) with Dependence rules proportion of transliterations that are correct. that we have developed gives more than 96.53% accuracy. Also, the Table STT Number of correctly transliterated words indicates better outcomes at 64.73% than the 푊퐴 = Total Number of Reference words direct orthographical mapping (DOM) which gave 34.62% accuracy on the Somali names The transliteration accuracy or word list. So our transliteration system accuracy is a measurement of the percentage accomplishes the requirement of of transliterated Somali words to English. transliteration across Somali to English. BLEU (Bilingual Evaluation Understudy) allowances for multiple reference translations, it is used to evaluate many term to term bilingual translation researches [14]. The BLEU technique offers a score between 0 and 1, which is a scale showing how alike the target language word is to the reference word; 0 is the least score and 1 is the best score.

23 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016

iBLEU results Spelling-based approaches are considered to be easier to implement and 98.30% 98.87% 99.15% 96.64% show better performance than the phonetic- 85.58% 88.98% 79.18% based approaches because they do not 77.53% 71.23% depend on pronunciation dictionaries which 62.69% 60.11% may not consist of the pronunciations of all words. Last but not least, we have chosen transliteration rules rather than machine 36.13% learning methods due to the unavailability of a large corpus containing Somali-English paired-words. The transliteration table along with the dependency rules proposed in this paper improved accuracy of transliteration significantly. Unigram Bigram Trigram Four-gram DOM STT STT with DR 6. References [1] Lecarme, J. and Maury, C., A Figure 5. iBLEU Score. Software Tool for Research in Linguistics and Lexicography: Figure 5 demonstrates the results Application to Somali, Computers and given by the BLEU interactive tool to Translation, Vol.2 No, 1, pp.21-36, compare the DOM and our proposed STT 1987. and dependency rules over the 600 proper [2] Andrzejewski, B.W., The Introduction names from the Somali Names data. Using a of a National Orthography for Somali, unigram, bigram 3-gram and 4-gram, DOM School of Oriental and African and STT DR score for all grams shows above Studies, 1974. 0.9664 (almost 1) as the BLEU score is [3] Lee, J.S. and Choi, K.S., English to between 0 to 1. The STT alone scored Korean Statistical Transliteration for 0.6269, nearly twice as high as the DOM. Information Retrieval, Computer We see that the higher the N-gram, Processing of Oriental Languages, the higher the result, with the DOM scores of Vol. 12, No. 1, pp.17-37, 1998. 0.3613, 0.6011, 0.7123 and 0.7753 for [4] Knight, K. and Graehl, J., Machine Unigram Bigram, Trigram and four-gram, Transliteration, Computational respectively. Linguistics, Vol.24, No.4, pp.599-612, 1998. 5. Conclusion [5] Wan, S. and Verspoor, C.M., In this paper, we explored the Automatic English-Chinese Name automatic forward transliteration for Somali Transliteration for Development of proper names to English. This language-pair Multilingual Resources, In is not well studied in any automatic Proceedings of the 17th international transliteration work. All our transliteration conference on Computational rules and alignments procedure presented linguistics-Volume 2, pp. 1352-1356, here are based on a direct orthographic 1998. transformation. Our method avoids the [6] Kang, I.H. and Kim, G., English-to- intermediate phonetic interpretation used in Korean Transliteration Using Multiple phoneme-based methods and reduces the Unbounded Overlapping Phoneme transliteration error rate. chunks, In Proceedings of the 18th conference on Computational 24 Thammasat International Journal of Science and Technology Vol.21, No.4, October-December 2016

linguistics-Volume 1, pp. 418-424, Computational Linguistics and the 2000. 44th Annual Meeting of the [7] Kang, B.J. and Choi, K.S., Two Association for Computational Approaches for the Resolution of Linguistics, pp. 1137-1144, 2006. Word Mismatch Problem Caused by [12] Karimi, S., 2008, Machine English Words and Foreign Words in Transliteration of Proper Names Korean Information between English and Persian, Ph.D. Retrieval, International Journal of Thesis, RMIT University, Melbourne, Computer Processing of Oriental 216p. Languages, Vol.14, No.2, pp.109-131, [13] Buckwalter, T.A., Lexicographic 2001. Notation of Arabic Noun Pattern Morphemes and Their Inflectional [8] Oh, J.H. and Choi, K.S., An English- Features, In Proceedings of the Korean Transliteration Model Using Second Cambridge Conference on Pronunciation and Contextual Rules, Bilingual Computing in Arabic and In Proceedings of the 19th English, pp. 5-7, 1990. international conference on [14] Papineni, K., Roukos, S., Ward, T. and Computational linguistics-Volume 1, Zhu, W.J., BLEU: A Method for pp. 1-7, 2002. Automatic Evaluation of Machine [9] Virga, P. and Khudanpur, S., Translation, In Proceedings of the Transliteration of Proper Names in 40th Annual Meeting on Association Cross-lingual Information Retrieval, for Computational Linguistics, pp. In Proceedings of the ACL 2003 311-318, 2002. Workshop on Multilingual and Mixed- [15] Madnani, N., iBLEU: Interactively Language Named Entity Recognition- Debugging and Scoring Statistical Volume 15, pp. 57-64, 2003. Machine Translation Systems, [10] AbdulJaleel, N. and Larkey, L.S., In Semantic Computing (ICSC), 2011 Statistical Transliteration for English- Fifth IEEE International Arabic Cross Language Information Conference, pp. 213-214, 2011. Retrieval, In Proceedings of the [16] Somali Names List with Their English Twelfth International Conference on Transliteration Retrieved Information and Knowledge Somalinames Directory. Available Management, pp. 139-146, 2003. Source: [11] Malik, M.G., Punjabi Machine http://www.somalinames.com/index.p Transliteration, In Proceedings of the hp/somali-names/soma-names.csv 21st International Conference on

25