An Automatic Greeklish to Greek Transliteration System
Total Page:16
File Type:pdf, Size:1020Kb
All Greek to me! An automatic Greeklish to Greek transliteration system Aimilios Chalamandaris, Athanassios Protopapas, Pirros Tsiakoulis, Spyros Raptis Institute for Language and Speech Processing Epidavrou & Artemidos 6, 15125 Maroussi, Greece {achalam, protopap, ptsiak, spy}@ilsp.gr Abstract This paper presents research on “Greeklish,” that is, a transliteration of Greek using the Latin alphabet, which is used frequently in Greek e-mail communication. Greeklish is not standardized and there are a number of competing conventions co-existing in communication, based on personal preferences regarding similarities between Greek and Latin letters in shape, sound, or keyboard position. Our research has led to the development of “All Greek to me!,” the first automatic transliteration system that can cope with any type of Greeklish. In this paper we first present previous research on Greeklish, describing other approaches that have attempted to deal with the same problems. We then provide a brief description of our approach, illustrating the functional flowchart of our system and the main ideas that underlie it. We present measures of system performance, based on about a year’s worth of usage as a public web service, and preliminary research, based on the same corpus, on the use of Greeklish and the trends in preferred Latin-Greek letter mapping. We evaluate the consistency of different transliteration patterns among users as well as the within-user consistency based on coherent principles. Finally we outline planned future research to further understand the use of Greeklish and improve “All Greek to me!” to function reliably embedded in integrated communication platforms bridging e-mail to mobile telephony and ubiquitous connectivity. text, i.e. the Greek letter /θ/ yields /th/ and 1. Introduction and background the diphthong /αι/ yields /e/ The word “Greeklish” stands for a combination of the 2. Based on similarities between Greek and Greek and the English language (Greek-lish) and it refers Latin letter shapes, i.e. /8/ for the letter /θ/ to transliteration of Greek using the Latin alphabet. This and /w/ for the letter /ω/ Romanization is used frequently in e-mail communication 3. Based on similarities in the keyboard layout, among Greek-speaking computer users, and its main i.e. /u/ for the letter /θ/ and /c/ for the letter characteristic is the lack of a standardized table of /ψ/. transliteration mappings. More specifically, Greeklish is a Several other researchers [10,11,12] have agreed with significantly inconsistent manner of transliterating Greek this classification, nevertheless the validity of this with the Latin alphabet, based on alternative co-existing hypothesis has not so far been tested empirically conventions, which mainly depend on personal based on usage data. In this paper we present a first preferences regarding similarities between Greek and approach to this question in section 3. Latin letters’ shape, sound or even keyboard layout. Before full compatibility of operational systems with the 1.2. Approaches to automatic transliteration Greek alphabet, Greeklish was the main means for Since the appearance of Greeklish, several attempts communicating amongst users. Nowadays, even though have been presented in the literature, either as ad hoc most operational systems and programs support Greek approaches for automatic transliteration [14] or as more character set, Greeklish still remains one of the main tools complete applications with advanced features such as for safe communication via e-mail. email client services etc. Most of these applications are Several studies of Greeklish [3,4] have shown that distributed freely and are based on a specific, fixed set of nearly all Greek-spoken computer users have used transliteration rules, simply replacing every Latin letter Greeklish at least once as a means of communication via into a corresponding Greek letter. Few of these e-mail; at the same time, more than 50% of the users over applications make use of regular expressions techniques in 35 years old consider Greeklish to be a necessary evil in order to better cope with different context-dependent everyday computer use. Another important aspect of this patterns [9]. One application particularly worth Romanization is their difficulty: It has been found that mentioning is aspell [5], an open source spell checker for reading a text written in Greeklish demands at least 40% Linux environment and OpenOffice suite. Aspell first more time and effort than reading the same text in plain maps all Latin characters to Greek ones via a specific Greek, even for experienced users of Greeklish [11]. mapping set and then applies its conventional method of Greeklish has been an apple of discord in the past [1] and spell checking and correction. its impact in the actual quality of the content they deliver Our approach, apart from the incorporation of is also a subject of research by linguists [11,13]. dictionaries, differs from all aforementioned ones on three important aspects. First, we use an intermediate stage of 1.1. Types of Greeklish phonetic representation of all Greeklish words, which One of the earliest studies [1] of the Greeklish provides faster and more robust results than passing phenomenon classified the basis for transliteration into directly to Greek characters. Second, we use probabilistic three distinct categories: models for the decision of the optimal mapping from Latin 1. Based on sound resemblance aiming to to Greek characters, as well as for the decision of the most represent phonetically the respective Greek probable word. And third, our system can handle very 1226 efficiently mixed texts with Greeklish and non-Greeklish 3. Usage data words, using a language identification algorithm. In this section we present analyses of system usage, provided as a free web-based service at ILSP’s official 2. Our Approach web site [2]. In this section we present the ideas that underlie our approach as implemented in All Greek to me! developed at 3.1. User distribution ILSP [6]. All Greek to me! is the first automatic Analysis of the performance is based on data acquired transliteration system that can cope with virtually any type in the ten month period from January 2005 through of Greeklish and provide orthographically correct Greek October 2005 via the online demo version of our text. In the following figure one can see the general application [2], which limits each conversion request to flowchart of the system. 255 characters. This sample is very important because it constitutes a large corpus of real-life unbiased Greeklish, and as such it allows us to derive objective conclusions Input about our system. The total size of the corpus is 2,095,037 words, including 145,601 unique words. The total number of Phonetic entries (conversion requests) was 171,698, received from Greeklish to representations Phonemes 18,868 unique IP addresses. The latter number does not of Greeklish Rules represent the unique users because an estimated 41% of Word the users do not have static IP addresses and therefore may be represented in the corpus with alternative IP identities. Users originated in 83 different countries, of which the most frequent are listed in the Table 1. Greeklish Language or non- identification based No COUNTRY Request % Unique on trigrams and Greeklish? IPs % pruning 1 GREECE 47,28% 53,75% No 2 GERMANY 15,85% 16,89% Yes 3 UNITED_KINGDOM 11,90% 4,85% Optimal word Decide about most 4 UNITED_STATES 7,18% 5,36% according to probable word. 5 AUSTRALIA 3,61% 2,03% word prob. and Take into account “orthography” orthography 6 FRANCE 1,98% 2,01% 7 NETHERLANDS 1,86% 0,43% 8 ITALY 1,62% 1,22% Greek lexicon Output with relative 9 CYPRUS 1,39% 1,68% probabilities 10 BELGIUM 1,09% 1,15% REST 6,24% 5,31% Figure 1: Flowchart of All Greek to me! system. Table 1: Countries of origin of the conversion requests making up the corpus. The first step of its operation is to transcribe from Greeklish into all possible phonetic representations using With the use of cookies, we estimate that in average a set of manually defined rules (enriched after initial 62.3% of the users are frequent users. Until the day this testing [6]). This intermediate stage helps prune paper was written, the use of our web service was doubled alternatives employing a phonotactic model for Greek, within a five-month period, exceeding 38.000 different which at the same time performs language identification users, and having converted more than 5,200,000 words. [8]. By using trigram probabilities, every phonetic word produces a score according to its constituent phonetic 3.2. Transliteration pattern preferences sequence. The instances that produce a score below a specific threshold are considered to be non-Greeklish A series of hierarchical log-linear models with and without a latent class were constructed in order to test the words (that is, foreign words) and therefore are left intact in Latin characters. Instances scoring above the threshold hypothesis that users of Greeklish tend to prefer one of the are passed on to a next level and projected onto a large three main modes of transliteration (visual, phonetic, lexicon that contains relative probabilities of appearance keyboard layout). For this test we used transcriptions in general purpose Greek texts, such as newspapers or yielding /η/, /υ/, /ω/, /θ/, and /ου/. These five graphemes news broadcasts. The projection is based on the phonetic are the only ones easily admitting all three modes of representation; hence for every word in the lexicon we transcription and producing distinct outcomes (visual: /n/ have also stored the corresponding phonetic sequence. /u/ /w/ /8/ /ou/; phonetic: /i/, /i/, /o/, /th/, /u/; keyboard: /h/ The detailed structure and function of All Greek to me! /y/ /v/ /u/ /oy/, respectively). Under the assumption that a has been presented in [6].