Machine Transliteration of Proper Names Between English and Persian
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by RMIT Research Repository Machine Transliteration of Proper Names between English and Persian A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Sarvnaz Karimi BEng. (Hons.), MSc. School of Computer Science and Information Technology, Science, Engineering, and Technology Portfolio, RMIT University, Melbourne, Victoria, Australia. March, 2008 Declaration I certify that except where due acknowledgement has been made, the work is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; and, any editorial work, paid or unpaid, carried out by a third party is acknowledged. Sarvnaz Karimi School of Computer Science and Information Technology RMIT University March, 2008 ii Acknowledgements During my PhD candidature, I have had the privilege to work with great supervisors, Dr Andrew Turpin and Dr Falk Scholer, who offered me their trust and patience. Their assis- tance, advice, and constructive comments and criticism were invaluable to me. I am proud of all the joint work we did during my candidature. To them I owe my deepest gratitude. My gratitude also goes to Dr Ali Moeini, my masters degree supervisor, who had been a never-ending source of encouragement, which led to my pursuit of a PhD degree. Many thanks to all of the volunteers who dedicated their time to construct the bilingual corpora I required for my experiments: Behnaz Abdollahi, Naghmeh Alibeik Zade, Shahrzad Bahiraie, Nahid Bozorgkhou, Azam Jalali, Haleh Jenab, Masoumeh Kargar, Mahshid Mehrtash, Mitra Mirzarezaee, Apameh Pour Bakhshandeh, Sahar Saberi, Sima Tarashion, my dear sis- ters Nastaran and Nazgol, and many others. The whole journey of my life as a research student at RMIT was made enjoyable by my friends at the search engine group, in particular, Dayang, Pauline, and Ying. I am also grateful for the international postgraduate research scholarship (IPRS) provided by the Australian government, and also the financial support from the school of Computer Science and Information Technology, RMIT University. Last but not the least, I would like to dedicate this thesis to my mother for her un- derstanding, encouragement, and immense support; and to the memory of my father for inspiring my fascination in learning and science, and his support in pursuing my studies. Sarvnaz Karimi iii Credits Portions of the material in this thesis have previously appeared in the following publications: Sarvnaz Karimi, Andrew Turpin, Falk Scholer, English to Persian Transliteration, In • Proceedings of 13th Symposium on String Processing and Information Retrieval, Vol- ume 4209 of Lecture Notes in Computer Science, pages 255–266, Glasgow, UK, October 2006. Sarvnaz Karimi, Andrew Turpin, Falk Scholer, Corpus Effects on the Evaluation of • Automated Transliteration Systems, In Proceedings of The 45th Annual Meeting of the Association for Computational Linguistics, pages 640–647, Prague, Czech Republic, June 2007. Sarvnaz Karimi, Falk Scholer, Andrew Turpin, Collapsed Consonant and Vowel Models: • New Approaches for English-Persian Transliteration and Back-Transliteration, In Pro- ceedings of The 45th Annual Meeting of the Association for Computational Linguistics, pages 648–655, Prague, Czech Republic, June 2007. Sarvnaz Karimi, Effective Transliteration, Australasian Computing Doctoral Consor- • tium, Ballarat, Australia, January 2007. Sarvnaz Karimi, Enriching Transliteration Lexicon by Automatic Transliteration Ex- • traction, HCSNet SummerFest’07: Student Posters, The University of New South Wales, Sydney, Australia, December 2007. Sarvnaz Karimi, Falk Scholer, Andrew Turpin, Combining Transliteration Systems by • Black-Box Output Integration, RMIT University Technical Report TR-08-01, 2008. iv This work was supported by the Australian government Endeavour International Postgrad- uate Research Scholarships program and the Information Storage, Analysis and Retrieval group at RMIT University. The thesis was written in the Vim editor on Mandriva GNU/Linux, and typeset using the LATEX 2ε document preparation system. All trademarks are the property of their respective owners. Note Unless otherwise stated, all fractional results have been rounded to the displayed number of decimal figures. Contents Abstract 1 1 Introduction 3 1.1 Motivation ..................................... 3 1.2 Challenges in Machine Transliteration . ........ 4 1.2.1 ScriptSpecifications . 4 1.2.2 MissingSounds............................... 5 1.2.3 TransliterationVariants . ... 6 1.3 ScopeoftheThesis................................ 6 1.3.1 TransliterationEffectiveness. ..... 6 1.3.2 TransliterationEvaluation. .... 9 1.4 StructureoftheThesis. 10 2 Background 12 2.1 MachineTranslation .............................. 12 2.1.1 TextCorpus ................................ 15 2.1.2 Statistical Machine Translation . ..... 16 TrainingModel............................... 19 DecodingModel .............................. 20 2.1.3 WordAlignment .............................. 22 IBMModel1................................ 23 IBMModel3................................ 25 HMMAlignment.............................. 27 A Word-Alignment Tool: GIZA++ ................... 29 v CONTENTS vi 2.2 MachineTransliteration . 31 2.2.1 TransliterationProcess. 31 2.2.2 TransliterationModel . 32 2.2.3 Bilingual Transliteration Corpus . ..... 33 2.2.4 EvaluationMetrics . 33 2.3 Approaches of Generative Transliteration . ......... 33 2.3.1 Phonetic-basedMethods. 35 2.3.2 Spelling-basedMethods . 42 2.3.3 HybridMethods .............................. 46 2.3.4 CombinedMethods............................. 49 2.4 Approaches of Transliteration Extraction . ......... 50 2.5 Methodology .................................... 58 2.5.1 English-Persian Transliteration Corpus . ....... 58 2.5.2 Persian-English Transliteration Corpus . ....... 59 2.5.3 Multi-Transliterator Corpus . 59 2.5.4 EvaluationMethodology. 60 Single-TransliteratorMetrics . 61 Multi-TransliteratorMetrics. 62 SystemEvaluation ............................. 62 HumanEvaluation ............................. 63 2.6 Summary ...................................... 64 3 Transliteration Based on N-grams 65 3.1 Introduction.................................... 65 3.2 N-gramsintheTransliterationProcess . ....... 66 3.2.1 Segmentation and Transformation Rule Generation . ........ 67 3.2.2 TransliterationGeneration. 71 3.3 ImplementationDescription . ..... 72 3.3.1 Transformation Rule Selection . 72 3.3.2 Transliteration Variation Generation . ....... 75 3.3.3 AmbiguousAlignments .......................... 75 3.4 ExperimentalSetup ............................... 78 CONTENTS vii 3.5 PastContext .................................... 79 3.6 FutureContext ................................... 80 3.7 Past and Future Context: Symmetrical and Non-Symmetrical......... 82 3.8 Conclusions ..................................... 89 4 Transliteration Based on Consonant-Vowel Sequences 93 4.1 Introduction.................................... 93 4.2 Consonants and Vowels: Clues for Word Segmentation . ......... 95 4.3 Collapsed-Vowel Approach: CV-MODEL1 . 97 4.4 Collapsed-Vowel Approach: CV-MODEL2 . 100 4.5 Collapsed Consonant-Vowel Approach: CV-MODEL3 . 101 4.6 ExperimentalSetup ............................... 103 4.7 Comparison of Segmentation Algorithms . ....... 103 4.8 English to Persian Character Alignment . ....... 107 4.8.1 AlignmentAlgorithm ........................... 110 4.8.2 Impact of Alignment on Transliteration Accuracy . ........ 113 4.9 English to Persian Back-Transliteration . ........ 113 4.10Summary ...................................... 118 5 Combining Automated Transliteration Systems 120 5.1 ExperimentalSetup ............................... 121 5.2 SystemCombination ............................... 122 5.2.1 IndividualSystems. 122 5.2.2 Analysis of Individual Outputs . 124 5.3 CombinationSchemes .............................. 125 5.3.1 MajorityVoteinTop-n(MVn) . 127 5.3.2 Classification-based System Selection . 130 5.4 ExperimentsandDiscussion . 133 5.4.1 EffectofMajorityVoting . 133 5.4.2 EffectofClassifier ............................. 135 5.4.3 Effect of the Combined Voting and Classification . 136 5.5 Conclusions ..................................... 138 CONTENTS viii 6 Corpus Effects on Transliteration Evaluation 143 6.1 Introduction.................................... 144 6.2 Analysis of the Difference of the Single-Transliterator Corpora . 145 6.2.1 Language-separatedCorpora . 145 6.2.2 CorpusScale ................................ 146 6.3 EffectofEvaluationScheme . 149 6.4 EffectofTransliterator. 150 6.4.1 Analysis of Evaluation using Combined Corpora . 158 6.4.2 Transliterator Consistency . 160 6.4.3 Inter-Transliterator Agreement and Perceived Difficulty ........ 163 6.5 Conclusions ..................................... 164 7 Conclusion 168 7.1 Effective Machine Transliteration . ....... 169 7.2 TransliterationEvaluation . 171 7.3 FutureWork .................................... 172 A Abbreviations 174 B Terminology 176 C International Phonetic Alphabet (IPA) 180 D Handcrafted English-Persian and Persian-English Transformation Rules 182 E A Sample of English-Persian Transliteration Corpus 184 Bibliography 188 List of Figures 2.1 A general diagram of components of a statistical