Methods and Tools for Weak Problems of Translation Muhammad Ghulam Abbas Malik
Total Page:16
File Type:pdf, Size:1020Kb
Methods and Tools for Weak Problems of Translation Muhammad Ghulam Abbas Malik To cite this version: Muhammad Ghulam Abbas Malik. Methods and Tools for Weak Problems of Translation. Computer Science [cs]. Université Joseph-Fourier - Grenoble I, 2010. English. tel-00502192 HAL Id: tel-00502192 https://tel.archives-ouvertes.fr/tel-00502192 Submitted on 13 Jul 2010 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Université de Grenoble N° attribué par la bibliothèque /_ /_ /_ /_ /_ /_ /_ /_ /_ / THÈSE pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ DE GRENOBLE Spécialité: “INFORMATIQUE” (Computer Science) dans le cadre de l’École Doctorale “Mathématiques, Sciences et Technologie de l’Information, Informatique” (Doctoral School “Mathematics, Information Science and Technology, Computer Science”) présentée et soutenue publiquement par Muhammad Ghulam Abbas MALIK le 9 juillet 2010 Méthodes et outils pour les problèmes faibles de traduction (Methods and Tools for Weak Problems of Translation) Thèse dirigée par M. Christian BOITET Directeur de thèse M. Pushpak BHATTACHARYYA Codirecteur de thèse JURY M. Arthur SOUCEMARIANADIN Président Mme Violaine PRINCE Rapporteur M. Patrice POGNAN Rapporteur M. Eric WEHRLI Rapporteur M. Vincent BERMENT Examinateur M. Alain DÉSOULIÈRES Examinateur M. Christian BOITET Directeur M. Pushpak BHATTACHARYYA Codirecteur Préparée au laboratoire GETALP–LIG (CNRS–INPG–UJF–UPMF) Table of Contents Introduction ..................................................................................................................... 1 Part I ...................................................................................................................... 7 Scriptural Translation and other Weak Translation Problems ...................... 7 Chapter 1. Scriptural Translation ................................................................................. 9 1.1. Scriptural Translation, Transliteration and Transcription ................................................... 9 1.2. Machine Transliteration in Natural Language Processing (NLP) ....................................... 9 1.2.1. Grapheme-based models ................................................................................................ 10 1.2.2. Phoneme-based models .................................................................................................. 11 1.2.3. Hybrid and correspondence-based models ..................................................................... 11 1.3. Scriptural Translation ........................................................................................................ 12 1.3.1. Challenges and barriers within the same language ........................................................ 14 1.3.1.1. Scriptural divide .......................................................................................................... 14 1.3.1.2. Under-resourced and under-written languages ........................................................... 15 1.3.1.3. Absence of necessary information .............................................................................. 15 1.3.1.4. Different spelling conventions .................................................................................... 16 1.3.1.5. Transliterational or transcriptional ambiguities .......................................................... 17 1.3.2. Challenges and barriers within dialects .......................................................................... 17 1.3.2.1. Distinctive sound inventories ...................................................................................... 17 1.3.2.2. Lexical divergence and translational ambiguities ....................................................... 18 1.3.3. Challenges and barriers between closely related languages ........................................... 19 1.3.3.1. Characteristics ............................................................................................................. 19 1.3.3.2. Under-resourced dialect or language pairs .................................................................. 19 1.4. Approaches for Scriptural Translation .............................................................................. 20 1.4.1. Direct programming approaches .................................................................................... 20 1.4.2. Finite-state approaches ................................................................................................... 21 1.4.3. Empirical, machine learning and SMT approaches ....................................................... 21 1.4.4. Hybrid approaches ......................................................................................................... 21 1.4.5. Interactive approaches .................................................................................................... 22 Chapter 2. Scriptural Translation Using FSTs and a Pivot UIT .......................... 23 2.1. Scriptural Translation for Indo-Pak Languages ................................................................ 24 2.1.1. Scripts of Indo-Pak languages ........................................................................................ 24 2.1.1.1. Scripts based on the Persio-Arabic script .................................................................... 24 2.1.1.2. Indic scripts ................................................................................................................. 25 2.1.2. Analysis of Indo-Pak scripts for scriptural translation ................................................... 26 2.1.2.1. Vowel analysis ............................................................................................................ 26 2.1.2.2. Consonant analysis ...................................................................................................... 27 2.1.2.2.1. Aspirated consonants ............................................................................................... 27 2.1.2.2.2. Non-aspirated consonants ........................................................................................ 28 2.1.2.3. Diacritics analysis ....................................................................................................... 28 2.1.2.4. Punctuations, digits and special characters analysis ................................................... 29 2.1.3. Indo-Pak scriptural translation problems ....................................................................... 29 2.2. Universal Intermediate Transcription (UIT) ..................................................................... 29 2.2.1. What is UIT? .................................................................................................................. 30 2.2.2. Principles of UIT ............................................................................................................ 31 2.2.3. UIT mappings for Indo-Pak languages .......................................................................... 31 2.3. Finite-state Scriptural Translation model .......................................................................... 32 i 2.3.1. Finite-state transducers ................................................................................................... 32 2.3.1.1. Non-deterministic transducers ..................................................................................... 33 2.3.1.1.1. Character mappings .................................................................................................. 33 2.3.1.1.2. Contextual mappings ................................................................................................ 37 2.3.1.2. Deterministic transducers ............................................................................................ 37 2.3.2. Finite-state model architecture ....................................................................................... 38 2.4. Experiments and results .................................................................................................... 39 2.4.1. Test data ......................................................................................................................... 39 2.4.2. Results and discussion .................................................................................................... 41 2.5. Conclusion ......................................................................................................................... 47 Part II .................................................................................................................. 49 Statistical, Hybrid and Interactive Models for Scriptural Translation ........ 49 Chapter 3. Empirical and Hybrid Methods for Scriptural Translation ............... 51 3.1. SMT Model for Scriptural Translation .............................................................................. 51 3.1.1. Training data .................................................................................................................. 52 3.1.2. Data alignments .............................................................................................................