English to Tamil Machine Translation System Using Parallel Corpus
Total Page:16
File Type:pdf, Size:1020Kb
1 ================================================================= Language in India www.languageinindia.com ISSN 1930-2940 Vol. 19:5 May 2019 India’s Higher Education Authority UGC Approved List of Journals Serial Number 49042 ================================================================ ENGLISH TO TAMIL MACHINE TRANSLATION SYSTEM USING PARALLEL CORPUS Prof. Rajendran Sankaravelayuthan [email protected] Amrita Universiy, Coimbatore Dr. G. Vasuki AVVM Sri Pushpam College, Poondi [email protected] Coimbatore 2019 ================================================================= Language in India www.languageinindia.com ISSN 1930-2940 19:5 May 2019 Prof. Rajendran Sankaravelayuthan and Dr. G. Vasuki English To Tamil Machine Translation System Using Parallel Corpus 2 A FEW WORDS This research material entitled “ENGLISH TO TAMIL MACHINE TRANSLATION SYSTEM USING PARALLEL CORPUS” was lying in my lap since 2013. I was planning to edit and publish it in book form after making necessary modifications. But as I have taken up some academic responsibility in Amrita University, Coimbatore after my retirement from Tamil University, I could not find time to fulfil my mission. So I am presenting it in raw format here. Let it see the light. Kindly bear with me. I am helpless. Statistical machine translation (SMT) is a machine translation paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The statistical approach contrasts with the rule-based approaches to machine translation as well as with example- based machine translation. Statistical machine translation (SMT) learns how to translate by analyzing existing human translations (known as bilingual text corpora). In contrast to the Rules Based Machine Translation (RBMT) approach that is usually word based, most mondern SMT systems are phrased based and assemble translations using overlap phrases. In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called phrases, but typically are not linguistic phrases, but phrases found using statistical methods from bilingual text corpora. Analysis of bilingual text corpora (source and target languages) and monolingual corpora (target language) generates statistical models that transform text from one language to another with that statistical weights are used to decide the most likely translation. RAJENDRAN ================================================================= Language in India www.languageinindia.com ISSN 1930-2940 19:5 May 2019 Prof. Rajendran Sankaravelayuthan and Dr. G. Vasuki English To Tamil Machine Translation System Using Parallel Corpus 3 PAGE CONTENT NOS. Chapter 1 Introduction 10 1.1. Motivation 10 1.2 Issues in the research 12 1.3 Aims and objectives of the work 13 1.4 Methodology 14 1.5. Previous research works 14 1.6 Charecterization 16 1.7. Relevance of the present research work 16 Chapter 2 Survey of MT systems in India and Abroad 17 2.0. Introduction 17 2.1 Machine Translation 18 2.1.1 Machine Translation System for non Indian languages 29 2.1.2 Machine Translation Systems for Indian languages 28 2.2 History of Machine Translation 37 2.3 Need for MT 42 2.4 Problems in MT 43 2.5 Types of Machine Translation Systems 44 2.6 Different Approaches used for Machine Translation 45 2.6.1 Linguistic or Rule-Based Approaches 45 2.6.1.1 Direct MT System 46 2.6.1.2 Interlingua Machine Translation 47 2.6.1.3 Transfer based MT 49 2.6.2 Non-Linguistic Approaches 50 2.6.2.1 Dictionary Based Approach 50 2.6.2.2 Empirical or Corpus Based Approaches 51 2.6.2.2.1 Example Based Approach 51 2.6.2.2.2 Statistical Approach 52 2.6.3 Hybrid Machine Translation Approach 53 2.7 Categories of Machine Translation System 54 2.7.1 Fully Automated Machine Translation System 54 ================================================================= Language in India www.languageinindia.com ISSN 1930-2940 19:5 May 2019 Prof. Rajendran Sankaravelayuthan and Dr. G. Vasuki English To Tamil Machine Translation System Using Parallel Corpus 4 2.7.2 Machine Aided Translation System 55 2.7.3 Terminology Data Banks 55 2.8 Advantages of Statistical Machine Translation over Rule Based 56 Machine Translation 2.9 Applications of Machine Translation 57 2.10 Summary 62 Chapter 3 Creation of Parallel Corpus 63 3.0 Introduction 63 3.1 Pre-Electronic corpus 63 3.2. Corpus in the present day context 63 3.2.1. Sampling and representativeness 64 3.2.2. Finite size 65 3.2.3. Machine-readable form 66 3.2.4. A standard reference 67 3.3. Classification of the corpus 67 3.3.1. Genre of text 68 3.3.2. Nature of data 68 3.3.3. Type of text 69 3.3.4. Purpose of design 70 3.3.5. Nature of application 70 3.3.5.1. Aligned corpus 70 3.3.5.2. Parallel corpus 71 3.3.5.3. Reference corpus 71 3.3.5.4. Comparable corpus 71 3.3.5.5. Opportunistic corpus 72 3.4. Generation of written corpus 72 3.4.1. Size of corpus 72 3.4.2. Representativeness of texts 73 3.4.3. Question of Nativity 73 3.4.4. Determination of target users 75 3.4.5. Selection of time-span 76 3.4.6. Selection of texts type 76 ================================================================= Language in India www.languageinindia.com ISSN 1930-2940 19:5 May 2019 Prof. Rajendran Sankaravelayuthan and Dr. G. Vasuki English To Tamil Machine Translation System Using Parallel Corpus 5 3.4.7. Method of data sampling 77 3.4.8. Method of data input 78 3.4.9. Hardware requirement 79 3.4.10. Management of corpus files 79 3.4.11. Method of corpus sanitation 80 3.4.12. Problem of copy right 80 3.5. Corpus processing 81 3.5.1. Frequency study 81 3.5.2. Word sorting 82 3.5.3. Concordance 82 3.5.4 Lexical Collocation 83 3.5.5 Key Word In Context (KWIC) 83 3.5.6 Local Word Grouping (LWG) 84 3.5.7 Word Processing 84 3.5.8 Tagging 86 3.6. Parallel corpora 86 3.6.1. Parallel corpora types 88 3.6.2. Examples of parallel corpora 89 3.6.3. Applications of parallel corpora 90 3.6.4. Corpora creation in Indian languages 92 3.6.4.1. POS tagged corpora 93 3.6.4.2. Chunked corpora 93 3.6.4.3. Semantically tagged corpora 94 3.6.4.4. Syntactic tree bank 94 3.6.4.5. Sources for parallel corpora 95 3.6.4.6. Tools 95 3.6.5. Creating multilingual parallel corpora for Indian languages 96 3.6.5.1. Creating the source text 98 3.6.5.2. Domain of corpus 98 3.6.5.2.1. Health Domain 98 3.6.5.2.2. Tourism domain 99 3.6.5.3. Data storage, maintenance and dissemination 99 ================================================================= Language in India www.languageinindia.com ISSN 1930-2940 19:5 May 2019 Prof. Rajendran Sankaravelayuthan and Dr. G. Vasuki English To Tamil Machine Translation System Using Parallel Corpus 6 3.6.5.4. Parallel corpus creation 100 3.6.5.5. POS Annotation 100 3.6.5.5.1 POS Tag set 101 3.6.5.5.1.1 Principles for Designing Linguistic Standards for Corpora 101 Annotation 3.6.5.5.2 Super Set of POS Tags 102 3.6.5.5.3 Super Set of POS Tags for Indian Languages 103 3.6.5.5.4 Manual POS Annotation 103 3.6.6. Creation of parallel corpus for the SMT system 103 3.6.6.1. Corpus collection 104 3.6.6.2. Compilation of parallel corpora 105 3.6.6.3. Alignment of the parallel corpus 105 3.6.6.4. Sentence alignment 107 3.6.6.5. Word alignment 108 3.7. Summary 109 Chapter 4 Parallel Structure of English and Tamil Language 110 4.0 Introduction 110 4.1. Parallel sentential structures in English and Tamil 110 4.1.1. Prallel affirmative sentences 117 4.1.2. Parallels in interrogative sentences 119 4.1.2.1. Parallels in yes-no questions 120 4.1.2.2. Parallels of wh-questions 122 4.1.3. Parallels in negative sentences 124 4.1.3.1. Parallels in negation in equative sentences 124 4.1.3.2. Parallels in negation in non-equative sentences 125 4.1.3.3. Parallels in negative pronouns and determiners 125 4.1.4. Parallels in imperative sentence 128 4.2. Parallel clause structures of English and Tamil 130 4.2.1. Parallels in nominal/complement clause 135 4.2.2. Parallels in Adverbial clauses 136 4.2.3. Parallels in Adjectival clauses 141 4.2.4. Parallels in comparative clauses 143 ================================================================= Language in India www.languageinindia.com ISSN 1930-2940 19:5 May 2019 Prof. Rajendran Sankaravelayuthan and Dr. G. Vasuki English To Tamil Machine Translation System Using Parallel Corpus 7 4.2.4.1. Parallels in comparative clause of quality 144 4.2.4.2. Parallels in comparative clause of quantity 144 4.2.4.3. Parallels in comparative clause of adverbs 145 4.2.5. Parallels in co-ordination 146 4.3. Parallel structures of English and Tamil phrases 147 4.3.1. Parallels in noun phrases 147 4.3.1.1. Parallels in demonstratives 147 4.3.1.2.Parallels in quantifiers 148 4.3.1.3. Parallels in genitive phrase 149 4.3.2. Parallel structures in verb phrase 150 4.3.2.1. Parallels in complex verbal forms denoting tense, mood and 151 aspect 4.3.2.2. Parallels in verb patterns 161 4.3.3. Parallels in adjectival phrases 172 4.3.4.