Korean Language Generation in an Interlingua-Based Speech
Total Page:16
File Type:pdf, Size:1020Kb
Korean Language Generation in an Interlingua-based Speech Translation System by Dennis Woojun Yang Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 1995 ( Massachusetts Institute of Technology 1995. All rights reserved. AAuthor uthor....... ................................ Department of Electrical E ineering and Computer Science May 26, 1995 CerCer::" -- -- - 'U./ ....... ....... :'"''..................'''- Stephanie Seneff Principal Research Scientist mhesis Supervisor Certifiec .................. Clifford Weinstein Leader of Lincoln/p ech Systems Technology Group \.- . i 1 1. B Thesis Supervisor Acceptedby .--........ ".....Ac e F.R........Morgenthaler Chairmanepartmental Com ittee on Graduate Theses ;:;ASSACHUSETTS INSTITU'E OF TECHNOLOGY AUG 10 1995 LIBRARIES arker EMn Korean Language Generation in an Interlingua-based Speech Translation System by Dennis Woojun Yang Submitted to the Department of Electrical Engineering and Computer Science on May 26, 1995, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering Abstract Group 24 at MIT Lincoln Laboratory has been developing an automatic speech-to- speech translation system for the English-Korean pair. For the machine translation module, an interlingua system has been adopted. This system analyzes the source language text and represents the results of the analysis in a semantic frame, an un- ambiguous textual-meaning propositional representation language, from which the text in the target language is generated. For the language generation component, GENESIS, a language generation system developed at the Spoken Language Sys- tems Group at the Laboratory for Computer Science of Massachusetts Institute of Technology, has been utilized. GENESIS has been used for European languages for general purposes and for Japanese in limited domains. It has also been found to be capable of handling some of the linguistic phenomena that are needed for Korean. However, there exist areas in which GENESIS cannot currently handle Korean gen- eration. This thesis explores the degree to which GENESIS is able to manage Korean language generation. Thesis Supervisor: Stephanie Seneff Title: Principal Research Scientist Thesis Supervisor: Clifford Weinstein Title: Leader of Lincoln Speech Systems Technology Group Acknowledgments More than 8 months of research and a bucket of sweat resulted in this document. Without substantial help and support from the people cited below, the completion of this thesis would have not been possible. I would like to take this opportunity to thank the people who have contributed much to the completion of my Master of Engineering thesis. I wish to express my sincere thanks to my MIT campus supervisor, Dr. Stephanie Seneff. Only with her expertise in natural language processing and her watchful guidance has it been possible for me to start, progress, and finish my thesis. I truely appreciate her technical support and time spent in proofreading my thesis. I gratefully acknowledge the help of my MIT Lincoln Laboratory supervisor, Dr. Cliff Weinstein. His support ranged from providing all the necessary equipment to proofreading my thesis. I would also like to recognize MIT Lincoln Lab for allowing me to use its facilities. A lot of thanks to the Group 24 members of MIT Lincoln Lab for not only helping me do my research, but also for showing me a glimpse of the real world. I would especially like to thank Dinesh Tummala whom I could always count on for guidance on various issues, both technical and personal, and Dr. Youngsuk Lee who shared her Korean expertise with me. Special thanks to my friends, Shinsuk, Hyunwoo, Dongwhan, and Soojung who scored the performance of the output, and to Jinmo, John and professor Victor Zue who proofread my thesis. A lot of thanks and best wishes for Sunyong who showed me her thesis to look at the format, and for Mira who sent over the material about the Korean grammar. And, to my parents who immigrated to America to give me the best education they can, I would like to express my deepest gratitude. Most importantly, I would like to thank God for being with me and keeping me strong in faith throughout the whole process. The work for this thesis was carried out in the fall of 1994 and in the spring of 1995 at MIT Lincoln Laboratory, Lexington, Massachusetts. This document was written and revised in the spring term of 1995 at MIT Lincoln Laboratory. Contents 1 Introduction 10 1.1 Machine translation systems . ...................... 10 1.2 CCLINC............ ...................... 12 1.3 Korean language in CCLINC . ...................... 14 1.4 TINA and GENESIS ..... ...................... 15 1.5 Evaluation procedure ..... ...................... 16 2 Korean language phenomena 17 2.1 Word order rules ................. ............ ..17 2.2 Conjunctive relations. .. .... .... 22 2.3 Verb suffixes . ..... ....... 22 2.3.1 Honorific/polite suffixes ......... ........... ..22 2.3.2 Tense suffixes. .. .. .. .. .. .. 24 2.3.3 Type-defining suffixes . ........... ..24 2.3.4 Conjunctive suffixes . .. ... .. ... .. 25 2.3.5 Gerund suffixes .... ...... .. .. ... .. ... .. 26 3 GENESIS 28 3.1 Semantic frames ........ .................. 28 3.2 Mechanism of GENESIS ... .................. 30 3.2.1 Lexicon ........ .................. 30 3.2.2 Messages ........ .................. 31 3.2.3 Rewrite rules. .................. 32 5 3.3 GENESIS for Korean ........................... 32 3.3.1 Lexicon . ........................... 32 3.3.2 Messages... ........................... 38 3.3.3 Rewrite rules ........................... 41 4 Proposed improvements to GENESIS 44 4.1 Negations and passive voice sentences 44 ........... ....... 4.2 Articles ................ ........... ....... 49 4.3 Styles of speech. ........... ....... 50 4.4 Preposition vs. postposition. ........... ....... 51 4.5 Mapping approach. ........... ....... 52 4.6 Lexical incompatibility. ........... 53 5 Evaluation 54 5.1 Evaluation Procedure ............ .. ... .. .54 5.1.1 Data .................... ..... ....... .54 5.1.2 Method .................. ... .. .. .54 5.1.3 Scores ................... ... .... .. .55 5.2 Analysis ...................... ........... ..55 5.2.1 Insufficient analysis of TINA ....... ........... ..58 5.2.2 Fixable by changing rules of GENESIS .. .. ... .. .59 5.2.3 Require code modification for GENESIS ........... ..60 5.2.4 Other ................... ........... ..60 5.3 Conclusion ..................... .. .. ... .. 61 6 Discussion and future plans 63 A Lexicon for GENESIS 65 B Messages for GENESIS 73 C Rewrite-rules for GENESIS 81 6 D Inflection patterns 89 7 List of Figures 1-1 A typical SSTS ................... 11 1-2 Translation using ILT approach [1] ........ 11 1-3 System structure for multilingual SSTS [6] .... 13 3-1 Parse tree for a sample sentence .......... 29 4-1 Inflections for "Hada" verb ............. 48 5-1 Adequacy score histogram (O indicates unparsed) 56 5-2 Fluency score histogram (O indicates unparsed) 57 D-1 Inflection patterns for V1 "HaDa" verbs . .......... .91 D-2 Inflection patterns for V2 verbs ........... ..93 D-3 Inflection patterns for V3 verbs .......... ..95 D-4 Inflection patterns for V4 verbs .......... ..97 D-5 Inflection patterns for V5 verbs 99 8 List of Tables 3.1 Example lexicon entries for English .............. 31 5.1 Evaluation scores. 55 5.2 Occurrences of each error source ................ 55 A.1 Lexicon file for GENESIS ................... 66 B.1 Messages file for GENESIS ................... 74 C.1 Data files used for rewrite.c................. .. 82 C.2 Program automatically generating korean-rewrite-rules.text . .. 83 C.3 Special.eng. ........................... .. 88 D.1 Words beloging to V1 "HaDa" verbs ............ ..... ..90 D.2 Words beloging to V2 verbs ................. ..... ..92 D.3 Words beloging to V3 verbs ................. ..... ..94 D.4 Words beloging to V4 verbs ................. ..... ..96 D.5 Words beloging to V5 verbs ................. ..... ..98 9 Chapter 1 Introduction 1.1 Machine translation systems Group 24 at MIT Lincoln Laboratory has been developing an automatic speech-to- speech translation system (SSTS) for English and Korean. It has been proposed that the English-Korean automatic SSTS be used by military coalition forces in Korea where the need for communication among Korean and American soldiers has been recognized. However, this proposal is difficult to fulfill due to the large differences between the two languages. The purpose of building the English-Korean automatic SSTS is to help the soldiers communicate in their own respective languages. A typical SSTS works in three phases: speech recognition, language translation, and speech synthesis [1]. The first phase recognizes the speech in the source language (SL) then produces the utterance in text form. The second phase analyzes this utterance and translates it into the target language (TL) in text form. The last phase converts this translation into sound. See figure 1-1. Most machine translation systems developed to date fall into two categories de- pending on how the language translation is approached - transfer and interlingua [1]. Transfer systems involve finding the target language correlates for lexical units and syntactic constructions of the source language, whereas in interlingua systems the SL and TL are