The Application of Explicit Semantic Analysis in Translation Memory Systems
Total Page:16
File Type:pdf, Size:1020Kb
The Application of Explicit Semantic Analysis in Translation Memory Systems Oumai Wang A thesis submitted to Imperial College London for the degree of Doctor of Philosophy School of Professional Development Imperial College London London SW7 2AZ 1 DECLARATION OF ORIGINALITY I confirm that the presented thesis is my work. All referenced works are acknowledged. Parts of the thesis have been presented for publication in advance of submission of the thesis. 2 COPYRIGHT DECLARATION The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work. 3 Abstract Although translation memory systems have become one of the most important computer- assisted translation tools, the development of systems able to retrieve Translation Memory (TM) files on the basis of semantic similarity has hitherto been limited. In this study, we investigate the use of Explicit Semantic Analysis (ESA), a semantic similarity measure that represents meanings in natural language texts by using knowledge bases such as Wikipedia, as a possible solution to this problem. While ESA may be used to improve TM systems, at present the evaluation of semantic processing techniques in the context of TM is not fully developed because the use of semantic similarity measures in TM systems has been limited. The study hence aims to evaluate ESA for its specific application in TM systems. The evaluation is performed within a knowledge management framework as this provides a suitable technical context. A software platform called the ESA Information Retrieval platform was designed to test the performance of ESA in TM system tasks using three different text genres: technical reports, popular scientific articles and financial texts. The aim of the evaluation was not only to improve our understanding of how ESA can be applied to TM systems, but also to examine certain textual factors that may have an impact on their performance. It was found that the use of ESA was able to create different ways of utilising translation suggestions. On the basis of the results obtained, both the existing problems of using ESA in TM systems and the future perspectives of TM systems are discussed. This study not only contributes to our understanding of employing semantic processing techniques in TM systems but also presents a new knowledge management perspective for the development of translation technology. 4 Acknowledgments I would like to express my very great appreciation to my supervisors Dr Mark Shuttleworth and Dr Bettina Bajaj for their supervision, encouragement and support throughout the course of my Ph.D. I am most grateful that Mark allowed me to work with him for three years, and that he was willing to take me on as an MSc teaching assistant in my second year of study and was the co-author of my first paper. I very much appreciate all Bettina’s constructive and critical advice on my thesis. I would like to thank my examiners Dr Felicity Mellor, Professor Ruslan Mitkov and Professor Andrew Rothwell for their time and very important and judicious comments. Special thanks are also due to my proof-reader Susan Peneycad. I really appreciate your help and patience. During my years as a Ph.D. student, I have had the pleasure of sharing in the knowledge of a number of people. Some of them have very patiently but enthusiastically explained various technical and mathematical issues to me. This work would not have been possible without their advice and support. The list includes Zhen Li, Xinyue Fu, Rui Zhang and Ji Qi. I appreciate having had Tuan-Chi Hsieh, Dawning Hoi Ching Leung, Boshuo Guo, Emmanouela Patiniotaki and Albert Feng-shuo Pai as colleagues at Imperial College. Many of my friends who supported me throughout my Ph.D. study also deserve mention. They are Yufei Huang, Juan Ojeda Castillo, Xiang Weng, Yujie Sun, Jianfeng Yu, Ning Niu, Suwen Chen and Peng Zhang. I would like to thank Liam Watson and Naomi Anderson-Eyles for their excellent administrative support. Most importantly, I would also like to thank my beloved family for the support they have provided me through my entire life. You have taught me to be a strong, independent person and have guided me to find the definitions of love, courage and perseverance in my life. 5 List of Abbreviations, Acronyms, and Notations ACA: Aircraft Accident Report ASL: Average Sentence Length ATSL: Average Translation Suggestion Length BLEU: Bilingual Evaluation Understudy CAT: Computer-assisted Translation CR: Conceptually Related Translation Suggestion ESA: Explicit Semantic Analysis FS: Formally Similar Translation Suggestion GVSM: Generalised Vector Space Model IR: Information Retrieval KM/ KMS: Knowledge Management/ Knowledge Management System LCS: Lowest Common Subsumer LSA/LSI: Latent Semantic Analysis/ Latent Semantic Indexing MT: Machine Translation NLP: Natural Language Processing OMCS: Open Mind Common Sense POS: Part of Speech QTR: Queries/Translation suggestions Ratio RN: Reuters-21578 Collection SaaS: Software as a Service SciAm: Scientific American Articles SD: Standard Deviation TERL Translation Edit Rate TF-IDF: Term Frequency-Inverse Document Frequency TM: Translation Memory TMS: Translation Memory System TREC: the Text Retrieval Conference 6 TTR: Type/Token Ratio XML: Extensible Markup Language 푎⃗: Vector R: Matrix 푛 (Summation) ∑푚=1 푎푚 = 푎1 + 푎2 + 푎3 … 푎푛 3 For example, (m 1) (1 1) (2 1) (3 1) 9 m1 푛 (Product) ∏푚=1 푎푚 = 푎1 ∙ 푎2 ∙ 푎3 ∙ 푎4 ∙ … . 푎푛 3 For example, ∏푚=1(푚 + 1) = (1 + 1) ∙ (2 + 1) ∙ (3 + 1) = 24 ∈ (is an element of) If a S, it means that a is an element of the set S. |a| |a| is the absolute value of a. 7 List of Figures and Charts Figure 1.1: Example of a TMS layout ........................................................................................21 Figure 1.2: IR process .................................................................................................................22 Figure 2.1: Wiig’s Knowledge Typology ...................................................................................41 Figure 2.2: Dalkir’s Integrated KM Cycle ..................................................................................45 Figure 2.3: Nonaka’s Knowledge Spiral Model .........................................................................48 Figure 2.4: The transfer and conversion of knowledge in the workflow of a TMS seen as a type of KMS ................................................................................................................................56 Figure 3.1: A fragment of WordNet hierarchy ...........................................................................66 Figure 3.2: A fragment of WordNet hierarchy ...........................................................................67 Figure 3.3: A fragment of WordNet hierarchy with trained probabilistic values .......................68 Figure 3.4: Semantic relations in ConceptNet ............................................................................74 Figure 3.5: A sample of semantic network in HowNet ..............................................................75 Figure 3.6 Extract from the Wikipedia article on 'oxygen' .........................................................89 Figure 3.7: Category page for 'oxygen' .......................................................................................90 Figure 3.8: A basic proposition of 'ozone' ..................................................................................91 Figure 3.9: An example of the language bar on a Wikipedia page .............................................93 Figure 3.10: An example entry for 'oxygen' in BabelNet ...........................................................97 Figure 3.11: Another example entry for 'oxygen' in BabelNet ...................................................97 Figure 4.1: Extract of an RN text ..............................................................................................119 Figure: 4.2: A screenshot of original Scientific American articles ...........................................124 Figure 4.3: A screenshot of an original AAIB aircraft accident report ....................................125 Figure 4.4: A screenshot of an original AAIB aircraft accident report ....................................126 Figure 4.5: A screenshot of TextStats .......................................................................................128 Figure: 4.6: A screenshot of XML2sql .....................................................................................130 Figure: 4.7: A screenshot of the ESA IR platform....................................................................131 Chart 5.1: Distribution of ESA similarity scores for the test collections..................................156 Chart 5.2: The relation between QTR and ranges of query length ...........................................172 Figure 6.1: An example of a parse tree .....................................................................................188 Figure 6.2: An example of a Wikipedia page about ‘微信’ (WeChat) .....................................197 Figure 6.3: The infrastructure of Translation Memory Systems as a type of KMS ..................203