Multi-Word Terminology Extraction for Domain-Specific Documents

Grenoble INP – Institut National Polytechnique de Grenoble École Nationale Supérieure d’Informatique, Mathématiques Appliqueées et Télécommunications LIG – Laboratoires d’Informatique de Grenoble Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole CARLOS EDUARDO RAMISCH Multi-word terminology extraction for domain-specific documents Mémoire de Master Master 2 de Recherche en Infromatique Option Intelligence Artificielle et Web Christian BOITET Advisor Aline VILLAVICENCIO Coadvisor Grenoble, juin 2009 ACKNOWLEDGEMENTS First of all, for their patience and support, I would like to thank my advisors Aline Villavicencio and Christian Boitet, that were always ready to answer my questions, discuss interesting points of my research and motivate me. Without them I would never be able to arrive this far. Thank you for sharing with me your knowledge and your experience on this passionating and challenging science that is Natural Language Processing. In my academic trajectory, I counted on the teaching team of two high-quality superior institutions: the Institute of Informatics of UFRGS1 and ENSIMAG/Télécom of Grenoble INP2. Addition- ally, I would like to thank the CLT team, specially Marc Dymetman and Lucia Specia, for their advice and guidance during my internship, and moreover for the nice time spent with all XRCE colleagues. Third, I would like to thank all the course and internship colleagues that, at one point or another, shared with me the joys and sacrifices of the academic life. The journey would not be so funny without my colleagues from PLN-UFRGS, GETALP, XRCE, and specially my class mates and friends at UFRGS and ENSIMAG. Fourth, I would like to thank the institutions and people that provided me with funding for my studies: CNPq, for the scientific initiation scholarship (PIBIC), CAPES for the international exchange scholarship (Brafitec), Floralis for the internship at GETALP and XRCE, for the research internship. Moreover, I would like to thank my parents for their financial support without which I would not be able to accomplish this work. Fifth, thanks to my friends for the funny moments, for their support and reliability. Thanks to my companion Arnaud for his patience when I was too busy, for his help when I was in need and for his unconditional support. Thanks to my parents Ari and Sônia, to my grand-parents Gisela and Edgar, and to my sister Renata that, since the very start, provided me with the means for studying and having a successful academic career. Finally, to all the people that never stopped believing in the quality of my work and in my capacity, thank you! A researcher and computer scientist does not work, as in the caricatures, individually, behind his computer in an empty and cold lab. On the contrary, his work is the result of a collective effort. Therefore, I thank all the people that directly or indirectly contributed to the achievement of this work. 1Instituto de Informática, Universidade Federal do Rio Grande do Sul 2École National Supérieure d’Informatique, Mathématiques Appliquées et Télécommunications, Grenoble Institut National Polytéchnique CONTENTS ABSTRACT . 5 1 INTRODUCTION . 6 1.1 Motivation ........................................7 1.2 Goals and hypotheses ..................................8 2 RELATED WORK . 11 2.1 About multi-word expressions .............................. 11 2.2 About multi-word terms ................................. 13 2.3 About machine translation ................................ 14 3 BIOMEDICAL TERMINOLOGY IDENTIFICATION . 16 3.1 The Genia corpus ..................................... 16 3.2 Pre-processing ...................................... 17 3.2.1 Tokenisation and POS tags . 19 3.2.2 Lemmas and plurals . 19 3.2.3 Acronyms and parentheses . 20 3.3 Part Of Speech patterns ................................. 21 3.4 Association measures ................................... 22 3.5 MT Integration ...................................... 25 4 EVALUATION . 29 4.1 Evaluating pre-processing ................................ 29 4.1.1 Re-tokenisation and lemmatisation . 30 4.1.2 Acronym detection evaluation . 31 4.2 Terminology extraction evaluation ........................... 32 4.2.1 POS pattern selection . 33 4.2.2 Association measures evaluation . 34 4.2.3 Corpus nature evaluation . 38 4.2.4 Learning algorithms evaluation . 39 4.2.5 Threshold evaluation . 42 4.2.6 Compare with related work . 43 4.3 Application-oriented evaluation ............................. 44 5 CONCLUSIONS AND FUTURE WORK . 45 REFERENCES . 49 ANNEXE A RÉSUMÉ ÉTENDU . 54 A.1 Introduction ........................................ 54 A.2 Révision bibliographique ................................ 55 A.3 Extraction de la terminologie biomédicale ....................... 56 A.4 Évaluation ......................................... 58 A.5 Conclusions ........................................ 62 APÊNDICE B RESUMO ESTENDIDO . 64 B.1 Introdução ........................................ 64 B.2 Trabalhos relacionados .................................. 65 B.3 Extração terminológica na biomedicina ........................ 66 B.4 Avaliação ......................................... 68 B.5 Conclusões ........................................ 72 APPENDIX C PART-OF-SPEECH PATTERNS . 73 APPENDIX D PRE-PROCESSING ALGORITHMS . 75 APPENDIX E IMPLEMENTATION . 77 ABSTRACT One of the greatest challenges of Natural Language Processing is to adapt techniques that work well on general-purpose language to handle domain-specific texts. It is specially important, at the lexical level, to be able to deal with terminology and its implications. The goal of this work is to employ multi-word expression identification techniques on the biomedical Genia corpus in order to evaluate to what extent they can be used for Multi-Word Terms (MWT) extraction in both mono- and multi- lingual systems. We start with a short motivation, relating our research to cross-lingual information access and machine translation and then setting the theoretical bases and definitions of the concepts on which we work: multi-word expression, term and multi-word term. Then, we present a non- exhaustive discussion of some approaches used by different related work for multi-word expressions identification, interpretation, classification and application. Some work on domain-specific terminology is also shown, particularly concerning symbolic methods for automatic ontology discovery and semi-automatic MWT extraction based on statistic and morphological cues. A very brief introduction to statistical machine translation shows some aspects that we investigate in our evaluation. Our methodology can be divided in four steps: pre-processing, in which we homogenise the corpus tokenisation and morphology, candidate extraction through selected morphological patterns, candidate filtering through association measures from the corpus and from the Web and some machine learning, and application-oriented evaluation on a machine translation system. Our results show that good pre-processing heuristics can drastically improve performance. Additionally, part of speech tag pattern selection is domain-dependent and fundamental for higher recall. Among the association measures the t test seems to perform better. The corpora we used, Genia and the Web, are essentially different and provide good MWT estimators at different rank intervals, suggesting that they should be combined using more sophisticated tools. While neural networks cannot correctly learn a MWT identification model from our features, support vector machines seem to provide good tradeoff between recall and precision, being even more effective when some frequency threshold is applied. Our approach presents recall around 6.4% with 74% precision. Even though recall seems low, it is consid- erably higher than a standard MWE extraction technique called Xtract, which obtains 67% precision at 3.8% recall on the Genia corpus. Finally, we point out some future directions for MWT extraction, including better marginal probability estimators, contingency table-based methods for arbitrary MWT lengths, multi-class evaluation and collaborative systems to build multi-lingual specialised lexicons. An extended abstract in Portuguese is available on appendix B. An extended abstract in French is available on appendix A. Keywords: Natural Language Processing, Multiword Expressions, Terminology, Lexical Acquisi- tion, Machine Learning, Statistical Machine Translation, Domain Adaptation. 6 1 INTRODUCTION The ultimate goal of Natural Language Processing (NLP) is to make the communication process between humans and computers as natural as asking a friend for a favour. Natural language is perhaps the most natural way for humans to express themselves and corresponds to one of the main com- ponents of intelligent behaviour. It is therefore often seen as a sub-area of Artificial Intelligence and inherits from it several theories and techniques. The study of Natural Language Processing has several applications that can help to overcome linguistic barriers not only between humans and computers but also between humans and humans. Examples of the former include question answering, information extraction and natural language generation, while the latter includes applications such as machine translation and cross-lingual information access. The term Natural Language Processing is also sometimes used as a synonym for Computational Linguistics; it is an interdisciplinary area situated at the confluence of Computer Science and Applied Linguistics.

Load more