Word Sense Disambiguation and Recognizing Textual Entailment with Statistical Methods

INSTITUTO POLITÉCNICO NACIONAL CENTRO DE INVESTIGACIÓN EN COMPUTACIÓN NATURAL LANGUAGE PROCESSING LABORATORY WORD SENSE DISAMBIGUATION AND RECOGNIZING TEXTUAL ENTAILMENT WITH STATISTICAL METHODS A THESIS IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE BY: Miguel Angel Ríos Gaona ADVISORS: Dr. Alexander Gelbukh Dr. Sivaji Bandyopadhyay MEXICO, D.F., 2010 Resumen En esta tesis estudiamos métodos estadísticos aplicados en la solución de tareas del procesamiento del lenguaje natural. Estas tareas son: la desambiguación de sentidos de las palabras y el reconocimiento de implicación textual. Primero presentamos una nueva medida para la asignación del sentido a una palabra ambigua basada en una modificación al algoritmo simple de Lesk. Usamos la coocurrencia de las palabras para seleccionar el sentido correcto de una palabra. Se utilizó información estadística encontrada en la Web para medida. En los datos SemCor nuestro método tiene una cobertura del 100%, lo que significa siempre da una respuesta y una precisión de 0.47. En los datos de Senseval-2, nuestra variante del método de Lesk tiene una precisión de 0.45 y supero a varios de los métodos basados en Lesk, también alcanzó una cobertura del 100%. Finalmente propusimos una nueva medida no simétrica de causa-efecto aplicada en la tarea de reconocimiento de implicación textual. En primer lugar se realizaron búsquedas en un corpus por oraciones que contiene el marcador de discurso “porque”. Con estas oraciones se creó un conjunto de pares de causa y efecto. El reconocimiento de la implicación se basa en medir la relación causa-efecto entre el texto y la hipótesis utilizando las frecuencias relativas de las palabras de los pares de causa-efecto. En los resultados hemos superado el baseline , en los tres corpus de prueba del PASCAL (Reconocimiento de implicación textual, RTE).La medida muestra ser buena para determinar la clase “verdadero” y mostró ser menos precisa en la clase “falso”. i Abstract We study statistical methods based on the use of information retrieved from the Web in attempt to solve two Natural Language Processing tasks: Word Sense Disambiguation and Recognizing Textual Entailment. For Word Sense Disambiguation, we present a measure for semantic relatedness based on the simple Lesk algorithm. We measure kind of mutual information between the gloss of each sense of the word and the context of the word: namely, the scores of the sense s is the frequency (as the number of webpages found by Google) of the context where the word is substituted by the gloss of the sense s, divided by the frequency of the gloss itself (again, as the number of webpages found by Google). In the SemCor dataset our method has the coverage of 100% (i.e., our method always gives some answer) and an accuracy of 0.47. On the Senseval 2 dataset, our method has an accuracy of 0.45, which outperforms some other Lesk-based methods (again with 100% coverage). For Recognizing Textual Entailment, we propose a new cause-effect non-symmetric measure. First, we search over a large corpus for sentences which contain the discourse marker “because” and create a database of cause-effect pairs. The entailment recognition is based on measuring the probability of a cause-effect relation between the Text and the Hypothesis using the relative frequencies of words from the cause-effect pairs. Our results outperform the baseline system, over the three test sets of the PASCAL Recognizing Textual Entailment Challenges (RTE). The measure is good at determining the “true” class, while it is less accurate at the “false” class. ii Acknowledgements I would like to thank my advisors, Dr. Alexander Gelbukh and Dr. Sivaji Bandyopadhyay for valuable guidance and advice. I am also grateful to my family and friends for their understanding and support. iii Table of Contents RESUMEN .................................................................................................................................I ABSTRACT .............................................................................................................................. II ACKNOWLEDGEMENTS ........................................................................................................ III LIST OF TABLES ................................................................................................................... VI LIST OF FIGURES ................................................................................................................ VIII 1. INTRODUCTION ..............................................................................................................1 1.1. Research Problems and Why they are Worthwhile Studying ................................3 1.2. Research Methods in Brief .....................................................................................4 1.3. Goal of the Thesis...................................................................................................5 1.4. Structure of the Thesis............................................................................................5 2. RELATED BIBLIOGRAPHY ..............................................................................................6 2.1. Approaches to Word Sense Disambiguation..........................................................8 2.1.1. Supervised Methods .......................................................................................9 2.1.2. Unsupervised Methods .................................................................................11 2.2. Approaches to Recognizing Textual Entailment..................................................13 2.2.1. Approaches by Language Levels..................................................................16 2.2.1.1. Lexical Level ........................................................................................16 2.2.1.2. Syntactic Level .....................................................................................17 2.2.1.3. Semantic Level .....................................................................................18 2.2.2. Machine Learning Approaches.....................................................................19 3. WORD SENSE DISAMBIGUATION .................................................................................21 3.1. Theoretical Framework.........................................................................................21 3.2. Proposed Method..................................................................................................23 3.3. Experimental Setting ............................................................................................25 3.3.1. Data Sets.......................................................................................................25 3.4. Experimental Results............................................................................................26 3.5. Conclusion............................................................................................................30 4. RECOGNIZING TEXTUAL ENTAILMENT .......................................................................31 4.1. Theoretical Framework.........................................................................................31 4.2. Proposed Methods ................................................................................................33 4.2.1. Causal Non-symmetric Measure ..................................................................35 4.2.2. Experimental Setting ....................................................................................37 4.2.3. Experimental Results....................................................................................39 4.2.3.1. Experiment 1 ........................................................................................39 4.2.3.2. Experiment 2 ........................................................................................44 iv 4.2.4. Symmetric and Non-symmetric Meta-classifier...........................................48 4.2.5. Experimental Design ....................................................................................49 4.2.6. Experimental Results....................................................................................49 4.2.6.1. Experiment 1 ........................................................................................50 4.2.6.2. Experiment 2 ........................................................................................54 4.3. Conclusion............................................................................................................59 5. CONCLUSIONS ..............................................................................................................60 5.1. Contributions ........................................................................................................61 5.2. Publications ..........................................................................................................61 REFERENCES ........................................................................................................................63 v List of Tables Table 2.1: T-H pairs examples .............................................................................................14 Table 3.1: Simplified Lesk algorithm...................................................................................22 Table 3.2: New statistical measure.......................................................................................25 Table 3.3: Comparison

Word Sense Disambiguation and Recognizing Textual Entailment with Statistical Methods

Are Annotators' Word-Sense-Disambiguation

Textual Inference for Machine Comprehension Martin Gleize

Repurposing Entailment for Multi-Hop Question Answering Tasks

Techniques for Recognizing Textual Entailment and Semantic Equivalence

Automatic Evaluation of Summary Using Textual Entailment

Dependency-Based Textual Entailment

Parser Evaluation Using Textual Entailments

Textual Entailment Recognition Based on Dependency Analysis and Wordnet

A Probabilistic Classification Approach for Lexical Textual Entailment

Frame Semantics Annotation Made Easy with Dbpedia

Constructing a Textual Semantic Relation Corpus Using a Discourse Treebank

PATTY: a Taxonomy of Relational Patterns with Semantic Types