A Quantitative Approach to Concept Analysis
Total Page:16
File Type:pdf, Size:1020Kb
A Quantitative Approach to Concept Analysis Rogelio Nazar TESI DOCTORAL UPF / ANY 2010 DIRECTORES DE LA TESI Dr. Jorge Vivaldi Palatresi (Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra) Prof. Dr. Leo Wanner (Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra) Acknowledgments First of all, I thank my parents, for all the love, protection and education. Also my brothers, sisters, nieces and friends I left in Argentina, for their support and for forgiving me for my absence all these years. I express also my gratitude and appreciation to my advisors, Jorge Vivaldi and Leo Wanner, for the knowledge they shared with me, and for their enthusiasm and patience. This thesis would have been impossible without the support of Institut Universitari de Lingüística Aplicada. I am in debt with the three chairs of the Institut during my time there, Teresa Cabré, Teresa Turell and Mercè Lorente, for their intellectual, financial and affective support. This research was supported first through an ADQUA scholarship from Generalitat de Catalunya and later by a contract with the Ministry of Education of Spain. Hinrich Schütze provided valuable feedback for the firsts steps of this thesis during my research stay in University of Stuttgart, and also accepted to write an external evaluation. Juan Manuel Torres-Moreno also accepted to write an evaluation and to be a member of the jury. I thank as well those who accepted to be members or alternate members of the jury: Teresa Cabré, Guadalupe Aguado, Jean Véronis, Horacio Saggion, Maarten Janssen, Horacio Rodríguez, Mick O'Donnell, Kim Gerdes and Irene Castellón. Jean Véronis also shared with me the dataset to replicate his experiments and provided feedback. I am very much in debt to my friend Chris Norrdin, who corrected many of my grammar errors. He is not to blame for the remaining errors because the text suffered intense post-editing. The same goes for errors of facts or reasoning, which are my own and not of my supervisors. I benefited from discussions with my colleagues at the Institut, my thanks to Teresa Cabré, Jaume Llopis, Gabriela Ferraro, Sabela Fernández, Jenny Azarian, Evanthia Triantafyllidou, Marta Sanchez, Maarten Janssen, Teun van Dijk, Irene Renau, Janet deCesaris, Paz Battaner, Vanesa Vidal, Araceli Alonso, Raquel Casesnoves, Rosa Estopa, Lluís De Yzaguirre, Manuel Souto, Albert Morales, Juan Manuel Pérez and many others. Horacio Panella and Francisco Vallejo programmed a nice flash-interface for my system. Ricardo Baeza-Yates provided corpus and access to the Yahoo BOSS platform. Vanessa Alonso, Sylvie Hochart and Jesus Carrasco provided indispensable tactical support. A group of specialists in the field of medicine kindly accepted to evaluate the results of knowledge extraction algorithms presented in this thesis: Antoni Valero, Jaume Franci, Hugo Vitale, Jorge Nazar and Graciela Nazar. Also a group terminologists contributed with the evaluation: Amor Montane, Carles Tebé, Carme Bach, Natalia Seghezzi, Irina Kostina, Alba Coll, Gabriel Reus and Iria da Cunha. Finally, thanks to Adriana Gorri for being always there and, after all, because it was her idea to come to this beautiful country. Abstract The present research focuses on the study of the distribution of lexis in corpus and its aim is to inquire into the relations that exist between concepts through the occurrences of the terms that designate them. The initial hypothesis is that it is possible to analyze concepts by studying the contexts of occurrence of the terms. More precisely, taking into account the statistics of term co-occurrence in context windows of n words. The thesis presents a computational model in the form of graphs of term co- occurrence in which each node represents single or multiword terms. Given a query term, a graph for that term is derived from a given corpus. As texts are analyzed, every time that two terms appear together in the same context window, the nodes that represent each of these terms are connected by an arc or, in case they already had one, their connection is strengthened. This graph is presented as a model of learning, and as such it is evaluated with experiments in which a computer program solves tasks that involve some degree of concept analysis. Within the scope of concept analysis, one of those tasks is to tell whether a word or a sequence of words in a given text is referring to a specific concept and to derive some of the basic properties of that concept, such as its taxonomic relations. Some other tasks can be to determine when the same word is referring to more than one concept (cases of homonymy or polysemy) as well as to determine when different words are referring to the same concept (cases of synonymy or equivalence between languages or dialectical variations). As a linguistic interpretation of these phenomena, this thesis derives a generalization in the realm of discourse analysis: the properties of the co-occurrence graphs are possible because authors of argumentative texts have a tendency to name some of the basic properties of the concepts that they introduce in discourse. This happens mainly at the beginning of texts, in order to ensure that principles among reader and writer are shared. Each author will predicate different information about a given concept, but the authors that treat the same topic will tend to depart from a common base and this coincidence will be expressed in the selection of the vocabulary. This coincidence in the selection of the vocabulary, because of its cumulative effect, can be studied with statistical means. vii Resumen El presente trabajo se centra en el estudio de la distribución del léxico en corpus y su cometido es el análisis de las relaciones existentes entre los conceptos a través de los términos que estos designan. La hipótesis de partida es que podemos analizar conceptos estudiando los contextos de aparición de los términos que los designan, utilizando para ello las estadísticas de coocurrencia de los términos en ventanas de contexto de n palabras. La tesis presenta un modelo computacional en forma de grafos de coocurrencia de términos donde los nodos representan términos simples o sintagmáticos. Dado un término analizado, se deriva un grafo para ese término a partir de un corpus. A medida que los textos se analizan, cada vez que dos términos aparecen juntos en una misma ventana de contexto, los nodos que los representan se conectan entre sí mediante un arco o bien fortalecen su conexión si ya la tenían. Este grafo es presentado como un modelo de aprendizaje, y como tal es evaluado mediante experimentos en que un ordenador resuelve tareas propias del análisis conceptual. Estas tareas incluyen determinar cuándo una palabra o secuencia de palabras dentro de un texto hace referencia a un concepto definido, así como determinar algunas de las propiedades más importantes de este concepto, tal como sus relaciones taxonómicas. Otras tareas son las de determinar cuándo una misma palabra puede hacer referencia a más de un concepto (casos de homonimia o polisemia) o determinar cuándo distintas palabras hacen referencia a un mismo concepto (casos de sinonimia o equivalencia entre lenguas o variedades dialectales). Como una interpretación lingüística de estos fenómenos, esta tesis extrae una generalización en el plano del análisis del discurso: las propiedades de los grafos de coocurrencia léxica surgen gracias a la tendencia que tienen los autores de textos argumentativos de mencionar algunas de las propiedades más importantes de los conceptos que introducen en el discurso. Esto ocurre sobre todo al inicio del discurso, con el objeto de asegurar que los principios entre lector y autor son compartidos. Cada autor predicará distintas informaciones acerca de un determinado concepto, pero los autores que traten sobre un mismo tema tendrán tendencia a partir de una misma base y esta coincidencia se manifestará en la selección del léxico que, por su efecto acumulativo, puede ser estudiada de manera estadística. viii Table of Contents Chapter 1: Introduction..............................................................................1 1.1 Basic definitions...................................................................................3 1.2 General Outline of the Approach.........................................................6 1.3 The Contribution of this Thesis to Theoretical Linguistics..................8 1.4 The Contribution of This Thesis to Applied Linguistics....................10 1.5 Structure of this Thesis.......................................................................10 Chapter 2: Working Hypotheses..............................................................11 2.1 Main Hypotheses...............................................................................11 2.2 Empirical Evidence in Support of the Hypotheses.............................17 2.3 Limits of the Hypotheses...................................................................20 Chapter 3: Basic Notions of Concept Analysis........................................23 3.1 Further Delimitation of the Meaning of the Term Concept ...............23 3.2 Historical Roots.................................................................................26 3.3 Modern Semantics..............................................................................33 3.4 Cognitive Perspectives.......................................................................37 3.5 Neurolinguistic Accounts...................................................................39