The Value of Learning Linguistic Structures in Relation Extraction
Total Page:16
File Type:pdf, Size:1020Kb
The value of learning linguistic structures in relation extraction Jan Overgoor 5836387 [email protected] Bachelor thesis Credits: 15 EC Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Maarten van Someren Institute of Informatics Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam July 9th, 2010 1 INTRODUCTION 1 Abstract The extraction of instances of relations that are conceptually known is an important sub-task of automatic ontology population. We consider an approach that uses a local alignment kernel to learn to recognize linguistic dependency paths of these relations. This approach uses WordNet as a similarity function between concepts. We evaluate its performance by comparing it with a probabilistic approach, based on the co-occurrence of the related entities. Our results show that good results are possible, especially when only a small portion of the data is available. 1 Introduction Is a question like \What is the capital of France?" answerable without structural knowledge of the sentences that might contain this information, or is a simple model of co-occurrence of the relevant concepts enough? This question exemplifies the main difference between two opposing schools of approaches to the field of Information Extraction; one that uses semantic and linguistic knowledge and one that relies solely on statistic information. The field of Information Extraction addresses the problem of extracting knowledge from generally unstructured data. This is an essential step for processing human-created content so that it can be used in computer-driven applications like question-answering systems and semantic search. These are in turn important steps for the development of fully automated structured-information processing systems which are capable of qualitative reasoning over what has come to be known as the Semantic Web. The problem of extracting semantic information from unstructured text is generally considered to consist of two sub-tasks: the recognition and disambiguation of entities and the identification of relationships between them. An example system that performs both these tasks is OpenCalais1, developed by Thomason Reuters, which, given a selection of unstructured text, returns a set of RDF tags identifying the entities present in them, their types and some elementary relations between them. The Semantic Web [13] is the concept of data being semantically interpretable by ma- chines, with the use of ontologies. Of the two ways of realizing this idea, manually con- structing ontologies or automatically building them from already produced data, the latter relies heavily on Information Extraction techniques. Building a complete ontology from unstructured data consists of two steps: building the conceptual structure of the ontology and finding instances of the classes and relations that make up the ontology (`ontology pop- ulation') [9]. For both tasks Relation Extraction is a candidate approach. Relations can be general and hold between general concepts or between instances. An example of a general relation is given in the sentence \Your leg is part of your body". `Leg' and `body' are general categories. A relation can also hold between instances. For example, \Leidseplein is a part of Amsterdam". Here `Leidseplein' and `Amsterdam' are instances. In this report we examine two different tasks associated with the extraction of binary relations (R(es; et)). The first task, Relation Extraction (RE), refers to finding the right relation between entities that are already identified in a text. In our case we will consider a simplified version which will only consider the relations to be found between entities in 1http://www.opencalais.com/ 1 INTRODUCTION 2 single sentences. The input consists of a sentence (S), a source and a target entity marked in them (es and et), and the types of the entities (Ts and Tt). The task here is to classify each entity pair as being instances of a certain relation. The second task, Related Entity Finding (REF), is to return a set of entities that can be said to have a certain relationship with a certain entity. An example question would be \What football players play for Ajax?". The input here is a corpus of text with (named) entities tagged and typed in them, a query consisting of a source entity (es), a relation (R) and the type of the target entity (Tt). The task is then to return a ranked list of relevant target entities (et). As mentioned earlier, there are two essentially different approaches to the problem of relation extraction in general. The first aims to find semantic relationships between entities by the syntactic structures that generally identify them. This has also been one of ultimate goals of the field of Natural Language Processing, the results of which are employed, in combination with results from the field of Machine Learning, for this `semantic' school of relation extraction. However, supervised learning methods require a set of labeled instances as training data, in this case tagged sentences, which is not always available. This becomes especially prominent in large scale situations with a lot of data [6]. The second school has its roots in the field of Information Retrieval. It aims to identify relations between entities based on the their co-occurrence. This family of more lightweight approaches has repeatedly been shown to be more effective than their semantic relatives in a variety of situations [10]. When, for example, co-occurrence is used for relation extraction it follows that Recall will be relatively high as each occurrence of a relation in the content will be reflected in the co-occurrence values. At the same time the score for precision will be lower because pure co-occurrence methods do not distinguish between different relations. This will be a problem in situations with ambiguous relations, when more than one relation is possible between two types of entities. Then a more rare relation might not be retrievable within the high frequencies of a similar, but more common one. This trade-off between Precision and Recall is the central issue for probabilistic ap- proaches in general: how can the frequencies over data be used to say things about its semantics? The main question of this report is this issue in the context of relation extrac- tion. How much knowledge about the structure of the relation is necessary for it to be extractable? To address this tension we compare the performance of the semantic approach we consider with a statistic approach to the same task. 1.1 Context One line of research stemming from the semantic school sees relation extraction as supervised classification of relations based on their syntactic structure. The first case of using the syntactic structure of a sentence for relation extraction is found in [12], who analyze the dependency trees of sentences using hand-written patterns, to extract causal relationships. An early example of relation extraction being tried with machine learning techniques is [20], who devise various kernels to automatically extract the relevant patterns from a set of labeled instances of the relation. kernel is a function that implicitly compares elements by their mapping to a higher dimensional space, where they are linearly separable, whereas they might not have been in their original form. An important feature of kernel-based methods is that they do not train a classifier on a number of features, but do so by directly computing 2 A LOCAL ALIGNMENT KERNEL FOR DEPENDENCY PATHS 3 the similarity between instances. This means that these methods are more suitable for relatively high-dimensional problems. A kernel for comparing dependency trees was first proposed by [8]. One of the problems they encountered was that Recall was low because the set of negative instances is difficult to characterize because of its heterogeneity. An improved method would therefore first identify candidate relations before trying to perform classification. This line of research was continued by [5] and [4], who propose a kernel that only considers the shortest-path dependency path between entities in a sentence. Finally [11] use a local-alignment kernel to learn to detect the syntactic structures that identify a relation. They use different WordNet distance functions as similarity measures between concepts. This last method will be elaborated upon in Section 2. For a co-occurrence-based method we look at [3], which explores the possible utility of co-occurrence models for the task of finding related entities. In addition to just using co- occurrence frequencies, they filter the candidate entities by type and use statistical language models of the documents where the entities occur as contextual information. A landmark paper in this field is [18], which uses a probabilistic model to represent an entity by the language that is used to describe it. These `entity models' contain more structure than a fully probabilistic representation like `bag of words', but are more loosely defined than for example structured database representations. This started a line of research to explore the utility of these semi-structured representations of entities for tasks like named entity recognition and question answering. One effort that used this representation was [17], which used spatial proximity between entities and the terms describing them for the task of finding experts for a specific topic. A different approach to the same task was [1], which associates relevant documents to entities, in this case also experts. The research of [3] stemmed from this, except that they model entity-entity relations directly, instead of entity-document relations. One interesting example of co-occurrence being used next to other techniques is [9], which uses a semi supervised learning procedure to find documents that might contain instances of the target relation. These approaches are elaborated on in Section 3. The Text REtrieval Conference (TREC) organizes yearly information retrieval compe- titions to provide a standardized evaluation infrastructure for newly developed systems.