The value of learning linguistic structures in relation extraction

Jan Overgoor 5836387 [email protected]

Bachelor thesis Credits: 15 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Faculty of Science Science Park 904 1098 XH Amsterdam

Supervisor Maarten van Someren Institute of Informatics Faculty of Science Science Park 904 1098 XH Amsterdam

July 9th, 2010 1 INTRODUCTION 1

Abstract The extraction of instances of relations that are conceptually known is an important sub-task of automatic ontology population. We consider an approach that uses a local alignment kernel to learn to recognize linguistic dependency paths of these relations. This approach uses WordNet as a similarity function between concepts. We evaluate its performance by comparing it with a probabilistic approach, based on the co-occurrence of the related entities. Our results show that good results are possible, especially when only a small portion of the data is available.

1 Introduction

Is a question like “What is the capital of France?” answerable without structural knowledge of the sentences that might contain this information, or is a simple model of co-occurrence of the relevant concepts enough? This question exemplifies the main difference between two opposing schools of approaches to the field of Information Extraction; one that uses semantic and linguistic knowledge and one that relies solely on statistic information. The field of Information Extraction addresses the problem of extracting knowledge from generally unstructured data. This is an essential step for processing human-created content so that it can be used in computer-driven applications like question-answering systems and semantic search. These are in turn important steps for the development of fully automated structured-information processing systems which are capable of qualitative reasoning over what has come to be known as the Semantic Web. The problem of extracting semantic information from unstructured text is generally considered to consist of two sub-tasks: the recognition and disambiguation of entities and the identification of relationships between them. An example system that performs both these tasks is OpenCalais1, developed by Thomason Reuters, which, given a selection of unstructured text, returns a set of RDF tags identifying the entities present in them, their types and some elementary relations between them. The Semantic Web [13] is the concept of data being semantically interpretable by ma- chines, with the use of ontologies. Of the two ways of realizing this idea, manually con- structing ontologies or automatically building them from already produced data, the latter relies heavily on Information Extraction techniques. Building a complete ontology from unstructured data consists of two steps: building the conceptual structure of the ontology and finding instances of the classes and relations that make up the ontology (‘ontology pop- ulation’) [9]. For both tasks Relation Extraction is a candidate approach. Relations can be general and hold between general concepts or between instances. An example of a general relation is given in the sentence “Your leg is part of your body”. ‘Leg’ and ‘body’ are general categories. A relation can also hold between instances. For example, “Leidseplein is a part of Amsterdam”. Here ‘Leidseplein’ and ‘Amsterdam’ are instances. In this report we examine two different tasks associated with the extraction of binary relations (R(es, et)). The first task, Relation Extraction (RE), refers to finding the right relation between entities that are already identified in a text. In our case we will consider a simplified version which will only consider the relations to be found between entities in

1http://www.opencalais.com/ 1 INTRODUCTION 2

single sentences. The input consists of a sentence (S), a source and a target entity marked in them (es and et), and the types of the entities (Ts and Tt). The task here is to classify each entity pair as being instances of a certain relation. The second task, Related Entity Finding (REF), is to return a set of entities that can be said to have a certain relationship with a certain entity. An example question would be “What football players play for Ajax?”. The input here is a corpus of text with (named) entities tagged and typed in them, a query consisting of a source entity (es), a relation (R) and the type of the target entity (Tt). The task is then to return a ranked list of relevant target entities (et). As mentioned earlier, there are two essentially different approaches to the problem of relation extraction in general. The first aims to find semantic relationships between entities by the syntactic structures that generally identify them. This has also been one of ultimate goals of the field of Natural Language Processing, the results of which are employed, in combination with results from the field of Machine Learning, for this ‘semantic’ school of relation extraction. However, supervised learning methods require a set of labeled instances as training data, in this case tagged sentences, which is not always available. This becomes especially prominent in large scale situations with a lot of data [6]. The second school has its roots in the field of Information Retrieval. It aims to identify relations between entities based on the their co-occurrence. This family of more lightweight approaches has repeatedly been shown to be more effective than their semantic relatives in a variety of situations [10]. When, for example, co-occurrence is used for relation extraction it follows that Recall will be relatively high as each occurrence of a relation in the content will be reflected in the co-occurrence values. At the same time the score for precision will be lower because pure co-occurrence methods do not distinguish between different relations. This will be a problem in situations with ambiguous relations, when more than one relation is possible between two types of entities. Then a more rare relation might not be retrievable within the high frequencies of a similar, but more common one. This trade-off between Precision and Recall is the central issue for probabilistic ap- proaches in general: how can the frequencies over data be used to say things about its semantics? The main question of this report is this issue in the context of relation extrac- tion. How much knowledge about the structure of the relation is necessary for it to be extractable? To address this tension we compare the performance of the semantic approach we consider with a statistic approach to the same task.

1.1 Context One line of research stemming from the semantic school sees relation extraction as supervised classification of relations based on their syntactic structure. The first case of using the syntactic structure of a sentence for relation extraction is found in [12], who analyze the dependency trees of sentences using hand-written patterns, to extract causal relationships. An early example of relation extraction being tried with machine learning techniques is [20], who devise various kernels to automatically extract the relevant patterns from a set of labeled instances of the relation. kernel is a function that implicitly compares elements by their mapping to a higher dimensional space, where they are linearly separable, whereas they might not have been in their original form. An important feature of kernel-based methods is that they do not train a classifier on a number of features, but do so by directly computing 2 A LOCAL ALIGNMENT KERNEL FOR DEPENDENCY PATHS 3

the similarity between instances. This means that these methods are more suitable for relatively high-dimensional problems. A kernel for comparing dependency trees was first proposed by [8]. One of the problems they encountered was that Recall was low because the set of negative instances is difficult to characterize because of its heterogeneity. An improved method would therefore first identify candidate relations before trying to perform classification. This line of research was continued by [5] and [4], who propose a kernel that only considers the shortest-path dependency path between entities in a sentence. Finally [11] use a local-alignment kernel to learn to detect the syntactic structures that identify a relation. They use different WordNet distance functions as similarity measures between concepts. This last method will be elaborated upon in Section 2. For a co-occurrence-based method we look at [3], which explores the possible utility of co-occurrence models for the task of finding related entities. In addition to just using co- occurrence frequencies, they filter the candidate entities by type and use statistical language models of the documents where the entities occur as contextual information. A landmark paper in this field is [18], which uses a probabilistic model to represent an entity by the language that is used to describe it. These ‘entity models’ contain more structure than a fully probabilistic representation like ‘bag of words’, but are more loosely defined than for example structured database representations. This started a line of research to explore the utility of these semi-structured representations of entities for tasks like named entity recognition and question answering. One effort that used this representation was [17], which used spatial proximity between entities and the terms describing them for the task of finding experts for a specific topic. A different approach to the same task was [1], which associates relevant documents to entities, in this case also experts. The research of [3] stemmed from this, except that they model entity-entity relations directly, instead of entity-document relations. One interesting example of co-occurrence being used next to other techniques is [9], which uses a semi supervised learning procedure to find documents that might contain instances of the target relation. These approaches are elaborated on in Section 3. The Text REtrieval Conference (TREC) organizes yearly information retrieval compe- titions to provide a standardized evaluation infrastructure for newly developed systems. Since 2009 they also included an Entity Track ([2]), which evaluates the performance of systems to the task of REF. This framework and the results published in [2] also used by [3] to evaluate their approach. We use the same task as one way of comparing the semantic and statistic approaches. As noted, this is a different task than RE, which is what [11] aims to do. For them to be comparable on this point, the method has therefore to be extended so that it effectively functions as a question answering system. This procedure is explained in Section 4. The rest of the report consists of a description and discussion of the experiments we performed to compare the two methods.

2 A Local Alignment Kernel for Dependency Paths

In [11], Katrenko et al. propose a method for the Relation Extraction task described above. Recall that the input consists of a sentence (S) with the source and target entities (es and (et) and their types (Ts and Tt) of the relation marked in them. The task is then to classify each sentence as containing a certain relation between the marked entities. For this they use a supervised classification method that has to be trained on a corpus of labelled 2 A LOCAL ALIGNMENT KERNEL FOR DEPENDENCY PATHS 4

training instances before it can be used to classify new unseen instances. In their particular method they require the entities to be marked with their proper WordNet sense, which will be elaborated upon in the next paragraph. As mentioned before, a kernel-based machine learning method classifies objects in a high dimensional space, but does so without explicit calculations in that space. Instead, it uses a kernel function between two objects (K(xi, xj)) which is equivalent to the dot product of the objects in the high dimensional space. This has the advantage that the high dimensionality is effectively avoided, but its functionality of being able to linearly separate non-linear data sets is maintained. This is implemented by the idea of Support Vector Machines (SVM) [7] which uses a kernel to separate the different classes of elements by mapping them to a high dimensional space. This can be used as a classifier for new instances by mapping them into that same space. There are little requirements for a function to serve as a kernel function: it has to be symmetric, such that K(xi, xj) = K(xj, xi), and positive semi-definite, such that P i,j K(xi, xj)cicj ≥ 0. This means that one has a lot of freedom with designing a kernel, as well as with the representation used.In Natural Language Processing this is exemplified by the many different ways people have used kernel methods. Tried out approaches range from kernels for strings and sub-sequences [5] to kernels for whole parse trees, albeit shallow ones [20]. In order to apply a kernel-based supervised learning approach to the RE task, Katrenko et al. [11] propose a local alignment kernel for dependency paths, which is a different kind of syntactic analysis than regular parsing. Dependency analysis extracts syntactic dependency relations between lexical items in a sentence [15]. Different than regular parsing, a dependency graph is not a tree and a word can have multiple incoming and outgoing relations. In Figures 1 and 2 the difference between the two is illustrated. The edges in the dependency graph indicate relations between a word (the head) and other words.

S ¨H ¨¨ HH NP VP Q  Q nsubj dobj John V NP ? ? likes Mary John likes Mary Figure 1: Parse tree Figure 2: Dependency graph

Following results from the literature Katrenko et al. aim to (1) perform partial matching of dependency structures, while (2) being able to incorporate prior knowledge in the process. For the first aim they use a distance measure based on a string distance measure that searches for the similar sub-sequences. They base their kernel on the Smith-Waterman measure [19], an edit-distance measure originally devised for the alignment of molecular sub-sequences. Edit-distance measures calculate the distance between strings by taking the minimal number of transformations that are necessary to convert one string into the other. The time complexity of calculating the distance between two sequences with the LA kernel is O(nm), with sequences of lengths n and m. The space complexity is t · (t + 1)/2 (= O(t2)), where t is the number of unique elements in the data set. This is high, as t grows 2 A LOCAL ALIGNMENT KERNEL FOR DEPENDENCY PATHS 5

exponentially with the number of test instances.

2.1 Wordnet Distance Measures The second aim, the possibility of incorporating prior knowledge to the process, is solved in two ways: for experiments on the domain of biomedical relations they use co-occurrence statistics, while for experiments with generic relations they use different perspectives on the concept of WordNet relatedness. Because the task currently under discussion concerns the relations in a generic data set, we will only discuss the WordNet relatedness measures here. WordNet2 is a large lexical database of the semantic content of English words. In WordNet, a word has different senses that are grouped together in sets of synonyms called synsets [14]. Words are also collected in a taxonomy depicting an hierarchical structure of hypernym (IS-A) relations. This results in a fully connected graph so that a path can be found between each pair of words. WordNet can thus be used to formulate different perspectives on the semantic distance between words . As in [16], Katrenko et al. consider five different distance measures based on WordNet relatedness. All measures are a composite of the following relations between two concepts c1 and c2: the length of the shortest path (path(c1, c2)) and the most specific common subsumer between the two (lcs(c1, c2)), and the depth of a node in the hierarchy (depth(ci)). Two of the five measures, wup and lch, use only this information. The first measure, wup, calculates the conceptual similarity between concepts, using the least common subsumer:

2 ∗ depth(lcs(c1, c2)) simwup(c1, c2) = path(c1, lcs(c1, c2)) + path(c2, lcs(c1, c2)) + 2 ∗ depth(lcs(c1, c2)) A different approach is the lch measure, which calculates the semantic similarity, using the maximum dept in the WordNet hierarchy:

path(c1, c2) simlch(c1, c2) = −log 2 ∗ maxc∈W ordNetdepth(c) The other three measures use the notion of information content of a concept, which means that the rarer something is, the more information it contains. If the probability of encountering a concept is p(c), calculated, for example, from occurrence in a corpus, then the information content of that concept is IC(c) = − log p(c). The res measure is based solely on applying this notion to the least common subsumer:

simres(c1, c2) = − log p(lcs(c1, c2))

Another measure, jcn, adds the information content of the concepts themselves to the equation by taking the difference between that and the information content of the least common subsumer:

simjcn(c1, c2) = 2 ∗ log p(lcs(c1, c2)) − (log p(c1) + log p(c2))

2http://wordnet.princeton.edu/ 3 CO-OCCURRENCE MODELS 6

Finally, the lin measure integrates both the what the concepts have in common and the difference between them, by scaling the sum of the information contents of the least common subsumer with sum of the information content of both concepts :

2 ∗ log p(lcs(c1, c2)) simlin(c1, c2) = log p(c1) + log p(c2)

2.2 Evaluation Katrenko et al. evaluate their approach for the task of extracting various relations (cause/effect, instrument/agency, product/producer, origin/entity, theme/tool, part/whole, content/container) by applying it to the annotated SemEval-2007 Task-4 data set, which was designed for the same task. The training data consists of labeled sentences with relations, with the target and source entities tagged with their proper WordNet sense. There are 140 training exam- ples per relation, about half of which are positive instances. The test set contains about 80 examples per relation. The input sentences are parsed with the Stanford Parser3, which provides the dependency paths between the source and target entities. Between every ’word’ in the vocabulary, including entities, dependency relations and other words, a distance is calculated as follows:  1 x = x  1 2 d(x1, x2) = simwn(x1, x2) x1, x2 ∈WordNet 0 otherwise

They use various baselines to compare their performance with, including majority vot- ing, a shortest path kernel, always guessing ’yes’ and a random guesser. Apart from the regular performance measures of Accuracy, Precision, Recall and F-score, they performed an experiment with training sets of varying size, to measure the learning rate of their system with the various different WordNet measures.

3 Co-Occurrence Models

For the REF task, Bron et al. [3] propose a general architecture that links several compo- nents with different functionalities in a sequential order. They do this to be able to compare the impact of each component on the performance of the complete system. After the pre- processing steps of named entity recognition and disambiguation, they perform the related entity finding step by co-occurrence modeling. This results in a ranked list of candidate target entities of which the top n can be regarded as ‘answers’ and investigated as such. The ranking is based on the likelihood of them being correct, based on the co-occurrence with the source entity (P (et|es)), its type (P (Tt|et)) and information about the context of the relation (P (R|es, et)). The actual probabilities are then calculated with:

P (et|es,Tt,R) = P (et|es) · P (Tt|et) · P (R|es, et)

3http://nlp.stanford.edu/software/lex-parser.shtml 4 METHOD 7

Four different measures to estimate the co-occurrence values are considered: maximum likelihood estimate, χ2, point-wise mutual information and log likelihood ratio. An initial result is that the χ2 measure performs best when only the co-occurrence values are used. To improve performance with respect to the aforementioned Precision/Recall trade-off, Bron et al. employ two additional filters over the pure occurrence-values. To prevent the inclusion of answers of the wrong type they filter the candidate answers on their type. The typeset of an entity is made by retrieving from Wikipedia the categories that the entity is in. To these sets also the parent categories are added, up to a certain level n, an approximately optimal value for which is found to be 6. ”Pablo Picasso” returns a set including ”Spanish Sculptors” and ”Spanish Atheists”, but also ”Sculptors”, ”Artists” and ”Person”. For each entity they let the P (Tt|et) value be 1 if there is an intersection between the typesets of et and Tt, and 0 if there is no intersection. The other technique that is used to increase Precision is the inclusion of information about the context around instances of relations. This is done with a probabilistic language model. P (R|es, et) is then taken to be the product of the probabilities of the terms of the relation, given the collection of documents where both the source and target entity occur:

Y n(t,R) P (R|es, et) = P (t|θes,et ) t∈R

The actual values for P (t|θes,et ) come from estimation on a very large set of content. Bron et al. work with the ClueWeb Category B corpus, also used in the TREC2009 competition, to populate their language model. They compare their best results with those from participants of the Entity Retrieval Task in the TREC 2009 competition [2]. The various systems, optimized for Recall and Precision, perform on a similar level as the actual competitors. The version which uses the maximum likelihood estimate measure for co-occurrence has the best overall balance between Recall and Precision with an F-score of 0.34. Of this method, the calculation of the co-occurrence values is by far the most time-intensive as the whole corpus has to be considered.

4 Method

In this section we will discuss the necessary steps taken for the statistic and semantic ap- proaches from [3] and [11] respectively, to be comparable in the same framework. Since both the RE and REF tasks deal with essentially the same problem, namely the identification of a certain relation between two entities, they can be looked at within the same format. Having one framework to handle two different tasks has the advantage that approaches aimed for one of the tasks can be employed for the other. This can be both a full port or the application of advances in one task in the domain of the other. For a brief reminder: the RE task refers to having an input of a collection of instances between relations, where the source and target entities (es and et) are marked and typed (Ts and Tt) and the goal is to determine the right relationship (R) between them. The REF task takes as input a source entity (es), a relation (R) and the type of the target entity (Tt) and the task is to return a ranked list of target entities (et) with which es has relation R. 4 METHOD 8

A conceptual equivalence of two tasks follows when both tasks can be rewritten in term of the other. The key step with the RE and REF tasks is that a positive classification of a relation in the RE task is equal to the target entity occurring in the result set of the a query for the source with the relation. This can be expressed as:

RE(R(es, et)) ↔ et ∈ REF (es,R,Tt) This means that it is possible to perform the RE task in the REF environment by taking the occurrence of an entity et in the result set of a query () as a sufficient reason for that triplet (R(es, et)) to be a positive classification. On the other hand, a REF task can be solved by a RE system by classifying over each pair (so that et ∈ Tt) and taking the set of positive classifications as the result set. In either case a complete set of known to be correct entities (‘gold standard’) is required to evaluate the performance on the task.

4.1 Making the LA kernel less supervised As described in Section 2, the kernel used in [11] uses different WordNet distance functions to calculate the similarity of elements. However, this is only possible because they (1) only consider non-named entities and (2) assume words to be tagged with their proper WordNet sense. Using a free text corpus for finding generic relations means that both assumptions do not hold. When the task is to find generic relations between named entities, WordNet can not be queried directly, because the entities do not occur in WordNet. Also, the entities in a generic data are not tagged with their proper WordNet sense. In order to overcome these issues and use the WordNet distance anyway we employ the following two-step technique: we use the similarity between the types of the instances by taking the smallest distance between their senses. The first idea is to use the similarity between the the types of compared entities, as opposed to their direct similarity. Each instance has rdf:type relationships with dBpedia classes, which are usually regular concepts which occur in WordNet. A SPAQRL query for the rdf:type for ‘Amsterdam’, returns for example: {PopulatedPlace, Settlement, Place, City}. Because only one value can be used we consider each type×type pair and take the average over their scores. The similarity of two instances of arbitrary type is thus taken as the semantic intersection of their respective dBpedia types, in the form of their Cartesian product. The second issue, relevant to both the non-named entities and the retrieved types of the named entities, which also lack their corresponding WordNet sense, is essentially a double disambiguation issue. Both words return a non-empty set of senses, for which the point-wise similarity scores can be calculated. In order to do so we take the smallest distance between the senses of both sets. The assumption is that the semantically closest interpretation is also the most likely one, given that they are encountered together. In formula this is:

simbest(i, j) = minsi∈Ti,sj ∈Tj δ(si, sj) We compared this approach with other ideas like taking the average over all distances between both sets of senses, using the distance between the first items in both lists or using a random pair. However, those approaches performed uniformly worse. 5 EXPERIMENTAL SET-UP 9

get distance(e1, e2) T1 ← get dbpedia types(e1) T2 ← get dbpedia types(e2) sum ← 0 count ← 0 for all :(t1 ∈ T1 ∧ t2 ∈ T2 ∧ t1, t2 ∈ WordNet) do best ← 0 for all :(s1 ∈ wn senses(t1) ∧ s2 ∈ wn senses(t2)) do distance ← get wordnet distance(s1, s2) if distance > best then distance ← best end if end for sum ← sum + temp count ← count + 1 end for return sum / count

Figure 3: get distance algorithm to calculate the WordNet similarity between generic entities

The complete algorithm that is used to calculate the distance between two instances (get distance) can be seen in Figure 4.1.

5 Experimental Set-Up

We evaluate the two approaches by measuring both their performances on the task of extracting relations between entities, based on the content found on Wikipedia. By using both standard Information Retrieval measures over a number of trails and investigating the taken actions as if it was a question-answering task, we evaluate the performance on both the RE and REF tasks.

5.1 Content from Wikipedia We use content mined from Wikipedia4 as the data set from which the have to extract relations. The structure of Wikipedia makes it especially suited for research on Information Extraction techniques. While the content is mostly free text, it is augmented with a number of extra features that make processing and interpreting it a lot easier. To start, each entity has its own homepage with a unique identifier and accompanying text, which can be considered as the main source for a description of the entity. References to other entities in the content are hyperlinks to the pages of those entities, which means that both named entity are identified and disambiguated and that there is a representation of which entities are related to each other. Furthermore, entities have IS-A relations with categories, where

4http://www.wikipedia.org 5 EXPERIMENTAL SET-UP 10 the categories are also nested in an hierarchical structure of super- and sub-sets. Finally, a large part of the content of Wikipedia has a fully structured counterpart in dBpedia5. This is an RDF-repository containing triplets of the format [subject relation object], where entities refer to their Wikipedia homepages. It also contains references to other structured and unstructured data sets, making it an important spil in the idea of the Semantic Web. This repository can be queried using the query language SPAQRL. We employ Wikipedia as follows. For each page we take the subject of that page, identified by its title, to be the es and each occurrence of an entity on that page to be a et with an unknown relation. In order to be able to formulate dependency paths between two entities in the same sentence, we only take into account the sentences where the source entity also occurs. Each page therefore provides a set of instances where the subject of the page is the source entity, the other entity is the target entity, and the sentence where they both occur in is the context from where the relation should be extracted. The task thus becomes to retrieve relations from free text, where named entity retrieval has been performed. We selected four relations to be extracted, shown in Table 1, from three different data sets. The first query, “Chefs with a show on Food Network”, was discussed in detail to illustrate the behavior of the REF system in [3]. This provides a background for us to compare against. The second relation, ‘MovementOf’, was taken from [9]. The last two relations, ‘BirthPlaceOf’ and ‘ResidenceOf’, are interesting because they are ambigous. They have the same source and target types and both relations can hold between the same pairs of entities. To look at how both methods react on this we use one data set containing the entities relevant to either of them and test the performance of both task on it.

Id Source entity Ts Relation in text Relation Tt 1 Food Network Channel Chefs with a show on Food Network NetworkOf Chef 2 Impressionism Movement Artists associated with Impressionism MovementOf Artist 3 Amsterdam City People born in Amsterdam BirthPlaceOf Person 4 Amsterdam City People that live in Amsterdam Residence Person

Table 1: Numbers of the four test topics

For each relation the procedure to create the data set was as follows. A set of correct instances of relations was taken from dBpedia, which also provides the types of the relevant entities. The content itself was gathered by taking the Wikipedia pages of each entity of either type Ts or Tt. For the REF task only the page of es is required, but to train the systems for the RE task, and apply them on the REF task, we need all instances of Tt. From each page the irrelevant information, like tables, images and references, was ripped. The data sets for R3 and R4 were then sent to Marc Bron, who applied his method on it and provided us with his results. For the evaluation of the Local Alignment method of [11] more preprocessing is required. From each page all instances of relations are extracted with the method described in Section 4, and filtered so that only R(es, et) relations remain. These triplets of es, et and the accompanying sentence are then handled by an implementation

5http://dbpedia.org/About 5 EXPERIMENTAL SET-UP 11 of the Local Alignment method6. We labeled the extracted instances of relations between known to be correct pairs based on whether or not the relation is evident from the sentence where the entities ocurr. The other pairs were given a negative label. The sizes of the resulting data sets are illustrated in Table 2. For each relation the sizes of sets of candidate source and target entities (#es and #et respectively) are listed, as well as the number of instances of the relation in dBpedia (#R∈dB), the gold standards. Also shown in Table 2 are the number of examples we managed to extract from the raw data. It is clear that the different relations have a wide variety of instances present in the corpus. The first relation is rare as only a select number of entities can hold it in the first place. The other three relations all have ‘People’ as target entities, which means that the set of potential entities is extremely large. In these cases we mined the relevant entities by either narrowing down the target type (Relation #2), or taking the known correct entities and a random subject of a set size (2000). The last two relations use the same corpus of source and target entities and differ only in the number correct instances and thus positive examples.

Id Relation #es #et #R∈dB #examples #+ examples 1 NetworkOf 13 87 23 36 21 2 MovementOf 59 > 2000 119 72 44 3 BirthPlaceOf 51 > 2000 340 132 113 4 Residence 51 > 2000 30 146 5

Table 2: The four test topics

5.2 Evaluation We evaluate the performance of the method on the RE task with the standard Information Retrieval measures of Precision, Recall, Accuracy and F-Score. All four are based on the respective sizes of the sets of correctly positive (tp), correctly negative (tn), incorrectly positive (fp) and incorrectly negative (fn) actions. An action here refers to either a classi- fication or the retrieval of a concept, depending on whether the task is a classification task or an information retrieval one. We take for each relation the gold standard as the set of target entities of the instances of the relation present in dBpedia. Because this set is not as rich as the generic relations in Wikipedia, the problem of filling a database like that is exactly one that this report addresses, this restricts the choice of relations we use. Relations that do exists are for example ‘LocationOf’ and ‘BirthPlace’, but others, like the basic ‘SituatedIn’ from ‘City’ to ‘Country’, are notably absent. The set of [es R et] triplets present in dBpedia is thus taken to be a correct and exhaustive set of instances of the relation, with it size being equal to tp + fn.

6The implementation was written in Python 2.6 (http://www.python.org/download/releases/2.6.5/), SWI-Prolog (http://www.swi-prolog.org/) was used for the dependency path finding. Externally used code includes: NLTK Toolkit (http://www.nltk.org/) for WordNet binding, SPARQL wrapper (http:// nlp.stanford.edu/software/lex-parser.shtml) for communication with dBpedia, Stanford Parser version 2010-02-26 (http://sparql-wrapper.sourceforge.net/) and LibSVM 2.9 (LibSVM 2.9 http://www.csie. ntu.edu.tw/~cjlin/libsvm/) for training and testing of the SVM kernel. 6 RESULTS 12

Because we have a complete set of the relations between entities we can use the same scores for both the RE and the REF tasks. The score for Precision refers to the percentage of correctly retrieved instances: tp P recision = tp + fp The Recall measure stands for the percentage of correct entities that are retrieved, compared their total set: tp Recall = tp + fn The Accuracy score is the overall fraction of correct actions, compared with the whole set of instances: tp + tn Accuracy = tp + fp + tn + fn Finally, because the values for Precision and Recall are usually strongly dual, we use the F-score as an indication of the overall performance of the method. The F-score is the harmonic mean between the two: P recision · Recall F − score = 2 · P recision + Recall

6 Results

Here we will present the results of the experiments performed as described in the previous section.

6.1 Relation Extraction We start off with the results of the experiment, looked at from the RE point-of-view. We first show the behavior of the LA kernel, when different WordNet similarity functions are used, in Table 3. Based on these results, little division can be made between the similarity functions that use information content (res, jcs and lin), and those that do not (wup, res). The non-IC measures have a higher Recall by a large margin, while the IC measures are slightly better at Precision. We performed a paired t-test with a 95% confidence level to test if these differences are statistically significant. It turns out that the the improvement of the non-IC is significant, but the difference between the two on Precision is not. The best F-score is achieved by lch.

WN function Accuracy Precision Recall F-score wup 59.29 59.97 34.82 44.06 lch 61.90 63.00 35.21 45.17 res 67.14 70.80 29.42 41.57 jcn 57.14 64.96 26.92 39.06 lin 65.38 69.13 28.84 40.70

Table 3: Results of the different WordNet similarity functions on R2 6 RESULTS 13

Relation & Method Accuracy Precision Recall F-score LA-wup on R1 58.33 59.33 89.02 71.20 LA-lch on R2 61.90 63.00 35.31 45.17 LA-wup on R3 88.46 95.45 30.35 46.04 LA-wup on R4 52.92 49.28 15.47 23.57 LAavg [11] 73.09 71.48 72.30 71.60 Co-oc. on R1 [3] .. 26.23 69.82 38.13 Co-oc. on R3 .. 30.00 22.05 24.03 Co-oc. on R4 .. 20.00 26.66 22.85

Table 4: Results of the methods on the different relations

Table 4 shows the performance of each method on the four different relations. For the LA kernel, the results show the performance of the classification task, with all extracted relations as both test and trainings instances. The performance is thus strongly affected by the quality of the mining process, ie. the number of relations that are extracted from whole corpus. This was illustrated in Table 2. Evaluation was done with 10-fold cross-validation. For each relation only the best performing similarity score is displayed, all of them being non-IC measures. We also include the average results reported by [11] to be achieved on the RE task. We did not include any baseline values after the unanimously inferior results reported in [11]. The different baseline strategies would have performed even worse on the current task, because of the strong imbalance of the data and the rarity of positive instances. To compare the results from the LA kernel with those of co-occurrence, we gathered results from two sources. For R3 and R4 we applied an implementation of the method described in [3] on the same data set as the LA kernel. However, the time consuming nature of this method, restricted us from also trying it on the other two relations. The calculation of the co-occurrence values of the terms present in the data set takes the most time, as a much larger data set (‘Wikipedia’) is required. Therefore, for R1 we compare the results with those reported in [3]. While these technically use a different data set, in practice they also work with Wikipedia. The Recall and Precision scores are taken from the methods optimized for Recall and Precision respectively. The Precision scores of R3 and R4 are ‘P@10’, referring to the Precision of the first ten returned answers. The results of the LA kernel show that the behavior of the system is different on the different data sets. In all situations Precision is above 50%, but the difference between the different relations can be as high as 45%. R1 is the easiest task, also because the combination ‘Chef’ and ‘TV Channel’ is relatively unambiguous, and the LA kernel performs better on it than the results reported in [3]. Performance on R2 is notably lower, especially because of the low Recall. The performance of the LA kernel on R3 and R4 is clearly the result of the system returning only a small number of answers, as opposed to the co-occurrence method which larger return set results in a lower Precision rate. Performance on R4 is lower than on R3, but not Recall for co-occurrence. It should be noted that the achieved performance of the LA kernel includes both the quality of the mining process and the ability to recognize relations in extracted sentences. The evaluated task is thus relation extraction from free task. This is different than the experiments done in [11], who classify given relations in labeled sentences. This means 6 RESULTS 14 that the Recall scores are notably lower than when they would refer to the percentage of relations found that are present in the extracted sentences. If those sets (the last column of Table 2) would have been used as the gold standard, all Recall scores would have been over 90% and the F-scores would increase proportionally. LA-wup would achieve a score as high as 93% on R3 in this scenario. However, an important difference between structural and probablistic methods is the amount of processing that is required to run them. If there is no functional method of extracting the dependency paths of relations, being able to classify correctly them becomes less valuable. Eventually, the problem is one of distillation. The level of extracted instances will improve with more careful natural language processing. More importantly, not all relations in dBpedia are explicitly mentioned in the free text content of Wikipedia. Many facts are stored in tables and lists, which make the extraction process essentially an engineering problem. Finally, we experimented with different sizes of training data to measure the learning rate of the LA kernels with difference WordNet functions. In Figure 4 the results of R2 are displayed, next to an aggregation of the learning rate curves presented in [11]. For the results by the LA kernel we took as the ‘gold standard’ the set of instances of relations that were actually extracted by the mining process, as opposed to the complete set of instances. This is because the important information here is the learning rate for the classifier, not the performance of the system in general. For the numbers from [11] we averaged over the results they present from the different tasks. In the graph it can be seen that there is relatively little difference in performance between the situations where there is a portion of the training data available and the situation where all training data is available. This shows that only a small sub-set of the training data is necessary for the LA kernel to achieve good performance. Already with 25% of the training data, an F-score of over 60 is reached. The different measure show similar learning curves, with only the res taking longer to get to the level of the others. In general, the wup measure leads to the best results.

Figure 4: Learning curves on R2 6 RESULTS 15

6.2 Related Entity Finding To illustrate the behavior of the different approaches in light of the REF task, we present a sample of the answers returned for each query. This leads to a qualitative image, compared to the quantitative RE results. For the co-occurrence methods, we list the top 10 returned answers. This is possible because their result set is a ranked list. For the results from the LA-kernel, we rank the answers based on a majority voting procedure between the answers returned by the five different WordNet measures.

Bron et al. [3] LAavg 1 Bobby Flay Rachael Ray 2 Anne Burrell Anthony Bourdain 3 Robert Irvine Jamie Oliver 4 Tyler Florence Michael Simon 5 Aaron McCargo Tyler Florance 6 Mario Batali Alton Brown 7 Sunny Anderson Duff Goldman 8 Guy Fieri Sara Moulton 9 Giada de Laurentiis Dave Lieberman 10 Kevin Brauch Guy Fieri

Table 5: The first 10 answers for R1, ‘NetworkOf’

The first table, Table 5, shows the answers that the LA kernel returned for R1, next to the results reported in [3]. Together with the results from Table 4, it shows that the co-occurrence methods are more sure ‘right answers’: while Precision in generally low, the top-10 answers are almost all correct. Tables 6 and 7 shows the results for R3 and R4 respectively. The accompanying answers here are results from applying the co-occurrence method from Bron et al. to the same data set as the LA kernel method. We did not use type filtering, which shows in the occurrence of answers of the wrong type (‘Zutphen’) in the results. Different than our expectation that the co-occurrence results would return similar answers, they only share one. Because we have no results to compare the returned answers for R2, we leave them out.

Bron et al. LAavg Bron et al. LAavg 1 Willem Schouten Baruch Spinoza 1 Jacob van Campen Joop vd Ende 2 WF Hermans Maarten Heijmans 2 S.D. v Hoogstraten Piet Keizer 3 C.H Ver Huell Viola Haqi 3 Frits Bolkestein Michael Mols 4 Jan Mostaert Channii 4 B.C. Koekkoek Marijke Vos 5 Hendrik Muller Sebastiaan Bremer 5 Zutphen Pieter de Graeff 6 B vd Helst M. van Nieuwkerk 6 Pieter Nuyts Ewout Irrgang 7 Eberhard vd Laan Cornelis Kruseman 7 B vd Helst Judith Sargentini 8 Roel van Duijn Voila Hagi 8 Eberhard vd Laan Gerrit Bolkestein 9 Karel Appel Jerson A. Ribeiro 9 Roel v Duijn 10 Judith Sargentini Maarten Heijmans 10 Karel Appel

Table 6: The first 10 answers for R3, ‘Birth- Table 7: The first 10 answers for R4, ‘Resi- PlaceOf’ denceOf’ 7 CONCLUSION 16

7 Conclusion

We investigated the use of structural information for the task of relation extraction, by comparing an approach that uses it with one that does not. The structural method uses a local alignment kernel to learn the structure of dependency paths between related entities. We used a framework that allowed us to evaluate the performance of the methods on both the Relation Extraction and Related Entity Retrieval tasks in one experiment. The used data set consists of Wikipedia pages, containing free text and tagged named entities. For the local alignment method to be able to work with free text we used a procedure that extracts dependency paths between the entities occurring in a sentence. We also use dBpedia and WordNet to calculate the semantic distance between entities, avoiding the need of the supervised kernel method for semantically labeled content. The results show an almost uniform superior performance of the LA kernel over the co- occurrence method. In general, the LA kernel performs well on Precision (over 50%). The score for Recall depends strongly on the relation and domain used. However, the Recall scores are strongly impaired by the extraction process, which does not find all relations between pairs that are present in the content. Evaluation of just the classification component results in Recall scores of 90%. Because of this issue we were unable to properly verify our hypothesis that the LA kernel would perform better on relations that are rare and ambiguous. The co-occurrence method does show a 10% decrease in Precision for the more rare ‘ResidenceOf relation, compared to ‘BirthplaceOf, but so does the LA kernel. However, the latter only extracted instances of one-fifth of the correct relations. Because the different methods consist of different components, its hard to compare their functioning in detail. We performed some experiments to determine the impact of using different WordNet measures compare the semantic distance between terms and the result was that the measures that do not use Information Content perform better. It should be noted that the used aggregation of WordNet similarities is a bit superficial in the current context because only relations between two set entity types are considered. There are therefore in practice only two different distances between concepts: ‘the same entity type’ and ‘the other entity type’. The idea might prove useful, however, for other situations where content has to be annotated on a higher level than it is. We also looked at the impact of the size of the training data for the LA kernel, which showed that even when a small amount of training data is used, reasonable results are achieved. The data requirements for the two methods are as follows: the co-occurrence method requires a large amount of content to calculate the co-occurrence rates. Because of the amount of data involved, this is a time consuming process. The LA method on the contrary requires only a small set of labeled data to train itself. However, its method of processing the content takes a lot of time and any new content has to be processed in the same fashion. Training it is thus much more efficient than the co-occurrence method, but using it on new content is extremely inefficient. Because of its structure and named entity tagging, Wikipedia is a useful environment to evaluation relation extraction approaches in. Usage of the same methods on a new environment, like the warc data sets used in the TREC competitions would require an extra step of named entity recognition and disambiguation. The found results do not directly lead to new directions for a possible integration of the two methods. While it would be feasible to integrate linguistic knowledge in the co- 7 CONCLUSION 17

occurrences P (et|es,Tt,R) equation or integrate co-occurrence information in the LA kernel, this does not directly follow from the results. This would, however, be an interesting future research direction. The latter idea, to add co-occurrence information to the LA kernel, would probably lead to a doubly inefficient system as it would require both the calculation of the co-occurrence rates and the heavy linguistic processing to extract the dependency paths. Beyond this next step, other future research directions can be imagined. An interesting one would be the extraction of n-ary relations from text. News articles about events can often be represented as one relation between a number of different instances. While this is theoretically a compound of the more general binary relation (every n-ary relation can be represented as a set of binary relations), its structure allows for different techniques to be tried. Also, when linguistic relation extraction systems perform sufficiently well, its functioning can be used to analyze the complexity of different relations, for example between different languages. In general, the comparison of semantic and statistic approaches for the same task is a very interesting endeavour, as it might lead to insight about the relation between the statistic occurrence of objects and their meanings. This is especially true for finding relations between entities, as it is instrinsically linked with both oc-occurrence, because the entities are named together, and semantics understanding, because there can be different relations between two entities. REFERENCES 18

References

[1] K. Balog. People search in the enterprise. PhD thesis, University of Amsterdam, 2008.

[2] K. Balog, A.P. de Vries, P. Serdyukov, P. Thomas, and T. Westerveld. Overview of the TREC 2009 Entity track. In The Eighteenth Text REtrieval Conference (TREC 2009) Notebook, 2009.

[3] M. Bron, K. Balog, and M. de Rijke. Related entity finding based on co-occurrence. In Proceedings of the Eighteenth Text REtrieval Conference (TREC 2009), Gaithersburg, MD, 2010.

[4] R.C. Bunescu. Learning for Information Extraction. PhD thesis, The University of Texas, 2007.

[5] R.C. Bunescu and R.J. Mooney. A shortest path dependency kernel for relation extrac- tion. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 724–731. Association for Computa- tional Linguistics, 2005.

[6] P. Cimiano. Ontology learning and population. In Proceedings Dagstuhl Seminar Machine Learning for the Semantic Web, 2005.

[7] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.

[8] A. Culotta and J. Sorensen. Dependency tree kernels for relation extraction. In Pro- ceedings of the 42nd Annual Meeting on Association for Computational Linguistics, 2004.

[9] V. de Boer, M. van Someren, and B. J. Wielinga. A redundancy-based method for the extraction of relation instances from the web. International Journal of Human- Computer Studies, 65(9):816–831, 2007.

[10] Alon Halevy, Peter Norvig, and Fernando Pereira. The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2):8–12, 2009.

[11] S. Katrenko, P. Adriaans, and M. van Someren. Using Local Alignments for Relation Recognition. Journal of Artificial Intelligence Research, 38:1–48, 2010. REFERENCES 19

[12] C.S.G. Khoo, S. Chan, and Y. Niu. Extracting causal knowledge from a medical database using graphical patterns. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, volume 38, pages 336–343. ACL, 2000.

[13] T.B. Lee, J. Hendler, O. Lassila, et al. The Semantic Web. Scientific American, 284(5):34–43, 2001.

[14] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K.J. Miller. WordNet: An on-line lexical database. International journal of lexicography, 3(4):235–312, 1990.

[15] J. Nivre. Dependency grammar and dependency parsing. MSI report, 5133, 2005.

[16] T. Pedersen, S. Patwardhan, and J. Michelizzi. Wordnet:: similarity-measuring the relatedness of concepts. In Proceedings of the National Conference on Artificial Intelli- gence, pages 1024–1025. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2004.

[17] D. Petkova and W.B. Croft. Proximity-based document representation for named entity retrieval. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 731–740. ACM New York, NY, USA, 2007.

[18] H. Raghavan, J. Allan, and A. McCallum. An exploration of entity models, collective classification and relation description. In Proceedings of LinkKDD. Citeseer, 2004.

[19] T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J. Mol. Bwl, 147:195–197, 1981.

[20] D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. The Journal of Machine Learning Research, 3:1083–1106, 2003.