Learning Information Extraction Patterns Using Wordnet
Total Page:16
File Type:pdf, Size:1020Kb
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Department of Computer Science University of Sheffield Sheffield, S1 4DP, UK [email protected] [email protected] Abstract Weakly supervised algorithms have the bene- Information Extraction (IE) systems often use pat- fit of requiring only small amounts of annotated terns to identify relevant information in text but training data. But the learning task is more chal- these are difficult and time-consuming to generate lenging since there are fewer examples of the pat- manually. This paper presents a new approach to terns to be learned. Providing the learning algo- the automatic learning of IE patterns which uses rithm with access to additional knowledge can WordNet to judge the similarity between patterns. compensate for the limited number of annotated The algorithm starts with a small set of sample examples. The approach we have chosen is to aug- extraction patterns and uses a similarity metric, ment an IE pattern learning algorithm with infor- based on a version of the vector space model aug- mation from WordNet which allows our system to mented with information from WordNet, to learn decide when patterns have similar meanings. similar patterns. This approach is found to perform The remainder of this paper is organised as fol- better than a previously reported method which re- lows. We begin by describing the general process lied on information about the distribution of pat- of weakly supervised pattern induction and an ex- terns in a corpus and did not make use of Word- isting approach, based on the distribution of pat- Net. terns in a corpus (Section 2). Section 3 introduces a new algorithm that uses WordNet to generalise 1 Introduction extraction patterns and Section 4 an implementa- One of the goals of current research in Informa- tion of this approach. Section 5 describes an eval- tion Extraction (IE) is to develop systems which uation regime based on the MUC-6 management can be easily ported to new domains with the min- succession task [MUC, 1995]. The results of an imum of human intervention. Early IE systems experiment in which several methods for calcu- were generally based on knowledge engineering lating the similarity between extraction patterns approaches and often proved difficult to adapt to are compared is presented in Section 6. Section 7 new domains. One approach to this problem is compares the proposed approach with an existing to use machine learning to automatically learn method. the domain-specific information required to port a system. Soderland [1999] developed an approach 2 Weakly Supervised Extraction Pattern that learned rules from text which had been anno- Learning tated with the information to be extracted. How- We begin by outlining the general process of learn- ever, the annotated text required for training is ing extraction patterns, similar to the approach often difficult and time-consuming to obtain. An presented by Yangarber [2003]. alternative approach is to use weakly supervised learning algorithms, these do not require large 1. For a given IE scenario we assume the ex- amounts of annotated training data and rely on a istence of a set of documents against which small set of examples instead. These approaches the system can be trained. The documents are greatly reduced the burden on the application de- either relevant (contain the description of an veloper by alleviating the knowledge acquisition event relevant to the scenario) or irrelevant. bottleneck. However, the documents are not annotated Petr Sojka, Key-Sun Choi, Christiane Fellbaum, Piek Vossen (Eds.): GWC 2006, Proceedings, pp. 95–102. c Masaryk University, 2005 and the algorithm does not have access to this may learn patterns which tend to occur in the same information. documents as relevant ones whether or not they are 2. This corpus is pre-processed to generate a actually relevant. For example, we could imagine set of all patterns which could be used to an IE scenario in which relevant documents con- represent sentences contained in the corpus, tain a piece of information which is related to, but call this set P. The aim of the learning process distinct from, the information we aim to extract. is to identify the subset of P representing If patterns expressing this information were more patterns which are relevant to the IE scenario. likely to occur in relevant documents than irrel- evant ones the document-centric approach would 3. The user provides a small set of seed patterns, also learn these irrelevant patterns. Pseed , which are relevant to the scenario. Rather than focusing on the documents matched These patterns are used to form the set of by a pattern, an alternative approach is to rank pat- ← currently accepted patterns, Pacc, so Pacc terns according to how similar their meanings are Pseed . The remaining patterns are treated as to those which are known to be relevant. This ap- candidates for inclusion in the accepted set, proach is motivated by the fact that the same event = − forming the set Pcand ( P Pacc). can be described in different ways in natural lan- 4. A function, f , is used to assign a score guage. Once a pattern has been identified as being to each pattern in Pcand based on those relevant it is highly likely that its paraphrases and which are currently in Pacc. This function as- patterns with similar meanings will also be rele- signs a real number to candidate patterns so vant to the same extraction task. This approach ∀ c Pcand , f (c, Pacc) 7→ R. A set also avoids the problem which may be present of high scoring patterns (based on absolute in the document-centric approach since patterns scores or ranks after the set of patterns has which happen to co-occur in the same documents been ordered by scores) are chosen as being as relevant ones but have different meanings will suitable for inclusion in the set of accepted not be ranked highly. patterns. These form the set Plearn. The approach presented here uses WordNet [Fellbaum, 1998] to determine pattern similarity. 5. The patterns in Plearn are added to Pacc and Other systems which use WordNet to help with removed from Pcand , so Pacc ← Pacc ∪ the learning of IE patterns include [Chai and Bier- Plearn and Pcand ← Pacc − Plearn mann, 1999; Català et al., 2003]. Although they 6. If a suitable set of patterns has been learned used WordNet’s hierarchical structure to gener- then stop, otherwise go to step 4 alise patterns rather than identify those with simi- An important choice in the development of such lar meanings. an algorithm is step 4, the process of ranking the candidate patterns, this effectively determines 3 Semantic IE Pattern Learning which of the candidate patterns will be learned. For these experiments extraction patterns con- Yangarber et. al. [2000] chose an approach mo- sist of predicate-argument structures, as proposed tivated by the assumption that documents contain- by Yangarber [2003]. Under this scheme patterns ing a large number of patterns which have already consist of triples representing the subject, verb, been identified as relevant to a particular IE sce- and object (SVO) of a clause. The first element nario are likely to contain more relevant patterns. is the “semantic” subject (or agent), for example Patterns which occur in these documents far more “John” is a clausal subject in each of the sentences than others will then receive high scores. This ap- “John hit Bill”, “Bill was hit by John”, “Mary proach can be viewed as being document-centric. saw John hit Bill”, and “John is a bully”. The This approach has been shown to successfully second element is the verb in the clause and the acquire useful extraction patterns which, when third the object (patient) or predicate. “Bill” is a added to an IE system, improved its performance clausal object in the first three example sentences [Yangarber et al., 2000]. However, it relies on an and “bully” in the final sentence. When a verb is assumption about the way in which relevant pat- being used intransitively, the pattern for that clause terns are distributed in a document collection and is restricted to only the first pair of elements. 96 The filler of each pattern element can be ei- element-filler pairs is symmetric, so wi j = wji ther a lexical item or semantic category such as (W is symmetric). Pairs with different pattern el- person name, country, currency values, numeri- ements (i.e. grammatical roles) are automatically cal expressions etc. In this paper lexical items given a similarity score of 0. Diagonal elements are represented in lower case and semantic cat- of W represent the self-similarity between pairs egories are capitalised. For example, in the pat- and have the greatest values. The actual values de- tern COMPANY+fired+ceo, fired and ceo are noting the similarity between pattern elements are lexical items and COMPANY a semantic category acquired using existing lexical similarity metrics which could match any lexical item belonging to (see Section 4). that type. Figure 1 gives an example using three patterns A vector space model, similar to the ones used in which shows how they could be represented as Information Retrieval [Salton and McGill, 1983], vectors given the set of element filler pairs forming is used to represent patterns and a similarity met- a basis for the vector space. A similarity matrix ric defined to identify those with similar meanings. with example values is also shown. Each pattern can be represented as a set of pattern element-filler pairs. For example, the pattern COM- Table 1: Similarity values for example patterns using Equa- PANY+fired+ceo consists of three pairs: sub- tion 1 and cosine metric ject_COMPANY, verb_fire and object_ceo.