Relational Learning of Pattern-Match Rules for Information Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf and Raymond J. Mooney Department of Computer Sciences University of Texas at Austin Austin, TX 78712 {mecaliff ,mooney}@cs. utexas, edu Abstract Recently, several researchers have begun to ap- ply learning methods to the construction of IE sys- Information extraction systems process tems (McCarthy and Lehnert, 1995, Soderland et natural language documents and locate ah, 1995, Soderland et al., 1996, Riloff, 1993, Riloff, a specific set of relevant items. Given 1996, Kim and Moldovan, 1995, Huffman, 1996). the recent success of empirical or corpus- Several symbolic and statistical methods have been based approaches in other areas of natu- employed, but learning is generally used to construct ral language processing, machine learning only part of a larger IE system. Our system, RAPIER has the potential to significantly aid the (Robust Automated Production of Information Ex- development of these knowledge-intensive traction Rules), learns rules for the complete IE systems. This paper presents a system, task. The resulting rules extract the desired items RAPmrt, that takes pairs of documents and directly from documents without prior parsing or filled templates and induces pattern-match subsequent processing. Using only a corpus of doc- rules that directly extract fillers for the uments paired with filled templates, RAPIER learns slots in the template. The learning al- unbounded Eliza-like patterns (Weizenbaum, 1966) gorithm incorporates techniques from sev- that utilize limited syntactic information, such as eral inductive logic programming systems the output of a part-of-speech tagger. Induced pat- and learns unbounded patterns that in- terns can also easily incorporate semantic class infor- clude constraints on the words and part- mation, such as that provided by WordNet (Miller of-speech tags surrounding the filler. En- et al., 1993). The learning algorithm was inspired couraging results are presented on learn- by several Inductive Logic Programming (ILP) sys- ing to extract information from com- tems and primarily consists of a specific-to-general puter job postings from the newsgroup (bottom-up) search for patterns that characterize misc. jobs. offered. slot-fillers and their surrounding context. The remainder of the paper is organized as follows. Section 2 presents background material on IE and re- 1 Introduction lational learning. Section 3 describes RAPIEK's rule representation and learning algorithm. Section 4 An increasing amount of information is available in presents and analyzes results obtained on extracting the form of electronic documents. The need to in- information from messages posted to the newsgroup telligently process such texts makes information ex- mist.jobs.offered. Section 5 discusses related traction (IE), the task of locating specific pieces of work in applying learning to IE, Section 6 suggests data from a natural language document, a particu- areas for future research, and Section 7 presents our larly useful sub-area of natural language processing conclusions. (NLP). In recognition of their significance, IE sys- tems have been the focus of DARPA's MUC program 2 Background (Lehnert and Sundheim, 1991). Unfortunately, IE systems are difficult and time-consuming to build 2.1 Information Extraction and the resulting systems generally contain highly In information extraction, the data to be extracted domain-specific components, making them difficult from a natural language text is given by a template to port to new domains. specifying a list of slots to be filled. The slot fillers Califf 8J Mooney 9 Relational Learning Mary Elaine Califf and Raymond J. Mooney (1997) Relational Learning of Pattern-Match Rules for Infor- mation Extraction. In T.M. Ellison (ed.) CoNLLPT: Computational Natural Language Learning, ACL pp 9-15. @ 1997 Association for Computational Linguistics Posting from Newsgroup Collins, 1991) allow induction over structured exam- Telecommunications. SOLARIS Systems ples that can include first-order logical predicates Administrator. 38-44K. Immediate need and functions and unbounded data structures such Leading telecommunications firm in need as lists, strings, and trees. Detailed experimen- of an energetic individual to fill the tal comparisons of ILP and feature-based induction following position in the Atlanta have demonstrated the advantages of relational rep- office: resentations in two language related tasks, text cat- egorization (Cohen, 1995) and generating the past SOLARIS SYSTEMS ADMINISTRATOR tense of an English verb (Mooney and Califf, 1995). Salary: 38-44K with full benefits Location: Atlanta Georgia, no While RAPIEa is not strictly an ILP system, its rela- relocation assistance provided tional learning algorithm was inspired by ideas from Filled Template the following ILP systems. computer_science_job GOLEM (Muggleton and Feng, 1992) is a bottom- title: SOLARIS Systems Administrator up (specific to general) ILP algorithm based on the salary: 38-44K construction of relative least-general generalizations, state: Georgia rlggs (Plotkin, 1970). The idea of least-general gen- city: Atlanta eralizations (LGGs) is, given two items (in ILP, two platform: SOLARIS clauses), finding the least general item that covers area: telecommunications the original pair. This is usually a fairly simple com- putation. Rlggs are the LGGs relative to a set of Figure 1: Sample Message and Filled Template background relations. Because of the difficulties in- troduced by non-finite rlggs, background predicates may be either one of a set of specified values or must be defined extensionally. The algorithm op- strings taken directly from the document. For ex- erates by randomly selecting several pairs of posi- ample, Figure 1 shows part of a job posting, and the tive examples and computing the determinate rlggs corresponding slots of the filled computer-science job of each pair. Determinacy constrains the clause to template. have for each example no more than one possible IE can be useful in a variety of domains. The var- valid substitution for each variable in the body of the ious MUC's have focused on domains such as Latin clause. The resulting clause with the greatest cover- American terrorism, joint ventures, rnicroelectron- age of positive examples is selected, and that clause ics, and company management changes. Others have is further generalized by computing the rlggs of the used IE to track medical patient records (Soderland selected clause with new randomly chosen positive et al., 1995) or company mergers (Huffman, 1996). examples. The generalization process stops when A general task considered in this paper is extracting the coverage of the best clause no longer increases. information from postings to USENET newsgroups, The CHILLIN (Zelle and Mooney, 1994) system such as job announcements. Our overall goal is to combines top-down (general to specific) and bottom- extract a database from all the messages in a news- up ILP techniques. The algorithm starts with a most group and then use learned query parsers (Zelle and specific definition (the set of positive examples) and Mooney, 1996) to answer natural language questions introduces generalizations which make the definition such as "What jobs are available in Austin for C++ more compact. Generalizations are created by se- programmers with only one year of experience?". lecting pairs of clauses in the definition and com- Numerous other Internet applications are possible, puting LGGs. If the resulting clause covers negative such as extracting information from product web examples, it is specialized by adding antecedent lit- pages for a shopping agent (Doorenbos, Etzioni, and erals in a top-down fashion. The search for new liter- Weld, 1997). als is carried out in a hill-climbing fashion, using an information gain metric for evaluating literals. This 2.2 Relational Learning is similar to the search employed by FOIL (Quin- Most empirical natural-language research has em- lan, 1990). In cases where a correct clause cannot ployed statistical techniques that base decisions on be learned with the existing background relations, very limited contexts, or symbolic techniques such CHILLIN attempts to construct new predicates which as decision trees that require the developer to spec- will distinguish the covered negative examples from ify a manageable, finite set of features for use in the covered positives. At each step, a number of making decisions. Inductive logic programming and possible generalizations are considered; the one pro- other relational learning methods (Birnbaum and ducing the greatest compaction of the theory is im- Califf 8J Mooney 10 Relational Learning plemented, and the process repeats. CHILLIN uses Pre-filler Pattern: Filler Pattern: Post-filler Pattern: 1) word: leading 1) list: len: 2 1) word: [firm, company] the notion of empirical subsumption, which means tags: Inn, nns] that as new, more general clauses are added, all of the clauses which are not needed to prove positive examples are removed from the definition. Figure 2: A Rule Extracting an Area Filler from the Example Document PROGOL (Muggleton, 1995) also combines bottom-up and top-down search. Using mode decla- rations provided for both the background predicates tern list specifies a maximum length N and matches and the predicate being learned, it constructs a 0 to N words or symbols from the document that most specific clause for a random seed example. The each must match the list's