An Unsupervised Method for the Extraction of Propositional Information from Text

Colloquium An unsupervised method for the extraction of propositional information from text Simon Dennis* Institute of Cognitive Science, University of Colorado, Boulder, CO 80301 Recent developments in question-answering systems have demon- tends to be unique within a given sentence or sentence fragment. strated that approaches based on propositional analysis of source Although these conditions often hold for general knowledge ques- text, in conjunction with formal inference systems, can produce tions of the kind found in the Text REtrieval Conference (TREC) substantive improvements in performance over surface-form ap- Question Answer track, there are many intelligence applications for proaches. [Voorhees, E. M. (2002) in Eleventh Text Retrieval Confer- which they cannot be guaranteed. Often relevant information will ence, eds. Voorhees, E. M. & Buckland, L. P., http://trec.nist.gov/pubs/ be stated only once or may only be inferred and never stated trec11/t11_proceedings.html]. However, such systems are hampered explicitly. Furthermore, the results of the most recent TREC by the need to create broad-coverage knowledge bases by hand, question–answer competition suggest that deep reasoning systems making them difﬁcult to adapt to new domains and potentially fragile may now have reached a level of sophistication that allows them to if critical information is omitted. To demonstrate how this problem surpass the performance possible using surface-based approaches. might be addressed, the Syntagmatic Paradigmatic model, a memory- In the 2002 TREC competition, the POWER ANSWER system (4), based account of sentence processing, is used to autonomously which converts both questions and answers into propositional form extract propositional knowledge from unannotated text. The Syn- and uses an inference engine, achieved a confidence weighted score tagmatic Paradigmatic model assumes that people store a large of 0.856, a substantive improvement over the second placed exac- number of sentence instances. When trying to interpret a new tanswer (5), which received a score of 0.691 in the main question- sentence, similar sentences are retrieved from memory and aligned answering task. with the new sentence by using String Edit Theory. The set of A key component in the performance of the POWER ANSWER alignments can be considered an extensional interpretation of the system is its use of the WORDNET lexical database (6). WORDNET sentence. Extracting propositional information in this way not only provides a catalog of simple relationships among words, such as permits the model to answer questions for which the relevant facts synonymy, hypernymy, and part-of relations that POWER ANSWER are explicitly stated in the text but also allows the model to take uses to supplement its inference system. Despite the relatively small advantage of ‘‘inference by coincidence,’’ where implicit inference number of relations considered and the difficulties in achieving occurs as an emergent property of the mechanism. To illustrate the good coverage in a hand-coded resource, the additional back- potential of this approach, the model is tested for its ability to ground knowledge provided by WORDNET significantly improves determine the winners of tennis matches as reported on the Associ- the performance of the system. This fact suggests that further gains ation of Tennis Professionals web site. may be achieved if automated methods for extracting a broader range of propositional information could be used in place of, or in he closely related fields of question answering and information conjunction with, the WORDNET database. Textraction aim to search large databases of textual material In recent years, there have been a number of attempts to build (textbases) to find specific information required by the user (1, 2). systems capable of extracting propositional information from sen- As opposed to information retrieval systems, which attempt to tences (7–9). identify relevant documents that discuss the topic of the user’s For instance, given the sentence: information need, information extraction systems return specific information such as names, dates, or amounts that the user requests. Sampras outguns Agassi in US Open Final, Although information retrieval systems (such as Google and Alta Vista) are now in widespread commercial use, information extrac- these systems might produce an annotation such as: tion is a much more difficult task and, with some notable excep- ͓ ͔ ͓ ͔͓ ͔ tions, most current systems are research prototypes. However, the Winner Sampras outguns Loser Agassi Loc in US Open Final . potential significance of reliable information extraction systems is This work has been driven, at least in part, by the availability of substantial. In military, scientific, and business intelligence gather- semantically labeled corpora such as Penn Treebank (10) and ing, being able to identify specific entities and resources of relevance FRAMENET (11). As a consequence, the semantic roles used by the across documents is crucial. Furthermore, some current informa- systems are those defined by the corpus annotators. However, tion extraction systems now attempt the even more difficult task of deciding on a best set of semantic roles has proven extremely providing summaries of relevant information compiled across a difficult. There are a great many schemes that have been proposed document set. ranging in granularity from very broad, such as the two-macro-role The majority of current information extraction systems are based on surface analysis of text applied to very large textbases. Whereas the dominant approaches in the late 1980s and early 1990s would This paper results from the Arthur M. Sackler Colloquium of the National Academy of attempt deep linguistic analysis, proposition extraction, and rea- Sciences, ‘‘Mapping Knowledge Domains,’’ held May 9–11, 2003, at the Arnold and Mabel soning, most current systems look for answer patterns within the Beckman Center of the National Academies of Sciences and Engineering in Irvine, CA. raw text and apply simple heuristics to extract relevant information Abbreviations: SP, Syntagmatic Paradigmatic; SET, String Edit Theory; RT, relational trace; (3). Such approaches have been shown to work well when infor- EM, expectation maximization; LTM, long-term memory. mation is represented redundantly in the textbase and when the *E-mail: [email protected]. type of the answer is unambiguously specified by the question and © 2004 by The National Academy of Sciences of the USA 5206–5213 ͉ PNAS ͉ April 6, 2004 ͉ vol. 101 ͉ suppl. 1 www.pnas.org͞cgi͞doi͞10.1073͞pnas.0307758101 Downloaded by guest on September 28, 2021 proposal of ref. 12, through theories that propose nine or 10 roles, and the results are presented. Finally, some factors that remain to such as ref. 13, to much more specific schemes that contain be addressed are discussed. domain-specific slots, such as ORIG࿝CITY, DEST࿝CITY, or DEPART࿝TIME, that are used in practical dialogue understanding Introduction to SET systems (14). SET was popularized in a book entitled Time Warps, String Edits and That there is much debate about semantic role sets and that Macromolecules (17) and has been developed in both the fields of existing systems must commit to a scheme aprioriare important computer science and molecular biology (18–21). As the name limitations of existing work and, I will argue, are consequences of suggests, the purpose of SET is to describe how one string, which a commitment to intentional semantics. In systems that use inten- could be composed of words, letters, amino acids, etc., can be edited tional semantics, the meanings of representations are defined by to form a second string. That is, what components must be inserted, their intended use and have no inherent substructure. deleted, or changed to turn one string into another. In the model, For instance, the statement ‘‘Sampras outguns Agassi’’ might be SET is used to decide which sentences from a corpus are most like represented as: the target sentence, and which tokens within these sentences should Sampras: Winner align. Agassi: Loser As an example, suppose we are trying to align the sentences However, the names of the roles are completely arbitrary and ‘‘Sampras defeated Agassi’’ and ‘‘Kuerten defeated Roddick.’’ The carry representational content only by virtue of the inference most obvious alignment is that which maps the two sentences to system in which they are embedded. each other in a one-to-one fashion. Now contrast the above situation with an alternative extensional representation of ‘‘Sampras outguns Agassi,’’ in which roles are Sampras defeated Agassi defined by enumerating exemplars, as follows: ͉͉͉ [A1] Sampras: Kuerten, Hewitt Kuerten defeated Roddick Agassi: Roddick, Costa The winner role is represented by the distributed pattern of In this alignment, we have three edit operations. There is a change Kuerten and Hewitt, words chosen because they are the names of of ‘‘Sampras’’ for ‘‘Kuerten,’’ a match of ‘‘defeated,’’ and a change people who have filled the ‘‘X’’ slot in a sentence like ‘‘X outguns of ‘‘Agassi’’ for ‘‘Roddick.’’ In fact, this alignment can also be Y’’ within the experience of the system. Similarly, Roddick and expressed as a sequence of edit operations, ͗ ͘ Costa are the names of people who have filled the ‘‘Y’’ slot in such Sampras, Kuerten ͗ ͘ a sentence and form a distributed representation of the loser role. defeated, defeated ͗ ͘ Note the issue is not just a matter of distributed vs. symbolic Agassi, Roddick representation. The tensor product representation used in the In SET, sentences do not have to be of the same length to be STAR model (15) of analogical reasoning uses distributed repre- aligned. If we add ‘‘Pete’’ to the first sentence, we can use a delete sentations of the fillers but assigns a unique rank to each role and to describe one way in which the resulting sentences could be so is an intentional scheme. By contrast, the temporal binding aligned. mechanism proposed by ref. 16 allows for both distributed filler and Pete Sampras defeated Agassi role vectors and hence could implement extensional semantics.

Load more