<<

From: AAAI Technical Report SS-00-01. Compilation copyright © 2000, AAAI (www.aaai.org). All rights reserved.

Adaptive User Interfaces for Information Extraction

Peter Vanderheyden and Robin Cohen Department of Computer Science, University of Waterloo Waterloo, Ontario, Canada

Abstract Information Extraction: What Is It? Informationextraction systemsdistill specific items of Information extraction involves "extraction or pulling informationfrom large quantities of text. Current lit- out of pertinent information from large volumes of erature in the field of information extraction focuses texts" (MUC-71999). It is often defined by example, largely on howdifferent language models and machine terms of the results the system would return for a given learning approaches can improve system accuracy, with query, and often contrasted with what an information little mention of the role of interaction betweenthe user and the system. In this paper, we explore many retrieval system might return for a similar query. opportunities for interaction and adaptation during in- formation extraction, and propose a system design to The query-result perspective investigate these opportunities further. generally interprets a query as a string of unrelated words or word phrases 2, matching Introduction it against the words in the corpus documents. Output Muchhas been written about the flood of information from the IR system consists of either whole documents currently available, and the challenge of locating rel- or segments, deemedrelevant to the query words. Thus, evant items amid that flood. Several areas of applied for P’s question "Is terrorism on the rise?", if the corpus research -- information extraction (IE), information re- contains a document that mentions a ’rise’ in ’terror- trieval (IR), and knowledge discovery in text (KDT) ism’, and assuming P has confidence in the document, seek to address this challenge. Weare specifically in- such a document will have answered P’s question. terested in the field of information extraction. In this An IE system takes as its input a user’s query ex- paper, we discuss opportunities for interaction between pressed (or expressable) as a sentence template, con- user and system, and for adaptation by the system to taining unfilled slots, and relations betweenthose slots. facilitate this interaction. Wethen develop objectives For example, a query to identify terrorism events con- for designing a system that supports the user effectively tained in a document could be written as: during information extraction. "A terrorism event occurred in which {GroupA} Wemotivate this discussion by observing that little {ActB} {GroupB}, and {Acre} {0bjectC}." is mentioned in current literature about the interaction between an information extraction system and a user. The system then processes each document in a corpus In contrast, for example, information retrieval litera- and, for each occurrence in the document of an event ture discusses interactions between users and computer- that adequately fills the query template, the system re- based systems as well as between human librarians and turns the filled template (e.g., (ActB}="kidnapped’, library patrons (Twidale & Nichols 1998). {GroupS}="10 civilians", etc.). Given P’s question, The following will be an ongoing example to which converted into an appropriate template form, the sys- we refer throughout the paper: Person P is concerned tem would ideally return all terrorism events mentioned that there has been a high numberof terrorism cases re- in the corpus. Thus one possible scenario is that, if the ported recently t, and is looking for more information on IR system failed to answer P’s question satisfactorily, the matter. What P really wants to know (P’s "global the IE system could be used as a follow-up approach information need") is the following: "Is terrorism on (Gaizauskas ~ Robertson 1997). The data returned the rise?" System S is able to accept many different by the IE system could be analyzed (the number of kinds of queries, and then to search for information in events counted) and plotted as a function of time, to a corpus of documents. How should P and S interact to satisfy P’s information need? 2Most current information retrieval systems consider only minor variations (e.g., morphological) on the query XThereader mayinsert any other topic in place of "ter- words, though natural language approaches to IR exist as rorism" (e.g., managementsuccessions, cancer, rap music). well (Lewis & Sparck Jones 1996). 136 see whether terrorism cases were indeed increasing in user to another, or within the span of a single task; this frequency. is by no meansan exhaustive list. It is not our intention Knowledge discovery in text (Feldman & Dagan to suggest that any one system should explicitly make 1995) begins with the kind of well-structured data often use of all of these parameters. However, the designers returned by an information extraction system. The sys- of a system should at least be aware of them, in order tem and user then cooperate, the system using statisti- to take into account any that are relevant to the specific cal calculations and the user using intuition and knowl- design and intended application of a given system. edge of the world, to find correlative patterns among Many parameters mentioned in this section come data categories, possibly defining new categories in the from (Allen 1996) and are not specific to information ex- process. For example, a KDTsystem might analyze traction, but are relevant to various information tasks. the terrorist event data returned by the IE query men- tioned earlier, to identify patterns of occurrence over Information-based variations the a range of years. Thus S would compute the an- What kind of information is the user interested in? swer to P’s question on its own, rather than relying What kind of information is contained in the docu- on a document (that P trusts) in the corpus contain- ments? Different kinds of information may not be con- ing the answer. Finding information often does involve ducive to the same techniques for representation or in- trying one approach, getting partial answers, integrat- teraction. (For example, diagrammatic information is ing this new knowledge, and trying again with the same more successfully conveyed in a diagram than as text.) approach or with a different one. Whydoes the user want the information that the system is trying to find? For example, is the user: The natural language perspective ¯ searching for something in particular; Some degree of language understanding is required to recognize how the terms in the query template are re- ¯ generally browsing, or exploring, to become ac- lated to one another. The documents in the corpus quainted with information available on the topic; maycontain either unstructured natural language (e.g., ¯ integrating new information into what the user al- news wire articles, in the MUCs;e.g., (MUC-61995) ready knows; or, and (MUC-71999)) or semi-structured text (e.g., ¯ trying to gain new perspectives on what the user al- mulaic reviews of restaurants or movies, with consistent ready know? structure; discussed in manyof the papers in (AAAI’98- AIII 1998)). When documents contain unstructured The user’s motivation for performing the IE task will natural language, information extraction can be seen as affect how she examines each document and what kind a natural language understanding task restricted to the of information she focuses on. As a result, it mayaffect domain of knowledge relevant to the user’s query. The howthe user prefers to interact with the system. system typically contains a knowledge base of words Howmany solutions to the query is the user looking and phrases, semantic relations, and rules of syntax in for? If the user is interested in finding a small number the language. Howthese syntax rules are expressed is of solutions then a less accurate (and possibly faster) very muchaffected by the specific language model being procedure may suffice. This is in contrast to a user used (e.g., finite state automata (Appelt et ai. 1995); interested in all solutions that occur in the corpus, or statistical dependency model (Weischedel 1995)). for the "best" solution (which may require finding all Information extraction systems acquire additional solutions, then using somecriterion to select the best). knowledge (about the lexicon, semantics, etc.) dur- Similarly, is the user satisfied with finding someof the ing the course of the task through elements in the query (i.e., in information extraction, techniques. Typically, this involves "batch" learning -- this means that partially filled templates are accept- learning in large increments -- with no user involvement able)? If so, how many and which elements (if they are within each batch, and the specific approach taken to not weighted equally in importance) are required? will be very much dependent on the language model. With respect to documents in the corpus, does it mat- ter how documents are related to one another? For ex- Summary: there’s still something missing ample, in the legal domain it is important to relate a given case to possible precedents (Rose & Belew 1991). A good deal of the process involved in information ex- traction has yet to be discussed. For example, the de- User-based variations sign of an information extraction system should con- Howdoes the user process different kinds of informa- sider the kinds of users, queries, and corpora that will be involved. And information extraction may occur in tion? In any given situation, different users will vary with respect to: isolation, but in the context of other activities. ¯ what the user perceives, identifies, or focuses on3; The Information Task In Context 3Allen (1996) discusses individuals’ "boundedrational- In this section, we briefly mention some of the param- ity", the fact that people do not always make use of all eters that will vary from one task to another, from one available information-- i.e., real people are not ideal users.

137 ¯ how the user interprets the information; text as annotations (Bird ~ Liberman 1999); e.g., se- mantic features are inserted into the document text as ¯ what courses of action the user will identify, and SGML4. markups which one the user will pursue; and, ¯ what the user knows, and meta-knowledge about Phases of an IE task what the user doesn’t know. A typical information extraction task might be de- As well, users will vary in howquickly they learn over scribed as a series of five phases or steps. Manyof these the course of a given task, and in what they require in phases involve elements that are unconstrained in the order to be confident of the results. specifications of the task, and these elements have been In addition to their abilities, users will also vary with italicized; an italicized term X can be read as "someset respect to preferences -- how information should be of X", which may be the same or a different set than presented, how much information should be presented an X occurring elsewhere. at once, etc.. 1. The user recognizes an information need, then defines Wlthin-task variations and represents a specific problem: Has the user firmly established the query and domain (a) the user identifies Concepts relevant to the query, of interest? An information task can be described in (b) the user annotates Concepts as they occur in Doc- several phases: uments, using Annotations. ¯ task initiation: recognizing the information need; 2. The user and system explore possible strategies and generate possible intermediate solutions: ¯ topic selection: identifying the general topic; (a) the system derives Rules for identifying Concepts ¯ prefocus exploration: investigating information on in Documents, the general topic; (b) the user reviews and possibly modifies directly how ¯ focus formulation: deciding on a narrower topic; Rules are defined in the system, ¯ information collection: investigating information on (c) the system applies Rules, annotates Concepts as the narrower topic; and, they occur in Documents, using Annotations. ¯ search closure: completing the information search. 3. The user reviews and evaluates Annotations in Doc- If the query and domain are not already well defined uments. at the beginning of the task, they will evolve over the 4. The user may choose to repeat the cycle if one of the course of the task as a result of: following conditions holds, or to proceed to 5: ¯ additional information learned from documents, ¯ the system is missing a relevant concept -- loop ¯ constraints learned about information in documents, back to l(a) and modify the representation, or loop back to 2(c) and overlook the omission; ¯ changes in user needs due to external influences. ¯ the system rules are incorrect, but the relevant con- Howthey evolve will depend in part on the user-based cept is defined correctly -- loop back to 2(b) and variations mentioned above, and may involve broaden- modify the rules, or loop back to 2(c) and overlook ing, narrowing, replacing, or reorganizing individual el- the error; ements within the current configuration, or the chang- ¯ the system solution is acceptable for the documents ing the scope of the entire problem definition. annotated so far, but additional unannotated doc- Between-task variations uments remain -- loop back to 2(c). Howis the current query related to previous queries? 5. The system produces a report of final results. For example, the current query may be building on the Without any specification on how all information ex- results of earlier queries in the same task (e.g., informa- traction systems should set these parameters, develop- tion extraction), or on the results of other tasks (e.g., ers must make these decisions for each system. For information retrieval, or knowledge discovery). example, the (italicized) elements mayvary in: Similarly, a query may refer to a domain similar to ¯ number or size -- e.g., how many documents does a that of an earlier query; some elements from the earlier user examine before proceeding to the next phase? task maybe relevant, while others may not. ¯ selection -- e.g., which documents does the user ex- Information Extraction In Context amine? are they the same documents as in the pre- vious phase, or novel documents? We now focus our discussion on information extraction specifically, and consider how this task can vary from 4This is consistent with a number of the MUCsystems one time to another. The model of information extrac- (e.g., FASTUS(Appelt et al. 1995) and AlembicWorkbench tion that we have adopted is the annotation model: the (Aberdeenet al. 1995)), and we believe it is rich enough results of analyzing a text are incorporated into that also describe most other systems.

138 ¯ order -- e.g., the system mayprocess a next rule in ous language components and related both to the text some fixed sequence, or in order of how often it has patterns in which they appear in a document as well as been activated in the past; the order maybe increas- to the concepts already in the semantic model. ing, decreasing, or random. Proposed design. At present, we allow concepts ¯ motives -- e.g., in phase 2(b), is the user’s motive to be related to one another vertically in a simple to identify new rules, to correct current rules, or to subsumption hierarchy (containing only ANDand OR compare and contrast rules that overlap? links), and horizontally using simple pre-defined rela- ¯ control -- e.g., is it the system or the user whocon- tions (Is-A and HAs-A); all other relations are them- strains each of these elements; if both, then which selves defined as concepts (e.g., relation:HAS-MORE- has priority during conflicts? THAN-ONE-OFmight be defined as relation:HAs-A --4 concept:MoRE-THAN-ONE--+ relation:Is-A). Relations ¯ timing- e.g., when can a parameter be changed, could therefore be involved in subsumption hierarchies and are these times predetermined or variable? as well. Proposed design. We argue that these parameters Such a semantic representation provides a clear con- be variable through the course of an IE task, being set trast to representing concepts implicitly using a rule initially to some default value. Furthermore, while the syntax, and allows us to investigate issues related to user has priority in changing the parameter values, the both interfaces and interaction methods (see below). system should be capable of suggesting values to the user on the basis of its evaluation of data patterns and, Interface if the user does not object, resetting these values and The IE system stores its knowledgein a variety of repos- thus adapting the interface to the task conditions. itories and forms, including: ¯ annotated document text, Knowledge representation ¯ the , An information extraction system must maintain an in- ¯ rule syntax, ternal representation of the language model for process- ing document text. This will require such components ¯ the lexicon, or lists of knownwords, as a lexicon of part-of-speech and semantic features for ¯ a listing of query results found so far. words and phrases, and a rule set for syntactic Any of these can be large, with the corpus possibly and discourse structures. Whendocuments are written containing Gigabytes of text. It is challenging, there- in a semi-structured language, these components can be fore, to provide the user an intuitive access to all of this appropriately simplified. data. In knowledge discovery applications, which gener- In addition, a representation of the information con- ally involve large amounts of numerical data, graphical tained inside a document will need to be maintained; interfaces are common;however, it maybe difficult to as relevant concepts are recognized in the document, visualize textual information graphically. A natural al- they will be added to this representation and integrated ternative would be to use a natural language interface, into the representation by being related to other con- though this also presents a number of challenges. cepts already present. Such a semantic model may be One approach to information retrieval (Rose & Belew maintained either explicitly (e.g., as a semantic network 1991) involves a graphical representation of a small (Morganet al. 1995)) or implicitly in the system’s lan- amount of textual information; only the concepts guage components. Relations between documents may deemed most relevant are shown. The user is then able also need to be represented. to manipulate these concepts, as well as query them for We have proposed a "tripartite" semantic model, di- further information. viding the represented information into a query model, domain model, and corpus model (Vanderheyden & Co- Proposed design. We encourage making many hen 1998). Briefly, the domain model is a network of views of the system knowledge available and allowing concepts relevant to the domain -- the general subject the user to select the preferred view according to per- area relevant to the query. The query model is a sub- sonal, task, and information characteristics. Andwe are network of the domain model, containing only the con- encouraged by the interface developed by Rose & Belew cepts asked for in the query and thus expected in the in which the user had access to several of these views in results returned by the system; the query model con- parallel. User-driven navigation enables both purpose- cepts, however, may have additional constraints (e.g., ful search behaviour as well as opportunistic browsing. values maybe restricted to a subset of those allowed in Navigation in one view, as well as any modifications the corresponding domain concept). Finally, the cor- made to the system knowledge, require the other views pus model connects the domain model to information to be updated accordingly. There are some factors, in documents that the system may have processed (in however, that may not allow continuous real-time up- part or in full) but that is not contained in the domain. dates across all views. For example, too many changes As new concepts are identified by the system or user, occurring at once may be distracting to the user, while they will need to be integrated into the system’s vari- also taxing the system performance.

139 Interaction when the user is ready to resume their activity. The interface provides the user with a windowinto the system knowledge, allowing the user to learn about: Learning and adaptation ¯ document content, While it is important that the system support various opportunities for interaction with the user, it is nev- ¯ the current definition of query and domain models, ertheless impractical for user and system to interact ¯ the current mapping from query and domain models on every decision. The ability of the system to learn to document text. about the documents that it processes as well as about With this information, the user may see that the system the user can help to keep this amount of interaction is missing a concept, or a mapping from the concept to bounded. For example, information extraction systems the text. Information extraction can be performed as a use learning methods in order to integrate new words completely user-driven task, with the system perform- into their lexicon, acquire rules for parsing syntax and ing only those functions that the user specifies. discourse, and recognize relations between domain con- However, the user would need to be paying attention cepts (Wermter, Riloff, & Scheler 1996). A system to a great deal of information in order to perform the thus able to adapt itself autonomously to changes in task in an efficient manner. Are any of the annotations some aspects of the task. in the annotated documents incorrect? Are any of the The general rule of thumb has often been that learn- rules of syntax or discourse, or semantic relations, more ing in larger increments (i.e., batch learning) improves general or more specific than they should be? Are the accuracy. However,learning in smaller increments (i.e., documents being examined in the optimal order? online learning) allows better adaptivity to changing el- In any task involving this muchinformation, the sys- ements in the task. Furthermore, when the system is tem should provide as much support to the user’s ac- able to control howdata for learning is selected (i.e., tions and decisions as possible. There is some very in- active learning), the requirement for interaction with teresting and relevant work in area of "mixed-initiative" the user is often further reduced. planning and system design (we discuss some of this Caution must be taken in how the system updates work in (Vanderheyden & Cohen 1999a)) in which its knowledgebases as a result of patterns identified in system and user cooperate in performing a task, each the text. The user’s information need may be evolving, contributing the skills and knowledge for which they for example, as the user learns more about the domain, are best suited. While the user has the advantage of a and about what kind of information the system is iden- high-level knowledge of the user’s own information need tifying; this kind of "evolving query" (Vanderheyden and (at least to some extent) of the domain, the system Cohen 1999b) is a natural consequence of an informa- can have the advantage of a low-level representation for tion gathering task. For example, a user may have a patterns in the corpus as a result of its ability to pro- query about terrorist activities, asking for the namesof cess large amounts of text very quickly. Thus, the user perpetrators and the locations of targets; seeing an in- and system can, in general, play complementary roles terim output from the system prompts the user to refine in guiding the task. the query to define terrorist activities as only involving The mixed initiative must be monitored. The system certain kinds of weapons. and user need to make the other aware of what each is The query is not the only element of the task that doing (referred to as "context registration"), in order may change over time. Certainly the domain evolves as that conflicts could be avoided. additional documents are processed. As well, when the Whenthe system has information has relevant infor- corpus is very large or dynamic(e.g., the ), the mation, it maychoose to "take the initiative" of inter- corpus itself may be seen as evolving if rules for map- acting with the user. There will be constraints on when ping text patterns to query items that apply at one time and howoften interaction should occur; not so often as or for one subcorpus no longer apply for another. The to annoy the user, for example. idea that task requirements evolve is consistent with muchresearch within the information retrieval commu- Proposed design. When the user is setting task pa- nity, but in contrast to traditional approaches to in- rameters, the system could provide information about formation extraction -- e.g., as seen in recent MUC-6 relevant patterns in the document text. For example, (1995) and MUC-7(1999) papers -- which implicitly by performing some lookahead to documents not yet assume that task requirements are static for the dura- seen by the user, the system could suggest which docu- tion of the task. ments the user should examine next and in what order. An opportunity for the system to initiate interaction Proposed design. Using machine learning tech- with the user is when a partial answer to the user’s niques, the system is able to suggest what values are query is identified in a document, or the documentoffers appropriate for task parameters. For example, one of a critical opportunity for deciding whether a subset of the parameters the system should monitor is how well the system’s current rules is accurate or not. the documents are fitting to its current rule set. In Wherepossible, the system is processing information this way, critical documents can be identified and the while the user is idle, but without causing undue delay system can decide between alternatives more efficiently

140 while preventing the user from annotating redundant analysis. In Towards Standards and Tools for Dis- documents that provide it with no new knowledge. course Tagging, Proceedings of the Workshop, 1-10. Association for Computational Linguistics. Conclusions Feldman, It., and Dagan, I. 1995. Knowledgediscov- In our continuing work, we would like to show that the ery in textual databases (KDT). In Proceedings of the interactive approach that we have suggested for infor- First International Conference on Knowledge Discov- mation extraction is both possible and practical from ery (gOD-95). the perspectives of the user and of the system. Gaizauskas, It., and Itobertson, A. M. 1997. Cou- From the perspective of the user, research in the li- pling information retrieval and information extraction brary sciences and in user-centered design both strongly : A new text technology for gathering information suggest that an interactive approach is natural and nec- from the web. In Conference Proceedings of RIAO 97: essary to identify and solve the user’s information need, Computer-Assisted Information Searching on the In- particularly when the user is not completely confident ternet, volume 1, 356-370. about the topic area, the kinds of information that are Lewis, D. D., and Sparck Jones, K. 1996. Natural available, or the specifics of the system. People are not language processing for information retrieval. Com- always effective at expressing their information need, munications of the ACM39(1):92-101. perhaps not having completely formulated it in their own minds. Morgan, It.; Garigliano, It.; Callaghan, P.; Poria, From the perspective of the system, there is evidence S.; Smith, M.; Urbanowicz, A.; Collingham, It.; to suggest that interactive approaches to learning per- Costantino, M.; Cooper, C.; and the LOLITAGroup. form as well or better than approaches involving less in- 1995. University of Durham: Description of the teraction. However, the most convincing evidence will LOLITAsystem as used in MUC-6. In MUC-6(1995), be a demonstration that this is the case. 71-85. In conclusion and for the purpose of raising discus- MUC-6.1995. Proceedings of the Sixth Message Un- sion, we make the following claims: derstanding Conference (MUC-fi), Columbia, Mary- land, USA: Morgan Kaufmann: San Francisco, USA. Claim 1 The user must be involved in constructing and maintaining the system’s knowledge base. MUC-7. 1999. Proceedings of the Seventh Message Understanding Conference (MUC-7). Available online Claim 2 The knowledge represented in the system, the from the SAIC website at http://www.muc.saic.com/. process for gathering information, and the way in which it is presented to the user must be able to Rose, D. E., and Belew, It. K. 1991. Towarda direct- change and evolve as the user’s needs evolve. manipulation interface for conceptual information re- trieval systems. Greenwood Press: Westport, Con- Claim 3 The system and user should be responsible necticut, USA. chapter 3, 39-53. for those aspects of the task that they are each best Twidale, M. B., and Nichols, D. M. 1998. Design- qualified to perform, and these may change as the ing interfaces to support collaboration in information task progresses. retrieval. Interacting With Computers 10(2):177-193. We look forward to an interesting symposium! Vanderheyden, P., and Cohen, It. 1998. Informa- tion extraction and the casual user. In AAAI’98-AIII References (1998),137-142. AAAI’98-AIII. 1998. Proceedings of the AAAI’98 Vanderheyden, P., and Cohen, It. 1999a. Designing Workshop on AI and Information Integration, Madi- a mixed-initiative information extraction system. In son, Wisconsin, USA. Working Notes of the AAAI’99 Workshop on Mixed- Aberdeen, J.; Burger, J.; Day, D.; Hirschman, L.; Initiative Intelligence, 142-146. Orlando, Florida, Robinson, P.; and Vilain, M. 1995. MITRE:Descrip- USA: AAAI’99-MII. tion of the Alembic system used for MUC-6.In MUC-6 Vanderheyden, P., and Cohen, It. 1999b. A dy- (1995), 141-155. namic interface for a dynamic task: Information ex- Allen, B. L. 1996. Information Tasks: Towardsa User- traction and evolving queries. In Working Notes of Centered Approach to Information Systems. Library the AAAI’99 Spring Symposium on Agents With Ad- and Information Science. Academic Press: San Diego, justable Autonomy, 120-123. Stanford, California, CA. USA: AAAI’99-SSAWAA. Appelt, D. E.; Hobbs, J. It.; Bear, J.; Israel, D.; Weischedel, R. 1995. BBN: Description of the PLUM Kameyama, M.; Kehler, A.; Martin, D.; Myers, K.; system as used for MUC-6. In MUC-6(1995), 55-69. and Tyson, M. 1995. SitI International FASTUS Wermter, S.; Itiloff, E.; and Scheler, G., eds. 1996. system: MUC-6test results and analysis. In MUC- Connectionist, Statistical, and Symbolic Approaches to 6 (1995),237-248. Learning for Natural Language Processing. Lecture Bird,S., and Liberman,M. 1999.Annotation graphs Notes in Artificial Intelligence. Springer, Berlin. as a framework for multidimensional linguistic data

141