Interactive Information Extraction with Constrained Conditional Random Fields
Total Page:16
File Type:pdf, Size:1020Kb
Interactive Information Extraction with Constrained Conditional Random Fields Trausti Kristjansson Aron Culotta Paul Viola Andrew McCallum Microsoft Research Dept. of Computer Science Microsoft Research Dept. of Computer Science Redmond, Washington University of Mass. Amherst Redmond, Washington University of Mass. Amherst [email protected] [email protected] [email protected] [email protected] Abstract fi elds in a contact database, including last name, fi rst name, address, city, state, phone, etc. As we will show, it is pos- Information Extraction methods can be used to au- sible to create automatic systems which will extract over tomatically “fill-in” database forms from unstructured data such as Web documents or email. State-of-the-art 90% of these fi elds correctly from a diverse set of complex methods have achieved low error rates but invariably sources. Given this low error rate, the first goal of a good make a number of errors. The goal of an interactive user interface is to display the extracted fi elds so that they information extraction system is to assist the user in fill- can be verified rapidly. The second goal is to allow for the ing in database fields while giving the user confidence rapid correction of incorrect fi elds. It is important to realize in the integrity of the data. The user is presented with that the fi elds are extracted as an interdependent set. Given an interactive interface that allows both the rapid verifi- the name “Charles Stanley” it is likely that the fi rst name is cation of automatic field assignments and the correction Charles and the last name is “Stanley.” But, the opposite is of errors. In cases where there are multiple errors, our possible as well. Given the error that the two names have system takes into account user corrections, and immedi- been switched, naive correction systems require two correc- ately propagates these constraints such that other fields are often corrected automatically. tive actions. In the interactive information extraction system Linear-chain conditional random fields (CRFs) have described below, when the user corrects the fi rst name fi eld been shown to perform well for information extraction to be “Stanley,” the system then automatically changes the and other language modelling tasks due to their ability last name fi eld to be “Charles.” We call this capability cor- to capture arbitrary, overlapping features of the input in rection propagation. a Markov model. We apply this framework with two ex- From the perspective of user interface design, there are tensions: a constrained Viterbi decoding which finds the a number of goals, including reducing cognitive load, re- optimal field assignments consistent with the fields ex- ducing the number of user actions (clicks and keystrokes), plicitly specified or corrected by the user; and a mech- and speeding up the data acquisition process. An impor- anism for estimating the confidence of each extracted field, so that low-confidence extractions can be high- tant element that is often overlooked is the confidence the lighted. Both of these mechanisms are incorporated in a user has in the integrity of the data. This is crucial to the novel user interface for form filling that is intuitive and usability of the application, as users are not tolerant of (sur- speeds the entry of data—providing a 23% reduction in prising) errors, and will discontinue the use of an automatic error due to automated corrections. semi-intelligent application if it has corrupted or misclassi- fi ed information. Unfortunately such factors are often hard Introduction to quantify. An interactive form fi lling system is quite different from A recent study showed that as part of the process of gather- the batch processing of data, such as for warehouse data ing and managing information, currently 70 million workers, cleaning (Borkar, Deshmukh, & Sarawagi 2000). In batch or 59% of working adults in the U.S., complete forms on a processing the set of fi elds extracted are determined directly regular basis. Filling in forms is tedious, error-prone and and are optimized for low error rates. In contrast interactive time-consuming. In many cases, the data that is used to pop- information extraction (IIE) puts additional requirementson ulate the fi elds of the form is already available in computer the information extraction system. To facilitate a natural readable form. user experience, the information extraction system must dis- The goal of this work is to reduce the burden on the user play low confi dence fi elds and make optimal use of any cor- to the largest extent possible, while ensuring the integrity rections that the user has made. of the data entered into the system. One typical example There are a number of statistical approaches for infor- is the entry of contact addresses from on-line sources such mation extraction (IE) that are more or less suited to this as email messages or Web pages. There are more than 20 paradigm. The most common engineering approach is to Copyright c 2004, American Association for Artificial Intelli- build a set of regular expressions that extract the fields in gence (www.aaai.org). All rights reserved. question. Regular expressions are a poor match for interac- tive information extraction since they cannot estimate confi- dence, nor can they naturally incorporateuser labels and cor- rections. Maximum Entropy Classifi ers are potentially quite powerful, since they allow for the introduction of arbitrary, potentially dependent, features. Maximum entropy classi- fi er can also estimate the confi dence in decisions. However, each fi eld extracted using a maximum entropy model is es- timated independently. For this reason the potential for cor- rection propagation is minimal. Conditional Random Fields, a generalization both of maximum entropy models and hid- den Markov models, allow for the introduction of arbitrary non-local features and capture the dependencies between la- bels. CRFs have been shown to perform well on information extraction tasks (McCallum & Li 2003; Pinto et al. 2003; McCallum 2003; Sha & Pereira 2003), and are well-suited for interactive information extraction since the confi denceof the labels can be estimated and there is a natural scheme for Figure 1: A user interface for entry of contact information. optimally propagating user corrections. The user interface relies on interactive information extrac- There are two contributions of this paper. The first con- tion. If a user makes a correction, the interactive parser can tribution is the introduction of the interactive information update other fields. Notice that there are 3 possible names extraction framework. This includes a user interface that associated with the address. The user is alerted to the ambi- highlights the label assigned to each field in the unstruc- guity by the color coding. tured document visually while flagging low confi dence la- bels. The interface also allows for rapid correction using “drag and drop.” Finally, the interface supports the propa- case color-coded correspondence is used (e.g. blue for gation of fi eld corrections, so that one correction will often all phone information, and yellow for email addresses). correct many errors. Other options include arrows or floating overlayed tags. The second contribution is a pair of new algorithms for The UI allows for rapid correction. For example, text seg- the estimation of field confidences in CRFs and for the in- ments can easily be grouped into blocks to allow for a corporation of constraints into the Viterbi decoding process. single click-drag-drop. In the contact record at the left, In this case the constraints come from corrections to incor- fields have drop down menus with other candidates for rect fields or from the new field labels added by the user. the fi eld. Alternatively the interface could include “try The remainder of this paper describes each contribution again” buttons next to the fields that cycle through pos- in turn. We then describe a set of experiments in the domain sible alternative extractions for the fi eld until the correct of contact address entry. In these experiments we compare value is found. the performance of several well known algorithms against CRFs. We then investigate the effectiveness of constrained The UI immediately propagates all corrections and addi- Viterbi decoding after correcting the least confi dent error. tions by the “constrained Viterbi” proceduredescribed be- low. User Interaction Models The UI visually alerts the user to fi elds that have low con- Theideaexploredin this paper is that of populatingthe fi elds fi dence. Furthermore, in the unstructured text box, possi- of a contact database, sometimes called a Digital Address ble alternatives may be highlighted (e.g. alternate names Book. With the increase of personal digital devices such as are indicated in orange) personal digital assistants (PDAs), and cell phones, there is Using a well-defi ned probabilistic model, such as CRF’s, increasing demand for better tools for contact entry. we can correctly calculate confidence estimates for each User Interfaces for Information Extraction field assignment. Estimation of confidence scores is dis- cussed in the section “Confi dence Estimation.” Figure 1 shows a user interface that facilitates interactive in- Confidence scores can be utilized in a UI in a number formation extraction. The fi elds to be populated are on the of ways. Field assignments with relatively low confi dence left side, and the source text was pasted by the user into the can be visually marked. If a field assignment has very low right side. The information extraction system extracts text confi dence, and is likely to be incorrect; we may choose not segments from the unstructured text and populates the cor- to fi ll in the fi eld at all. The text that is most likely to be responding fi elds in the contact record.