Interactive Information Extraction with Constrained Conditional Random Fields

Trausti Kristjansson Aron Culotta Paul Viola Andrew McCallum Microsoft Research Dept. of Computer Science Microsoft Research Dept. of Computer Science Redmond, Washington University of Mass. Amherst Redmond, Washington University of Mass. Amherst [email protected] [email protected] [email protected] [email protected]

Abstract fi elds in a contact database, including last name, fi rst name, address, city, state, phone, etc. As we will show, it is pos- Information Extraction methods can be used to au- sible to create automatic systems which will extract over tomatically “fill-in” database forms from unstructured data such as Web documents or email. State-of-the-art 90% of these fi elds correctly from a diverse set of complex methods have achieved low error rates but invariably sources. Given this low error rate, the first goal of a good make a number of errors. The goal of an interactive user interface is to display the extracted fi elds so that they information extraction system is to assist the user in fill- can be verified rapidly. The second goal is to allow for the ing in database fields while giving the user confidence rapid correction of incorrect fi elds. It is important to realize in the integrity of the data. The user is presented with that the fi elds are extracted as an interdependent set. Given an interactive interface that allows both the rapid verifi- the name “Charles Stanley” it is likely that the fi rst name is cation of automatic field assignments and the correction Charles and the last name is “Stanley.” But, the opposite is of errors. In cases where there are multiple errors, our possible as well. Given the error that the two names have system takes into account user corrections, and immedi- been switched, naive correction systems require two correc- ately propagates these constraints such that other fields are often corrected automatically. tive actions. In the interactive information extraction system Linear-chain conditional random fields (CRFs) have described below, when the user corrects the fi rst name fi eld been shown to perform well for information extraction to be “Stanley,” the system then automatically changes the and other language modelling tasks due to their ability last name fi eld to be “Charles.” We call this capability cor- to capture arbitrary, overlapping features of the input in rection propagation. a . We apply this framework with two ex- From the perspective of user interface design, there are tensions: a constrained Viterbi decoding which finds the a number of goals, including reducing cognitive load, re- optimal field assignments consistent with the fields ex- ducing the number of user actions (clicks and keystrokes), plicitly specified or corrected by the user; and a mech- and speeding up the data acquisition process. An impor- anism for estimating the confidence of each extracted field, so that low-confidence extractions can be high- tant element that is often overlooked is the confidence the lighted. Both of these mechanisms are incorporated in a user has in the integrity of the data. This is crucial to the novel user interface for form filling that is intuitive and usability of the application, as users are not tolerant of (sur- speeds the entry of data—providing a 23% reduction in prising) errors, and will discontinue the use of an automatic error due to automated corrections. semi-intelligent application if it has corrupted or misclassi- fi ed information. Unfortunately such factors are often hard Introduction to quantify. An interactive form fi lling system is quite different from A recent study showed that as part of the process of gather- the batch processing of data, such as for warehouse data ing and managing information, currently 70 million workers, cleaning (Borkar, Deshmukh, & Sarawagi 2000). In batch or 59% of working adults in the U.S., complete forms on a processing the set of fi elds extracted are determined directly regular basis. Filling in forms is tedious, error-prone and and are optimized for low error rates. In contrast interactive time-consuming. In many cases, the data that is used to pop- information extraction (IIE) puts additional requirementson ulate the fi elds of the form is already available in computer the information extraction system. To facilitate a natural readable form. user experience, the information extraction system must dis- The goal of this work is to reduce the burden on the user play low confi dence fi elds and make optimal use of any cor- to the largest extent possible, while ensuring the integrity rections that the user has made. of the data entered into the system. One typical example There are a number of statistical approaches for infor- is the entry of contact addresses from on-line sources such mation extraction (IE) that are more or less suited to this as email messages or Web pages. There are more than 20 paradigm. The most common engineering approach is to Copyright c 2004, American Association for Artificial Intelli- build a set of regular expressions that extract the fields in gence (www.aaai.org). All rights reserved. question. Regular expressions are a poor match for interac- tive information extraction since they cannot estimate confi- dence, nor can they naturally incorporateuser labels and cor- rections. Maximum Entropy Classifi ers are potentially quite powerful, since they allow for the introduction of arbitrary, potentially dependent, features. Maximum entropy classi- fi er can also estimate the confi dence in decisions. However, each fi eld extracted using a maximum entropy model is es- timated independently. For this reason the potential for cor- rection propagation is minimal. Conditional Random Fields, a generalization both of maximum entropy models and hid- den Markov models, allow for the introduction of arbitrary non-local features and capture the dependencies between la- bels. CRFs have been shown to perform well on information extraction tasks (McCallum & Li 2003; Pinto et al. 2003; McCallum 2003; Sha & Pereira 2003), and are well-suited for interactive information extraction since the confi denceof the labels can be estimated and there is a natural scheme for Figure 1: A user interface for entry of contact information. optimally propagating user corrections. The user interface relies on interactive information extrac- There are two contributions of this paper. The first con- tion. If a user makes a correction, the interactive parser can tribution is the introduction of the interactive information update other fields. Notice that there are 3 possible names extraction framework. This includes a user interface that associated with the address. The user is alerted to the ambi- highlights the label assigned to each field in the unstruc- guity by the color coding. tured document visually while flagging low confi dence la- bels. The interface also allows for rapid correction using “drag and drop.” Finally, the interface supports the propa- case color-coded correspondence is used (e.g. blue for gation of fi eld corrections, so that one correction will often all phone information, and yellow for email addresses). correct many errors. Other options include arrows or floating overlayed tags.

The second contribution is a pair of new algorithms for The UI allows for rapid correction. For example, text seg- the estimation of field confidences in CRFs and for the in- ments can easily be grouped into blocks to allow for a corporation of constraints into the Viterbi decoding process. single click-drag-drop. In the contact record at the left, In this case the constraints come from corrections to incor- fields have drop down menus with other candidates for rect fields or from the new field labels added by the user. the fi eld. Alternatively the interface could include “try The remainder of this paper describes each contribution again” buttons next to the fields that cycle through pos- in turn. We then describe a set of experiments in the domain sible alternative extractions for the fi eld until the correct of contact address entry. In these experiments we compare value is found. the performance of several well known algorithms against CRFs. We then investigate the effectiveness of constrained The UI immediately propagates all corrections and addi- Viterbi decoding after correcting the least confi dent error. tions by the “constrained Viterbi” proceduredescribed be- low.

User Interaction Models The UI visually alerts the user to fi elds that have low con- Theideaexploredin this paper is that of populatingthe fi elds fi dence. Furthermore, in the unstructured text box, possi- of a contact database, sometimes called a Digital Address ble alternatives may be highlighted (e.g. alternate names Book. With the increase of personal digital devices such as are indicated in orange) personal digital assistants (PDAs), and cell phones, there is Using a well-defi ned probabilistic model, such as CRF’s, increasing demand for better tools for contact entry. we can correctly calculate confidence estimates for each User Interfaces for Information Extraction field assignment. Estimation of confidence scores is dis- cussed in the section “Confi dence Estimation.” Figure 1 shows a user interface that facilitates interactive in- Confidence scores can be utilized in a UI in a number formation extraction. The fi elds to be populated are on the of ways. Field assignments with relatively low confi dence left side, and the source text was pasted by the user into the can be visually marked. If a field assignment has very low right side. The information extraction system extracts text confi dence, and is likely to be incorrect; we may choose not segments from the unstructured text and populates the cor- to fi ll in the fi eld at all. The text that is most likely to be responding fi elds in the contact record. This user interface is assigned to the field can then be highlighted in the textbox designed with the strengths and weaknesses of the informa- (e.g. in orange). tion extraction technology in mind. Some important aspects Another related case is when there are multiple text seg- are: ments that are all equally likely to be classified as e.g. a The UI displays visual aids that allow the user to quickly name, then this could also be visually indicated (as is done verify the correctness of the extracted fields. In this in Figure 1).

✄✫✪✬✞

User Interaction Models where ✙ is the probability that all fields are correctly

✲✢✳✄ ✞ For the purposes of quantitative evaluation we will simulate assigned initially and is the distribution over number

of incorrect fields in a record✛ after one field has been cor- the behavior of a user during contact record entry, verifi ca- ✢ tion, and correction. This allows for a simpler experimental rected. The distribution will depend on which incorrect paradigm that can more clearly distinguish the values of the field is corrected, e.g. a random incorrect field is corrected

under UIM2 whereas the least confident incorrect field is ✦ various technical components. A set of user studies will be ✢ reported elsewhere. corrected under UIM3. The subscript on ✴✶✵✸✷✣✹ indi- A large number of user interaction models are possible cates that correction-propagation is performed once. given the particulars of the interface and information extrac- tion engine. Here we outline the basic models that will be evaluated in the experimental section. 600 CRF UIM1: The simplest case. The user is presented with the 500 CCRF after one random correction results of automatic field assignment and has to correct all errors (i.e. no correction-propagation). 400 UIM2: Under this model, we assume an initial auto- matic fi eld assignment, followed by a single randomly- 300 chosen manual correction by the user. We then perform correction-propagation, and the user has to correct all re-

maining errors manually. of records Number 200 UIM3: This model is similar to UIM2. We assume an initial 100 automatic fi eld assignment. Next the user is asked to cor- rect the least confi dent incorrect fi eld. The user is visually alerted to the fields in order of confidence, until an error 0 is found. We then perform correction-propagationand the 0 1 2 3 4 5 6 Number of incorrect fields in record user then has to correct all remaining errors manually. UIMm: The user has to fill in all fields manually. Figure 2: Histogram, where records fall into bins depending Performance Evaluation on howmanyfi elds in a recordarein error. Solid bars arefor The goal in designing a new application technology is that CRF before any corrections. The shaded bars show the dis-

users see an immediate benefi t in using the technology. As- tribution after one random incorrect fi eld has been corrected.

✄ ✞ ✂✢✺✄ ✞

suming that perfect accuracy is required, benefi t is realized These can be used to estimate ✙ and respectively. ✛ if the technology increases the time efficiency of users, or ✛ if it reduces the cognitive load, or both. Here we introduce an effi ciency measure, called the Expected Number of User Constrained Conditional Random Fields Actions, which will be used in addition to standard IE per- formance measures. Conditional random fi elds (Lafferty, McCallum, & Pereira 2001) are undirected graphical models used to calculate the The Expected Number of User Actions: The Expected conditional probability of values on designated output nodes Number of User Actions (ENUA) measure is defined as the given values on designated input nodes. In the special case number of user actions (e.g. clicks) required to correctly in which the designated output nodes of the enter all fields of a record. The Expected Number of User arelinkedbyedgesin a linear chain,CRFs makea fi rst-order Actions will depend on the user interaction model. To ex- Markov independence assumption among output nodes, and press the Expected Number of User Actions we introduce ✂✁☎✄✝✆✟✞ thus correspond to fi nite state machines (FSMs). In this case the following notation: is the probability distribution ✆ CRFs can be roughly understood as conditionally-trained over the number of errors after ✠ manual corrections. This hidden Markov models, with additional flexibility to effec-

distribution is represented by the histogram in Figure 2. tively take advantage of complex overlapping features.

✑✽✼✿✾ ✾✳❁ ✰✝✰❂✰❃✾✳❄❆❅ ❀ Under UIM1, which does not involve correction propaga- ✢✳❀

Let ✻ be some observed input data se-

tion, the Expected Number of User Actions is: quence, such as a sequence of words in a document, (the

✕ ❈

values on ❇ input nodes of the graphical model). Let be

✡☞☛✍✌✏✎✒✑✔✓ ✄ ✞

✙ (1) a set of FSM states, each of which is associated with a la-

✑✽✼❋❊ ❊●❁ ✰✝✰✝✰❃❊●❄✂❅

✢●❀ ❀

✖✘✗✚✙✜✛ ✛

bel, (such as a label LASTNAME). Let ❉

✄ ✞ ❇ where ✙ is the distribution over the number of incorrect be some sequence of states, (the values on output nodes).

fi elds (see✛ Figure 2). CRFs defi ne the conditional probability of a state sequence

In models UIM2 and UIM3 the Expected Number of User given an input sequence as

❯✺❳✘❯ ❚

Actions is ❚❩❨ ✦

✕ ❚ ✕✳❯❲❱

✄ ✞✲✑ ✄❋❊ ✢ ❊ ✞❫❪

❀ ❀ ❀❭❬ ❀

✡☞☛✍✌✏✎✣✢✤✑✥✄✧✦✤★✩ ✄✫✪✬✞☎✞✮✭ ✯✢✘✄ ✞✱✰

❑✂▲◆▼❖❘◗✸❙

❍ ❉❏■ ✻

(2) p ✻ (3)

✛ ✛

❳✘❯

❚❩❨ ❚ ❑✂▲

where is a normalization❯ factor over all state sequences, correctedfi eld. In addition to correcting the fi eld the user in-

✄❋❊ ✢ ❊ ✞

❀ ❀ ❀✱❬ ✻

❱ is an arbitrary feature function over its ar- dicates, this process may also change the predicted states for

guments, and is a learned weight for each feature func- observations outside of the corrected field. This is because ❑ tion. The normalization factor, ▲ , involves a sum over the recursive formulation in Eq. 5 can affect optimal paths

an exponential number of different possible state sequences, before and after the time steps specifi ed in ✥ . but because these nodes with unknown values are connected in a graph without cycles (a linear chain in this case), it Confidence Estimation can be effi ciently calculated via using dy- To estimate the confidence the CRF has in an extracted namic programming. Inference to fi nd the most likely state fi eld, we employ a techniquewe term Constrained Forward- sequence (very much like Viterbi algorithm in this case) is Backward (Culotta & McCallum 2004). The Forward- also a simple matter of dynamic programming. Backward algorithm is similar to the Viterbi algorithm: in- Maximum aposteriori training of these models is effi- stead of choosing the maximum state sequence, Forward-

ciently performed by hill-climbing methods such as conju- Backward evaluates all possible state sequences given the ❚

gate gradient, or its improved second-order cousin, limited- observation sequence.

✄❋❊●✁❋✞

✢ ✌

memory BFGS (Sha & Pereira 2003). The “forward values” ✮ are recursively defi ned ✟✡✂

In orderto facilitate the user interactionmodel, we need to similarly to Eq. 4, except the ❖ is replaced by a sum-

❯ ❯ ❚

clamp some of the hidden variables to particular values. Do- mation.❚ Thus we have

✒ ✕✘❯❲❱

ing so results in the constrained Viterbi algorithm for CRFs, ✕

✄❋❊✔✓ ❊●✁ ✞✯✗✙✘✟✰ ✄✫❊✔✓ ✞ ✄❋❊ ✁❩✞✲✑

❀ ❀ ❀❭❬ ✢

▼ ❖❘◗✖✕

✻ ✮

described below. ✮ (6) ✎

For HMMs, the Viterbi algorithm (Rabiner 1989) is an ✏ ❑ effi cient dynamic programming solution to the problem of Furthermore, the recursion terminates to defi ne ▲ in Eq. 3:

finding the state sequence most likely to have generated

❄ ✁

✄✫❊ ✞ ❑✯▲ ✑

the observation sequence. Because CRFs are conditionally ✮ (7) trained, the CRF Viterbi algorithm instead fi nds the most ✁ likely state sequence given an observation sequence,

The Constrained Forward-Backward algorithm calculates

✑✄✂✆☎✞✝✠✟✡✂ ✄ ✞✱✰

❉✁ ❉✟■ ✻

p ❍ the probability of any sequence passing through a set of con-

❊✔✰✳✬ ✑ ✼✫❊✔✰❆✰ ✰✰ ❊✲✱●❅

☛ ✥ straints ✥ , where now can be either a To avoid an exponential-time search over all possible set- positive constraint or a negative constraint. A negative con-

tings of ❉ , Viterbi stores the probability of the most likely straint constrains the forward value calculation not to pass

❊✁✰ ❬

path at time ❬ which accounts for the fi rst observations and through state .

❚ ✁

ends in state ❊ . Following the notation of (Rabiner 1989), The calculations of the forward values can be made to

✄✫❊●✁✫✞ ✄✫❊●✁❩✞

✥ ☞

we define this probability to be ☞ , where is the conform to in a manner similar to the Constrained Viterbi

✄✫❊●✁❩✞

✢ ❊✳✁ ✌

probability of starting in each state , and the induction step algorithm. If ✮ is the constrained forward value, then

❑ ✑ ✄❋❊ ✞

✓ ✓

✁ ✩

✮ is the value of the constrained lattice. Our

❯✺❳✘❯ ❚

is given by: ❚

✒ ❱

✕✘❯ confi dence estimate is equal to the normalized value of the

❑ ❑

✁ ✁

✓ ✄✫❊✔✓ ❊ ✞✑✗✙✘✟✰ ✄✫❊✔✓ ✞ ✢✺✄✫❊ ✞✲✑✍✟✡✂

❀ ❀ ❀✱❬

▲✵✴

▼❖✟◗✖✕ ❖

✻ ☞ ☞ constrained lattice: . ✎✑✏ In the context of interactive form fi lling, the constraints

(4) ✥ correspond to an automatically extracted fi eld. The pos-

The recursion terminates in itive constraints specify the observation tokens labelled in-

❄ ✄✫❊●✁❋✞✤✣ ✑✄✂✠☎✛✝✜✟✡✂

✚ side the fi eld, and the negativeconstraints specify the bound-

✁ ✢

❖ ☞ ary of the field. For example, if we use states names B- TITLE and I-JOBTITLE to label tokens that begin and con-

We can backtrack through the dynamic programming table tinue a JOBTITLE fi eld, and the system labels observation

✼✿✾✘❁ ✰✰✰ ✾✪✶●❅ ✑✥✼❋❊ ❁✏✑

❀ ❀ ❉

to recover .

❚ ❚ ✥

❊ sequence asaJOBTITLE fi eld, then

❊✔✷✥✑ ✰ ✰✰ ✑ ❊ ✑ ❊✔✸✺✹✑ ❀

Constrained Viterbi alters Eq. 4 such that is con-

✑✽✼✫❊ ❊ ✰✰✰❃❅ ❀

✢ B-JOBTITLE, I-JOBTITLE I-

✌ ✥ strained❚ to pass through some subpath . JOBTITLE .

These constraints ✥ now define the new induction is

✢✳✄✫❊ ✞✲✑

❯✘❳✘❯ ❚

❚ Experiments

✗✫✘ ✁ ✁

✞ ❊ ✄✫❊ ✄✫❊ ✞ ❊ ✑ ❊ ✢ ✟✡✂

❀ ❀✱❬ ❀

✓ ✓

✧ ★

✌ For training and testing we collected 2187 contact records

✕✪✩

▼ ❖❘◗ ❖ ✻ ☞ if ✎✤✏ (27,560words) from web pages and emails and hand-labeled

✪ otherwise 25 classes of data fi elds.1 Some data came from pages con- ❚

(5) taining lists of addresses, and about half came from disparate

❊ ✢✭✬

✌ ✥ for all ✥ . For time steps not constrained by ,Eq. 4 1The 25 fields are: FIRSTNAME, MIDDLENAME, LAST- is used instead. NAME,NICKNAME,SUFFIX, TITLE, JOBTITLE, COMPANY-

In the context of interactive form fi lling, the constraints ✥ NAME,DEPARTMENT,ADDRESSLINE, CITY1, CITY2, STATE, correspond to a set of observations (an address fi eld) manu- COUNTRY,POSTALCODE,HOMEPHONE,FAX, COMPANY- ally corrected by the user. Upon correction, the system runs PHONE,DIRECTCOMPANYPHONE, MOBILE,PAGER,VOICE- Constrained Viterbi to fi nd the best path that conformsto the MAIL, URL, EMAIL,INSTANTMESSAGE Token Acc. F1 Prec Rec ENUA Change CRF 89.73 87.23 88.24 86.24 CRF – (UIM1) 0.73 baseline MAXENT 88.43 84.84 85.09 84.95 CCRF – (UIM2) 0.63 -13.9% CCRF – (UIM3) 0.64 -11.3% Table 1: Token accuracy and fi eld performance for the Con- MAXENT – (UIM1) 0.94 +29.0% ditional Random Field based fi eld extractor, and the Maxi- Manual – (UIMm) 6.31 +770.0% mum Entropy based fi eld extractor. Table 2: The Expected Number of User Actions (ENUA) to completely enter a contact record. Notice that Constrained web pages found by searching for valid pairs of city name CRF with a random corrected field reduces the Expected and zip code. Number of User Actions by 13.9%. The features consisted of capitalization features, 24 reg- ular expressions over the token text (e.g. CONSTAINSHY- PHEN, CONTAINSDIGITS, etc.), character n-grams of length comparison, manual entry requires over 10 times more user 2-4, and offsets of these features within a window of size 5. actions. We also used 19 lexicons, including “US Last Names,” “US Confi dence estimation is used in UIM3. Recall that in this First Names,” “State names,” “Titles/Suffi xes,” “Job titles,” user interaction model the system assigns confi dence scores and “Road endings.” Feature induction was not used in these to the fields, and the user is asked to correct the least confi- experiments. dent incorrect fi eld. We implemented five machine learning methods to au- Interestingly, correcting a random fi eld (ENUA = 0.63) tomatically annotate the contact records. All models were seems to be slightly more informative for correction- trained on a random subset of 70% of the data and tested on propagation than correcting the least confi dent erroneous the remaining 30%. Table 1 shows the performance for the fi eld (ENUA = 0.64). two baseline methods. Column 1 lists the token accuracy While this may seem surprising, recall that a fi eld will (the proportion of tokens labeled correctly), and columns 2- have low confi dence if the posterior probability of the com- 4 list the segmentation performance at the fi eld level, where peting classes is close to the score for the chosen class. F1 is the harmonic mean of recall and precision. CRF is the Hence, it only requires a small amount of extra informa- conditional random fi eld classifi er described earlier. MAX- tion to boost the posterior for one of the other classes and ENT is a conditional maximum entropy classifi er. The ex- “flip” the classifi cation. We can imagine a contrived exam- periments do not include any user feedback. Notice that the ple where thereare two adjacentincorrectfi elds. In thiscase, token error rate of the CRF system is about 11% lower than we should correct the more confi dent of the two to maximize that of the MaxEnt system. correction propagation. This is because the fi eld with lower In the following sections, we start by discussing results in confi dence requires a smaller amount of extra information to terms of the Expected Number of User Actions. Then we correct its classifi cation. discuss results that highlight the effectiveness of correction- Under UIM3, the user may be required to verify a number propagation and utilization of confidence scores respec- of correct fi elds before an incorrect fi eld is found, since the tively. model may have least confi dence in correct fi elds. Another way of assessing the effectiveness of the confi - User Interaction Evaluation dence measure is to ask how effective is it at directing the The standard information retrieval metrics do not adequately user to an incorrect token. In our experiments with CCRFs, capture the performance of an Interactive Information Ex- the number of records that contained one or more errors was traction system. In this paper we have proposed a metric 276. The least confi dent fi eld was truly incorrect in 226 out called the Expected Number of User Actions. of those 276 records. Hence, confi denceestimation correctly Table 2 shows the Expected Number of User Actions for predicts an erroneous fi elds 81.9% of the time. If we instead the different algorithms and User Interaction Models. In ad- choose a token at random, then we will choose an incorrect dition to the CRF and MAXENT algorithms, Table 2 shows token in 80 out of the 276 records, or 29.0%. In practice, the results for CCRF, which is the constrained conditional ran- user does not initially know where the errors are, so confi- dom fi eld classifi er presented in this paper. dence estimates can be used effectively to direct the user to The baseline user interaction model (UIM1) where IE is incorrect fi elds. used to populatethe fi elds and the user correctsall remaining The ENUA metric does not take into account the time it errors is expected to require 0.73 user actions per record. takestheuserto scan therecordandfi nd incorrectfi elds. Itis Notice that manual entry of records is expected to require diffi cult to assess this without extensive user studies, where on average 6.31 user actions to enter all fi elds. This is about different strategies and visual cues are compared. 8.6 times more user actions. We see the advantages of correction-propagation in Correction Propagation CCRMs when an arbitrary incorrect field is corrected To examine the effectiveness of correction propagation, Ta- (UIM2), over the baseline (i.e. not using correction prop- ble 3 shows the token accuracy of CCRF on contact records agation (UIM1)) in the second row of Table 2. The ENUA containing errors in at least two fi elds. One fi eld is corrected dropsto 0.63, which is a relative dropin ENUA of 13.9%. In by the user, with the hope that correction propagation will total accuracy before correction 72.02 Error Reduction F1 Prec Rec total accuracy after correction 87.12 CCRF - (L. CONF.) 56.2% 94.45 94.84 94.06 uncorrected accuracy before correction 79.50 CCRF - (RANDOM) 19.2% 89.72 90.72 88.75 uncorrected accuracy after correction 84.30 CMAXENT 27.2% 89.4 90.34 88.48

Table 3: Correction propagation results for CCRF on se- Table 4: Token accuracy and fi eld performance for interac- quences with errors in at least two fields. Error reduction tive field labeling. CCRF - (L. CONF.) obtains a 56% is 23% for uncorrected tokens. reduction in error over CRF, and a 46% reduction in error over CCRF - (RANDOM). automatically fi x errors in other fi elds. Here, total accuracy Pearson’s r Avg. Precision is the token accuracy for the entire contact record, and un- Constrained FB 0.531 98.0 corrected accuracy is the token accuracy of tokens not cor- Random -0.02 87.99 rected by the user. Note that these accuracies are naturally WorstCase - 71.48 lower than in Table 2 because we are only examiningrecords with multiple errors. Table 5: Constrained Forward-Backward confi dence estima- These results show that having the user correct one fi eld tion performs substantially better than baseline approaches. results in a 23% reduction in error in the remaining fields. This additional error reduction is a boon to users since they do not have to perform these corrections manually. To explicitly measure the effectiveness of the Constrained Confidence Estimation Forward-Backward algorithm for confi dence estimation, Ta- ble 5 displays two evaluation measures: Pearson’s r and av- In the preceding discussion, the goal of IIE has been to cor- erage precision. Pearson’s r is a correlation coeffi cient rang- rectly fill in all fields of each record. A different scenario ing from ★◆✦ to 1 which measures the correlation between a arises if we wish to reduce the labelling error rate of a large confidence score of a field and whether or not it is correct. amount of data but we do not need the labelling to be er- Given a list of extracted fi elds ordered by their confi dence ror free. If we have limited man-power, we would like to scores, average precision measures the quality of this order- maximize the effi ciency or information gain from the human ing. We calculate the precision at each point in the ranked labeller. list where a correct field is found and then average these This user interaction model assumes that we allow the hu- values. WORSTCASE is the average precision obtained by man labeller to verify or correct a single fi eld in each record, ranking all incorrect fields above all correct fields. Both before going on to the next record. Pearson’s and average precision results demonstrate the As before the constrained conditional random fi eld model effectiveness of Constrained Forward-Backward for estimat- is used, where Constrained Forward-Backward is used to ing the confi dence of extracted fi elds. predict the least confident extracted field. If this field is in- correct, then CCRF is supplied with the correct labelling, and correction propagation is performed using Constrained Related Work Viterbi. If this fi eld is correct, then no changes are made, This paper is the first of which we are aware that uses in- and we go on to the next record. teractive information extraction with constraint propagation The experiments compare the effectiveness of verifying or and confi dence prediction to reduce human effort in form- correcting the least confident field i.e. CCRF - (L.CONF), fi lling. Several other efforts have studied effi cient ways to verifying or correcting an arbitrary field i.e. CCRF - to interactively train an extraction system, which would (RANDOM). later run without human interaction, for example (Cardie & Finally, CMAXENT is a Maximum Entropy classifi er that Pierce 1998; Caruana, Hodor, & Rosenberg 2000). estimates the confi dence of each fi eld by averaging the pos- Methods to visually link related components in a user in- terior probabilities of the labels assigned to each token in the terface have been explored, for example (Becker & Cleve- field. As in CCRF, the least confident field is corrected if land 1987; Swayne, Cook, & Buja 1991). The XGobi sys- necessary. tem uses color coding and “brushing” to indicate associa- Table 4 show results after a single field has been verified tions in various types of high dimensional data. or corrected. Notice that if a random field is chosen to be Many common word processors use visual cues to direct verifi ed or corrected, then the token accuracy goes to 91.7%, the users attention to possible errors in spelling and gram- which is only a 19.2%reduction in error rate. If however,we mar. In (Miller & Myers 2001) the authors use a similar verify or correct only the least confi dent fi eld, the error rate strategy, based on outlier detection. is reduced by 56.18%. Others have implemented systems for information extrac- This difference illustrates that reliable confi dence predic- tion from free-text address blocks, however none using an tion can increase the effectiveness of a human labeller. Also interactive method. Borkar, Deshmukh, & Sarawagi (2000) note that the 56% error reduction CCRF achieves over CRF obtains high accuracy using a HMM on a simpler and more is substantially greater than the 27% error reduction between limited set of fi elds (HouseNum, PO Box, Road, City, State, CMAXENT and MAXENT. ZIP) ,which usually appear in very regular form. Similarly, Bouckaert (2002) extracts the components of author affi lia- 155, Computer Science Department, School of Computer tions from articles of a pharmaceutical journal. Science, Carnegie Mellon University. Confi dence prediction itself is also an under-studied as- Borkar, V. R.; Deshmukh, K.; and Sarawagi, S. 2000. Au- pect of information extraction—although it has been inves- tomatically extracting structure from free text addresses. In tigated in document classifi cation (Bennett 2000), speech Bulletin of the IEEE Computer Society Technical committe recognition (Gunawardana, Hon, & Jiang 1998), and ma- on Data Engineering. IEEE. chine translation (Gandrabur & Foster 2003). Much of the Bouckaert, R. 2002. Low level information extraction: A previous work in confi dence estimation for information ex- based approach. In Proc. TextML 2002. traction comes from the active learning literature. For ex- Cardie, C., and Pierce, D. 1998. Proposal for an interactive ample, Scheffer, Decomain, & Wrobel (2001) derive con- environment for information extraction. Technical Report fi dence estimates using hidden Markov models in an infor- TR98-1702. mation extraction system, however, they do not estimate the Caruana, R.; Hodor, P.; and Rosenberg, J. 2000. High confi dence of entire fi elds, only singleton tokens. The token precision information extraction. In KDD-2000 Workshop confi dence is estimated by the difference between the prob- on Text Mining. abilities of its fi rst and second most likely labels, whereas Culotta, A., and McCallum, A. 2004. Confidence esti- our Constrained Forward-Backward (Culotta & McCallum mation for information extraction. In Proceedings of the 2004) considers multi-token fields, and the full distribu- Human Language Technologies Conference (to appear). tion of all suboptimal paths. Scheffer, Decomain, & Wro- Gandrabur, S., and Foster, G. 2003. Confi dence estimation bel also explore an idea similar to Constrained Forward- for text prediction. In Proceedings of the Conference on Backward to perform Baum-Welch training with partially Natural Language Learning (CoNLL 2003). labelled data, wherein a limited number of labels provide Gunawardana, A.; Hon, H.; and Jiang, L. 1998. Word- constraints. However, these constraints are again for sin- based acoustic confi dence measures for large-vocabulary gleton tokens only. Constrained Viterbi has been used pre- . In Proc. ICSLP-98, 791–794. viously in bioinformatics to fi nd sub-optimal alignments of Lafferty, J.; McCallum, A.; and Pereira, F. 2001. Condi- RNA sequences (Zuker 1991). tional random fi elds: Probabilistic models for segmenting and labeling sequence data. In Proc. 18th International Conclusion and Future Work Conf. on Machine Learning, 282–289. Morgan Kaufmann, We have introduced a new system for assisting users when San Francisco, CA. entering database records from unstructured data. We ex- McCallum, A., and Li, W. 2003. Early results for named ploit CRFs to pre-populate the database fi elds and allow nat- entity recognition with conditional random fi elds, feature ural user interaction where the system takes into account induction and web-enhanced lexicons. In Hearst, M., and any corrections by the user. This is done by correction- Ostendorf, M., eds., HLT-NAACL. Edmonton, Alberta, propagation using the Constrained Viterbi algorithm in Canada: Association for Computational Linguistics. CRFs. Note that correction-propagation can be applied to McCallum, A. 2003. Effi ciently inducing features of con- any relational model. By calculating confi dence estimates ditional random fields. In Nineteenth Conference on Un- and highlighting low confi dence fi eld assignments, we help certainty in Artificial Intelligence (UAI03). the user spot any incorrect fi elds. We have shown that both Miller, R. C., and Myers, B. A. 2001. Outlier finding: of these methods are quite useful. Using the system, the Ex- focusing user attention on possible errors. In UIST, 81–90. pected Number of User Actions per record has been dramat- Pinto, D.; McCallum, A.; Wei, X.; and Croft, W. B. 2003. ically reduced, from 6.31 for manual entry to 0.63 or more Table extraction using conditional random fi elds. In Pro- than 10-fold. ceedings of the ACM SIGIR. Rabiner, L. 1989. A tutorial on hidden markov models. In Acknowledgments IEEE, volume 77, 257–286. We would like to thank David Parkinson for providing labelling Scheffer, T.; Decomain, C.; and Wrobel, S. 2001. Active and parsing tools and valuable discussions. This work was sup- hidden markov models for information extraction. In Ad- ported in part by the Center for Intelligent Information Retrieval, vances in Intelligent Data Analysis, 4th International Con- the Central Intelligence Agency, the National Security Agency, ference, IDA 2001. the National Science Foundation under NSF grant #IIS-0326249, Sha, F., and Pereira, F. 2003. Shallow parsing with con- SPAWARSYSCEN-SD grant number N66001-02-1-890, and by ditional random fi elds. In Hearst, M., and Ostendorf, M., the Defense Advanced Research Projects Agency, through the De- eds., HLT-NAACL: Main Proceedings, 213–220. Edmon- partment of the Interior, NBC, Acquisition Services Division, un- ton, Alberta, Canada: Association for Computational Lin- der contract #NBCHD030010. guistics. Swayne, D. F.; Cook, D.; and Buja, A. 1991. Xgobi: Inter- References active dynamic data visualization in the x window system. Proceedings of the ASA Section on Statistical Graphics. Becker, R. A., and Cleveland, W. S. 1987. Brushing scat- terplots. Technometrics 29:127–142. Zuker, M. 1991. Suboptimal sequence alignment in molec- ular biology: Alignment with error analysis. Journal of Bennett, P. N. 2000. Assessing the calibration of naive Molecular Biology 221:403–420. bayes’ posterior estimates. Technical Report CMU-CS-00-