<<

PATTERN MATCHING IN THE TEXTRACT EXTRACTION SYSTEM

Tsuyoshi Kitani t Yoshio ]:',riguchi t* Masami Ilara)* Center for Ma.chine Translation Ca.rnegie Mellon University Pittsburgh, PA 15213

Abstract est by locating speciIic expressions defined a.s key words and phrasal patterns obtained by In information extraction systems, pattern coq)us analysis. marchers are widely used to identi~q infof This paper describes a pattern matching mat|on of interest in a scntcncc. In this method that first identifies in a seu- paper, pattern matching in the Tt:XTRACT tence and then links critical pieces of informa- information extraction system is described. tion that map to a p~ttern. The first step in It comprises a conccpt search which |dent|- pattern ln~tching is a searvh applied tics key words representing a concept, and a in the TI';XTRACT system of the TIPSTER template pattern search which identifies pat- Japanese microelectronics and corporate .joint terns of words and phrases. TI'JXTI~,A(;T ventures domains{aacobs 93a], [aacobs 93b]. using thc matcher performed wcll in the In this step, key words representing a concept :I'IPSTER/MUC-5 evahtation. Thc pattern are searched for within a sentence. The second matching architecture is also suitable ]br rapid step is a. template pattern sea~rh applied in the system development across different domains TEXTRACT joint ventures system. A com- of the same . plex pattern to be searched for usually con- sists of a few words and phrases, inste~d of just one word, as in the concept search. The tem- 1 INTRODUCTION plate pattern search recognizes relationships between matched objects in the defined pat- In information extraction systems, finite- tern a.s well a.s recognizing the. concept itself. state pattern matchers are becoming popular l,'rom the viewpoints of system perfof as a means of identifying individu~d pieces of mance and portalfility across domains, the information in a sentence. Pattern matching TIPS'I.'I~;II/MUC-5 evaluatioll J'esults suggest systems for English texts are reported to be that pattern nta.tching described in this paper suits,hie tor achieving a high level of per[br- is all appropriate architecture for information mance with less effort, compared to full pars- extraction from ,lapanese texts. ing architectures[Hobbs et al. 92]. Among seventeen systems presented in the Fifth Message Understanding Conference (MUC-5), 2 TIPSTER/MUC-5 OVERVIEW three systems used a pattern marcher a.s the main component for identifying patterns to be The goal of the TI1)STER/MUC-5 project extracted [MUC-5 93]. A pattern matching ar- sponsored by ARPA is to capture informa- chitecture is appropriate for information ex- tion of interest from English and .lal)aJmse traction fi-om texts in narrow domains since newspal)er artMes about microelectronics and identifying informatkm does not necessarily re- corpora.re joint ventures. 1 A system must quire full understa.nding of the text. The pat- fill a. generic template with information taken tern matcher can extract information of inter- 1Several Al{PA-sponsored sites f¢)rmed tile TIP- tVisiting rese;trcher fron/ NTT Data Communica- ST]'21/informal|on extraclion project. "['he TIPSTER tions Systems Corp., email: tkitani~.rd.nttdM.a.jp sites and other non-sponsored organizations partici- HNTT Data Communiclttions Systems Corp. pated in MUC-5.

1064 fronl the text its. a fully automated fa.sh- ion. The template is composed of several objects, each containing severM slots. Slots ...... Ra'te m ...... I may have pointers as va.lues, where pointers Pre- I....~ :::::::::::::::::::::::i ,::::.:::'~:": :r~:: :: ::::::::::: link related ot)jects. Extracted information processing i :search:: :: i is expected to be stored in an -oriented morphological - concept - concept l database [TIPSTER 92]. analysis identification ldentificqtion I • name - information l In the microelectronics domain, information recognition merging in a I about four specific processes in seiniconduc- sentence / tot manufacturing for microchip fabrication is captured. They are layering, lithography, Discourse~ Template etching, aaM packaging processes. ],ayering, processingI [ generation lithography, and etching a.re wafer fa.brical:ion Lt-information . output processes; packaging is part of tile last stage of merging generation manufacturing. Entities such as manufactu rer, In :t text distributor, and user, in addition to detailed manufacturing information such a.s materials Fig. 1: Architecture of the q'EXTI{.A(~T joint used and the microchip specifications such as ventures sysl, eln wafer size and device speed are also extra.cted in each process. The joint ventures domain focuses on ex tracting entities, i.e. organizations, forming or dissolving joint venture relationshil)s. The information to l)e extracted includes entity in- are. identified I)y the name recognition module. formation such as location, na.tionality, per- Tim segments are g,'Oul)e(l into units which sonnel, and facilities, and joint venture infor- a.re meaningfill in (.It(; l)attern ma.tching pro- mation such as rela.tionshii)s, 1)usiness a.ctivi- cess[Kitani and Mitamura 94]. Most strings ties, capital, and estimated revenue of the joint to be extracted dire.ctly Dora the text at'(.' iden- ventttre. tiffed by MAJESTY and the name recognizer in the l)reprocessor.

3 TEXTRACT ARCHITECTURE The con(;ept search and template pattern search rood u les both identi['31 concepts in a set,- TEXTI/ACT is an informati(m extraction tence. The template pattern sear¢:h also rec- system developed as an optiona.l system of ognizes relationshil)s within the identified in- the GE-CMU SHOGUN system [Jacobs 93a], f'ornuttion in the matched pattern. Details of [aacobs93b]. it processes the TIPSTFI{. the l)attern matching process are described in Japanese domains of microelectronics and col the next section. porate joint ventures. The :I'I';XTllA(VI~ mi- croelectronics system comprises three major The discourse processor links information components: prel)rocessing ~ conceltt search, identified a.t different stages o[" processing. and template generatiol|. In ad(lition to (:on. l"irst, implicit subjects, often use<[ in Japanese celtt search, the "FI!;XrI'IIACT joint ventures sentences, are inherited fronl previous sen- system perfbrms a templ~te pattern search. [t tences, and set'oil(l, company ltatlles are givell is also equipped with a discourse processor, as iltliqlte ltunlbers necessary to accurately rec- shown in Fig. 1. ogMze company relationships throughout tile In the preprocessor, Japanese text is seg- text[Kitani 94]. Concepts identified during mented into primitive: words tagged with their tile pattern matching process are used to se- t)arts of speech by a Japanese segmentor lect an approt)ria.te string and filler' to go into ~ called MAJESTY[Kitani and Mitamura 93], slot. ]?inally, l.he template generation pro(:ess [Kitani 91]. Then, proper norms, along with assembles the extracted information necessary monetary, nulneric, and temporal expressions to creat(.~ the OUtl)nl; descril)ed in Secl, iou 2.

1065 4 PATTERN MATCHING IN ,,> .3.1) :~ :/ <,, tells the matcher that it re- TEXTRACT quires an exact word matching against a word in the text. 4.1 Concept search 4.2 Template pattern search Key words representing the same concept are grouped into a list and used to recognize 4.2.1 Template pattern matcher the concept in a sentence. The list is written in a simple format: (concept-name wordl word2 The teml>late pattern matcher identifies ...). For example, key words tbr recognizing a typical expressions to be extracted from the dissolved joint vent u re con cept can be written text that frequently aplmar in the corpus. The in the following way: patterns are defined as pa.ttern matching rules using regular expressions. (DISSOLVED ~-~j~-~ ~-.~: ~) The pattern matcher is a ffnite-state au- or tomaton sinfilar to the pattern recognizer use.d (DISSOLVED dissolve terminate cancel). in the MUC-4 FASTUS system developed at The concept search module recognizes the con- SRI [I[obbs et al. 92]. /n TEXTRACT, state cept when a word in the fist exists in tile sen- transitions arc driven by segmented words or tence. Using such a simple word list some- grouped units fi'om the prei)rocessor. The tilnes generates an incorrect concept. For ex- matcher identifies all possible patterns of in- ample, a dissolved concept is erroneously iden- terest in the. text that match defined l>atterns. tiffed fl'om an expression "cancel a hotel reser- It must ignore unnecessary words in the pat- vation". IIowever, when processing text in a. tern to perform successfifl pattern matching narrow

1066 a bles in which there is at least one com- (JointVenturel 6 pany llaDle~ @CNAME_PARTNER_SUBJ ~ [ ~:strict:P , sele(:t l)atterns tha.t COllSili[le tilt fewest @CNAME_PARTNERWITH input seglnents (the shortest string :strict:P match), and @SKIP ~.-~:loose:VS) • select patterns that include the lnost (a) A matching pattern for Japanese llliiiil)er of variables and defined words.

(JointVenturel 3 Another important feature of the pa.t:tern @CNAME_PARTNER_SUBJ Iilat(:hor is tha,t rules can be groupe(1 accord- create::V ilig to their COilCel)t. A rule lla.iile "JohltVeli- a joint venture:NP turel" iii Fig. 2, for example, represents a with::P concel)t "JointVelitllre'. Ushlg this group- @CNAME_PARTNER_WITH) big, the best nlatched pattern can be selected (b) A matching pattern for English fl'on-i nlatched patterns of a particular concept group instea.d of choosing from all the matched patterns. This feature enables the discourse and template g(meration processes to look at Fig. 2: A re;etching pattern for (a) ,laf)anese the, best infortnation necessary whet, tilling in a,n(l (b) English a particular slot.

The ill-st field in a i)atteril is the pattern 5 EXAMPLE OF THE INFORMA- nalile followed by the patterll iiuliiber. The tia,ttern nunlber is used to deride whether or TION EXTRACTION PROCESS llOt a, search within a, given strhlg is IleCess.>t£y, This sectkm describes how concepts ~tnd To assure ell]ciency with the pal, l;ern marcher, patterns identifh;d by tile matcher are used for the fiehl designated by the lllunber sliould in- tenll)late filling. Concepts are often useful to clude tlie leant frequent word in the entire pat- fill in the "set fill" ( fi'om a. given set) terll (,,~t~,, for aa,l)anese and :'a joint velil, urc" sh)ts. An entity type slot, for examl)le , ha,s for English in this case). four giw, i : COMPANY, I)EI{SON, 4.2.3 Pattern selection GOVERNM/i;NT, ~tnd OTIIt{;R. The matcher assigns concepts related to each entity type ex- Approxiiila.tely 150 pa.tterns were used to cept ()TflEIL Thus, from the given set, the extract various concepts in the Japanese joint output generator chooses an entity type corre- ventures domain. Several patterns usually sponding to the identified concept. There axe tllatch a. single sentence. Moreover, siuce pal;- ca.ses when discourse processing is necessa.ry terns are often searched using case ma.rkers to link identified concepts and patterns. The slic]i as "~", "7~<", and " ~ ", which frequently f.ollowing text;: "X Inc. created a joint w;ntm'e apt)ea.r ill .]al)a.nese texts, even a sitigie l)a.t - with Y Corp. last yea.r. X announced yester- 1;eri/ Call l[latch the Sellt,ence ill n]oi'e thrill Olle day that it. terminated the venture." in used w{I,y whell severa, I of" tile same ca,se lliarkers ex- to describe the extraction process illustrated ist ill a sentence. However, since the template in Fig. 3. gtmerator aCCellts only the best lnat(;he(I pal;- In the preprocessing, two company na.tnes in tern~ ehoosilig a corre(:tly nla,tehed i)atl,eril is the first sentence "X hic." and "Y Corp." are imlJortant. The selection in (lone by applying identified either I)y MAJESTY or the name three heuristic rules in the following or

7067 (Dissolvedl 2 "X Inc. created a "X announced OCNAMEPARTNER_SUBJ joint venture with yesterday that it dissolve I terminate l cancel::V Y Corp. last year."i terminated the venture." @skip Pre- company1: venture::N "X Inc." "X" I rocesslng company2: company3: @CNAME_PARTNER_WITH). Ip "Y Corp." :::::::::::::::::::::: Then, discourse processing must check if com+

D,ssoLveD" panies identified in this pattern are the same as the current joint venture comi)anies in order to recognize their dissolved relationship.

IVENTURE* &LCorp" 6 OVERALL SYSTEM PERFOR- Discourse "X Inc."(companyl=company3) MANCE process ng } JOINT-VENTURE* y DISSOLVED* A total of 250 newspaper articles, 100 "Y Corp." * concept name about Japanese mi('roelectronics and 150 about Japanese corl)orate joint ventures Fig. 3: FxamI)le of the inforntatiou extraction were provided by ARI)A fl)r nse in the process Tll)SrI'FA/./MU(L5 system evalua.tion. Five microelectronics and six joint ventures sys- tems were presented in the Japanese, system the company name "X" is also identitied by eva.luation a.t MUC-5. 4 Scoring was done in the preprocessor, a Next, the concept "DIS- a, semi-atttomatie nta,nner, rl'he scoring pro- SOLVEi)" is recognized by the key word ter- gram automatically compared the system out- minate in the concept search. (The key word put with answer tetnl)lates created by hu- list is shown in Section 4.11.) After sentence- mat~ analysts, then, when a Mman decision level processing, discourse processing recog- was necessa.ry, analysts instructed the scof nizes that "X" in the second sentence is a ref- ing progratu whether tlt(,, two strings in cora- erence to "X Inc." found in the first sentence. l)arisen were completely matched, pa.rtially Thus, the "DISSOLVI';I)" concept is joined to matched, or unumtched. Finally, it calculated the, joint venture relationship between "X inc." an overall score combined from all the news- and "Y Corp.". in this way, TI!iXTRACT rec- paper article scores. Although various eval- ognizes that the two companies dissolved the nat;ion tnetrics were tneasured in the evalua- ,joint venture. tiou [C[lillchor and Sun(lheim 93], only the fol- SupI)ose that the second sentence is re- lowing error and reeall-l)recision metrics are place(t with another sentence: "Shortly after, discussed in this pa.per. The ha.sic scoring cat- X terminated a contract to supply rice to Z egories use(l are: correct (CO R), partially cor- Corp.". Although it does not mentiot~ the rect (PAR), itlcorrect (INC), ntissing (MIS), dissolved relationship nor anything a hottt "Y and spurious (SPIJ), counted as the tmml)er Corp.", the system incorrectly recognizes the of i)ieces of inl'orma.tion in the system output dissolved joint ventttre rela.tionship between eompa.red to tile possil)le (answer) informa- "X In('." and "V Corp." due to the tion. of the word terminate. When this undesir- (l) 1';1'1;o1'metrics able matching is often seen, more complicated template patterns must be used instead of tile * l';rror 1)el. resl)onse fill (I~l{II): simple word list. A dissolved concept, lbr ex- ample, could he identified using the fo][owing wrong _ INC+ ILdR/2 + ]I'IIS + HPU template pattern: lotal COR+ t)AR+ INC+ MLS'+,5'PU

aWhen it is an unknown word to tim prepro(x~ssor, 4These numbers include TI;;X'FRACT, aA| optional the discourse processor idcnti[ies it IM.er. sb'stmn of (;I'LCMU SIIO(;UN.

7068 TaJ)h, 1: Scores of "rI';XTItACT and two other l.Ol>ra.nking of[iciaJ sysl,e,l,S iit '.I'II>STHL/M U(7-5

'I'I"XTRACT (,i Mi,;) [System A (JMI';) _.58 :it) 38 i4 60 53 n System B (JME) s. C ,Io us ISO.4 I 'rI,;x:rRACT (J Jr) System A (J,IV) System 1} (,],IV) _(J:l .... 51_ .... '2:t ..... IC_!_'l 67_L(irk ....,52 1_ TEXTI{A(/r's scores sul)inil.l.ed to MU(: 5 were unollicial. J M t';: J a.liiHl(+,,S(~ Ill icroelectr()lfics (IoIIt ai It JJV: ,/a.p,+,.lleSe ('o,'l)Ol'al;(~ johi t. v0litil ,'eS ([outai II

,,, Ulldergeneration (lIND): from the '['[F'STEI{/M U(]+5 syst,:utl cwduatiotl re,quits [M U(,-+>' : 93]. u I I,X" IRA' C, I perforn+eIc~ ;tcros:-; dilreretlt domains (>[' the s;,.me possible laliguage. The TI:,XTRAC/r nii,::r(+,+lectrolii(:s system was dew~loped in only three weeks by .. Precision (I'I[E): one person by simply replacing joint venture C01¢, -F I'A R/2 coilce+l)t,~ and key words witll representative acutetal Ilti('rof!l(~(~|.l'Oli](~s COlIC(':])l;.q ~,.,1(1 key words.

+ l)~l[ l:'-mea.slu'e (l)&l{,):

'2 * IU')C * l' tUq 7 CON(JL(JSION IU']C I ]'l~l: lit the ,];'q)alle,qe lilicro(qeC:tl'OlliCB ai]d cor- '.l'he error per l'eSl)oltse lill (I';RI{.) was the of- porate .](>itll, vent, u res rankiNg oNi- ~(;cotidaJ'y e,va.hia.tiolt ,heLlits we, re tlllfi0rgOll- cial s,ystems ;it, the 'I'IILqTER/M U(]-5 system eration (UNI)), overget,eratio,t (OV(',), ;,.td eva]ua,t.ioN. Alt, hough l)erl'oruna,nce of F,.a,IJ,(;rll S,I I)sl;]l;lll;ioll (SU II). The recall, l,l'eciSiOll, ~+tll(1 lua.tching~ tnust be ewdua.ted, Ihe high I)erfor- I:'-Iilea.sti re l]iel+rics we,'e used a.~; u nofli(:ia.I lnet. ma.nce or TEXTI~A(/r suggests that the paL- tics for MUC-5. tern n.atcher worked well in e×tra,cting infof Ta, ble I shows scores of TI:,X'['I{A(:T and ma,tion froul t]ie l, ext;. ']'tie p:ra.nkilig oflicia.l .'-;y ,'-;l, elil+q r, l.akeu ("'I'I'~X'I'IIA(/I"s scores submitie(I to MU(~-5 were s'I+EXTIIA(YI' processed only Japane>~e text, u,official. [I was scored ollicially M'ler the confelX~llCe, wherea.s the two other sysl,enis l~rocessed I)olh I']nglish The official scores showed slight dilfcrences from ulmf D,II([ Jit|)a.nese text. ficiaJ onus.

1069 has not been tested to other than References Japanese. It is expected to work to other lan- guages with some minor modifications given [Chinchor and Sundheim 93] Chinchor, N. that the input is segmented into primitive and Sundheim, B. (1993). MUC-5 Evalua- words tagged with their parts of speech. tion Metrics. Notebook of the Fifth Message The TEXTRACT Japanese microelectron- Undersianding Cor@renee (MUC-5). ics system was developed in only three weeks [IIobbs et al. 92] lIobbs, J., Appelt, 1)., et al. by one person. In spite of its simplicity, it (1992). FASTUS: A System for Extracting showed the high performance. This result hfformation from Natural-Language Text. also suggests that the pattern matching archi- SRI International, Technical Note No. 519. tecture is highly portable across similar do- mains of the same la.nguage, thus facilitating [Jacobs 93a] Jacobs, rapid system development. Developing and P. (1993). TIPSTER/SIIOGUN 18-Month maintaining TEXTRACT's pattern matching Progress Report. Notebook of the TIPSTER based architecture is easier and less complex 18-Month Meeting. than that of a full parsing system, as experi- enced in the early stage of StIOGUN system [Ja.cobs 931)] Jacobs, P. (1993). GE-CMU: l)e- development [aacobs 93b]. scription of the Shogun System Used for MUC-5. Notebook of the Fifth Message Un- Corpus analysis took about half of the de- derstanding Conference (M U6'-5). velopment , since only a KWIC (Key Word In Context) list and a word fi'equency [Kitani 91] Kitani, T. (:199:l). An ()CR tool were used to acquire the concept-word Post-processing Method for Handwritten lists and the template patterns. Using good Japanese Documents. In proceedings of Nat- statistical corpus analysis tools will shorten ural Language Processing PaciJic Rim ,gym- the development time and promise a high per- posium, pp. 38-45. formance. The tools should not only collect patterns of interest with context, but also give [Kitani and Mitamura 93] Kitani, T. and Mi- statistical data to show how well deft ned pat- tamura, T. (1993). A Japanese Preproces- terns are working when they are applied in the sor for Syntactic and Scmalltic Parsing. In system. proceedings of Ninth 11¢1¢E Conferencc on At MUC-5 meeting, P&R F-measure of one Artificial for Applications, pp. of the top-ranking systems was claimed to be 86-92. close to the human perfornlance [Jacobs 93b]. r [Kitani and Mitamnra 94] Kitani, T. and Mi- To match the system performance of a pattern tam ura, T. (199d). A n Accurate Morpholog- matching system to human performallce, the ical Analysis and Proper Name hlentifica- preprocessor must recognize expressions to be tion for Japanese Text Processing. Journal extracted at nearly 100% accuracy given that of Information Processing Society of Japan, other components simply merge information Vol. 35, No. 3, 1)P. 404 - 413. and generate ontput. [Kitani 94] Kitani, T. (1994). Merging hffor- Acknowledgements marion by I)iscourse Processing for lnfof mation Extraction. In proceedings of 7~nth The authors wish to express their apprecia- IEIH¢ Conference on Artificial Intclligence for Applications, pp. 4112-418. tion to aaime Carbonell, who provided the op- portunity to pursue this research a.t the Cen- [MUC-5 93] (:1993). System ter for Machine Translation, Carnegie Mellon Descriptions. Notebook of thc l"iflh Mess'age University. Thanks are also due to 3bxu ko Mi- Undcrstandin.g Cbnfcrcnec (AIUC-5). tamura and Michael Mauldin for their many helpful suggestions. [TIt)STER 92] (1992). Joint Ventnre Tem- plate Fill Rules. Plenary Session Notebook r'l'he human performance was estirmtted to be recall of the TII%5'TER, 12-Month Mcctin 9. and precision of about seventy to eighty.

1070