
PATTERN MATCHING IN THE TEXTRACT INFORMATION EXTRACTION SYSTEM Tsuyoshi Kitani t Yoshio ]:',riguchi t* Masami Ilara)* Center for Ma.chine Translation Ca.rnegie Mellon University Pittsburgh, PA 15213 Abstract est by locating speciIic expressions defined a.s key words and phrasal patterns obtained by In information extraction systems, pattern coq)us analysis. marchers are widely used to identi~q infof This paper describes a pattern matching mat|on of interest in a scntcncc. In this method that first identifies concepts in a seu- paper, pattern matching in the Tt:XTRACT tence and then links critical pieces of informa- information extraction system is described. tion that map to a p~ttern. The first step in It comprises a conccpt search which |dent|- pattern ln~tching is a concept searvh applied tics key words representing a concept, and a in the TI';XTRACT system of the TIPSTER template pattern search which identifies pat- Japanese microelectronics and corporate .joint terns of words and phrases. TI'JXTI~,A(;T ventures domains{aacobs 93a], [aacobs 93b]. using thc matcher performed wcll in the In this step, key words representing a concept :I'IPSTER/MUC-5 evahtation. Thc pattern are searched for within a sentence. The second matching architecture is also suitable ]br rapid step is a. template pattern sea~rh applied in the system development across different domains TEXTRACT joint ventures system. A com- of the same language. plex pattern to be searched for usually con- sists of a few words and phrases, inste~d of just one word, as in the concept search. The tem- 1 INTRODUCTION plate pattern search recognizes relationships between matched objects in the defined pat- In information extraction systems, finite- tern a.s well a.s recognizing the. concept itself. state pattern matchers are becoming popular l,'rom the viewpoints of system perfof as a means of identifying individu~d pieces of mance and portalfility across domains, the information in a sentence. Pattern matching TIPS'I.'I~;II/MUC-5 evaluatioll J'esults suggest systems for English texts are reported to be that pattern nta.tching described in this paper suits,hie tor achieving a high level of per[br- is all appropriate architecture for information mance with less effort, compared to full pars- extraction from ,lapanese texts. ing architectures[Hobbs et al. 92]. Among seventeen systems presented in the Fifth Message Understanding Conference (MUC-5), 2 TIPSTER/MUC-5 OVERVIEW three systems used a pattern marcher a.s the main component for identifying patterns to be The goal of the TI1)STER/MUC-5 project extracted [MUC-5 93]. A pattern matching ar- sponsored by ARPA is to capture informa- chitecture is appropriate for information ex- tion of interest from English and .lal)aJmse traction fi-om texts in narrow domains since newspal)er artMes about microelectronics and identifying informatkm does not necessarily re- corpora.re joint ventures. 1 A system must quire full understa.nding of the text. The pat- fill a. generic template with information taken tern matcher can extract information of inter- 1Several Al{PA-sponsored sites f¢)rmed tile TIP- tVisiting rese;trcher fron/ NTT Data Communica- ST]'21/informal|on extraclion project. "['he TIPSTER tions Systems Corp., email: tkitani~.rd.nttdM.a.jp sites and other non-sponsored organizations partici- HNTT Data Communiclttions Systems Corp. pated in MUC-5. 1064 fronl the text its. a fully automated fa.sh- ion. The template is composed of several objects, each containing severM slots. Slots ........Ra'te m ....... I may have pointers as va.lues, where pointers Pre- I....~ :::::::::::::::::::::::i ,::::.:::'~:": :r~:: :: ::::::::::: link related ot)jects. Extracted information processing i :search:: :: i is expected to be stored in an object-oriented morphological - concept - concept l database [TIPSTER 92]. analysis identification ldentificqtion I • name - information l In the microelectronics domain, information recognition merging in a I about four specific processes in seiniconduc- sentence / tot manufacturing for microchip fabrication is captured. They are layering, lithography, Discourse~ Template etching, aaM packaging processes. ],ayering, processingI [ generation lithography, and etching a.re wafer fa.brical:ion Lt-information . output processes; packaging is part of tile last stage of merging generation manufacturing. Entities such as manufactu rer, In :t text distributor, and user, in addition to detailed manufacturing information such a.s materials Fig. 1: Architecture of the q'EXTI{.A(~T joint used and the microchip specifications such as ventures sysl, eln wafer size and device speed are also extra.cted in each process. The joint ventures domain focuses on ex tracting entities, i.e. organizations, forming or dissolving joint venture relationshil)s. The information to l)e extracted includes entity in- are. identified I)y the name recognition module. formation such as location, na.tionality, per- Tim segments are g,'Oul)e(l into units which sonnel, and facilities, and joint venture infor- a.re meaningfill in (.It(; l)attern ma.tching pro- mation such as rela.tionshii)s, 1)usiness a.ctivi- cess[Kitani and Mitamura 94]. Most strings ties, capital, and estimated revenue of the joint to be extracted dire.ctly Dora the text at'(.' iden- ventttre. tiffed by MAJESTY and the name recognizer in the l)reprocessor. 3 TEXTRACT ARCHITECTURE The con(;ept search and template pattern search rood u les both identi['31 concepts in a set,- TEXTI/ACT is an informati(m extraction tence. The template pattern sear¢:h also rec- system developed as an optiona.l system of ognizes relationshil)s within the identified in- the GE-CMU SHOGUN system [Jacobs 93a], f'ornuttion in the matched pattern. Details of [aacobs93b]. it processes the TIPSTFI{. the l)attern matching process are described in Japanese domains of microelectronics and col the next section. porate joint ventures. The :I'I';XTllA(VI~ mi- croelectronics system comprises three major The discourse processor links information components: prel)rocessing ~ conceltt search, identified a.t different stages o[" processing. and template generatiol|. In ad(lition to (:on. l"irst, implicit subjects, often use<[ in Japanese celtt search, the "FI!;XrI'IIACT joint ventures sentences, are inherited fronl previous sen- system perfbrms a templ~te pattern search. [t tences, and set'oil(l, company ltatlles are givell is also equipped with a discourse processor, as iltliqlte ltunlbers necessary to accurately rec- shown in Fig. 1. ogMze company relationships throughout tile In the preprocessor, Japanese text is seg- text[Kitani 94]. Concepts identified during mented into primitive: words tagged with their tile pattern matching process are used to se- t)arts of speech by a Japanese segmentor lect an approt)ria.te string and filler' to go into ~ called MAJESTY[Kitani and Mitamura 93], slot. ]?inally, l.he template generation pro(:ess [Kitani 91]. Then, proper norms, along with assembles the extracted information necessary monetary, nulneric, and temporal expressions to creat(.~ the OUtl)nl; descril)ed in Secl, iou 2. 1065 4 PATTERN MATCHING IN ,,> .3.1) :~ :/ <,, tells the matcher that it re- TEXTRACT quires an exact word matching against a word in the text. 4.1 Concept search 4.2 Template pattern search Key words representing the same concept are grouped into a list and used to recognize 4.2.1 Template pattern matcher the concept in a sentence. The list is written in a simple format: (concept-name wordl word2 The teml>late pattern matcher identifies ...). For example, key words tbr recognizing a typical expressions to be extracted from the dissolved joint vent u re con cept can be written text that frequently aplmar in the corpus. The in the following way: patterns are defined as pa.ttern matching rules using regular expressions. (DISSOLVED ~-~j~-~ ~-.~: ~) The pattern matcher is a ffnite-state au- or tomaton sinfilar to the pattern recognizer use.d (DISSOLVED dissolve terminate cancel). in the MUC-4 FASTUS system developed at The concept search module recognizes the con- SRI [I[obbs et al. 92]. /n TEXTRACT, state cept when a word in the fist exists in tile sen- transitions arc driven by segmented words or tence. Using such a simple word list some- grouped units fi'om the prei)rocessor. The tilnes generates an incorrect concept. For ex- matcher identifies all possible patterns of in- ample, a dissolved concept is erroneously iden- terest in the. text that match defined l>atterns. tiffed fl'om an expression "cancel a hotel reser- It must ignore unnecessary words in the pat- vation". IIowever, when processing text in a. tern to perform successfifl pattern matching narrow <lomain, concepts are often i<lentiffe<t for various expressions. correctly fi'om the simple list, since key words are usually used in a particular meaning of in- 4.2.2 Pattern matching rules terest in the domain. Fig. 2 shows a defined pattern in During the Ja.panese segmentation process which an arhitrary string is represented in the preprocessor, a key word in the text as "g~string" along with its correspond- tends to be divided into a few separate words ing English pattern. 2 Specilica.lly, a vari- by MAJESTY, when the word is not stored able starting with "@CNAMI:;" is ca.lled the in the dictionary, For example, the compound COulpally-name varial)le, used where a com- noun "~jJ'~f~" consists of two words, "i~3 '' pany nanm is exi)ected to apl)ear. For exain- (joint venture) and ":r#~b1" (dissolve). It is seg- pie, "{}CNAME_I'AI{TNER_SUBJ" matches mented into the two individual nouns using the any string that likely includes at least; one com- current MAJESTY dictionary. Thus, when pany name acting a.s a joint venture partner the compound word "~.-~jlf(t'-fb]'' is searched for and functioning as a subject in tile sentence.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages7 Page
-
File Size-