Sense i)ismnl}iguati<)n and Text Set mentation Bas(q on I,c×ical (+ohcslOll

()KUMUI{A Manabu, IIONI)A Takeo School of [nforma,tion Science, Japan Advanced Institute of Science a.nd Technology ('l'al.sunokuchi, lshikawa 923-12 Japan) c-nmil: { oku,honda}¢~jaist.ac.ji~

Abstract cohesion is far easier to idenlAfy than reference be- cause 1)oth in lexical cohesion relation ap- In this paper, we describe ihow word sense am= pear in a text while one word in reference relation biguity can be resolw'.d with the aid of lexical eo- is a pr

2. lexical <'hains provide a <'lue for the determi- 1 Introduction nation of segnlent boundaries of the text[10].

A text is not a mere set of unrelated sentences. ]n this paper, we first describe how word sense Rather, sentences in a text are about the same ambiguity can t)e resolved with the aid of lexical thing and connected to each other[l()]. Cohesion cohesion. During the process of generating lex- and cohere'nee are said to contribute to such con- i<'al chains incrementally, they are recorded in a nection of the sentences. While coherence is a register in the order of the salience. The salie'ncc semantic relationship and needs computationally of lexical chains is based on their recency and expensive processing for identification, cohesion length. Since the more salient rep is a surface relationship among words iu a text resents the nearby local context, by checking lexi: and more accessible than coherence. Cohesion ca[ cohesion between the current word and lexieal is roughly classitled into reference t, co'r@tnction, chains in the order of tile salience, in tandem with and lezical coh,esion 2. generatiou of lexical chains, we realize incremen. Except conjmwtion that explicitly indicates l;he tal word sense disambiguation based on contex- relationship between sentences, l;he other two tual information that lexical chains reveal. <:lasses are considered to t>e similar in that the re- Next;, we describe how segment boundaries of lationship hetweer~ sentences is in

755 Morris and Itirst[10] pointed out the above and has a hierarchical structure above this level. two importance of lexical cohesion for discourse For each word, a list of category numbers which analysis and presented a way of computing corresponds to its multiple word senses is given. lexical chains by using Roger's International We count a sequence of words which are included Thesaurus[15]. IIowever, in spite of their mention in the same category as a lexical chain. It might to the importance, they did not present the way be (:lear that this task is computationally trivial. of word sense disambiguation based on lexical co- Note that we regard only a sequence of words in hesion and they only showed the correspondences the same category as a lexical chain, rather than between lexical chains and segment boundaries by using the complete Morris and Hirst's framework their intuitive analysis. with five types of thesaural relations. McRoy's work[8] can be considered as the one The word sense of a word can be determined that uses the information of lexical cohesion for in its context. For example, in the context word sense disambiguation, but her method does {universe, star, universe, star, galaxy, sun}, the not; take into account the necessity to arrange word 'earth' has a 'planet' sense, not a 'ground' lexical chains dynamically. Moreover, her word one. As clear from this example, lexical chains sense disambignation method based on lexical co- ('an be used as a contextual aid to resolve word hesion is not evaluated fully. sense ambiguity[10]. In the generation process In section two we outline what lexical cohe- of lexical chains, by choosing the lexical chain sion is. In section three we explain the way of that the current word is added to, its word sense incremental generation of lexical chains in tan- is determined. Thus, we regard word sense dis- dem with word sense disambiguation and describe ambiguation as selecting the most likely category the result of the evaluation of our disambiguation number of the thesaurus, as similar to [16]. method. In section four we explain the measure l';arlier we proposed incremental disambigua- of the plausibility of segment boundaries and de- tion method that uses intrasentential informa- scribe the result of the evaluation of our measure. tion, such as selectional restrictions and case frames[l 2]. In the next section, we describe incre- 2 Lexical Cohesion mental disambiguation method that uses lexical chains as intersentential(contextual) information. Consider the following example, which is the English translation of the fragment of one of Japanese texts that we use for the experiment 3 Generation of Lexical Chains later. In the last section, we showed that lexical chains In the universe that continues expancb carl play a role of local context, t]owever, multi- ing, a number of stars have appeared ple lexical chains might cooccur in portions of a aml disappeared again and again. And text and they might vary in their plausibility as about ten billion years after tile birth local context. For this reason, for lexical chains of the universe, in the same way as to function truly as local context, it is necessary the other stars, a primitive galaxy was to arrange them in the order of the salience that formed with the primitive sun as the indicates the degree of tile plausibility. We base center. the salience on the following two factors: the re- Words {nniverse, star, universe, star, galaxy, cency and the length. The more recently updated sun} seem to be semantically same or related to chains are considered to be the more activated each other and they are included in the same cat- context in the neighborhood and are given more egory in Roget's International Thesaurus. Like salience. The longer chains are considered to be Morris and tIirst, we compute such sequences of more about the topic in the neighborhood and related words(lexical chains) by using a thesaurus are given more salience. as the knowledge base to take into account not By checking lexical cohesion between the cu> only the repetition of the same word but the use rent word and lexical chains in the order of the of superordinates, subordinates, and synonyms. salience, the lexical chain that is selected to add We. use a Japanese thesaurus 'Bnnrui- the current word determines its word sense and goihyo'[1]. Bunrui-goihyo has a similar organi- plays a role of local context. zation to Roger's: it consists of 798 categories Based on this idea, incremental generation of

756 lexical chains realizes incremental word sense dis- If a candidate word can not be added to the ambiguation using contextual information that existing lexical chain, new lexieal chains for each lexical chains reveal. During the generation word sense are recorded in the register. of lexical chains, their salience is also in As clear fl'om tile algorithm, rather than the crementally updated. We think incremental truly incremental method where the register of disambiguation[9] is a better strategy, because lexical chains is updated word by word in a sen- a combinatorial explosion of the number of to tenee, we adopt the incremental method where tal ambiguities rnight occur if ambiguity in not updates are performed at the end of each sentence resolved as early as possible during the analyt because we regard intrasentential information as ical process. Moreover, incremental word sense more iml)ortant. disarnbiguation is indist)ensable during the gem The process of word sense disambiguation us- eration of lexical chains if lexical chains are used ing lexical chains is illustrated in Figure 1. The for incremental analysis, because tile word sense most salient lexical chain is located at the top in ambiguity might cause many undesirable lexical the register. In the initial state the word W1 re-- chains and they might degrade the performance, utains aml)iguous. When tile current unambigu- of the analysis(in this case, the disambignation ous word W2 is added, tile chain b is selected(top itself). left). The chain b t)ecomes the most salient(top- right). Ilere the word sense ambiguity of the word 3.1 The Algorithm W[ in the chain b is resolved(bottom-left). If the word to be added is ambiguous(W3), tile word First of all, a &~pauese text is automatically seg-- sense corresponding to the more salient ]exieal mented into a sequence of words 1)y the morpho- chaln(1D21) in seh;eted(l)ottom-right). logical analysis[l 1]. Ih-om tile result of the |nor- phological analysis, candidate words are selected 3.2 The Ewfluation to inch.lde in lexical chains. We consider only nouns, verbs, and adjectives, with sonte excep Wc apply Lhe algodthn~ to five texts. Tal)le l lions such as nouns in adverbial use and verbs in shows the system's performance. postpositional use. The 'correctness' of the disambiguation is Next lexical chains are formed. Lexical cohe- judge, d by one of the authors. The system's per- sion among candidate words inside a sentence is formance is con|tinted as the quotient of the num first; checked by using the thesaurus. Ilere the ber of correctly disambiguated words by the num word sense of the current w/)rd might be deter- her of ambiguous words miuus the nmnber of mined. This preference for lexica.1 cohesion inside wrongly segmented words(morphological attalysis a sentence over the intersentential one retlects our ergo rs) 3, observation that the former nfight be tighter. Words that relnaill ambiguous are those that After the analysis htside a sentence, i:audidate (1o llOt ['orin any lexical chains with other words. words are tried to be added to one of the lexi- F,xcept t)y the errors in the ntorphologieal analy- eal chains that are recorded in the register in the sis, most of the errors in the disambiguation are order of the above salience. The ih'st chain that caused by being dragged into the wrong context. the current word has tile lexica] cohesion relation The average performance is 63.4 %. We think is selected. The salience of the selected lexical the system's l?erformam:e is promising for the fol- chain gets higher and then the arrangement in lowing reasous: the register is updated. Here not ()lily the word sense amt)iguity of the I. l,exical cohesion is not the only knowledge current word is resolved but the word sense of the sour('e lbr word sense disatnbiguation and amt)iguous words in the selected ]exica[ chain cau [)roves to be usefill at least as a source sup- also be determined. Because the lexical chain gets plernentary to our earlier framework that higher salience, other word senses of the mnhigu used cane frmnes[12]. ous words in the lexic~d chain whi/-h correspond 2. In fact, higher performance is reported in to other lexical chains can he rejected. There [16], thai; uses bro~der context acquired by fore,, lcxica] chains can be used riot only a.s prior context but, also later context for word seuse dis- at,I lie accuramy ot' the inorphological analysis will be im- ambiguation. l)r(wed by adding new word entries or the like.

757 {W2, [ID2]) /

) (,,,*,, b w~I,o2j w?

~'(~,p ~n b wlDmq ) ~eino )

~.hain c WI[ID,2! )

(mind )

~ ist of a/~oiguouswords k (Wl,{ID11,II]121) (...,[...,...]) ...... ) ....!:!:i:?::, ~-----~ (Wl,[ID11,ID12]) (...,[...,.,.]) ...... ) .,::i x:i:!$>".4!:!:i:

(W3, [ID3].,ID32 ]) f::+-. / / r min b W2[ID2] W1[ID11] )

i:i::.:..... a ,,, ) ain a D ain

~aind ) c~in cl )

~J.st of mrbiguouswords ~zst of ambiguouswords (...[:..,..,]) ...... ) ~,~ (.:.[...,...1) ...... )

Figure 1: The process of word sense disambiguation

number of number of number of number of system's text number of candidate ambiguous words that correctly performance sentences words words remain disambiguated (%) ambiguous words No.1 41 481 166 126 87.5 No.2 26 197 71. 13 32 51.6 No.3 24 212 57 12 34 64.2 No.4 38 433 123 19 71 60.1 No.5 24 163 82 11 42 53.8

Table 1: The performance for the disambigm~tion

758 training on large corl)ora , but. our method can attain such tolerab[e level of performance chains text without any training. i 2 start~end 123456789012345678901234 ( i - 24) However, our salience of lexical chains is, of ( 4 - 13) course, rather naive and must be refined by us- (14 - is) ing other kinds of inibrmation, such as Japanese ( 8 - 9) topicM marker 'wa'. (14 - 18)

4 by Lexi- Figure 2: l,exieal chains in the text cal Chains 4.2 The Evahmtion

The second importance of lexic~d chains is that We, try to segnient the texts ill section 3.2 they provide a clue for the deternfination of seg- and apply the above measure to the lexical ment boundaries. (Jertain spans of sentences in chains that were tbluned. We pick out three a text form selnantic units and are usually called texts(No.3,4,5), which are fi:om the exam ques segments. It is crucial to identify the segment tions of the Japanese language, that ask us to par- boundaries as a first step to construct the struc- tiglon the texts into a given number of segments. ture of a text[2]. The system's performmwe is judged by the com. p~rison with segment boundaries marked as an attaehe(l model answer. Two more texts(No.6,7) 4.1 The Measure for Segnmnt tioun

759 Moreover, this ewduation method is not nec- text given number of number of essarily adequate since partitioning into a larger boundaries correct boundaries number of smaller segments might be possible No.3 1 1 and be necessary for the given texts. And so we No.4 6 3 will have to consider the evaluation method that No.5 1 0 the agreement with hmnan subjects is tested in No.6 4 future. Ilowever, since human subjects do not al- No.7 3 1 ways agree with each other on segmentation[6, 4, 14], our evaluation method using the texts in the questions with model answers is considered to be Table 2: The performance for the segmenta- a good simplification. tion(l) Several other methods to text segmentation have been proposed. Kozima[7] and Youmans[17] number of number of proposed statistical measures(they are named text generated correct rec. prec. LCP and VMP respectively), which indicate the boundaries boundaries I plausibility of text points as a segment bound- I__ No.3 3 1 1 0.3H ary. Their hills or valleys tend to indicate seg- No.4 10 3 0.--T-~0~ ment boundaries. However, they only showed the No.5 -% o 1 correlation between their measures and segment No.6 7 3 o.7---~ o.4~] boundaries by their intuil,ive analysis of few sam- No.7 5 1 _ 0.aK 0.20 A ple texts, and so we cannot compare our system's and their performance precisely.

Table 3: The performance for the segmenta- ltearst[5] independently proposes a similar tion(2) measure for text segmentation and evaluates the performance o[ her method with precision and re- call rates. However, her segmentation method itself because it is written by one of the depends heavily on the information of paragraph most difficult writers in Japan, KOBAYASH[ boundaries and always partitions a text at the Hideo. Table 2 shows that our system gets points of paragraph boundaries. 8(1+3+3+1)/15(1+6+1+4F3)= 53 % in the test. From Table 3, the average recall and pre- cision rates are 0.52 and 0.25 respectively. Of course these results are unsatisfactory, but we think this measure for segment boundaries is 5 Conclusion promising and useful as a preliminary one. Since lexical chains are considered to be dif- We showed that lexical cohesion can be used as a ferent in their degree of contribution to segment knowledge source for word sense disambiguation boundaries, we arc now refining the measure by and text segrnentatinn. We think our method is taking into account their importance. We base promising, although only partially successful re- the importance of lexical chains on the following sults can be obtained in the experiments so far. two factors: Here we reported some preliminary positive re- 1. The lexical chains that include more words sults and made some suggestions for how to im- with topical marker 'wa' get more impor- prove the method in future. The improvement of tance. the method is now under way. In addition, because computation of lexical 2. The longer lexical chains tend to represent a chains depends completely on the thesaurus used, semantic unit and get more importance. we think the comparison among the results by The start and end points of the more impor- different thesauri would be insightful and are now tant lexical chains can get the more boundary planning. [t it also necessary to incorporate other strength. This refinement of the measure is in textual information, such as clue words, which the process and yields a certain extent of improve- can be computationally accessible to improve the ment of the system's performance. performance.

760 References [13] T. Ookuma. Gengo tan'i toshite no bun~ shou. Nihongo gaku, 11(4):20-25, 1992. in [1] Bunrui-Goihyo. Shuei Shuppan., :1964. in Japanese. Japanese. [14] R.J. Passonneau. Intention-based segmenta.- [2] B.J. Grosz and C.L. Sidner. Attention, tion: Human reliability and correlation with intentions, and the structure of discourse. linguistic cues. In Proc. of the 31st An- Coraputationol Li~iguistics, 12(3):175 204, nual Meeting of the Association for Compu- 1986. tational Linguistics, pages 148-155, 1993.

[3] It. A. K. ftalliday and R. Hassan. Cohesion [15] P. I{oget. Roget's International Thesaurus, in English. Longman, 1976. Fourth Edition. Harper and Row Publishers Inc., 1977. [4] M.A. ltearst. Texttiling: A quantitative ap- [16] D. Yarowsky. Word-sense disambiguation proach to discourse segmentation. Techni- using statistical models of roget's categories cal Report 93/24, University of California, Berkeley, 1993. trained on large corpora. In Proc. of the ldth. ['ntcrnational Co~@re~nce on Computational [5] M.A. Hearst. Multi-paragraph segmentw Linguistics, pages 454--460, 1992. tion of expository texts. Technical Report [17] G. Youmans. A new tool for discourse anal- 94/790, Uniw~rsity of California, Berkeley, ysis: The vocabulary-management profile. 1994:. Language , 67:763--789, 1991. [6] J. ttirschberg and B. C, rosz. lnt:onational fea- tures of local and global discourse structure. In Proc. of the Darpa Workshop on Speech and Natu~vd Language, pages ,141- 446, 1992.

[7] H. Kozima. Text segmentation based on sim- ilarity between words. In Proc. of the 31st A nn.lLal Meeting of the Association for Com- putational Linguistics, pages 286 288, 1993.

[8] S.W. Mcll~oy. Using multiple knowledge sources for word sense discrimination. Com- putational Linguistics , 18(1):1 30, 1992.

[9] C.S. Mellish. Computer Interpretation of Natural Language Descriptions. Ellis Hor-- wood, 1985.

[10] J. Morris and G. tlirst. Lexical cohesion computed by thesaural relations as an indi- cator of the structure of text. Computational Linguistics, 17(1.):21-48, 1991.

[11] Nagao Lab., Kyoto University. ,]apancsc Morphological Analysis System ,]UMAN Manual Version l.O, 1993. in ,lapanese.

[12] M. Okumura and H. Tanaka. Towards in- cremental disambiguation with a general- ized discrimination network. In Proc. of the 8th National Conference on Arti]icial Intel- ligence, pages 990 995, 1990.

7(/1