Word Sense Disambiguation and Text Segmentation Based on Lexical

Word Sense i)ismnl}iguati<)n and Text Set mentation Bas(q on I,c×ical (+ohcslOll ()KUMUI{A Manabu, IIONI)A Takeo School of [nforma,tion Science, Japan Advanced Institute of Science a.nd Technology ('l'al.sunokuchi, lshikawa 923-12 Japan) c-nmil: { oku,honda}¢~jaist.ac.ji~ Abstract cohesion is far easier to idenlAfy than reference because 1)oth words in lexical cohesion relation ap- In this paper, we describe ihow word sense am= pear in a text while one word in reference relation biguity can be resolw'.d with the aid of lexical eo- is a pr<mom, or elided and has less information to hesion. By checking ]exical coheshm between the infer the other word in the relation automatically. current word and lexical chains in the order of Based on this observation, we use lexical cohe- the salience, in tandem with getmration of lexica] sion as a linguistic device for discourse analysis. chains~ we realize incretnental word sense disam We call a sequence of words which are in lexieal biguation based on contextual infl)rmation that cohesion relation with each other a Icxical chain lexical chains,reveah Next;, we <le~<:ribe how set like [10]. l,exical chains tend to indicate portions men< boundaries of a text can be determined with of a text; that form a semantic uttit. And so vari.- the aid of lexical cohesion. Wc can measure the ous lexical chains tend to appear in a text corre. plausibility of each point in the text as a segment spou(ling to the change of the topic. Therefore, boundary by computing a degree of agreement of I. lexical chains provide a local context to aid the start and end points of lexical chaihs. in the resolution of word sense ambiguity; 2. lexical <'hains provide a <'lue for the determi- 1 Introduction nation of segnlent boundaries of the text[10]. A text is not a mere set of unrelated sentences. ]n this paper, we first describe how word sense Rather, sentences in a text are about the same ambiguity can t)e resolved with the aid of lexical thing and connected to each other[l()]. Cohesion cohesion. During the process of generating lex- and cohere'nee are said to contribute to such con- i<'al chains incrementally, they are recorded in a nection of the sentences. While coherence is a register in the order of the salience. The salie'ncc semantic relationship and needs computationally of lexical chains is based on their recency and expensive processing for identification, cohesion length. Since the more salient lexical chain rep is a surface relationship among words iu a text resents the nearby local context, by checking lexi: and more accessible than coherence. Cohesion ca[ cohesion between the current word and lexieal is roughly classitled into reference t, co'r@tnction, chains in the order of tile salience, in tandem with and lezical coh,esion 2. generatiou of lexical chains, we realize incremen. Except conjmwtion that explicitly indicates l;he tal word sense disambiguation based on contex- relationship between sentences, l;he other two tual information that lexical chains reveal. <:lasses are considered to t>e similar in that the re- Next;, we describe how segment boundaries of lationship hetweer~ sentences is in<licated by two a text can be determined with the aid of lexical semantically same(or related) words. But lexical cohesion. Since the start and end points of lexical chains it, the text tend to indicate the start and 1Reference by pronouns and ellipsis in Halliday and end points of the segment, we can measure the Hasan's classification[3] are included here. plausibility o[' each point in the text as a segment 2Reference by flfll NPs, substitution mtd lcxical cohe-. sion in Ilalllday and Hasan's classillcation a.re included boundary by computing a degree of agreement of here. the sta.rt and end points of lexical chains. 755 Morris and Itirst[10] pointed out the above and has a hierarchical structure above this level. two importance of lexical cohesion for discourse For each word, a list of category numbers which analysis and presented a way of computing corresponds to its multiple word senses is given. lexical chains by using Roger's International We count a sequence of words which are included Thesaurus[15]. IIowever, in spite of their mention in the same category as a lexical chain. It might to the importance, they did not present the way be (:lear that this task is computationally trivial. of word sense disambiguation based on lexical co- Note that we regard only a sequence of words in hesion and they only showed the correspondences the same category as a lexical chain, rather than between lexical chains and segment boundaries by using the complete Morris and Hirst's framework their intuitive analysis. with five types of thesaural relations. McRoy's work[8] can be considered as the one The word sense of a word can be determined that uses the information of lexical cohesion for in its context. For example, in the context word sense disambiguation, but her method does {universe, star, universe, star, galaxy, sun}, the not; take into account the necessity to arrange word 'earth' has a 'planet' sense, not a 'ground' lexical chains dynamically. Moreover, her word one. As clear from this example, lexical chains sense disambignation method based on lexical co- ('an be used as a contextual aid to resolve word hesion is not evaluated fully. sense ambiguity[10]. In the generation process In section two we outline what lexical cohe- of lexical chains, by choosing the lexical chain sion is. In section three we explain the way of that the current word is added to, its word sense incremental generation of lexical chains in tan- is determined. Thus, we regard word sense dis- dem with word sense disambiguation and describe ambiguation as selecting the most likely category the result of the evaluation of our disambiguation number of the thesaurus, as similar to [16]. method. In section four we explain the measure l';arlier we proposed incremental disambigua- of the plausibility of segment boundaries and de- tion method that uses intrasentential informa- scribe the result of the evaluation of our measure. tion, such as selectional restrictions and case frames[l 2]. In the next section, we describe incre- 2 Lexical Cohesion mental disambiguation method that uses lexical chains as intersentential(contextual) information. Consider the following example, which is the English translation of the fragment of one of Japanese texts that we use for the experiment 3 Generation of Lexical Chains later. In the last section, we showed that lexical chains In the universe that continues expancb carl play a role of local context, t]owever, multi- ing, a number of stars have appeared ple lexical chains might cooccur in portions of a aml disappeared again and again. And text and they might vary in their plausibility as about ten billion years after tile birth local context. For this reason, for lexical chains of the universe, in the same way as to function truly as local context, it is necessary the other stars, a primitive galaxy was to arrange them in the order of the salience that formed with the primitive sun as the indicates the degree of tile plausibility. We base center. the salience on the following two factors: the re- Words {nniverse, star, universe, star, galaxy, cency and the length. The more recently updated sun} seem to be semantically same or related to chains are considered to be the more activated each other and they are included in the same cat- context in the neighborhood and are given more egory in Roget's International Thesaurus. Like salience. The longer chains are considered to be Morris and tIirst, we compute such sequences of more about the topic in the neighborhood and related words(lexical chains) by using a thesaurus are given more salience. as the knowledge base to take into account not By checking lexical cohesion between the cu> only the repetition of the same word but the use rent word and lexical chains in the order of the of superordinates, subordinates, and synonyms. salience, the lexical chain that is selected to add We. use a Japanese thesaurus 'Bnnrui- the current word determines its word sense and goihyo'[1]. Bunrui-goihyo has a similar organi- plays a role of local context. zation to Roger's: it consists of 798 categories Based on this idea, incremental generation of 756 lexical chains realizes incremental word sense dis- If a candidate word can not be added to the ambiguation using contextual information that existing lexical chain, new lexieal chains for each lexical chains reveal. During the generation word sense are recorded in the register. of lexical chains, their salience is also in As clear fl'om tile algorithm, rather than the crementally updated. We think incremental truly incremental method where the register of disambiguation[9] is a better strategy, because lexical chains is updated word by word in a sen- a combinatorial explosion of the number of to tenee, we adopt the incremental method where tal ambiguities rnight occur if ambiguity in not updates are performed at the end of each sentence resolved as early as possible during the analyt because we regard intrasentential information as ical process. Moreover, incremental word sense more iml)ortant. disarnbiguation is indist)ensable during the gem The process of word sense disambiguation us- eration of lexical chains if lexical chains are used ing lexical chains is illustrated in Figure 1. The for incremental analysis, because tile word sense most salient lexical chain is located at the top in ambiguity might cause many undesirable lexical the register.

Word Sense Disambiguation and Text Segmentation Based on Lexical

Sentence Boundary Detection for Handwritten Text Recognition Matthias Zimmermann

Multiple Segmentations of Thai Sentences for Neural Machine Translation

A Clustering-Based Algorithm for Automatic Document Separation

An Incremental Text Segmentation by Clustering Cohesion

A Text Denormalization Algorithm Producing Training Data for Text Segmentation

Topic Segmentation: Algorithms and Applications

A Generic Neural Text Segmentation Model with Pointer Network

Text Segmentation Techniques: a Critical Review

Text Segmentation Based on Semantic Word Embeddings

Steps Involved in Text Recognition and Recent Research in OCR; a Study

Multiple Segmentations of Thai Sentences for Neural Machine

A Journey Through Existing Procedures on Ocr Recognition and Text