Catching the Cheshire Cat

CATCHING THE CHESHIRE CAT

Christer Johansson

Dept. of Linguistics, Lund University, Sweden email: Christer.Johansson @ling.lu.se

ABSTRACT There are different methods of calculating Finding useful phrases is important in applica- statistical associations. Yang & Chute (1992) tions like information retriewd, and text-to- showed that a linear least square mapping of speech systems. One of the currently most natural language to canonical terms is both fea- used statistics is the mutual information ratio. sible, and a way of detecting synonyms. Their This paper compares the mutual information method does not seem to detect dependencies in ratio and a measure that takes temporal order- the order of words however. To do this we need ing into account. Using this lnodified measure, a measure that is sensitive to the order between some local syntactic constraints as well as words. In this paper we will use a variant of phrases am captured. mutual infi)rmation that derives from Shannon's theory of information. (as discussed in e.g., INTRODUCTION Salton & McGill, 1983) In Alice's Adventures in Wonderland by Lewis Carrel many of Alice's friends have names that Definitions and assumptions consists of two words, for example: the March The definition of a word in a meaninglul way is Hare, the Mock Turtle, and the Cheshire Cat. [:ar from easy, but a working definition, for '['he individual words in these combinations, if technical purposes, is to assume that a word we ignore capitalisation, might be quite com- equals a string of letters. These 'words' are sep- mon. arated by non-letters. The case of letters is ig- Individual words usually mean different nored, i.e. converted into lower case. For ex- things when they am free. l:or example, in "The ample: "there's" are two 'words': "there" and March against Apartheid", and "The March ~IS". I tare", "march" means totally different things. There is obviously a strong link between "the" A collocation consists of a word and the and "march", but the link between "march" and word that immediate@ follows. Index I will re- "hare" is definitely stronger, at least in Lt;wis fer to the first word and 2 to the second word. Carrol's text. Index 12 will refer to word 1 followed by The goal of this paper is to propose a statistic word2, and similarly for 2 I. that measures the strength ol7 such glue between words in a sampled text. Finding tile names {)17 Another assumption is that natural language Alice's friends can be done by searching for two is morn predictive in the (left-to-right) temporal adjacent words with initial capit~d letters. order, than in tile reversed order. This is moti- wtted by the simple obserwttion that speech ()no use of statistical associations could he to comes into the system through the ears serially. find translatable concepts and phrases, that might be expressed with a different number of For example: consider the French phrase "un words in another language. Another possibly ben viu hlanc" (Lit. "a good wine white"). interesting use of statistical associations is to "Ben" can (relatively often) be followed by predict whether words constitute new or given "vin", but usually not "vin" by "ben". The information in speech. It has been proposed same kind of link exists between "vin" and (e.g. Horne& Johansson, 1993) that the stress "bhmc", but not between "blanc" and "vin". of words in speech is highly dependent on the This linking affects the intonation of French informational content of the word. Also, statisti- phrases, and also that intonation supports these cal associations are not incompatible with the kinds of links. Note, that this is not an explana-. first stages of the "hypothesis space" proposed tion of either intonation or syntax: we mosl by Processability Theory (personal communica- likely have to consider massive interaction be-. tion with Manfred Pienemann of Sydney tween different modalities of language. University, see also Meisel & al., 1981).

1021 the sample was ten times larger we might have Deriving the measure found at least one such pair. The mutual information ratio, g, provides a rough estimation on the glue between words. It MATERIAL measures, roughly, how much more common a The material is Alice's Adventures in collocation is in a text than can be accounted for Wonderhmd by Lewis Carrol, available in elec- by chance. This measure does not assume any tronic format via email from the Gutenberg ordering between the words making up a collo- Project. The text contains 27332 words of cation, in the sense that the g-measure of which 2576 are unique, making up a total of [wl...w2] and [w2...wl] are calculated as if 14509 unique word pairs. Alice in Wonderland they were unrelated collocations. was chosen because it is a well-known text, it contains some phrases that we know are in there The mutual information ratio (in Steier & (e.g. March Hare), and it contains a sufficient Belew, 1991) is expressed: number of words, and variations of words, to be interesting for the experiment. Studies could be done for other collections of texts, e.g. , medical abstracts. As morn documents ate available, comparisons between documents can be done (Steier & Belew, 1991). This experiment only contains within comparisons of phrases for Formula 1: The mutual information ratio one specific text.

where 'p' defines the probability function, METHOD p([wl...w2]) is read as "the probability of For each of the unique words in the text the fie- finding word w2 after word wl". quencies of all immediately following words were collected. In this text, no filtering of the Adjusting for order between words text was performed. Some initial experiments We have experimented with the difference in were performed, with a stoplist, to remove mutual information, ag, between the two differ- function words and some other common words ent orderings of two words making up a collo- (see Fox, 1992, for details). Some simple cation. The results indicate that zxg captures stemming was also tried, e.g. removing 's' and some of the local constraints in a sampled text. 'ed' from the end of words. Stemming may lead 6g can be expressed: to difficulties in distinguishing compounds from noun-verb complexes. It is not clear if the pros A~t= of using stemming outweighs the cons, conse- quently we decided to work with the raw text. Stoplists and stemming might be more important I (P([Wl...W2])I , (P([W2...W~I when the ordinary g-measure is used.

RESULTS => The collocations were ordered differently by the two measures. The g was sensitive to individual frequencies, and favoured very low fi'equency , / , collocations. The Ag was sensitive to the order- = tog2/-- --/ ing of the words, and favoured high frequency collocations that only occmred in one order. The quality of the diffemnt measures can be seen by Formula 2: The diffcmnce in mutual comparing the top and last ten collocations inlblrnation between the measures. Table 1.1 and 2.1 refer to Ag, and Table 1.2 and 2.2 refer to g. The N where F([wx...Wy]) denotes the frequency of column tells the rank-number of the collocation. which Wx and Wy co-occur in the sample. Note that the frequencies of the individual F(wx) is the frequency of word Wx. Note that words, F1 and F2, are not used to compute Ag, they are only provided for compa~%on with the the size of the sample cancels in this equation. g-measure. Note also that this measure is not sensitive to the individual probabilities of the words. Note that the numerical values of the g-mea- A problem is when them is no F([w2...wl]). sure and the Ag-measure cannot be directly compared since they measure slightly different In these cases, we have chosen to arbitrarily set phenomena. F([w2...Wl]) to 0.1, with the justification that if

1022 Table 1.1: The top ten collocations by Ag The last ten collocations. Ag is sensitive to deviation from an expected ordering in tile N-I wo' _._.!r 1,'1 y~_l, lZll,)lA sample. The negative valued link between these said-> (he -1 ]~(}~ _ 462 .f~2~ _2!01 {J __[ words makes a phrase boundary between the 2 ioT> (h~ two words probable. 3 ~ in->a .{).92 } 3696321 97]_0 ] 4 an{f> the 9_.7(}! 'Fable 2.1: The last ten collocations by A~ _5 in-> tile _9.641 _ 3691 .6421 801_0 ~ _ 6 to-> Ihe 9.431 7 don->l 9.25 l 14500 catcq~illar-> -4.70 28 [ 1642 I / 26 I 8 as-> she 9.25_] __ Ihe _ {) a-> little 9.201 6322__ 1281, "A', t2 A _14501 mouse->the 10 she-> had 9.2()[ _55~_178L 2~1o I 14502 s->it -4.~1~ 201/~.~ 56 [ 14503 s->that -5.09 20t 3t51 x4 L 14504 dormouse-> -5.13 40 16421 1 ] 35 I als gives a measure of local links between the words. As can be seen from Table 1.1, Abt 14505 q ucen->lhc __-5.177 75 16421 2 72 I captures local constraints: that prepositions am 14L506 she->alld _578 552 usually followed by a noun phrase, that 'and' 14507 was->shc usually is used as a noun co-ordinator 14508 m->i -5.86 63 545 I 1 I 5N [ (indicated by the high value for 'and->the'). 14509! was->it Mitjushin (1992) has proposed similar links on a higher syntactic level, using a rule-based ap- The g-measure, in contrast, gives some col- proach. We have deliberately tried to awfid locations that are intuitively unlikely phrases talking about word-classes since it is misleading consisting of high frequency words. In the case at this level of analysis. However, we get many of "the-> the" there exists 1641 pairs that speak examples of good representatives for word- against that pairing, but it is hard to explain this classes that form collocations. in terms of local syntactic constraints. The negative scores seems to capture possible typo- Table 1.2: Thc top tell collocations by ~t graphic errors. N word pair bt 1,"! F2 _le 12 ___l w{×~den->spadcs.... 14.7 / ~ 1 _ 1_ Table 2.2: The last ten collocations by bt 2 vari{~u~r_>Drctextss .__14.7 & I ._ l N word pair g___ 1,"1 __.1~2 1¢12 uncommonly->lat .___ 14.7 1__ I _ I 14500 she-> of -3.37 552 513 1 4 _t3~->t{}il~e 14.7 I . I 1 14501 to-> and -3.54 729 872 2 • .5 littere{l->audibl~__ _ 14.7 1 1 1 14502 a->i -3.66 632 545 1 _.6 link li~>sl~_ecd~ 144.Z C 1 _ r 14503 and->of -4.03 872 513 1 7 tide-> rises 14.7 1 l 1 14504 i->and -4.12 545 872 1 _ 8 lalt->cuslard 14505 she->mld -4.14 552 872 1 _9 slcam-~c __£ _14.j_ & t L_ 145(16 to->to -4.28 729 729 1 10 ~d~->{h'essed t4.j Z J _ ! 14507 mid->~ld -4.80 872 872 1 14508 i->lhe -5.03 545 r642 1 The flavour of the collocations that bt rate 14509 Ihc->lhe -6.62 '642 !642 1 highly is different. As can be seen from Table 1.2, low individual frequencies result in a high g-value, even if the collocation is unique. This Particle verbs gives an illusion of a semantic relation, which is Particle verbs are hard to rank high for the due to the fact that low frequency words arc ~t-measure, because the individual fl'cquencies usually high in content. The g-measure is useful of the particles are usually devastatingly high, when we are interested in the correlation be- ~md the fl'equency of the main verb in pm'ticle tween words within and between documents verb constructions are usually higher than av- (Steier & Belew, 1991). This notion could be crage. The Abt are, in gencral, good at finding expanded up{}n to incorporate correlation be- such combinations if the order between the tween any two words in general, and it seems two words is fixed ('Fable 3.1). to work well for the g-measure (Wettler and Rapp, 1989).

1023 equal to the size of the memory were collected. Table 3.1: Some verb + particle (or negation) Adding a memory allowed the model to detect shared information of words that was further word pair Np. NAg rt Art 1 apart (for example "pack of card~" or "boots and did->not 3961 33 6.39 8.13~ / shoes". scemcd->to 6818 54 4.80 7.64 The memory introduced false collocations: must->be 4038 58 6.32 7.57 e.g., "grammar-> mouse". The context was: looked->at 5211 72 5.61 7.41 "Alice thought thi,~" lnust be the right way of speaking to a mouse: she had never done such a Finding thematic phrases thing before, but she remembered having seen in But what about finding Alice's friends' ? Does her brother~ Latin Grammar, ',4 mouse--era the art find the phrases that the text is about (~ mouse--to a mouse--a mouse--O mouse]'" thematic phrases)? To test this we chose some of This context gave up to 5 collocations for the names of Alice's friends (Table 3.2). "grammar" followed by "mouse", and therefore We found that the rank number that Ag deliv- rated "grammar-> mouse" very high. ers is higher than the rank number for the rt- measure for all the checked friends. This is due Otherwise, words that happened to be near a to the frequency effects discussed above. word without being statistically related to the word were usually rated low. The g gave clearly Table 3.2: Alice's Friends better results on finding related phrases than the zXg, with the model with the 'memory'. word pair ~_ NAg g With the memory, the Abt ordered the pairs mock->turtle 1517 12 8.86~ closer to the original raw-frequency ordering the march->hare 1003 28 9.65 more 'memory' was present. The experiment white->rabbit 1637 47 8.62 with the memory was useful because it showed chesl~ke ->cat 1360 473 9.04 that this was not worth doing for aj.t, but likely ~ueen 8519 831 4,00 worth doing for g. the->donnouse 8841 832 3.86 8463 2954 4.03 CONCLUSIONS What is lost There am obviously good phrases that g Possible usefulness rates higher than zXg. These usually consists of The higher sensitivity to local constraints in two words that are uncommon in the sample. the temporal ordering could be used in a parser Some idioms are of this kind. The at* needs to for finding local phrases. This might also have find more examples of collocations with the its implications for language acquisition. It could exact ordering between the consti-tuents to be tested if language learners make mistakes that rate the collocation high ( Table 3.3). could be explained by the statistical connectivity between words. Further research is needed on Table 3.3: Some collocations how the measure of connectivity behaves on with Ng < Nag phrase boundaries. word pair Nbt NAbt rt Art Areas where phrase finding could be useful ycr->honour 172 705 12.7 5.32 include: text-to-speech (phrase intonation), ma- young->lady 230 1073 12.4 4.91 chine translation (translation of compounds), guinea->pigs 398 645 11.6 5.32 and in information retrieval: phrase transfo~xna- rose->tree 459 1114 11.3 4.91 tion of high frequency terms into medium fie- fast->asleep 460 1115 11.3 4.91 quency telxns with a better discrimination value (Salton & McGill, 1983). note->book 462 2501 / 11 3 4.32 raving->mad 597 2500 ~ 4.32 cheshire->cats 1925 4468 [ 8.23 [ 3.32 Characteristics The rt-measure is good at estimating global correlations in a document or collection of documents (Wettler & Rapp, 1989). This could be Adding memory used for capturing contextual and pragmatic We have also done some experiments with constraints in a text. Other methods exist that are adding memory to the method. A 'memory' good, perhaps even better, at capturing for could, for example, extend 10 words after each example synonymy. word. All words following within a distance

1024 Linear least square mapping (Yang & Chute Shannon, C. E., 1951, Prediction mid l';ntropy of 1992) is one method that has shown to bc Printed English, Bell Systems Technical promising on capturing very good mappings Journal, Vol. 30, No. 1, Janumy 1951, pp. 50~ between, in their case, symptoms and diagnosis. 65. (quoted in Salton & MeGilI) The same technique could be used for mapping a Sitter, A. M. & Belcw, R. K., 1991, l:';xporting text to its abstract. The draw-back of these phrases: A statistical analysis of topical language. In Casey, R. & Crofl, B., (l';ds.), 2nd methods is their inherent parallel structure which Symposium on Document Analysis and makes it hard to account for the ordering that lnJormation RetriewtL natm'al language requires. Wettler, M. & Rapp, R., 1989, A connectionist System to Simulate I,exical Decisions in The Ag-measure, on the other hand, is a local [nlbrmation Retrieval, In: Pl'cilcr & al. (F,ds,), measure, that seems to capture dependencies in Connectionism in Perspective, North-I lolland the temporal ordering of the language. It is hard Yang, Y. & Chute (2. G., 1992, A Linem' I,east to draw any definite conclusions from the Squm'cs l;it Method for Inlormation Retrieval analysis of only one text, but we have seen how flom Natural Language Texts, Proceedings of the two proposed measures react 1o the fre- the ./iJteenth International Con/erence on quencies of individual words, as well as the Conqmtational Linguistics, pp. 447-453 Yarowsky, I)., 1992, Word-Sense l)ismnbiguation frequencies of word pairs. Taking into account Using Statistical Models of Roger's Categories the ability o1' Abt to find dependencies in the Trained on I.arge Coq)ora, Proceedings of the temporal ordering, we think it is a more relevaut Ji/teenth International CoR/?erence on measure than I-t for several aspects of natural Computational Linguistics, pp. 454-460 language processing, but not all. Project Gutenberg, Illinois Benedictine College, send the message: "send gutenberg catalog" to 'almanac @oes. otwl.edu ' lor more inlbrmation. Canol, I,. Alice's Adventures in Wonderland, The Acknowledgements Millennium l:ulerum l'~(lition 2.9 Thanks to the people at my department: es- pecially Barbara Gawronska.

REFERENCES Below, R. K., 1989, Adaptive inlormation retrieval: Using a connectionist reprcsentalion to reUicve and learn about doctlments, hi: l'roc. SIGIR 1989, pp. I 1-20, C~unbridge, MA. Fox, C., 1992, l.exical Analysis and Stoplists, Ill: l:rakes, W. B., & Baeza-Yates, R., hfformation Retrieval, Prentice Ilall, NJ. llorne, M. & Johansson, C., 1991, l,cxical Structure and accenting in English and Swedish rcslrictcd texts. Working l'apers (Dept. of Ling., U. of Lurid, Sweden)38: 97-114. Ilornc, M. & Johansson, C., 1993, Computational tracking of 'new' vs. 'given' information: implications lor synthesis of intonation, In: (}ranstr(hn, B. & Nard, 1.. (l';ds.) Nordic Prosody VI - papers l?om a symposium, Ahnquist & Wikscll International, S|ockhohn, Sweden. Meiscl, J., Clahsen, 11., alld Pienemann, M., 1981, On determining developmental stages in second language acquisition. Studies in Second Language Acquisilion 3, 2, pp. 109-135. Mitjushin, L. 1992, lligh Probabilily Syntactic Links, Proceedings o~ tile fifteenth International Conference on Computational Linguistics, pp. 930-934. Salt(m, (]., & McGill M. J., 1983, Introduction Io Modern h!formation Retriewtl, McGraw-I Iill Computer Science Series

1025

Information Retrieval & Extraction