On the use of automatic tools for large scale semantic analyses of causal connectives
Liesbeth Degand Wilbert Spooren Yves Bestgen Université catholique de Lou- Vrije Universiteit, Amsterdam Université catholique de Lou- vain Language & Communication vain Institute of Linguistics [email protected] Faculty of Psychology [email protected] [email protected]. ac.be
them analyst-independent. In this paper, we test Abstract such a methodology for which we used a number of linguistic hypotheses found in the literature on In this paper we describe the (annotation) tools underly- the semantics of causal connectives and tried to ing two automatic techniques to analyse the meaning replicate the results. The linguistic material we and use of backward causal connectives in large Dutch worked on are four Dutch backward causal con- newspaper corpora. With the help of these techniques, nectives: aangezien ('since') , doordat ('because of Latent Semantic Analysis and Thematic Text Analysis, the fact that') , omdat ('because') and , want ('be- the contexts of more than 14,000 connectives were stud- ied. We will focus here on the methods of analysis and cause'). This choice was motivated by the fact that on the fairly straightforward (annotation) tools needed there has already been quite some linguistic work to perform the semantic analyses, i.e. POS-tagging, lem- on this topic, mainly empirically based (Degand, 1 matisation and a thesaurus-like thematic dictionary. 2001; Degand and Pander Maat, 2003; Pit, 2003). We have shown elsewhere how linguistic hypothe- 1 Introduction ses concerning the scaling of these connectives in terms of subjectivity and their thematic behaviour In ongoing work, we explore the possibility to could be supported (Bestgen et al., 2003). Since make use of large corpora to test hypotheses con- these first results are very encouraging, we would cerning linguistic factors determining the meaning like to focus here on the methods of automatic and use of connectives. Of course, corpus-based analysis – Latent Semantic Analysis and Thematic approaches of connectives are not new, but classi- Text Analysis - and on the fairly straightforward cally they consist of either fully analysed but rela- (annotation) tools needed to perform the semantic tively small corpora, or of large corpora of which a analyses, i.e. POS-tagging, lemmatisation and a random set is analysed. The reason for this quanti- thesaurus-like thematic dictionary. We illustrate tative restriction is clear: The data-analyses are how the combination of the two techniques of completely hand-based. While these empirical automatic analysis permit to gain deeper insight studies are useful from a qualitative point of view, into the semantic constraints on the use of the con- they all suffer from the same quantitative draw- nectives studied. Doing so, we test a number of back, namely the relatively small number of data new hypotheses concerning the perspectivizing and (rarely more than 100 occurrences are analysed, polyphonic nature of connectives that remain un- mostly only 50). In addition, most of these analy- confirmed in the linguistic literature. We also dis- ses are still too analyst-dependent, making gener- cuss the robustness of the techniques and their re- alizations and replications difficult. Changing this usability in other contexts and other languages. situation includes handling exhaustively large cor- pora (with hundreds and even thousands of occur- rences of the same linguistic phenomenon) and implementing the analytic procedures to make 1 For lack of space we will not present the linguistic analyses here but will consider them as given. 2 Techniques and Tools Connective Raw frequency Relative frequency (per million words) The techniques used have to fulfil two tasks: they aangezien 248 30 are needed to extract the relevant linguistic mate- Doordat 826 101 rial from the corpus, that is to say the four connec- Omdat 7689 938 tives with their context of use; and they are used to Want 5621 686 analyse the retrieved elements in order to test a Table 1: Frequencies of the causal connectives in the number of linguistic hypotheses concerning the data set meaning and use of these connectives. Our main The extracted sentences were then analysed in terms of a series of heuristics to identify the CAUSE objective is to show that with the use of these tech- 4 niques only fairly straightforward annotation tools (P) and CONSEQUENCE segments (Q) . From a are needed to perform quite profound semantic syntactic point of view, the connectives doordat, analyses on massive quantitative data. omdat and aangezien can occur in two basic types of causal constructions: medial (Q CONNECTIVE P), 2.1 POS-Tagging and the identification of see example (1), and preposed ones (CONNECTIVE the causal segments P, Q), see example (2). The connective want only appears in medial constructions. The extraction of the relevant linguistic material was fulfilled by automatic syntactic analysis tech- (1) Een gezamenlijk beleid is niques. As a basis for our analyses we worked with nodig omdat in het najaar the first six months of a Dutch newspaper corpus in het Japanse Kyoto of more than 30 million words 2. This material was wereldwijd wordt onderhan- POS-tagged using MBT (Memory Based Tagger) deld over het klimaat. (Daelemans et al.,1996). We then discarded the ‘A common policy is necessary because worldwide negotia- items with few content words: sports results, tele- tions will take place in the vision programs, crosswords and puzzles, stock autumn in the Japanese city exchange reports, service information from the of Kyoto.’ newspaper editor, etc. We also ‘cleaned’ the cor- (2) Iedere strenge winter pus material of irregularities caused by the incom- heeft gevolgen voor de patibility between the source file and the tagging kerkorgels', zegt dr. A.J. program (mostly nonsense words generated by the Gierveld van de Gerefor- program). This eventually led to a data set of ap- meerde Organisten- proximately 16,500,000 words. vereniging. Doordat het The POS-tagging permitted to segment the cor- hout krimpt, kunnen er pus in sentences and to label the words grammati- kieren ontstaan waardoor cally. Second, POS-tagging allowed us to locate lucht ontsnapt. and extract the connectives from the sentences in ‘Every hard winter has conse- which they occurred. Concretely, we extracted all quences for the church or- sentence-length segments on the basis of the tag gans”, Dr. A.J. Gierveld of
2 We used the year 1997 of "De Volkskrant" a Dutch national daily newspaper. The corpus is distributed on CD-rom. 4 The connectives under investigation are all so-called backward causal connec- 2 These cases were eliminated in order to be sure of the exact influence of the tives, i.e. they express an underlying causal relation of the type CONSEQUENCE – connective and about the exact contribution of the context. CAUSE , in which the connective introduces the CAUSE segment. a) the position of the connective in the of initial connective, even though a word precedes sentence (number and type of words the connective. preceding the connective), (5) En omdat in Nederland de b) the number, position and order of finite voertaal nog steeds het verbs in the segment, Nederlands is, worden de meeste schoolvakken ook in c) the presence or absence of punctuation die taal gedoceerd. markers, especially commas. ’And because Dutch is still For example, a sentence beginning with the con- the main language in the nective omdat can either be preposed (P-Q) (ex- Netherlands, most subjects are taught in that language.’ ample 3), or medial (Q-P), if Q and P are given in different sentences (example 4). This resulted in 21 heuristic rules, the adequacy of (3) Omdat de verdachte niet which was hand-checked on large samples of the eerder was veroordeeld, data. In the end, 1.4% of the data were lost because bleef de gevangenisstraf one of the segments was missing or because none geheel voorwaardelijk. of the procedures could work out the identification ‘Because the suspect had not of P and Q. Ultimately we were able to identify been convicted before, the the causal segments for 14181 sentences. Four syn- sentence was entirely proba- tactic environments can be distinguished, involving tional.’ a preposed construction corresponds to a construction ze onvindbaar zijn tussen in which Q and P are linked by a connective de ramsj. within the same sentence (example 1); ‘But there are more [TV] pro- grammes that are worth watch- b)
The idea of the thematic text analysis was to Table 2 displays the cosines resulting from the confirm that the break in semantic tightness occur- LSA-analysis. Two ANOVAs were performed. ring with want -segments, as revealed by the LSA- The first one had the connectives as independent analysis, could indeed be interpreted in terms of a variable and the semantic proximity between the perspective shift. We would therefore expect that causal segments as dependent variable. It shows the causal segments related by the connective want that hypothesis 1 is borne out (F(3, 10505) = show diverging perspectivisation patterns, and that 11.36, p < 0.0001): the causal segments related by this will not be the case for the segments related by the (monophonic) connective omdat are semanti- omdat, doordat, aangezien. This is reformulated in cally closer than the segments related by the (poly- hypothesis 3. phonic) connective want . The results furthermore show that doordat and aangezien should be de- Hypothesis 3: If the causal segments are re- scribed in terms of monophonic connectives. The lated by the connective want , the Q-segment second ANOVA, with the connectives as inde- contains perspective signals, the P-segment pendent variable and the semantic proximity be- does not. The causal segments related by the tween the prior and subsequent sentences as connectives omdat, doordat, aangezien do not dependent variable, confirms hypothesis 2 (F(3, present such a shift. 10505) = 25.75, p < 0.0001): the monophonic con- nectives aangezien, doordat and omdat go along Communication Attitude markers markers with topic continuity (or at least semantic prox- Mean Q Mean P Mean Q Mean P imity) between the prior and subsequent sentence aangezien 0.173 0.115 0.360 0.273 to the causal construction, while this is less the (N = 139) SD: 0.38 SD: 0.32 SD: 0.48 SD: 0.45 case for the connective want . doordat (N 0.129 0.104 0.305 0.326 To confirm that these results are indeed related = 699) SD: 0.33 SD: 0.31 SD: 0.46 SD: 0.47 to the issue of perspectivisation, this LSA-analysis omdat (N = 0.179 0.162 0.312 0.312 was completed with a thematic text analysis to test 6747) SD: 0.38 SD: 0.37 SD: 0.46 SD: 0.46 for the presence vs. absence of perspective indica- want (N = 0.175 0.181 0.442 0.394 tors. To this end we built a "Perspective" diction- 5589) SD: 0.38 SD: 0.38 SD: 0.50 SD: 0.49 ary of perspective-indicating elements (Spooren, Table 3: Mean number of perspective markers in P & Q 1989) such as intensifiers, emphasisers, attitudinal nouns and adjuncts, etc. (Caenepeel, 1989). The The results displayed in Table 3 show that the hy- dictionary was composed of two subcategories: pothesis is borne out for the subcategory of attitu- dinal markers: want -segments display a higher a) communication markers, like (non- amount of attitudinal markers in Q than in P (F(1, ambiguous) verbs and adverbs of saying 5588) = 26.84, p < 0.0001). For the other connec- and thinking, e.g. report, tell, confirm, re- tives this is not the case. For the communication quire, according to,… markers, the hypothesis is not borne out. Actually, only omdat displays a higher amount of communi- cation markers in Q (F(1, 6746) = 6.53, p < 0.01). phenomena for a number of reasons. The first is While this latter result might seem counter to ex- that it makes it possible to test linguistic hypothe- pectation, it actually goes in the direction of prior ses about the use of causal connectives on a large observations that omdat -relations frequently dis- scale basis, whereas previous tests were based on play the explicit introduction of speech acts (De- only small corpora and small amount of data. The gand, 2001; Pit 2003). second is that the analysis is mostly fully auto- All together, these results offer new interesting matic, especially with respect to the coding of the insights into the discourse environment of (Dutch) fragments. It is especially this latter feature that causal connectives. On the one hand, we have should appeal to the linguistic community, and shown with the LSA analysis that the proximity makes our method more robust. The intercoder between Q and P is lower for want -relations than reliability is a constant concern of everyone work- for the other connectives and that this is also the ing with corpora to test linguistic hypotheses (Car- case for the semantic proximity between the sen- letta, 1996), and the more so when one is coding tences prior and subsequent to the causal relations. for semanto-pragmatic interpretations, as in the We therefore concluded that the connective want is case of the analysis of connectives. A third reason a marker of thematic shift. On the other hand, the is that our method combines two techniques of TTA analysis revealed that the Q-segments in automatic text analysis, which allows us to formu- want -relations display a higher amount of attitudi- late our hypotheses to be tested more fine-grained nal markers. In our view, the presence of these than possible with either one separately. Moreover, markers leads to the conclusion that the connective hypothesis formulation and testing goes further: want is indeed a marker of perspective shift, i.e. We can use the methodology to formulate new hy- the break in semantic tightness should be inter- potheses. An interesting possibility is to use LSA preted as a perspective break, as has often been to find neighbours of terms in the dictionary, thus suggested in the literature. Furthermore, the addi- extending the dictionary. A further interesting tional results for want (absence of communication venue is to test the linguistic hypotheses for differ- markers in Q) also suggest that markers expressing ent genres. This brings us to a further possibility, the speaker's attitude should be clearly distin- namely to reuse the semantic space for different guished from those that explicit the speaker's types of linguistic research. A final possibility is to speech act (verbs of saying) or designate him/her use the present semantic space for comparative explicitly as the source of the speech act (adverbs research: How do the present results compare to a like aldus, volgens , … 'according to'). similar analysis of French connectives? The polyphony/monophony distinction overlaps with the coordination/subordination distinction Acknowledgements between want vs. the other connectives. The ques- tion arises which of those two factors is responsi- L. Degand and Y. Bestgen are research fellows of ble for the results obtained. One route to follow is the Belgian National Fund for Scientific Research. to compare our results with a language like English This research was supported by grant n° FRFC in which a same connective ( because ) has both 2.4535.02 and by a grant (“Action de Recherche monophonic and polyphonic uses, or with a lan- concertée”) of the government of the French- guage like French where a polyphonic connective language community of Belgium. like puisque is subordinating. The latter topic is object of ongoing research. References Berry, M.W. (1992). Large scale singular value compu- 4 Discussion tation, International journal of Supercomputer Appli- cation , 6: 13-49. In this paper we have presented a method for the linguistic investigation of a discourse phenomenon, Berry, M., Do, T., O'Brien, G., Krishna, V. and Varad- viz. connectives, giving very satisfying results han, S. (1993). SVDPACKC: Version 1.0 User's Guide , Tech. Rep. CS-93-194, University of Tennes- without necessitating heavy, work-intensive (hand- see, Knoxville, TN, October 1993. based) discourse annotation. The research pre- sented is important to the corpus study of discourse Burgess C., Livesay K., Lund K., " Explorations in Con- Foltz, P.W., Kintsch, W., & Landauer T.K. (1998). The text Space : Words, Sentences, Discourse ", Dis- measurement of textual coherence with Latent Se- course Processes , Vol. 25, 1998, p. 211-257. mantic Analysis. Discourse Processes, 25 , 285-307. Bestgen, Y. (2002). Détermination de la valence affec- Kintsch, W. (2000). Metaphor comprehension: A com- tive de termes dans de grands corpus de textes. Actes putational theory. Psychonomic Bulletin and Review, du Colloque International sur la Fouille de Texte 7, 257-266. CIFT'02 (pp. 81-94) . Nancy : INRIA. Kintsch W., (2001).Predication, Cognitive Science 25, Bestgen, Y., Degand, L. & Spooren, W. (2003). On the 173-202. use of automatic techniques to determine the seman- Landauer, T.K., Foltz, P.W., and Laham, D. (1998). An tics of connectives in large newspaper corpora: an introduction to Latent Semantic Analysis. Discourse exploratory study. Lagerwerf L., Spooren W., De- Processes , 25 (2, 3), 259-284. gand L. (Eds). Determination of Information and Tenor in Texts: MAD 2003, Stichting Neerlandistiek Lebart, L., Salem, A., and Berry, L. (1998). Exploring VU Amsterdam & Nodus Publikationen Münster, Textual Data . Kluwer Academic Publisher. 179-188. Lemaire, B., Bianco, M., Sylvestre, E., & Noveck, I. Biber, D. (1998). Variation across speech and writing . (2001). Un modèle de compréhension de textes fondé Cambridge: Cambridge University Press. sur l'analyse de la sémantique latente. In H. Paugam Moisy, V. Nyckees, J. Caron-Pargue (Eds.), La Co- Brouwers, L. (1997). Het juiste woord, betekeniswoor- gnition entre Individu et Société : Actes du Colloque denboek . 6th ed. (ed. by F. Claes). Antwerpen etc.: de l'ARCo (pp. 309-320). Paris: Hermès. Standaard. Oversteegen, L. (1997). On the pragmatic nature of Caenepeel, M. (1989). Aspect, Temporal Ordering and causal and contrastive connectives. Discourse Perspective in Narrative Fiction . Doctoral Disserta- Processes, 24, 51-86. tion University of Edinburgh. Pennebaker, J.W., , Mehl, M.R., & Niederhoffer, K.G. Carletta, J. (1996). Assessing agreement on classifica- (2003). Psychological aspects of natural language tion tasks: the kappa statistic. Computational Lin- use: Our words, our selves. Annual Review of Psy- guistics 22 (2), 249-254. chology, 54 , 547-577. Choi, F., Wiemer-Hastings P., & Moore J. (2001) La- Piérard, S., Degand, L., & Bestgen Y. (2004). Vers une tent Semantic Analysis for Text Segmentation. In L. recherche automatique des marqueurs de la segmen- Lee & D. Harman (Eds.), Proceedings of the 2001 tation du discours. Actes des 7 es Journées internatio- Conference on Empirical Methods in Natural Lan- nales d’Analyse statistique des Données Textuelles . guage Processing , 109-117. Louvain-la-Neuve. Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. Pit, M. (2003). How to Express Yourself with a Causal (1996). MBT: A Memory-Based Part of Speech Tag- Connective. Subjectivity and Causal Connectives in ger-Generator. In E. Ejerhed & I. Dagan (Eds.), Pro- Dutch, German and French . Amsterdam : Rodopi. ceedings of the Fourth Workshop on Very Large Corpora (pp. 14-27). Copenhagen, Denmark. Popping, R. (2000). Computer-assisted text analysis. London: SAGE. Deerwester S., Dumais S.T., Furnas G.W., Landauer T.K., Harshman R., Indexing by Latent Semantic Spooren, W.P.M.S (1989). Some Aspects of the Form Analysis, Journal of the American Society for Infor- and Interpretation of Global Contrastive Coherence mation Science , Vol. 41, 1990, 391-407. Relations. Unpublished Dissertation, K.U. Nijmegen. Degand, L. (2001). Form and Function of Causation. A Stone, P.J. (1997). Thematic text analysis: New agendas theoretical and empirical investigation of causal for analyzing text content. In C.W. Roberts (Eds.). constructions in Dutch, Peeters, Leuven, Paris, Ster- Text Analysis for the Social Sciences: Methods for ling. Drawing Statistical Inferences from Texts and Tran- scripts (pp.35-54). Mahwah, NJ: Erlbaum. Degand, L. & Pander Maat, H. (2003) A contrastive study of Dutch and French causal connectives on the van den Bosch, A., & Daelemans, W. (1999). Memory- Speaker Involvement Scale, A. Verhagen & J. van de based morphological analysis. In Proceedings of the Weijer (eds.) Usage based approaches to Dutch (pp. 37th Annual Meeting of the Association for Compu- 175-199) . Utrecht: LOT. tational Linguistics, ACL'99 (pp. 285-292). New Brunswick, NJ: ACL