A Statistical Approach for the Lexical Analysis of Large Corpora
Total Page:16
File Type:pdf, Size:1020Kb
A Statistical Approach for the Lexical Analysis of Large Corpora (?) Metodi statistici per l’analisi lessicale di corpora di grandi dimensioni Francesco Pauli1, Arjuna Tuzzi2 1 Dipartimento di Scienze Statistiche, Universita` di Padova 2 Dipartimento di Sociologia, Universita` di Padova e-mail: [email protected] Keywords: End of Year Message, Italian Presidents of Republic, lexicon, specific textual units 1. Introduction Statistical and linguistic analysis procedures were implemented to analyse a corpus of 57 discourses delivered by the nine Presidents of the Italian Republic for the traditional End of Year Message. This work deals with the issue of identifying specific lexical features by means of analysis of the contingency lexical tables containing the frequencies of words in sub-corpora (the 57 speeches or the 9 sub-corpora each comprising the discourses held by one President). Usually we have 7 discourses for each President, the two exceptions are Einaudi (6) and Segni (2). The length of speeches shows an increasing trend in time (Table 1), the two extremes being Einaudi for its conciseness and Scalfaro. Transcriptions of the spoken version of the discourses were obtained from www.quirinale.it. The dimensions in terms of word-tokens of corpus and sub-corpora, and the vocabulary dimensions in terms of word-types (forms and lemmas) are shown in Table 1. In order to increase the amount of information, lemmatisation was conducted on the corpus through manual and automatic processes; after lemmatisation the number of word-types (lemmas) is substantially reduced. Table 1: Number and length of the Presidents’ discourses President period n. disc. tokens types-forms types-lemmas Luigi Einaudi 1948-55 6 1203 558 464 Giovanni Gronchi 1955-62 7 5822 1794 1388 Antonio Segni 1962-64 2 1795 781 669 Giuseppe Saragat 1964-71 7 8459 2238 1707 Giovanni Leone 1971-78 7 7374 2041 1516 Sandro Pertini 1978-85 7 15545 2743 1903 Francesco Cossiga 1985-92 7 13858 3001 2236 Oscar Luigi Scalfaro 1992-99 7 24634 4133 2811 Carlo Azeglio Ciampi 1999-06 7 12569 2919 2074 Corpus 1948-06 57 91259 9786 6003 (?) Work is part of a project developed by an interdisciplinary research team of the University of Padova (Cortelazzo & Tuzzi, 2007). Table 2: Excerpt of specific words for three Presidents Einaudi Pertini Ciampi word b h o word b h o word b h o foriero.A * * 0.715 americano.A * * 0.277 citta.N` * * 0.020 elevare.V * * 0.193 animo.N * * 0.021 Euro.N * * 0.185 patria.N * * 0.968 anziano.N * * 0.111 Europa.N * * 0.663 confortare.V * * 1.000 ascoltare.V * * 0.016 europeo.A * * 0.599 volgere.V * * 1.000 assassinare.V * * 0.074 fiducia.N * * 0.584 concorde.A * * 0.393 carcere.N * * 0.636 generazione.N * * 0.900 borgo.N * * 0.016 cercare.V * * 0.058 identita.N` * * 0.421 fecondo.A * * 0.851 combattere.V * * 0.003 istituzione.N * * 0.005 anno.N * * 0.971 contingente.N * * 0.000 Italia.N * * 0.843 soddisfazione.N * * 0.917 dannato.A * * 0.596 nostro.A * * 0.404 casolare.N * * 0.016 difendere.V * * 0.006 patria.N * * 0.169 idealmente.ADV * * 0.593 disoccupazione.N * * 0.164 secolo.N * * 0.002 elevazione.N * * 0.536 dittatura.N * * 0.742 Unione.N * * 0.807 apprestare.V * * 0.779 fame.N * * 0.003 governare.V * * 0.581 percorso.N * * 0.482 fare.V * 0.072 inno.N * * 0.048 tutto.PRON * * 0.160 funerale.N * * 0.060 provincia.N * * 0.371 voto.N * * 0.548 giovane.N * * 0.000 sogno.N * * 0.223 sicche.CONJ´ * * 0.242 gioventu.N` * * 0.051 stesso.A * * 0.005 lecito.A * * 0.715 guerra.N * * 0.030 su.PREP * * 0.559 auspici.N * * 0.125 invece.AVV * * 0.993 voi.PRON * * 0.280 riservare.V * * 0.847 io.PRON * * 0.729 affidare.V * * 0.579 affetto.N * * 0.917 italiano.A * * 0.000 comunale.A * * 0.581 comune.A * * 0.849 italiano.N * 0.020 confronto.N * * 0.512 italiano.N * * 0.953 mila.A * * 0.333 Risorgimento.N * * 0.360 cammino.N * * 0.758 morire.V * * 0.900 coesione.N * * 0.041 sereno.A * * 0.701 napoletano.A * * 0.050 fondamentale.A * * 0.045 palpito.N * 0.593 umanita.N` * 0.837 moglie.N * * 0.862 ognor.ADV * 0.784 Spagna.NM * * 0.168 sessanta.NUM * 0.045 rigoglio.N * 0.593 sismico.A * * 0.001 immigrazione.N * 0.840 Trieste.NM * 1.000 Hiroshima.NM * * 0.002 sindaco.N * 0.009 2. Methods and results The first issue we consider is deciding whether a lemma appears homogeneously across the 9 Presidents, or if it appears mostly in one or in a subset of them and can be considered “specific”. This is done by means of bootstrap (resampling with replacement) calculation and hypergeometric model (Lafon, 1980); in Table 2 we report for three Presidents the significance status of some lemmas according to the two alternatives (columns b and h). The bootstrap method shows better performance for low frequency words (Pauli & Tuzzi, 2006). A lemma can not be considered “specific” of a President if it is not well spread out over all his speeches. For this reason a control of the homogeneity of the distribution among discourses of the same President of lemmas recognised as “specific” of that President (Baayen, 2000) is relevant. From a technical point of view this issue is not different than that we dealt with the 9 sub-corpora where the President we are investigating in depth play the role of the corpus and his speeches play the role of the sub-corpora. In Table 2 we report the significance levels of some “specific” lemmas (o), where o is a p-value for the null hypotheses of homogeneous distribution. References Baayen H.R. (2000) Word Frequency Distributions, Kluwer Academic Pub. Cortelazzo M.A., Tuzzi A. (2007) Messaggi dal Colle, Marsilio, Venezia. Lafon P. (1980) Sur la variabilite´ de la frequence´ des formes dans un corpus, Mots, I, 127–165. Pauli F., Tuzzi A. (2006) Identifying specific textual units of documents taken from large corpora. Comparing methods, JADT 2006, Presses universitaire de Franche-Comte.´.