Discriminating the registers and styles in the Modem

George Tambouratzis*, Stella Markantonatou*, Nikolaos Hairetakis*, Marina Vassiliou*, Dimitrios Tambouratzis ^, George Carayannis*

* Institute for Language and Speech Processing Epidavrou & Artemidos 6, 151 25 Maroussi, {giorg_t, marks, nhaire, mvas, gkara} @ilsp.gr

^ Agricultural University of , lera Odos 75, 118 55, Athens, Greece. {[email protected]}

Abstract Baayen et al. (1996), who study the topic of author identification, apply statistical measures This article investigates (a) whether register and methods on syntactic rewrite rules resulting discrimination can successfully exploit linguistic by processing a given set of texts. They report information reflecting the evolution of a that the accuracy thus obtained is higher than language (such as the phenomenon of when applying the same statistical measures to the language) and (b) what kind the original text. On the other hand, Biber of linguistic information and which statistical (1995) uses Multidimensional Analysis coupled techniques may be employed to distinguish with a large number of linguistic features to among individual styles within one register. distinguish amongregisters. The underlying idea Using clustering techniques and features is that, rather than being distinguished on the reflecting the diglossia phenomenon, we have basis of a set of finguistic features, registers are successfully discriminated registers in Modem distinguished on the basis of combinations of Greek. However, diglossia information has not weighted linguistic features, the so-called been shown sufficient to distinguish among "dimensions". individual styles within one register. Instead, a large number of linguistic features need to be This article reports on the discrimination of texts studied with methods such as discriminant in written Modem Greek. The ongoing research analysis in order to obtain a high degree of described here has followed two distinct discrimination accuracy. directions. First, we have tried to distinguish among registers of written Modern Greek. In a 1 Introduction second phase, our research has focused on distinguishing among individual styles within The identification of the language style one register and, more specifically, among characterising the constituent parts of a corpus is speakers of the Greek Parliament. To achieve very important to several appfieations. For that, structural, morphological and part-of-speech example, in information retrieval applications, information is employed. Initially (in section 2) where large corpora of texts need to be searched emphasis is placed on distinguishing among the efficiently, it is useful to have information about different registers used. In section 3, the task of the language style used in each text, to improve author identification is tested with selected the accuracy of the search (Karlgren, 1999). In statistical methods. In both sections, we describe fact, the criteria regarding language style may the set of linguistic features measured, we argue differ for each search and therefore - due to the for the statistical method employed and we large number of texts - there is a requirement to comment on the results. Section 4 contains a perform style categorisation in an automated description of future plans for extending this fine manner. Such systems normally use statistical of research while in section 5 the conclusions of methods to evaluate the properties of given texts. this article are provided. The complexity of the studied properties varies. Kilgarriff (1996) employs mainly the frequency- 2 Distinguishing Registers of-occurrence of words while Karlgren (1999) applies statistical methods primarily on structural To distinguish among registers, we successfully and part-of-speech information. exploited a particular feature of Modem Greek, namely the contrast between Katharevousa and non-classifiable entries), resulting in 11 Demotiki. These are variation.,; of Modern Greek variables expressed as percentages. which correspond (if only roughly) to formal and informal speaking. Katharevoma was the official These variables are more similar to the language of the Greek State until 1979 when it characteristics used by Karlgren (1999), and was replaced by Demotiki. By that time, differ considerably from those used by Kilgarriff Demotiki was the establis]hed language of (1996) and Baayen et al. (1996). For the metrics literature while, in times, it had been the of the first and thkd categories, a custom-built language of elementary education. Compared to program was used running under Linux. This Demotiki, Katharevousa bears an important program calculated all structural and resemblance to manifested morphological metrics for each text in a single explicitly on the morphological level and the use pass and the results were processed with the help of the lexicon. At a second step, we dropped the of a spreadsheet package. The metrics of the Katharevousa-Demotiki approach and relied on second category were calculated using a custom- part-of-speech information, which is often built program in the C programming language. exploited in text categofisafion experiments (for PoS counts were obtained using the ILSP tagger instance, see Biber et al. 1998). Again, we (Papageorgiou et al., 2000) coupled with a obtained satisfactory results. number of custom-built programs to determine the actual frequencies-of-occurrence from the 2.1 Method of work tagged texts. Finally, the STATGRAPHICS package was used for the statistical analysis. The variables used to distinguish among registers may be grouped into the following categories: The dataset selected consisted of examples from 1. Morphological variables: These were verbal three registers: endings quantifying the contrast (i) fiction (364 Kwords - 24 texts), Katharevousa / Demofiki. Although the (ii) texts of academic prose referring to morphological differences between these two historical issues, also referred to as the variations of Greek are not limited to the history register (361 Kwords - 32 texts) and verb paradigm, we focused on the latter since (iii) political speeches obtained from the it better highlights the contrast under proceedings of the Greek parliament consideration (Tainbouratzis et al., 2000). A sessions, also referred to as the parliament total of 230 verbal endings were selected, register (509 Kwords - 12 texts). split into 145 Demotiki and 85 Katharevousa The texts of registers (I) and (II) were retrieved endings (see also the Appendix). These 230 from the ILSP corpus (Gavrilidou et al., 1998), frequencies-of-occurrence were grouped into all of them dating from the period 1991-1999. 12 variables for use in the, statistical analysis. The texts of register (III) were transcripts of the 2. Lexical variables: Certain negation particles Greek Parliament sessions held during the first (ovd~ei¢, otu~rere, oo~o4zo6, dryer)clearly half of 1999. signify a preference for Katharevousa while others (~51Xo~¢, #are, Xcopi¢) are clear This dataset was processed using both seeded indicators of Demotiki. However, the most and unseeded clustering techniques with between frequently used negation particles (tzt,/a/v, 3 and 6 clusters. The unseeded approach ~cv) are not characteristic of either of the two confirmed the existence of distinct natural variations. classes, which correspond to the three registers. 3. Structural macro-features: average sentence The seeded approach confirmed the ability to length, number of commas, dashes and accurately separate these three registers and to brackets (total of 4 variables). cluster their elements together. Initially, a "short" 4. After the completion of the experiments with data vector containing only the 12 morphological variables of type 1-3 (Tambouratzis et al., variables quantifying the Demofiki/Katharevousa 2000), Part-of-Speech (PoS) counts were contrast was used (Tambouratzis et al. 2000), as introduced. The PoS categories were well as a 16-element vector combining structural adjectives, adjunedons, adverbs, articles, and morphological characteristics. The seeds for conjunctions, nouns, pronouns, numerals, the Parliaraent and History registers were chosen particles, verbs and a hold-all category (for randomly. The seeds for the Fiction register were chosen so that at least one of them would not be

36 an "outlier" of the Fiction register. seeded clustering approach had a high degree of Representative results are shown in Table 1 for accuracy, reaching 100% when using 5 clusters. the different vectors and numbers of clusters. In For the 27-element vector with both each case, the classification rate quoted morphological and PoS information, perfect corresponds to the number of text elements clustering has been achieved even with 4 correctly classified (according to the register of clusters. On the other hand, a successful the respective seed). clustering (albeit with a lower level of accuracy) is achieved using only structural and PoS 12-elem. 16-elem. information. 6 clusL 95.6% 98.5% 4 clust. 97.1% 98.5% It should be noted that the lexical variables used, 3 clust. 95.6% 97.1% that is the negation particles, did not contribute at all (Markantonatou et al., 2000). Furthermore, Table 1 - Seeded clustering accuracy as a the system performed almost as well with and function of the cluster number and vector size. without macro-structure features, the difference in accuracy being less than 5%. The vector size was augmented with PoS information, resulting in a 27-element data The parliament texts can be claimed to form a vector. A new set of clustering experiments were register whose patterns are closely positioned in performed using Ward's method with the the pattern space. Of the three registers, the squared Euclidean distance measure to cluster literature one presented the highest degree of the data in an unseeded manner. Finally, a 15- variance, with more than one sub-clusters element data vector was used with PoS and existing as well as outlier elements. This may be structural information but without any explained by the fact that the parliament morphological information. The results obtained proceedings, contrary to literature, undergo (Table 2) show that PoS information improves intensive editing by a small group of specialised the clustering performance. public servants.

2.2 Comments on the Results 3 Distinguishing Styles within One Register Our results strongly suggest that registers of written Modem Greek can be discriminated In this section, we report on our efforts to accurately on the basis of the contrast distinguish among individual styles within one Katharevousa / Demotiki manifested with register. In particular, we intend to distinguish morphological variation. Languages with a among speakers of the Parliament by studying different history may not be suited to such a the transcripts of the speeches of five parliament categorisafion method. This is evident in Biber's members over the period 1997-2000. Each of work (1995) for the English language, where a these speakers belongs to one of the five political variety of grammatical and macro-structural parties that were represented in the Greek linguistic features but no morphological variation parliament over that period. Up to date, the features were employed. It seems then that experiments have been limited to the period corpora of languages which are characterised by 1999-2000. the phenomenon of diglossia, may be successfully categorisable on the basis of 3.1 Method of work morphological information (or other reflexes of diglossia). Such a discrimination method may The number of variables (46 in total) calculated give results as satisfactory as approaches which for each of the five speakers can be grouped as are closer to the Biber (1995) spirit and rely on follows: PoS and structural measures (see Tables 1 and 2). . Morphological variables (20 variables): • Verbal endings expressing the Tables 1 and 2 show that the accuracy of Katharevousa / Demotiki contrast giving clustering reaches approximately 99% while the rise to 12 variables.

~'7 12-elem. 16-elem. 27-elem. 15-elem 6 clust. 95.5% 100.0% 100.0% 100.0% 5 clust. 95.5% 100.0% 100.0% . 89.6% 4 clust. 94.1% 98.5% 100.0% 83.4% 3 clust. 94.1% 98.5% 98.5% 83.4%

Table 2 - Unseeded clu~¢erin£ accuracy as a function of the cluster number and vector size usevL

* the use of infixes (2 variables) in the past registers. Even when only 2 speakers were used, tense forms. the clusters formed involved patterns from both . the person and ntmaber of the verb form speaker classes. (6 variables). The last two types of variable are expressed We experimented with two corpora, Corpus I and as percentages normalised over the number Corpus 11, as described in Table 3. Corpus H is a of verb forms. subset of Corpus I. Each of the speeches 2. Lexical variables (6 variables): included in Corpus II was delivered as an • Negation particles(623, &v,/aft). opening speech Cprotoloyia") at a parliament • Negative words of Katharevousa (ovJei¢, session when at least two of the studied speakers ~iveo). delivered speeches. • Other words which also express the contrast Katharevousa / Demotiki (the An important issue is whether the selected anaphoric pronouns 'o~oio¢' (Kath) and variables are strongly correlated. If indeed strong 'taro' (Dem)), currently resulting in a correlations do exist, these might be used to single variable. reduce the dimensionality of the pattern space. 3. Structural macro-features: average sentence For the purposes of this analysis, the 46 and word length, number of commas, independent variables were used (45 in the case question marks, dashes and brackets, of Corpus II where only "protoioyiai" exist, since resulting in a total of 6 variables. then the order variable is constantly equal to 1). 4. Structural micro-features (other than The number of correlations of all variable pairs lexical): exceeding given thresholds is depicted in Figure • Part-of-Speech counts (10 variables). 1, for both Corpus I and Corpus 11. According • Use of grammatical categories such as to this study, in Corpus IL the percentage of the genitive case with nouns and variable pairs with an absolute value of adjectives (2 variables). correlation exceeding 0.5 is approximately 3%, 5. The year when the speech was presented in indicating a low correlation between the the Parliament and the order of the speech in parameters. Additionally, out of 990 pairs of the daily schedule, that is whether it was the Corpus 11, only a single one has a correlation first speech of the speaker that clay (hereafter exceeding 0.8. The correlations for the same denoted as "protoloyia") or the second, third parameter pairs over the two corpora are similar, etc. (resulting in a total of 2 variables). though as a rule the correlation for Corpus I is 6. The identity of the speaker, denoted as the less that that for Corpus 11, reflecting the larger speaker Signature (1 variable), which was variability of texts in Corpus I. The correlation used to determine the desired classification. study indicated that most of the parameters are Similarly to the clustering experiments, a set of not strongly correlated. Thus, a factor analysis C programs was used to extract automatically the step is not necessary and the application of the values of the aforementioned variables from the diseriminant analysis directly on the original transcripts. Most of these programs rely on variables is justified. measuring the occurrence of di-gram% and more generally n-grams, for letters, words and tagsets, Initially, Corpus I (see Table 3) was processed. thus being straight-forward. In the case of The 46 aforementioned variables were used to speaker identification, Discriminant Analysis generate discriminant functions accurately was used, as the clustering approach did not give recognising the identity of the speaker. To that very good results, indicating that the distinction end, three different approaches were used: among personal styles is weaker than that among (i) the full model: all variables were used to determine the discriminant functions;

38 (ii) the forward model: starting from an the basis of morphological features expressing empty model, variables were introduced in the contrast Katharevousa/Demotiki. This may order to create a reduced model, with a be explained by the fact that these texts undergo small number of variables; intensive editing towards a well-established sub- (iii) the backward model: starting from the full language. This editing homogenises the model, variables were eliminated to create morphological profile of the texts but, of course, a reduced model. does not go as far as homogenising the lexical preferences of the various speakers. That is why, In the cases of the forward and backward contrary to the register-clustering experiments, models, the values of the F parameter to both lexical variables expressing the particular enter and delete a variable were set to 4 while the contrast seem to play a role in discriminating maximum number of steps to generate the model between speakers and why the use of was set to 50. Katharevousa-odented negative particles, which was not important in register discrimination, Year 1999-2000 seems to be of some importance in style Speaker Corpus I Corpus II discrimination. The observation that negative A 92 30 words play a role in style identification is in B 45 24 agreement with the observations of Labb6 (1983) C 33 21 on the French political speech. D 21 16 E 150 36 Structural features have turned out to be important: the average word length, the use of Table 3 - Comparative composition of punctuation and question marks and the use of Corpus I and Corpus 11. certain parts-of-speech such as articles, The performance of this model is improved if: conjunctions, adjuncfions and - especiaUy - I. the order in which each particular speech verbs. Furthermore, the distribution of verbs into was delivered is taken into account: the persons and numbers seems to be important, subset of "protoloyiai" is well-defined and though the exact variables selected differ presents a low variance while the speeches of depending on the exact set of speeches used second or lower order have a higher (these variables are of course complementary). variance. 2. the corpus comprises only sessions where One of the most interesting findings of this more than one speaker has delivered research is that it is important whether the speaker delivers a "protoloyia" or not. speeches. Thus, the more balanced Corpus II (Table 3) presents an improved "Protoloyiai" can be classified at a rate of 95% discrimination performance. while mixed deliveries result in a lower rate, as low as 75%. This may be caused by two factors: 1. "Protoloyiai" represent longer stretches of For these two corpora, the results of the diseriminant analysis are shown in Table 4. The text, which are more characteristic of a given discrimination rate obtained with Corpus II is speaker. 2. Speakers prepare meticulously for their much higher than that for Corpus I. In addition, "protoloyiai" while their other deliveries smaller models, with 8 variables, may be created that correctly classify at least 75% of Corpus II. represent a more spontaneous type of speech, An example of the factors generated and the which tends to contain patterns shared by all manner in which they separate the pattern space the parliament members. is shown in the diagrams of Figure 2. Finally, certain additional patterns are emerging for each of the speakers. Certain speakers (e.g. 3.2 Comments on the Results speaker A) are more consistently recognised than others (e.g. speaker B) while speaker B is similar Though this research is continuing, certain facts to speaker C and speaker D is similar to speaker can be reported with confidence. E. This indicates that additional variables may be required to improve the classification accuracy Within the Greek Parliament Proceedings for all speakers. register, individual styles can not be classified on correlation of variables

• ,e - Corpus I Corpus II

m m

o ~'~_--_~_ : ::= 0.2 0.4 0.6 0.8 -10 correlation level

Figure 1 - Percentage of variable pairs exceedin£ a given level of absolute correlation.

Dataset Model observations full forward backward Corpus I 93.79 % 75.37 % 78.30 % 341 (46) (46) (46) Corpus II 97.64 % 94.49 % 92.91% 127 (45) (13) (20) Corpus II 97.64 %% 87.40 % 79.53 %% 127 (reduced model) (45) (8) (s)

Table 4 - Discrimination rate (the corresponding model size is shown in italics).

4 Future Plans 5 Conclusions

As a next step, frequency of use of certain In this article, ongoing research on register and lemmata shall be imroduced since visual individual style eategorisation of written Modem inspection indicates that they may provide good Greek has been reported. A system has been discriminatory features. We also plan to proposed for the automatic register substitute average lengths (of both words and categorisafion of corpora in Modem Greek sentences) with the distribution of lengths. exploiting the highly inflectional nature of the Furthermore, we intend to introduce certain language. The results have been obtained with a structural measurements such as repetition of relatively constrained set of registers; however structures, chains of nominals and the occurrence their recognition accuracy is remarkably high, of negation within NP phrasal eousdments. exceeding 98% with an unseeded clustering Another possible extension involves the approach using between 3 and 6 clusters. inclusion of the speech topic. As certain speakers' characteristics seem to change through On the front of individual style categorisation, a time, we plan to process the entire corpus of discrimination rate of over 80% was achieved for speeches for the target period 1997/-2000. five speakers within the Greek Parliament Finally, an important issue is the comparison of register. Morphological variables were shown to the results obtained in our experiments to these be of less importance to this task, while lexieal generatedby alternative techniques proposed by and straetural variables seemed to take over. We other researchers. This will allow the deduction are planning to introduce several new lexical and of more accurate conclusions regarding the structural variables in order to achieve better strengths and the weaknesses of the research discrimination rates and to determine strategies. discriminating features of the different styles.

40 Acknowledgements Department of lz~guistics, Faculty of Philosophy of the Aristotelian University of Thessaloniki, May The authors wish to thank the Dept. of Language 2000 (in print/in Greek). Technology Applications and specifically Dr. Papageorgiou, H., Prokopidis, P., Giouli, V. & Harris Papageorgiou and Mr. Prokopis Piperidis, S. (2000) A Unified PoS Tagging Prokopidis in obtaining the lemmatised versions Architecture and its application to Greek. of the parliament transcripts. Additionally, the Proceedings of the 2nd International Conference on authors wish to acknowledge the assistance of Language Resources and Evaluations, Athens, the Secretariat of the in Greece, 31 May - 2 June, Vol. 3, pp. 1455-1462. obtaining the session transcripts. Tambouratzis, G., Markantonatou, S., Hairetakis, N. References & Carayannis, G. (2000) Automatic Style Categorisation of Corpora in the Greek Language. Proceedings of the 2nd International Conference on Baayen, R. H., van Halteren, H. and Tweedie, F. J. Language Resources and Evaluations, Athens, (1996). Outside the cave of shadows: Using Greece, 31 May - 2 June, Vol. 1, pp. 135-140. syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, Vol. 11, No. 3, pp. 121-131. APPENDIX Characteristics of Katharevousa and Biber, D. (1995) Dimensions of Register Variation: A Demotiki cross-linguistic comparison. Cambridge University Press. Diglossia in Modem Greek is due to the contrast between Katharevousa and Demotiki and is well- Biber, D., Conrad, S. & Reppen, R. (1998) Corpus manifested on the morphological level. Here we Linguistics: Investigating Language Structure and concentrate on verb morphology. Use. Cambridge University Press. Demotiki tends to have words ending with an 'open' syllable. So, 3 r~ Plural verbal endings in -n (1) are Clairis, C. & Babiniofis, G. (1999) Grammar of augmented to -ne (2). Modern Greek - I1 Verbs. Ellinika Grammata, Athens (in Greek). (1) ~ [e'leyan] (Kath ) (=they said) (2) 2b/etw [le'yane] (Dem) (=they said) Gavrilidou, M., Labropoulou P., Papakostopouiou N., In Demotik/, Katharevousa's consonant dusters of Spiliotopoulou S., Nassos N. (1998) Greek Corpus two fricatives or two plosives are convened into Documentation, Parole LE2-4017/10369, WP2.9- clusters of one fricative and one plosive (3) - (4) WP-ATH-1. (Holton et al., 1997, pp. 14). (3) nmtrOtb [pis0o']/n~zortb [pisto'] (=to be Holton D., Mackridge, P. & Philippaki-Warburton, I. convinced) (1997) Greek: A Comprehensive Grammar of the Modem Language. Roufledge, London and New (4) ga2mpOd~[kalifSo']/xczivffrd~ [kalifto'] (--to be York. covered) Certain verb classes exhibit thematic vowel Karlgren, J., (1999) Stylistic Experiments in alternations either following the inflectional paradigm Information Retrieval. In T. Strzalkowski (ed.), of Ancient Greek or Demotiki (5) (Clairis and Natural Language Information Retrieval, pp. 147- Babiniotis, 1999). 166. Dordrecht: Kluwer. (5) e~aprdtraz [eksarta'te] (Kath ) l e~odrakrou [eksartiete] (Dem) (=depends) Kilgarriff, A. (1996). Which words are parfieuiarly characteristic of a text? A survey of statistical Sometimes Deraotiki uses a verbal root, which is similar though not identical to the Katharevousa one approaches. In Proc. AISB Workshop on Language Engineering for Document Analysis and (6). Recognition, Sussex University, April, pp. 33-40. (6) AfioJ [li'o] (Kath)/2fivxo Oi"no] (Dem ) (=to solve) Finally, many verbs inherited from Katharevausa Labb~, D. (1983). Francois Mittermna~ Essai sur le survive in Demotik/, either having an equivalent - discours. La Pens~,e Sauvage, Grenoble. mainly colloquial- (7) or not (8) (Clalris and Babiniotis, 1999). Markantonatou, S. & Tambouratzis, G. (2000) Some (7) rcpodO~uoa [profi'0eme] ( Kath ) / oxoxe6oJ quantitative observations regarding the use of [skope'vo] (Dem) (=I intend to) grammatical negation in Modern Greek. Proceedings of the 21 st Annual Meeting of the (8) ztpo~arapaz [proi'stame] (=supervise)

/11 0 Speak~ A " " 0 SpsmlmrA * v sp--k~el ". . Speaker B • 4 " SpeakeTC • * • + s..~o I "=-. S=__ + SpeakmO • . • -y_~, 0 SpeakerE • ~'~v v 2 ~ , "'- :" % ^ ~.o o°~ ° . • r~ + * o o .~j>,- . , d # + o :{ °° ,o.o ., + ++ 0 .3 o 0o# 0 0 O0

_____~_--_~___~_.~.~--.___._,[email protected],.____ -8 -S -4 -2 0 2 4 6 F=ctm I Fact0r2

6 --.~--~m--..~.~=.--.-..-======.,.-..--- 6~ O SpeakerA 0 Z~ SptakW B I • •~ ~sdcer O Speaker I • ,,peakeTC e 4 • C [ • * 4 + SpeakerO I ~ • + SpeakerD • 0 0 0 SpellkerE [ * 0 SpeakerE e ° • 2 ~_ • o 0 21 0 0 ~ O0 " *. Oo oO;O" : 0 0 o ^o ~ d+o o~O'~. o~ u. ~ o # o +'~,+ ~o + 0 o -2 0 -2 ++ + + ++ o :.~ z~ -4 z~ Z~

-6 -8 -8 -4 -2 0 2 4 .,6 Facle¢ 1 Fact0r2

r 6ram i 0 Spe~kerA I 0 SpsakefA + SpeWer B + 5 z~ SpeakerB e Spe;lkerC • Speaker C + Speaker O ++ 4 + Spe=k~ O + + + 0 Spe~ikorE O Speaks E + 3 + +.#-~ + + .~+÷ + + ÷ '2 3 +, ." 0 0 ++ +÷ e00 ° ."" + O+00 " "t~':. " O 0 ~T O O O ".. ,~ # 0~000 0 :" ""

• 00~o_QP o ~ ~"~- -I

Z~ z~O 0 Q> 0 o -2 -2 0 0 0 ZX 0 Z~ ~0 0 0 ~ t , -3 8 6 4 -2 0 2 4 6 6 -4 -2 0 2 4 Factor I Factor 3

Figure 2 - Discriminant factors plotted against the patterns for corpus I1.

42