Stylometric Analysis of Parliamentary Speeches: Gender Dimension Justina Mandravickaite˙ Tomas Krilaviciusˇ Vilnius University, Lithuania Vytautas Magnus University, Lithuania Baltic Institute of Advanced Baltic Inistitute of Advanced Technology, Lithuania Technology, Lithuania [email protected] [email protected] Abstract to capture the differences in the language due to the gender (Newman et al., 2008; Herring and Relation between gender and language has Martinson, 2004). Some results show that gen- been studied by many authors, however, der differences in language depend on the con- there is still some uncertainty left regard- text, e.g., people assume male language in a for- ing gender influence on language usage mal setting and female in an informal environ- in the professional environment. Often, ment (Pennebaker, 2011). We investigate gender the studied data sets are too small or texts impact to the language use in a professional set- of individual authors are too short in or- ting, i.e., transcripts of speeches of the Lithua- der to capture differences of language us- nian Parliament debates. We study language wrt age wrt gender successfully. This study style, i.e., male and female style of the language draws from a larger corpus of speeches usage by applying computational stylistics or sty- transcripts of the Lithuanian Parliament lometry. Stylometry is based on the two hypothe- (1990–2013) to explore language differ- ses: (1) human stylome hypothesis, i.e., each in- ences of political debates by gender via dividual has a unique style (Van Halteren et al., stylometric analysis. Experimental set 2005); (2) unique style of individual can be mea- up consists of stylistic features that indi- sured (Stamatatos, 2009), stylometry allows gain- cate lexical style and do not require exter- ing meta-knowledge (Daelemans, 2013), i.e., what nal linguistic tools, namely the most fre- can be learned from the text about the author quent words, in combination with unsu- - gender (Luyckx et al., 2006; Argamon et al., pervised machine learning algorithms. Re- 2003; Cheng et al., 2011; Koppel et al., 2002), sults show that gender differences in the age (Dahllöf, 2012), psychological characteristics language use remain in professional en- (Luyckx and Daelemans, 2008), political affilia- vironment not only in usage of function tion (Dahllöf, 2012), etc. words, preferred linguistic constructions, Like in most studies of gender and language but in the presented topics as well. (Yu, 2014; Herring and Martinson, 2004), bio- logical sex as a criterion for gender was used in 1 Introduction this study. We compare differences of the gen- Gender influence on language usage have been ex- der related language use at the group level (fac- tensively studied (Lakoff, 1973; Holmes, 2006; tion). Lithuanian language allows easy distinction Holmes, 2013; Argamon et al., 2003) without between male and female legislators based on their fully reaching a common agreement. Understand- names in the transcripts.1 ing gender differences in professional environ- We investigate several questions: (1) How well ment would assist in a more balanced atmosphere simple stylistic features distinguish genders of (Herring and Paolillo, 2006; Mullany, 2007), how- members the Lithuanian Parliament? (2) Which ever results on extent of variation depending on differences in language use by female and male context of communication in professional setting Lithuanian Parliament members selected features are inconclusive(Newman et al., 2008). and methods are able to capture? Most studies rely on the relatively small data 1Of course, all information about members of parliament sets, or texts of the individual authors are too short102 is available on-line. Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 102–107, Valencia, Spain, 4 April 2017. c 2017 Association for Computational Linguistics Figure 2: Bootstrap Consensus Tree with Can- berra and 100–10000 MFW. Figure 1: Results with 7000 MFW as features. at the ending. All these features produce a sub- stantial number of inflective forms for one lemma. 2 Data Set Thus in order to avoid data sparseness we did not lemmatize corpus for our experiments. Corpus of parliamentary speeches in the Lithua- To get around of “fingerprint” of individual au- 2 nian Parliament is used. It consists of transcripts thorship as much as possible, all the samples were of parliamentary speeches from March 1990 to concatenated into two large documents based on December 2013, 10727 of female members of Par- the gender, and then were partitioned into 15 parts liament (MPs) and 100181 of male MPs, over- each. Thus for analysis we had 15 samples of par- all 23 908 302 words (2 357 596 of female MPs liamentary speech made by female MPs and an- and 21 550 706 of male; see Table 2 for the de- other 15 samples – made by male MPs. tails). Only speeches of at least 100 words and of MPs with at least 200 of them were included in 3 Stylistic Features and Statistical the corpus (Kapociˇ ut¯ e-Dzikien˙ e˙ and Utka, 2014). Measures It could have diminished number of female MPs speeches included into the corpus and our anal- We use the most frequent words (MFW) (Bur- ysis as well. However, the choice of unsuper- rows, 1992; Hoover, 2007; Eder, 2013b; Rybicki vised learning approach downscales class imbal- and Eder, 2011; Eder and Rybicki, 2013; Eder, ance problem, i.e. significant difference in number 2013a) (usually, they coincide with function words of transcribed parliamentary speeches made by fe- (Hochmann et al., 2010; Sigurd et al., 2004)), as male and male MPs. features, because they are considered to be topic- Lithuanian is a highly inflective language, i.e. neutral and perform well (Juola and Baayen, 2005; nouns have grammatical gender, number and se- Holmes et al., 2001; Burrows, 2002). mantic relations between them are expressed with Stylo package for stylometric analysis using R 7 cases; adjectives have to match nouns in terms (Eder et al., 2014) is used for experiments. of gender, number and case; verbs have 4 tenses Experiments are performed in batches using dif- and particles for each of them, with ending mark- ferent number of MFWs, firstly, using the whole ing its tense, person and number; gender and case corpus, raw frequency list of features is gener- for the particles are also marked morphologically ated, then normalized using z-scores, which mea- sure distance of features frequencies in the corpus 2 Corpus of parliamentary speeches in the Lithuanian Par- in terms of their proximity to the mean (Hoover, liament was created in the project “Automatic Authorship At- Ai µ tribution and Author Profiling for the Lithuanian Language” 2004), where z-scores are defined as z = σ− , (ASTRA) (No. LIT-8-69), 2014 – 2015. 103 where Ai is frequency of a feature, µ is mean fre- MPs by gender No. of samples No. of words No. of unique words Female 10 727 2 357 596 93 611 Male 100 181 21 550 706 268 030 Table 1: Statistics of Corpus of parliamentary speeches in the Lithuanian Parliament. n Ai Bi with Canberra distance δ = | − | (AB) i=1 Ai + Bi where n is a number of most frequent features,| | | | P A and B are documents, Ai and Bi are frequen- cies of a given feature in the documents A and B in the corpus, respectively (Eder et al., 2014). It was reported to be suitable for inflective lan- guages, albeit it is sensitive for rare vocabulary (Eder et al., 2014), e.g., words that occurred only once or twice. The goal is identifying stylistic dissimilarities and mapping positions of the text samples in rela- tion to each other, not classifying female/male leg- islators, hence hierarchical clustering with Ward linkage (it minimizes total variance within-cluster (Everitt et al., 2011)) was chosen. Though it is sensitive to changes in a number of features or methods of grouping (Eder, 2013a; Luyckx et al., 2006), in this study it shows stable results. Ro- Figure 3: Results with 200 MFW (starting at 6800 bustness of clustering results was examined us- MFW). ing bootstrap procedure (Eder, 2013a). It includes extensions of Burrows’s Delta (Argamon, 2008; Eder et al., 2014) and bootstrap consensus trees quency of certain feature in one document, σ is a (Eder, 2013a) as a way to improve reliability of standard deviation. cluster analysis dendrograms. Dissimilarity between the text samples is cal- culated using selected distances (see below), and 4 Experiments distance matrix is generated. Then, hierarchical From 20 to 10 000 most frequent features were clustering is applied to group samples by similar- used for each experiment. We use hierarchical ity (Everitt et al., 2011), and dendrograms are used clustering with Ward linkage and Canberra dis- to visualize the results. tance, and visualize results in dendrograms to map Typically Burrows’s Delta distance is used for positions of the samples in relation to each other. stylometric analysis (Burrows, 2002; Rybicki and We focus on identifying variation in female and Eder, 2011). However, Delta depends on z-scores, male parliamentary speech, and do not analyze number of documents and balance of terms in smaller clusters and dynamics inside them. A documents, length and number of authors (Sta- more detailed investigation of separate features matatos, 2009). While Burrow’s Delta is effec- (e.g., specific words, part-of-speech tags or their tive for English and German, it is less success- sequences) that are characteristic to female MPs ful for highly inflective languages, e.g., Latin and and male MPs individually, are part of future Polish (Rybicki and Eder, 2011). Hence we used plans, while in this paper we focus on the most Eder’s Delta, i.e., a modified Burrows’s Delta that frequent words. gives more weight to the frequent features and Experiments with more MFW (from 7000 up rescales less frequent to avoid random infrequent to 9910) successfully separated samples of parlia- ones (Eder et al., 2014).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-