Gender-Based Analysis of Slovene User-Generated Content
Total Page:16
File Type:pdf, Size:1020Kb
GENDER-BASED ANALYSIS OF SLOVENE USER-GENERATED CONTENT Iza Škrjanec Master Thesis Jožef Stefan International Postgraduate School Ljubljana, Slovenia Supervisor: Prof. Dr. Nada Lavrač, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Working Supervisor: Dr. Senja Pollak, Jožef Stefan Institute, Ljubljana, Slovenia Evaluation Board: Assoc. Prof. Dr. Marko Robnik-Šikonja, Chair, Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia Prof. Dr. Dunja Mladenić, Member, Jožef Stefan Institute, Ljubljana, Slovenia, and Jožef Stefan International Postgraduate School, Ljubljana, Slovenia Asst. Prof. Dr. Ana Zwitter Vitez, Member, Faculty of Humanities, University of Pri- morska, Koper, Slovenia, and Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia Iza Škrjanec GENDER-BASED ANALYSIS OF SLOVENE USER-GENERATED CONTENT Master Thesis ANALIZA SLOVENSKIH SPLETNIH UPORABNIŠKIH VSEBIN Z VIDIKA SPOLA Magistrsko delo Supervisor: Prof. Dr. Nada Lavrač Working Supervisor: Dr. Senja Pollak Ljubljana, Slovenia, August 2017 v Acknowledgments First and foremost I would like to thank my supervisor Nada Lavrač and my working supervisor Senja Pollak for their support, guidance, and patience during the two years I spent at the IPS. I believe our interdisciplinary collaboration has been most fruitful and can say I have acquired important skills and knowledge. I would also like to thank the members of the committee, Marko Robnik-Šikonja, Dunja Mladenić, and Ana Zwitter Vitez, for their comments and suggestions. Thank you for helping me improve my research. I am most grateful to the Janes national research project and Darja Fišer, the project leader, for allowing me to use the Twitter and blog data and for including me in the project tasks. There are many people who have contributed to this thesis and I would like to express my gratitude to each and every one of you: Matej Martinc for helping me with the tech- nical aspects of machine learning and kindly answering my n+1 emails; Luka Komidar for providing advice for the selection of statistical tests; Ben Verhoeven for his collaboration and the helpful discussions about the results; Vojko Gorjanc for the reflections on gender and language use; Jaka Čibej and Damjan Popič for the pep talks; and Gašper Pesek for the language check-up. Many thanks go to my friends and colleagues. Finally, I am grateful for the support I received from my parents, my brother, and my sister. And thank you, Timotej, for encouraging me and standing by my side no matter what. vii Abstract Language as a social phenomenon is subject to variation and change. Among the social factors of variation, the gender of speakers has been studied by sociolinguists with regard to the linguistic practices displayed by women and men. Advances in natural language processing have enabled researchers to model linguistic variation in large text corpora using automated approaches. In this Master Thesis we compare the language of women and men in Slovene user- generated content. User-generated content is a fairly new but increasingly popular domain and presents an important language resource due to its size, production in real time, and the informal nature of communication. In our analysis of gender-related variation, we rely on the approaches of computational stylometry, which aims to generalize the writing style into measurable patterns that can be compared between individual authors or groups of authors. We base the experiments on Slovene blog entries and Twitter messages (tweets), which were manually annotated for the authors’ gender. In the first step, we investigate the documents with machine learning techniques. Using a clustering algorithm, we construct topic ontologies of blogs to identify the predominant topics in entries by female and male bloggers. We then present automated methods that explore whether gender-related linguistic variation can provide predictive information to differentiate between women and men automatically. For this, we build two types of gender prediction models: a rule-based and a statistical classifier. The rule-based classification model takes into account the use of referential gender in verb phrases. It provides a baseline accuracy for more complex and time-consuming statistical models, for which we experiment with different features and learning algorithms. The best performing statistical models on tweets and blogs are further analyzed for the features ranked as most informative for female and male authors. Moreover, we present cross-genre experiments where the gender classifier was trained on one of the text genres and tested on the other. In the second step, we analyze the linguistic choices of female and male authors that have less to do with topic and more to do with the writing style. Specifically, we contrast the use of words typical of a writing style (e.g., profane language, emoticons, hedges and intensifiers) by applying the Mann-Whitney test for testing the statistical significance of the differences, and the squared Pearson correlation coefficient as a measure of effect size. The results of our experiments with gender prediction models show that rule-based classifiers work well, but noisy text might worsen their performance. For statistical models, the combination of Support Vector Machine as the learning algorithm and word unigrams as features achieves the best results on both tweets and blogs. The statistical analysis of stylistic variation further shows that the difference between female and male authors in the use of some discourses is statistically significant, though the effect size of gender on this variation is small. The results of cross-genre experiments with statistical models suggest that our tweet-trained model could be used for predicting gender in other Slovene user-generated content. ix Povzetek Jezik je družbeni pojav in je kot tak podvržen variaciji in spremembam. Med družbenimi dejavniki variacije sociolingvisti preučujejo tudi spol govorcev glede na jezikovne prakse, ki jih uporabljajo ženske in moški. Napredek v obdelavi naravnega jezika raziskovalcem omogoča, da modelirajo jezikovno variacijo v obsežnih besedilnih korpusih in z uporabo avtomatskih pristopov. V tem magistrskem delu analiziramo jezik žensk in moških v slovenskih spletnih uporab- niških vsebinah. Spletne uporabniške vsebine so dokaj nova, vendar vedno bolj priljubljena domena, zaradi obsežnosti, objavljanja v realnem času in neformalnega sporazumevanja pa predstavljajo pomemben jezikovni vir. V analizi jezikovne variacije z vidika spola se opiramo na pristope računalniške stilometrije, katere namen je posplošitev sloga pisanja na merljive vzorce, ki jih lahko nato primerjamo glede na posamezne avtorje ali skupine avtor- jev. Naša analiza je osnovana na slovenskih blogovskih zapisih in sporočilih z družbenega omrežja Twitter (tvitih), ki so ročno označeni s podatkom o avtorjevem spolu. V prvem delu naloge besedila analiziramo z metodami strojnega učenja. Z uporabo al- goritma gručenja izdelamo tematske ontologije, da prepoznamo prevladujoče teme v blogo- vskih zapisih moških in žensk. Nato predstavimo avtomatske metode, s katerimi raziščemo, ali lahko s pomočjo razlik v jezikovni rabi avtomatsko prepoznamo spol avtorja besedil. V ta namen zgradimo dva tipa napovednih modelov za spol: model s klasifikacijskimi pravili in statistični model. Model s klasifikacijskimi pravili za razlikovanje med avtorji upošteva rabo referenčnega spola v glagolskih zvezah. Točnost modela s pravili služi kot osnova za primerjavo s kompleksnejšimi in časovno potratnejšimi statističnimi modeli, za katere smo eksperimentirali z različnimi značilkami in algoritmi strojnega učenja. Najboljši napove- dni model za vsakega od besedilnih žanrov nato analiziramo na podlagi značilk, ocenjenih kot najbolj informativne za ženski ali moški razred. Poleg tega predstavimo čezžanrske poskuse, pri katerih napovedni model naučimo na enem žanru in testiramo na drugem. V drugem delu analiziramo jezik moških in žensk glede na razlike, ki niso vezane na temo besedila, pač pa na slog pisanja. Natančneje avtorje primerjamo glede na rabo besed, značilnih za določen slog (npr. kletvice, emotikoni, omejevalci in ojačevalci diskurza), z uporabo Mann-Whitneyjevega testa za testiranje statistične pomembnosti razlik. Za merjenje velikosti učinka uporabimo kvadrirani Pearsonov korelacijski koeficient. Rezultati poskusov so pokazali, da napovedni modeli s klasifikacijskimi pravili uspešno napovedujejo spol avtorja, vendar lahko šum v besedilih poslabša njihovo točnost. Stati- stični model, ki deluje na podlagi besednih unigramov kot značilk in metode podpornih vektorjev kot algoritma, se je izkazal kot najtočnejši tako na tvitih kot blogih. Statistična analiza slogovne variacije je pokazala, da so razlike v rabi nekaterih diskurzov statistično pomembne, vendar je velikost učinka spola na te razlike majhna. Glede na rezultate čez- žanrskih eksperimentov lahko sklepamo, da je naš napovedni model, naučen za tvitih, primeren za napovedovanje spola v drugih žanrih slovenskih spletnih uporabniških vsebin. xi Contents List of Figures xiii List of Tables xv Abbreviations xvii 1 Introduction1 1.1 Gender and Language .............................. 1 1.2 Motivation..................................... 3 1.3 Hypotheses and Goals .............................. 3 1.4 Scientific Contributions ............................. 5 1.5 Organization of the Thesis............................ 5 2 Background