Authorship Attribution and Author Profiling of Lithuanian Literary Texts
Total Page:16
File Type:pdf, Size:1020Kb
Authorship Attribution and Author Profiling of Lithuanian Literary Texts Jurgita Kapociˇ ut¯ e-Dzikien˙ e˙ Andrius Utka Ligita Šarkute˙ Vytautas Magnus University Vytautas Magnus University Kaunas Univ. of Technology K. Donelaicioˇ 58, LT-44248, K. Donelaicioˇ 58, LT-44248, K. Donelaicioˇ 73, LT-44029, Kaunas, Lithuania Kaunas, Lithuania Kaunas, Lithuania [email protected] [email protected] [email protected] Abstract posts, Internet comments, tweets, etc.) the au- thorship analysis is becoming more and more top- In this work we are solving authorship at- ical. In this respect it is important to consider tribution and author profiling tasks (by fo- the anonymity factor, as it allows everyone to ex- cusing on the age and gender dimensions) press their opinions freely, but on the other hand, for the Lithuanian language. This paper opens a gate for different cyber-crimes. Therefore reports the first results on literary texts, the authorship research –which for a long time which we compared to the results, pre- in the past was mainly focused on literary ques- viously obtained with different functional tions of unknown or disputed authorship– drifts to- styles and language types (i.e., parliamen- wards more practical applications in such domains tary transcripts and forum posts). as forensics, security, user targeted services, etc. Available text corpora, linguistic tools, and sophis- Using the Naïve Bayes Multinomial and ticated methods even more accelerate the devel- Support Vector Machine methods we in- opment of the authorship research field, which is vestigated an impact of various stylistic, no longer limited to authorship attribution (when character, lexical, morpho-syntactic fea- identifying who, from a closed-set of candidate tures, and their combinations; the differ- authors, is the actual author of a given anonymous ent author set sizes of 3, 5, 10, 20, 50, text document) only. Other research directions in- and 100 candidate authors; and the dataset volve the author verification task (when deciding sizes of 100, 300, 500, 1,000, 2,000, and if a given text is written by a certain author or 5,000 instances in each class. The high- not); the plagiarism detection task (when search- est 89.2% accuracy in the authorship at- ing for similarities between two different texts or tribution task using a maximum number parts within a single text); the author profiling of candidate authors was achieved with task (when extracting information about author’s the Naïve Bayes Multinomial method and characteristics, typically covering the basic demo- document-level character tri-grams. The graphic dimensions as the age, gender, native lan- highest 78.3% accuracy in the author pro- guage or psychometric traits); etc. In this paper filing task focusing on the age dimension we focus on authorship attribution (AA) and au- was achieved with the Support Vector Ma- thor profiling (AP) problems covering the age and chine method and token lemmas. An ac- gender dimensions. curacy reached 100% in the author profil- ing task focusing on the gender dimension Some researchers claim that in the scenario with the Naïve Bayes Multinomial method when an author makes no efforts to modify his/her and rather small datasets, where various writing style, authorship identification problems lexical, morpho-syntactic, and character can be tackled due to an existing “stylometric fin- feature types demonstrated a very similar gerprint” notion –an individual and uncontrolled performance. habit to express thoughts in certain unique ways, which is kept constant in all writings by the same 1 Introduction author. Van Halteren (2005) even named this phe- nomenon a “human stylome” in analogy to a DNA With the constant influx of anonymous or “genome”. However, Juola (2007) argues that pseudonymous electronic text documents (forum strict implications are not absolutely correct, be- 96 Proceedings of the 5th Workshop on Balto-Slavic Natural Language Processing, pages 96–105, Hissar, Bulgaria, 10–11 September 2015. cause the “genome” is stable, but the writing style ority of the SB methods over the SML techniques, tends to evolve over time. More stable (e.g., gen- e.g., Memory-Based Learning produced better der) and changing (e.g., age, social status, educa- results compared to Naïve Bayes and Decision tion, etc.) demographic characteristics affect the Trees (Zhao and Zobel, 2005); the Delta method writing style, thus making AP a solvable task for surpassed performance levels achieved by the pop- these dimensions. ular Support Vector Machine method (Jockers and With the breakthrough of the Internet era lit- Witten, 2010). However, the SB approaches are erary texts in the authorship research were grad- considered to be more suitable for the problems ually replaced with e-mails (Abbasi and Chen, with a big number of classes and limited training 2008), (de Vel et al., 2001), web forum mes- data, e.g., the Memory-Based Learning method sages (Solorio et al., 2011), online chats (Cristani applied on 145 authors outperformed Support Vec- et al., 2012), (Inches et al., 2013), Internet tor Machines (Luyckx and Daelemans, 2008), ap- blogs (Koppel et al., 2011) or tweets (Sousa-Silva plied on 100,000 candidate authors outperformed et al., 2011; Schwartz et al., 2013), which, in Naïve Bayes, Support Vector Machine and Reg- turn, contributed to a development of Computa- ularized Least Squares Classification (Narayanan tional Linguistic methods able to cope effectively et al., 2012). In our research we have at most with the following problems: short texts, non- 100 candidate authors, 6 age and 2 gender groups, normative language texts, many candidate authors, therefore the SML approaches seem to be the most etc. The discovered advanced techniques helped suitable choice. Besides, many AA and AP tasks to achieve even higher accuracy in AA and AP are solved using the popular Support Vector Ma- tasks on literary texts (under so-called “ideal con- chine method, which in the contemporary compu- ditions”). Although the research on literary texts tational research is considered as the most accu- have lost popularity due to the decrease in demand rate, thus the most suitable technique for differ- of their practical applications, the results obtained ent text classification problems (e.g., superiority on literary texts can still be interesting from the of Support Vector Machine is proved in (Zheng et scientific point of view, as it may perform some al., 2006)). However, a selection of classification kind of a baseline function in the comparative re- method itself is not as important as a proper selec- search. In this respect we believe that our present tion of an appropriate feature type. paper, which focuses on literary texts, will deliver valuable results. Besides, the obtained AA and Starting from Mendenhall (1887) the first sty- AP results will be compared with the results pre- lometric techniques were based on the quanti- viously reported on the Lithuanian language cov- tative features (so-called “style markers”) such ering other functional styles and language types. as a sentence or word length, number of sylla- bles per word, type-token ratio, vocabulary rich- 2 Related Works ness function, lexical repetition, etc. (for review see (Holmes, 1998)). However, these feature types Despite archaic rule-based approaches (attribut- are considered to be suitable only for homoge- ing texts to authors/characteristics depending on neous and long texts (e.g., entire books) and for a set of manually constructed rules) and some the datasets having only a few candidate authors. rare attempts to deal with unlabeled data (Nasir The first modern pioneering work of Mosteller and et al., 2014; Qian et al., 2014) automatic AA Wallace (1963) –who obtained promising AA re- and AP tasks are tackled with Supervised Ma- sults on The Federalist papers with the Bayesian chine Learning (SML) or Similarity-Based (SB) method applied on frequencies of a few dozens techniques (for review see (Stamatatos, 2009)). function words– triggered many posterior exper- In the SML paradigm, texts of known author- iments with various feature types. In the contem- ship/characteristic (training data) are used to con- porary research the most widespread approach is struct a classifier which afterwards attributes to represent text documents as vectors of frequen- anonymous documents. In the SB paradigm, an cies, which elements cover specific layers of lin- anonymous text is attributed to the particular au- guistic information (lexical, morpho-syntactic, se- thor/characteristic whose text is the most simi- mantic, character, etc.). The best feature types are lar according to some calculated similarity mea- determined only after an experimental investiga- sure. The comparative experiments prove superi- tion. 97 Since most AA and AP research works deal solved using the Support Vector Machine method with Germanic languages, providing no recom- with the radial basis (Reicher et al., 2010). The mendations that could work with the morpholog- researchers used 4 datasets (newspaper texts of ically rich, highly inflective, derivationally com- 25 authors, on-line blogs of 22 authors, Croat- plex languages (such as Lithuanian), having rela- ian literature classics of 20 authors, Internet fo- tively free word order in sentences, our focus is rum posts of 19 authors). They tested a big va- on the research done for the Baltic and Slavic lan- riety of features and their combinations: func- guages, which by their nature and characteristics tion words, idf weighted function words, frequen- are the most similar to Lithuanian. cies of coarse-grained part-of-speech tags, fine- The AA experiments for the Polish language grained part-of-speech tags with normalized fre- were performed with the literary texts of 2 au- quencies, part-of-speech tri-grams, part-of-speech thors using the feed-forward multilayer Percep- tri-grams with function words, other features (in- tron method (with one or two hidden layers) and cluding punctuation, frequencies of word lengths, the sigmoid activation function trained by the sentence-length frequency values, etc.).