Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection
Total Page:16
File Type:pdf, Size:1020Kb
Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling, Cross-Domain Authorship Attribution and Style Change Detection Walter Daelemans1, Mike Kestemont1, Enrique Manjavacas1, B Martin Potthast2( ),FranciscoRangel3, Paolo Rosso4,G¨unther Specht5, Efstathios Stamatatos6, Benno Stein7, Michael Tschuggnall5, Matti Wiegmann7, and Eva Zangerle5 1 University of Antwerp, Antwerp, Belgium 2 Leipzig University, Leipzig, Germany [email protected] 3 Autoritas Consulting, Valencia, Spain 4 Universitat Polit`ecnica de Val`encia, Valencia, Spain 5 University of Innsbruck, Innsbruck, Austria 6 University of the Aegean, Samos, Greece 7 Bauhaus-Universit¨at Weimar, Weimar, Germany http://pan.webis.de Abstract. We briefly report on the four shared tasks organized as part of the PAN 2019 evaluation lab on digital text forensics and authorship analysis. Each task is introduced, motivated, and the results obtained are presented. Altogether, the four tasks attracted 373 registrations, yield- ing 72 successful submissions. This, and the fact that we continue to invite the submission of software rather than its run output using the TIRA experimentation platform, demarcates a good start into the second decade of PAN evaluations labs. 1 Introduction The PAN 2019 evaluation lab organized four shared tasks related to authorship analysis, i.e., the analysis of authors based on their writing style. Two of the tasks addressed the profiling of authors with respect to traditional demographics as well as new ones from two perspectives: (1) whether they are bots or humans, and, (2) studying the public personas of celebrities in particular. Another task tackled the most traditional task of authorship analysis, authorship attribution, but from the new angle of attributing authors across different writing domains (i.e., topics). The fourth task addressed the important, yet exceedingly difficult task of handling multi-author documents and the detection of style changes within a given text written by more than one author. Authors are listed in alphabetical order. c Springer Nature Switzerland AG 2019 F. Crestani et al. (Eds.): CLEF 2019, LNCS 11696, pp. 402–416, 2019. https://doi.org/10.1007/978-3-030-28577-7_30 Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling 403 The four tasks continue the series of shared tasks, which has been organized for more than a decade now starting with PAN 2009 [19], preceded only by two PAN workshops at ECAI 2008 and SIGIR 2007, which laid the foundation for what was to come. Focusing on tasks from the areas digital text forensics, text reuse, and judging the trustworthiness and ethicality of texts, we have assembled new benchmarks for more than a dozen different tasks now, many of which continue to be used for evaluations throughout the research community. In this paper, each of the following sections gives a brief, condensed overview of the four aforementioned tasks, including their motivation and the results obtained. 2 Bots and Gender Profiling Author profiling aims at classifying authors depending on how language is shared by groups of people. This may allow to identify demographics such as age and gender, and it can be of high interest from a marketing, security and foren- sics perspective. The research community has shown an increasing interest in the author profiling shared task throughout the past years, as evidenced by the growing number of participants.1 Having addressed several aspects of author profiling in social media from 2013 to 2018, the author profiling shared task of 2019 aims at investigating whether the author of a Twitter feed is a bot or a human. Furthermore, in case of a human it was asked to profile the gender of the author. As in previous years, we have proposed the task from a multi- lingual perspective, covering English and Spanish languages. One of our main objectives was to demonstrate the feasibility of automatically identifying bots as well as demonstrating the difficulty of identifying more elaborate bots than basic information spreaders. 2.1 Evaluation Framework To build the PAN-AP-2019 corpus,2 we have combined Twitter accounts iden- tified as bots in existing datasets with newly discovered ones on the basis of specific search queries. In both cases, a minimum of three annotators agreed with the annotation, or else the Twitter user was discarded. To annotate gen- der, we followed the same methodology as in previous editions of the shared task. In Table 1, some corpus statistics are shown. The corpus is balanced per type (bot/human), and in case of human, it is also balanced per gender. Each author is composed of exactly 100 tweets. 1 In the past seven editions of the author profiling shared task at PAN, we have had 21 (2013 [26]), 10 (2014 [23]), 22 (2015 [20]), 22 (2016 [28]), 22 (2017 [27]), 23 (2018 [25]), and 55 (2019 [22]) participating teams, respectively. 2 We should highlight that we are aware of the legal and ethical issues related to collecting, analyzing, and profiling social media data [21], and that we are committed to legal and ethical compliance in our scientific research and its outcomes. 404 W. Daelemans et al. Table 1. Number of authors per language. The corpus is balanced regarding bots vs. humans, and regarding gender in case of humans, and it contains 100 tweets per author. Dataset English (EN) Spanish (ES) Training 4,120 3,000 Test 2,640 1,800 The participants were asked to send two predictions per author: (1) whether the author is a bot or a human, and in case of a human (2) whether the author is male or female. The participants were allowed to approach the task also in one instead of all of the languages, and to address only one subproblems (bots or gender). Classification accuracy has been employed for evaluation. For each language, we obtain the accuracy for both problems in both languages separately and average them to obtain the final ranking. 2.2 Results This year, 55 teams participated in the shared task. In Table 2, the overall perfor- mance per language and participant’s ranking are shown. The best results have been obtained for both identification (95.95% in English vs. 93.33% in Spanish) and gender profiling (84.17% in English vs. 81.72% in Spanish). As can be seen, results for bot identification are higher than 90% in some cases, revealing the relative ease of this task. A more in-depth analysis is presented in the overview paper [22] where we show that certain types of bots are not as easy to detect as others, and the risks this entails. In Table 2, the best results per language and problem are highlighted in bold font. The overall best result (88.05%) has been obtained by the author in [16]. They have approached the task with a Support Vector Machine with character and word n-grams as features. It is worth mentioning the high perfor- mance obtained by the word and character n-grams baselines, even greater than that of word embeddings [12,13] and Low Dimensionality Statistical Embedding (LDSE) [24]. 3 Celebrity Profiling Celebrities are a highly prolific population of Twitter users. They influence public opinion, are role models to their fans and follower, and sometimes they are the voices of the disenfranchised. For these reasons, the “rich and famous” have been studied in the social sciences and economics as a matter of course, especially with regard to their presence on social media. Our recent seminal work on celebrity profiling [34], and this task at PAN 2019 introduce this particular group of people to computational linguistics. This task focuses on determining four demographics of celebrities based on their Twitter timelines: Overview of PAN 2019: Bots and Gender Profiling, Celebrity Profiling 405 Table 2. Accuracy per subtask and language, and global ranking as average. Ranking Team Bots vs. Human Gender Average EN ES EN ES 1 Pizarro 0.9360 0.9333 0.8356 0.8172 0.8805 2 Srinivasarao & Manu 0.9371 0.9061 0.8398 0.7967 0.8699 3 Bacciu et al. 0.9432 0.9078 0.8417 0.7761 0.8672 4 Jimenez-Villar et al. 0.9114 0.9211 0.8212 0.8100 0.8659 5 Fernquist 0.9496 0.9061 0.8273 0.7667 0.8624 6 Mahmood 0.9121 0.9167 0.8163 0.7950 0.8600 7 Ipsas & Popescu 0.9345 0.8950 0.8265 0.7822 0.8596 8 Vogel & Jiang 0.9201 0.9056 0.8167 0.7756 0.8545 9 Johansson & Isbister 0.9595 0.8817 0.8379 0.7278 0.8517 10 Goubin et al. 0.9034 0.8678 0.8333 0.7917 0.8491 11 Polignano & de Pinto 0.9182 0.9156 0.7973 0.7417 0.8432 12 Valencia et al. 0.9061 0.8606 0.8432 0.7539 0.8410 13 Kosmajac & Keselj 0.9216 0.8956 0.7928 0.7494 0.8399 14 Fagni & Tesconi 0.9148 0.9144 0.7670 0.7589 0.8388 char nGrams 0.9360 0.8972 0.7920 0.7289 0.8385 15 Glocker 0.9091 0.8767 0.8114 0.7467 0.8360 word nGrams 0.9356 0.8833 0.7989 0.7244 0.8356 16 Martinc et al. 0.8939 0.8744 0.7989 0.7572 0.8311 17 Sanchis & Velez 0.9129 0.8756 0.8061 0.7233 0.8295 18 Halvani & Marquardt 0.9159 0.8239 0.8273 0.7378 0.8262 19 Ashraf et al. 0.9227 0.8839 0.7583 0.7261 0.8228 20 Gishamer 0.9352 0.7922 0.8402 0.7122 0.8200 21 Petrik & Chuda 0.9008 0.8689 0.7758 0.7250 0.8176 22 Oliveira et al. 0.9057 0.8767 0.7686 0.7150 0.8165 W2V 0.9030 0.8444 0.7879 0.7156 0.8127 23 De La Peña & Prieto 0.9045 0.8578 0.7898 0.6967 0.8122 24 López Santillán et al.