FB-NEWS15: A Topic-Annotated Facebook Corpus for Emotion Detection and Sentiment Analysis Lucia C. Passaro, Alessandro Bondielli and Alessandro Lenci CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica University of Pisa (Italy) [email protected] [email protected] [email protected]

Abstract take part in discussions with larger groups of peo- ple and, consequently, the bond between SN and English. In this paper we present the FB- information is becoming increasingly stronger. NEWS15 corpus, a new Italian resource Mass information is gradually moving towards for sentiment analysis and emotion detec- general platforms, and official websites are losing tion. The corpus has been built by crawl- their lead position in providing information. As ing the Facebook pages of the most impor- noted by Newman et al. (2012), even though the tant newspapers in Italy and it has been use of internet in the years 2009-2012 has grown, organized into topics using LDA. In this the same is not reflected in the consumption of on- work we provide a preliminary analysis line newspapers, probably because of the increas- of the corpus, including the most debated ing use of SN for news diffusion and gathering. news in 2015. If on the one hand this apparent decline of the traditional news platforms may lead to a decline Italiano. In questo lavoro presentiamo il in quality and news coverage (Chyi and Lasorsa, corpus FB- NEWS15, un corpus italiano 2002), on the other hand the rise of SN as plat- creato per scopi di sentiment analysis ed forms to spread news promotes a more fervid de- emotion detection. Il corpus stato costru- bate between users (Shah et al., 2005). This issue ito scaricando le pagine Facebook delle is central for the present work. In fact, user’s com- maggiori testate giornalistiche in Italia e ments very often contain their own opinions about successivamente organizzato in topic uti- a certain issue. In addition, because of the col- lizzando LDA. In questo articolo forniamo loquial style of the comments, they contain large una analisi preliminare del corpus, e mos- amounts of words and collocations with a high triamo le notizie pi discusse nel 2015. subjective content, mostly concerning the author’s emotive stance. Facebook is one of the most popular online SN 1 Introduction in the world with 1 billion active users per month The use of Social Networks (SN) platforms like and it offers the possibility to collect data from Facebook and Twitter has developed overwhelm- people of different ages, educational levels and ingly in recent years. SN are exploited for differ- cultures. From a linguistic point of view, previous ent purposes ranging from the sharing of contents studies (Lin and Qiu, 2013) demonstrated that the among friends and useful contacts to the news- language in Facebook is more emotional and in- gathering about different domains such as politics terpersonal compared for example to the language and sports (Ahmad, 2010; Ahmad, 2013; Shef- in Twitter. Probably, this is due to the fact that in fer and Schultz, 2010). Many journalists indeed Facebook there is a stronger psychological close- use SN platforms for professional reasons (Oriella, ness between the author and audience because of 2013; Hermida, 2013). the different structure (bidirectional vs. unidirec- Several recent studies provide insights on how tional graphs) of the SNs. the popularity of blogs and other user generated In this paper we present the FB-NEWS15 content impacted the way in which news are con- corpus, a new Italian resource for sentiment sumed and reported. Picard (2009) states that SN analysis and emotion detection. The FB- platforms provide an easy and affordable way to NEWS15 corpus can be freely downloaded at

228

CLIC_2016_Proceedings.indd 228 02/12/16 15.04 colinglab.humnet.unipi.it/resources/under full article. The corpus keeps tracks of the three- the Creative Commons Attribution License fold hierarchical structure of Facebook, which in- creativecommons.org/licenses/by/2.0.1 cludes the news posts by the newspaper, the users’ The debate among users in commenting news comments to the posts and the replies to the com- and posts on Facebook offers a lot of subjective ments. In this context, it is clear that the emotive material to study the way in which people express content of the post is often neutral, but this post their own opinions and emotions about a target can inspire long discussions among readers, which event. In fact, in FB-NEWS15 we find linguistic can become useful material for sentiment analysis items expressing the whole range of positive and and emotion detection. Figure 1 shows a post, with negative emotions. In analyzing a news corpus, some of its comments and replies. however, it is not simple to aggregate the posts on the basis of a certain fact, since several posts re- late to the same event. For this reason, we decided to organize the corpus into clusters of topically re- lated news identified with Latent Dirichlet Allo- cation (LDA: Blei et al. (2003)). This approach allow us to infer the most debated news in the cor- pus, and, in a second step, to discover the readers’ sentiment about a particular topic. The paper is organized as follows: Section 2 describes the creation of the corpus, from crawl- ing (2.1) to linguistic annotation (2.2), and finally provides basic corpus statistics (2.3). Section 3 re- ports on the automatic topic extraction with LDA.

2 FB-NEWS15 For the creation of the corpus we followed the most important Italian newspapers. Since we were interested in building a corpus as heterogeneous as possible, we decided to focus on major news- papers with different political orientations, and which have in general heterogeneous readers. Figure 1: Example of post in Facebook with the Facebook allow users to post states, links, pho- relative comments and replies. tos and videos on their own wall. In general, users can be divided into two macro-categories: Peo- In order to create the FB-NEWS15, we decided ple and Pages. People are often individuals, and to download the timeline of the following news- the interaction with them is usually bidirectional papers, from 1 January 1 to 31 December 2015: (user A can read what user B publishes if A and , , L’, Libero, Il B have a friendship relation). Conversely, Pages Fatto Quotidiano, Rainews24, , are typically used to represent organizations, pub- Huffington Post Italia. lic figures (web stars), companies or, as in our case, newspapers. In this case, the relationship 2.1 Crawling is unidirectional, in the sense that user A can ac- Facebook offers developers Application Program- cess the timeline of the page P by putting a ”Like” ming Interfaces (APIs) for creating apps with on P. Unlike a single-user, who usually publishes Facebook’s native functionalities. In order to de- photos, videos and links about his private life, the velop the crawler, we exploited the Graph API, timeline of a newspaper Facebook page, in general which provides a simple view of the Facebook contains news titles with a link to the official web- social graph by showing the objects in the graph site of the newspaper, where the user can read the and the connections between them. The Graph 1All data collected have been processed anonymously for API allows us to navigate through the graph of scientific purposes, without storing personal information. the social network, which is organized into nodes

229

CLIC_2016_Proceedings.indd 229 02/12/16 15.04 tribution for each Newspaper.

Un business truffaldino [E ora NEWSPAPER N. OF TEXTS finitela con l’eco-balla dei La Repubblica 4558,829 controlli sulle emissioni] Avvenire 91,824 Il Giornale 3,497,610 Libero 2,436,246 4,900,314 Figure 2: Example of crawled text. Rainews24 369,834 Huffington Post 1,552,042 Corriere della Sera 3,553,966 (Users, Pages, Photos and Comments) and Edges OVERALL 20,960,665 (Connections such as Friendship or Likes). The graph is navigated by exploiting HTTP requests, Table 1: Number of texts aggregated by Newspa- that may be implemented using any programming per in FB-NEWS. language. The native APIs offered by Facebook has some drawbacks: i) the maintenance of the Table 2 shows the total number of tokens for app, since the APIs change over time, making it each page and the average number of texts, pro- necessary to update the code of the crawler; ii) duced for each post for each page. We can notice only public data can be accessed without requir- that the most followed newspapers on Facebook ing the user’s consent; iii) Facebook places limi- are Il Fatto Quotidiano and La Repubblica. tations on the number of requests through a given period of time. For each post, comment and reply, NEWSPAPER TOKENS TEXTS/POSTS we stored the message (text), the story (presence La Repubblica 96,059,756 182.61 of photos and links tags), its timestamp, the type Avvenire 2,611,899 12.65 Il Giornale 64,345,260 77.93 (post, comment, reply), the parent post/comment, Libero 41,166,457 81.87 the number of likes, shares and replies (Figure 2). Il Fatto Quotidiano 99,025,541 193.33 Rainews24 7,735,908 10.21 Huffington Post 32,587,065 84.06 2.2 Linguistic annotation Corriere della Sera 64,197,579 95.01 A very basic preprocessing phase has been ap- OVERALL 407,729,465 94.83 plied to the corpus before linguistic annotation, Table 2: Tokens and Texts/Posts ratio for page. to replace urls with the tag URL . The text has been subsequently feed to a pipeline of general- purpose NLP tools. In particular, it has been POS-tagged with the Part-Of-Speech tagger de- 3 Topics in FB-NEWS15 scribed in (Dell’Orletta, 2009) and dependency- parsed with the DeSR parser (Attardi et al., 2009). FB-NEWS15 contains texts referring to a large In addition, complex terms like forze dell’ordine variety of events. In order to organize the cor- (security force) or toccare il fondo (hit rock bot- pus into clusters of thematically related news, we tom) have been identified using the EXTra term used LDA (Blei et al., 2003). LDA represents extraction tool (Passaro and Lenci, 2015). documents as random mixtures over latent topics, where each topic is characterized by a distribu- 2.3 Corpus Analysis tion over words. These random mixtures express Except for Avvenire and Rainews24, for which a document semantic content, and document sim- we downloaded very few data, the other news- ilarity can be estimated by looking at how similar papers are attested in the corpus in a balanced the corresponding topic mixtures are. For the topic way. In general, the number of posts is very low identification we used the software Mallet (Mc- compared to the number of comments and replies. Callum, 2002).

230

CLIC_2016_Proceedings.indd 230 02/12/16 15.04 Figure 3: Cumulative distribution of posts, comments and replies in FB-NEWS15 for each Newspaper.

3.1 Selecting the vocabulary topic with the higher number of texts). Since we were interested in extracting the topics NATIONAL POLITICS (2,516,640 TEXTS, from the news articles, we have built the model on RANK 1): Renzi, presidente, premier, the portion of FB-NEWS15 containing the posts { Mattarella, riforma, Alfano, senato, camera, (FB-NEWS15 posts) published by the newspaper. Boschi, aula (Renzi, president, Mattarella, In particular, we used entropy (Dumais, 1990) as } reform, Alfano, senate, chamber, Boschi, a global term weighting and we selected for train- hall) ing the terms (nouns, adjectives, verbs and com- plex terms) with a high informative value (thresh- SCHOOL (1,707,145 TEXTS,RANK 2): scuola, { old fixed to 0.3), while using the remaining words giovane, studente, protesta, corso, man- as stopwords in Mallet (McCallum, 2002). care, sospendere, inglese, spiegare, lezione } 3.2 Extracting topics from posts (school, young, protest, class, lack, suspend, English, explain, lesson) In order to determine the most debated topics in 2015, we used LDA to assign 50 topics to the CRIME (1,543,735 TEXTS,RANK 7): posts in FB-NEWS15 posts and we navigated the uccidere, polizia, arrestare, fermare, { graph to assign the topics to the comments and the sparare, uomo, poliziotto, colpo, ferire, replies. Later, we restricted the topics associated agente (kill, police, detain, stop, open } to a post P to the topics T having a probability fire, man, policeman, bump, wound, police th higher than the 90 percentile of the topic dis- officer) tribution of P . In this way, each post has been assigned, on average, to 3.06 topics. Finally, com- ISIS (1,267,749 TEXTS,RANK 16): Isis, { ments and replies have inherited the probability of guerra, siria, minaccia, U.S.A., Libia, belonging to the topic T from their parent post. colpire, islamico, usare, jihadisti (Isis, war, } Among the extracted topics ranked according to Syria, threat, U.S.A., Libya, damage, islamic, the sum of these probabilities we can find national use, jihadist) and foreign politics, terrorism and church but also food, football, cinema and weather forecast. We FOOD (949,520 TEXTS,RANK 40): mangiare, { report some topics below, with the number of texts ricetta, cibo, preparare, consiglio, evitare, and the relative ranking (i.e., rank 1 is given to the perfetto, trucco, salute, semplice (eat, }

231

CLIC_2016_Proceedings.indd 231 02/12/16 15.04 recipe, food, prepare, advice, avoid, perfect, F. Dell’Orletta. 2009. Ensemble system for part-of- trick, health, simple) speech tagging. In EVALITA 2009 Evaluation of NLP and Speech Tools for Italian 2009, LNCS, Reg- FOOTBALL (606,560 TEXTS,RANK 50): gio Emilia (Italy). Springer. seguire la diretta, guardare il video, campo, { S. T. Dumais. 1990. Enhancing performance in latent calcio, serie, Napoli, , segnare, battere, semantic indexing (lsi) retrieval. Technical Report partita (follow the live, look at the video, TM-ARH-017527. } football field, football, league, Naples, Mi- A. Hermida. 2013. #journalism. reconfiguring jour- lan) nalism research about twitter, one tweet at a time. Digital Journalism. 4 Conclusions and ongoing work H. Lin and L. Qiu. 2013. Two sites, two voices: As one of the most widespread social networks, Linguistic differences between facebook status up- Facebook offers the possibility to collect opinion- dates and tweets. In P. L. Patrick Rau, editor, Cross- Cultural Design. Cultural Differences in Everyday ated pieces of texts from people of different ages, Life: 5th International Conference, CCD 2013, Held cultures and education. The composition of FB- as Part of HCI International 2013, volume 2, pages NEWS15, in which each comment is explicitly as- 432–440, Las Vegas (USA). Springer Berlin Heidel- sociated with a particular post, allows us to study berg. the differences in terms of readers’ perceptions Andrew Kachites McCallum. 2002. Mal- about a particular topic. Differently from other so- let: A machine learning for language toolkit. cial media like Twitter, Facebook contains larger http://mallet.cs.umass.edu. texts including lot of subjective expressions that N. Newman, W. H. Dutton, and G. Blank. 2012. Social are very useful for the construction of sentiment media in the changing ecology of news: The fourth and emotive lexicons. and fifth estate in britain. Internet Science, 7(1):6– Starting from previous works (Passaro et al., 22. 2015; Passaro and Lenci, 2016), we plan to use Oriella. 2013. The new normal for news. have global this corpus to build lexical resources for sentiment media changed forever? The 6th Annual Oriella analysis and emotion detection, which will include Digital Journalism Survey. both words and complex terms. In addition, we L. C. Passaro and A. Lenci. 2015. Extracting terms plan to optimize the topic modeling phase and to with extra. In Proceedings of the EUROPHRAS investigate the possibility of using the extracted 2015 Computerised and Corpus-based Approaches topics as a prior for inferring the sentiment orien- to Phraseology: Monolingual and Multilingual Per- spectives, pages 188–196, Malaga (Spain). tation of a particular comment. Lucia C. Passaro and Alessandro Lenci. 2016. Eval- uating context selection strategies to build emotive References vector space models. In Proceedings of the Tenth In- ternational Conference on Language Resources and A. Ahmad. 2010. Is twitter a useful tool for journal- Evaluation (LREC 2016). European Language Re- ists? Journal of Media Practice, 11(2):145–155. sources Association (ELRA), may. A. Ahmad. 2013. Whats in a tweet? foreign corre- L. C. Passaro, L. Pollacci, and A. Lenci. 2015. Item: spondents use of social media. Journalism Practice, A vector space model to bootstrap an italian emotive 7(1):33–46. lexicon. In Proceedings of the second Italian Con- ference on Computational Linguistics CLiC-it 2015, G. Attardi, F. Dell’Orletta, M. Simi, and J. Turian. pages 215–220, Trento (Italy). 2009. Accurate dependency parsing with a stacked multilayer perceptron. In EVALITA 2009 Evalu- R. Picard. 2009. Blogs, tweets, social media, and the ation of NLP and Speech Tools for Italian 2009, news business. Nieman Reports, 63(3):10–12. LNCS, Reggio Emilia (Italy). Springer. D. V. Shah, J. Cho, W. P. Eveland, and N. Kwak. 2005. D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Information and expression in a digital age. Com- dirichlet allocation. The Journal of Machine Learn- munication Research, 32(10):531–565. ing Research, 3:993–1022. M. L. Sheffer and B. Schultz. 2010. Paradigm shift H. Chyi and D. L. Lasorsa. 2002. An explorative or passing fad? twitter and sports journalism. Inter- study on the market relation between online and national journal of Sport Communication, 3(4):472– print newspapers. Journal of Media Economics, 484. 15(2):91–106.

232

CLIC_2016_Proceedings.indd 232 02/12/16 15.04