FB-NEWS15: a Topic-Annotated Facebook Corpus for Emotion Detection and Sentiment Analysis Lucia C
Total Page:16
File Type:pdf, Size:1020Kb
FB-NEWS15: A Topic-Annotated Facebook Corpus for Emotion Detection and Sentiment Analysis Lucia C. Passaro, Alessandro Bondielli and Alessandro Lenci CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica University of Pisa (Italy) [email protected] [email protected] [email protected] Abstract take part in discussions with larger groups of peo- ple and, consequently, the bond between SN and English. In this paper we present the FB- information is becoming increasingly stronger. NEWS15 corpus, a new Italian resource Mass information is gradually moving towards for sentiment analysis and emotion detec- general platforms, and official websites are losing tion. The corpus has been built by crawl- their lead position in providing information. As ing the Facebook pages of the most impor- noted by Newman et al. (2012), even though the tant newspapers in Italy and it has been use of internet in the years 2009-2012 has grown, organized into topics using LDA. In this the same is not reflected in the consumption of on- work we provide a preliminary analysis line newspapers, probably because of the increas- of the corpus, including the most debated ing use of SN for news diffusion and gathering. news in 2015. If on the one hand this apparent decline of the traditional news platforms may lead to a decline Italiano. In questo lavoro presentiamo il in quality and news coverage (Chyi and Lasorsa, corpus FB- NEWS15, un corpus italiano 2002), on the other hand the rise of SN as plat- creato per scopi di sentiment analysis ed forms to spread news promotes a more fervid de- emotion detection. Il corpus stato costru- bate between users (Shah et al., 2005). This issue ito scaricando le pagine Facebook delle is central for the present work. In fact, user’s com- maggiori testate giornalistiche in Italia e ments very often contain their own opinions about successivamente organizzato in topic uti- a certain issue. In addition, because of the col- lizzando LDA. In questo articolo forniamo loquial style of the comments, they contain large una analisi preliminare del corpus, e mos- amounts of words and collocations with a high triamo le notizie pi discusse nel 2015. subjective content, mostly concerning the author’s emotive stance. Facebook is one of the most popular online SN 1 Introduction in the world with 1 billion active users per month The use of Social Networks (SN) platforms like and it offers the possibility to collect data from Facebook and Twitter has developed overwhelm- people of different ages, educational levels and ingly in recent years. SN are exploited for differ- cultures. From a linguistic point of view, previous ent purposes ranging from the sharing of contents studies (Lin and Qiu, 2013) demonstrated that the among friends and useful contacts to the news- language in Facebook is more emotional and in- gathering about different domains such as politics terpersonal compared for example to the language and sports (Ahmad, 2010; Ahmad, 2013; Shef- in Twitter. Probably, this is due to the fact that in fer and Schultz, 2010). Many journalists indeed Facebook there is a stronger psychological close- use SN platforms for professional reasons (Oriella, ness between the author and audience because of 2013; Hermida, 2013). the different structure (bidirectional vs. unidirec- Several recent studies provide insights on how tional graphs) of the SNs. the popularity of blogs and other user generated In this paper we present the FB-NEWS15 content impacted the way in which news are con- corpus, a new Italian resource for sentiment sumed and reported. Picard (2009) states that SN analysis and emotion detection. The FB- platforms provide an easy and affordable way to NEWS15 corpus can be freely downloaded at 228 CLIC_2016_Proceedings.indd 228 02/12/16 15.04 colinglab.humnet.unipi.it/resources/under full article. The corpus keeps tracks of the three- the Creative Commons Attribution License fold hierarchical structure of Facebook, which in- creativecommons.org/licenses/by/2.0.1 cludes the news posts by the newspaper, the users’ The debate among users in commenting news comments to the posts and the replies to the com- and posts on Facebook offers a lot of subjective ments. In this context, it is clear that the emotive material to study the way in which people express content of the post is often neutral, but this post their own opinions and emotions about a target can inspire long discussions among readers, which event. In fact, in FB-NEWS15 we find linguistic can become useful material for sentiment analysis items expressing the whole range of positive and and emotion detection. Figure 1 shows a post, with negative emotions. In analyzing a news corpus, some of its comments and replies. however, it is not simple to aggregate the posts on the basis of a certain fact, since several posts re- late to the same event. For this reason, we decided to organize the corpus into clusters of topically re- lated news identified with Latent Dirichlet Allo- cation (LDA: Blei et al. (2003)). This approach allow us to infer the most debated news in the cor- pus, and, in a second step, to discover the readers’ sentiment about a particular topic. The paper is organized as follows: Section 2 describes the creation of the corpus, from crawl- ing (2.1) to linguistic annotation (2.2), and finally provides basic corpus statistics (2.3). Section 3 re- ports on the automatic topic extraction with LDA. 2 FB-NEWS15 For the creation of the corpus we followed the most important Italian newspapers. Since we were interested in building a corpus as heterogeneous as possible, we decided to focus on major news- papers with different political orientations, and which have in general heterogeneous readers. Figure 1: Example of post in Facebook with the Facebook allow users to post states, links, pho- relative comments and replies. tos and videos on their own wall. In general, users can be divided into two macro-categories: Peo- In order to create the FB-NEWS15, we decided ple and Pages. People are often individuals, and to download the timeline of the following news- the interaction with them is usually bidirectional papers, from 1 January 1 to 31 December 2015: (user A can read what user B publishes if A and La Repubblica, Il Giornale, L’Avvenire, Libero, Il B have a friendship relation). Conversely, Pages Fatto Quotidiano, Rainews24, Corriere Della Sera, are typically used to represent organizations, pub- Huffington Post Italia. lic figures (web stars), companies or, as in our case, newspapers. In this case, the relationship 2.1 Crawling is unidirectional, in the sense that user A can ac- Facebook offers developers Application Program- cess the timeline of the page P by putting a ”Like” ming Interfaces (APIs) for creating apps with on P. Unlike a single-user, who usually publishes Facebook’s native functionalities. In order to de- photos, videos and links about his private life, the velop the crawler, we exploited the Graph API, timeline of a newspaper Facebook page, in general which provides a simple view of the Facebook contains news titles with a link to the official web- social graph by showing the objects in the graph site of the newspaper, where the user can read the and the connections between them. The Graph 1All data collected have been processed anonymously for API allows us to navigate through the graph of scientific purposes, without storing personal information. the social network, which is organized into nodes 229 CLIC_2016_Proceedings.indd 229 02/12/16 15.04 <doc user="<newspaper(string)>" id="<id_post(string)>" The average number of posts for each newspaper type="post" is 27,341.25, while for comments and replies is parent_post="" respectively 2,016,243.38 and 576,498.5. Table parent_comment="" date="AAAA-MM-DD HH:MM:SS" 1shows the number of texts (including posts, com- location="" ments and replies) in FB-NEWS15 for each News- likes="662" paper and Figure 2.3 shows their cumulative dis- comments="54" shares="322"> tribution for each Newspaper. Un business truffaldino [E ora NEWSPAPER N. OF TEXTS finitela con l’eco-balla dei La Repubblica 4558,829 controlli sulle emissioni] Avvenire 91,824 Il Giornale 3,497,610 </doc> Libero 2,436,246 Il Fatto Quotidiano 4,900,314 Figure 2: Example of crawled text. Rainews24 369,834 Huffington Post 1,552,042 Corriere della Sera 3,553,966 (Users, Pages, Photos and Comments) and Edges OVERALL 20,960,665 (Connections such as Friendship or Likes). The graph is navigated by exploiting HTTP requests, Table 1: Number of texts aggregated by Newspa- that may be implemented using any programming per in FB-NEWS. language. The native APIs offered by Facebook has some drawbacks: i) the maintenance of the Table 2 shows the total number of tokens for app, since the APIs change over time, making it each page and the average number of texts, pro- necessary to update the code of the crawler; ii) duced for each post for each page. We can notice only public data can be accessed without requir- that the most followed newspapers on Facebook ing the user’s consent; iii) Facebook places limi- are Il Fatto Quotidiano and La Repubblica. tations on the number of requests through a given period of time. For each post, comment and reply, NEWSPAPER TOKENS TEXTS/POSTS we stored the message (text), the story (presence La Repubblica 96,059,756 182.61 of photos and links tags), its timestamp, the type Avvenire 2,611,899 12.65 Il Giornale 64,345,260 77.93 (post, comment, reply), the parent post/comment, Libero 41,166,457 81.87 the number of likes, shares and replies (Figure 2).