What to Write? a Topic Recommender for Journalists

What to Write? A topic recommender for journalists Giovanni Stilo and Paola Velardi Sapienza University of Rome, Italy stilo, velardi @di.uniroma1.it { } Alessandro Cucchiarelli and Giacomo Marangoni and Christian Morbidoni Universita` Politecnica delle Marche, Italy a.cucchiarelli, g.marangoni, c.morbidoni @univpm.it { } Abstract media events and performances, or for disasters (Lehmann et al., 2012). Assessing the focus, dura- In this paper we present a recommender tion and outcomes of news stories on public atten- 3 system, What To Write and Why (W ), tion is paramount for both public bodies and me- capable of suggesting to a journalist, for dia in order to determine the issues around which a given event, the aspects still uncov- the public opinion forms, and in framing the issues ered in news articles on which the read- (i.e., how they are being considered) (Brooker and ers focus their interest. The basic idea is Schaefer, 2005). Futhermore, real-time analysis to characterize an event according to the of public reaction to news items may provide use- echo it receives in online news sources ful feedback to journalists, such as highlighting as- and associate it with the corresponding pects of a story that needs to be further addressed, readers’ communicative and informative issues that appear to be of interest for the public patterns, detected through the analysis of but have been ignored, or even to help local news- Twitter and Wikipedia, respectively. Our papers echo international press releases. methodology temporally aligns the results The aim of this paper is to present a news me- of this analysis and recommends the con- dia recommender, What to Write and Why (W 3), cepts that emerge as topics of interest from for analyzing the impact of news stories on the Twitter and Wikipedia, either not covered readers, and finding aspects – still uncovered in or poorly covered in the published news news articles – on which the public has focused articles. their interest. The purpose of W 3 is to support journalists in the task of reshaping and extend- 1 Introduction ing their coverage of breaking news, by suggesting In a recent study on the use of social media sources topics to address when following up on such news by journalists (Knight, 2012) the author concludes items. For example, we have found that a com- that ”social media are changing the way news mon pattern for news readers is to search events of are gathered and researched”. In fact, a growing the same type occurred in the past on Wikipedia, number of readers, viewers and listeners access which is not surprising per se: however, among online media for their news (Gloviczki, 2015). the many possible similar events, our system is When readers feel involved by news stories they able to identify those that the majority of read- may react by trying to deepen their knowledge ers consider (sometimes surprisingly) highly as- on the subject, and/or confronting their opinions sociated with breaking news, e.g., searching for with peers. Stories may then solicit a reader’s in- the 2013 CeaseFire program in Baltimore during formation and communication needs. The inten- Egypt’s ceasefire proposal in Gaza on July 2014. sity and nature of both needs can be measured on the web, by tracking the impact of news on 2 Methodology users’ search behavior on on-line knowledge bases as well as their discussions on popular social plat- Our methodology is in five steps, as shown in the forms. What is more, on-line public’s reaction to workflow of Figure1: news is almost immediate (Leskovec et al., 2009) Step 1. Event detection: We use SAX*, an and even anticipated, as for the case of planned unsupervised temporal mining algorithm that we 19 Proceedings of the 2017 EMNLP Workshop on Natural Language Processing meets Journalism, pages 19–24 Copenhagen, Denmark, September 7, 2017. c 2017 Association for Computational Linguistics Event Intra source Inter Sources Detection clustering Twitter alignement Stream t2 E1 t1 t4 C1 t3 C2 E2 C3 E'1 C4 Reccomendations Classification E'2 based on of Event Intra source Communication Needs Communication Needs Detection clustering News Stream E''2 t2 t1 t4 1 t3 C C2 E''1 C3 C4 Reccomendations Classification based on of Event Intra source Information Needs Information Needs Detection clustering Wikipedia page-views E'''1 t2 t1 t4 1 t3 C C2 E'''2 C3 C4 Figure 1: Workflow of W 3 2.6 2.1 2 1.6 a a a a a a a b b a a 1.1 4 0.6 a b b a a a a a a b a 0.1 1 5 -0.4 a a a a a a a a a b b a a a a a b a a a a a -0.9 3 -1.4 a a a a b b a a a a a 7/15/14 7/16/14 7/17/14 7/18/14 7/19/14 7/20/14 7/21/14 7/22/14 7/23/14 7/24/14 Siberia_Airlines_Flight_1812 Ukraine Malaysia_Airlines_Flight_17 2014_pro-Russian_unrest_in_Ukraine SurFace-to-air_missile Pan_Am_Flight_103 Figure 3: SAX* strings associated to a temporal series s(t) 2014_Russian_military_intervenKon_in_Ukraine in 5 adjacent or overlapping windows. Figure 2: Cluster of normalized time series of Wikipedia page views for Malaysia Airline crash on July 2014. associated with the signal in each window (with Σ = a, b ). { } introduced in (Stilo and Velardi, 2016), to cluster For a better characterization of an event, we tokens – words, entities, hashtags, page views – merge clusters referring to the same event and ex- based on the shape similarity of their associated tracted in adjacent windows, based on their sim- signals s(t). In SAX*, signals observed in tempo- ilarity. Merged clusters form meta-clusters, de- S ral windows Lk are first transformed into strings of noted with mi , where the index i refers to the symbols of an alphabet Σ; next, strings associated event and S N, T, W to the data source. With ∈ { } to active tokens (those corresponding to patterns reference to Figure3, the signals in windows 1, 2, of public attention) are clustered based on their 3 and 4 would be merged, but not the signal in similarity. Each cluster is interpreted as related to window 5. an event ei. Clusters are extracted independently An example from the T dataset is shown in Table from on-line news (N), Twitter messages (T ) and 1: note that the first two clusters show that ini- Wikipedia page views (W ). tially Twitter users where concerned mainly about For example, the cluster in Figure2 shows the tragedy (clusters C9 and C5), and only later Wikipedia page views related to the Malaysia Air- did their interest focus on political aspects (e.g., line crash on July 2014. We remark that SAX* Barack Obama, Vladimir Putin in C17 and C18). blindly clusters signals without prior knowledge Step 3. Inter-source alignment: Next, an of the event and its occurrence date, and further- alignment algorithm explores possible matches more, it avoids time-consuming processing of text across the three data sources N, T and W . For strings, since it only considers active tokens. any event ei, we eventually obtain three ”aligned” N T W Step 2. Intra-source clustering: Since clus- meta-clusters mi , mi and mi mirroring respec- ters are generated in sliding windows Lk of equal tively the media coverage of the considered event length L and temporal increment ∆, clusters refer- and its impact on readers’ communication and in- ring to the same event but extracted in partly over- formation needs. lapping windows may slightly differ, especially Step 4. Generating a recommendation: The for long-lasting events, when news updates moti- input to our recommender is the news meta- N vate the emergence of new sub-topics and the de- clusters mi related to an event ei first reported cay of others. An example is in Figure3, show- on day d0 and extracted during an interval I : ing for simplicity a cluster with a unique signal d d d , where d is the day in which 0 ≤ ≤ 0+x 0+x s(t) which we can also interpret as the cluster cen- the query is performed by the journalist. The troid. The Figure also shows the string of symbols system compares the three aligned meta-clusters 20 Table 1: The Twitter meta-cluster capturing the Malaysia Airlines flight crash event and its composing clusters Clusters C9 [tragic, crash, tragedi, Ukraine 1.0, Malaysia Airlines 0.6, Airline 0.66, Malaysia Airlines Flight 17 0.65, Malaysia 0.60, Russia 0.51, Avia- tion accidents and incidents 0.36, Airliner 0.35, Malaysia Airlines Flight 37 0.28, United States 0.21, Tragedy 0.20, Boeing 777 ... ] C5 [tragic, tragedi, Airline 1.0, Malaysia Airlines Flight 17 0.97, Malaysia Airlines 0.70, Malaysia 0.58, Ukraine 0.40, Twitter 0.39, Gaza Strip 0.32, CNN 0.26, Tragedy 0.25, God 0.24, Airliner 0.22, Israel 0.22, Malaysia Airlines Flight 370 0.22, Netherlands 0.21,..] C17 [tragedi, tragic, Malaysia Airlines Flight 17 1.0, Airline 0.89, Malaysia Airlines 0.62, Malaysia 0.54, Gaza Strip 0.42, Twitter 0.38, Ukraine 0.38, Hamas 0.33, Barack Obama 0.32, Israel 0.29, Vladimir Putin 0.27, God 0.26, CNN 0.25, Hell 0.25, Airliner 0.23, Malaysia Airlines Flight 37 0.20,...] T C18 [tragedi, tragic, Malaysia Airlines Flight 17 1.0, Airline 0.98, Malaysia Airlines 0.80, Tragedy , Malaysia 0.54, Gaza Strip 0.50, Ukraine 0.48, Hamas 0.408, Israel 0.38, Barack Obama 0.7, Twitter 0.37, Vladimir Putin 0.36, CNN 0.32, Airliner 0.28, Malaysia Airlines Flight 370 0.26, Hell 0.252, ...] Meta-cluster [tragedi 0.22, tragic 0.22, airline 0.20, malaysia airlines flight 17 0.20, ukraine 0.19, malaysia airlines 0.19, malaysia 0.17, russia 0.129, tragedy 0.12, vladimir putin 0.12, airliner 0.12, crash 0.12, gaza strip 0.11, barack obama 0.11, aviation accidents and incidents 0.11, cnn 0.106, malaysia airlines flight 370 0.10, god 0.10, ..

What to Write? a Topic Recommender for Journalists

Arxiv:1809.06223V1 [Cs.CL] 17 Sep 2018

Roberto Navigli Curriculum Vitæ

Ontology Learning and Its Application to Automated Terminology Translation

Incorporating Trustiness and Collective Synonym/Contrastive Evidence Into Taxonomy Construction

Stefano Faralli Curriculum Vitæ

Clustering Enterprise Networks by Patent Analysis

Word Sense Disambiguation Based on Word Similarity Calculation Using Word Vector Representation from a Knowledge-Based Graph

Learning Domain Ontologies from Document Warehouses and Dedicated Web Sites

What Women Like: a Gendered Analysis of Twitter Users' Interests

Word Sense Disambiguation: a Survey

End-To-End Reinforcement Learning for Automatic Taxonomy Induction

Syntactically Aware Neural Architectures for Definition Extraction