Summarizing Twitter Streams Related to Pre-Scheduled Events
Total Page:16
File Type:pdf, Size:1020Kb
“Winter is here”: Summarizing Twitter Streams related to Pre-Scheduled Events Anietie Andy Derry Wijaya Daphne Ippolito [email protected] [email protected] [email protected] University of Pennsylvania Boston University University of Pennsylvania Chris Callison-Burch [email protected] University of Pennsylvania Abstract • finally Jon and Dany meet and i’m freaking out #gots7 Pre-scheduled events, such as TV shows and • Daenerys: have you not come to bend the sports games, usually garner considerable at- knee? Jon Snow: i have not Daenerys. tention from the public. Twitter captures large These tweets reflect a part of or the whole scene volumes of discussions and messages related to these events, in real-time. Twitter streams that happened during this time period on the show related to pre-scheduled events are charac- i.e. Jon Snow meeting with Daenerys. terized by the following: (1) spikes in the Monitoring tweet streams - related to an event, volume of published tweets reflect the high- for information related to sub-events can be time lights of the event and (2) some of the pub- consuming partly because of the overwhelming lished tweets make reference to the charac- amount of data, some of which are redundant or ir- ters involved in the event, in the context in relevant to the sub-event. In this paper, we propose which they are currently portrayed in a sub- event. In this paper, we take advantage of a method to summarize tweet streams related to these characteristics to identify the highlights pre-scheduled events. We aim to identify the high- of pre-scheduled events from tweet streams lights of pre-scheduled events from tweet streams and we demonstrate a method to summarize related to the event and automatically summarize these highlights. We evaluate our algorithm on these highlights. Specifically we evaluate our al- tweets collected around 2 episodes of a popu- gorithm on tweets we collected around a popular lar TV show, Game of Thrones, Season 7. fantasy TV show, Game of Thrones. We will make this dataset available to the research community 2. 1 Introduction This paper makes the following contributions: Every week, pre-scheduled events, such as TV • Identify the highlights of pre-scheduled shows and sports games capture the attention of events from tweets streams related to the vast numbers of people. The first episode of the event and identify the character that had the seventh season of Game of Thrones (GOTS7), a most mentions in tweets published during the popular fantasy show on HBO, drew in about 10 highlight. million viewers during its broadcast1. • Identify the context in which this character During the broadcast of a popular pre-scheduled was being discussed in tweets published dur- event, Twitter users generate a huge amount ing the highlight and summarize the highlight of time-stamped tweets expressing their excite- by selecting the tweets that discuss this char- ments/frustrations, opinions, and commenting acter in a similar context. about the characters involved in the event, in the 2 Related Work context in which they are currently portrayed in a sub-event. For example, the following are some Some approaches to summarizing tweets related tweets that were published during a three minute to an event adapt or modify summarization tech- time period of an episode of GOTS7: niques that perform well with documents from • Bend the knee... Jon snow #gots7 news articles and apply these adaptations to tweets. In Sharifi et al.(2010a); Shen et al. 1https://en.wikipedia.org/wiki/Game_ of_Thrones_(season_7)#Ratings 2https://anietieandy.github.io (2013) a graph-based phrase reinforcement algo- highlights from this dataset and summarize these rithm was proposed. In Sharifi et al.(2010b) a hy- highlights. brid TF-IDF approach to extract one-or-multiple- Each episode of GOTS7 lasted approximately sentence summary for each topic was proposed. an hour. We used the Twitter streaming API to col- In Liu et al.(2011) an algorithm is proposed lect time-stamped and temporally ordered tweets that explores a variety of text sources for sum- containing ”#gots7”, a popular hashtag for the marizing twitter topics. In Harabagiu and Hickl show, while each episode was going on. We note (2011) an algorithm is proposed that synthesizes that filtering by hashtag gives us only some of the content from multiple microblog posts on the tweets about the show–we omit tweets that used same topic and uses a generative model which other GOTS7 related hashtags or no hashtags at induces event structures from the text and cap- all. Our dataset consists of the tweet streams for tures how users convey relevant content. In Mar- seven episodes of GOTS7; we collected the fol- cus et al.(2011), a tool called ”Twitnfo” was pro- lowing number of tweets: 32,476, 9,021, 4,532, posed. This tool used the volume of tweets re- 8,521, 6,183, 8,971, and 17,360 from episodes lated to a topic to identify peaks and summarize 1,2,3,4,5,6, and 7 respectively. these events by selecting tweets that contain a de- sired keyword or keywords, and selects frequent 3.1 Highlight Identification terms to provide an automated label for each peak. To identify the highlights of each episode, we In Takamura et al.(2011); Shen et al.(2013) a plot the number of tweets that were published per summarization model based algorithm was pro- minute for each minute of an episode. Since data posed based on the facility location problem. In at the minute level is quite noisy and to smooth out Chakrabarti and Punera(2011) a summarization short-term fluctuations, we calculated the mean of algorithm based on learning an underlying hid- the number of tweets published every 3 minutes den state representation of an event via hidden as shown by the red line in Figure1, which forms Markov models is proposed. Louis and New- peaks in the tweet volume. We observed the fol- man(2012) proposed an algorithm that aggre- lowing: (1) the spikes in the volume of tweets gates tweets into subtopic clusters which are then correspond to some exciting events/scenes during ranked and summarized by a few representative the show and (2) when there is a spike in the vol- tweets from each cluster (Shen et al., 2013). In ume of tweets, the characters involved in the sub- Nichols et al.(2012) an algorithm was proposed events (around that time period) spike in popu- that uses the volume of tweets to identify sub- larity as well in the published tweets. For exam- events, then uses various weighting schemes to ple, in episode 1, when the character Arya Stark perform tweet selection. Li et al.(2017) proposed wore Walder Freys face and poisoned all of house an algorithm for abstractive text summarization Frey, there was a spike in the volume of tweets based on sequence-to-sequence oriented encoder- at this time period; also Arya Stark and Walder decoder model equipped with a deep recurrent Frey spiked in popularity in tweets published in generative decoder. Nallapati et al.(2016) pro- this time period. posed a model using attentional endoder-decoder Several studies have suggested that a peak needs Recurrent Neural Network. to rise above a threshold to qualify it as a high- Our algorithm is different from the previous light in a given event. Hence, similar to Shamma work in that it identifies the character that had the et al.(2009); Gillani et al.(2017), we identify most mentions in tweets published in a highlight highlights of the events by selecting the peaks us- and identifies the context in which this character ing the mean and standard deviation of the peaks was being discussed in this highlight; it then sum- in all the tweets collected around the 7 episodes of marizes the highlight by selecting tweets that dis- GOTS7. cuss this character in a similar context. 3.2 Character Identification 3 Dataset To identify the characters involved in GOTS7, we select all the character names listed in the GOTS7 Our dataset consist of tweets collected around 7 Wikipedia page. It is common for tweets to men- episodes of a popular TV show, GOTS7. We al- tion nicknames or abbreviations rather that char- gorithmically identify points of elevated drama or acter full names. For example, in tweets col- Figure 1: Histogram of number of tweets published per minute during an episode of Game of Thrones. The red line, which forms peaks, shows the mean of the number of tweets published every 3 minutes. The names of the character that had the most mentions during each peak in tweet volume are also shown. lected around GOTS7 episode 1, the character sentence completion and lexical substitution than Sandor Clegane is mentioned 22 times by his full popular context representation of averaged word name and 61 times by his nickname “the hound.” embeddings. Context2vec learns sentential con- Therefore, for each character, we assemble a list text representation around a target word by feed- of aliases consisting of their first name (which ing one LSTM network with the sentence words for GOTS7 characters is unique), and the nick- around the target from left to right, and another names listed in the first paragraph of the charac- from right to left. These left-to-right and right- ter’s Wikipedia article. All characters end up hav- to-left context word embeddings are concatenated ing at most 2 aliases i.e. their first name and/or a and fed into a multi-layer perceptron to obtain nickname. For example, the nicknames for Sandor the embedding of the entire joint sentential con- Clegane are Sandor and the hound.