PRELIMINARY PREPRINT VERSION: DO NOT CITE The AAAI Digital Library will contain the published version some time after the conference.

Twitter Event Summarization by Exploiting Semantic Terms and Graph Network

Quanzhi Li, Qiong Zhang , US Bellevue, WA 98004, USA {[email protected], [email protected]}

Abstract choose only one to three tweets as event summary. Other is a fast communication channel for gathering and applications may want to select more tweets as summary. spreading breaking news, and it generates a large volume of To find the best tweets to represent an event, we first need tweets for most events. Automatically creating a summary for to understand the event characteristics. An event is usually an event is necessary and important. In this study, we explored defined by the 4Ws questions: who, what, where and when two extractive approaches for summarizing events on Twitter. (Mohd 2007). An event tweet usually contains terms corre- The first one exploits the semantic types of event related sponding to these aspects, and these terms can be classified terms, and ranks the tweets based on the score computed from into different semantic classes/types, such as entity names these semantic terms. The second one utilizes a graph convo- (who), location (where), hashtag (topic & what), temporal lutional network built from a tweet relation graph to generate expression (when), verb & noun (what), and mention (who). tweet hidden features for tweet salience estimation. And the Different semantic types have different degrees of im- most salient tweets are selected as the summary of the event. portance for describing an event, and this will have implica- Our experiments show that these two approaches outperform tions on both event detection and event summarization algo- the compared methods. rithms. Previous studies on event summarization from social media have not explicitly exploited this type of information. They do not classify the terms into different semantic clas- Introduction ses; they just treat them equally important in their algo- rithms, e.g. in Hybrid TF-IDF (Inouye and Kalita 2016) and Social media services, such as Twitter, generate a large vol- sumBasic (Vanderwende et al. 2007). In the first approach, ume of content (tweets) for most events. This study aims at we take the semantic type of a term into consideration when exploring new summarization methods for real-time events determining its weight. We hypothesize that classifying detected from Twitter. Text summarization methods can be tweet terms into their corresponding semantic classes, as- classified into two categories. Extractive methods select a signing different weights to them, and then integrating them subset of words, phrases, or sentences from the original doc- together will improve the event summarization perfor- ument to generate a summary. In contrast, abstractive ap- mance. Therefore, in the first approach, we split the term proaches generate a summary by using words and phrases space into groups of terms, or semantic classes. that may not appear in the source text. Our approaches be- In the second approach, we consider the tweet event long to the extractive method category. We choose one or summarization task as a special multi-document summari- more tweets from a event cluster to represent the event. Usu- zation problem, and utilize graph-based neural networks. ally, only one or two tweets are selected. One reason of se- The tweets selected as the summary should be both repre- lecting only a couple of tweets is that many applications pre- sentative and informative. To be representative in an event fer a very short summary, due to their limited UI space or cluster, a tweet needs to convey similar information with other reasons. The proposed approaches have been adapted other tweets in this cluster. We use a tweet relation graph to in three production systems, which need to monitor real- measure how representative a tweet is. In this graph, each time events on different topics. They prefer selecting only node is a tweet, and the edge weight measures how similar one to three tweets per event as summary, in order to have two nodes are. To reflect the informative aspect, we calcu- more events displayed on one screen. In this study, our ex- late a score for each node in the relation graph, using the periment also shows that human annotators usually also semantic terms the tweet contains. A tweet having more

Copyright © 2021, Association for the Advancement of Artificial Intelli- gence (www.aaai.org). All rights reserved. semantic terms usually will be more informative. To better Ling 2016; Cheng and Lapata 2016; Nallapati et al. 2017; utilize the tweet relation information and the representa- See et al. 2017; Cao et al. 2017). Chang et al. (2013; 2016) tional power of neural network, we apply graph convolu- regard Twitter summarization as a supervised classification tional network (GCN) (Kipf and Welling 2017; Yasunaga et task through mining rich social features, such as user influ- al. 2017) on the tweet relation graph. A salience score is es- ence. Wang and Ling (2016) employ encoder-decoder timated for each tweet based on the hidden features gener- RNNs to generate short abstractive summaries for opinions. ated by GCN, and the most salient ones are chosen as the Cheng and Lapata (2016) train an extractive summarization summary. We evaluated the two approaches using 1,000 approach using attention-based encoder-decoder RNNs to events detected from Twitter. sequentially label summary-worth sentences. See et al. (2017) augment the standard attention-based encoder-de- coder RNNs using pointer generator network. SUM- Related Studies MARUNNER (Nallapati et al. 2017) is an abstractive sum- marization model using an RNN-based encoder. Narayan et There are many studies on text summarization, but only a al. (2018) proposed a reinforcement learning-based system few of them focus on social media. One of the most popular trained by globally optimizing the ROUGE score. The extractive summarization methods is the centroid-based ap- model from Zhou et al. (2018) extracts sentences from a proach (Becker et al. 2011; Radev et al. 2004). Becker et al. document by jointly learning to score and select sentences. (2011) studied three centroid-based methods, and their ex- LATENT (Zhang et al., 2018) uses a latent model to directly periment showed that the Centroid method outperformed the maximize the likelihood of human summaries given se- other two, Degree and LexRank. LexRank (Erkan and lected sentences. Liu et al. (2019) use structured attention to Radev 2004) is a graph-based approach, which creates an induce a multi-root dependency tree representation of the adjacency matrix for text units and computes the stationary document while predicting the output summary. Most of distribution considering it as a Markov chain. TextRank these models focus on either abstractive method or single (Mihalcea and Tarau 2004) is another graph-based one, us- document. Applying these sequence-to-sequence ap- ing PageRank to find the highly ranked sentences. proaches to the multi-document summarization task has not sumBasic is a simple but effective method, which mainly been successful. He and Duan (2018) utilize the Twitter net- depends on word frequency (Vanderwende et al. 2007). It work structure to explore whether social relations can help chooses the tweet that has the highest sum of word proba- Twitter summarization. Wang and Zhang (2017) build a bility, which is computed from word frequency. To handle joint model to filter, cluster, and summarize the tweets for the special situation of tweet event, Inouye and Kalita (2016) new events. Their summarization model uses a multi-layer redefine TF-IDF in terms of a hybrid document. In their Hy- perceptron network to estimate a probability score for each brid TF-IDF approach, the TF component of the TF-IDF for- tweet and then rank them. The two approaches we presented mula uses the entire collection of tweets, while the IDF com- in this paper are different from previous studies. ponent treats each tweet as a separate document. Experi- ments from (Inouye and Kalita 2016; Alsaedi et al. 2016) show that Hybrid TF-IDF and sumBasic outperform other Semantic Class Based Approach approaches. Alsaedi et al. (2016) augmented the TF-IDF and Centroid methods by considering the time window of an In the semantic class-based approach, we have two methods event. This method is mainly used for retrospective events. to rank the tweets. We first describe each semantic class, and Mishra and Berberich (2017) propose a novel divergence- then present the two methods, Semantic-A and Semantic-B, based framework that selects excerpts from an initial set of for selecting the best tweets as the summary of an event. pseudo-relevant documents. Their algorithm requires an ex- Given a tweet, the first method (i.e., Semantic-A) integrates plicit user input, and it mainly works for long text. This is different semantic terms together to compute a tweet score, different from our case, which is to summarize tweets with- and the second method (i.e., Semantic-B) uses learning to out using any user query. rank algorithm to decide if a tweet is the top candidate. We The task B of TREC 2018 real-time summarization track call a term belonging to one of the semantic classes de- (Sequiera et al, 2018) asks the participating systems to out- scribed below as “semantic term”. Previous study (Li et al, put a list of tweets ordered by relevance to users’ interests 2017) has used semantic terms to detect event clusters from every day during the evaluation period. This task is different Tweet data streaming. from our case, which asks for the general summary of an event, not specific to a user’s interest. Four teams partici- Semantic Classes pated in this task, and the team having the best performance uses a method adapted from (Chellal et al. 2017), which Named Entity. In an event, “who” is usually defined by ranks the tweets based on their cosine similarity with user’s proper nouns (named entities), such as people or organiza- interest. Except the similarity calculation for user interest tion names. Occasionally, “what” is also represented by relevance, this method is basically the same as the “the ear- named entities. Location is also represented by named enti- liest tweet” approach tested in our experiments. ties, but in this study, we have a separate class for location. Recent text summarization studies on single document The entity extraction model uses three vectors to form the have employed neural networks (Cao et al. 2016; Wang and word representation: the word embedding, a character representation vector generated from Bi-LSTM model, and Tweet Score Normalization. In Equation 2, we do not another character representation generated from CNN consider the length of a tweet. This means it will always bias model. And a word level Bi-LSTM model is used to detect toward longer tweets, which is not desired, although usually the entity label, followed by a CRF layer to exploit the cor- a longer text segment conveys more information. We define relation between labels in local context and the sentence, by a normalization factor, which is the logarithm of the tweet jointly learning the best label chain for a given sentence. Ex- length (number of words, excluding stop words) plus 1, to periment on the English data from CoNLL 2003 (Sang and normalize the tweet score by dividing it by this factor. Meulder 2003) shows that the performance of this approach Another factor we considered for adjusting the tweet is comparable to that of the state-of-the-art algorithms ( score is the retweet information. Retweeting reinforces a and Hovy 2016; Yadav and Bethard 2018; Yadav et al. message, and the number of retweets is an indication of pop- 2018). If an entity is identified as location, then it will be ularity. The following equation shows how the tweet score classified as location. Therefore, in this paper, when we talk is normalized: about named entities, it only refers to people or organization. Mention. Mention (e.g. @realTrump) has been ignored by Z = (3) previous studies, but we found that in many events “who” is represented by a mention. Mentions have the same semantic where Y is the tweet’s score from Equation 2, r is the retweet meaning as named entities, but we put them in a separate value of this tweet, and n is the number of words in this class because of its special usage pattern. tweet. Here we use log(1+r) and log(1+n), to avoid a value Location. Locations are identified by our entity extraction of 0 when r or n is 1. The tweet with the highest score is algorithm described above. selected as the summary of the event. We also tested other Verb. Verbs usually describe the “what” aspect of an equations to deal with the retweet and tweet length prob- event, such as killed and sued. A stop word list is used to lems, and Equation 3 gave us the best performance. filter out the very common words, such as take, do, etc. The features described above convey different types of Noun. Same as verbs, nouns, such as hurricane and bomb- information: syntactic, semantic, co-occurrence, popularity, ing, are also used to describe the “what” aspect of an event. time, and hashtag composition. We hypothesize that inte- Hashtag. Hashtags usually describe the topic of an event, grating them together will improve the performance, since but they may also reflect the “what” aspect, e.g. #bombing. they complement each other to certain degree. Temporal Information. The “when” aspect of an event is usually described by temporal expressions, such as 8am to- Semantic-B: Learning to Rank day or this morning. We use an in-house algorithm to iden- Because of the nature of this task, which is to give an or- tify temporal expressions. dered list of recommended tweets, in the Semantic-B method, we design the integration of different semantic Semantic-A: Semantic Class Integration types as a learning to rank problem. Learning to rank algo- Terms in different semantic classes play different roles in an rithm can rank the candidates according to their relevance or event, so we give them different weights in the summariza- other factors to the target, which is an event summary in our tion algorithm. The terms in the same semantic class will case. It has been used in many general recommendation have the same weight. In addition to the class weight, we tasks, and also in some tweet related tasks. The features we also consider term frequency in the event tweet collection, consider in this approach included all the seven types of se- which is reflected on the term probability distribution de- mantic terms described before, and the two normalization scribed below. Retweet count and tweet length are also con- factors for handling retweet and tweet length, e.g. log(1+r) sidered in our algorithm. and log(1+n) described in Equation 3. For the semantic terms in a tweet, we calculate their feature values by proba- Term Probability Distribution. A probability distribu- bility distributions, using the same equation, Equation 1, de- tion is calculated for each semantic term in the tweet col- scribed in Semantic-A. lection of an event. For term t, its probability is: Putting all the features and factors together, we will have ' �(�) = (1) multiple candidate lists with different candidates or different ( orders. A learning to rank algorithm is used to build a rank- where N is the total number of tweets in this event, and c is ing model based on the training data, and this model is used the total number of tweets that contain this term. on the test data set. The learning to rank algorithm used in Tweet Score Integration. Given a tweet T, we integrate this study is extended from LibLinear (Lee and Lin 2014). the scores of the semantic terms in this tweet to get its score: ∑ ∑ Y = /∈0 +∈. �+ ∙ �(�) (2) Graph Neural Network Based Approach where u is a semantic type belonging to semantic type col- We describe the graph-based neural model in this section. lection S, t is a semantic term in tweet T, wu is the weight for semantic type u, and p(t) is the probability of t. The param- Figure 1 shows the high-level structure for estimating tweet salience. For a given event cluster, a tweet relation graph is eter wu for each semantic class is learned from training data. first built, then GCN networks are applied on the relation graph, and finally a salience score is calculated for each considering the semantic terms it contains, and then use tweet by considering both the tweet hidden state and also the these scores to update the graph edges. Using these terms cluster embedding that represents the whole event cluster. as input, we apply linear regression, which has been used in We describe each component below. (Christensen et al. 2013), to each tweet x to obtain its score, Tweet Representation. We fine-tune the pre-trained s(x). Then for a directed edge of (xi, yj), its weight is trans- BERT language model (Devlin et al. 2019) to generate tweet formed by the score of node xi, and then normalized by the representation. BERT, which is based on a multi-layer bidi- total outgoing edge scores, using this equation: rectional Transformer, is a neural network language model 12��, ��7 8(��) trained on large corpuses of text data. It has generally been wnew(xi , xj) = (4) applied to specific natural language understanding tasks by ∑�9�,..,� 12��, ��7 8(��) fine-tuning a final output layer for each task, and has im- where w(xi , xj) is the original edge score from node xi to xj, proved the state-of-the art performance across the board for s(xi) is the score of node xi, and wnew(xi , xj) is the normalized natural language understanding tasks. In our graph-based score. All edges are then transformed to undirected, the edge model, given a tweet, after removing URLs and retweet weights are updated to the average of the weights of both mentions, its text is tokenized using WordPiece and fed into directions, and finally they all are rescaled to 0 to 1. the BERT model. The [CLS] token vector of the final output Graph Convolutional Network. GCN networks are ap- layer is taken as the tweet representation. This vector is then plied over the tweet relation graph. What GCN does is to used by both the tweet relation graph component and the learn a transformation function f(X,E) to produce hidden GCN component, shown in Figure 1. features for each node, taking the tweet relation graph nodes Tweet Relation Graph. The underlying idea behand uti- and edges as input. This function uses the following layer- lizing a relation graph is that a tweet is important in an event wise propagation rule: cluster if it has strong connections with other important @ @ �(>?@) = �( �F �(>)�(>)) (5) tweets in this event. In this graph, a node is a tweet, and an BCD BCD edge is the relation between two tweets, measured by their similarity. Following previous studies (Erkan and Radev Tweet Salience Estimation. The GCN component learns 2004; Cao et al. 2017), we also use cosine similarity to new tweet embeddings, e.g. (Y1, Y2 , … , Yn) in Figure 1. In measure the connection strength between node, using tweet order to get a global context of an event, a cluster-level em- embeddings. An edge is added between two nodes if the bedding, called C, is generated by averaging the embed- cosine value is greater than the threshold, obtained empiri- dings of all the tweets in this cluster. Then the salience score cally. of tweet xi is computed by using both the cluster embedding

C and the tweet embedding Yi: T Si = f(xi) = v tanh (Ws Yi + Wc C) (6)

where Ws, Wc and v are parameters. Si is then normalized over all the tweets of this cluster, via a softmax operation. Previous study (Yasunaga et al. 2017) has used similar ap- proach for multi-document summarization.

Figure 1: The tweet salience estimation workflow.

Another factor that affects the importance of a tweet on representing an event is how informative the tweet is. As described earlier, an event is usually defined by the 4Ws Figure 2: Screenshot of event summary examples questions. Therefore, we calculate a score for each node by Deployment. Figure 2 shows a preliminary user interface based on unigram overlap, and ROUGE-2, which is based example which displays three events with summary. The on bigram overlap, to compute the F1 score. number next to the summary is the total number of tweets of Algorithms Compared. We compare our approaches to that event, and it can change dynamically as the system de- these methods: tects more tweets for that event. A new panel will pop up to 1. The earliest tweet: this is the first tweet in the event. This show all the tweets of an event if its number is clicked by tweet is the first one describing the event, and so it is a user. special one. 2. The most retweeted tweet: retweeting reinforces a mes- sage and the number of retweets shows its popularity. Experiments and Results 3. sumBasic: previous studies have shown that sumBasic is a simple but effective method (Alsaedi et al. 2016; He and Data Collection Duan 2018). 4. Hybrid TF-IDF: this is the state-of-the-art approach on Previous studies (Alsaedi et al. 2016; Inouye and Kalita 2016) tweet event summarization (Inouye and Kalita 2016). have used the data set created by Inouye and Kalita, which 5. Multi-Perceptron: This method uses a multi-layer percep- has 50 events. Shen et al. (2013) use six NBA games as their tron neural network to estimate a probability score for event data set. Chua and Asur (2013) use four events, built by each tweet and rank them (Wang and Zhang 2017). searching Twitter using some keywords. All the events in The earliest and most retweeted methods are two simple these data sets were not detected at real-time via Twitter’s but intuitive approaches. Previous studies (Alsaedi et al. streaming data; they all belong to retrospective and specified 2016; Inouye and Kalita 2016) already demonstrated that events (Atefeh and Khreich 2013). The latter two data sets are sumBasic and Hybrid TF-IDF outperform other methods, not available to the public. such as LexRank, Mead, TextRank. Therefore, in the exper- Due to the above reasons, we decided to create our own iments, our approaches are compared to these two ap- data set. Tracer is a real-time event detection system proaches, and the Multi-Perceptron method that is based on on Twitter (Liu et al. 2016; Liu et al. 2017). It clusters tweets neural network. The approach proposed by (He and Duan talking about the same story into the same cluster at real-time, 2018) is based on social network structure, and it is not suit- by considering tweet link, entity name, hashtag, etc. Their able for real-time event summarization, since usually it is evaluation shows that the system has the state-of-the-art real- impossible for a summarization model to build the social time event detection/clustering performance. We used the media network structure at real time, due to the real-time events generated by this system in our study, and collected constraint and lack of access to the whole Twitter network. 1,000 events from it. They include both big events, such as Therefore, we did not include this approach in our experi- bombing, and small events, such as an NBA game. ments. Six human annotators were trained in social media news events, and reviewed these events. They were asked to go through all the tweets of an event to select the most repre- Training Settings sentative and informative ones as the “true” summary. Be- Semantic-A Training. We used 700 events and the logistic cause tweets are very short, it was not too time-consuming regression algorithm to learn the weights of semantic types for the annotators to identify the appropriate tweets. The ex- in Equation 2, using a 5-fold cross-validation, and the other perts may choose more than one tweet as the summary. If 300 events as evaluation data. The seven semantic types are multiple tweets are selected for an event, they should not the features, and their values from Equation 1 are the feature convey repetitive message, instead they should represent values. For each of the training events annotated with true different aspects or development stages of the event. Over- summary, we randomly selected 5 tweets from this event as all, 72% of the events have one tweet chosen as summary, the false summary. The coefficient of each semantic type is 27% of them have two or three tweets, and less than 1% of used as the weight in Equation 2. them have more than three tweets. Semantic-B Training. The 700 events mentioned in the last section are also used to train the learning to rank model. Algorithm Performance Comparison If an annotated event has more than one tweet as summary, then they were already ranked by our annotators, and so we Evaluation Metrics. We use both ROUGE (Lin and Hovy can get the ranks between them. For each of the training 2003) and BLEU-4 (Papineni et al. 2002) as the evaluation events, we randomly selected 5 tweets from this event as metrics. ROUGE works by comparing an automatically pro- negative summary examples. These negative tweets have duced system summary against the reference summary. It the same rank among themselves.

basically measures the overlap of N-grams between the sys- Graph-based Approach Training. The graph-based tem and reference summaries. However, it does not tell you model is trained end-to-end. The same 700 events are used much as a metric. To get a good quantitative value, in the as the training and validation data. We fine tune the pre- context of ROUGE, we compute precision and recall using trained BERT-base model for generating tweet representa- the overlaps, and then report the F1-measure. This metric tion. The embedding dimension size is 768. We train the has been used in previous studies (Alsaedi et al. 2016; He model using an Adam optimizer with β1 = 0.9 and β2 = and Duan 2018). In this study, we use ROUGE-1, which is 0.999, and use a dynamic learning rate during the training process. The number of GCN layers, shuffled mini-batch methods, and the differences are statistically significant at size, dropout rate, and learning rate are 3, 25, 0.3 and 0.001, p=0.05 level using t-test (Rice 2006). Table 1 also shows respectively. The graph model is trained by minimizing the that Semantic-B, which is based on the learning to rank al- cross-entropy loss between the predicted tweet salience gorithm, achieves better performance than Semantic-A, score and its ROUGE F1 score: which is based on integrating all the semantic features into � = − ∑ ∑ log(�(� )) �(� ) (7) one final score. Compared to the ROUGE and BLEU values C QR O O reported in some previous studies on other summarization where D is the whole training data, xi is a tweet in a cluster, data sets, the scores in this study may look high. One reason f(xi) is the salience score of xi, and R(xi) is the average F1 is that tweets are very short and the ones in the same event score of ROUGE-1 and ROUGE-2, after normalized over all are similar, which makes the scores higher. the tweets of the cluster, through a softmax operation. Sim- Table 2 presents the result when we can choose more than ilar approaches are also used in previous studies (Wang and one tweet as summary. We can see that the overall perfor- Zhang 2017; Yasunaga et al. 2017; He and Duan 2018). mance increases, compared to selecting only one tweet. This shows that our designed parameters, C1, C2 and P, do im- Method ROUGE F BLEU prove the performance. Earliest 0.511 0.384 Most retweeted 0.540 0.413 Method ROUGE F BLEU Hybrid TF-IDF 0.572 0.437 Earliest 0.522 0.391 SumBasic 0.588 0.448 Most retweeted 0.550 0.423 Multi-Perceptron 0.593 0.450 Hybrid TF-IDF 0.579 0.442 Semantic-A 0.607 0.456 SumBasic 0.596 0.456 Semantic-B 0.622 0.465 Multi-Perceptron 0.603 0.459 Graph based 0.647 0.490 Semantic-A 0.618 0.468 Semantic-B 0.630 0.476 Table 1. Comparison of different algorithms Graph based 0.658 0.498

Number of Tweets Selected as Summary Table 2. Performance when multiple tweets may be chosen Some events may need to have more than one tweet as their summaries. In this case, each tweet represents a different as- Ablation Test. In the GCN based method, when building pect or development stage of the event. In this study, the tweet relation graph, we use the semantic terms to cal- whether multiple tweets should be selected is determined as culate a score for each node, and then use it to update the follow: all the tweets are first ordered in descending order edge weight, in order to include the informative aspect of a based on their scores produced by the corresponding model. tweet in the tweet salience estimation. To show how im- The tweet ranked first is selected as part of the summary portant it is, we did an extra experiment, in which the edge first, and then from the second one, each tweet is checked weight is not updated by the node score in the tweet relation and is also selected as part of the summary if it meets the graph. When computing the tweet salience score, in order following two conditions: 1. Its cosine similarity with any to get a global context of an event, a cluster-level embedding of the already selected tweets is less than a threshold C1. is generated by averaging the embeddings of all the tweets This condition is to make sure that the next selected tweet in this cluster, and the salience score is computed by using talks about a different aspect of the event from the already both the cluster embedding and the tweet embedding. To selected tweets. 2. The percentage of tweets that are similar show how important the cluster level embedding is, we also to this one (measured by cosine similarity, and the threshold conducted an experiment without using the cluster level em- is C2) should be greater than a threshold P. This is to make bedding. sure that this tweet is representative enough. The values Table 3 presents the results of these two experiments, learned from training data are: C1 = 0.68, C2 = 0.81, P = compared to the proposed graph-based model. The results 16%. They are independent of individual summarization al- show that the both the ROUGE and the BLEU scores had a gorithms. The baseline approaches only select one tweet as 3% decline when the node weight is not updated using se- summary by their design. To make the comparison fair, we mantic terms, but it is still slightly better than the two meth- did two comparisons: 1. Each method just selects one tweet ods based on semantic classes. This demonstrated that ex- as summary. 2. To be able to select multiple tweets as sum- plicitly including the semantic terms in node relation crea- mary, the other methods also used the learned values for C1, tion do help in tweet salience estimation. When excluding C2 and P to determine if more tweets should be selected. the cluster level embedding in computing tweet salience score, the performance decreases greatly, from 0.658 to Comparison Result and Discussion 0.620, more than 6%. It shows that the cluster level embed- Table 1 presents the result of the first comparison, based on ding provides very useful information on estimating the average F1 scores of ROUGE-1 & ROUGE-2, and whether a tweet is representative or not. BLEU-4, using the 300 evaluation events. It shows that Se- mantic-B and the GCN based approach outperform other Thomas N Kipf and Max Welling. Semisupervised classification Method ROUGE F BLEU with graph convolutional networks. ICLR 2017. Graph based - without node weight up- 0.637 0.477 Hing-Pei Lee and Chih-Jen Lin. Large-scale linear rankSVM. Neu- date using semantic terms ral Computation 2014 Graph based - without cluster level em- 0.620 0.466 bedding for salience score estimation Quanzhi Li, Armineh Nourbakhsh, Sameena Shah and Xiaomo Graph based 0.658 0.498 Liu, Real-Time Novel Event Detection from Social Media, IEEE ICDE 2017 Table 3. Ablation test result C.-Y. Lin and E. Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. NAACL Xiaomo Liu, Quanzhi Li, Armineh Nourbakhsh, et al. Reuters Conclusion Tracer: A Large Scale System of Detecting & Verifying Real- Time News Events from Twitter, CIKM2016. This paper presents two summarization approaches for events detected from Twitter. The first one utilizes the se- Xiaomo Liu, Armineh Nourbakhsh, Quanzhi Li, et al., Reuters Tracer: Toward Automated News Production Using Large Scale mantic types of event related terms, and ranks the tweets Social Media Data, IEEE BigData 2017. based on the score computed from analyzing these semantic terms. The second one employs GCN and tweet relation Yang Liu, Ivan Titov, and Mirella Lapata. Single document sum- marization as tree induction. NAACL 2019 graph to generate tweet hidden features for tweet salience estimations. Experiments on a large data set show that our Xuezhe Ma and Eduard Hovy, End-to-end Sequence labeling via two approaches outperform other summarization methods. bi-directional LSTM-CNN-CRF, ACL 2016 R. Mihalcea, P. Tarau. TextRank: Bringing order into texts, EMNLP 2004 References A. Mishra and K. Berberich, Event digest: A holistic view on past events, SIGIR 2017. Nasser Alsaedi, Pete Burnap, Omer Rana. Automatic summariza- M. Mohd. 2007. Named entity patterns across news domains. 1st tion of real world events using Twitter. ICWSM 2016 BCS IRSG conference on Future Directions in Information Access F. Atefeh and W. Khreich. A survey of techniques for event detec- Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, et al. Abstrac- tion in Twitter, Computational Intelligence, 2013 tive text summarization using sequence-to-sequence rnns and be- H. Becker, M. Naaman &L. Gravano. Selecting quality Twitter yond. CoNLL 2016 content for events. ICWSM 2011 Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Ranking sen- Freddy C.T. Chua and Sitaram Asur. Automatic summarization of tences for extractive summarization with reinforcement learning. events from social media. ICWSM 2013 NAACL 2018 Ziqiang Cao,Wenjie Li, Sujian Li, and Furu Wei. Improving multi- O. Owoputi, B. O'Connor, C. Dyer, K. Gimpel, N. Schneider, et al. document summarization via text classification. AAAI 2017 Improved part-of-speech tagging for online conversational text Ziqiang Cao, Wenjie Li, Sujian Li, Furu Wei, and Yanran Li. with word clusters. NAACL 2013. Attsum: Joint learning of focusing and summarization with neural Kishore Papineni, Salim Roukos, ToddWard, and Wei- Jing Zhu. attention. COLING 2016 Bleu: a method for automatic evaluation of machine translation. Abdelhamid Chellal, Mohand Boughanem, Bernard Dousset. ACL 2002 Word similarity-based model for tweet stream prospective notifi- D. Radev, T. Allison, S. Blair-Goldensohn, J. Blitzer, et al. Mead cation. ECIR 2017 – a platform for multidocument multilingual text summarization, Yi Chang, Xuanhui Wang, Qiaozhu Mei, Yan Liu, Towards twitter LREC 2004 context summarization with user influence models. WSDM 2013 John A. Rice. 2006. Mathematical Statistics and Data Analysis, Yi Chang, Jiliang Tang, Dawei Yin, Makoto Yamada, and Yan Third Edition Liu, Timeline summarization from social media with life cycle Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the models. IJCAI 2016 point: Summarization with pointer generator networks. ACL 2017 Jianpeng Cheng and Mirella Lapata. Neural summarization by ex- Royal Sequiera, Luchen Tan, and Jimmy Lin, Overview of the tracting sentences and words. ACL 2016. TREC 2018 Real-Time Summarization Track, TREC 2018 Janara Christensen, Mausam, Stephen Soderland, and Oren Chao Shen, Fei Liu, Fuliang Weng and Tao Li. A participant-based Etzioni. Towards coherent multidocument summarization. approach for event summarization using Twitter Streams. NAACL NAACL 2013. 2013. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova. Beyond Toutanova. BERT: Pre-training of deep bidirectional transformers SumBasic: Task-focused summarization with sentence simplifica- for language understanding. NAACL 2019 tion and lexical expansion, IPM. G. Erkan and D. Radev. Lexrank: graph-based centrality as sali- Lu Wang, Wang Ling. Neural network abstract generation for ence in text summarization, JAIR 2014 opinions and arguments. NAACL 2016 Ruifang He and Xingyi Duan. Twitter summarization based on so- Zhongqing Wang & Yue Zhang, A neural model for joint event cial network and sparse reconstruction. AAAI 2018. detection and summarization, IJCAI 2017 David Inouye and Jugal K. Kalita. Comparing summarization al- gorithms for multiple post summaries. Social Computing 2016. Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, et al. Graph- based Neural Multi-Document Summarization, CoNLL 2017Conference Name:ACM Woodstock conference Vikas Yadav, Steven Bethard, A Survey on Recent Advances in Named Entity Recognition from Deep Learning models, CoLING 2018 Vikas Yadav, Rebecca Sharp, Steven Bethard, Deep affix features improve neural named entity recognizers. Conference on Lexical and Computational Semantics 2018 Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. Neu- ral latent extractive document summarization. EMNLOP 2018 Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, et al. Neural document summarization by jointly learning to score and select sentences. ACL 2019