Arxiv:1905.13164V1 [Cs.CL] 30 May 2019

Hierarchical Transformers for Multi-Document Summarization

Yang Liu and Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh [email protected], [email protected]

Abstract cally related documents — has received signif- icantly less attention, partly due to the paucity In this paper, we develop a neural summa- of suitable data for the application of learning rization model which can effectively process multiple input documents and distill abstrac- methods. High-quality multi-document summa- tive summaries. Our model augments a previ- rization datasets (i.e., document clusters paired ously proposed Transformer architecture (Liu with multiple reference summaries written by hu- et al., 2018) with the ability to encode docu- mans) have been produced for the Document Un- ments in a hierarchical manner. We represent derstanding and Text Analysis Conferences (DUC cross-document relationships via an attention and TAC), but are relatively small (in the range mechanism which allows to share information of a few hundred examples) for training neu- as opposed to simply concatenating text spans ral models. In an attempt to drive research fur- and processing them as a flat sequence. Our model learns latent dependencies among tex- ther, Liu et al.(2018) tap into the potential of tual units, but can also take advantage of ex- Wikipedia and propose a methodology for cre- plicit graph representations focusing on simi- ating a large-scale dataset (WikiSum) for multi- larity or discourse relations. Empirical results document summarization with hundreds of thou- on the WikiSum dataset demonstrate that the sands of instances. Wikipedia articles, specifically proposed architecture brings substantial im- lead sections, are viewed as summaries of various 1 provements over several strong baselines. topics indicated by their title, e.g.,“Florence” or 1 Introduction “Natural Language Processing”. Documents cited in the Wikipedia articles or web pages returned Automatic summarization has enjoyed renewed by Google (using the section titles as queries) are interest in recent years, thanks to the popular- seen as the source cluster which the lead section ity of neural network models and their ability to purports to summarize. learn continuous representations without recourse Aside from the difficulties in obtaining train- to preprocessing tools or linguistic annotations. ing data, a major obstacle to the application of The availability of large-scale datasets (Sandhaus, end-to-end models to multi-document summariza- 2008; Hermann et al., 2015; Grusky et al., 2018) tion is the sheer size and number of source doc- containing hundreds of thousands of document- uments which can be very large. As a result, it arXiv:1905.13164v1 [cs.CL] 30 May 2019 summary pairs has driven the development of is practically infeasible (given memory limitations neural architectures for summarizing single doc- of current hardware) to train a model which en- uments. Several approaches have shown promis- codes all of them into vectors and subsequently ing results with sequence-to-sequence models that generates a summary from them. Liu et al.(2018) encode a source document and then decode it into propose a two-stage architecture, where an extrac- an abstractive summary (See et al., 2017; Celiky- tive model first selects a subset of salient passages, ilmaz et al., 2018; Paulus et al., 2018; Gehrmann and subsequently an abstractive model generates et al., 2018). the summary while conditioning on the extracted Multi-document summarization — the task of subset. The selected passages are concatenated producing summaries from clusters of themati- into a flat sequence and the Transformer (Vaswani 1Our code and data is available at https://github. et al., 2017), an architecture well-suited to lan- com/nlpyang/hiersumm. guage modeling over long sequences, is used to decode the summary. 2 Related Work Although the model of Liu et al.(2018) takes Most previous multi-document summarization an important first step towards abstractive multi- methods are extractive operating over graph-based document summarization, it still considers the representations of sentences or passages. Ap- multiple input documents as a concatenated flat proaches vary depending on how edge weights sequence, being agnostic of the hierarchical struc- are computed e.g., based on cosine similarity with tures and the relations that might exist among doc- tf-idf weights for words (Erkan and Radev, 2004) uments. For example, different web pages might or on discourse relations (Christensen et al., 2013), repeat the same content, include additional con- and the specific algorithm adopted for ranking text tent, present contradictory information, or discuss units for inclusion in the final summary. Sev- the same fact in a different light (Radev, 2000). eral variants of the PageRank algorithm have been The realization that cross-document links are im- adopted in the literature (Erkan and Radev, 2004) portant in isolating salient information, elimi- in order to compute the importance or salience of nating redundancy, and creating overall coherent a passage recursively based on the entire graph. summaries, has led to the widespread adoption More recently, Yasunaga et al.(2017) propose a of graph-based models for multi-document sum- neural version of this framework, where salience marization (Erkan and Radev, 2004; Christensen is estimated using features extracted from sen- et al., 2013; Wan, 2008; Parveen and Strube, tence embeddings and graph convolutional net- 2014). Graphs conveniently capture the relation- works (Kipf and Welling, 2017) applied over the ships between textual units within a document col- relation graph representing cross-document links. lection and can be easily constructed under the as- Abstractive approaches have met with limited sumption that text spans represent graph nodes and success. A few systems generate summaries edges are semantic links between them. based on sentence fusion, a technique which iden- In this paper, we develop a neural summariza- tifies fragments conveying common information tion model which can effectively process multi- across documents and combines these into sen- ple input documents and distill abstractive sum- tences (Barzilay and McKeown, 2005; Filippova maries. Our model augments the previously pro- and Strube, 2008; Bing et al., 2015). Although posed Transformer architecture with the ability to neural abstractive models have achieved promis- encode multiple documents in a hierarchical man- ing results on single-document summarization ner. We represent cross-document relationships (See et al., 2017; Paulus et al., 2018; Gehrmann via an attention mechanism which allows to share et al., 2018; Celikyilmaz et al., 2018), the ex- information across multiple documents as opposed tension of sequence-to-sequence architectures to to simply concatenating text spans and feeding multi-document summarization is less straightfor- them as a flat sequence to the model. In this ward. Apart from the lack of sufficient training way, the model automatically learns richer struc- data, neural models also face the computational tural dependencies among textual units, thus in- challenge of processing multiple source docu- corporating well-established insights from earlier ments. Previous solutions include model trans- work. Advantageously, the proposed architecture fer (Zhang et al., 2018; Lebanoff and Liu, 2018), can easily benefit from information external to the where a sequence-to-sequence model is pretrained model, i.e., by replacing inter-document attention on single-document summarization data and fine- with a graph-matrix computed based on the basis tuned on DUC (multi-document) benchmarks, or of lexical similarity (Erkan and Radev, 2004) or unsupervised models relying on reconstruction ob- discourse relations (Christensen et al., 2013). jectives (Ma et al., 2016; Chu and Liu, 2018). We evaluate our model on the WikiSum dataset Liu et al.(2018) propose a methodology for and show experimentally that the proposed archi- constructing large-scale summarization datasets tecture brings substantial improvements over sev- and a two-stage model which first extracts salient eral strong baselines. We also find that the ad- information from source documents and then uses dition of a simple ranking module which scores a decoder-only architecture (that can attend to very documents based on their usefulness for the target long sequences) to generate the summary. We fol- summary can greatly boost the performance of a low their setup in viewing multi-document sum- multi-document summarization system. marization as a supervised machine learning prob- : t (6) (4) (5) (3) (2) (1) ˆ u and i ) ) s against } receive } i pn ) tm } P of each to- } ) : L , w ˆ } , w p pi tm n ,P u ])) t , u ··· )) , p ··· u , ˆ p , ··· ; ˆ 1 ( 1 , p t 2 ··· pi ··· 1 : , w , w u P W 1 1 { { P t ([ { ( ( p u 1 t p { { W denoting the relatedness of are the updated vectors for i . The model is trained by y pj } with the vector = lstm = lstm L t , u ˆ } u } = sigmoid( ti , s are word embeddings for tokens in = tanh( u s pn tm i pj = maxpool( p ··· = maxpool( , u , u , ˆ p t is the score indicating whether para- , w 1 should be used for summarization. ˆ , and u s ti ··· ··· s { P , w , P 1 and source paragraph 1 t p u u T and All input paragraphs A max-pooling operation is then used over title { { where graph scores minimizing the cross entropy lossground-truth between scores a paragraph to theadopt gold ROUGE-2 recall standard (of summary. paragraph We Finally, to estimate whether a paragraphselected, should we be use a linearmoid transformation function: and a sig- We concatenate ken in the paragraph andformation apply to a extract non-linear features trans- for matchingand the the title paragraph. Ation second yields max-pooling the opera- final paragraph vector each token after applying the LSTMs. vectors to obtain a fixed-length representation T where troduced in Vaswani etsummary al.);(2017 token it by generates a tokensource while input. attending tolength We penalty the also (Wu et useprocess, al. beam to2016) generate search in more and themaries. fluent decoding a and longer sum- 3.1 Paragraph Ranking Unlike Liu etbased on al.(2018) their similarity with who the titlebased rank (using tf-idf- cosine paragraphs similarity),based approach. we adopt Aapplied a to logistic each regression learning- paragraph model todicating whether calculate is it a should score be in- rization. selected for summa- We usewith two Long-Short recurrent neural TermHochreiter and networks Memory Schmidhuber 1997) to units representtle ti- (LSTM; y r

r t e a e d g m o r

r c m a e e e t z v u i i d s t r c a a m r t s m r b u e a s d input paragraphs o c n L e

, and

T d 1 p L L

e a a a a r k r r r g n a a a a a p p p r r a p of the Wikipedia article. (retrieved from Wikipedia citations source paragraphs are ﬁrst ranked and -best ones. Our summarizer follows 0 } D L L L

s ,P h

search results returned by Google (with h p r e p a e c a r -best ones serve as input to an encoder-decoder r k r 0 ··· g u n g 10 , a o L a a r r 1 r s Our summarization system is illustrated in Fig- a a p P p model which generates the target summary. lem and for thislabeled purpose datasets assume (i.e., access source documents-summary topairs). large, Inlearning-based contrast ranker to and their ourcan approach, abstractive hierarchically we model encode usewith the the a ability to input learn latent documents, uments relations across and doc- additionally incorporateencoded in information well-known graph representations. 3 Model Description We follow Liu eteration) al.(2018 of in treating leaddocument the summarization Wikipedia gen- task. sections Thepothetical input as to system a a is hy- the multi- cle title and of a a collection Wikipediathe arti- of output is source the documents,Source Wikipedia while documents article’s are ﬁrst webpages section. citederences in section the Ref- of thetop Wikipedia article andthe title the of the articledocuments as could be the relatively query). long,into Since they source multiple are split paragraphsformally, by given line-breaks. title More ure1. Since theand input possibly paragraphs lengthy, instead are ofan numerous directly abstractive system, applying we ﬁrst rankmarize them the and sum- the very successful encoder-decoder(Bahdanau architecture et al., codes),2015 the where input theand text encoder the into decoder en- generates hidden targeton summaries representations based these representations. Inexclusively on this the paper, encoder we partdecoder focus of follows the the model, Transformer our architecture in- Figure 1: Pipeline oftion our system. multi-document summariza- the and a search engine),lead section the task is to generate the {

...... gold target text D) as yi. In testing, input para- 3.2.2 Local Transformer Layer graphs are ranked based on the model predicted A local transformer layer is used to encode con- scores and an ordering {R1, ··· ,RL} is gener- 0 textual information for tokens within each para- ated. The ﬁrst L paragraphs {R , ··· ,R 0 } are 1 L graph. The local transformer layer is the same selected as input to the second abstractive stage. as the vanilla transformer layer (Vaswani et al., 2017), and composed of two sub-layers: 3.2 Paragraph Encoding

Instead of treating the selected paragraphs as h = LayerNorm(xl−1 + MHAtt(xl−1)) (11) a very long sequence, we develop a hierarchi- l cal model based on the Transformer architecture x = LayerNorm(h + FFN(h)) (12) (Vaswani et al., 2017) to capture inter-paragraph relations. The model is composed of several lo- where LayerNorm is layer normalization pro- cal and global transformer layers which can be posed in Ba et al.(2016); MHAtt is the multi- stacked freely. Let tij denote the j-th token in the head attention mechanism introduced in Vaswani i-th ranked paragraph Ri; the model takes vectors et al.(2017) which allows each token to attend 0 to other tokens with different attention distribu- xij (for all tokens) as input. For the l-th trans- l−1 tions; and FFN is a two-layer feed-forward net- former layer, the input will be xij , and the output l work with ReLU as hidden activation function. is written as xij.

3.2.1 Embeddings 3.2.3 Global Transformer Layer Input tokens are first represented by word embed- A global transformer layer is used to exchange in- d dings. Let wij ∈ R denote the embedding as- formation across multiple paragraphs. As shown signed to tij. Since the Transformer is a non- in Figure2, we first apply a multi-head pooling op- recurrent model, we also assign a special posi- eration to each paragraph. Different heads will en- tional embedding peij to tij, to indicate the po- code paragraphs with different attention weights. sition of the token within the input. Then, for each head, an inter-paragraph attention To calculate positional embeddings, we follow mechanism is applied, where each paragraph can Vaswani et al.(2017) and use sine and cosine func- collect information from other paragraphs by self- tions of different frequencies. The embedding ep attention, generating a context vector to capture for the p-th element in a sequence is: contextual information from the whole input. Fi- nally, context vectors are concatenated, linearly 2i/d ep[i] = sin(p/10000 ) (7) transformed, added to the vector of each token, and fed to a feed-forward layer, updating the rep- e [2i + 1] = cos(p/100002i/d) (8) p resentation of each token with global information. e [i] i where p indicates the -th dimension of the em- Multi-head Pooling To obtain fixed-length bedding vector. Because each dimension of the paragraph representations, we apply a weighted- positional encoding corresponds to a sinusoid, for pooling operation; instead of using only one rep- o e any fixed offset , p+o can be represented as a resentation for each paragraph, we introduce a e linear function of p, which enables the model to multi-head pooling mechanism, where for each distinguish relative positions of input elements. paragraph, weight distributions over tokens are In multi-document summarization, token tij has calculated, allowing the model to flexibly encode two positions that need to be considered, namely i paragraphs in different representation subspaces (the rank of the paragraph) and j (the position by attending to different words. of the token within the paragraph). Positional Let xl−1 ∈ d denote the output vector of the embedding pe ∈ d represents both positions ij R ij R last transformer layer for token t , which is used (via concatenation) and is added to word embed- ij 0 as input for the current layer. For each paragraph ding wij to obtain the final input vector xij : Ri, for head z ∈ {1, ··· , nhead}, we first trans- z form the input vectors into attention scores aij peij = [ei; ej] (9) z and value vectors bij. Then, for each head, we 0 z xij = wij + peij (10) calculate a probability distribution aîj over tokens within the paragraph based on attention scores: Feed Feed Feed Feed Feed Feed Feed Feed z z l−1 Forward Forward Forward Forward Forward Forward Forward Forward aij = Wa xij (13) z z l−1 bij = Wb xij (14) this is para one this is para two n z z X z aîj = exp(aij)/ exp(aij) (15) j=1 context context where W z ∈ 1∗d and W z ∈ dhead∗d are a R b R context context context context context context 1 2 3 1 2 3 weights. dhead = d/nhead is the dimension of each head. n is the number of tokens in Ri. Inter-paragraph head 1 head 1 We next apply a weighted summation with an- Attention other linear transformation and layer normaliza- Inter-paragraph head 2 head 2 z Attention tion to obtain vector headi for the paragraph: Inter-paragraph head 3 head 3 n Attention z z X z z headi = LayerNorm(Wc aijbij) (16) j=1 Multi-head Pooling Multi-head Pooling where W z ∈ dhead∗dhead is the weight. c R this is para one this is para two The model can flexibly incorporate multiple heads, with each paragraph having multiple attention distributions, thereby focusing on different views of the input. Figure 2: A global transformer layer. Different col- ors indicate different heads in multi-head pooling and Inter-paragraph Attention We model the de- inter-paragraph attention. pendencies across multiple paragraphs with an inter-paragraph attention mechanism. Similar to We then add c to each input token vector xl−1, self-attention, inter-paragraph attention allows for i ij and feed it to a two-layer feed-forward network each paragraph to attend to other paragraphs by with ReLU as the activation function and a high- calculating an attention distribution: way layer normalization on top: z z z qi = Wq headi (17) l−1 z z z gij = Wo2ReLU(Wo1(xij + ci)) (22) ki = Wk headi (18) l l−1 z z z x = LayerNorm(gij + x ) (23) vi = Wv headi (19) ij ij m zT z z X exp(qi ki0 ) z dff ∗d d∗dff context = v 0 (20) where Wo1 ∈ R and Wo2 ∈ R are the i Pm exp(qzT kz) i i=1 o=1 i o weights, dff is the hidden size of the feed-forward later. This way, each token within paragraph Ri where qz, kz, vz ∈ dhead∗dhead are query, i i i R can collect information from other paragraphs in a key, and value vectors that are linearly trans- z hierarchical and efficient manner. formed from headi as in Vaswani et al.(2017); z dhead contexti ∈ R represents the context vec- 3.2.4 Graph-informed Attention tor generated by a self-attention operation over The inter-paragraph attention mechanism can be all paragraphs. m is the number of input para- viewed as learning a latent graph representation graphs. Figure2 provides a schematic view of (self-attention weights) of the input paragraphs. inter-paragraph attention. Although previous work has shown that simi- Feed-forward Networks We next update token lar latent representations are beneficial for down- representations with contextual information. We stream NLP tasks (Liu and Lapata, 2018; Kim first fuse information from all heads by concate- et al., 2017; Williams et al., 2018; Niculae et al., nating all context vectors and applying a linear 2018; Fernandes et al., 2019), much work in d∗d transformation with weight Wc ∈ R : multi-document summarization has taken advantage of explicit graph representations, each focus- 1 nhead ci = Wc[contexti ; ··· ; contexti ] (21) ing on different facets of the summarization task (e.g., capturing redundant information or repre- ROUGE-L Recall Methods senting passages referring to the same event or L0 = 5 L0 = 10 L0 = 20 L0 = 40 entity). One advantage of the hierarchical trans- Similarity 24.86 32.43 40.87 49.49 former is that we can easily incorporate graphs ex- Ranking 39.38 46.74 53.84 60.42 ternal to the model, to generate better summaries. We experimented with two well-established Table 1: ROUGE-L recall against target summary for 0 graph representations which we discuss briefly be- L -best paragraphs obtained with tf-idf cosine similarity and our ranking model. low. However, there is nothing inherent in our model that restricts us to these, any graph modeling relationships across paragraphs could have For both ranking and summarization stages, been used instead. Our first graph aims to capture we encode source paragraphs and target sum- lexical relations; graph nodes correspond to para- maries using subword tokenization with Sentence- graphs and edge weights are cosine similarities Piece (Kudo and Richardson, 2018). Our vocabu- based on tf-idf representations of the paragraphs. lary consists of 32, 000 subwords and is shared for Our second graph aims to capture discourse re- both source and target. lations (Christensen et al., 2013); it builds an Approximate Discourse Graph (ADG) (Yasunaga Paragraph Ranking To train the regression et al., 2017) over paragraphs; edges between para- model, we calculated the ROUGE-2 recall (Lin, graphs are drawn by counting (a) co-occurring en- 2004) of each paragraph against the target sum- tities and (b) discourse markers (e.g., however, mary and used this as the ground-truth score. The nevertheless) connecting two adjacent paragraphs hidden size of the two LSTMs was set to 256, (see the Appendix for details on how ADGs are and dropout (with dropout probability of 0.2) was constructed). used before all linear layers. Adagrad (Duchi We represent such graphs with a matrix G, et al., 2011) with learning rate 0.15 is used for where Gii0 is the weight of the edge connecting optimization. We compare our ranking model paragraphs i and i0. We can then inject this graph against the method proposed in Liu et al.(2018) into our hierarchical transformer by simply substi- who use the tf-idf cosine similarity between each tuting one of its (learned) heads z0 with G. Equa- paragraph and the article title to rank the input tion (20) for calculating the context vector for this paragraphs. We take the first L0 paragraphs from head is modified as: the ordered paragraph set produced by our ranker and the similarity-based method, respectively. We m z0 X Gii0 z0 concatenate these paragraphs and calculate their contexti = Pm vi0 (24) Gio ROUGE-L recall against the gold target text. The i0=1 o=1 results are shown in Table1. We can see that our 4 Experimental Setup ranker effectively extracts related paragraphs and produces more informative input for the down- WikiSum Dataset We used the scripts and urls stream summarization task. provided in Liu et al.(2018) to crawl Wikipedia articles and source reference documents. We suc- Training Configuration In all abstractive mod- cessfully crawled 78.9% of the original documents els, we apply dropout (with probability of 0.1) be- (some urls have become invalid and correspond- fore all linear layers; label smoothing (Szegedy ing documents could not be retrieved). We fur- et al., 2016) with smoothing factor 0.1 is also used. ther removed clone paragraphs (which are exact Training is in traditional sequence-to-sequence copies of some parts of the Wikipedia articles); manner with maximum likelihood estimation. The these were paragraphs in the source documents optimizer was Adam (Kingma and Ba, 2014) with whose bigram recall against the target summary learning rate of 2, β1 = 0.9, and β2 = 0.998; was higher than 0.8. On average, each input we also applied learning rate warmup over the has 525 paragraphs, and each paragraph has 70.1 first 8, 000 steps, and decay as in (Vaswani et al., tokens. The average length of the target sum- 2017). All transformer-based models had 256 hid- mary is 139.4 tokens. We split the dataset with den units; the feed-forward hidden size was 1, 024 1, 579, 360 instances for training, 38, 144 for vali- for all layers. All models were trained on 4 GPUs dation and 38, 205 for test. (NVIDIA TITAN Xp) for 500, 000 steps. We used Model ROUGE-1 ROUGE-2 ROUGE-L Lead 38.22 16.85 26.89 LexRank 36.12 11.67 22.52 FT (600 tokens, no ranking) 35.46 20.26 30.65 FT (600 tokens) 40.46 25.26 34.65 FT (800 tokens) 40.56 25.35 34.73 FT (1,200 tokens) 39.55 24.63 33.99 T-DMCA (3000 tokens) 40.77 25.60 34.90 HT (1,600 tokens) 40.82 25.99 35.08 HT (1,600 tokens) + Similarity Graph 40.80 25.95 35.08 HT (1,600 tokens) + Discourse Graph 40.81 25.95 35.24 HT (train on 1,600 tokens/test on 3000 tokens) 41.53 26.52 35.76

Table 2: Test set results on the WikiSum dataset using ROUGE F1. gradient accumulation to keep training time for all and compressed the key and value in self- models approximately consistent. We selected the attention with a convolutional layer. The 5 best checkpoints based on performance on the model has 5 layers as in Liu et al.(2018). validation set and report averaged results on the Its hidden size is 512 and its feed-forward test set. hidden size is 2, 048. The title and ranked During decoding we use beam search with beam paragraphs were concatenated and truncated size 5 and length penalty with α = 0.4 (Wu et al., to 3,000 tokens. 2016); we decode until an end-of-sequence token is reached. Hierarchical Transformer (HT) is the model proposed in this paper. The model archi- Comparison Systems We compared the pro- tecture is a 7-layer network (with 5 local- posed hierarchical transformer against several attention layers at the bottom and 2 global at- strong baselines: tention layers at the top). The model takes the title and L0 = 24 paragraphs as input to Lead is a simple baseline that concatenates the ti- produce a target summary, which leads to ap- tle and ranked paragraphs, and extracts the proximately 1, 600 input tokens per instance. ﬁrst k tokens; we set k to the length of the ground-truth target. 5 Results

LexRank (Erkan and Radev, 2004) is a widely- Automatic Evaluation We evaluated summa- used graph-based extractive summarizer; we rization quality using ROUGE F1 (Lin, 2004). We build a graph with paragraphs as nodes and report unigram and bigram overlap (ROUGE-1 edges weighted by tf-idf cosine similarity; we and ROUGE-2) as a means of assessing infor- run a PageRank-like algorithm on this graph mativeness and the longest common subsequence to rank and select paragraphs until the length (ROUGE-L) as a means of assessing fluency. of the ground-truth summary is reached. Table2 summarizes our results. The first Flat Transformer (FT) is a baseline that applies block in the table includes extractive systems a Transformer-based encoder-decoder model (Lead, LexRank), the second block includes sev- to a flat token sequence. We used a 6-layer eral variants of Flat Transformer-based models transformer. The title and ranked paragraphs (FT, T-DMCA), while the rest of the table presents were concatenated and truncated to 600, 800, the results of our Hierarchical Transformer (HT). and 1, 200 tokens. As can be seen, abstractive models generally out- perform extractive ones. The Flat Transformer, T-DMCA is the best performing model of Liu achieves best results when the input length is set et al.(2018) and a shorthand for Transformer to 800 tokens, while longer input (i.e., 1, 200 to- Decoder with Memory Compressed Atten- kens) actually hurts performance. The Hierarchi- tion; they only used a Transformer decoder cal Transformer with 1, 600 input tokens, outper- Model R1 R2 RL Model QA Rating HT 40.82 25.99 35.08 Lead 31.59 -0.383 HT w/o PP 40.21 24.54 34.71 FT 35.69 0.000 HT w/o MP 39.90 24.34 34.61 T-DMCA 43.14 0.147 HT w/o GT 39.01 22.97 33.76 HT 54.11 0.237

Table 3: Hierarchical Transformer and versions thereof Table 4: System scores based on questions answered without (w/o) paragraph position (PP), multi-head by AMT participants and summary quality rating. pooling (MP), and global transformer layer (GT). forms FT, and even T-DMCA when the latter is four questions per gold summary. Examples of presented with 3, 000 tokens. Adding an external questions and their answers are given in Table5. graph also seems to help the summarization pro- We adopted the same scoring mechanism used cess. The similarity graph does not have an ob- in Clarke and Lapata(2010), i.e., correct answers vious influence on the results, while the discourse are marked with 1, partially correct ones with 0.5, graph boosts ROUGE-L by 0.16. and 0 otherwise. A system’s score is the average We also found that the performance of the Hi- of all question scores. erarchical Transformer further improves when the Our second evaluation study assessed the over- model is presented with longer input at test time.2 all quality of the summaries by asking partici- As shown in the last row of Table2, when test- pants to rank them taking into account the fol- ing on 3, 000 input tokens, summarization quality lowing criteria: Informativeness (does the sum- improves across the board. This suggests that the mary convey important facts about the topic in model can potentially generate better summaries question?), Fluency (is the summary fluent and without increasing training time. grammatical?), and Succinctness (does the sum- Table3 summarizes ablation studies aiming to mary avoid repetition?). We used Best-Worst Scal- assess the contribution of individual components. ing (Louviere et al., 2015), a less labor-intensive Our experiments confirmed that encoding para- alternative to paired comparisons that has been graph position in addition to token position within shown to produce more reliable results than rating each paragraph is beneficial (see row w/o PP), as scales (Kiritchenko and Mohammad, 2017). Par- well as multi-head pooling (w/o MP is a model ticipants were presented with the gold summary where the number of heads is set to 1), and the and summaries generated from 3 out of 4 systems global transformer layer (w/o GT is a model with and were asked to decide which summary was the only 5 local transformer layers in the encoder). best and which one was the worst in relation to Human Evaluation In addition to automatic the gold standard, taking into account the criteria evaluation, we also assessed system performance mentioned above. The rating of each system was by eliciting human judgments on 20 randomly se- computed as the percentage of times it was chosen lected test instances. Our first evaluation study as best minus the times it was selected as worst. quantified the degree to which summarization Ratings range from −1 (worst) to 1 (best). models retain key information from the documents Both evaluations were conducted on the Ama- following a question-answering (QA) paradigm zon Mechanical Turk platform with 5 responses (Clarke and Lapata, 2010; Narayan et al., 2018). per hit. Participants evaluated summaries pro- We created a set of questions based on the gold duced by the Lead baseline, the Flat Transformer, summary under the assumption that it contains the T-DMCA, and our Hierarchical Transformer. All most important information from the input para- evaluated systems were variants that achieved the graphs. We then examined whether participants best performance in automatic evaluations. As were able to answer these questions by reading shown in Table4, on both evaluations, participants system summaries alone without access to the gold overwhelmingly prefer our model (HT). All pair- summary. The more questions a system can an- wise comparisons among systems are statistically swer, the better it is at summarization. We cre- significant (using a one-way ANOVA with post- ated 57 questions in total varying from two to hoc Tukey HSD tests; p < 0.01). Examples of 2This was not the case with the other Transformer models. system output are provided in Table5. Pentagoet Archeological District The Pentagoet Archeological District is a National Historic Landmark District located at the southern edge of the Bagaduce Peninsula in Castine, Maine. It is the site of Fort Pentagoet, a 17th-century fortified trading post established by fur traders of French Acadia. From 1635 to 1654 this site was a center of trade with the local

OLD Abenaki, and marked the effective western border of Acadia with New England. From 1654 to 1670 the site G was under English control, after which it was returned to France by the Treaty of Breda. The fort was destroyed in 1674 by Dutch raiders. The site was designated a National Historic Landmark in 1993. It is now a public park. What is the Pentagoet Archeological District? [a National Historic Landmark District] Where is it located? [Castine , Maine] QA What did the Abenaki Indians use the site for? [trading center] The Pentagoet Archeological District is a National Historic Landmark District located in Castine, Maine. This district forms part of the traditional homeland of the Abenaki Indians, in particular the Penobscot tribe. In the colonial period, Abenakis frequented the fortiﬁed trading post at this site, bartering moosehides, sealskins, EAD

L beaver and other furs in exchange for European commodities. ”Pentagoet Archeological district” is a National Historic Landmark District located at the southern edge of the Bagaduce Peninsula in Treaty Of Breda. the Pentagoet Archeological district is a National Historic Landmark District located at the southern edge of the Bagaduce Peninsula in Treaty Of Breda. It was listed on the FT national register of historic places in 1983. The Pentagoet Archeological District is a national historic landmark district located in castine , maine . this district forms part of the traditional homeland of the abenaki indians , in particular the Penobscot tribe. The district was listed on the national register of historic places in 1982. T-DMCA The Pentagoet Archeological district is a National Historic Landmark District located in Castine, Maine. This district forms part of the traditional homeland of the Abenaki Indians, in particular the Penobscot tribe. In

HT the colonial period, Abenaki frequented the fortiﬁed trading post at this site, bartering moosehides, sealskins, beaver and other furs in exchange for European commodities. Melanesian Whistler The Melanesian whistler or Vanuatu whistler (Pachycephala chlorura) is a species of passerine bird in the

OLD whistler family Pachycephalidae. It is found on the Loyalty Islands, Vanuatu, and Vanikoro in the far south-

G eastern Solomons. What is the Melanesian Whistler? [a species of passerine bird in the whistler family Pachycephalidae]

QA Where is it found? [Loyalty Islands , Vanuatu , and Vanikoro in the far south-eastern Solomons] The Australian golden whistler (Pachycephala pectoralis) is a species of bird found in forest, woodland, mallee, mangrove and scrub in Australia (except the interior and most of the north) Most populations are resident, but EAD

L some in south-eastern Australia migrate north during the winter. The Melanesian whistler (P. Caledonica) is a species of bird in the family Muscicapidae. It is endemic to

FT Melanesia.

The Australian golden whistler (Pachycephala chlorura) is a species of bird in the family Pachycephalidae, which is endemic to Fiji. T-DMCA The Melanesian whistler (Pachycephala chlorura) is a species of bird in the family Pachycephalidae, which is

HT endemic to Fiji.

Table 5: GOLD human authored summaries, questions based on them (answers shown in square brackets) and automatic summaries produced by the LEAD-3 baseline, the Flat Transformer (FT), T-DMCA (Liu et al., 2018) and our Hierachical Transformer (HT).

6 Conclusions formative outperforming competitive systems by a wide margin. In the future we would like to apply In this paper we conceptualized abstractive multi- our hierarchical transformer to question answering document summarization as a machine learning and related textual inference tasks. problem. We proposed a new model which is able to encode multiple input documents hierar- Acknowledgments chically, learn latent relations across them, and additionally incorporate structural information from We would like to thank Laura Perez-Beltrachini well-known graph representations. We have also for her help with preprocessing the dataset. This demonstrated the importance of a learning-based research is supported by a Google PhD Fellow- approach for selecting which documents to sum- ship to the first author. The authors gratefully ac- marize. Experimental results show that our model knowledge the financial support of the European produces summaries which are both fluent and in- Research Council (award number 681760). References Katja Filippova and Michael Strube. 2008. Sentence fusion via dependency graph compression. In Pro- Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ceedings of the 2008 Conference on Empirical Meth- ton. 2016. Layer normalization. arXiv preprint ods in Natural Language Processing, pages 177– arXiv:1607.06450. 185, Honolulu, Hawaii. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- gio. 2015. Neural machine translation by jointly Sebastian Gehrmann, Yuntian Deng, and Alexander learning to align and translate. In In Proceedings of Rush. 2018. Bottom-up abstractive summarization. the 3rd International Conference on Learning Rep- In Proceedings of the 2018 Conference on Empiri- resentations, San Diego, California. cal Methods in Natural Language Processing, pages 4098–4109, Brussels, Belgium. Regina Barzilay and Kathleen R. McKeown. 2005. Sentence fusion for multidocument news summa- Max Grusky, Mor Naaman, and Yoav Artzi. 2018. rization. Computational Linguistics, 31(3):297– Newsroom: A dataset of 1.3 million summaries with 327. diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of Lidong Bing, Piji Li, Yi Liao, Wai Lam, Weiwei Guo, the Association for Computational Linguistics: Hu- and Rebecca Passonneau. 2015. Abstractive multi- man Language Technologies, Volume 1 (Long Pa- document summarization via phrase selection and pers), pages 708–719, New Orleans, Louisiana. merging. In Proceedings of the 53rd Annual Meet- ing of the Association for Computational Linguistics Karl Moritz Hermann, Tomas Kocisky, Edward and the 7th International Joint Conference on Natu- Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- ral Language Processing (Volume 1: Long Papers), leyman, and Phil Blunsom. 2015. Teaching ma- pages 1587–1597, Beijing, China. chines to read and comprehend. In Advances in Neu- ral Information Processing Systems 28, pages 1693– Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and 1701. Curran Associates, Inc. Yejin Choi. 2018. Deep communicating agents for abstractive summarization. In Proceedings of the Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. 2018 Conference of the North American Chapter of Long short-term memory. Neural computation, the Association for Computational Linguistics: Hu- 9(8):1735–1780. man Language Technologies, Volume 1 (Long Pa- pers), pages 1662–1675, New Orleans, Louisiana. Yoon Kim, Carl Denton, Luong Hoang, and Alexan- Janara Christensen, Mausam, Stephen Soderland, and der M Rush. 2017. Structured attention networks. Oren Etzioni. 2013. Towards coherent multi- In Proceedings of the 5th International Conference document summarization. In Proceedings of the on Learning Representations, Toulon, France. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- Diederik P Kingma and Jimmy Ba. 2014. Adam: A man Language Technologies, pages 1163–1173, At- method for stochastic optimization. arXiv preprint lanta, Georgia. Association for Computational Lin- arXiv:1412.6980. guistics. Thomas N Kipf and Max Welling. 2017. Semi- Eric Chu and Peter J Liu. 2018. Unsupervised neural supervised classification with graph convolutional multi-document abstractive summarization. arXiv networks. In Proceedings of the 4th International preprint arXiv:1810.05739. Conference on Learning Representations, San Juan, Puerto Rico. James Clarke and Mirella Lapata. 2010. Discourse constraints for document compression. Computa- Svetlana Kiritchenko and Saif Mohammad. 2017. tional Linguistics, 36(3):411–441. Best-worst scaling more reliable than rating scales: A case study on sentiment intensity annotation. In John Duchi, Elad Hazan, and Yoram Singer. 2011. Proceedings of the 55th Annual Meeting of the As- Adaptive subgradient methods for online learning sociation for Computational Linguistics, pages 465– and stochastic optimization. Journal of Machine 470, Vancouver, Canada. Learning Research, 12(Jul):2121–2159.

Gunes¨ Erkan and Dragomir R Radev. 2004. Lexrank: Taku Kudo and John Richardson. 2018. Sentencepiece: Graph-based lexical centrality as salience in text A simple and language independent subword tok- summarization. Journal of artiﬁcial intelligence re- enizer and detokenizer for neural text processing. search, 22:457–479. arXiv preprint arXiv:1808.06226.

Patrick Fernandes, Miltiadis Allamanis, and Marc Logan Lebanoff and Fei Liu. 2018. Automatic detec- Brockschmidt. 2019. Structured neural summarization of vague words and sentences in privacy poli- tion. In Proceedings of the 7th International Con- cies. In Proceedings of the 2018 Conference on Em- ference on Learning Representations, New Orleans, pirical Methods in Natural Language Processing, Louisiana. pages 3508–3517, Brussels, Belgium. Chin-Yew Lin. 2004. Rouge: A package for automatic Abigail See, Peter J. Liu, and Christopher D. Manning. evaluation of summaries. In Text Summarization 2017. Get to the point: Summarization with pointer- Branches Out: Proceedings of the ACL-04 Work- generator networks. In Proceedings of the 55th An- shop, pages 74–81, Barcelona, Spain. Association nual Meeting of the Association for Computational for Computational Linguistics. Linguistics (Volume 1: Long Papers), pages 1073– 1083, Vancouver, Canada. Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Shazeer. 2018. Generating Wikipedia by summariz- Jon Shlens, and Zbigniew Wojna. 2016. Rethinking ing long sequences. In Proceedings of the 6th Inter- the inception architecture for computer vision. In national Conference on Learning Representations, The IEEE Conference on Computer Vision and Pat- Vancouver, Canada. tern Recognition (CVPR). Yang Liu and Mirella Lapata. 2018. Learning structured text representations. Transactions of the Asso- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob ciation for Computational Linguistics, 6:63–75. Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Jordan J Louviere, Terry N Flynn, and Anthony Al- you need. In Advances in Neural Information Pro- fred John Marley. 2015. Best-worst scaling: The- cessing Systems 30, pages 5998–6008. Curran Asso- ory, methods and applications. Cambridge Univer- ciates, Inc. sity Press. Xiaojun Wan. 2008. An exploration of document Shulei Ma, Zhi-Hong Deng, and Yunlun Yang. 2016. impact on graph-based multi-document summariza- An unsupervised multi-document summarization tion. In Proceedings of the 2008 Conference on Em- framework based on neural document model. In pirical Methods in Natural Language Processing, Proceedings of COLING 2016, the 26th Interna- pages 755–762, Honolulu, Hawaii. tional Conference on Computational Linguistics: Technical Papers, pages 1514–1523, Osaka, Japan. Adina Williams, Andrew Drozdov, and Samuel R. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Bowman. 2018. Do latent tree learning models iden- 2018. Ranking sentences for extractive summariza- tify meaningful structure in sentences? Transac- tion with reinforcement learning. In Proceedings of tions of the Association for Computational Linguis- the 2018 Conference of the North American Chap- tics, 6:253–267. ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pa- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V pers), pages 1747–1759, New Orleans, Louisiana. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Vlad Niculae, Andre´ F. T. Martins, and Claire Cardie. Macherey, et al. 2016. Google’s neural machine 2018. Towards dynamic computation graphs via translation system: Bridging the gap between hu- sparse latent structure. In Proceedings of the 2018 man and machine translation. In arXiv preprint Conference on Empirical Methods in Natural Lan- arXiv:1609.08144. guage Processing, pages 905–911, Brussels, Bel- gium. Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu, Ayush Pareek, Krishnan Srinivasan, and Dragomir Daraksha Parveen and Michael Strube. 2014. Multi- Radev. 2017. Graph-based neural multi-document document summarization using bipartite graphs. In summarization. In Proceedings of the 21st Confer- Proceedings of TextGraphs-9: the workshop on ence on Computational Natural Language Learning Graph-based Methods for Natural Language Pro- (CoNLL 2017), pages 452–462, Vancouver, Canada. cessing, pages 15–24, Doha, Qatar.

Romain Paulus, Caiming Xiong, and Richard Socher. Jianmin Zhang, Jiwei Tan, and Xiaojun Wan. 2018. 2018. A deep reinforced model for abstractive sum- Adapting neural single-document summarization marization. In Proceedings of the 6th International model for abstractive multi-document summariza- Conference on Learning Representations, Vancou- tion: A pilot study. In Proceedings of the Interna- ver, Canada. tional Conference on Natural Language Generation.

Dragomir Radev. 2000. A common theory of information fusion from multiple text sources step one: A Appendix Cross-document structure. In 1st SIGdial Workshop on Discourse and Dialogue, pages 74–83, Hong We describe here how the similarity and discourse Kong, China. graphs discussed in Section 3.2.4 were created. Evan Sandhaus. 2008. The New York Times Annotated These graphs were added to the hierarchical trans- Corpus. Linguistic Data Consortium, Philadelphia, former model as a means to enhance summary 6(12). quality (see Section5 for details). A.1 Similarity Graph If two paragraphs < pi, pi0 > are adjacent in one The similarity graph S is based on tf-idf cosine source webpage and they are connected with one similarity. The nodes of the graph are paragraphs. of the above 36 discourse markers, mii0 will be 1, otherwise it will be 0. We ﬁrst represent each paragraph pi as a bag of The ﬁnal edge weight Dii0 is the weighted sum words. Then, we calculate the tf-idf value vik for of eii0 and mii0 each token tik in a paragraph:

Dii0 = 0.2 ∗ eii0 + mii0 (26) Nd vik = Nw(tik)log( ) (25) Ndw(tik) where Nw(t) is the count of word t in the paragraph, Nd is the total number of paragraphs, and Ndw(t) is the total number of paragraphs containing the word. We thus obtain a tf-idf vector for each paragraph. Then, for all paragraph pairs < pi, pi0 >, we calculate the cosine similarity of their tf-idf vectors and use this as the weight Sii0 for the edge connecting the pair in the graph. We remove edges with weights lower than 0.2.

A.2 Discourse Graphs To build the Approximate Discourse Graph (ADG) D, we follow Christensen et al.(2013) and Yasunaga et al.(2017). The original ADG makes use of several complex features. Here, we create a simpliﬁed version with only two features (nodes in this graph are again paragraphs).

Co-occurring Entities For each paragraph pi, we extract a set of entities Ei in the paragraph using the Spacy3 NER recognizer. We only use entities with type {PERSON, NORP, FAC, ORG, GPE, LOC, EVENT, WORK OF ART, LAW}. For each paragraph pair < pi, pj >, we count eij, the number of entities with exact match.

Discourse Markers We use the following 36 explicit discourse markers to identify edges between two adjacent paragraphs in a source webpage:

again, also, another, comparatively, fur- thermore, at the same time,however, im- mediately, indeed, instead, to be sure, likewise, meanwhile, moreover, nevertheless, nonetheless, notably, otherwise, regardless, similarly, unlike, in addition, even, in turn, in exchange, in this case, in any event, ﬁnally, later, as well, espe- cially, as a result, example, in fact, then, the day before

3https://spacy.io/api/entityrecognizer