Fine-tune BERT for Extractive Summarization

Yang Liu Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB [email protected]

Abstract has reached a bottleneck due to the complex- BERT (Devlin et al., 2018), a pre-trained ity of the task. In this paper, we argue that, Transformer (Vaswani et al., 2017) model, BERT (Devlin et al., 2018), with its pre-training has achieved ground-breaking performance on on a huge dataset and the powerful architecture multiple NLP tasks. In this paper, we describe for learning complex features, can further boost BERTSUM, a simple variant of BERT, for ex- the performance of extractive summarization . tractive summarization. Our system is the state In this paper, we focus on designing differ- of the art on the CNN/Dailymail dataset, out- ent variants of using BERT on the extractive performing the previous best-performed sys- tem by 1.65 on ROUGE-L. The codes to repro- summarization task and showing their results on duce our results are available at https://github. CNN/Dailymail and NYT datasets. We found that com/nlpyang/BertSum a flat architecture with inter-sentence Transformer layers performs the best, achieving the state-of- 1 Introduction the-art results on this task. Single-document summarization is the task of au- tomatically generating a shorter version of a doc- 2 Methodology ument while retaining its most important informa- Let d denote a document containing several sen- tion. The task has received much attention in the tences [sent1, sent2, ··· , sentm], where senti is natural language processing community due to its the i-th sentence in the document. Extractive sum- potential for various information access applica- marization can be defined as the task of assigning a tions. Examples include tools which digest textual label yi ∈ {0, 1} to each senti, indicating whether content (e.g., news, social media, reviews), answer the sentence should be included in the summary. It questions, or provide recommendations. is assumed that summary sentences represent the The task is often divided into two paradigms, most important content of the document. abstractive summarization and extractive summa- rization. In abstractive summarization, target sum- 2.1 Extractive Summarization with BERT maries contains words or phrases that were not in To use BERT for extractive summarization, we the original text and usually require various text require it to output the representation for each rewriting operations to generate, while extractive arXiv:1903.10318v2 [cs.CL] 5 Sep 2019 sentence. However, since BERT is trained as a approaches form summaries by copying and con- masked-language model, the output vectors are catenating the most important spans (usually sen- grounded to tokens instead of sentences. Mean- tences) in a document. In this paper, we focus on while, although BERT has segmentation embed- extractive summarization. dings for indicating different sentences, it only has Although many neural models have been two labels (sentence A or sentence B), instead of proposed for extractive summarization re- multiple sentences as in extractive summarization. cently (Cheng and Lapata, 2016; Nallapati et al., Therefore, we modify the input sequence and em- 2017; Narayan et al., 2018; Dong et al., 2018; beddings of BERT to make it possible for extract- Zhang et al., 2018; Zhou et al., 2018), the im- ing summaries. provement on automatic metrics like ROUGE †Please see https://arxiv.org/abs/1908.08345 for the full Encoding Multiple Sentences As illustrated in and most current version of this paper Figure 1, we insert a [CLS] token before each sen- Input Document [CLS] sent one [SEP] [CLS] 2nd sent [SEP] [CLS] sent again [SEP]

Token Embeddings E[CLS] Esent Eone E[SEP] E[CLS] E2nd Esent E[SEP] E[CLS] Esent Eagain E[SEP]

Inter val Segment Embeddings EA EA EA EA EB EB EB EB EA EA EA EA

Position Embeddings E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12

BERT

T1 ...... T2 ...... T3 ......

Summar ization Layer s

Y1 Y2 Y3

Figure 1: The overview architecture of the BERTSUM model. tence and a [SEP] token after each sentence. In to get the predicted score: vanilla BERT, The [CLS] is used as a symbol to ˆ aggregate features from one sentence or a pair of Yi = σ(WoTi + bo) (1) sentences. We modify the model by using mul- where σ is the Sigmoid function. tiple [CLS] symbols to get features for sentences ascending the symbol. Inter-sentence Transformer Instead of a sim- ple sigmoid classifier, Inter-sentence Transformer Interval Segment Embeddings We use inter- applies more Transformer layers only on sen- val segment embeddings to distinguish multiple tence representations, extracting document-level sentences within a document. For senti we will features focusing on summarization tasks from the assign a segment embedding EA or EB condi- BERT outputs: tioned on i is odd or even. For example, for l l−1 l−1 [sent1, sent2, sent3, sent4, sent5] we will assign h˜ = LN(h + MHAtt(h )) (2) [EA,EB,EA,EB,EA]. hl = LN(h˜l + FFN(h˜l)) (3) The vector Ti which is the vector of the i-th [CLS] symbol from the top BERT layer will be where h0 = PosEmb(T ) and T are the sen- used as the representation for senti. tence vectors output by BERT, PosEmb is the function of adding positional embeddings (indi- 2.2 Fine-tuning with Summarization Layers cating the position of each sentence) to T ; LN After obtaining the sentence vectors from BERT, is the layer normalization operation (Ba et al., we build several summarization-specific layers 2016); MHAtt is the multi-head attention oper- stacked on top of the BERT outputs, to capture ation (Vaswani et al., 2017); the superscript l indi- document-level features for extracting summaries. cates the depth of the stacked layer. For each sentence senti, we will calculate the fi- The final output layer is still a sigmoid classi- ˆ nal predicted score Yi. The loss of the whole fier: ˆ ˆ L model is the Binary Classification Entropy of Yi Yi = σ(Wohi + bo) (4) against gold label Y . These summarization layers i where hL is the vector for sent from the top are jointly fine-tuned with BERT. i layer (the L-th layer ) of the Transformer. In Simple Classifier Like in the original BERT pa- experiments, we implemented Transformers with per, the Simple Classifier only adds a linear layer L = 1, 2, 3 and found Transformer with 2 layers on the BERT outputs and use a sigmoid function performs the best. Recurrent Neural Network Although the per two steps, which makes the batch size approx- Transformer model achieved great results on imately equal to 36. Model checkpoints are saved several tasks, there are evidence that Recurrent and evaluated on the validation set every 1,000 Neural Networks still have their advantages, steps. We select the top-3 checkpoints based on especially when combining with techniques in their evaluation losses on the validations set, and Transformer (Chen et al., 2018). Therefore, we report the averaged results on the test set. apply an LSTM layer over the BERT outputs to When predicting summaries for a new docu- learn summarization-specific features. ment, we first use the models to obtain the score To stabilize the training, pergate layer normal- for each sentence. We then rank these sentences ization (Ba et al., 2016) is applied within each by the scores from higher to lower, and select the LSTM cell. At time step i, the input to the LSTM top-3 sentences as the summary. layer is the BERT output Ti, and the output is cal- culated as: Trigram Blocking During the predicting pro- cess, Trigram Blocking is used to reduce redun-   Fi dancy. Given selected summary S and a candidate  Ii  sentence c, we will skip c is there exists a trigram   = LNh(Whhi−1) + LNx(WxTi) (5)  Oi  overlapping between c and S. This is similar to the Gi Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998) but much simpler. Ci = σ(Fi) Ci−1 + σ(Ii) tanh(Gi−1) (6) 3.2 Summarization Datasets h =σ(O ) tanh(LN (C )) (7) i t c t We evaluated on two benchmark datasets, namely the CNN/DailyMail news highlights dataset (Her- where Fi,Ii,Oi are forget gates, input gates, mann et al., 2015) and the New York Times output gates; Gi is the hidden vector and Ci Annotated Corpus (NYT; Sandhaus 2008). The is the memory vector; hi is the output vector; CNN/DailyMail dataset contains news articles and LNh, LNx, LNc are there difference layer normal- ization operations; Bias terms are not shown. associated highlights, i.e., a few bullet points giv- The final output layer is also a sigmoid classi- ing a brief overview of the article. We used the fier: standard splits of Hermann et al.(2015) for train- ing, validation, and testing (90,266/1,220/1,093 Yˆi = σ(Wohi + bo) (8) CNN documents and 196,961/12,148/10,397 Dai- lyMail documents). We did not anonymize enti- 3 Experiments ties. We first split sentences by CoreNLP and pre- In this section we present our implementation, de- process the dataset following methods in See et al. scribe the summarization datasets and our evalua- (2017). tion protocol, and analyze our results. The NYT dataset contains 110,540 articles with abstractive summaries. Following Durrett et al. 3.1 Implementation Details (2016), we split these into 100,834 training and We use PyTorch, OpenNMT (Klein et al., 2017) 9,706 test examples, based on date of publication and the ‘-base-uncased’∗ version of BERT to (test is all articles published on January 1, 2007 or implement the model. BERT and summarization later). We took 4,000 examples from the training layers are jointly fine-tuned. Adam with β1 = 0.9, set as the validation set. We also followed their fil- β2 = 0.999 is used for fine-tuning. Learning rate tering procedure, documents with summaries that schedule is following (Vaswani et al., 2017) with are shorter than 50 words were removed from the warming-up on first 10,000 steps: raw dataset. The filtered test set (NYT50) in- cludes 3,452 test examples. We first split sen- lr = 2e−3 · min(step−0.5, step · warmup−1.5) tences by CoreNLP and pre-process the dataset following methods in Durrett et al.(2016). All models are trained for 50,000 steps on 3 Both datasets contain abstractive gold sum- GPUs (GTX 1080 Ti) with gradient accumulation maries, which are not readily suited to training ∗https://github.com/huggingface/pytorch-pretrained- extractive summarization models. A greedy algo- BERT rithm was used to generate an oracle summary for Model ROUGE-1 ROUGE-2 ROUGE-L PGN∗ 39.53 17.28 37.98 DCA∗ 41.69 19.47 37.92 LEAD 40.42 17.62 36.67 ORACLE 52.59 31.24 48.87 REFRESH∗ 41.0 18.8 37.7 NEUSUM∗ 41.59 19.01 37.98 Transformer 40.90 18.02 37.17 BERTSUM+Classifier 43.23 20.22 39.60 BERTSUM+Transformer 43.25 20.24 39.63 BERTSUM+LSTM 43.22 20.17 39.59

Table 1: Test set results on the CNN/DailyMail dataset using ROUGE F1. Results with ∗ mark are taken from the corresponding papers. each document. The algorithm greedily select sen- • DCA (Celikyilmaz et al., 2018) is the Deep tences which can maximize the ROUGE scores as Communicating Agents, a state-of-the-art ab- the oracle sentences. We assigned label 1 to sen- stractive summarization system with multi- tences selected in the oracle summary and 0 other- ple agents to represent the document as well wise. as hierarchical attention mechanism over the agents for decoding. 4 Experimental Results As illustrated in the table, all BERT-based mod- The experimental results on CNN/Dailymail els outperformed previous state-of-the-art models datasets are shown in Table 1. For comparison, by a large margin. BERTSUM with Transformer we implement a non-pretrained Transformer base- achieved the best performance on all three met- line which uses the same architecture as BERT, but rics. The BERTSUM with LSTM model does not with smaller parameters. It is randomly initial- have an obvious influence on the summarization ized and only trained on the summarization task. performance compared to the Classifier model. The Transformer baseline has 6 layers, the hidden Ablation studies are conducted to show the con- size is 512 and the feed-forward filter size is 2048. tribution of different components of BERTSUM. The model is trained with same settings follow- The results are shown in in Table 2. Interval seg- ing Vaswani et al.(2017). We also compare our ments increase the performance of base model. model with several previously proposed systems. Trigram blocking is able to greatly improve the summarization results. This is consistent to pre- • LEAD is an extractive baseline which uses the vious conclusions that a sequential extractive de- first-3 sentences of the document as a sum- coder is helpful to generate more informative sum- mary. maries. However, here we use the trigram block- ing as a simple but robust alternative. • REFRESH (Narayan et al., 2018) is an extrac- tive summarization system trained by glob- Model R-1 R-2 R-L ally optimizing the ROUGE metric with rein- BERTSUM+Classifier 43.23 20.22 39.60 forcement learning. -interval segments 43.21 20.17 39.57 -trigram blocking 42.57 19.96 39.04 • NEUSUM (Zhou et al., 2018) is the state-of- Table 2: Results of ablation studies of BERTSUM on the-art extractive system that jontly score and CNN/Dailymail test set using ROUGE F1 (R-1 and R- select sentences. 2 are shorthands for unigram and overlap, R-L is the longest common subsequence). • PGN (See et al., 2017), is the Pointer Gener- ator Network, an abstractive summarization The experimental results on NYT datasets are system based on an encoder-decoder archi- shown in Table 3. Different from CNN/Dailymail, tecture. we use the limited-length recall evaluation, fol- lowing Durrett et al.(2016). We truncate the pre- Jianpeng Cheng and Mirella Lapata. 2016. Neural dicted summaries to the lengths of the gold sum- summarization by extracting sentences and words. maries and evaluate summarization quality with In Proceedings of the ACL Conference. ROUGE Recall. Compared baselines are (1) First- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and k words, which is a simple baseline by extract- Kristina Toutanova. 2018. Bert: Pre-training of deep ing first k words of the input article; (2) Full bidirectional transformers for language understand- is the best-performed extractive model in Dur- ing. rett et al.(2016); (3) Deep Reinforced (Paulus Yue Dong, Yikang Shen, Eric Crawford, Herke van et al., 2018) is an abstractive model, using rein- Hoof, and Jackie Chi Kit Cheung. 2018. Banditsum: force learning and encoder-decoder structure. The Extractive summarization as a contextual bandit. In Proceedings of the EMNLP Conference BERTSUM+Classifier can achieve the state-of-the- . art results on this dataset. Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. 2016. Learning-based single-document summariza- Model R-1 R-2 R-L tion with compression and anaphoricity constraints. First-k words 39.58 20.11 35.78 In Proceedings of the ACL Conference. Full∗ 42.2 24.9 - Deep Reinforced∗ 42.94 26.02 - Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su- BERTSUM+Classifier 46.66 26.35 42.62 leyman, and Phil Blunsom. 2015. Teaching ma- chines to read and comprehend. In Advances in Neu- Table 3: Test set results on the NYT50 dataset using ral Information Processing Systems, pages 1693– ROUGE Recall. The predicted summary are truncated 1701. to the length of the gold-standard summary. Results with ∗ mark are taken from the corresponding papers. Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. 2017. Opennmt: Open-source toolkit for neural . In arXiv preprint arXiv:1701.02810. 5 Conclusion Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. In this paper, we explored how to use BERT for Summarunner: A recurrent neural network based se- extractive summarization. We proposed the BERT- quence model for extractive summarization of docu- ments. In Proceedings of the AAAI Conference. SUM model and tried several summarization layers can be applied with BERT. We did experiments Shashi Narayan, Shay B Cohen, and Mirella Lapata. on two large-scale datasets and found the BERT- 2018. Ranking sentences for extractive summariza- SUM with inter-sentence Transformer layers can tion with reinforcement learning. In Proceedings of achieve the best performance. the NAACL Conference. Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive sum- References marization. In Proceedings of the ICLR Conference.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- Evan Sandhaus. 2008. The New York Times Annotated ton. 2016. Layer normalization. arXiv preprint Corpus. Linguistic Data Consortium, Philadelphia, arXiv:1607.06450. 6(12).

Jaime G Carbonell and Jade Goldstein. 1998. The use Abigail See, Peter J. Liu, and Christopher D. Manning. of mmr and diversity-based reranking for reodering 2017. Get to the point: Summarization with pointer- documents and producing summaries. generator networks. In Proceedings of the ACL Con- ference. Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Yejin Choi. 2018. Deep communicating agents for Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob abstractive summarization. In Proceedings of the Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz NAACL Conference. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin cessing Systems, pages 5998–6008. Johnson, Wolfgang Macherey, George Foster, Llion Jones, Niki Parmar, Mike Schuster, Zhifeng Chen, Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming et al. 2018. The best of both worlds: Combining Zhou. 2018. Neural latent extractive document sum- recent advances in neural machine translation. In marization. In Proceedings of the EMNLP Confer- Proceedings of the ACL Conference. ence. Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang, Ming Zhou, and Tiejun Zhao. 2018. Neural docu- ment summarization by jointly learning to score and select sentences. In Proceedings of the ACL Confer- ence.