Modeling Content Importance for Summarization with Pre-Trained

Modeling Content Importance for Summarization with Pre-trained Language Models Liqiang Xiao1, Lu Wang2, Hao He1;3,∗ Yaohui Jin1;3∗ 1MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 2Computer Science and Engineering, University of Michigan 3State Key Lab of Advanced Optical Communication System and Network, Shanghai Jiao Tong University [email protected], [email protected] {hehao, jinyh}@sjtu.edu.cn Abstract the large-scale summarization datasets (Nallapati Modeling content importance is an essential et al., 2016; Narayan et al., 2018), data-driven ap- yet challenging task for summarization. Previ- proaches (Nallapati et al., 2017; Paulus et al., 2018; ous work is mostly based on statistical meth- Zhang et al., 2019) have made significant progress. ods that estimate word-level salience, which Yet most of them conduct the information selec- does not consider semantics and larger con- tion implicitly while generating the summaries. It text when quantifying importance. It is thus lacks theory support and is hard to be applied to hard for these methods to generalize to seman- low-resource domains. In another line of work, tic units of longer text spans. In this work, we apply information theory on top of pre- structure features (Zheng and Lapata, 2019), such trained language models and define the con- as centrality, position, and title, are employed as cept of importance from the perspective of in- proxies for importance. However, the information formation amount. It considers both the se- captured by these features can vary in texts of dif- mantics and context when evaluating the im- ferent genres. portance of each semantic unit. With the help To overcome this problem, theory-based meth- of pre-trained language models, it can eas- ods (Louis, 2014; Peyrard, 2019; Lin et al., 2006) ily generalize to different kinds of semantic aim to formalize the concept of importance, and units (n-grams or sentences). Experiments on CNN/Daily Mail and New York Times develop general-purpose systems by modeling the datasets demonstrate that our method can bet- background knowledge of readers. This is based ter model the importance of content than prior on the intuition that humans are good at identifying work based on F1 and ROUGE scores. important content by using their own interpreta- 1 Introduction and Related Work tion of the world knowledge. Theoretical models usually rely on information theory (IT) (Shannon, Text summarization aims to compress long docu- 1948). Louis(2014) uses Dirichlet distribution ment(s) into a concise summary while maintaining to represent the background knowledge and em- the salient information. It often consists of two crit- ploys Bayesian surprise to find novel information. ical subtasks, important information identification Peyrard(2019) instead models the importance with and natural language generation (for abstractive entropy, assuming the important words should be summarization). With the advancements of large frequent in the given document but rare in the back- pre-trained language models (PreTLMs) (Devlin ground. et al., 2019; Yang et al., 2019), state-of-the-art re- However, statistical method is only a rough eval- sults are achieved on both natural language under- uation for informativity, which largely ignores the standing and generation. However, it is still unclear effect of semantic and context. In fact, the infor- how well these large models can estimate “content mation amount of units is not only determined by importance” for a given document. frequency, but also by its semantic meaning, con- Previous studies for modeling importance are text, as well as reader’s background knowledge. In either empirical-based, which implicitly encode addition, bag-of-words approaches are difficult to importance during document summarization, or generalize beyond unigrams due to the sparsity of theory-based, which often lacks support by empiri- n-grams when n is large. cal experiments (Peyrard, 2019). Benefiting from In this paper, we propose a novel and general- ∗Corresponding author purpose approach to model content importance for 3606 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 3606–3611, November 16–20, 2020. c 2020 Association for Computational Linguistics I(x3x4) P(x3) P(x4) I(x3x4) P(x3) P(x4) I(x3x4|·) Transformer summarization. We employ information theory on top of pre-trained language models, which are ex- PLMs pected to better capture the information amount of P(x3|·) P(x4|·) / semantic units by leveraging their meanings and MLMs PLMs P1 P2 P3 P4 context.P5 We argue/ that important content contains / ➕ ➕ ➕ ➕ information➕ ALMs that cannot be directly inferred from X1 X2 [M] [M]contextX5 and background knowledge. Large pre- MLMs trained language models are suitable for our study / X1 X2 [M] [M] X5 ALMs since they are trained from large-scaled datasets consisting of diverse documents and thus contain- ing a wide range of knowledge. X1 X2 [M] [M] X5 We conduct experiments on popular summa- Figure 1: Information amount evaluation with language rization benchmarks of CNN/Daily Mail and New models. Here we take a subsequence x3x4 as example. York Times corpora, where we show that our pro- [M] denotes mask and PLMs/MLMs/ALMs are three posed method can outperform prior importance different options for language models. I(x3x4|·) = estimation models. We further demonstrate that − log[P (x3|·)P (x4|·)], where conditions for different our method can be adapted to model semantic units models are omitted for brevity. of different scales (n-grams and sentences). P (x j · · · ; x ; x ; ··· ) 2 Methodology e.g., i i−1 i+1 , as the number of combinations of the context can be explosive. In this section, we first estimate the amount of information by using information theory with pre- 2.2 Using Language Models in Information trained language models (§2.1 and §2.2), where we Theory consider both the context and semantic meaning With the development of deep learning, neural lan- of a given text unit. We then propose a formal guage models can efficiently predict the probability definition of importance for text summarization of a specified unit, such as a word or a phrase, from a perspective of information amount (§2.3). given its context, which makes it feasible to calcu- late high-order approximation for the information 2.1 Information Theory amount. Information theory (IT), as invented by Shannon We thus propose to use neural language models (1948), has been used on words to quantify their to replace the statistical models for estimating the “informativity”. Concretely, IT uses the frequency information amount of a given semantic unit. Lan- of semantic units xi to approximate the probability guage models can be categorized as follows, and P (xi) and uses negative logarithm of frequency as we present information estimation method for each the measurement for information, which is called as shown in Fig.1. self-info1: Auto-regressive Language Model (ALM) (Ben- I(x ) = − log P (x ) (1) i 2 i gio et al., 2000) is the most commonly used prob- It approximates the information amount of a unit abilistic model to depict the distribution of lan- (e.g. word) in a given corpus. guage, which is usually referred as unidirection However, traditional IT suffers from the spar- LM (UniLM). Given a sequence of tokens x0:T = n sity problem of longer -grams and also ignores [x0; x1; ··· ; xT ], UniLMs use leftward content to semantics and context. Advanced compression al- estimate the conditional probability for each token: gorithms in IT (Hirschberg and Lelewer, 1992) at- P (xtjx<t) = gUniLM(x<t), where gUniLM denotes tempt to model the context to better estimate the a neural network for language model and x<t rep- information amount. But due to the sparsity, they resents the sequence from x0 to xt−1. Then the can only count up to third-order statistics. Statisti- joint probability of a subsequence is factorized as: cal methods are nearly impossible to reliably calcu- n Y P (x jx ) = P (x jx ) (2) late the probability of xi conditioned on its context, m:n <m t <t t=m 1The unit of information is “bit", with base of 2. In the After applying Eq. (1) to both sides of Eq. (2), rest of this paper, we omit base 2 for brevity. we can obtain the information amount of the subse- 3607 quence conditioned on its context as: low information amount. We further propose a n X notion of importance as the information amount I(x jx ) = I(x jx ) m:n <m t <t (3) conditional on the background knowledge: t=m Imp(x jX − x ;K) = − log P (x jX − x ) Masked Language Model (MLM) is proposed by i i LMK i i (6) Taylor(1953) and combined with pre-training by 2 where X − xi means the context excluding the Devlin et al.(2019) to encode bidirectional context. unit xi from input X and K denotes the knowledge MLM masks a certain number of tokens from the encoded in the pre-trained model. In practice, when input sequence, then predicts these tokens based calculating the importance of a semantic unit, we on the unmasked ones. The conditional proba- first exclude all its occurrences from the input doc- bility of a masked token xt can be estimated as: ument, and let the PreTLMs predict the probability P (xtjx6=t) = gMLM(x6=t), where 6= t indicates that of each occurrence, based on which the information the t-th token is masked. Information amount of a amount is calculated. As the same unit may appear given subsequence of the input is calculated as: at multiple positions in the input, summation is n X used as the final value of information amount. I(xm:njx2=[m:n]) = I(xtjx2=[m;n]) (4) t=m Based on our notion of importance, a summariza- Since MLMs encode both leftward and rightward tion model is to maximize the overall importance of context, intuitively, it can better estimate the infor- a subset x of the input X, with a length constraint, mation of current tokens than UniLMs.

Modeling Content Importance for Summarization with Pre-Trained

Evaluation of Machine Learning Algorithms for Sms Spam Filtering

NLP - Assignment 2

3 Dictionaries and Tolerant Retrieval

N-Gram-Based Machine Translation

Title and Link Description Improving Distributional Similarity With

Arxiv:2006.14799V2 [Cs.CL] 18 May 2021 the Same User Input

Word Senses and Wordnet Lady Bracknell

A Unigram Orientation Model for Statistical Machine Translation

Heafield, K. and A. Lavie. "Voting on N-Grams for Machine Translation

Part I Word Vectors I: Introduction, Svd and Word2vec 2 Natural Language in Order to Perform Some Task

Phrase Based Language Model for Statistical Machine Translation

Phrase-Based Attentions