Conditional Self-Attention for Query-Based Summarization
Total Page:16
File Type:pdf, Size:1020Kb
Conditional Self-Attention for Query-based Summarization Yujia Xie ∗ Tianyi Zhou Yi Mao Weizhu Chen Georgia Tech University of Washington Microsoft Microsoft [email protected] [email protected] [email protected] [email protected] Abstract element’s context-aware embedding is then com- puted by weighted averaging of other elements Self-attention mechanisms have achieved with the probabilities. Hence, self-attention is great success on a variety of NLP tasks due powerful for encoding pairwise relationship into to its flexibility of capturing dependency contextual representations. between arbitrary positions in a sequence. For problems such as query-based summarization However, higher-level language understanding (Qsumm) and knowledge graph reasoning often relies on more complicated dependencies where each input sequence is associated with than the pairwise one. One example is the con- an extra query, explicitly modeling such ditional dependency that measures how two ele- conditional contextual dependencies can lead ments are related given a premise. In NLP tasks to a more accurate solution, which however such as query-based summarization and knowl- cannot be captured by existing self-attention edge graph reasoning where inputs are equipped mechanisms. In this paper, we propose condi- with extra queries or entities, knowing the depen- tional self-attention (CSA), a neural network module designed for conditional dependency dencies conditioned on the given query or entity is modeling. CSA works by adjusting the extremely helpful for extracting meaningful rep- pairwise attention between input tokens in a resentations. Moreover, conditional dependencies self-attention module with the matching score can be used to build a large relational graph cov- of the inputs to the given query. Thereby, ering all elements, and to represent higher-order the contextual dependencies modeled by relations for multiple (>2) elements. Hence, it CSA will be highly relevant to the query. is more expressive than the pairwise dependency We further studied variants of CSA defined by different types of attention. Experiments modeled by self-attention mechanisms. on Debatepedia and HotpotQA benchmark In this paper, we develop conditional self- datasets show CSA consistently outperforms attention as a versatile module to capture the vanilla Transformer and previous models for conditional dependencies within an input se- the Qsumm problem. quence. CSA is a composite function that applies a cross-attention mechanism inside an ordinary 1 Introduction self-attention module. Given two tokens from the input sequence, it first computes a dependency Contextual dependency is believed to provide crit- arXiv:2002.07338v1 [cs.CL] 18 Feb 2020 score for each token with respect to the condition ical information in a variety of NLP tasks. Among by cross-attention, then scales tokens by the com- the popular neural network structures, convolution puted condition-dependency scores, and applies is powerful in capturing local dependencies, while self-attention to the scaled tokens. In this way LSTM is good at modeling distant relations. CSA is capable of selecting tokens that are highly Self-attention has recently achieved great success correlated to the condition, and guiding contextual in NLP tasks due to its flexibility of relating two embedding generation towards those tokens. In elements in a distance-agnostic manner. For each addition, when applied to a highly-related token element, it computes a categorical distribution that and a loosely-related token, CSA reduces to mea- reflects the dependency of that element to each of suring the global dependency (i.e., single-token the other elements from the same sequence. The importance) of the former. Hence, CSA can ∗Work done during internship in Microsoft. capture both local (e.g., pairwise) and global de- ($) pendency in the same module, and automatically � �$ �% ' ' ' �& �($) �($) ' ' ' �($) switch between them according to the condi- $ % & Position-wise Feed-forward Networks (PFN) �($) �($) ' ' ' �($) $ % & 4 tion. In contrast, previous works usually need two �,- ' ' ' ' ' ' ' ' ' ' ' ' modules to capture these two types of information. Activation � Mask Softmax �($) �($) ' ' ' �($) We then apply CSA to Query-based summariza- $ % & ($) (%) tion (Qsumm) task. Qsumm is a challenging task �(%) � � (%) (%) (%) in which accurate retrieval of relevant contexts � � ' ' ' � ($) ($) ' ' ' ($) (%) (%) ' ' ' (%) $ $ $ �$ �% �& �$ �% �& �(%) �(%) ' ' ' �(%) and thorough comprehension are both critical % % % ($) (%) �,- �,- ' ' ' ' ' ' for producing high-quality summaries. Given a ' ' ' ' ' ' Multi-head (opt.) Multi-head (opt.) (%) (%) ' ' ' (%) query, it tries to generate from passage a summary �& �& �& ℎ ℎ ' ' ' ℎ ℎ ℎ ' ' ' ℎ pertaining to the query, in either an extractive $ % & $ % & (i.e., selected from passage) or abstractive (i.e., �$ �% ' ' ' �& 1 Representation of Cross-attention �$ �% ' ' ' �& generated by machine) manner . We build CSA- Condition � Compatibility function Transformer, an attention-only Qsumm model in Figure 1: CSA module: ⊗ is entry-wise multiplication, Section4 by introducing CSA to Transformer ar- × is matrix multiplication, + is entry-wise addition, chitecture (Vaswani et al., 2017). On Debatepedia g(1) = W (1)h and g(2) = W (2)h for i; j 2 [n] as in 2 i sa i j sa j and HotpotQA , CSA-Transformer significantly Eq6. A diagonal mask disables the attention of each outperforms baselines for both extractive and token/block to itself before applying softmax. abstractive tasks. respectively, D (1) (2) E 2 Background f(xi; c) = W xi;W c ; (4) T (1) (2) f(xi; c) = w σ(W xi + W c + b) + b; (5) Attention Given a sequence x = [x ; : : : ; x ] 1 n where W (1) 2 dW ×de ;W (2) 2 dW ×dc , w 2 de R R (e.g., word embeddings) with xi 2 R and con- dW R , b and b are learnable parameters, and σ(·) text vector c 2 dc (e.g., representation of another R is an activation function. sequence or token), attention (Bahdanau et al., Self-attention (SA) is a variant of attention 2015) computes an alignment score between c modeling the pairwise relationship between tokens and each x by a compatibility function f(x ; c). i i from the same sequence. One line of work a.k.a A softmax function then transforms the align- Token2Token self-attention (T2T) (Hu et al., ment scores a 2 n to a categorical distribution R 2017; Shen et al., 2017) produces context-aware p(zjx; c) defined as n representation for each token xj based on its de- a = [f(xi; c)] ; (1) i=1 pendency to other tokens xi within x by replac- p(z = ijx; c) = softmax(a)[i]; 8i 2 [n]; (2) ing c in Eq. (1)-(3) with xj. Notably in Trans- where larger p(z = ijx; c) implies that xi is more former (Vaswani et al., 2017), a multi-head atten- relevant to c. A context-aware representation u tion computes u in multiple subspaces of x and c, of x is achieved by computing the expectation of and concatenates the results as the final represen- sampling a token according to p(zjx; c), i.e., tation. Source2Token self-attention (S2T) (Lin n X et al., 2017; Shen et al., 2017; Liu et al., 2016), on u = p(z = ijx; c)x = (x ): (3) i Ei∼p(zjx;c) i the other hand, explores the per-token importance i=1 Multiplicative attention (Vaswani et al., 2017; with respect to a specific task by removing c from Sukhbaatar et al., 2015; Rush et al., 2015) and ad- Eq. (1)-(3). Its output u is an average of all tokens ditive attention (multi-layer perceptron attention) weighted by their corresponding importance. (Bahdanau et al., 2015; Shang et al., 2015) are two commonly used attention mechanisms with differ- 3 Conditional Self-Attention ent compatibility functions f(·; ·) defined below Conditional self-attention computes the self- attention score between any two tokens xi and xj 1Qsumm is related to Question-Answering (QA) but its based on the representation c of a given condition. output contains more background information and explana- Specifically, to incorporate the conditional in- tion than a succinct answer returned by QA. 2HotpotQA is originally built for multi-hop QA, but the formation, CSA first applies a cross-attention provided supporting facts can be used as targets of Qsumm. module (multiplicative or additive attention) to compute the dependency scores of xi and xj to Summary Encoder of the query c by Eq. (1)-(2), i.e., pi p(z = ijx; c) and Decoder , Self-attention S2T self- % p p(z = jjx; c). Inputs x and x are then layers attention j , i j 3" 3( # # # 32 p p h p x !" # # # !$ scaled by i and j to obtain i , i i and CSA/MD-CSA hj , pjxj. Finally, an additive self-attention is Encoder of the passage Self-attention layers applied to hi and hj with compatibility function f (x ; x jc) 1" 1( # # # 12 csa i j , (6) T (1) (2) S2T self-attention S2T self-attention S2T self-attention wsaσ(Wsa hi + Wsa hj + bsa) + bsa; Self-attention layers Self-attention layers Self-attention layers resulting in context-aware embedding uj for xj: n &" # # # &,- &,-." # # # &,/ &,0." # # # &+ X n uj = softmax([fcsa(hi; hj)]i=1)[i] × hi: (7) i=1 &" &( &' &* &) # # # &+ We extend the above model by using multi-head mechanism and position-wise feed-forward net- Figure 2: CSA-Transformer with block self-attention. works (PFN) proposed in Vaswani et al.(2017): Eq. (6)-(7) are applied to K subspaces of h from the condition in CSA. (k) K (k) K [Θ h]k=1, whose outputs [uj ]k=1 are then con- Conditional self-attention layer Given the en- catenated and processed by a linear projection coded query c and the encoded passage v, a con- K (with whead 2 R ) and a PFN layer ditional self-attention module as in Figure1 (with h (1) (2) (K)i head input x = v) produces a sequence of context-and- uj = PFN u ; u ; ··· ; u w : (8) j j j condition-aware representations u. 4 Query-based Summarization Model Decoder Depending on whether the output summary is abstractive or extractive, we apply dif- In Qsumm, each data instance is a triplet (x; q; s), ferent decoders to u.