Conditional Self-Attention for Query-based Summarization
Yujia Xie ∗ Tianyi Zhou Yi Mao Weizhu Chen Georgia Tech University of Washington Microsoft Microsoft [email protected] [email protected] [email protected] [email protected]
Abstract element’s context-aware embedding is then com- puted by weighted averaging of other elements Self-attention mechanisms have achieved with the probabilities. Hence, self-attention is great success on a variety of NLP tasks due powerful for encoding pairwise relationship into to its flexibility of capturing dependency contextual representations. between arbitrary positions in a sequence. For problems such as query-based summarization However, higher-level language understanding (Qsumm) and knowledge graph reasoning often relies on more complicated dependencies where each input sequence is associated with than the pairwise one. One example is the con- an extra query, explicitly modeling such ditional dependency that measures how two ele- conditional contextual dependencies can lead ments are related given a premise. In NLP tasks to a more accurate solution, which however such as query-based summarization and knowl- cannot be captured by existing self-attention edge graph reasoning where inputs are equipped mechanisms. In this paper, we propose condi- with extra queries or entities, knowing the depen- tional self-attention (CSA), a neural network module designed for conditional dependency dencies conditioned on the given query or entity is modeling. CSA works by adjusting the extremely helpful for extracting meaningful rep- pairwise attention between input tokens in a resentations. Moreover, conditional dependencies self-attention module with the matching score can be used to build a large relational graph cov- of the inputs to the given query. Thereby, ering all elements, and to represent higher-order the contextual dependencies modeled by relations for multiple (>2) elements. Hence, it CSA will be highly relevant to the query. is more expressive than the pairwise dependency We further studied variants of CSA defined by different types of attention. Experiments modeled by self-attention mechanisms. on Debatepedia and HotpotQA benchmark In this paper, we develop conditional self- datasets show CSA consistently outperforms attention as a versatile module to capture the vanilla Transformer and previous models for conditional dependencies within an input se- the Qsumm problem. quence. CSA is a composite function that applies a cross-attention mechanism inside an ordinary 1 Introduction self-attention module. Given two tokens from the input sequence, it first computes a dependency Contextual dependency is believed to provide crit- arXiv:2002.07338v1 [cs.CL] 18 Feb 2020 score for each token with respect to the condition ical information in a variety of NLP tasks. Among by cross-attention, then scales tokens by the com- the popular neural network structures, convolution puted condition-dependency scores, and applies is powerful in capturing local dependencies, while self-attention to the scaled tokens. In this way LSTM is good at modeling distant relations. CSA is capable of selecting tokens that are highly Self-attention has recently achieved great success correlated to the condition, and guiding contextual in NLP tasks due to its flexibility of relating two embedding generation towards those tokens. In elements in a distance-agnostic manner. For each addition, when applied to a highly-related token element, it computes a categorical distribution that and a loosely-related token, CSA reduces to mea- reflects the dependency of that element to each of suring the global dependency (i.e., single-token the other elements from the same sequence. The importance) of the former. Hence, CSA can ∗Work done during internship in Microsoft. capture both local (e.g., pairwise) and global de- ( ) pendency in the same module, and automatically � � � � �( ) �( ) �( ) switch between them according to the condi- Position-wise Feed-forward Networks (PFN) �( ) �( ) �( ) tion. In contrast, previous works usually need two �