Semantic Matching for Sequence-To-Sequence Learning
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Matching for Sequence-to-Sequence Learning Ruiyi Zhang3, Changyou Cheny, Xinyuan Zhangz, Ke Bai3, Lawrence Carin3 3Duke University, yState University of New York at Buffalo, zASAPP Inc. [email protected] Abstract that each piece of mass in the source distribution is transported to an equal-weight piece of mass in the In sequence-to-sequence models, classical op- target distribution. However, this requirement is timal transport (OT) can be applied to seman- too restrictive for Seq2Seq models, making direct tically match generated sentences with target sentences. However, in non-parallel settings, applications inappropriate due to the following: (i) target sentences are usually unavailable. To texts often have different lengths, and not every tackle this issue without losing the benefits of element in the source text corresponds an element classical OT, we present a semantic matching in the target text. A good example is style transfer, scheme based on the Optimal Partial Trans- where some words in the source text do not have port (OPT). Specifically, our approach par- corresponding words in the target text. (ii) it is tially matches semantically meaningful words reasonable to semantically match important words between source and partial target sequences. e.g. To overcome the difficulty of detecting active while neglecting some other words, , conjunc- regions in OPT (corresponding to the words tion. In typical unsupervised models, text data are needed to be matched), we further exploit prior usually non-parallel in the sense that pairwise data knowledge to perform partial matching. Ex- are typically unavailable (Sutskever et al., 2014). tensive experiments are conducted to evaluate Thus, both pairwise information inference and text the proposed approach, showing consistent im- generation must be performed in the same model provements over sequence-to-sequence tasks. with only non-parallel data. Classical OT is not applicable without target text sequences. However, 1 Introduction partial target information is available, for exam- Sequence-to-sequence (Seq2Seq) models are ple, the detected objects in an image should be de- widely used in various natural-language-processing scribed in its caption, and the content words when tasks, such as machine translation (Sutskever et al., changing the style should be preserved. OT will 2014; Cho et al., 2014; Bahdanau et al., 2015), text fail in these cases but matching can be performed summarization (Rush et al., 2015; Chopra et al., by optimal partial transport (OPT). Specifically, we 2016) and image captioning (Vinyals et al., 2015; exploit the partial target information representation Xu et al., 2015). Typically, these models are based via partially matching it with generated texts. The on an encoder-decoder architecture, with an en- partial matching is implemented based on lexical coder mapping a source sequence into a latent vec- information extracted from the texts. We call our tor, and a decoder translating the latent vector into method SEmantic PArtial Matching (SEPAM). a target sequence. The goal of a Seq2Seq model is To demonstrate the effectiveness of SEPAM, we to optimize this encoder-decoder network to gen- consider applying it on sequence-prediction tasks erate sequences close to the target. Therefore, a where semantic partial matching is needed: (i) in proper measure of the distance between sequences unsupervised text-style transfer, SEPAM can be em- is crucial for model training. ployed for content preservation via partially match- Wasserstein distance between two text se- ing the input and generated text; (ii) in image cap- quences, i.e., word-mover distance (Kusner et al., tioning, SEPAM can be applied to partially match 2015), can serve as an effective regularizer for se- the objects detected in images with corresponding mantic matching in Seq2Seq models (Chen et al., captions for more informative generation; (iii) in 2019). Classical optimal transport models require table-to-text generation, SEPAM can prevent hallu- 212 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 212–222 November 16 - 20, 2020. c 2020 Association for Computational Linguistics n×m cination (Dhingra et al., 2019) via partially match- where Π(u; v) = T R+ T1m = > f 2 j ing tabular key words with generated sentences. u; T 1n = v , 1n denotes an n-dimensional all- g The main contributions of this paper are sum- one vector, C is the cost matrix given by Cij = > marized as follows: (i) A novel semantic match- c(xi; yj) and T; C = Tr(T C) represents the ing scheme based on optimal partial transport is Frobenius dot-product.h i proposed. (ii) Our model can be interpreted as in- corporating prior knowledge into the optimal trans- 2.2 Optimal Partial Transport port to exploit the structure of natural language, Optimal partial transport (OPT) was studied ini- while making the algorithm tractable for real-world tially by Caffarelli and McCann (2010). It is a tasks. (iii) In order to demonstrate the versatility variant of optimal transport, where only a portion of the proposed scheme, we empirically show con- of mass is to be transported, in an efficient way. In sistent improvements in style transfer for content OPT, the transport problem is defined by generaliz- preservation, image captioning for informative im- ing γ as a Borel measure such that: age descriptions and in table-to-text generation for faithful generation. Z Dc(µ; ν) = inf [c(x; y)]dγ(x; y) : (3) 2 Background γ2Π≤(f;g); M(γ)=m 2.1 Optimal Transport where Π≤(f; g) is defined as the set of nonneg- n n ative finite Borel measures on R R whose Optimal transport defines distances between prob- × ability measures on a domain (the word- first and second marginals are dominated by f X n R embedding space in our setting). The optimal trans- and g respectively, i.e., γ(A R ) A f(x)dx n R × ≤ n port distance for two probability measures µ and and γ(R A) A g(y)dx for all A R ; R× ≤ 2 (γ) n n dγ represents the mass of γ in ν is defined as (Peyré et al., 2017): M , R ×R (3), and m [0; min f L1 ; g L1 ]. Here f and c(µ; ν) = inf E(x;y)∼γ [c(x; y)] ; (1) 2 fk k k k g D γ2Π(µ;ν) g can be considered as the maximum marginal where Π(µ; ν) denotes the set of all joint distri- measures for γ. As a result, if m is less than f ; g γ butions γ(x; y) with marginals µ(x) and ν(y); min L1 L1 , this means assigns zero measuresfk k fork somek g elements of the space. In other c(x; y): X X R is the cost function for mov- × ! ing x to y, e.g., the Euclidean or cosine distance. words, the zero-measure elements need not be con- Intuitively, the optimal transport distance is the min- sidered when matching µ and ν. The elements imum cost that γ induces in order to transport from with non-zero measure are all active regions. A challenge in OPT is how to detect these active re- µ to ν. When c(x; y) is a metric on X, c(µ; ν) induces a proper metric on the space of probabilityD gions. Thus directly optimizing (3) is typically very challenging and computationally expensive. distributions supported on X, commonly known as the Wasserstein distance (Villani, 2008). In our setting of text analysis, we propose to lever- We focus on applying the OT distance on tex- age prior knowledge to define the active regions, as tual data. Therefore, we only consider OT be- introduced below. tween discrete distributions. Specifically, con- sider two discrete distributions µ; ν P(X), 3 Semantic Matching via OPT Pn 2 which can be presented as µ = i=1 uiδxi and Pm In unsupervised Seq2Seq tasks without pair-wise ν = j=1 vjδyj with δx the Dirac function cen- n information, naively matching the generated se- tered on x. The weight vectors u = ui i=1 ∆n m f g 2 quence with the weak-supervision information and v = vi i=1 ∆m belong to the simplex, n f g 2m (e.g., source text in style transfer) will render de- i.e., P u = P v = 1, as both µ and ν i=1 i j=1 j ficient performance, even though both sentences are probability distributions. Under such a setting, share similar content. In supervised settings, tar- computing the OT distance defined in (1) can be re- get and input sequences are of different lengths formulated as the following minimization problem: m n but have similarity in terms of semantic meaning, X X 0 such as table-to-text generation. Motivated by this, Wc(µ; ν) = min Tij c(xi; yj) T2Π(µ;ν) · i=1 j=1 (2) we propose a novel technique for semantic partial i = min T; C ; matching and consider two scenarios: ( ) text-to- T2Π(µ;ν) h i text matching and (ii) image-to-text matching. 213 <DT> <NNs> <VBP> <IN> <JJ> <CC> <DT> <NN> <VBZ> <RB> <VBN> <RB> . snow 1.000 snow skiing man<sos slope> hill covera The rooms are so bad and the food is poorly cooked either . skiing 0.993 LSTM man 0.917 CNN slope 0.898 person 0.889 a man hill 0.808 LSTM covered 0.750 riding 0.627 Rooms were spacious and food was very well cooked . A man riding skis downslope a snow covered<eos> slope <DT> <NN> <VBP> <NNs> <IN> <DT> <NN> <JJ> <NN> . <NNs> <VBD> <JJ> <CC> <NN> <VBD> <RB> <RB> <VBN>. Generated caption: a man riding skis down a snow covered slope LSTM Figure 1: Semantic Matching between the potential target (top) and the generated texts (bottom).