Semantic Matching for Sequence-to-Sequence Learning

Ruiyi Zhang3, Changyou Chen†, Xinyuan Zhang‡, Ke Bai3, Lawrence Carin3 3Duke University, †State University of New York at Buffalo, ‡ASAPP Inc. [email protected]

Abstract that each piece of mass in the source distribution is transported to an equal-weight piece of mass in the In sequence-to-sequence models, classical op- target distribution. However, this requirement is timal transport (OT) can be applied to seman- too restrictive for Seq2Seq models, making direct tically match generated sentences with target sentences. However, in non-parallel settings, applications inappropriate due to the following: (i) target sentences are usually unavailable. To texts often have different lengths, and not every tackle this issue without losing the benefits of element in the source text corresponds an element classical OT, we present a semantic matching in the target text. A good example is style transfer, scheme based on the Optimal Partial Trans- where some words in the source text do not have port (OPT). Specifically, our approach par- corresponding words in the target text. (ii) it is tially matches semantically meaningful words reasonable to semantically match important words between source and partial target sequences. e.g. To overcome the difficulty of detecting active while neglecting some other words, , conjunc- regions in OPT (corresponding to the words tion. In typical unsupervised models, text data are needed to be matched), we further exploit prior usually non-parallel in the sense that pairwise data knowledge to perform partial matching. Ex- are typically unavailable (Sutskever et al., 2014). tensive experiments are conducted to evaluate Thus, both pairwise information inference and text the proposed approach, showing consistent im- generation must be performed in the same model provements over sequence-to-sequence tasks. with only non-parallel data. Classical OT is not applicable without target text sequences. However, 1 Introduction partial target information is available, for exam- Sequence-to-sequence (Seq2Seq) models are ple, the detected objects in an image should be de- widely used in various natural-language-processing scribed in its caption, and the content words when tasks, such as machine translation (Sutskever et al., changing the style should be preserved. OT will 2014; Cho et al., 2014; Bahdanau et al., 2015), text fail in these cases but matching can be performed summarization (Rush et al., 2015; Chopra et al., by optimal partial transport (OPT). Specifically, we 2016) and image captioning (Vinyals et al., 2015; exploit the partial target information representation Xu et al., 2015). Typically, these models are based via partially matching it with generated texts. The on an encoder-decoder architecture, with an en- partial matching is implemented based on lexical coder mapping a source sequence into a latent vec- information extracted from the texts. We call our tor, and a decoder translating the latent vector into method SEmantic PArtial Matching (SEPAM). a target sequence. The goal of a Seq2Seq model is To demonstrate the effectiveness of SEPAM, we to optimize this encoder-decoder network to gen- consider applying it on sequence-prediction tasks erate sequences close to the target. Therefore, a where semantic partial matching is needed: (i) in proper measure of the distance between sequences unsupervised text-style transfer, SEPAM can be em- is crucial for model training. ployed for content preservation via partially match- Wasserstein distance between two text se- ing the input and generated text; (ii) in image cap- quences, i.e., word-mover distance (Kusner et al., tioning, SEPAM can be applied to partially match 2015), can serve as an effective regularizer for se- the objects detected in images with corresponding mantic matching in Seq2Seq models (Chen et al., captions for more informative generation; (iii) in 2019). Classical optimal transport models require table-to-text generation, SEPAM can prevent hallu-

212 Findings of the Association for Computational Linguistics: EMNLP 2020, pages 212–222 November 16 - 20, 2020. c 2020 Association for Computational Linguistics n×m cination (Dhingra et al., 2019) via partially match- where Π(u, v) = T R+ T1m = > { ∈ | ing tabular key words with generated sentences. u, T 1n = v , 1n denotes an n-dimensional all- } The main contributions of this paper are sum- one vector, C is the cost matrix given by Cij = > marized as follows: (i) A novel semantic match- c(xi, yj) and T, C = Tr(T C) represents the ing scheme based on optimal partial transport is Frobenius dot-product.h i proposed. (ii) Our model can be interpreted as in- corporating prior knowledge into the optimal trans- 2.2 Optimal Partial Transport port to exploit the structure of natural language, Optimal partial transport (OPT) was studied ini- while making the algorithm tractable for real-world tially by Caffarelli and McCann (2010). It is a tasks. (iii) In order to demonstrate the versatility variant of optimal transport, where only a portion of the proposed scheme, we empirically show con- of mass is to be transported, in an efficient way. In sistent improvements in style transfer for content OPT, the transport problem is defined by generaliz- preservation, image captioning for informative im- ing γ as a Borel measure such that: age descriptions and in table-to-text generation for faithful generation. Z Dc(µ, ν) = inf [c(x, y)]dγ(x, y) . (3) 2 Background γ∈Π≤(f,g), M(γ)=m

2.1 Optimal Transport where Π≤(f, g) is defined as the set of nonneg- n n ative finite Borel measures on R R whose Optimal transport defines distances between prob- × ability measures on a domain (the word- first and second marginals are dominated by f X n R embedding space in our setting). The optimal trans- and g respectively, i.e., γ(A R ) A f(x)dx n R × ≤ n port distance for two probability measures µ and and γ(R A) A g(y)dx for all A R ; R× ≤ ∈ (γ) n n dγ represents the mass of γ in ν is defined as (Peyré et al., 2017): M , R ×R (3), and m [0, min f L1 , g L1 ]. Here f and c(µ, ν) = inf E(x,y)∼γ [c(x, y)] , (1) ∈ {k k k k } D γ∈Π(µ,ν) g can be considered as the maximum marginal where Π(µ, ν) denotes the set of all joint distri- measures for γ. As a result, if m is less than f , g γ butions γ(x, y) with marginals µ(x) and ν(y); min L1 L1 , this means assigns zero measures{k k fork somek } elements of the space. In other c(x, y): X X R is the cost function for mov- × → ing x to y, e.g., the Euclidean or cosine distance. words, the zero-measure elements need not be con- Intuitively, the optimal transport distance is the min- sidered when matching µ and ν. The elements imum cost that γ induces in order to transport from with non-zero measure are all active regions. A challenge in OPT is how to detect these active re- µ to ν. When c(x, y) is a metric on X, c(µ, ν) induces a proper metric on the space of probabilityD gions. Thus directly optimizing (3) is typically very challenging and computationally expensive. distributions supported on X, commonly known as the Wasserstein distance (Villani, 2008). In our setting of text analysis, we propose to lever- We focus on applying the OT distance on tex- age prior knowledge to define the active regions, as tual data. Therefore, we only consider OT be- introduced below. tween discrete distributions. Specifically, con- sider two discrete distributions µ, ν P(X), 3 Semantic Matching via OPT Pn ∈ which can be presented as µ = i=1 uiδxi and Pm In unsupervised Seq2Seq tasks without pair-wise ν = j=1 vjδyj with δx the Dirac function cen- n information, naively matching the generated se- tered on x. The weight vectors u = ui i=1 ∆n m { } ∈ quence with the weak-supervision information and v = vi i=1 ∆m belong to the simplex, n { } ∈m (e.g., source text in style transfer) will render de- i.e., P u = P v = 1, as both µ and ν i=1 i j=1 j ficient performance, even though both sentences are probability distributions. Under such a setting, share similar content. In supervised settings, tar- computing the OT distance defined in (1) can be re- get and input sequences are of different lengths formulated as the following minimization problem: m n but have similarity in terms of semantic meaning, X X 0 such as table-to-text generation. Motivated by this, Wc(µ, ν) = min Tij c(xi, yj) T∈Π(µ,ν) · i=1 j=1 (2) we propose a novel technique for semantic partial i = min T, C , matching and consider two scenarios: ( ) text-to- T∈Π(µ,ν) h i text matching and (ii) image-to-text matching. 213

. snow 1.000 snow skiing man hill covera The rooms are so bad and the food is poorly cooked either . skiing 0.993 LSTM man 0.917 CNN slope 0.898 person 0.889 a man hill 0.808 LSTM covered 0.750 riding 0.627 Rooms were spacious and food was very well cooked . A man riding skis downslope a snow covered slope
. . . Generated caption: a man riding skis down a snow covered slope LSTM

Figure 1: Semantic Matching between the potential target (top) and the generated texts (bottom). The left shows how to partially match two texts with different styles. The right shows how to partially match texts with concepts detected from the images. Text-to-Text Matching We consider semantic tasks. Fortunately, we can exploit the syntax infor- matching between two sequences in Seq2Seq mod- mation from text, and incorporate this information els, where partial matching is important: i) in the as prior knowledge into OPT to avoid the detection unsupervised setting, such as non-parallel style procedure. transfer, partially matching between the source and target texts is helpful for content preservation. ii) 3.1 Partial Matching via OPT in the supervised setting, such as table-to-text gen- We formulate the proposed semantic matching as a eration, partially matching the input and target se- partial optimal-transport problem, where only parts quences can effectively avoid hallucination genera- of the source and target are matched. Specifically, tion, i.e. text mentions extra information than what we incorporate prior knowledge into the optimal is present in the source. Figure 1 shows an example partial transport (OPT), and this prior knowledge of partial matching, where part-of-speech (POS) helps determine the set of words to match, i.e., tags for each word are exploited to provide prior M(X), where M( ) is a function giving a set in- · knowledge. In these cases, directly applying OT cluding the words/phases to match. The strategy of will cause imbalanced transportation issue or poor how to determine M( ) depends on tasks. · performance. OPT distance To apply the OT distance to text, Image-to-Text Matching Objects and their we first represent a sentence Y with a discrete dis- p 1 P δ properties (e.g., colors) are both included in the tribution Y = T t e(yt) in the semantic space, where the length-normalized point mass is placed pair-wise images and captions. Consider the image- y to-text matching in Figure 1. It is clear that each ob- at the semantic embedding, et = e(yt), of each token yt of the sequence Y . Here e( ) denotes a ject in the image has corresponding words/phases · in the captions. We can consider matching the la- word-embedding function mapping a token to its bels of detected objects in the image to some words d-dimensional feature representation. For two sen- in its caption. Please note labels are not in one- tences X and Y , we define their OPT distance as: to-one correspondence with the text, thus directly m n applying OT is inappropriate, similar to the case of X X x y Wc(µ, ν) = min Tijc(ei , ej ) , (4) T∈Π (µ,ν) text-to-text matching. c i=1 j=1 , Different Matching Schemes Hard matching where Πc(µ ν) is the solution space, and every solution T , satisfies T if x / seeks to exactly match from the source and tar- Πc(µ ν) ij = 0 i M or y∈ / M . Different from classical∈ get. Typically, hard matching is too simple to be (X) j (Y ) OPT, the elements∈ in or to match have been effective without considering semantic similarity, µ ν explicitly defined by M , which represents the and if we apply classical optimal transport in un- ( ) prior knowledge. In more· detail, the constraint supervised settings, it causes an imbalance match- of OPT is more specific, and does not need any ing, since some unnecessary words are included in optimization procedure. We use cosine distance as the source and the exact target is unavailable. To x y exT ey the cost function and c(e , e ) , 1 x y . tackle this issue, one can directly apply the opti- − ke k2ke k2 mal partial transport (OPT) here to detect which Approximation of OPT Computing the exact word has its correspondence and match the word OPT distance is computationally challenging (Fi- with its target. However, the detection process is galli, 2010). We bypass the difficulty of active computationally expensive, which is not scalable region detection using lexical information and re- as a constrained optimization in (3) for real-world formulate it as an OT problem. We then em-

214 AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LIgiOCmgn1AO5ZMJtOGZpIhyShlKPgZblwo4tZ/ceffmGm70NYDIYdz7iUnJ0g408Z1v52l5ZXVtfXCRnFza3tnt7S339QyVYQ2iORStQOsKWeCNgwznLYTRXEccNoKhpe533qgSjMp7swooX6M+4JFjGBjpftuIHmoR7G9sptxr1R2K+4EaJF4M1KGGeq90lc3lCSNqTCEY607npsYP8PKMMLpuNhNNU0wGeI+7VgqcEy1n01Sj9GxVUIUSWWPMGii/t7IcKzzaHYyxmag571c/M/rpCa68DMmktRQQaYPRSlHRqK8AhQyRYnhI0swUcxmRWSAFSbGFlW0JXjzX14kzWrFO61Ub8/KtaunaR0FOIQjOAEPzqEG11CHBhBQ8Ayv8OY8Oi/Ou/MxHV1yZhUewB84nz83bZNk AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LIgiOCmgn1AO5ZMJtOGZpIhyShlKPgZblwo4tZ/ceffmGm70NYDIYdz7iUnJ0g408Z1v52l5ZXVtfXCRnFza3tnt7S339QyVYQ2iORStQOsKWeCNgwznLYTRXEccNoKhpe533qgSjMp7swooX6M+4JFjGBjpftuIHmoR7G9sptxr1R2K+4EaJF4M1KGGeq90lc3lCSNqTCEY607npsYP8PKMMLpuNhNNU0wGeI+7VgqcEy1n01Sj9GxVUIUSWWPMGii/t7IcKzzaHYyxmag571c/M/rpCa68DMmktRQQaYPRSlHRqK8AhQyRYnhI0swUcxmRWSAFSbGFlW0JXjzX14kzWrFO61Ub8/KtaunaR0FOIQjOAEPzqEG11CHBhBQ8Ayv8OY8Oi/Ou/MxHV1yZhUewB84nz83bZNk M( ) M( ) AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg== M( ) AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg== AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg== M( ) K K AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjiRqhgH9AZSiaTtqGZyZDcEetQ8H+4caGIW3+KO/+N6WOhrQcCH+fckJsTJIJrcJxvK7eyura+kd8sbG3v7Bbtvf2mlqmirEGlkKodEM0Ej1kDOAjWThQjUSBYKxheTvLWPVOay/gORgnzI9KPeY9TAsbq2kUP2ANkN+OyR0MJJ1275FScqfAyuHMoobnqXfvLCyVNIxYDFUTrjusk4GdEAaeCjQteqllC6JD0WcdgTCKm/Wy6+BgfGyfEPanMiQFP3d83MhJpPYoCMxkRGOjFbGL+l3VS6F34GY+TFFhMZw/1UoFB4kkLOOSKURAjA4QqbnbFdEAUoWC6KpgS3MUvL0OzWnFPK9Xbs1Lt6mlWRx4doiNURi46RzV0jeqogShK0TN6RW/Wo/VivVsfs9GcNa/wAP2R9fkDsxuThg== · · · ·

˜ AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZcFQVxWsA9ox5LJZNrQTDIkGaUMBT/DjQtF3Pov7vwbM20X2nog5HDOveTkBAln2rjut1NYWV1b3yhulra2d3b3yvsHLS1TRWiTSC5VJ8CaciZo0zDDaSdRFMcBp+1gdJX77QeqNJPizowT6sd4IFjECDZWuu8Fkod6HNsr60z65YpbdadAy8SbkwrM0eiXv3qhJGlMhSEca9313MT4GVaGEU4npV6qaYLJCA9o11KBY6r9bJp6gk6sEqJIKnuEQVP190aGY51Hs5MxNkO96OXif143NdGlnzGRpIYKMnsoSjkyEuUVoJApSgwfW4KJYjYrIkOsMDG2qJItwVv88jJp1areWbV2e16pXz/N6ijCERzDKXhwAXW4gQY0gYCCZ3iFN+fReXHenY/ZaMGZV3gIf+B8/gBLLpNx ˜

AAAB/3icbVDNS8MwHE3n15xfVcGLl+AQPI12CnocCOJxgvuQtYw0TbewNClJKoxa0H/FiwdFvPpvePO/Md120M0HIY/3fj/y8oKEUaUd59sqLS2vrK6V1ysbm1vbO/buXluJVGLSwoIJ2Q2QIoxy0tJUM9JNJEFxwEgnGF0WfueeSEUFv9XjhPgxGnAaUYy0kfr2gacpC0nmBYKFahybK7vL875ddWrOBHCRuDNSBTM0+/aXFwqcxoRrzJBSPddJtJ8hqSlmJK94qSIJwiM0ID1DOYqJ8rNJ/hweGyWEkZDmcA0n6u+NDMWqyGYmY6SHat4rxP+8XqqjCz+jPEk14Xj6UJQyqAUsyoAhlQRrNjYEYUlNVoiHSCKsTWUVU4I7/+VF0q7X3NNa/eas2rh6nNZRBofgCJwAF5yDBrgGTdACGDyAZ/AK3qwn68V6tz6moyVrVuE++APr8wdOdpdZ

AAAB/3icbVDNS8MwHE3n15xfVcGLl+AQPI12CnocCOJxgvuQtYw0TbewNClJKoxa0H/FiwdFvPpvePO/Md120M0HIY/3fj/y8oKEUaUd59sqLS2vrK6V1ysbm1vbO/buXluJVGLSwoIJ2Q2QIoxy0tJUM9JNJEFxwEgnGF0WfueeSEUFv9XjhPgxGnAaUYy0kfr2gacpC0nmBYKFahybK7vL875ddWrOBHCRuDNSBTM0+/aXFwqcxoRrzJBSPddJtJ8hqSlmJK94qSIJwiM0ID1DOYqJ8rNJ/hweGyWEkZDmcA0n6u+NDMWqyGYmY6SHat4rxP+8XqqjCz+jPEk14Xj6UJQyqAUsyoAhlQRrNjYEYUlNVoiHSCKsTWUVU4I7/+VF0q7X3NNa/eas2rh6nNZRBofgCJwAF5yDBrgGTdACGDyAZ/AK3qwn68V6tz6moyVrVuE++APr8wdOdpdZ

AAAB9XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZcFQVxWsA9ox5LJZNrQTDIkGaUMBT/DjQtF3Pov7vwbM20X2nog5HDOveTkBAln2rjut1NYWV1b3yhulra2d3b3yvsHLS1TRWiTSC5VJ8CaciZo0zDDaSdRFMcBp+1gdJX77QeqNJPizowT6sd4IFjECDZWuu8Fkod6HNsr60z65YpbdadAy8SbkwrM0eiXv3qhJGlMhSEca9313MT4GVaGEU4npV6qaYLJCA9o11KBY6r9bJp6gk6sEqJIKnuEQVP190aGY51Hs5MxNkO96OXif143NdGlnzGRpIYKMnsoSjkyEuUVoJApSgwfW4KJYjYrIkOsMDG2qJItwVv88jJp1areWbV2e16pXz/N6ijCERzDKXhwAXW4gQY0gYCCZ3iFN+fReXHenY/ZaMGZV3gIf+B8/gBLLpNx

AAAB/3icbVDLSsNAFJ3UV62vqODGTbAIrkpSBV0WRHGhUNE+oAlhMp20QycPZm7EEgP6K25cKOLW33Dn3zhpu9DWAxcO59zLnDlezJkE0/zWCnPzC4tLxeXSyura+oa+udWUUSIIbZCIR6LtYUk5C2kDGHDajgXFgcdpyxuc5n7rjgrJovAWhjF1AtwLmc8IBiW5+o4dYOgTzNPLzLWB3kN6c3aVuXrZrJgjGLPEmpAymqDu6l92NyJJQEMgHEvZscwYnBQLYITTrGQnksaYDHCPdhQNcUClk47yZ8a+UrqGHwk1IRgj9fdFigMph4GnNvO0ctrLxf+8TgL+iZOyME6AhmT8kJ9wAyIjL8PoMkEJ8KEimAimshqkjwUmoCorqRKs6S/Pkma1Yh1WqtdH5dr547iOItpFe+gAWegY1dAFqqMGIugBPaNX9KY9aS/au/YxXi1okwq30R9onz+eCZbm X AAAB/3icbVDLSsNAFJ3UV62vqODGTbAIrkpSBV0WRHGhUNE+oAlhMp20QycPZm7EEgP6K25cKOLW33Dn3zhpu9DWAxcO59zLnDlezJkE0/zWCnPzC4tLxeXSyura+oa+udWUUSIIbZCIR6LtYUk5C2kDGHDajgXFgcdpyxuc5n7rjgrJovAWhjF1AtwLmc8IBiW5+o4dYOgTzNPLzLWB3kN6c3aVuXrZrJgjGLPEmpAymqDu6l92NyJJQEMgHEvZscwYnBQLYITTrGQnksaYDHCPdhQNcUClk47yZ8a+UrqGHwk1IRgj9fdFigMph4GnNvO0ctrLxf+8TgL+iZOyME6AhmT8kJ9wAyIjL8PoMkEJ8KEimAimshqkjwUmoCorqRKs6S/Pkma1Yh1WqtdH5dr547iOItpFe+gAWegY1dAFqqMGIugBPaNX9KY9aS/au/YxXi1okwq30R9onz+eCZbm Y X SEM Y LSEM L

Soft-argmaxAAAB+3icbVBNS8NAEN3Ur1q/Yj16CRbBiyWpgp6k4MVjRfsBbSib7aZdupsNuxNpCfkrXjwo4tU/4s1/47bNQVsfDDzem2FmXhBzpsF1v63C2vrG5lZxu7Szu7d/YB+WW1omitAmkVyqToA15SyiTWDAaSdWFIuA03Ywvp357SeqNJPRI0xj6gs8jFjICAYj9e1yD+gE0gcZwjlWQ4EnWd+uuFV3DmeVeDmpoByNvv3VG0iSCBoB4VjrrufG4KdYASOcZqVeommMyRgPadfQCAuq/XR+e+acGmXghFKZisCZq78nUiy0norAdAoMI73szcT/vG4C4bWfsihOgEZksShMuAPSmQXhDJiiBPjUEEwUM7c6ZIQVJmDiKpkQvOWXV0mrVvUuqrX7y0r9Jo+jiI7RCTpDHrpCdXSHGqiJCJqgZ/SK3qzMerHerY9Fa8HKZ47QH1ifP4yHlME=

Soft-argmaxAAAB+3icbVBNS8NAEN3Ur1q/Yj16CRbBiyWpgp6k4MVjRfsBbSib7aZdupsNuxNpCfkrXjwo4tU/4s1/47bNQVsfDDzem2FmXhBzpsF1v63C2vrG5lZxu7Szu7d/YB+WW1omitAmkVyqToA15SyiTWDAaSdWFIuA03Ywvp357SeqNJPRI0xj6gs8jFjICAYj9e1yD+gE0gcZwjlWQ4EnWd+uuFV3DmeVeDmpoByNvv3VG0iSCBoB4VjrrufG4KdYASOcZqVeommMyRgPadfQCAuq/XR+e+acGmXghFKZisCZq78nUiy0norAdAoMI73szcT/vG4C4bWfsihOgEZksShMuAPSmQXhDJiiBPjUEEwUM7c6ZIQVJmDiKpkQvOWXV0mrVvUuqrX7y0r9Jo+jiI7RCTpDHrpCdXSHGqiJCJqgZ/SK3qzMerHerY9Fa8HKZ47QH1ifP4yHlME=

AAAB+XicbVDLSgMxFL3js9bXqEs3wSK4KjNV0GVBEJcV7APaYchkMm1oJjMkmWIdCn6IGxeKuPVP3Pk3ZtoutPVAyOGce8nJCVLOlHacb2tldW19Y7O0Vd7e2d3btw8OWyrJJKFNkvBEdgKsKGeCNjXTnHZSSXEccNoOhteF3x5RqVgi7vU4pV6M+4JFjGBtJN+2e0HCQzWOzZU/TvwH3644VWcKtEzcOanAHA3f/uqFCcliKjThWKmu66Tay7HUjHA6KfcyRVNMhrhPu4YKHFPl5dPkE3RqlBBFiTRHaDRVf2/kOFZFODMZYz1Qi14h/ud1Mx1deTkTaaapILOHoowjnaCiBhQySYnmY0MwkcxkRWSAJSbalFU2JbiLX14mrVrVPa/W7i4q9ZunWR0lOIYTOAMXLqEOt9CAJhAYwTO8wpuVWy/Wu/UxG12x5hUewR9Ynz+TDZSv z AAAB9XicbVDLSgMxFL3js9ZX1aWbYBFclZkq6LIgiMsK9iHtWDKZTBuaSYYko5Sh4Ge4caGIW//FnX9jpu1CWw+EHM65l5ycIOFMG9f9dpaWV1bX1gsbxc2t7Z3d0t5+U8tUEdogkkvVDrCmnAnaMMxw2k4UxXHAaSsYXuZ+64EqzaS4NaOE+jHuCxYxgo2V7ruB5KEexfbK7sa9UtmtuBOgReLNSBlmqPdKX91QkjSmwhCOte54bmL8DCvDCKfjYjfVNMFkiPu0Y6nAMdV+Nkk9RsdWCVEklT3CoIn6eyPDsc6j2ckYm4Ge93LxP6+TmujCz5hIUkMFmT4UpRwZifIKUMgUJYaPLMFEMZsVkQFWmBhbVNGW4M1/eZE0qxXvtFK9OSvXrp6mdRTgEI7gBDw4hxpcQx0aQEDBM7zCm/PovDjvzsd0dMmZVXgAf+B8/gBMs5Ny z

x Enc( ) AAAB+XicbVDLSgMxFL1TX7W+Rl26CRbBVZmpgi4LgrisYB/QDkMmk7ahmcyQZArjUPBD3LhQxK1/4s6/MdN2oa0HQg7n3EtOTpBwprTjfFultfWNza3ydmVnd2//wD48aqs4lYS2SMxj2Q2wopwJ2tJMc9pNJMVRwGknGN8UfmdCpWKxeNBZQr0IDwUbMIK1kXzb7gcxD1UWmSt/nPqZb1edmjMDWiXuglRhgaZvf/XDmKQRFZpwrFTPdRLt5VhqRjidVvqpogkmYzykPUMFjqjy8lnyKTozSogGsTRHaDRTf2/kOFJFODMZYT1Sy14h/uf1Uj249nImklRTQeYPDVKOdIyKGlDIJCWaZ4ZgIpnJisgIS0y0KatiSnCXv7xK2vWae1Gr319WG7dP8zrKcAKncA4uXEED7qAJLSAwgWd4hTcrt16sd+tjPlqyFhUewx9Ynz+UkZSw y G( ) Enc( ) G( ) AAAB+nicbVBNS8NAEN34WetXqkcvwSLUS0mqoMeCKB4r2A9oQtlsNu3STTbsTtQSC/4RLx4U8eov8ea/cZv2oK0PBh7vzTAzz084U2Db38bS8srq2npho7i5tb2za5b2WkqkktAmEVzIjo8V5SymTWDAaSeRFEc+p21/eDHx23dUKibiWxgl1ItwP2YhIxi01DNLLtAHyC5jMq64JBBw3DPLdtXOYS0SZ0bKaIZGz/xyA0HSiMZAOFaq69gJeBmWwAin46KbKppgMsR92tU0xhFVXpafPraOtBJYoZC6YrBy9fdEhiOlRpGvOyMMAzXvTcT/vG4K4bmXsThJgern8kVhyi0Q1iQHK2CSEuAjTTCRTN9qkQGWmIBOq6hDcOZfXiStWtU5qdZuTsv1q6dpHAV0gA5RBTnoDNXRNWqgJiLoHj2jV/RmPBovxrvxMW1dMmYR7qM/MD5/ADqVlGM= AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjqsoJ9QGcomUzahmYmQ3JHrEPB/+HGhSJu/Snu/Demj4W2Hgh8nHNDbk6QCK7Bcb6t3Mrq2vpGfrOwtb2zW7T39ptapoqyBpVCqnZANBM8Zg3gIFg7UYxEgWCtYHg5yVv3TGku4zsYJcyPSD/mPU4JGKtrFz1gD5Bdj8seDSWcdO2SU3GmwsvgzqGE5qp37S8vlDSNWAxUEK07rpOAnxEFnAo2LnipZgmhQ9JnHYMxiZj2s+niY3xsnBD3pDInBjx1f9/ISKT1KArMZERgoBeziflf1kmhd+FnPE5SYDGdPdRLBQaJJy3gkCtGQYwMEKq42RXTAVGEgumqYEpwF7+8DM1qxT2tVG/PSrWrp1kdeXSIjlAZuegc1dANqqMGoihFz+gVvVmP1ov1bn3MRnPWvMID9EfW5w+p05OA AAAB+nicbVBNS8NAEN34WetXqkcvwSLUS0mqoMeCKB4r2A9oQtlsNu3STTbsTtQSC/4RLx4U8eov8ea/cZv2oK0PBh7vzTAzz084U2Db38bS8srq2npho7i5tb2za5b2WkqkktAmEVzIjo8V5SymTWDAaSeRFEc+p21/eDHx23dUKibiWxgl1ItwP2YhIxi01DNLLtAHyC5jMq64JBBw3DPLdtXOYS0SZ0bKaIZGz/xyA0HSiMZAOFaq69gJeBmWwAin46KbKppgMsR92tU0xhFVXpafPraOtBJYoZC6YrBy9fdEhiOlRpGvOyMMAzXvTcT/vG4K4bmXsThJgern8kVhyi0Q1iQHK2CSEuAjTTCRTN9qkQGWmIBOq6hDcOZfXiStWtU5qdZuTsv1q6dpHAV0gA5RBTnoDNXRNWqgJiLoHj2jV/RmPBovxrvxMW1dMmYR7qM/MD5/ADqVlGM= AAAB+HicbZBLSwMxFIUz9VXro6Mu3QSLUDdlpgq6LAjqsoJ9QGcomUzahmYmQ3JHrEPB/+HGhSJu/Snu/Demj4W2Hgh8nHNDbk6QCK7Bcb6t3Mrq2vpGfrOwtb2zW7T39ptapoqyBpVCqnZANBM8Zg3gIFg7UYxEgWCtYHg5yVv3TGku4zsYJcyPSD/mPU4JGKtrFz1gD5Bdj8seDSWcdO2SU3GmwsvgzqGE5qp37S8vlDSNWAxUEK07rpOAnxEFnAo2LnipZgmhQ9JnHYMxiZj2s+niY3xsnBD3pDInBjx1f9/ISKT1KArMZERgoBeziflf1kmhd+FnPE5SYDGdPdRLBQaJJy3gkCtGQYwMEKq42RXTAVGEgumqYEpwF7+8DM1qxT2tVG/PSrWrp1kdeXSIjlAZuegc1dANqqMGoihFz+gVvVmP1ov1bn3MRnPWvMID9EfW5w+p05OA Y · · · ·

Figure 2: Overview of the SEPAM architecture. Left: classical Seq2Seq, i.e., X and Y are pair-wise; SEM implements a soft-copying mechanism via semantic partial matching. Right: unsupervised Seq2Seq, i.e., X andL Y are non-parallel; SEM provides the guidance for G( ), to generate Y˜ relevant to X via semantic partial matching. Solid lines meanL gradients are backpropagated in training;· dash lines mean gradients are not backpropagated. ploy the IPOT algorithm (Xie et al., 2018) to ob- We consider an encoder-decoder framework in tain an efficient approximation. In practice, we this paper, where a latent vector z is given by an use a keyword mask K, defined as Kij = 0 if encoder Enc( ), with input X, i.e., zx = Enc(X). · xi / M(X) and yj / M(Y ) ; Kij = + Based on zx, a decoder G( ) generates a new sen- ∈ ∈ ∞ · if xi / M(X) xor yj / M(Y ); and Kij = 1, tence Yˆ that is expected to be the same as Y . The otherwise.∈ Hence we define∈ the OPT distance as decoder can be implemented by an LSTM (Hochre- p  Wc (pX , pY , K) , Wc pM(X), pM(Y ) . Specifi- iter and Schmidhuber, 1997), GRU (Cho et al., cally, IPOT considers proximal gradient descent to 2014), or Transformer (Vaswani et al., 2017). A solve the optimal transport matrix: unsupervised Seq2Seq model considers and as non-parallel, i.e., the pair-wise informationX Y is (t+1) n 0 (t) o T = arg min hT, C i + γ · DKL(T, T ) , (5) unknown. One typically pretrains the generator T∈Πc(µ,ν) with the reconstruction loss: where C0 = K C, 1/γ > 0 is the general- ◦ AE = EY ∼Y [ log p(Y zy)] , (8) ized step-size, and the generalized KL-divergence L − | (t) DKL(T, T ) is used as the proximity metric. The where zy = Enc(Y ). Note the goal of unsuper- full approach is summarized as Algorithm 1 in Ap- vised Seq2Seq is to generate a sequence Y given pendix A. some object X. Hence, we seek to learn the condi- tional generation distributions p(Y X), the same 4 Semantic Partial Matching for Text as the classical Seq2Seq model. In| practice, the Generation generator can be trained combining the reconstruc- Assume there are two sets of objects = tion loss with some guidance loss containing the X X(i) M and = Y (j) N , we consider a information from X. The guidance loss function { }i=1 Y { }j=1 Seq2Seq model, where the input is1 X, and the can be defined following SEPAM, and the others usually depend on tasks and we omit their details output is a sequence of length T with tokens yt, for clarity. i.e., Y = (y1, y2 . . . , yT ). One typically assigns the following probability to an observation y at Differentiable SEPAM Note that the SEPAM location t: p(y Y

1For simplicity, we omit the superscript “i” when the where 0 < τ < 1 is the annealing factor. Given context is independent of i. This applies to Y . two sentences, we denote the generated sequence

215 y T embeddings as Sg = e˜i i=1 and partial reference (iii) Table-to-Text Generation Semantic partial { x} T 0 embedding as Sr = ej j=1 in word or phrase matching can prevent hallucination generation, level. The cost matrix{ C} is then computed as i.e., text mentions extra information than what is y x ˆ Cij = c(e˜i , ej ). The semantic partial matching present in the table (Dhingra et al., 2019). M(Y ) loss between the reference and model generation extracts nouns from the generated sequence, and can be computed via the IPOT algorithm: M(X) are keys in the table, then we used them to compute SEM. Semantic partial matching will p L SEM = Wc (Sg, Sr, K) . (9) penalize the generator if extra information exists in L the generated text Yˆ .

SEPAM Regularization SEPAM training objec- 5 Related Work tives discussed above only focus on generating words with specific meanings and do not con- Optimal transport in NLP Kusner et al. (2015) sider the word-ordering. To train a proper text- first applied optimal transport in NLP, and pro- generation model, we propose to combine the posed the word mover’s distance (WMD). The SEPAM loss with the likelihood loss MLE in su- transportation cost is usually defined as Euclidean L pervised settings or AE in unsupervised settings. distance, making the OT distance approximated L Hence, we have the training objective in unsu- by solving a less-accurate lower bound (Kusner pervised settings: = AE + λ SEM, where λ et al., 2015). Based on this, Chen et al. (2018) pro- L L L is the weight of SEPAM to be tuned. A similar posed a feature-mover distance for style transfer. objective applies for supervised settings: = Chen et al. (2019) applied OT for classical seq- L MLE + λ SEM. In the following, we discuss to-seq, and formulated it as Wasserstein gradient L L how to extract and use prior knowledge for par- flows (Zhang et al., 2018). SEPAM moves forward tial matching in three downstream tasks: and applies OT in both supervised and unsuper- vised settings (Artetxe et al., 2018; Lample et al., (i) Non-parallel Style Transfer Semantic par- 2017). tial matching between the source sentence and the transferred one is helpful for content preservation, Unsupervised Seq2Seq Learning Different as shown in Figure 1. It is usually the case that from the standard Seq2Seq model (Sutskever et al., the content words are nouns or verbs, and style 2014), parallel sentences for different styles are words are adjectives or adverbs. Hence, M(X) not provided, and must be inferred from the data. and M(Yˆ ) are content word sets, extracted based Unsupervised machine translation (Artetxe et al., on the POS tags using NLTK. One can employ this 2018; Lample et al., 2017) learns to translate prior knowledge to perform different operations text from one language to another with two sets for words: (i) for content words, we should en- of texts of these languages provided. Dai et al. courage partial matching between the sentences by (2019) explores the transformer model as the SEM; and (ii) for style words, we should discour- generator, instead of classical auto-regressive ageL matching (Hu et al., 2017). More details are models. Style transfer (Shen et al., 2017) aims at provided in Appendix A.1. transferring the styles of the texts with non-parallel data. Compared with these tasks, unsupervised (ii) Unsupervised Image Captioning Visual image captioning (Feng et al., 2019) is more concepts extracted from an image can be employed challenging since images and sentences are in for generating relevant captions in the unsuper- distinct modalities. vised setting. Feng et al. (2019) uses exactly hard- matching and REINFORCE to update the caption- Copying Mechanism This is related to the copy ing model. Here, we apply the semantic partial network (Gu et al., 2016), which achieves retrieved- matching to encourage the generatation of visual- based copying. Li et al. (2018) further proposed concept words. Specifically, M(X) releases the a delete-retrieve-generate framework for the style visual concepts and M(Yˆ ) corresponds to the gen- transfer. Chen and Bansal (2018) combine the ab- erated words realted to the object (i.e., nouns). This straction with extraction in text summarization, and visual concept regularization can also be applied in achieves state-of-the-art results via reinforced word the supervised setting complementing with MLE selection. In this work, we proposed the semantic loss. partial matching, which can be regarded as a kind 216 0.08 Rooms Rooms were evant to the style. As shown in Figure 4, shows the 0.07 were 0.175 spacious spacious 0.06 0.150 and and attention maps of three instances. Hence, we can food 0.05 food 0.125 was was partially match words with lower attention weights 0.04 0.100 very very well 0.03 well 0.075 as they are mostly non-style words. However, em- cooked cooked 0.02 0.050 PAD PAD PAD 0.01 PAD 0.025 pirical results show implicit ways have much worse PAD PAD 0.000 is is so so are the are

the results than the simple rule-based strategy with The bad and The bad and food food either either rooms poorly rooms poorly cooked cooked POS tags. snow snow 0.16 skiing 0.08 skiing 0.14 man man 0.12 the atmosphere of the church is very fun . 0.06 slope slope 0.10

hill hill 0.08 overall I was very happy with the compensation I got . 0.04 cover cover 0.06 but the smell was so horrible i will never go there again . PAD PAD 0.04 0.02 PAD PAD 0.02 PAD PAD Figure 4: Attention maps for three Yelp instances. 0.00 a a a a skis skis

man man Larger attention weight corresponds to darker color. snow snow slope slope down down riding riding covered covered Figure 3: Optimal matching matrix visualization. A 6.2 Unsupervised Text-style Transfer comparison between OT (left column) and SEPAM (right column). The Optimal matching matrix of Setup We use the same data and split method SEPAM is sparse. The horizontal axis are the gener- described in (Shen et al., 2017). Experiments are ated texts, and the vertical axis are the partial targets. implemented with Tensorflow based on texar (Hu et al., 2018). For a fair comparison, we use a simi- of soft-copying mechanism. Instead of the retrieval- lar model configuration to that of (Hu et al., 2017; based exact copying used by (Gu et al., 2016; Li Yang et al., 2018). One-layer GRU (Cho et al., et al., 2018), SEPAM considers semantic similarity, 2014) encoder and LSTM attention decoder (gen- and thus ideally delivers smoother transformation erator) are used. We set the weight of semantic in generation. matching loss as λ = 0.2. Models are trained for a total of 15 epochs, with 10 epochs for pretraining 6 Experiments and 5 epochs for fine-tuning. 6.1 Demonstration Metrics We pretrain a CNN classifier, which Comparison between OT and SEPAM We achieves an accuracy of 97.4% on the validation show two examples of classical OT and SEPAM un- set. Based on it, we report the accuracy (ACC) to der two sequence-prediction tasks in Figure 3. The measure the quality of style control. We further first row shows the heat map of OT and SEPAM measure the content preservation using i) BLEU, on matching two sentences with different styles. which measures the similarity between the origi- SEPAM employs the syntax information to match nal and transferred sentences ii) ref-BLEU, which selected words and all the content words are ex- measures content preservation comparing the trans- actly matched. However, some sentiment words ferred sentences with human annotations. Fluency in classical OT are still matched, preventing suc- is evaluated with perplexity (PPL) of generated cessful style transfer. The second row in Figure sentences based on a pretrained language model. 3 shows the comparison on matching a generated sentence with the detected concept set in image Baselines We implemented CtrlGen (Hu et al., captioning. The concepts are perfectly matched 2017) and LM (Yang et al., 2018) as our baselines with their corresponding words in the caption using and further added SEPAM on these two models SEPAM, while OT includes some noisy matching for validation. Further, conditional variational en- (light blue). In summary, SEPAM achieves bet- coder (CVAE) (Shen et al., 2017) and retrieval- ter matching than classical OT, and the matching based methods (Li et al., 2018) are added as two weights (T) of SEPAM is more sparse. baselines. Implicit Use of Prior Knowledge We consider Analysis Results are shown in Tables 1. It may using the weights wt of attentions from a LSTM- be observed that combining our proposed method based text classifier to determine which words with corresponding baselines exhibits similar trans- to match. As discussed in Wiegreffe and Pinter fer accuracy and fluency, while maintaining the (2019), a word with higher attention weight means content better. Specifically, SEPAM shows higher it is more important for classification, i.e., more rel- BLEU scores on human annotated sentences, fur-

217 Model ACC (%) BLEU ref-BLEU PPL zon Mechnical Turk. We randomly sample 100 CAE (Shen et al., 2017) 73.9 20.7 7.8 51.6 sentences from the test set and ask 5 different re- (Fu et al., 2018): viewers to provide their rating scores of the models StyleEmbedding 8.1 67.4 19.2 120.1 in terms of fluency, style, and content preservation. MultiDecoder 46.9 40.1 12.9 113.1 We require all the workers to be native English (Li et al., 2018): Template 80.1 57.4 20.5 170.5 speakers, with approval rate higher than 95% and DeleteAndRetrieval 88.9 36.8 14.7 74.2 at least 100 assignments completed. For each sen- CtrlGen (Hu et al., 2017) 89.0 61.4 22.3 176.8 tence, five shuffled samples generated by different OT + CtrlGen 85.4 62.9 21.7 183.7 SEPAM+ CtrlGen 89.1 63.7 25.9 176.4 models are sequentially shown to a reviewer. Re- LM (Yang et al., 2018) 88.3 60.5 25.7 79.9 sults in Table 3 demonstrate that the better perfor- OT + LM 84.8 61.8 22.8 85.1 mance achieved by SEPAM, especially in terms of SEPAM+ LM 88.7 62.0 28.2 76.6 content preservation. Table 1: Our model and baselines performance on test 6.3 Image Captioning dataset with human annotations. Setup We consider image captioning using the Input: tasted really old , i could n’t believe it . COCO dataset (Lin et al., 2014), which contains CtrlGen: adds really top , i could gorgeous believe it . SEPAM+ CtrlGen: tasted really surprisingly , i could fantastic believe it . 123,287 images in total and each image is anno- LM: tasted really great , i could always believe it . tated with at least 5 captions. Following Karpathy’s SEPAM+ LM: tasted really excellent , i could always believe it . Input: they do not stock some of the most common parts . split (Karpathy and Fei-Fei, 2015), 113,287 images CtrlGen: they do fantastic laughed some of the most common are used for training and 5,000 are used for valida- parts . SEPAM+ CtrlGen: they do authentic expertly some of the most fascinating tion and testing. We note that the training images parts LM: they do definitely right some of the most cool parts . are used to build the image set, with all the captions SEPAM+ LM: they do always stock some of the most amazing parts left unused for any training. All the descriptions . Input: the woman who helped me today was very friendly in the Shutterstock image description corpus are and knowledgeable . tokenized with a vocabulary size of 18, 667 (Feng CtrlGen: the woman who so-so me today was very rude and knowledgeable. et al., 2019). The LSTM hidden dimension and the SEPAM+ CtrlGen: the woman who helped me today was very rude and shared latent space dimension are fixed to 512. The knowledgeable . LM: the woman who ridiculous me today was very rude weighting hyper-parameters are chosen to make dif- and knowledgeable. SEPAM+ LM: the woman who helped me today was very rude and ferent rewards roughly the same scale. Specifically, stupid . λ is set to 10. We train our model using the Adam optimizer (Kingma and Ba, 2014) with a learning Table 2: Examples for comparison of different methods rate of 0.0001. During the initialization process, on Yelp dataset. we minimize the cross-entropy loss using Adam ther validating its effectiveness on content preserva- optimizer (Kingma and Ba, 2014) with a learning tion. It is interesting to see that lower BLEU scores rate of 0.001. When generating captions in the test with original sentences does not imply higher phase, we use beam search with a beam size of 3. BLEU scores with the human annotations in Ta- Metrics We report BLEU (Papineni et al., ble 1. Compared with other models, the proposed 2002), CIDEr (Vedantam et al., 2015), and ME- model shows a better balance among accuracy, flu- TEOR (Banerjee and Lavie, 2005) scores. The ency and content preservation, achieving the high- results of different methods are shown in Table 5. est ref-BLEU. Method BLEU METEOR CIDEr SPICE Model Style Content Fluency Feng et al. (2019) 38.2 27.5 22.9 6.6

CAE (Shen et al., 2017) 3.21 2.91 2.83 OUR IMPLEMENTATIONS CtrlGen (Hu et al., 2017) 3.42 3.22 2.79 Hard Matching 38.5 27.2 28.2 7.8 LM (Yang et al., 2018) 3.38 3.32 3.20 OT 39.9 28.1 28.3 7.9 SEPAM+ CtrlGen 3.51 3.56 2.88 SEPAM 42.1 28.9 30.2 8.4 SEPAM+ LM 3.47 3.72 3.25 Table 5: Performance comparisons of Unsupervised Table 3: Human evaluation results on Yelp dataset. captioning on the MSCOCO dataset. Human Evaluation We further conduct human Analysis Results in Table 5 show consistent im- evaluations for the proposed SEPAM using Ama- provement of SEPAM over classical OT. Classical

218 Method BLEU METEOR ROUGE PARENT / Precision / Recall Seq2Seq (Wiseman and Rush, 2016) 22.24 19.50 39.49 43.41 / 49.09 / 41.80 Pointer (See et al., 2017) 19.32 19.88 40.68 49.52 / 61.73 / 44.09 Structre Aware (Liu et al., 2018) 22.76 20.27 39.32 46.47 / 51.18 / 46.34 Transformer 23.48 21.89 42.50 52.60 / 63.20 / 47.90 Transformer + OT 23.87 22.35 42.03 51.81 / 60.65 / 48.87 Transformer + SEPAM 24.06 22.29 42.83 53.16 / 62.99 / 48.81

Table 4: Performance comparisons of Table-to-Text Generation on the WikiPerson.

OT can improve upon the baseline via generating Slot Name Slot Value specific words aligned with the detected visual con- Xia Jin cepts. However, directly applying it in unsuper- Football player vised settings will suffer from the imbalance is- 14 February 1985 sue (Craig, 2014), i.e., the generated texts contains some useless elements without correspondence in Guizhou Hengfeng F.C. the targets. Our proposed SEPAM can avoid this Chongqing Dangdai Lifan F.C. problem via partial matching, leading to better per- Better City F.C. formance. Extension to Supervised Settings Our pro- MLE: Xia Jin (born 14 February 1985 in Guizhou) is a Chinese Football player who currently plays for Guizhou Dangdai Lifan F.C. in the China posed SEPAM SEM can also be applied in a su- L League One . he joined Guizhou Dangdai Lifan F.C. in the summer of 2010 pervised setting as a regularizer with the MLE loss. . he joined Guizhou Dangdai Lifan F.C. in the summer of 2013 . he joined Guizhou Dangdai Lifan F.C. in the summer of 2013 . he joined Guizhou We apply SEPAM in the captioning model, where Dangdai Lifan F.C. in the summer of 2013 . he started his professional career image features are fed into an LSTM sequence with Chongqing Dangdai Lifan F.C.. OT: Xia Jin (born 14 February 1985 in Chongqing) is a Chinese Football generator with an Att2in attention mechanism (An- player who currently plays for Guizhou Hengfeng F.C. in the China League One . jin started his professional footballer career with Guizhou Hengfeng derson et al., 2018). We pretrain the captioning F.C. in the . jin would move to China League One side model for a maximum of 20 epochs, then use rein- Chongqing Dangdai Lifan F.C. in February 2011 . he would move to China League Two side Chengdu Better City F.C. in January 2012 . he would move forcement learning to train it for another 20 epochs. to China League Two side Chongqing Dangdai Lifan F.C. in January 2013. Testing is done on the best model with the vali- SEPAM: Xia Jin (born 14 February 1985 in Chongqing) is a Chinese Football player who currently plays for Guizhou Hengfeng F.C. in the China League dation set. We partially match the tags or visual One . Xia Jin started his professional footballer career with Chongqing features of detected objects. Similarly, we see con- Dangdai Lifan F.C. in the Chinese Super League . Xia transferred to China League One side Chengdu Better City F.C. . sistent improvement of SEPAM over its baselines. Table 7: An example of Table-to-Text Generation. Method BLEU METEOR ROUGE CIDEr Vinyals et al. (2015) 27.7 23.7 - 85.5 blocks as 3, the hidden units of the feed-forward Gan et al. (2017) 56.6 25.7 - 101.2 Lu et al. (2017) 33.2 26.6 - 108.5 layer as 2048 and λ = 0.1. Similarly, the model is Chen et al. (2019) 33.8 25.6 - 102.9 first trained with MLE for 20, 000 steps and then OUR IMPLEMENTATIONS L fine-tuning with SEM. MLE 34.3 26.2 55.2 106.3 L Visual + OT 34.6 26.4 55.6 107.5 Metrics For automatic evaluation, we apply the Visual + SEPAM 34.9 26.9 56.0 109.2 Tag + OT 34.8 26.5 55.6 107.9 widely used evaluation metrics including the stan- Tag + SEPAM 34.4 27.0 56.1 111.6 dard BLEU(-4) (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, Table 6: Performance comparisons of supervised image 2015) scores to evaluate the generation quality. Fol- captioning results on the MSCOCO dataset. lowing Dhingra et al. (2019), we evaluate with 6.4 Table-to-Text Generation PARENT score on hallicination generation, which considers both the reference texts and table content Setup We evaluate SEPAM on table-to-text gen- in evaluations. eration (Lebret et al., 2016; Liu et al., 2018; Wiseman et al., 2018) with the WikiPerson Analysis Results in Table 4 show consistent im- dataset (Wang et al., 2018) and preprocess the train- provement of SEPAM over baselines in terms of ing set, with a vocabulary size of 50, 000. We use different evaluation metrics. Table 7 shows an ex- the transformer encoder and decoder. We set the ample of table-to-text generation. MLE halluci- number of heads as 8, the number of Transformer nates some information that does not appear in the

219 table. OT alleviates this issue, but still shows hal- Liqun Chen, Yizhe Zhang, Ruiyi Zhang, Chenyang lucination since the imbalance transportation issue. Tao, Zhe Gan, Haichao Zhang, Bai Li, Dinghan SEPAM generates almost no extra information, and Shen, Changyou Chen, and Lawrence Carin. 2019. Improving sequence-to-sequence learning via opti- covers all the entries in the table. mal transport. In ICLR.

7 Conclusions Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- tive summarization with reinforce-selected sentence We incorporate prior knowledge into optimal trans- rewriting. In ACL. port, to encourage partial-sentence matching via Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul- formulating it as an optimal partial transport prob- cehre, Dzmitry Bahdanau, Fethi Bougares, Holger lem. The proposed SEPAM shows broad applica- Schwenk, and Yoshua Bengio. 2014. Learning bility and consistent improvements against popular phrase representations using rnn encoder-decoder baselines in three downstream tasks: unsupervised for statistical machine translation. In EMNLP. style transfer for content preservation, image cap- Sumit Chopra, Michael Auli, and Alexander M Rush. tioning for informative descriptions and table-to- 2016. Abstractive sentence summarization with at- text generation for faithful generation. Further, tentive recurrent neural networks. In NAACL. the proposed technique can be regarded as a soft- Katy Craig, editor. 2014. The exponential formula for copying mechanism for Seq2Seq Models. the Wasserstein metric. PhD thesis, The State Uni- versity of New Jersey. Acknowledgment The authors would like to thank Zhenyi Wang, Liqun Chen and Siyang Yuan Ning Dai, Jianze Liang, Xipeng Qiu, and Xuanjing for helpful thoughts and discussions. The Duke Huang. 2019. Style transformer: Unpaired text style University research was funded in part by DARPA, transfer without disentangled latent representation. arXiv preprint arXiv:1905.05621. DOE, NSF and ONR. Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, and William W. References Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. In Pro- Peter Anderson, Xiaodong He, Chris Buehler, Damien ceedings of 57th Annual Meeting of the Association Teney, Mark Johnson, Stephen Gould, and Lei for Computational Linguistics. Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In William Fedus, Ian Goodfellow, and Andrew M Dai. CVPR. 2018. Maskgan: Better text generation via filling in the _. In ICLR. Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural ma- Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Un- chine translation. In ICLR. supervised image captioning. In CVPR.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Alessio Figalli. 2010. The optimal partial transport gio. 2015. Neural machine translation by jointly problem. Archive for rational mechanics and analy- learning to align and translate. In ICLR. sis.

Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, automatic metric for mt evaluation with improved and Rui Yan. 2018. Style transfer in text: Explo- correlation with human judgments. In ACL Work- ration and evaluation. In AAAI. shop. Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Luis Caffarelli and Robert J McCann. 2010. Free Kenneth Tran, Jianfeng Gao, Lawrence Carin, and boundaries in optimal transport and monge-ampere Li Deng. 2017. Semantic compositional networks obstacle problems. Annals of mathematics. for visual captioning. In CVPR.

Liqun Chen, Shuyang Dai, Chenyang Tao, Haichao Hongyu Gong, Suma Bhat, Lingfei Wu, JinJun Xiong, Zhang, Zhe Gan, Dinghan Shen, Yizhe Zhang, and Wen-mei Hwu. 2019. Reinforcement learning Guoyin Wang, Ruiyi Zhang, and Lawrence Carin. based text style transfer without parallel training cor- 2018. Adversarial text generation via feature- pus. In ACL. mover’s distance. In NeurIPS. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Liqun Chen, Zhe Gan, Yu Cheng, Linjie Li, Lawrence Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Carin, and Jingjing Liu. 2020. Graph optimal trans- Courville, and Yoshua Bengio. 2014. Generative ad- port for cross-domain alignment. In ICML. versarial nets. In NIPS.

220 Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Li. 2016. Incorporating copying mechanism in Jing Zhu. 2002. Bleu: a method for automatic eval- sequence-to-sequence learning. In ACL. uation of machine translation. In ACL. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Gabriel Peyré, Marco Cuturi, et al. 2017. Computa- short-term memory. Neural computation. tional optimal transport. Technical report. Zhiting Hu, Haoran Shi, Zichao Yang, et al. 2018. Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhut- Texar: A modularized, versatile, and extensi- dinov, and Alan W Black. 2018. Style transfer ble toolkit for text generation. arXiv preprint through back-translation. In ACL. arXiv:1809.00794. Alexander M Rush, Sumit Chopra, and Jason Weston. Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan 2015. A neural attention model for abstractive sen- Salakhutdinov, and Eric P Xing. 2017. Controllable tence summarization. arXiv:1509.00685. text generation. In ICML. Abigail See, Peter J Liu, and Christopher D Manning. Andrej Karpathy and Li Fei-Fei. 2015. Deep visual- 2017. Get to the point: summarization with pointer- semantic alignments for generating image descrip- generator networks. In ACL. tions. In CVPR. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Diederik P Kingma and Jimmy Ba. 2014. Adam: A Jaakkola. 2017. Style transfer from non-parallel text method for stochastic optimization. In ICLR. by cross-alignment. In NIPS. Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Akhilesh Sudhakar, Bhargav Upadhyay, and Arjun Ma- Weinberger. 2015. From word embeddings to doc- heswaran. 2019. Transforming delete, retrieve, gen- ument distances. In ICML. erate approach for controlled text style transfer. In EMNLP. Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised ma- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. chine translation using monolingual corpora only. In Sequence to sequence learning with neural networks. ICLR. In NIPS. Rémi Lebret, David Grangier, and Michael Auli. 2016. Richard S Sutton, David A McAllester, Satinder P Neural text generation from structured data with ap- Singh, and Yishay Mansour. 2000. Policy gradient plication to the biography domain. In Proceedings methods for reinforcement learning with function ap- of the Conference on Empirical Methods in Natural proximation. In NIPS. Language Processing. Alexey Tikhonov, Viacheslav Shibaev, Aleksander Na- Juncen Li, Robin Jia, He He, and Percy Liang. 2018. gaev, Aigul Nugmanova, and Ivan P. Yamshchikov. Delete, retrieve, generate: A simple approach to sen- 2019. Style transfer for texts: Retrain, report errors, timent and style transfer. In NAACL. compare with rewrites. In EMNLP. Chin-Yew Lin. 2015. Rouge: A package for automatic Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob evaluation of summaries. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, you need. In NeurIPS. and Ming-Ting Sun. 2017. Adversarial ranking for language generation. In NIPS. Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image de- Tsung-Yi Lin, Michael Maire, Serge Belongie, James scription evaluation. In CVPR. Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Cédric Villani. 2008. Optimal transport: old and new. Common objects in context. In ECCV. Springer Science & Business Media. Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, Oriol Vinyals, Alexander Toshev, Samy Bengio, and and Zhifang Sui. 2018. Table-to-text generation by Dumitru Erhan. 2015. Show and tell: A neural im- structure-aware seq2seq learning. In AAAI. age caption generator. In CVPR. Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Qingyun Wang, Xiaoman Pan, Lifu Huang, Boliang Socher. 2017. Knowing when to look: Adaptive at- Zhang, Zhiying Jiang, Heng Ji, and Kevin Knight. tention via a visual sentinel for image captioning. In 2018. Describing a knowledge base. arXiv preprint CVPR. arXiv:1809.01797. Fuli Luo, Peng Li, Jie Zhou, Pengcheng Yang, Baobao Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Chang, Zhifang Sui, and Xu Sun. 2019. A dual rein- Guoyin Wang, Dinghan Shen, Changyou Chen, and forcement learning framework for unsupervised text Lawrence Carin. 2019. Topic-guided variational au- style transfer. In IJCAI. toencoders for text generation. In NAACL.

221 Sarah Wiegreffe and Yuval Pinter. 2019. Attention is not not explanation. In EMNLP. Sam Wiseman and Alexander M Rush. 2016. Sequence-to-sequence learning as beam-search optimization. In EMNLP.

Sam Wiseman, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning neural templates for text gen- eration. In Proceedings of the Conference on Empir- ical Methods in Natural Language Processing.

Chen Wu, Xuancheng Ren, Fuli Luo, and Xu Sun. 2019. A hierarchical reinforced sequence operation method for unsupervised text style transfer. In ACL.

Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. 2018. A fast proximal point method for Wasserstein distance. In arXiv:1802.04307.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual at- tention. In ICML.

Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and Taylor Berg-Kirkpatrick. 2018. Unsupervised text style transfer using language models as discrimina- tors. In NeurIPS.

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI.

Ruiyi Zhang, Changyou Chen, Chunyuan Li, and Lawrence Carin. 2018. Policy optimization as wasserstein gradient flows. In ICML.

Ruiyi Zhang, Tong Yu, Changyou Chen, and Lawrence Carin. 2019. Text-based interactive recommenda- tion via constraint augumented reinforcement learn- ing. In NeurIPS.

Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. In ICML.

222