Learning Deep Latent Models for Text Sequences Lei Li Bytedance AI Lab
Total Page:16
File Type:pdf, Size:1020Kb
2020 Learning Deep Latent Models for Text Sequences Lei Li ByteDance AI Lab 4/29/2020 The Rise of New Media Platforms Toutiao Helo Douyin/Tiktok 2 Huge Demand for Automatic Content Generation Technologies • Automatic News Writing • Author writing assist tools – Title generation and text summarization • Automatic Creative Advertisement Design • Dialog Robots w/ response generation • Translation of content across multiple languages • Story Generation 3 4 Automated News Writing Xiaomingbot is deployed and constantly producing news on social media platforms (TopBuzz & Toutiao). 5 AI to Improve Writing Text generation to rescue! 6 Outline 1. Overview 2. Learning disentangled latent representation for text 3. Mirror-Generative NMT 4. Multimodal machine writing 5. Summary 7 Disentangled Latent Representation for Text VTM [R. Ye, W. Shi, H. Zhou, Z. Wei, Lei Li, ICLR20b] DSS-VAE [Y. Bao, H. Zhou, S. Huang, Lei Li, L. Mou, O. Vechtomova, X. Dai, J. Chen, ACL19c] Natural Language Descriptions name Sukiyaki eatType pub Sukiyaki is a Japanese food Japanese restaurant. It is a pub and it has a price average average cost and rating good good rating. It is area seattle based in seattle. 9 Data to Text Generation Data Table Sentence <key, value> Medical The blood pressure is higher than Reports normal and may expose to the risk of hypertension Style long dress Made of poplin, this long dress has Painting bamboo ink Fashion Product an ink painting of bamboo and Texture poplin Description feels fresh and smooth. Feel smooth Sia Kate Isobelle Furler (born Name: Sia Kate Isobelle Furler 18 December 1975) is an DoB: 12/18/1975 Person Australian singer, songwriter, Nationality: Australia Biography Occupation: Singer, voice actress and music Songwriter video director. 10 [1] The E2E Dataset: New Challenges For End-to-End Generation. https://github.com/tuetschek/e2e-dataset [2] Can Neural Generators for Dialogue Learn Sentence Planning and Discourse Structuring?. https://nlds.soe.ucsc.edu/sentence-planning- NLG Problem Setup • Inference: – Given: table data x, as key-position-value triples. – e.g. Name: Jim Green => (Name, 0, Jim), (Name, 1, Green) – Output: fluent, accurate and diverse text sequences y • Training: – N : pairs of table data and text. {⟨xi, yi⟩}i=1 – M : raw text corpus. {yj}j=1 M ≫ N 11 Why is Data-to-Text Hard? • Desired Properties: – Accuracy: semantically consistent with the content in the table – Diversity: Ability to generate infinite varying utterances • Scalability: real-time generation, latency, throughput (QPS) • Training: limited table-text pairs 12 Previous Idea: Templates [name] is a [food] restaurant. It is a [eatType] and it has a [price] cost and [rating] Sukiyaki is a Japanese rating. It is in [area]. restaurant. It is a pub and it has a name Sukiyaki average cost and eatType pub good rating. It is in food Japanese seattle. price average But manually creation of rating good templates are tedious area seattle 13 Our Motivation for Variational Template Machine Motivation 1: Continuous and Template disentangled representation for Sentence template and content Content Table Motivation 2: Incorporate raw text corpus to learn good q (template, Raw text content | representation. sentence) 14 VTM [R. Ye, W. Shi, H. Zhou, Z. Wei, Lei Li, ICLR20b] Variational Template Machine Input: triples of <field_name, table position, value> data x f p v K template {xk, xk , xk }k=1 variable 1. p(c|x) ∼ Neural Net content k k k variable c z maxpool(tanh(W ⋅ [xf , xp, xv ] + b)) 2. Sample , e.g. z ∼ p0(z) Gaussian text y 3. Decode y from [c, z] using another NN (e.g. Transformer) 15 VTM [R. Ye, W. Shi, H. Zhou, Z. Wei, Lei Li, ICLR20b] Training VTM Key idea: Disentangling table content and templates while data x template preserving as much variable information as possible! content Total loss = variable c z Reconstruction loss + text y Information-Preserving loss Variational Inference Instead of optimizing exact and table intractable expected likelihood, data x minimizing the (tractable) template variational lower bounds. variable −E log dz content �� = �(� �(�), �)�(�) variable c z ∫ ELBO E log KL p = − �(� �) �(� �(�), �) + [�(�|�)||�(�)] � = −E log �(� �, �)�(�)�(�)dzd� � ∬ text y ELBO E log r = − �(� �)�(� �) �(� �, �) +KL[�(�|�)| �(�)] + KL[�(�|�)||�(�)] 17 Preserving Content & Template 1. Content preserving loss 2 table lcp = �q(c|y) |c − f(x)| + DKL(q(c|y) ∥ p(c)) data x template 2. Template preserving loss of variable pairs content ltp = − �q(z|y)[log p(y˜ |z, x)] variable c z y˜ is the text sketch by removing table entry i.e. cross entropy of variational text y prediction from templates Preserving Template Ensure the template variable could recover the text sketch table Table data : data x x template {name[Loch Fyne], variable eatType[restaurant], food[French] content price[below $20]} c z variable Text y: Loch Fyne is a French restaurant catering to a budget of below $20. text y Text Sketch y˜: <ent> is a <ent> <ent> catering to a budget of <ent>. 19 Learning with Raw Corpus • Semi-supervised learning: “Back-translate” corpus to obtain pseudo-parallel pairs <table, text>, to enrich the learning Table Text name Sukiyaki eatType pub Sukiyaki is a Japanese restaurant. It is food Japanese price average a pub and it has a average cost and rating good good rating. It is in seattle. area seattle ? Known for its creative flavours, Holycrab's signatures are the Hokkien crab. q(<c,z>|y) 20 Evaluation Setup • Tasks – WIKI: generating short-bio from person profile. – SPNLG: generating restaurant description from attributes Train Valid Test Dataset table-text table-text table-text raw text raw text pairs pairs pairs WIKI 84k 842k 73k 43k 73k SPNLG 14k 150k 21k / 21k • Evaluation Metric: – Quality (Accuracy): BLEU score to ground-truth – Diversity: self-BLEU (lower is better) 21 VTM Produces High-quality and Diverse Text Ideal WIKI Ideal SPNLG 0.3 0.5 VTM T2S-beam T2S-beam 0.256 0.41 VTM T2S- T2S- 0.212 pretrain 0.32 pretrain BLEU 0.23 BLEU 0.168 0.14 0.124 Temp-KN 0.05 Temp-KN 0.08 0.3 0.48 0.66 0.84 1.02 0.6 0.705 0.81 0.915 1.02 Self-BLEU Self-BLEU VTM uses beam-search decoding. 22 VTM [Ye, …, Lei Li, ICLR20b] Raw data and loss terms are necessary Ablation results on Wiki-bio dataset ideal 0.26 VTM 0.22 w/o raw data ↑ 0.18 BLEU w/o 0.14 information-preserving losses 0.1 0.7 0.75 0.8 0.85 0.9 Self-BLEU↓ 23 Interpreting VTM Template variable project to 2D 24 VTM Generates Diverse Text Input Data Table Generated Text 1: John Ryder (8 August 1889 – 4 April 1977) was an Australian cricketer. 2: Jack Ryder (born August 9, 1889 in Victoria, Australia) was an Australian cricketer. 3: John Ryder, also known as the king of Collingwood (8 August 1889 – 4 April 1977) was an Australian cricketer. 25 Learning Disentangled Representation of Syntax and Semantics DSSVAE enables learning and syntactic semantic transferring sentence-writing styles style content Syntax provider Semantic content zsyn zsem There is an apple The dog is on the table behind the door DSSVAE x sentence There is a dog behind the door 26 DSS-VAE [Y. Bao, H. Zhou, S. Huang, Lei Li, L. Mou, O. Vechtomova, X. Dai, J. Chen, ACL19c] Impact • VTM and its extensions have been applied to multiple online systems on Toutiao including query suggestion generation, ads bid-word generation, etc. • Serving over 100million active users. • 10% of query suggestion phrases from the generation algorithm. 27 Part I Takeaway VTM DSSVAE table syntactic semantic data •Deepx latent models enable learning with bothtemplate table-text pairsstyle and content unpairedvariable text, with high zaccuracyz content• Disentangling approach forsyn modelsem variable compositionc z •Variational technique to speed up inference x text y sentence 28 Outline 1. Overview of Intelligent Information Assistant 2. Learning disentangled latent representation for text 3. Mirror-Generative NMT 4. Multimodal machine writing 5. Summary and Future Directions 29 Neural Machine Translation • Neural machine translation (NMT) systems are super good when you have large amount of parallel bilingual data src2tgt [p(y|x; θxy)] Chinese English tgt2src TM x ∼ �xy y ∼ �xy [p(x|y; θyx)] • BUT, very expensive/non-trivial to obtain – Low resource language pairs (e.g., English-to-Tamil) – Low resource domains (e.g., social network) • Large-scale mono-lingual data are not fully utilized 30 Existing approaches to exploit non- parallel data • There are two categories of methods using non-parallel data – Training ‣ Back-translation, Joint Back-translation, dual learning… – Decoding ‣ Interpolation w/ external LM … • Still not the best 31 So, what we expect? • A pair of relevant TMs so that they can directly boost each other in training src2tgt tgt2src [p[p(y(y|x|,xz;;θθ )])] [p[p(x(x|y|,yz;;θθyxyx))]] xyxy z(x, y) • A pair of relevant TM & LM src2tgt target LM [p[p(y(y|x|,xz;;θθ )])] [p[p(y(y|;z;θθy)y])] so that they can cooperate xyxy z(x, y) more effectively for better decoding We need a bridge 32 Integrating Four Language Skills with MGNMT MGNM target src2tg [p(y|z; θy)] [p(y|x, z; θxy)] x y z(x, y) y x source tgt2sr [p(x|z; θx)] [p(x|y, z; θyx)] 1. composing sentence in Source lang Benefits utilizing both 2. composing sentence in Target lang parallel bilingual data 3. translating from source to target and non- 4. translating from target to source parallel corpus 33 MGNMT [Z. Zheng, H. Zhou, S. Huang, L. Li, X. Dai, J. Chen, ICLR 2020a] y src2tgt TM src LM [p(y|x, z)] [p(x|z)] x MGNMT y z(x, y) tgt LM tgt2src TM [p(y|z)] [p(x|y, z)] x Published as a conference paper at ICLR 2020 y z z src2tgt TM src LM MGNMT [p(y|x, z)] [p(x|z)] target LM src2tgt TM [p(y|z; θy)] [p(y|x, z; θxy)] x y x y x y x MGNMT y z(x, y) z(x, y) GNMT MGNMT y x tgt LM tgt2src TM source LM tgt2src TM [p(x|z; θ )] [p(x|y, z; θ )] Figure 1: The [p(y|z)] [p(x|y, z)] x yx graphical model of MGNMT.