Interpretable Machine Learning for Natural Language Generation
Total Page:16
File Type:pdf, Size:1020Kb
NeurIPS 2020 Meetup Beijing Controllable and Interpretable Machine Learning for Natural Language Generation Lei Li ByteDance AI Lab 12/6/2020 Revolution in Information Creation and Sharing • New media platforms • Tremendous improvement in the efficiency and quality of content creation • Massive distribution of personalized information 2 AI for Information Creation and Sharing Information Creation & AI Sharing Technology 3 AI for Information Creation and Sharing Information Creation & AI Sharing Technology Automated Natural Lang. news writing Generation Sharing Content Machine Globally Translation Filtering Classification/ Misinformation Graph Neural Nets/ GANs 4 Why is NLG important? Machine Writing Question Answering ChatBOT Machine Translation 5 Machine Translation 6 AI to Improve Writing Text generation to rescue! Gmail smart compose, smart reply. 7 8 Automated News Writing Xiaomingbot is deployed and constantly producing news on social media platforms (Toutiao & TopBuzz). 9 human written GPT3, edited by human 10 A New Working Style for Authors Human-AI Co-authoring 11 Outline 1. Motivation and Basics 2. Deep Latent Variable Models 3. Multimodal machine writing: show case 4. Summary 12 Modeling a Sequence The quick brown fox jumps over the lazy dog . x = ( x1 , x2 , x3 , x4, x5 , x6 , x7, x8, x9, x10) The central problem of language modeling is to find the joint probability distribution: pθ(x) = pθ(x1, ⋯, xL) There are many ways to represent and learn the joint probability model. 13 DGM Taxonomy pθ(x) ⟷ pdata(x) Maximum Likelihood Estimation Adversarial Learning GAN Explicit Density Implicit Density GSN Tractable Density Intractable Density Energy-based Auto- Markov Parallel Latent Conditional Constrained Regressive Factorization Factorization Variable Model EBM PM Factorization RNN, LSTM Markov GLAT VAE CGMH Transformer Transformer NAT VTM MHA TSMH Transformer Output Softmax Multi-Head Scaled Dot-Product Attention Linear Linear MatMul Add & Norm Concat Feed Forward SoftMax Add & Norm Scaled Dot-Product Mask (opt.) Add & Norm h Multi-Head Attention Attention Feed Forward Scale Add & Norm LinearLinear LinearLinear LinearLinear Add & Norm MatMul Masked Multi-Head Multi-Head Q K V Attention Attention Q K V Input Output Embedding Embedding Inputs Outputs Figure 1: The Transformer - model architecture.Figure 1: The Transformer - model architecture. 15 Decoder: The decoder is also composedDecoder: of a stackThe of decoderN =6identical is also composed layers. In of addition a stack to of theN =6 two identical layers. In addition to the two sub-layers in each encoder layer, the decodersub-layers inserts in each a third encoder sub-layer, layer, whichthe decoder performs inserts multi-head a third sub-layer, which performs multi-head attention over the output of the encoderattention stack. Similar over the to outputthe encoder, of the we encoder employ stack. residual Similar connections to the encoder, we employ residual connections around each of the sub-layers, followedaround by layer each normalization. of the sub-layers, We followed also modify by layer the self-attention normalization. We also modify the self-attention sub-layer in the decoder stack to preventsub-layer positions in the from decoder attending stack to to subsequent prevent positions positions. from This attending to subsequent positions. This masking, combined with fact that the outputmasking, embeddings combined are with offset fact bythat one the position, output embeddings ensures that are the offset by one position, ensures that the predictions for position i can depend onlypredictions on the known for position outputsi can at positions depend only less on than thei. known outputs at positions less than i. 3.2 Attention 3.2 Attention An attention function can be describedAn as attention mapping function a query and can a be set described of key-value as mapping pairs to a an query output, and a set of key-value pairs to an output, where the query, keys, values, and outputwhere are the all query,vectors. keys, The values, output andis computed output are as all a weighted vectors. The sum output is computed as a weighted sum of the values, where the weight assignedof tothe each values, value where is computed the weight by assigned a compatibility to each function value is ofcomputed the by a compatibility function of the query with the corresponding key. query with the corresponding key. 3.2.1 Scaled Dot-Product Attention3.2.1 Scaled Dot-Product Attention We call our particular attention "ScaledWe Dot-Product call our particular Attention" attention (Figure "Scaled 2). The Dot-Product input consists Attention" of (Figure 2). The input consists of queries and keys of dimension dk, andqueries values and of dimension keys of dimensiondv. We computedk, and thevalues dot of products dimension of thedv. We compute the dot products of the query with all keys, divide each by pdqueryk, and with apply all a keys, softmax divide function each by top obtaindk, and the apply weights a softmax on the function to obtain the weights on the values. values. 3 3 Deep Latent Variable Models for Text • Disentangled Representation Learning for Text Generation [ICLR 20b, ACL 19c] • Interpretable Deep Latent Representation from Raw Text [ICML 20] • Mirror Generative Model for Neural Machine Translation [ICLR 20a] 16 Natural Language Descriptions name Sukiyaki eatType pub Sukiyaki is a Japanese food Japanese restaurant. It is a pub and it has a price average average cost and rating good good rating. It is area seattle based in seattle. 17 Data to Text Generation Data Table Sentence <key, value> Medical The blood pressure is higher than Reports normal and may expose to the risk of hypertension Style long dress Made of poplin, this long dress has Painting bamboo ink Fashion Product an ink painting of bamboo and Texture poplin Description feels fresh and smooth. Feel smooth Sia Kate Isobelle Furler (born Name: Sia Kate Isobelle Furler 18 December 1975) is an DoB: 12/18/1975 Person Australian singer, songwriter, Nationality: Australia Biography Occupation: Singer, voice actress and music Songwriter video director. 18 [1] The E2E Dataset: New Challenges For End-to-End Generation. https://github.com/tuetschek/e2e-dataset [2] Can Neural Generators for Dialogue Learn Sentence Planning and Discourse Structuring?. https://nlds.soe.ucsc.edu/sentence-planning- NLG Previous Idea: Templates [name] is a [food] restaurant. It is a [eatType] and it has a [price] cost and [rating] Sukiyaki is a Japanese rating. It is in [area]. restaurant. It is a pub and it has a name Sukiyaki average cost and eatType pub good rating. It is in food Japanese seattle. price average But manually creation of rating good templates are tedious area seattle 19 Our Motivation for Variational Template Machine Motivation 1: Continuous and Template disentangled representation for Sentence template and content Content Table Motivation 2: Incorporate raw text corpus to learn good q (template, Raw text content | representation. sentence) 20 VTM [R. Ye, W. Shi, H. Zhou, Z. Wei, Lei Li, ICLR20b] Variational Template Machine Input: triples of <field_name, table position, value> data x f p v K template {xk, xk , xk }k=1 variable 1. p(c|x) ∼ Neural Net content k k k variable c z maxpool(tanh(W ⋅ [xf , xp, xv ] + b)) 2. Sample , e.g. z ∼ p0(z) Gaussian text y 3. Decode y from [c, z] using another NN (e.g. Transformer) 21 VTM [R. Ye, W. Shi, H. Zhou, Z. Wei, Lei Li, ICLR20b] Learning with Raw Corpus • Semi-supervised learning: “Back-translate” corpus to obtain pseudo-parallel pairs <table, text>, to enrich the learning Table Text name Sukiyaki eatType pub Sukiyaki is a Japanese restaurant. It is food Japanese price average a pub and it has a average cost and rating good good rating. It is in seattle. area seattle ? Known for its creative flavours, Holycrab's signatures are the Hokkien crab. q(<c,z>|y) 22 VTM Produces High-quality and Diverse Text Ideal WIKI Ideal SPNLG 0.3 0.5 VTM T2S-beam T2S-beam 0.256 0.41 VTM T2S- T2S- 0.212 pretrain 0.32 pretrain BLEU 0.23 BLEU 0.168 0.14 0.124 Temp-KN 0.05 Temp-KN 0.08 0.3 0.48 0.66 0.84 1.02 0.6 0.705 0.81 0.915 1.02 Self-BLEU Self-BLEU VTM uses beam-search decoding. 23 VTM [Ye, …, Lei Li, ICLR20b] VTM Generates Diverse Text Input Data Table Generated Text 1: John Ryder (8 August 1889 – 4 April 1977) was an Australian cricketer. 2: Jack Ryder (born August 9, 1889 in Victoria, Australia) was an Australian cricketer. 3: John Ryder, also known as the king of Collingwood (8 August 1889 – 4 April 1977) was an Australian cricketer. 24 Learning Disentangled Representation of Syntax and Semantics DSSVAE enables learning and syntactic semantic transferring sentence-writing styles style content Syntax provider Semantic content zsyn zsem There is an apple The dog is on the table behind the door DSSVAE x sentence There is a dog behind the door 25 DSS-VAE [Y. Bao, H. Zhou, S. Huang, Lei Li, L. Mou, O. Vechtomova, X. Dai, J. Chen, ACL19c] Interpretable Text Generation Latent structure dialog actions GENERATOR Sampling “Remind me about x1 x2 x3 the football game.” [action=remind] “Will it be overcast tomorrow?” [action=request] x0 x1 x2 …… Generate Sentences with interpretable factors 26 How to Interpret Latent Variables in VAEs? Variational Auto-encoder (VAE) interpretable structure z x (Kingma & Welling, 2013) z: difficult to continuou interpret s latent variables discrete factors Will it be humid in New York today? Remind me about my meeting. 27 Discrete Variables Could Enhance Interpretability - but one has to do it right! Gaussian Mixture Variational Auto- encoder (GM-VAE) interpretable structure c z x (Dilokthanakul et al., 2016; Jiang et al., 2017) : discrete c Why? mode- component How to fix it? collapse z: continuous Will it be overcast Remind me latent variable tomorrow? about the football game.