COMP 786 (Fall 2020)

Natural Language Processing

Week 9: Summarization; Guest Talk; Machine Translation 1

Mohit Bansal

(various slides adapted/borrowed from courses by Dan Klein, JurafskyMartin-SLP3, Manning/Socher, others) Automatic Document Summarization Statistical NLP Spring 2011

Lecture 25: Summarization

Dan Klein – UC Berkeley

Single-Document Summarization

Full documentDocument to a salient, non-redundantSummarization summary of ~100 words

1 Multi-Document Summarization

Several news sources with articles on the same topic (can use overlappingMulti-document info across articles as Summarization a good feature for summarization)

… 27,000+ more

Extractive Summarization

2 Multi-document Summarization

… 27,000+ more

ExtractiveExtractive Summarization Summarization

Directly selecting existing sentences from input document instead of rewriting them

2 Selection

mid-‘90s • Maximum Marginal Relevance • Graph algorithmsGraph-based Extractive Summ

Stationary distribution represents node centrality

ss 22 present ss 11

ss 44 Nodes are sentences ss 33

Edges are similarities

[Mihalcea et al., 2004, 2005; inter alia]

Selection

mid-‘90s • Maximum Marginal Relevance • Graph algorithms • Word distribution models

ww PP DD (w)(w) ww PP AA (w)(w) present Obama 0.017 Obama ? speech 0.024 ~ speech ? health 0.009 health ? Montana 0.002 Montana ? Input document distribution Summary distribution

5 Selection Maximize Concept Coverage [Gillick and Favre, 2008]

The health care bill is a major test for the ss 11 Obama administration. concept value obama 3 ss 22 Universal health care is a divisive issue. health 2 President Obama remained calm. ss 33 house 1 ss 44 Obama addressed the House on Tuesday.

summary length value Length limit: greedy 18 words {s1, s3}175 {s2, s3, s4}176 optimal

[Gillick and Favre, 2009]

Maximize Concept Coverage

Optimization problem: Set Coverage

Value of concept c

Set of concepts Set of extractive summaries present in summary s of document set D

Results Bigram Recall Pyramid

Baseline 4.00 Baseline 23.5

2009 6.85 2009 35.0

[Gillick and Favre 09]

15 Selection

[Gillick and Favre, 2008]

The health care bill is a major test for the ss 11 Obama administration. concept value obama 3 ss 22 Universal health care is a divisive issue. health 2 President Obama remained calm. ss 33 house 1

ss 44 Obama addressed the House on Tuesday.

summary length value Length limit: greedy 18 words {s1, s3}175 {s2, s3, s4}176 optimal

MaximizeMaximize Concept Concept Coverage Coverage A set coverage optimization problem Optimization problem: Set Coverage

Value of concept c

Set of concepts Set of extractive summaries present in summary s of document set D

Results Bigram Recall Pyramid

Baseline 4.00 Baseline 23.5

2009 6.85 2009 35.0

[Gillick and Favre 09]

[Gillick and Favre, 2009]

15 Selection

Integer LinearMaximize Program for theConcept maximum Coverage coverage mode l [Gillick, Can be Riedhammer, solved using Favre, an integer Hakkani-Tur, linear program 2008] with constraints:

total concept value

summary length limit

maintain consistency between selected sentences and concepts

ci an indicator for the presence of concept i in the summary, and sj an indicator for the presence of sentence j in the summary. We add Occij to indicate the occurrence of concept i in sentence j. Equations (1) and (2) ensure the logical consistency of the solution: selecting a sentence necessitates selecting all the concepts it contains and selecting a concept is only possible if it is present in at least one selected sentence.

[Gillick et al., 2008] [Gillick and Favre, 2009]

Selection

[Gillick and Favre, 2009]

This ILP is tractable for reasonable problems

16 Problems with Extraction

What would a human do? It is therefore unsurprising that Lindsay pleaded not guilty yesterday afternoon to the charges filed against her, according to her publicist.

Beyond Extraction: Compression

If you had to write a concise summary, making effective use of the 100-word limit, you wouldProblems remove some with information Extraction from the lengthy sentences in the original article

What would a human do? It is therefore unsurprising that Lindsay pleaded not guilty yesterday afternoon to the charges filed against her, according to her publicist.

[Berg-Kirkpatrick et al., 2011]

19 Sentence Rewriting

[Berg-Kirkpatrick, Gillick, and Klein 11]

Beyond Extraction: Compression Sentence Rewriting Model should learn the subtree deletions/cuts that allow compression

[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011]

20 Beyond Extraction: Compression Sentence Rewriting Model should learn the subtree deletions/cuts that allow compression

[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011]

Sentence Rewriting

New Optimization problem: Safe Deletions

Value of deletion d

Set branch cut deletions made in creating summary s

How do we know how much a given deletion costs?

[Berg-Kirkpatrick, Gillick, and Klein 11]

21 Sentence Rewriting Sentence Rewriting

[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick, Gillick, and Klein 11]

Beyond Extraction: Compression Sentence Rewriting The new optimization problem Sentencelooks to maximize Rewriting the concept values asNew well asOptimization safe deletion valuesproblem: in the Safe candidate Deletions summary: New Optimization problem: Safe Deletions

Value of deletionValue of d deletion d

Set branch cut deletions madeSet branch in creating cut deletionssummary s made in creating summary s ToHow decide do the we value/cost know how of a muchdeletion, a wegiven decide deletion relevant costs? deletion featuresHow and do thewe model know learns how muchtheir weights: a given deletion costs?

[Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick, Gillick, and Klein 11] [Berg-Kirkpatrick et al., 2011]

21 21 Beyond Extraction: Compression

Some example features forFeatures concept bigrams and cuts/deletions:

Bigram Features f(b) Cut Features f(c)

COUNT: Bucketed document counts COORD: Coordinated phrase, four versions: NP, VP, S, SBAR STOP: Stop word indicators S-ADJUNCT: Adjunct to matrix verb, POSITION: First document position four versions: CC, PP, indicators ADVP, SBAR CONJ: All two- and three-way REL-C: Relative clause indicator conjunctions of above ATTR-C: Attribution clause indicator BIAS: Always one ATTR-PP: PP attribution indicator TEMP-PP: Temporal PP indicator TEMP-NP Temporal NP indicator BIAS: Always one

[Berg-Kirkpatrick et al., 2011] Neural Abstractive Summarization

Mostly based on sequence-to-sequence RNN models

Later added attention, coverage, pointer/copy, hierarchical encoder/ attention, metric rewards RL, etc.

Examples: Rush et al., 2015; Nallapati et al., 2016; See et al., 2017; Paulus et al., 2017 tion 3 contextualizes our models with respect to document, around which the story revolves. In closely related work on the topic of abstractive text order to accomplish this goal, we may need to summarization. We present the results of our ex- go beyond the word-embeddings-based represen- periments on three different data sets in Section 4. tation of the input document and capture addi- We also present some qualitative analysis of the tional linguistic features such as parts-of-speech output from our models in Section 5 before con- tags, named-entity tags, and TF and IDF statis- cluding the paper with remarks on our future di- tics of the words. We therefore create additional rection in Section 6. look-up based embedding matrices for the vocab- ulary of each tag-type, similar to the embeddings 2 Models for words. For continuous features such as TF and IDF, we convert them into categorical values In this section, we first describe the basic encoder- by discretizing them into a fixed number of bins, decoder RNN that serves as our baseline and then and use one-hot representations to indicate the bin propose several novel models for summarization, number they fall into. This allows us to map them each addressing a specific weakness in the base- into an embeddings matrix like any other tag-type. line. Finally, for each word in the source document, we 2.1 Encoder-Decoder RNN with Attention simply look-up its embeddings from all of its as- and Large Vocabulary Trick sociated tags and concatenate them into a single long vector, as shown in Fig. 1. On the target side, Our baseline model corresponds to the neural ma- we continue to use only word-based embeddings chine translation model used in Bahdanau et al. as theFeature-Augmented representation. Encoder-Decoder (2014). The encoder consists of a bidirectional GRU-RNN (Chung et al., 2014), while the decoder consists of a uni-directional GRU-RNN with the same hidden-state size as that of the encoder, and Layer Attention mechanism

an attention mechanism over the source-hidden Output states and a soft-max layer over target vocabu- lary to generate words. In the interest of space, State Hidden Hidden we refer the reader to the original paper for a de- W W W W DECODER POS POS POS POS Layer tailed treatment of this model. In addition to the NER NER NER NER Input Input TF TF TF TF basic model, we also adapted to the summariza- IDF IDF IDF IDF tion problem, the large vocabulary ‘trick’ (LVT) ENCODER described in Jean et al. (2014). In our approach, Figure 1: Feature-rich-encoder: We use one embedding the decoder-vocabulary of each mini-batch is re- vector each for POS, NER tags and discretized TF and IDF stricted to words in the source documents of that values, which are concatenated together with word-based em-[Nallapati et al., 2016] batch. In addition, the most frequent words in the beddings as input to the encoder. target dictionary are added until the vocabulary reaches a fixed size. The aim of this technique is to reduce the size of the soft-max layer of the 2.3 Modeling Rare/Unseen Words using decoder which is the main computational bottle- Switching Generator-Pointer neck. In addition, this technique also speeds up Often-times in summarization, the keywords or convergence by focusing the modeling effort only named-entities in a test document that are central on the words that are essential to a given example. to the summary may actually be unseen or rare This technique is particularly well suited to sum- with respect to training data. Since the vocabulary marization since a large proportion of the words in of the decoder is fixed at training time, it cannot the summary come from the source document in emit these unseen words. Instead, a most common any case. way of handling these out-of-vocabulary (OOV) words is to emit an ‘UNK’ token as a placeholder. 2.2 Capturing Keywords using Feature-rich However this does not result in legible summaries. Encoder In summarization, an intuitive way to handle such In summarization, one of the key challenges is to OOV words is to simply point to their location in identify the key concepts and key entities in the the source document instead. We model this no- tion using our novel switching decoder/pointer ar- is set to 0 whenever the word at position i in the chitecture which is graphically represented in Fig- summary is OOV with respect to the decoder vo- ure 2. In this model, the decoder is equipped with cabulary. At test time, the model decides automat- a ‘switch’ that decides between using the genera- ically at each time-step whether to generate or to tor or a pointer at every time-step. If the switch point, based on the estimated switch probability is turned on, the decoder produces a word from its P (si). We simply use the arg max of the poste- target vocabulary in the normal fashion. However, rior probability of generation or pointing to gener- if the switch is turned off, the decoder instead gen- ate the best output at each time step. erates a pointer to one of the word-positions in the The pointer mechanism may be more robust in source. The word at the pointer-location is then handling rare words because it uses the encoder’s copied into the summary. The switch is modeled hidden-state representation of rare words to decide as a sigmoid activation function over a linear layer which word from the document to point to. Since based on the entire available context at each time- the hidden state depends on the entire context of step as shown below. the word, the model is able to accurately point to s s s unseen words although they do not appear in the P (si = 1) = (v (Whhi + WeE[oi 1] 1 s · s target vocabulary.Generation+Copying + Wcci + b )), where P (si = 1) is the probability of the switch th turning on at the i time-step of the decoder, hi is the hidden state, E[oi 1] is the embedding vec- Layer tor of the emission from the previous time step, Output G P G G G ci is the attention-weighted context vector, and State s s s s s Wh, We, Wc, b and v are the switch parame- Hidden DECODER ters. We use attention distribution over word posi- Layer

tions in the document as the distribution to sample Input the pointer from. ENCODER a a a a Pi (j) exp(v (Whhi 1 + We E[oi 1] Figure 2: Switching generator/pointer model: When the / a d · a + Wc hj + b )), switch shows ’G’, the traditional generator consisting of the a [Nallapati et al., 2016] pi = arg max(Pi (j)) for j 1,...,Nd . softmax layer is used to produce a word, and when it shows j 2 { } ’P’, the pointer network is activated to copy the word from In the above equation, pi is the pointer value at one of the source document positions. When the pointer is th i word-position in the summary, sampled from activated, the embedding from the source is used as input for a the attention distribution Pi over the document the next time-step as shown by the arrow from the encoder to a word-positions j 1,...,Nd , where P (j) is the decoder at the bottom. 2 { } i the probability of the ith time-step in the decoder pointing to the jth position in the document, and d 2.4 Capturing Hierarchical Document hj is the encoder’s hidden state at position j. At training time, we provide the model with ex- Structure with Hierarchical Attention plicit pointer information whenever the summary In datasets where the source document is very word does not exist in the target vocabulary. When long, in addition to identifying the keywords in the OOV word in summary occurs in multiple doc- the document, it is also important to identify the ument positions, we break the tie in favor of its key sentences from which the summary can be first occurrence. At training time, we optimize the drawn. This model aims to capture this notion of conditional log-likelihood shown below, with ad- two levels of importance using two bi-directional ditional regularization penalties. 1Even when the word does not exist in the source vocabu- log P (y x)= (gi log P (yi y i, x)P (si) lary, the pointer model may still be able to identify the correct | i { | } position of the word in the source since it takes into account X the contextual representation of the corresponding ’UNK’ to- +(1 gi) log P (p(i) y i, x)(1 P (si)) ) { | } ken encoded by the RNN. Once the position is known, the corresponding token from the source document can be dis- where y and x are the summary and document played in the summary even when it is not part of the training words respectively, gi is an indicator function that vocabulary either on the source side or the target side. RNNs on the source side, one at the word level al., 2002; Erkan and Radev, 2004; Wong et al., and the other at the sentence level. The attention 2008a; Filippova and Altun, 2013; Colmenares et mechanism operates at both levels simultaneously. al., 2015; Litvak and Last, 2008; K. Riedhammer The word-level attention is further re-weighted by and Hakkani-Tur, 2010; Ricardo Ribeiro, 2013). the corresponding sentence-level attention and re- Humans on the other hand, tend to paraphrase normalized as shown below: the original story in their own words. As such, hu- man summaries are abstractive in nature and sel- P a(j)P a(s(j)) P a(j)= w s , dom consist of reproduction of original sentences Nd a a k=1 Pw(k)Ps (s(k)) from the document. The task of abstractive sum- marization has been standardized using the DUC- where P a(j) is theP word-level attention weight at w 2003 and DUC-2004 competitions.2 The data for jth position of the source document, and s(j) is these tasks consists of news stories from various the ID of the sentence at jth word position, P a(l) s topics with multiple reference summaries per story is the sentence-level attention weight for the lth generated by humans. The best performing system sentence in the source, N is the number of words d on the DUC-2004 task, called TOPIARY (Zajic in the source document, and P a(j) is the re-scaled et al., 2004), used a combination of linguistically attention at the jth word position. The re-scaled motivated compression techniques, and an unsu- attention is then used to compute the attention- pervised topic detection algorithm that appends weighted context vector that goes as input to the keywords extracted from the article onto the com- hidden state of the decoder. Further, we also con- pressed output. Some of the other notable work in catenate additional positional embeddings to the the task of abstractive summarization includes us- hidden state of the sentence-level RNN to model ing traditional phrase-table based machine transla- positional importance of sentences in the docu- tion approaches (Banko et al., 2000), compression ment. This architecture therefore models key sen- using weighted tree-transformation rules (Cohn tences as well as keywords within those sentences and Lapata, 2008) and quasi-synchronous gram- jointly. A graphical representation of this model is mar approaches (Woodsend et al., 2010). displayed in Figure 3. Hierarchical Attention With the emergence of deep learning as a viable alternative for many NLP tasks (Collobert et al.,

Layer 2011), researchers have started considering this

Output Output framework as an attractive, fully data-driven alter- native to abstractive summarization. In Rush et layer State al. (2015), the authors use convolutional models Hidden Hidden Sentence to encode the source, and a context-sensitive at- State layer tentional feed-forward neural network to generate Hidden Hidden Word DECODER the summary, producing state-of-the-art results on

Layer Sentence-level attention Gigaword and DUC datasets. In an extension to Input Input ENCODER Word-level attention this work, Chopra et al. (2016) used a similar con- volutional model for the encoder, but replaced the Figure 3: Hierarchical encoder with hierarchical attention: decoder with an RNN, producing further improve- the attention weights at the word level, represented by the ment in performance on both datasets. dashed arrows are re-scaled by the corresponding sentence-[Nallapati et al., 2016] In another paper that is closely related to our level attention weights, represented by the dotted arrows. work, Hu et al. (2015) introduce a large dataset The dashed boxes at the bottom of the top layer RNN rep- for Chinese short text summarization. They show resent sentence-level positional embeddings concatenated to promising results on their Chinese dataset using the corresponding hidden states. an encoder-decoder RNN, but do not report exper- iments on English corpora. 3 Related Work In another very recent work, Cheng and Lapata (2016) used RNN based encoder-decoder for ex- A vast majority of past work in summarization tractive summarization of documents. This model has been extractive, which consists of identify- is not directly comparable to ours since their ing key sentences or passages in the source doc- ument and reproducing them as summary (Neto et 2http://duc.nist.gov/ Pointer-Generator Networks

[See et al., 2017] Pointer-Generator Networks

[See et al., 2017] Pointer-Generator Networks

[See et al., 2017] Coverage for Redundancy Reduction

[See et al., 2017] Guest Talk by Ramakanth Pasunuru:

“Soft, Layer-Specific Multi-Task Summarization with Entailment and Question Generation” (ACL 2018)

“Multi-Reward Reinforced Summarization with Saliency and Entailment” (NAACL 2018)

(20 mins) Topic: Abstractive Summarization with Multi-Task Learning and Reinforcement Learning

(presented by Ramakanth Pasunuru)

1 (slides by Ramakanth Pasunuru)

Multi-Task Learning

2 (slides by Ramakanth Pasunuru) Multi-Task Learning

• Multi-task Learning (MTL) is an inductive transfer mechanism which leverages information from related tasks to improve the primary model’s generalization performance.

• It achieves this goal by training multiple tasks in parallel while sharing representations, where the training signals from the auxiliary tasks can help improve the performance of the primary task.

3 [Caruana, 1998; Argyriou et al., 2007; Kumar and Daume, 2012; Luong et al., 2016] Published as a conference paper at ICLR 2016

German (translation)

English Tags (parsing)

English (unsupervised)

Figure 2: One-to-many Setting –oneencoder,multipledecoders.Thisschemeisusefulforeither multi-target translation as in Dong et al. (2015) or betweenA Jointdifferent Many-Task tasks. Here, English Model: and Ger- man imply sequences of words inGrowing the respective languages. a Neural The Networkα values give for the proportions Multiple of NLP Tasks parameter updates that are allocated for the different tasks.

Kazuma Hashimoto,⇤ Caiming Xiong,† Yoshimasa Tsuruoka, and Richard Socher for constituency parsing as used in (Vinyals et al., 2015a), (b) a sequence of German words for ma- chine translation (Luong et al., 2015a), and (c) the same sequenceThe of University English words of for Tokyo autoencoders or a related sequence of English words forhassy, the skip-thought tsuruoka objective ([email protected] et al., 2015). { Salesforce} Research 3.2 MANY-TO-O(slidesNE byS RamakanthETTING Pasunuru) cxiong, rsocher @salesforce.com { } This scheme is the opposite of the one-to-many setting. As illustrated in Figure 3, it consists of mul- tiple encoders and one decoder.ThisisusefulfortasksinwhichonlythedecodercanbesharPrevious Worked, for example, when our tasks include machine translation and image caption generation (Vinyals et al., 2015b). In addition, from a machine translationAbstract perspective, this setting can benefit from a large amount of monolingual data on the target side, which is a standard practice in machine translation Entailment system and has also beenTransfer explored and for neural multi-task MT by Gulcehrelearning et haveal. (2015). Entailment Entailment encoder encoder traditionally focused on either a single Relatedness level

German (translation) semantic source-target pair or very few, similar Relatedness Relatedness tasks. Ideally, the linguistic levels of mor- encoder encoder

phology,Image syntax (captioning) and semantics would ben-English DEP DEP

efit each other by being trained in a sin- level syntactic gle model. We introduce a joint many-task English (unsupervised) CHUNK CHUNK model together with a strategy for succes-

word level POS POS Figure 3: Many-to-onesively setting growing–multipleencoders,onedecoder.Thisschemeishandyforta its depth to solve increas- sks in ingly complex tasks. Higher layers in- which only the decoders can be shared. word representation word representation clude shortcut connections to lower-level Sentence1 Sentence2 3.3 MANY-TO-MANYtaskS predictionsETTING to reflect linguistic hierar- chies. We[Luong use et a al., simple 2016] regularization term [Hashimoto et al., 2016] Lastly, as the name describes,to allow this for category optimizing is the all most model general weights one, consisting ofFigure multiple 1: encoders Overview of the joint many-task model and multiple decoders. We will explore this scheme in a translation setting that involves sharing to improve one task’s loss without exhibit- predicting different linguistic outputs at succes- multiple encoders and multiple decoders. In addition to the machine translation task,4 we will[Long include and Wang, 2015; Lu et al., 2016; Misra et al., 2016; Kendall et al., 2017; Ruder et al., 2018] two unsupervised objectivesing catastrophic over the source interference and target lang of theuages other as illustratedsively in Figure deeper 4. layers. tasks. Our single end-to-end model ob- 3.4 UNSUPERVISEDtainsLEARNING state-of-the-artTASKS or competitive results Our very first unsupervisedon five learning different task tasksinvolves from learning tagging,autoencoders pars- from monolingualdifferent tasks corpora, either entirely separately or at the which has recently beening, applied relatedness, to sequence and entailment to sequence tasks. learning (Dai & Le,same 2015). depth However, (Collobert in et al., 2011). Dai & Le (2015)’s work, the authors only experiment with pretraining and then finetuning, but not We introduce a Joint Many-Task (JMT) model, joint training which1 can Introduction be viewed as a form of multi-task learning (MTL). As such, we are very interested in knowing whether the same trend extends to our MTL settings. outlined in Figure 1, which predicts increasingly Additionally, weThe investigate potential the use for of the leveragingskip-thought multiplevectors (Kiros levels et of al., 2015)complex in the context NLP tasks of at successively deeper lay- our MTL framework.representation Skip-thought has vectors been are demonstrated trained by training in various sequence toers. sequence Unlike models traditional on pipeline systems, our sin- pairs of consecutiveways sentences, in the field which of makes Natural the skip-thougLanguageht Processing objective a naturalgleseq2seq JMT modellearning can be trained end-to-end for POS candidate. A minor technical difficulty with skip-thought objective is that the training data must (NLP). For example, Part-Of-Speech (POS) tags tagging, chunking, dependency parsing, semantic arXiv:1611.01587v5 [cs.CL] 24 Jul 2017 are used for syntactic parsers. The parsers are used relatedness, and textual entailment, by consider- 3 to improve higher-level tasks, such as natural lan- ing linguistic hierarchies. We propose an adaptive guage inference (Chen et al., 2016) and machine training and regularization strategy to grow this translation (Eriguchi et al., 2016). These systems model in its depth. With the help of this strat- are often pipelines and not trained end-to-end. egy we avoid catastrophic interference between Deep NLP models have yet shown benefits from the tasks. Our model is motivated by Søgaard and predicting many increasingly complex tasks each Goldberg (2016) who showed that predicting two at a successively deeper layer. Existing models different tasks is more accurate when performed in often ignore linguistic hierarchies by predicting different layers than in the same layer (Collobert et al., 2011). Experimental results show that our ⇤ Work was done while the first author was an intern at Salesforce Research. single model achieves competitive results for all † Corresponding author. of the five different tasks, demonstrating that us- (slides by Ramakanth Pasunuru)

MTL for Summarization

5 (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] MTL for Summarization

Input Document: celtic have written to the scottish football association in order to gain an ‘ under- An accurate abstractive summary of a standing of´ the refereeing decisions during their semi-final defeat by inverness on sunday • . the hoops were left outraged by referee steven mclean s´ failure to award a penalty or red card for a document should contain all its salient clear handball in the box by josh meekings to deny leigh griffith s´ goal-bound shot during the first-half . caley thistle went on to win the game 3-2 after extra-time and denied rory delia s´ men the chance information and should be logically entailed by to secure a domestic treble this season . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings . celtic s´ adam matthews -lrb- right -rrb- slides in with a the input document. strong challenge on nick ross in the scottish cup semi-final . ‘ given the level of reaction from our sup- porters and across football , we are duty bound to seek an understanding of what actually happened , celtic´ said in a statement . they added , ‘ we have not been given any other specific explanation so far and this is simply to understand the circumstances of what went on and why such an obvious error was made . however´ , the parkhead outfit made a point of congratulating their opponents , who have reached the first-ever scottish cup final in their history , describing caley as a ‘ fantastic club and´ saying ‘ reaching the final is a great achievement . celtic´ had taken the lead in the semi-final through defender s´ curling free-kick on 18 minutes , but were unable to double that lead thanks to the meekings controversy . it allowed inverness a route back into the game and celtic had goalkeeper sent off after the restart for scything down marley watkins in the area . greg tansey duly converted the resulting penalty . edward ofere then put caley thistle ahead , only for john guidetti to draw level for the bhoys . with the game seemingly heading for penalties , david raven scored the winner on 117 minutes , breaking thousands of celtic hearts . celtic captain scott brown -lrb- left -rrb- protests to referee steven mclean but the handball goes unpunished . griffiths shows off his acrobatic skills during celtic s´ eventual surprise defeat by inverness . celtic pair aleksandar tonev -lrb- left -rrb- and john guidetti look dejected as their hopes of a domestic treble end . Ground-truth: celtic were defeated 3-2 after extra-time in the scottish cup semi-final . leigh griffiths had a goal-bound shot blocked by a clear handball. however, no action was taken against offender josh meekings . the hoops have written the sfa for an ’understanding’ of the decision . See et al. (2017): was once on the end of a major hampden injustice while playing for celtic . but he can not see any point in his old club writing to the scottish football association over the latest controversy at the national stadium . hartson had a goal wrongly disallowed for offside while celtic were leading 1-0 at the time but went on to lose 3-2 . Our Baseline: john hartson scored the late winner in 3-2 win against celtic . celtic were leading 6 1-0 at the time but went on to lose 3-2 . some fans have questioned how referee steven mclean and additional assistant alan muir could have missed the infringement . Multi-task: celtic have written to the scottish football association in order to gain an ‘ understand- ing ’ of the refereeing decisions . the hoops were left outraged by referee steven mclean ’s failure to award a penalty or red card for a clear handball in the box by josh meekings . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings .

Figure 3: Example of summaries generated by See et al. (2017), our baseline, and 3-way multi-task model with summarization and both entailment generation and question generation. The boxed-red highlighted words/phrases are not present in the input source document in any paraphrasing form. All the unboxed- green highlighted words/phrases correspond to the salient information. See detailed discussion in Fig. 1 and Fig. 2 above. As shown, the outputs from See et al. (2017) and the baseline both include non- entailed words/phrases (e.g. “john hartson”), as well as they missed salient information (“hoops”, “josh meekings”, “leigh griffiths”) in their output summaries. Our multi-task model, however, manages to accomplish both, i.e., cover more salient information and also avoid unrelated information. (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] MTL for Summarization

Input Document: celtic have written to the scottish football association in order to gain an ‘ under- An accurate abstractive summary of a standing of´ the refereeing decisions during their scottish cup semi-final defeat by inverness on sunday • . the hoops were left outraged by referee steven mclean s´ failure to award a penalty or red card for a document should contain all its salient clear handball in the box by josh meekings to deny leigh griffith s´ goal-bound shot during the first-half . caley thistle went on to win the game 3-2 after extra-time and denied rory delia s´ men the chance information and should be logically entailed by to secure a domestic treble this season . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings . celtic s´ adam matthews -lrb- right -rrb- slides in with a the input document. strong challenge on nick ross in the scottish cup semi-final . ‘ given the level of reaction from our sup- porters and across football , we are duty bound to seek an understanding of what actually happened , celtic´ said in a statement . they added , ‘ we have not been given any other specific explanation We improve these via multi-task learning with so far and this is simply to understand the circumstances of what went on and why such an obvious • error was made . however´ , the parkhead outfit made a point of congratulating their opponents , who auxiliary tasks of question generation and have reached the first-ever scottish cup final in their history , describing caley as a ‘ fantastic club and´ saying ‘ reaching the final is a great achievement . celtic´ had taken the lead in the semi-final entailment generation. through defender virgil van dijk s´ curling free-kick on 18 minutes , but were unable to double that lead thanks to the meekings controversy . it allowed inverness a route back into the game and celtic had goalkeeper craig gordon sent off after the restart for scything down marley watkins in the area . greg tansey duly converted the resulting penalty . edward ofere then put caley thistle ahead , only for john Question Generation teaches the guidetti to draw level for the bhoys . with the game seemingly heading for penalties , david raven • scored the winner on 117 minutes , breaking thousands of celtic hearts . celtic captain scott brown summarization model how to look for salient -lrb- left -rrb- protests to referee steven mclean but the handball goes unpunished . griffiths shows off his acrobatic skills during celtic s´ eventual surprise defeat by inverness . celtic pair aleksandar tonev questioning-worthy details. -lrb- left -rrb- and john guidetti look dejected as their hopes of a domestic treble end . Ground-truth: celtic were defeated 3-2 after extra-time in the scottish cup semi-final . leigh griffiths had a goal-bound shot blocked by a clear handball. however, no action was taken Entailment Generation teaches the model how against offender josh meekings . the hoops have written the sfa for an ’understanding’ of the • decision . to rewrite a summary which is a directed- See et al. (2017): john hartson was once on the end of a major hampden injustice while playing for celtic . but he can not see any point in his old club writing to the scottish football association over logical subset of the input document. the latest controversy at the national stadium . hartson had a goal wrongly disallowed for offside while celtic were leading 1-0 at the time but went on to lose 3-2 . Our Baseline: john hartson scored the late winner in 3-2 win against celtic . celtic were leading 6 1-0 at the time but went on to lose 3-2 . some fans have questioned how referee steven mclean and additional assistant alan muir could have missed the infringement . Multi-task: celtic have written to the scottish football association in order to gain an ‘ understand- ing ’ of the refereeing decisions . the hoops were left outraged by referee steven mclean ’s failure to award a penalty or red card for a clear handball in the box by josh meekings . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings .

Figure 3: Example of summaries generated by See et al. (2017), our baseline, and 3-way multi-task model with summarization and both entailment generation and question generation. The boxed-red highlighted words/phrases are not present in the input source document in any paraphrasing form. All the unboxed- green highlighted words/phrases correspond to the salient information. See detailed discussion in Fig. 1 and Fig. 2 above. As shown, the outputs from See et al. (2017) and the baseline both include non- entailed words/phrases (e.g. “john hartson”), as well as they missed salient information (“hoops”, “josh meekings”, “leigh griffiths”) in their output summaries. Our multi-task model, however, manages to accomplish both, i.e., cover more salient information and also avoid unrelated information. (slides by Ramakanth Pasunuru) Summarization Model

Figure 2: Baseline sequence-to-sequence model with attention. The model may attend to relevant words in the source text to generate novel words, e.g., to produce the novel word beat in the abstractive summary

Germany beat Argentina 2-0 the model may attend to7 the words victorious and win in the source text. [See et al., 2017]

et al., 2014), in which recurrent neural networks that were applied to short-text summarization. We (RNNs) both read and freely generate text, has propose a novel variant of the coverage vector (Tu made abstractive summarization viable (Chopra et al., 2016) from Neural Machine Translation, et al., 2016; Nallapati et al., 2016; Rush et al., which we use to track and control coverage of the 2015; Zeng et al., 2016). Though these systems source document. We show that coverage is re- are promising, they exhibit undesirable behavior markably effective for eliminating repetition. such as inaccurately reproducing factual details, an inability to deal with out-of-vocabulary (OOV) 2 Our Models words, and repeating themselves (see Figure 1). In this section we describe (1) our baseline In this paper we present an architecture that sequence-to-sequence model, (2) our pointer- addresses these three issues in the context of generator model, and (3) our coverage mechanism multi-sentence summaries. While most recent ab- that can be added to either of the first two models. stractive work has focused on headline genera- The code for our models is available online.1 tion tasks (reducing one or two sentences to a single headline), we believe that longer-text sum- 2.1 Sequence-to-sequence attentional model marization is both more challenging (requiring Our baseline model is similar to that of Nallapati higher levels of abstraction while avoiding repe- et al. (2016), and is depicted in Figure 2. The to- tition) and ultimately more useful. Therefore we kens of the article wi are fed one-by-one into the apply our model to the recently-introduced CNN/ encoder (a single-layer bidirectional LSTM), pro- Daily Mail dataset (Hermann et al., 2015; Nallap- ducing a sequence of encoder hidden states hi. On ati et al., 2016), which contains news articles (39 each step t, the decoder (a single-layer unidirec- sentences on average) paired with multi-sentence tional LSTM) receives the word embedding of the summaries, and show that we outperform the state- previous word (while training, this is the previous of-the-art abstractive system by at least 2 ROUGE word of the reference summary; at test time it is points. the previous word emitted by the decoder), and t Our hybrid pointer-generator network facili- has decoder state st . The attention distribution a tates copying words from the source text via point- is calculated as in Bahdanau et al. (2015): ing (Vinyals et al., 2015), which improves accu- et = vT (W h +W s + b ) racy and handling of OOV words, while retaining i tanh h i s t attn (1) t t the ability to generate new words. The network, a = softmax(e ) (2) which can be viewed as a balance between extrac- tive and abstractive approaches, is similar to Gu where v, Wh, Ws and battn are learnable parame- et al.’s (2016) CopyNet and Miao and Blunsom’s ters. The attention distribution can be viewed as (2016) Forced-Attention Sentence Compression, 1www.github.com/abisee/pointer-generator Learning to Ask: Neural Question Generation for Reading Comprehension

Xinya Du1 Junru Shao2 Claire Cardie1 (slides by Ramakanth Pasunuru) 1Department of Computer Science, Cornell University [Pasunuru, Guo, & Bansal, ACL 2018] 2Zhiyuan College, Shanghai Jiao Tong University Auxiliary{xdu, Task: cardie}@cs.cornell.edu Question Generation [email protected]

The task of question generation is to generate a • Abstract Sentence: question from a given input sentence, which in Oxygen is used in cellular respiration and re- We study automatic question generation turn is related to the skill of being able to find the leased by photosynthesis, which uses the en- for sentences from text passages in read- important salient information to ask questions ergy of sunlight to produce oxygen from water. ing comprehension. We introduce an about the sentence. attention-based sequence learning model Questions: for the task and investigate the effect of en- – What life process produces oxygen in the • A good summarycoding should sentence- also vs. be paragraph-level able to find infor- and presence of light? extract all the salientmation. information In contrast to allin previousthe given work, photosynthesis source document,our model and hence does not we rely incorporate on hand-crafted – Photosynthesis uses which energy to form such capabilitiesrules into or a our sophisticated abstractive NLP pipeline;text it is oxygen from water? summarization insteadmodel trainable by multi-task end-to-end learning via sequence- it sunlight to-sequence learning. Automatic evalu- with a question generation task, sharing some – From what does photosynthesis get oxygen? ation results show that our system sig- water common parameters/representations.nificantly outperforms the state-of-the-art rule-based system. In human evaluations, Figure 1: Sample sentence from the second para- questions generated by our system are also graph of the article Oxygen, along with the natural rated as being more natural (i.e., grammat- questions and their answers. icality, fluency) and as more difficult to an- 8 [Rajpurkar et al., 2016; Du et al., 2017] Image Credits: Du et al., 2017 swer (in terms of syntactic and lexical di- vergence from the original text and reason- annotated data sets for natural language process- ing needed to answer). ing (NLP) research in reading comprehension and question answering. Indeed the creation of such 1 Introduction datasets, e.g., SQuAD (Rajpurkar et al., 2016) and MS MARCO (Nguyen et al., 2016), has spurred Question generation (QG) aims to create natu- research in these areas. ral questions from a given a sentence or para- graph. One key application of question generation For the most part, question generation has been is in the area of education — to generate ques- tackled in the past via rule-based approaches tions for reading comprehension materials (Heil- (e.g., Mitkov and Ha (2003); Rus et al. (2010). man and Smith, 2010). Figure 1, for example, The success of these approaches hinges criti- shows three manually generated questions that test cally on the existence of well-designed rules for a user’s understanding of the associated text pas- declarative-to-interrogative sentence transforma- sage. Question generation systems can also be de- tion, typically based on deep linguistic knowledge. ployed as chatbot components (e.g., asking ques- To improve over a purely rule-based sys- tions to start a conversation or to request feed- tem, Heilman and Smith (2010) introduced an back (Mostafazadeh et al., 2016)) or, arguably, as overgenerate-and-rank approach that generates a clinical tool for evaluating or improving mental multiple questions from an input sentence using health (Weizenbaum, 1966; Colby et al., 1971). a rule-based approach and then ranks them us- In addition to the above applications, question ing a supervised learning-based ranker. Although generation systems can aid in the development of the ranking algorithm helps to produce more ac-

1342 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1342–1352 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/P17-1123 (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] Auxiliary Task: Entailment Generation

• Directional, logical-implication relation between two sentences: • Premise: A girl is jumping on skateboard in the middle of a red bridge. • Entailment: The girl does a skateboarding trick. • Contradiction: The girl skates down the sidewalk. • Neutral: The girl is wearing safety equipment. • Premise: A blond woman is drinking from a public fountain. • Entailment: The woman is drinking water. • Contradiction: The woman is drinking coffee. • Neutral: The woman is very thirsty.

9 (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] Auxiliary Task: Entailment Generation

• Directional, logical-implication relation between two sentences: • Premise: A girl is jumping on skateboard in the middle of a red bridge. • Entailment: The girl does a skateboarding trick. • Contradiction: The girl skates down the sidewalk. • Neutral: The girl is wearing safety equipment. • Premise: A blond woman is drinking from a public fountain. • Entailment: The woman is drinking water. • Contradiction: The woman is drinking coffee. • Neutral: The woman is very thirsty.

• The task of entailment generation is to generate a hypothesis which is entailed by (or logically follows from) the given premise as input.

• In summarization, the generation decoder also needs to generate a summary that is entailed by the source document, i.e., does not contain any contradictory or unrelated/extraneous information as compared to the input document.

10 (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] MTL Architecture

QG ENCODER SG ENCODER EG ENCODER UNSHARED ENCODER LAYER 1 LAYER ENCODER

QG stands for Question Generation SHARED • ENCODER LAYER 2 LAYER ENCODER • SG stands for Summary Generation ATTENTION DISTRIBUTION SHARED ATTENTION • EG stands for Entailment Generation SHARED DECODER LAYER 1 LAYER DECODER UNSHARED DECODER LAYER 2 LAYER DECODER QG DECODER SG DECODER EG DECODER

11 (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] MTL Architecture (Layer Specific Sharing)

QG ENCODER SG ENCODER EG ENCODER • Belinkov et al. (2017) observed that lower layers of RNN cells in a seq2seq

UNSHARED machine translation model learn to ENCODER LAYER 1 LAYER ENCODER represent word structure, while higher layers are more focused on high-level

SHARED semantic meanings. ENCODER LAYER 2 LAYER ENCODER • We believe that these tasks have

ATTENTION DISTRIBUTION different training data distributions and SHARED ATTENTION low-level representations, they can still benefit from sharing their models’ high-

SHARED level components. DECODER LAYER 1 LAYER DECODER • Thus, we keep the lower-level layer of the 2-layer encoder/decoder of all three

UNSHARED tasks unshared, while we share the DECODER LAYER 2 LAYER DECODER QG DECODER SG DECODER EG DECODER higher layer across the three tasks.

12 (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] Soft vs. Hard Parameter Sharing

TaskA TaskB TaskC • Hard-sharing: In the most common multi-task Task-Specific learning hard-sharing approach, the parameters Layers to be shared are forced to be the same. As a result, gradient information from multiple tasks will directly pass through shared parameters, Shared Layers hence forcing a common space representation for all the related tasks.

TaskA TaskB TaskC • Soft-sharing: We encourage shared parameters to be close in representation space by penalizing their L2 distances. Unlike hard sharing, this approach gives more flexibility for Constrained the tasks by only loosely coupling the shared Layers space representations.

13 [Duong et al., 2015; Yang & Hospedales, 2017] Image Credits: [7] (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] Results

Models ROUGE-1 ROUGE-2 ROUGE-L METEOR PREVIOUS WORK Seq2Seq(50k vocab) (See et al., 2017) 31.33 11.81 28.83 12.03 Pointer (See et al., 2017) 36.44 15.66 33.42 15.35 Pointer+Coverage (See et al., 2017) ? 39.53 17.28 36.38 18.72 Pointer+Coverage (See et al., 2017) 38.82 16.81 35.71 18.14 † OUR MODELS Two-Layer Baseline (Pointer+Coverage) 39.56 17.52 36.36 18.17 + Entailment Generation ⌦ 39.84 17.63 36.54 18.61 ⌦ + Question Generation 39.73 17.59 36.48 18.33 ⌦ + Entailment Gen. + Question Gen. 39.81 17.64 36.54 18.54 ⌦ Table 1: CNN/DailyMailTable: Performance summarization of our multi-task results. models ROUGEon CNN/DailyMail scores dataset are full (~300K length examples). F-1 (as previous work). All the multi-task improvements are statistically significant over the state-of-the-art baseline.

proval rate greater than 95%, and had at least Models R-1 R-2 R-L PREVIOUS WORK 10,000 approved HITs. For the pairwise model ABS+ (Rush et al., 2015) 29.76 11.88 26.96 comparisons discussed in Sec. 6.2, we showed the RAS-El (Chopra et al., 2016) 33.78 15.97 31.15 annotators the input article, the ground truth sum- lvt2k (Nallapati et al., 2016) 32.67 15.59 30.64 Pasunuru et al. (2017) 32.75 15.35 30.82 mary, and the two model summaries (randomly OUR MODELS shuffled to anonymize model identities) – we then 2-Layer Pointer Baseline 34.26 16.40 32.03 + Entailment Generation⌦ 35.45 17.16 33.19 asked them to choose the better among the two * Our multi-task model is stat. signif. better than baseline (based on bootstrap test with 14 ⌦ + Question100K samples: Generation Efron and Tibshirani,35.48 1994). 17.31 32.97 model summaries or choose ‘Not-Distinguishable’ ⌦ + Entailment + Question 35.98 17.76 33.63 if both summaries are equally good/bad. In- ⌦ Table 2: Summarization results on Gigaword. structions for relevance were defined based on ROUGE scores are full length F-1. All the multi- the summary containing salient/important infor- task improvements are statistically significant over mation from the given article, being correct the state-of-the-art baseline. (i.e., avoiding contradictory/unrelated informa- tion), and avoiding redundancy. Instructions for Multi-Task with Entailment Generation We readability were based on the summary’s fluency, first perform multi-task learning between ab- grammaticality, and coherence. stractive summarization and entailment genera- tion with soft-sharing of parameters as discussed Training Details All our soft/hard and layer- in Sec. 4. Table 1 and Table 2 shows that this specific sharing decisions were made on the val- multi-task setting is better than our strong base- idation/development set. Details of RNN hidden line models and the improvements are statistically state sizes, Adam optimizer, mixing ratios, etc. are significant on all metrics5 on both CNN/DailyMail provided in the supplementary for reproducibility. (p<0.01 in ROUGE-1/ROUGE-L/METEOR and p<0.05 in ROUGE-2) and Gigaword (p<0.01 6 Results on all metrics) datasets, showing that entailment generation task is inducing useful inference skills 6.1 Summarization (Primary Task) Results to the summarization task (also see analysis exam- ples in Sec. 7). Pointer+Coverage Baseline We start from the 3 strong model of See et al. (2017). Table 1 shows Multi-Task with Question Generation For that our baseline model performs better than or multi-task learning with question generation, 4 comparable to See et al. (2017). On Gigaword the improvements are statistically significant in dataset, our baseline model (with pointer only, ROUGE-1 (p<0.01), ROUGE-L (p<0.05), and since coverage not needed for this single-sentence METEOR (p<0.01) for CNN/DailyMail and in summarization task) performs better than all pre- all metrics (p<0.01) for Gigaword, compared vious works, as shown in Table 2. to the respective baseline models. Also, Sec. 7

3 presents quantitative and qualitative analysis of We use two layers so as to allow our high- versus low- 6 level layer sharing intuition. Note that this does not increase this model’s improved saliency. the parameter size much (23M versus 22M for See et al. (2017)). 5Stat. significance is computed via bootstrap test (Noreen, 4As mentioned in the github for See et al. (2017), their 1989; Efron and Tibshirani, 1994) with 100K samples. publicly released pretrained model produces the lower scores 6In order to verify that our improvements were from the that we represent by in Table 1. auxiliary tasks’ specific character/capabilities and not just † (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] Results

Models ROUGE-1 ROUGE-2 ROUGE-L METEOR PREVIOUS WORK Seq2Seq(50k vocab) (See et al., 2017) 31.33 11.81 28.83 12.03 Pointer (See et al., 2017) 36.44 15.66 33.42 15.35 Pointer+Coverage (See et al., 2017) ? 39.53 17.28 36.38 18.72 Pointer+Coverage (See et al., 2017) 38.82 16.81 35.71 18.14 † OUR MODELS Two-Layer Baseline (Pointer+Coverage) 39.56 17.52 36.36 18.17 + Entailment Generation ⌦ 39.84 17.63 36.54 18.61 ⌦ + Question Generation 39.73 17.59 36.48 18.33 ⌦ + Entailment Gen. + Question Gen. 39.81 17.64 36.54 18.54 ⌦ Table 1: CNN/DailyMailTable: Performance summarization of our multi-task results. models ROUGEon CNN/DailyMail scores dataset are full (~300K length examples). F-1 (as previous work). All the multi-task improvementsHuman evaluation: are statistically Multi-task significant model is over better the than state-of-the-art baseline baseline. proval rate greater than 95%, and had at least Models R-1 R-2 R-L PREVIOUS WORK 10,000 approved HITs. For the pairwise model ABS+ (Rush et al., 2015) 29.76 11.88 26.96 comparisons discussed in Sec. 6.2, we showed the RAS-El (Chopra et al., 2016) 33.78 15.97 31.15 annotators the input article, the ground truth sum- lvt2k (Nallapati et al., 2016) 32.67 15.59 30.64 Pasunuru et al. (2017) 32.75 15.35 30.82 mary, and the two model summaries (randomly OUR MODELS shuffled to anonymize model identities) – we then 2-Layer Pointer Baseline 34.26 16.40 32.03 + Entailment Generation⌦ 35.45 17.16 33.19 asked them to choose the better among the two * Our multi-task model is stat. signif. better than baseline (based on bootstrap test with 14 ⌦ + Question100K samples: Generation Efron and Tibshirani,35.48 1994). 17.31 32.97 model summaries or choose ‘Not-Distinguishable’ ⌦ + Entailment + Question 35.98 17.76 33.63 if both summaries are equally good/bad. In- ⌦ Table 2: Summarization results on Gigaword. structions for relevance were defined based on ROUGE scores are full length F-1. All the multi- the summary containing salient/important infor- task improvements are statistically significant over mation from the given article, being correct the state-of-the-art baseline. (i.e., avoiding contradictory/unrelated informa- tion), and avoiding redundancy. Instructions for Multi-Task with Entailment Generation We readability were based on the summary’s fluency, first perform multi-task learning between ab- grammaticality, and coherence. stractive summarization and entailment genera- tion with soft-sharing of parameters as discussed Training Details All our soft/hard and layer- in Sec. 4. Table 1 and Table 2 shows that this specific sharing decisions were made on the val- multi-task setting is better than our strong base- idation/development set. Details of RNN hidden line models and the improvements are statistically state sizes, Adam optimizer, mixing ratios, etc. are significant on all metrics5 on both CNN/DailyMail provided in the supplementary for reproducibility. (p<0.01 in ROUGE-1/ROUGE-L/METEOR and p<0.05 in ROUGE-2) and Gigaword (p<0.01 6 Results on all metrics) datasets, showing that entailment generation task is inducing useful inference skills 6.1 Summarization (Primary Task) Results to the summarization task (also see analysis exam- ples in Sec. 7). Pointer+Coverage Baseline We start from the 3 strong model of See et al. (2017). Table 1 shows Multi-Task with Question Generation For that our baseline model performs better than or multi-task learning with question generation, 4 comparable to See et al. (2017). On Gigaword the improvements are statistically significant in dataset, our baseline model (with pointer only, ROUGE-1 (p<0.01), ROUGE-L (p<0.05), and since coverage not needed for this single-sentence METEOR (p<0.01) for CNN/DailyMail and in summarization task) performs better than all pre- all metrics (p<0.01) for Gigaword, compared vious works, as shown in Table 2. to the respective baseline models. Also, Sec. 7

3 presents quantitative and qualitative analysis of We use two layers so as to allow our high- versus low- 6 level layer sharing intuition. Note that this does not increase this model’s improved saliency. the parameter size much (23M versus 22M for See et al. (2017)). 5Stat. significance is computed via bootstrap test (Noreen, 4As mentioned in the github for See et al. (2017), their 1989; Efron and Tibshirani, 1994) with 100K samples. publicly released pretrained model produces the lower scores 6In order to verify that our improvements were from the that we represent by in Table 1. auxiliary tasks’ specific character/capabilities and not just † (slides by Ramakanth Pasunuru) [Pasunuru, Guo, & Bansal, ACL 2018] Results

Models ROUGE-1 ROUGE-2 ROUGE-L METEOR PREVIOUS WORK Seq2Seq(50k vocab) (See et al., 2017) 31.33 11.81 28.83 12.03 Pointer (See et al., 2017) 36.44 15.66 33.42 15.35 Pointer+Coverage (See et al., 2017) ? 39.53 17.28 36.38 18.72 Pointer+Coverage (See et al., 2017) 38.82 16.81 35.71 18.14 † OUR MODELS Two-Layer Baseline (Pointer+Coverage) 39.56 17.52 36.36 18.17 + Entailment Generation ⌦ 39.84 17.63 36.54 18.61 ⌦ + Question Generation 39.73 17.59 36.48 18.33 ⌦ + Entailment Gen. + Question Gen. 39.81 17.64 36.54 18.54 ⌦ Table 1: CNN/DailyMailTable: Performance summarization of our multi-task results. models ROUGEon CNN/DailyMail scores dataset are full (~300K length examples). F-1 (as previous work). All the multi-task improvementsHuman evaluation: are statistically Multi-task significant model is over better the than state-of-the-art baseline baseline.

Models Relevance Readabilityproval rateTotal greater thanModels 95%, and had at least R-1ModelsR-2 R-L R-1 R-2 R-L PREVIOUS WORK MTL VS.BASELINE 10,000 approved HITs.See For et the al. pairwise(2017) model 34.30ABS+14.25 (Rush et30.82 al., 2015) 29.76 11.88 26.96 MTL wins 43 40comparisons83 discussed inBaseline Sec. 6.2, we showed the 35.96RAS-El15.91 (Chopra32.92 et al., 2016) 33.78 15.97 31.15 Baseline wins 22 24annotators46 the input article,Multi-Task the ground (EG truth + QG) sum- 36.73lvt2k16.15 (Nallapati33.58 et al., 2016) 32.67 15.59 30.64 Non-distinguish. 35 36 71 Pasunuru et al. (2017) 32.75 15.35 30.82 mary, and theTable: two Performance model summaries of various (randomlymodels on DUC 2002 test only setup (567 examples). MTL VS. SEE ET AL. (2017) Table 5: ROUGE F1 scores on DUC-2002.OUR MODELS shuffled to anonymize model identities) – we then 2-Layer Pointer Baseline 34.26 16.40 32.03 MTL wins 39 33 72 + Entailment Generation⌦ 35.45 17.16 33.19 asked them to choose the better among the two * Our multi-task model is stat. signif. better than baseline (based on bootstrap test with See (2017) wins 29 38 67 relevance and higher readability14 ⌦ + Questionscores Generation w.r.t. the 35.48 17.31 32.97 ⌦ 100K samples: Efron and Tibshirani, 1994). Non-distinguish. 32 29model summaries61 orbaseline. choose ‘Not-Distinguishable’ W.r.t. See et al. (2017),+ Entailment our MTL + Question model 35.98 17.76 33.63 if both summaries are equally good/bad. In- ⌦ Table 3: CNN/DM Human Evaluation: pairwise Table 2: Summarization results on Gigaword. structions for relevanceis higher were defined in relevance based on scores but a bit lower in ROUGE scores are full length F-1. All the multi- comparison between our 3-way multi-taskthe summary (MTL) containingreadability salient/important scores (and infor- is higher in terms of total task improvements are statistically significant over model w.r.t. our baseline and See et al.mation(2017 from). the givenaggregate article, scores). being correct One potential reason for this the state-of-the-art baseline. (i.e., avoiding contradictory/unrelatedlower readability scoreinforma- is that our entailment gen- Models Relevance Readability Total tion), and avoiding redundancy.eration auxiliary Instructions task encourages for Multi-Task our summariza- with Entailment Generation We MTL wins 33 32readability65 were based on the summary’s fluency, first perform multi-task learning between ab- Baseline wins 22 22 44 tion model to rewrite more and to be more abstrac- Non-distinguish. 45 46grammaticality,91 and coherence. stractive summarization and entailment genera- tive than See et al. (2017) – seetion abstractiveness with soft-sharing re- of parameters as discussed Table 4: Gigaword Human Evaluation:Training pairwise Details Allsults our in soft/hard Table 11 and. layer- in Sec. 4. Table 1 and Table 2 shows that this comparison between our 3-way multi-taskspecific (MTL) sharing decisionsWe were also made show on human the val- evaluationmulti-task results setting on is betterthe than our strong base- model w.r.t. our baseline. idation/developmentGigaword set. Details dataset of RNN in hidden Table 4 line(again models based and on the pair- improvements are statistically 5 state sizes, Adam optimizer,wise comparisons mixing ratios, for etc. 100 are samples),significant where on all metrics we see on both CNN/DailyMail provided in the supplementary for reproducibility. p<0.01 Multi-Task with Entailment and Question Gen- that our MTL model is better( than ourin state-of-the- ROUGE-1/ROUGE-L/METEOR and p<0.05 in ROUGE-2) and Gigaword (p<0.01 eration Finally, we perform multi-task learning art baseline on both relevance and readability.7 6 Results on all metrics) datasets, showing that entailment with all three tasks together, achieving the best of generation task is inducing useful inference skills both worlds (inference skills and saliency).6.1 Summarization Ta- 6.3 (Primary Generalizability Task) Results Resultsto the (DUC-2002) summarization task (also see analysis exam- ble 1 and Table 2 show that our full multi-task ples in Sec. 7). Pointer+Coverage BaselineNext, weWe also start tested from the our model’s generalizabil- model achieves the best scores on CNN/DailyMailstrong model of Seeity/transfer et al. (2017).3 skills,Table 1 shows where we take the models and Gigaword datasets, and the improvements Multi-Task with Question Generation For that our baseline modeltrained performs on CNN/DailyMail better than or andmulti-task directly learning test them with question generation, 4 are statistically significant on allcomparable metrics to onSee eton al. DUC-2002.(2017). On Gigaword We takethe our improvements baseline and are 3- statistically significant in both CNN/DailyMail (p<0.01dataset,in ROUGE- our baselineway model multi-task (with pointer models, only, plusROUGE-1 the pointer-coverage (p<0.01), ROUGE-L (p<0.05), and since coverage not needed for this single-sentence 1/ROUGE-L/METEOR and p<0.02 in ROUGE- model from See et al. (2017METEOR).8 We (p< only0.01) re- for CNN/DailyMail and in 2) and Gigaword (p<0.01 on allsummarization metrics). Fi- task) performs better than all pre- all metrics (p<0.01) for Gigaword, compared vious works, as showntune in Table the beam-size2. for each of these three mod- nally, our 3-way multi-task model (with both en- to the respective baseline models. Also, Sec. 7 els separately (based on DUC-2003presents quantitative as the vali- and qualitative analysis of 3 9 tailment and question generation) outperformsWe use two the layers sodation as to allow set). our high-As versus shown low- in Table 5, our multi- 6 level layer sharing intuition. Note that this does not increase this model’s improved saliency. publicly-available pretrained result (the) parameter of the pre- size muchtask (23M model versus 22M achieves for See et statistically al. significant im- † 5 vious SotA (See et al., 2017) with(2017 stat.)). signifi- Stat. significance is computed via bootstrap test (Noreen, 4As mentioned in theprovements github for See et over al. (2017 the), theirstrong1989 baseline; Efron and (p< Tibshirani0.01, 1994) with 100K samples. cance (p<0.01), as well the higher-reportedpublicly released re- pretrainedin model ROUGE-1 produces the and lower ROUGE-L) scores 6In and order the to verify pointer- that our improvements were from the that we represent by in Table 1. auxiliary tasks’ specific character/capabilities and not just sults (?) on ROUGE-1/ROUGE-2 (p<0.01). † coverage model from See et al. (2017)(p<0.01 in all metrics). This demonstrates that our model 6.2 Human Evaluation is able to generalize well and that the auxiliary We also conducted a blind human evaluation on knowledge helps more in low-resource scenarios. Amazon MTurk for relevance and readability, based on 100 samples, for both CNN/DailyMail 6.4 Auxiliary Task Results and Gigaword (see instructions in Sec. 5). Table. 3 In this section, we discuss the individual/separated shows the CNN/DM results where we do pairwise performance of our auxiliary tasks. comparison between our 3-way multi-task model’s output summaries w.r.t. our baseline summaries 7Note that we did not have output files of any previous work’s model on Gigaword; however, our baseline is already and w.r.t. See et al. (2017) summaries. As shown, a strong state-of-the-art model as shown in Table 2. our 3-way multi-task model achieves both higher 8We use the publicly-available pretrained model from See et al. (2017)’s github for these DUC transfer results, which due to adding more data, we separately trained word em- produces the results in Table 1. All other comparisons and † beddings on each auxiliary dataset (i.e., SNLI and SQuAD) analysis in our paper are based on their higher ? results. and incorporated them into the summarization model. We 9We follow previous work which has shown that larger found that both our 2-way multi-task models perform sig- beam values are better and feasible for DUC corpora. How- nificantly better than these models using the auxiliary word- ever, our MTL model still achieves stat. significant improve- embeddings, suggesting that merely adding more data is not ments (p<0.01 in all metrics) over See et al. (2017) without enough. beam retuning (i.e., with beam =4). Input Document: celtic have written to the scottish football association in order to gain an ‘ under- standing of´ the refereeing decisions during their scottish cup semi-final defeat by inverness on sunday . the hoops were left outraged by referee steven mclean s´ failure to award a penalty or red card for a clear handball in the box by josh meekings to deny leigh griffith s´ goal-bound shot during the first-half . caley thistle went on to win the game 3-2 after extra-time and denied rory delia s´ men the chance to secure a domestic treble this season . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings . celtic s´ adam matthews -lrb- right -rrb- slides in with a strong challenge on nick ross in the scottish cup semi-final . ‘ given the level of reaction from our sup- porters and across football , we are duty bound to seek an understanding of what actually happened , celtic´ said in a statement . they added , ‘ we have not been given any other specific explanation so far and this is simply to understand the circumstances of what went on and why such an obvious error was made . however´ , the parkhead outfit made a point of congratulating their opponents , who have reached the first-ever scottish cup final in their history , describing caley as a ‘ fantastic club and´ saying ‘ reaching the final is a great achievement . celtic´ had taken the lead in the semi-final through defender virgil van dijk s´ curling free-kick on 18 minutes , but were unable to double that lead thanks to the meekings controversy . it allowed inverness a route back into the game and celtic had (slides by Ramakanth Pasunuru) goalkeeper craig gordon sent off after the restart for scything down marley watkins in the area . greg[Pasunuru, Guo, & Bansal, ACL 2018] tansey duly converted the resulting penalty . edward ofere then put caley thistle ahead , only for john guidetti to draw level for the bhoys . with the game seemingly heading for penalties , david raven scored the winner on 117 minutesAnalysis , breaking thousands of celtic hearts . celtic captain scott brown -lrb- left -rrb- protests to referee steven mclean but the handball goes unpunished . griffiths shows off his acrobatic skills during celtic s´ eventual surprise defeat by inverness . celtic pair aleksandar tonev -lrb- left -rrb- and john guidetti look dejected as their hopes of a domestic treble end . Ground-truth: celtic were defeated 3-2 after extra-time in the scottish cup semi-final . leigh griffiths had a goal-bound shot blocked by a clear handball. however, no action was taken against offender josh meekings . the hoops have written the sfa for an ’understanding’ of the decision . See et al. (2017): john hartson was once on the end of a major hampden injustice while playing for celtic . but he can not see any point in his old club writing to the scottish football association over the latest controversy at the national stadium . hartson had a goal wrongly disallowed for offside while celtic were leading 1-0 at the time but went on to lose 3-2 . Our Baseline: john hartson scored the late winner in 3-2 win against celtic . celtic were leading 1-0 at the time but went on to lose 3-2 . some fans have questioned how referee steven mclean and additional assistant alan muir could have missed the infringement . Multi-task: celtic have written to the scottish football association in order to gain an ‘ understand- ing ’ of the refereeing decisions . the hoops were left outraged by referee steven mclean ’s failure to award a penalty or red card for a clear handball in the box by josh meekings . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings .

Figure 3: Example of summaries generated by See et al. (2017), our baseline, and 3-way multi-task model with summarization and both entailment generation and question generation. The boxed-red highlighted words/phrases are not present in the input source document in any paraphrasing form. All the unboxed- green highlighted words/phrases correspond to15 the salient information. See detailed* Our multi-task discussion models in are Fig. stat. signif. better than baselines. 1 and Fig. 2 above. As shown, the outputs from See et al. (2017) and the baseline both include non- entailed words/phrases (e.g. “john hartson”), as well as they missed salient information (“hoops”, “josh meekings”, “leigh griffiths”) in their output summaries. Our multi-task model, however, manages to accomplish both, i.e., cover more salient information and also avoid unrelated information. (slides by Ramakanth Pasunuru)

Reinforcement Learning

16 (slides by Ramakanth Pasunuru) Reinforcement Learning

• Reinforcement Learning (RL) is a training mechanism in which an agent or a policy is allowed to interact with a given environment in order to maximize a reward.

• RL has successful application to many research areas such as continuous control, dialogue systems, and games.

• Recently, a special case of RL called policy gradients based reinforcement learning, has been widely applied to text generation problems in NLP through REINFORCE algorithm.

17 [ Williams, 1992; Tesauro, 1995; White & Sofge, 1992; Singh et al., 2002; Ren et al., 2017; Rennie et al., 2017; Paulus et al., 2018; Chen & Bansal, 2018; Celikyilmaz et al., 2018] (slides by Ramakanth Pasunuru) REINFORCE

LSTM SAMPLER

Reward

Figure: Overview of an LSTM decoder with sampling of words in a sequential fashion to generate a sentence. We measure a reward for the generated sentence w.r.t. the ground-truth and use this reward to Figure 2: Overview of our saliency predictor model. update RL policy (model). Figure 1: Our sequence generator with RL training. Saliency Rewards ROUGE-based rewards have no knowledge about what information is salient while also maintaining the readability of the gen- in the summary, and hence we introduce a erated sentence (Wu et al., 2016; Paulus et al., 18 [ Williams, 1992] novel reward function called ‘ROUGESal’ which 2017; Pasunuru and Bansal, 2017), which is de- gives higher weight to the important, salient fined as L = L +(1 )L , where is Mixed RL XE words/phrases when calculating the ROUGE score a tunable hyperparameter. (which by default assumes all words are equally weighted). To learn these saliency weights, we 3.3 Multi-reward Optimization train our saliency predictor on sentence and an- Optimizing multiple rewards at the same time is swer spans pairs from the popular SQuAD reading important and desired for many language gener- comprehension dataset (Rajpurkar et al., 2016)) ation tasks. One approach would be to use a (Wikipedia domain), where we treat the human- weighted combination of these rewards, but this annotated answer spans (avg. span length 3.2) for has the issue of finding the complex scaling and important questions as representative salient infor- weight balance among these reward combinations. mation in the document. As shown in Fig. 2, given To address this issue, we instead introduce a sim- a sentence as input, the predictor assigns a saliency ple multi-reward optimization approach inspired probability to every token, using a simple bidirec- from multi-task learning, where we have different tional encoder with a softmax layer at every time tasks, and all of them share all the model parame- step of the encoder hidden states to classify the ters while having their own optimization function token as salient or not. Finally, we use the proba- (different reward functions in this case). If r1 and bilities given by this saliency prediction model as r2 are two reward functions that we want to op- weights in the ROUGE matching formulation to timize simultaneously, then we train the two loss achieve the final ROUGESal score (see appendix functions of Eqn. 2 in alternate mini-batches. for details about our ROUGESal weighted preci- s a s LRL = (r1(w ) r1(w )) ✓ log p✓(w ) sion, recall, and F-1 formulations). 1 r (2) L = (r (ws) r (wa)) log p (ws) RL2 2 2 r✓ ✓ Entailment Rewards A good summary should 4 Rewards also be logically entailed by the given source document, i.e., contain no contradictory or un- ROUGE Reward The first basic reward is related information. Pasunuru and Bansal (2017) based on the primary summarization metric of used entailment-corrected phrase-matching met- ROUGE package (Lin, 2004). Similar to Paulus rics (CIDEnt) to improve the task of video caption- et al. (2017), we found that ROUGE-L metric as a ing; we instead directly use the entailment knowl- reward works better compared to ROUGE-1 and edge from an entailment scorer and its multi- ROUGE-2 in terms of improving all the metric 1 sentence, length-normalized extension as our ‘En- scores. Since these metrics are based on sim- tail’ reward, to improve the task of abstractive text ple phrase matching/n-gram overlap, they do not summarization. We train the entailment classi- focus on important summarization factors such as fier (Parikh et al., 2016) on the SNLI (Bowman salient phrase inclusion and directed logical entail- et al., 2015) and Multi-NLI (Williams et al., 2017) ment. Addressing these issues, we next introduce datasets and calculate the entailment probability two new reward functions. score between the ground-truth (GT) summary (as 1For the rest of the paper, we mean ROUGE-L whenever premise) and each sentence of the generated sum- we mention ROUGE-reward models. mary (as hypothesis), and use avg. score as our ...... Ent LSTM LSTM LSTM LSTM LSTM ...... CIDEr CIDEnt

XENT RL Reward Figure 2: Reinforced (mixed-loss) video captioning using entailment-corrected CIDEr score as reward. ward only when it is a directed match with (i.e., it 3Models is logically implied by) the ground truth caption, Attention Baseline (Cross-Entropy) Our hence avoiding contradictory or unrelated infor- attention-based seq-to-seq baseline model is mation (e.g., see Fig. 1). Empirically, we show similar to the Bahdanau et al. (2015) architecture, that first the CIDEr-reward model achieves signif- where we encode input frame level video features icant improvements over the cross-entropy base- {f1:n} via a bi-directional LSTM-RNN and then line (on multiple datasets, and automatic and hu- generate the caption w1:m using an LSTM-RNN man evaluation); next, the CIDEnt-reward model with an attention mechanism. Let θ be the model further achieves significant improvements over the ∗ parameters and w1:m be the ground-truth caption, CIDEr-based reward. Overall, we achieve the new then the cross entropy loss function is: state-of-the-art on the MSR-VTT dataset. m ∗ ∗ 2RelatedWork L(θ)=− log p(wt |w1:t−1,f1:n) (1) t=1 Past work has presented several sequence-to- ! T d sequence models for video captioning, using at- where p(wt|w1:t−1,f1:n)=softmax(W ht ), tention, hierarchical RNNs, 3D-CNN video fea- T d W is theGround-truth projection caption matrix, and wt and ht Generatedare (sampled) caption CIDEr Ent tures, joint embedding spaces, language fusion, the generatedamanisspreadingsomebutterinapan word and the RNN decoder hiddenpuppies is melting butter on the pan 140.5 0.07 etc., but using word-level cross entropy loss train- apandaiseatingsomebamboo apandaiseatingsomefried 256.8 0.14 state at timeamonkeypullsadogstail step t,computedusingthestandardamonkeypullsawoman 116.4 0.04 ing (Venugopalan et al., 2015a; Yao et al., 2015; RNN recursionamaniscuttingthemeat and attention-based context vectoramaniscuttingmeatintopotato 114.3 0.08 Pan et al., 2016a,b; Venugopalan et al., 2016). c .Detailsoftheattentionmodelareinthesup-the dog is jumping in the snow adogisjumpingincucumbers 126.2 0.03 t amanandawomanisswimminginthepool amanandawhaleareswimminginapool 192.5 0.02 Policy gradient for image captioning was re- plementary (due to space constraints). cently presented by Ranzato et al. (2016), using Table 1: Examples of captions sampled during policy gradientandtheirCIDErvsEntailmentscores. Reinforcement Learning (Policy Gradient) In amixedsequenceleveltrainingparadigmtouse sequence.estimated We also with use a single a variance-reducing sample from p bias: et al. (2015) that CIDEr, based on a consensus 1 order to directly optimize the sentence-level testθ non-differentiable evaluation metrics as rewards. (baseline) estimator in the reward function. Their measure across several human reference captions, metrics (as opposed to the cross-entropys s loss Liuxt = etE al.1w (2016b)t−1 for t and≥ Rennie1 w0 = etBOS al. (2016) improve estimateddetails with and the a single partial sample derivativesL(θ) ≈− fromr using(wpθ):,w the chain∼ pθ. has a higher correlation with human evaluation upon this using Monte Carlo roll-outs and a test in- above), we use a policy gradient pθ,whereθ rep- i = σ (W x + W h − + b )(Input Gate) rule are described in the supplementary. than other metrics such as METEOR, ROUGE, t ix t ih t 1 i resent thePolicy model Gradient parameters. with Here,s REINFORCE.s our baselineIn order to compute xt = E1wt−1 ferencefor t ≥ baseline,1 w0 = BOS respectively. Paulus et al. (2017) L(θ) ≈−r(w ),w ∼ pθ. and BLEU. They further showed that CIDEr gets f = σ (W x + W h − + b )(Forget Gate) (slides by Ramakanth Pasunuru) presentedt summarizationfx t fh t results1 f with ROUGE re- modelMixed actsthe Loss as gradient an agentDuring∇ andθ reinforcementL( interactsθ), we use with learning, the its REINFORCE envi- op- better algorithm with more number of human references (and it = σ (Wixxt + Wihht−1 + bi)(Input Gate) timizingestimated for with only athe single reinforcement sample from lossp (withθ: au- wards,ot = σ in(W aox mixed-lossxt + Wohh setup.t−1 + bo)(Output Gate) Policyronment Gradient (video[15](See and with also caption). Chapter REINFORCE. At 13 each in [14 timeIn]). order REINFORCE step, tothis compute is a is good based fit for our video captioning datasets, tomatic metrics as rewards) doesn’t ensure the ft = σ (Wfxxt + Wfhht−1 ⊗+ bf )(Forget⊗ Gate⊗ ) ⊗ thethe gradientagenton generates the∇ L observation(θ a), word we use (action), that the thes REINFORCE and expecteds the gen- gradient algorithmwhich of have a non- 20-40 human references per video). x =ctERecognizing=1 it " forφ(Wt zx≥ Textualx1tw+ W= EntailmentzIBOSIt + Wzh (RTE)ht−1 + isb az tra-)+ft " ct−1 θ L(θ) ≈−r(w ),w ∼ pθ. REINFORCE t wt−1 0 erationreadability ofdifferentiable the end-of-sequence and fluency reward of the function token generated results can caption, be in computed a as follows: o = σ (W xditional+ W NLPh − task+ b (Dagan)(Output et al., Gate 2006;) Lai and [15](See also Chapter 13 in [14]). REINFORCE isMore based recently, Rennie et al. (2016) further t oxi httσ=Wotoh"xtanh(t 1Wct)ho b Input Gate and there is also a chance of gaming the metrics t = ( ix t + ih t−1 + i)( ) rewardPolicyr to the Gradient agent. with Our training REINFORCE. objectiveIn is order to toshowed compute that CIDEr as a reward in image caption- Hockenmaier,⊗ ⊗ 2014; Jimenez⊗ et al.,⊗ 2014), boosted on the observation that theE expecteds gradients of as non- c = i " φ(Ws =x W+hW, I + W h − + b )+f " c − without actually∇θ improvingL(θ)=− thew quality∼pθ [r( ofw the)∇ out-θ log pθ(w )] . (4) t t ft =zxtσ (Wt fxs xtt zI+ Wt fhht−zh1 +t bf1)(Forgetz Gatet ) t 1 minimizethe gradient the negative∇ L expected(θ), we use reward the REINFORCE function: ing algorithm outperforms all other metrics as a reward, not by a large dataset (SNLI) recently introduced differentiableput (LiuLoss et reward al., function: 2016a).θ function Hence, can for be training computed our as follows: h = o " tanh(c ) [15](See also Chapter 13 in [14]). REINFORCEjust is in based terms of improvements on CIDEr metric, t t ot =byσ Bowman(tWoxxt + etW al.oh (2015).ht−1 + Therebo)(Output have been Gate several) reinforcementIn practice based the expected policy gradients, gradients we can use be approximated a using where It is the attention-derived image feature. This fea- L(θ)=−E s∼ [r(w s)] (2) s ⊗ ⊗ ⊗ ⊗ on the observationE ws thatpθ the expecteds gradients but ofs a also non- on all other metrics. In line with these ∇θaL single(θ)= Monte-Carlo− w ∼pθ [r sample(w )∇wθ log=(pθw(w...w)] . ) from(4) pθ, st = Wshtc,t =tureleaderboardit " isφ derived(Wzx modelsx ast + inW [ on6zI]I SNLI ast + follows:W (Chengzhht− given1 et+ al.,bz CNN)+ 2016;f featurest " ct−1 at mixed loss function, which is a weighted combi-1 T N differentiable reward function can be computed asabove follows: previous works, we also found that CIDEr NRockt¨aschellocations etI al.,,...I 2016);, weI focus on theαiI decom-, where α nationfor ofs theeach cross-entropy training example loss (XE) in the and minibatch: the rein- LSTM SAMPLER ht = ot " tanh({ct)1 N } t = !i=1 t i t =where w Gradientis the word estimation: sequence sampled from as a reward (‘CIDEr-RL’ model) achieves the best where I is the attention-derived imagei feature. This fea- In practiceforcement the learning expected loss gradient (RL), similar cans be to the approximated previ- s using t softmax(posable, intra-sentenceat + bα), and a attentiont = W tanh( modelWaI ofIi Parikh+ Wahht−the1 + model. Based on theE s REINFORCEs algo- s st = Wsht, ∇θL(θ)=−L wθ ∼pθ [srr(ww )∇θs log ppθs(ww metric)].. improvements(4) in our video captioning task, a singleous work Monte-Carlo (Paulus et∇ sampleθ al.,( 2017;) ≈−w Wu(=( et) al.,w∇1θ 2016)....wlog θT() from) pθ, ture is derivedbeta as). al. in In (2016).[ this6] as work follows: we Recently, set the given dimension Pasunuru CNN features of andW Bansalto at1×512,rithm and (Williams, 1992), the gradients of this non- Reward N and also has the best human evaluation improve- N locations setI ,...Ic and h, Ito zero. Let θαidenoteI , where the parametersα offor the eachThisIn practice training mixed loss the example expected improves in gradient results the minibatch: on can the be metric approximated using where{(2017)1 It0 is used theN } attention-derived0 multi-taskt = i learning=1 t imagei to combine feature.t video This= fea-differentiable,REINFORCEGradient reward-based approximation: with loss a Baseline. function are:The policyments gradient (see given Sec. 6.3 for result details, incl. those i ! used as reward through the reinforcements s loss s softmax(at +model.captioningbα), and Thena with=pθW entailment(wtanh(t|w1,...wW andaItI−i video1+) isW againah generation.ht− defined1 + by (1). a single Monte-Carlo samples w =(w1 s...wT )aboutfrom otherpθ, rewards based on BLEU, SPICE). ture is derivedt as in [6] as follows: given CNN features at by REINFORCE cans be generalizeds to compute the reward N i (andL θ improves∇EθLs( relevance∼θ) ≈−r wr based(w ) on∇θ ourlogp entailment-pwθ(w (3)). ba). In thisN worklocationsThe1 we parameters set theI ,...I dimensionθ of attention, I of W modelsto 1α× areI512, alsowhere, and traditionallyα ∇θ for( )= each− trainingw pθ [ example( ) · ∇ inθ thelog minibatch:θ( )] Several{ papers1 haveN } presentedt = the! relativei=1 t comparisoni oft = enhancedassociated rewards) with but an also action ensures value betterrelative read- to aEntailment reference Corrected re- Reward Although CIDEr learned by optimizingi the XE loss (2). Figure: Overview of an LSTM decoder with sampling of words in a set c0 and softmax(h0 imageto zero. captioningat + Letbα),θ metrics, anddenoteat and= W the theirtanh( parameters prosW andaI consIi + (Vedantam ofW theahht−1 +REINFORCEabilityward and fluency or withbaseline due a Baseline. tob the: cross-entropysThe policy loss (ingradients performs given better than other metrics as a reward, all Attentionet al., 2015; Anderson Model (Att2all). et al., 2016;The Liu et standard al., 2016b; attention Hodosh modelWe follow RanzatoReducing∇θL etvariance(θ al.) ≈− (2016)r(w approximating)∇θ log pθ(w ). sequential fashion to generate a sentence. We measure a reward for model. Thenba).p In( thisw |w work,...w we set− the) is dimension again defined of W to by1× (1512). , and etθ al., 2013;t 1 Elliott andt 1 Keller, 2014). bythe REINFORCE abovewhich the gradients training can via objective be generalizeda single is a conditioned sampled tos compute word lan- thethese rewards metrics (includingthe generated CIDEr) sentence are still w.r.t. based the on ground-truth and use this reward to presented in [6] also feeds then attention signal It as an in- ∇ L(θ)=−E s∼ [(r(w ) − b)∇ log p (w )] . (5) Figure 2: Overview of our saliency predictor model. The parametersset c0 θandofh attention0 to zero. models Let θ denote are also the traditionally parameters of the guageREINFORCE model,θ learning with to aw produce Baseline.pθ fluentThe captions). policyθ gradientanθ undirected given n-gramupdate matching RL policy score (model). between the put into all gates of the LSTM, and the output posterior.associated with an action value relative to a reference re- model. Then pθ(wt|w1,...wt−1) is again defined by (1). generated and ground truth captions. For exam- learned by optimizing the XE loss (2). wardOurby or baseline mixedREINFORCETheMixed baseline loss bloss: is: defined can be as: angeneralized arbitrary tofunction, compute as the long reward as it does TheIn parameters our experimentsθ of attention feeding modelsIt to are all also gates traditionally in addition to ple, the wrong captionFigure “a man 1: is Our playing sequence football” generator with RL training. Attention Model (Att2all). The standard attention model associatednot depend with on an the action “action” valuewrelatives [14], sinceto a reference in this case: re- Saliency Rewards ROUGE-based rewards have learnedthe input by optimizing did not boost the XE performance, loss (2). but feeding It to both LMIXED =(1− γ)LXE + γLRL (4) ward or baselineE s b: s sw.r.t. the correct caption “a man is playing bas- presented in [6] also feeds then attention signal It as an in- ∇θL(θ)=− w ∼pθ [(r(w ) − b)∇θ log pθ(w )] . (5) no knowledge about what information is salient Attentionthe gates Model and the (Att2all). outputsThe resulted standard in significant attention gains model when ketball” still gets a high score, even though these where γ is aE tunings parameter useds to balance s while also maintaining the readability of the gen- put into all gates of the LSTM, and the output posterior. w ∼pθ [b∇θ log psθ(w )] = b ∇θpsθ(w ) ADAM [20] was used. E s " in the summary, and hence we introduce a presented in [6] also feeds then attention signal It as an in- the two∇θL losses.(θ)=− Forw annealing∼pθ [(r( andw ) faster− b)∇ conver-θ log pθ(wtwo)] . captions(5) belong to two completely different The baseline can be an arbitrary function, as longs as it does In our experiments feeding It to all gates in addition to w 18 erated sentence (Wu et al., 2016; Paulus[ Williams, et 1992] al., put into all gates of the LSTM, and the output posterior. gence, we start with the optimizeds cross-entropy events. Similar issues hold in case of a negation novel reward function called ‘ROUGESal’ which the input did not boost performance, but feeding It to both not depend on the “action” w [14], since in this case: s 2017; Pasunuru and Bansal, 2017), which is de- In our3. Reinforcement experiments feeding LearningIt to all gates in addition to lossThe baseline baseline model, can be and an then arbitrary move function, to optimizing= b∇ asθ long" orp asθ( aw it wrong does) action/object in the generated caption s gives higher weight to the important, salient the gates and the outputs resulted in significant gains when 2 ws the input did not boost performance, but feeding It to both thenot above depend mixed on lossthe “action” function.sw [14], since in thiss case:(see examples in Tablefined 1). as LMixed = LRL +(1 )LXE, where is Sequence Generation as an RL problem. As described E s w ∼pθ [b∇θ log pθ(w )] = b " ∇θpθ(w ) words/phrases when calculating the ROUGE score ADAM [20the] was gates used. and the outputs resulted in significant gains when = b∇θ1=0.We address(6) the above issue by using an entail- in the previous section, captioning systems are traditionally E s s ws s a tunable hyperparameter. ADAM [20] was used. 4RewardFunctionsw ∼pθ [b∇θ log pθ(w )] = b " ∇θpθ(wment) score to correct the phrase-matching metric trained using the cross entropy loss. To directly optimize (which by default assumes all words are equally This shows that the baseline doesw nots changes(CIDEr the expected or others) when used as a reward, ensur- 3. Reinforcement Learning Caption Metric Reward Previous= b∇ imageθ " cap-pθ(w ) weighted). To learn these saliency weights, we NLP metrics and address the exposure bias issue, we can gradient, but importantly, it can reduce the variances of the 3. Reinforcement Learning b s p wing that the generated3.3 caption Multi-reward is logically implied Optimization cast our generative models in the Reinforcement Learning tioning papers have used traditional= captioning∇wθ " θ( ) train our saliency predictor on sentence and an- Sequence Generation as an RL problem. As described gradient estimate. For each training case,ws weby again (i.e., approx- is a paraphrase or directed partial match terminology as in [7]. Our recurrent models (LSTMs) intro- metrics such as CIDEr, BLEU, or= METEORb∇θ1=0 as. (6)s in the previousSequence section, Generation captioning as systems an RL problem. are traditionallyAs described imate the expected gradient with a single samplewith)w the∼ ground-truthpθ: Optimizing caption. To multiple achieve an ac- rewards at the same time is swer spans pairs from the popular SQuAD reading duced above can be viewed as an “agent” that interacts with reward functions, based on the match= betweenb∇θ1=0 the . (6) trained usingin the the previous cross entropy section, loss.captioning To directly systems optimize are traditionally curate entailment score, we adapt the state-of-the- Thisgenerated shows that caption the sample baseline and does the ground-truth nots change ref- the expecteds important and desired for many language gener- comprehension dataset (Rajpurkar et al., 2016)) trainedan external using the “environment” cross entropy (words loss. and To directly image features). optimize The ∇θL(θ) ≈−(r(w ) − b)∇θ log pθ(artw decomposable-attention). (7) model of Parikh et al. NLP metrics and address the exposure bias issue, we can erence(s).This shows First, that it the has baseline been shown does by not Vedantam change the expected NLPparameters metrics and of address the network, the exposureθ, define bias a policy issue,p weθ, canthatgradient, re- but importantly, it can reduce the variance of the ation tasks. One approach would be to use a (Wikipedia domain), where we treat the human- cast our generative models in the Reinforcement Learning gradient, but importantly, it can reduce the variance(2016) of trained the on the SNLI corpus (image caption 2 Note that if b is function of θ or t as in [7], equation (6) still castsults our generative in an “action” models that in is the the Reinforcement prediction of the Learning next word.gradientWe estimate. also experimented For each with training the curriculum case, learning we againdomain). approx- This modelweighted gives us combination a probability for of these rewards, but this annotated answer spans (avg. span length 3.2) for terminology as in [7]. Our recurrent models (LSTMs) intro- gradient estimate. For each training case, we agains approx- terminologyAfter each as action,in [7]. Our the recurrent agent (the models LSTM) (LSTMs) updates intro- its inter-imate‘MIXER’ theholds expected strategy and ofb gradientRanzato(θ) is eta al.valid with (2016), baseline. a where single the sample XE+RL wwhethers∼ pθ: the sampled video caption (generated by duced above can be viewed as an “agent” that interacts with annealingimate the is based expected on the gradient decoder time-steps; with a single however, sample the w ∼ pθ: has the issue of finding the complex scaling and important questions as representative salient infor- ducednal above “state” can (cells be viewed and hidden as an “agent” states of that the interacts LSTM, attention with mixed loss function strategy (described above) performed our model) is entailed by the ground truth caption an external “environment” (words and image features). The Final Gradient Expression.s s Using the chains s rule, and the weight balance among these reward combinations. mation in the document. As shown in Fig. 2, given an externalweights “environment” etc). Upon generating (words and the image end-of-sequence features). The (EOS) better in∇ termsθL∇( ofθθ) maintainingL≈−(θ) ≈−(r(w output(r()w− caption)b−)∇b fluency.)θ∇logθ logpθp(wθ(w).)as. premise(7)(7) (as opposed to a contradiction or neu- p parametersparameters oftoken, the network, the of agentthe network,θ, observes defineθ, a a define policy “reward” ap policyθ that, that is,pθ re-, for that instance, re- parametric model of θ given in Section 2 we have: To address this issue, we instead introduce a sim- a sentence as input, the predictor assigns a saliency sults in an “action”the CIDEr that score is the of prediction the generated of the sentence—we next word. denoteNote this thatNote if thatb is if functionb is function of θ oforθtorast inas [ in7], [7 equation], equation (6 ()6) still still sults in an “action” that is the prediction of the next word. T ple multi-reward optimization approach inspired probability to every token, using a simple bidirec- After eachAfter action,reward each the by action, agentr. The the (the reward agent LSTM) is (the computed LSTM) updates by updates its an inter- evaluation its inter- met-holdsholds and b and(θ) bis(θ a) validis a valid baseline. baseline. ∂L(θ) ∂st ∇θL(θ)=" , from multi-task learning, where we have different tional encoder with a softmax layer at every time ric by comparing the generated sequence to corresponding ∂st ∂θ nal “state”nal (cells “state” and (cells hidden and states hidden of states the LSTM, of the LSTM, attention attention t=1 weights etc).weights Uponground-truth etc). generating Upon sequences. generating the end-of-sequence The the goal end-of-sequence of training (EOS) is to (EOS) minimizeFinalFinal Gradient Gradient Expression. Expression.UsingUsing the the chain chain rule, rule, and and the the tasks, and all of them share all the model parame- step of the encoder hidden states to classify the the negative expected reward: parametricwhere s modelt is the of inputp given to the in Section softmax2 function.we have: Using RE- token, thetoken, agent theobserves agent observes a “reward” a “reward” that is, that for instance, is, for instance,parametric model of pθ givenθ in Section 2 we have: ters while having their own optimization function token as salient or not. Finally, we use the proba- the CIDEr score of the generated sentence—we denote this INFORCE with a baseline b the estimate of the gradient of T (different reward functions in this case). If r and the CIDEr score of the generated sentence—weE s denotes this ∂L(θ) 1 bilities given by this saliency prediction model as reward by r. The rewardL(θ)= is computed− w ∼pθ by[r( anw evaluation)] , met- (3) is given by [17T]: ∂L(θ) ∂st reward by r. The reward is computed by an evaluation met- ∂st ∇ L(θ)= ∂L(θ) ∂st , ∇ L(θθ)= " , r2 are two reward functions that we want to op- weights in the ROUGE matching formulation to ric by comparings thes generateds sequences to corresponding θ " ∂st ∂θ ric by comparingwhere thew generated=(w ,...w sequence) and tow correspondingis the word sampled from t=1∂st ∂θ ground-truth sequences.1 TheT goal oft training is to minimize ∂L(θ) t=1 s timize simultaneously, then we train the two loss r w b p w h s . (8) achieve the final ROUGESal score (see appendix ground-truth sequences.the model at The the goal time of step trainingt. In is practice to minimizeL(θ) is typically ≈ ( ( ) − )( θ( t| t) − 1wt ) the negative expected reward: where st is the∂s inputt to the softmax function. Using RE- functions of Eqn. 2 in alternate mini-batches. for details about our ROUGESal weighted preci- the negative expected reward: whereINFORCEst is the with input a to baseline the softmaxb the estimate function. of the Using gradient RE- of s s a s E s INFORCE∂L(θ) with a baseline b the estimate of the gradient of sion, recall, and F-1 formulations). L(θ)=− w ∼pθ [r(w )] , (3) is given by [17]: LRL1 = (r1(w ) r1(w )) ✓ log p✓(w ) s 7010∂st E s ∂L(θ) r L(θ)=− w ∼pθ [r(w )] , (3) is given by [17]: (2) ∂st s a s where ws ws,...ws and ws is the word sampled from LRL2 = (r2(w ) r2(w )) ✓ log p✓(w ) =( 1 T ) t ∂L(θ) s s s s s ≈ (r(w ) − b)(p (w |h ) − 1 s ). (8) r where w =(thew model,...w at the) and timew stepis thet. In word practice sampledL(θ) fromis typically θ t t wt 1 T t ∂L(θ)∂st s r w b p w h s . (8) Entailment Rewards A good summary should the model at the time step t. In practice L(θ) is typically ≈ ( ( ) − )( θ( t| t) − 1wt ) ∂st 4 Rewards also be logically entailed by the given source 7010 document, i.e., contain no contradictory or un- 7010 ROUGE Reward The first basic reward is related information. Pasunuru and Bansal (2017) based on the primary summarization metric of used entailment-corrected phrase-matching met- ROUGE package (Lin, 2004). Similar to Paulus rics (CIDEnt) to improve the task of video caption- et al. (2017), we found that ROUGE-L metric as a ing; we instead directly use the entailment knowl- reward works better compared to ROUGE-1 and edge from an entailment scorer and its multi- ROUGE-2 in terms of improving all the metric 1 sentence, length-normalized extension as our ‘En- scores. Since these metrics are based on sim- tail’ reward, to improve the task of abstractive text ple phrase matching/n-gram overlap, they do not summarization. We train the entailment classi- focus on important summarization factors such as fier (Parikh et al., 2016) on the SNLI (Bowman salient phrase inclusion and directed logical entail- et al., 2015) and Multi-NLI (Williams et al., 2017) ment. Addressing these issues, we next introduce datasets and calculate the entailment probability two new reward functions. score between the ground-truth (GT) summary (as 1For the rest of the paper, we mean ROUGE-L whenever premise) and each sentence of the generated sum- we mention ROUGE-reward models. mary (as hypothesis), and use avg. score as our (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] RL for Abstractive Summarization

We address three important aspects (saliency, directed logical entailment/ correctness, and non-redundancy) of a good abstractive text summary via reinforcement learning approach with two novel reward functions.

19 [See et al., 2017; Paulus et al., 2017; Celikyilmaz et al., 2018] (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] RL for Abstractive Summarization

We address three important aspects (saliency, directed logical entailment/ correctness, and non-redundancy) of a good abstractive text summary via reinforcement learning approach with two novel reward functions.

We also introduce a novel and effective multi-reward approach of optimizing multiple rewards simultaneously in alternate multi-task mini-batches.

19 [See et al., 2017; Paulus et al., 2017; Celikyilmaz et al., 2018] (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Reward Functions

Rouge Reward

Based on the primary summarization metric of ROUGE package (Lin, 2004).

20 (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Reward Functions

Rouge Reward Saliency Reward Entailment Reward

Based on the primary Gives higher weight to Based on whether each summarization metric of the important, salient sentence of the ROUGE package (Lin, words/phrases when generated summary is 2004). calculating the ROUGE entailed by the ground- score. truth summary.

20 (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Saliency Reward: ROUGESal

ROUGESal reward gives higher weight to the important, salient words/phrases when calculating the ROUGE score (which by default assumes all words are equally weighted):

John is playing with a dog • To learn these saliency weights, we train our saliency predictor on {sentence, answer spans} pairs from the popular SQuAD reading comprehension dataset (Rajpurkar et al., 2016) (Wiki domain).

1 0 1 0 0 1 • We treat the human-annotated answer spans for 0 1 0 1 1 0 important questions as representative salient information in the document. Figure: Overview of our saliency prediction model. • This saliency predictor is run on the ground-truth summary to get an importance weight for each word (used in ROUGE matching).

21 (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Entailment Reward: Entail

• A good summary should be logically entailed by source document, i.e., have no contradictory/unrelated information. We use an entailment scorer and its multi- sentence, length-normalized extension (to avoid very short sentences achieving misleadingly high entailment scores) as our “Entail” reward.

• We train the entailment classifier (Parikh et al., 2016) on the SNLI and Multi-NLI datasets and calculate the entailment probability score between the ground-truth Entail reward.2 Finally, we add a length normal- Models R-1 R-2 R-L M (GT) summary (as premise) and each sentence of the generated summary (as PREVIOUS WORK hypothesis), andization use average constraint score toas avoidour Entail very reward. short sentences Nallapati (2016)? 35.46 13.30 32.65 - achieving misleadingly high entailment scores: See et al. (2017) 39.53 17.28 36.38 18.72 ? Paulus (2017) (XE) 38.30 14.81 35.49 - #tokens in generated summary ? Entail = Entail (3) Paulus (2017) (RL) 39.87 15.82 36.90 - ⇥ #tokens in reference summary OUR MODELS Baseline (XE) 39.41 17.33 36.07 18.27 5 Experimental Setup ROUGE (RL) 39.99 17.72 36.66 18.93 Entail (RL) 39.53 17.51 36.44 20.15 ROUGESal (RL) 40.36 17.97 37.00 19.84 5.1 Datasets and Training22 Details ROUGE+Ent (RL) 40.37 17.89 37.13 19.94 CNN/Daily Mail dataset (Hermann et al., 2015; ROUGESal+Ent (RL) 40.43 18.00 37.10 20.02 Nallapati et al., 2016) is a collection of online Table 1: Results on CNN/Daily Mail (non- news articles and their summaries. We use the anonymous). ? represents previous work on anony- non-anonymous version of the dataset as described mous version. ‘XE’: cross-entropy loss, ‘RL’: reinforce in See et al. (2017). For test-only generaliza- mixed loss (XE+RL). Columns ‘R’: ROUGE, ‘M’: tion experiments, we use the DUC-2002 single METEOR. document summarization dataset3. For entailment reward classifier, we use a combination of the showed the annotators the input article, the ground full Stanford Natural Language Inference (SNLI) truth summary, and the two model summaries corpus (Bowman et al., 2015) and the recent (randomly shuffled to anonymize model identi- Multi-NLI corpus (Williams et al., 2017) training ties) – we then asked them to choose the better datasets. For our saliency prediction model, we among the two model summaries or choose ‘Not- use the Stanford Question Answering (SQuAD) Distinguishable’ if both summaries are equally dataset (Rajpurkar et al., 2016). All dataset splits good/bad. Instructions for relevance were based and other training details (dimension sizes, learn- on the summary containing salient/important in- ing rates, etc.) for reproducibility are in appendix. formation from the given article, being correct (i.e., avoiding contradictory/unrelated informa- 5.2 Evaluation Metrics tion), and avoiding redundancy. Instructions for We use the standard ROUGE package (Lin, 2004) readability were based on the summary’s fluency, and Meteor package (Denkowski and Lavie, 2014) grammaticality, and coherence. for reporting the results on all of our summariza- tion models. Following previous work (Chopra 6 Results et al., 2016; Nallapati et al., 2016; See et al., 2017), Baseline Cross-Entropy Model Results Our we use the ROUGE full-length F1 variant. abstractive summarization model has attention, We also performed Human Evaluation Criteria: pointer-copy, and coverage mechanism. First, human evaluation of summary relevance and read- we apply cross-entropy optimization and achieve ability, via Amazon Mechanical Turk (AMT). We comparable results on CNN/Daily Mail w.r.t. pre- selected human annotators that were located in the vious work (See et al., 2017).4 US, had an approval rate greater than 98%, and had at least 10, 000 approved HITs. For the pair- ROUGE Reward Results First, using ROUGE- wise model comparisons discussed in Sec. 6, we L as RL reward (shown as ROUGE in Table 1) im- 2Since the GT summary is correctly entailed by the source proves the performance on CNN/Daily Mail in all document, we directly (by transitivity) use this GT as premise metrics with stat. significant scores (p<0.001) as for easier (shorter) encoding. We also tried using the full compared to the cross-entropy baseline (and also input document as premise but this didn’t perform as well (most likely because the entailment classifiers are not trained stat. signif. w.r.t. See et al. (2017)). Similar on such long premises; and the problem with the sentence-to- to Paulus et al. (2017), we use mixed loss function sentence avg. scoring approach is discussed below). (XE+RL) for all our reinforcement experiments, to We also tried summary-to-summary entailment scoring (sim- ilar to ROUGE-L) as well as pairwise sentence-to-sentence ensure good readability of generated summaries. avg. scoring, but we found that avg. scoring of ground- truth summary (as premise) w.r.t. each generated summary’s 4Our baseline is statistically equal to the paper-reported sentence (as hypothesis) works better (intuitive because each scores of See et al. (2017) (see Table 1) on ROUGE-1, sentence in generated summary might be a compression of ROUGE-2, based on the bootstrap test (Efron and Tibshirani, multiple sentences of GT summary or source document). 1994). Our baseline is stat. significantly better (p<0.001) 3http://www-nlpir.nist.gov/projects/duc/ in all ROUGE metrics w.r.t. the github scores (R-1: 38.82, guidelines/2002.html R-2: 16.81, R-3: 35.71, M: 18.14) of See et al. (2017). (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Multi-Reward Optimization

• One approach for multi-reward optimization is to use a weighted combination of the rewards, but this has the issue of finding the complex scaling and weight balance among these diverse reward combinations.

23 Figure 2: Overview of our saliency predictor model.

Figure 1: Our sequence generator with RL training. Saliency Rewards ROUGE-based rewards have no knowledge about what information is salient while also maintaining the readability of the gen- in the summary, and hence we introduce a erated sentence (Wu et al., 2016; Paulus et al., novel reward function called ‘ROUGESal’ which 2017; Pasunuru and Bansal, 2017), which is de- gives higher weight to the important, salient fined as L = L +(1 )L , where is Mixed RL XE words/phrases when calculating the ROUGE score a tunable hyperparameter. (slides by Ramakanth Pasunuru) (which by default[Pasunuru assumes & Bansal, all NAACL words 2018] are equally weighted). To learn these saliency weights, we 3.3 Multi-reward Optimization train our saliency predictor on sentence and an- Multi-RewardOptimizing multiple rewards Optimization at the same time is swer spans pairs from the popular SQuAD reading important and desired for many language gener- comprehension dataset (Rajpurkar et al., 2016)) ation tasks. One approach would be to use a (Wikipedia domain), where we treat the human- weighted combination of these rewards, but this annotated answer spans (avg. span length 3.2) for has the issue of finding the complex scaling and important questions as representative salient infor- • One approach for multi-rewardweight balance optimization among these rewardis to use combinations. a weightedmation combination in the document. of As shown in Fig. 2, given the rewards, but this hasTo address the issue this issue, of finding we instead the introduce complex a sim- scalinga sentence and weight as input, the predictor assigns a saliency balance among these plediverse multi-reward reward optimization combinations. approach inspired probability to every token, using a simple bidirec- from multi-task learning, where we have different tional encoder with a softmax layer at every time To address this issue, tasks,we instead and all of introduce them share alla simple the model multi-reward parame- step optimization of the encoder hidden states to classify the • ters while having their own optimization function token as salient or not. Finally, we use the proba- approach inspired from multi-task learning, where we have different tasks, and (different reward functions in this case). If r1 and bilities given by this saliency prediction model as they share all model parametersr2 are two reward while functions having that their we own want tooptimization op- weights function in the ROUGE matching formulation to (different reward functionstimize in simultaneously, this case), thenwith we alternate train the twomini-batches: loss achieve the final ROUGESal score (see appendix functions of Eqn. 2 in alternate mini-batches. for details about our ROUGESal weighted preci- s a s LRL = (r1(w ) r1(w )) ✓ log p✓(w ) sion, recall, and F-1 formulations). 1 r (2) L = (r (ws) r (wa)) log p (ws) RL2 2 2 r✓ ✓ Entailment Rewards A good summary should 4 Rewards also be logically entailed by the given source document, i.e., contain no contradictory or un- ROUGE Reward The first basic reward is related information. Pasunuru and Bansal (2017) based on the primary23 summarization metric of used entailment-corrected phrase-matching met- ROUGE package (Lin, 2004). Similar to Paulus rics (CIDEnt) to improve the task of video caption- et al. (2017), we found that ROUGE-L metric as a ing; we instead directly use the entailment knowl- reward works better compared to ROUGE-1 and edge from an entailment scorer and its multi- ROUGE-2 in terms of improving all the metric 1 sentence, length-normalized extension as our ‘En- scores. Since these metrics are based on sim- tail’ reward, to improve the task of abstractive text ple phrase matching/n-gram overlap, they do not summarization. We train the entailment classi- focus on important summarization factors such as fier (Parikh et al., 2016) on the SNLI (Bowman salient phrase inclusion and directed logical entail- et al., 2015) and Multi-NLI (Williams et al., 2017) ment. Addressing these issues, we next introduce datasets and calculate the entailment probability two new reward functions. score between the ground-truth (GT) summary (as 1For the rest of the paper, we mean ROUGE-L whenever premise) and each sentence of the generated sum- we mention ROUGE-reward models. mary (as hypothesis), and use avg. score as our (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Results (CNN/Daily Mail)

Entail reward.2 Finally, we add a length normal- Models R-1 R-2 R-L M PREVIOUS WORK ization constraint to avoid very short sentences Nallapati (2016)? 35.46 13.30 32.65 - achieving misleadingly high entailment scores: See et al. (2017) 39.53 17.28 36.38 18.72 ? Paulus (2017) (XE) 38.30 14.81 35.49 - #tokens in generated summary ? Entail = Entail (3) Paulus (2017) (RL) 39.87 15.82 36.90 - ⇥ #tokens in reference summary OUR MODELS Baseline (XE) 39.41 17.33 36.07 18.27 5 Experimental Setup ROUGE (RL) 39.99 17.72 36.66 18.93 Entail (RL) 39.53 17.51 36.44 20.15 ROUGESal (RL) 40.36 17.97 37.00 19.84 5.1 Datasets and Training Details ROUGE+Ent (RL) 40.37 17.89 37.13 19.94 CNN/Daily Mail dataset (Hermann et al., 2015; ROUGESal+Ent (RL) 40.43 18.00 37.10 20.02 Nallapati et al., 2016) is a collectionTable: of online Results on CNN/DailyTable Mail 1: (nonanonymous). Results * on represents CNN/Daily previous work Mailon anonymous (non- version. ‘XE’: cross- entropy loss, ‘RL’: reinforce mixed loss (XE+RL).? Columns ‘R’: ROUGE, ‘M’: METEOR. Final multi-reward RL news articles and their summaries. Wemodel use improvements the areanonymous). statistically significantrepresents over baseline, previous ROUGE-RL, work Entail-RL. on anony- non-anonymous version of the dataset as described mous version. ‘XE’: cross-entropy loss, ‘RL’: reinforce mixed loss (XE+RL). Columns24 ‘R’: ROUGE, ‘M’: in See et al. (2017). For test-only generaliza- METEOR. tion experiments, we use the DUC-2002 single 3 document summarization dataset . For entailment sults on CNN/Daily Mail w.r.t. previous work (See reward classifier, we use a combination of the et al., 2017).4 full Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) and the recent ROUGE Rewards First, using ROUGE-L as Multi-NLI corpus (Williams et al., 2017) training RL reward (shown as ROUGE in Table 1) im- datasets. For our saliency prediction model, we proves the performance on CNN/Daily Mail in all use the Stanford Question Answering (SQuAD) metrics with stat. significant scores (p<0.001) as dataset (Rajpurkar et al., 2016). All dataset splits compared to the cross-entropy baseline (and also and other training details (dimension sizes, learn- stat. signif. w.r.t. See et al. (2017)). Similar ing rates, etc.) for reproducibility are in appendix. to Paulus et al. (2017), we use mixed loss function (XE+RL) for all our reinforcement experiments, to 5.2 Evaluation Metrics ensure good readability of generated summaries. We use the standard ROUGE package (Lin, 2004) ROUGESal and Entail Rewards With our and Meteor package (Denkowski and Lavie, 2014) novel ROUGESal reward, we achieve stat. signif. for reporting the results on all of our summariza- improvements in all metrics w.r.t. the baseline as tion models. Following previous work (Chopra well as w.r.t. ROUGE-reward results (p<0.001), et al., 2016; Nallapati et al., 2016; See et al., 2017), showing that saliency knowledge is strongly im- we use the ROUGE full-length F1 variant. proving the summarization model. For our Entail reward, we achieve stat. signif. improvements in 6 Results ROUGE-L (p<0.001) w.r.t. baseline and achieve Baseline Cross-entropy Model Our abstractive the best METEOR score by a large margin. See summarization model has attention, pointer-copy, Sec. 7 for analysis of the saliency/entailment skills and coverage mechanism. First, we apply cross- learned by our models. entropy optimization and achieve comparable re- Multi-Reward Results Similar to ROUGESal, 2Since the GT summary is correctly entailed by the source Entail is a better reward when combined with document, we directly (by transitivity) use this GT as premise the complementary phrase-matching metric in- for easier (shorter) encoding. We also tried using the full input document as premise but this didn’t perform as well formation in ROUGE; Table 1 shows that the (most likely because the entailment classifiers are not trained ROUGE+Entail multi-reward combination per- on such long premises; and the problem with the sentence-to- forms stat. signif. better than ROUGE-reward sentence avg. scoring approach is discussed below). We also tried summary-to-summary entailment scoring (sim- in ROUGE-1, ROUGE-L, and METEOR (p< ilar to ROUGE-L) as well as pairwise sentence-to-sentence 0.001), and better than Entail-reward in all avg. scoring, but we found that avg. scoring of ground- truth summary (as premise) w.r.t. each generated summary’s 4Our baseline is statistically equal to the paper-reported sentence (as hypothesis) works better (intuitive because each scores of See et al. (2017) (see Table 1) on ROUGE-1, sentence in generated summary might be a compression of ROUGE-2, based on the bootstrap test (Efron and Tibshirani, multiple sentences of GT summary or source document). 1994). Our baseline is stat. significantly better (p<0.001) 3http://www-nlpir.nist.gov/projects/duc/ in all ROUGE metrics w.r.t. the github scores (R-1: 38.82, guidelines/2002.html R-2: 16.81, R-3: 35.71, M: 18.14) of See et al. (2017). (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Results (CNN/Daily Mail)

Entail reward.2 Finally, we add a length normal- Models R-1 R-2 R-L M PREVIOUS WORK ization constraint to avoid very short sentences Nallapati (2016)? 35.46 13.30 32.65 - achieving misleadingly high entailment scores: See et al. (2017) 39.53 17.28 36.38 18.72 ? Paulus (2017) (XE) 38.30 14.81 35.49 - #tokens in generated summary ? Entail = Entail (3) Paulus (2017) (RL) 39.87 15.82 36.90 - ⇥ #tokens in reference summary OUR MODELS Baseline (XE) 39.41 17.33 36.07 18.27 5 Experimental Setup ROUGE (RL) 39.99 17.72 36.66 18.93 Entail (RL) 39.53 17.51 36.44 20.15 ROUGESal (RL) 40.36 17.97 37.00 19.84 5.1 Datasets and Training Details ROUGE+Ent (RL) 40.37 17.89 37.13 19.94 CNN/Daily Mail dataset (Hermann et al., 2015; ROUGESal+Ent (RL) 40.43 18.00 37.10 20.02 Nallapati et al., 2016) is a collectionTable: of online Results on CNN/DailyTable Mail 1: (nonanonymous). Results * on represents CNN/Daily previous work Mailon anonymous (non- version. ‘XE’: cross- entropy loss, ‘RL’: reinforce mixed loss (XE+RL).? Columns ‘R’: ROUGE, ‘M’: METEOR. Final multi-reward RL news articles and their summaries. Wemodel use improvements the areanonymous). statistically significantrepresents over baseline, previous ROUGE-RL, work Entail-RL. on anony- non-anonymous version of the dataset as described mous version. ‘XE’: cross-entropy loss, ‘RL’: reinforce mixed loss (XE+RL). Columns24 ‘R’: ROUGE, ‘M’: in See et al. (2017). For test-only generaliza- METEOR. tion experiments, we use the DUC-2002 single 3 document summarization dataset . For entailment sults on CNN/Daily Mail w.r.t. previous work (See reward classifier, we use a combination of the et al., 2017).4 full Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) and the recent ROUGE Rewards First, using ROUGE-L as Multi-NLI corpus (Williams et al., 2017) training RL reward (shown as ROUGE in Table 1) im- datasets. For our saliency prediction model, we proves the performance on CNN/Daily Mail in all use the Stanford Question Answering (SQuAD) metrics with stat. significant scores (p<0.001) as dataset (Rajpurkar et al., 2016). All dataset splits compared to the cross-entropy baseline (and also and other training details (dimension sizes, learn- stat. signif. w.r.t. See et al. (2017)). Similar ing rates, etc.) for reproducibility are in appendix. to Paulus et al. (2017), we use mixed loss function (XE+RL) for all our reinforcement experiments, to 5.2 Evaluation Metrics ensure good readability of generated summaries. We use the standard ROUGE package (Lin, 2004) ROUGESal and Entail Rewards With our and Meteor package (Denkowski and Lavie, 2014) novel ROUGESal reward, we achieve stat. signif. for reporting the results on all of our summariza- improvements in all metrics w.r.t. the baseline as tion models. Following previous work (Chopra well as w.r.t. ROUGE-reward results (p<0.001), et al., 2016; Nallapati et al., 2016; See et al., 2017), showing that saliency knowledge is strongly im- we use the ROUGE full-length F1 variant. proving the summarization model. For our Entail reward, we achieve stat. signif. improvements in 6 Results ROUGE-L (p<0.001) w.r.t. baseline and achieve Baseline Cross-entropy Model Our abstractive the best METEOR score by a large margin. See summarization model has attention, pointer-copy, Sec. 7 for analysis of the saliency/entailment skills and coverage mechanism. First, we apply cross- learned by our models. entropy optimization and achieve comparable re- Multi-Reward Results Similar to ROUGESal, 2Since the GT summary is correctly entailed by the source Entail is a better reward when combined with document, we directly (by transitivity) use this GT as premise the complementary phrase-matching metric in- for easier (shorter) encoding. We also tried using the full input document as premise but this didn’t perform as well formation in ROUGE; Table 1 shows that the (most likely because the entailment classifiers are not trained ROUGE+Entail multi-reward combination per- on such long premises; and the problem with the sentence-to- forms stat. signif. better than ROUGE-reward sentence avg. scoring approach is discussed below). We also tried summary-to-summary entailment scoring (sim- in ROUGE-1, ROUGE-L, and METEOR (p< ilar to ROUGE-L) as well as pairwise sentence-to-sentence 0.001), and better than Entail-reward in all avg. scoring, but we found that avg. scoring of ground- truth summary (as premise) w.r.t. each generated summary’s 4Our baseline is statistically equal to the paper-reported sentence (as hypothesis) works better (intuitive because each scores of See et al. (2017) (see Table 1) on ROUGE-1, sentence in generated summary might be a compression of ROUGE-2, based on the bootstrap test (Efron and Tibshirani, multiple sentences of GT summary or source document). 1994). Our baseline is stat. significantly better (p<0.001) 3http://www-nlpir.nist.gov/projects/duc/ in all ROUGE metrics w.r.t. the github scores (R-1: 38.82, guidelines/2002.html R-2: 16.81, R-3: 35.71, M: 18.14) of See et al. (2017). (slides by Ramakanth Pasunuru) [Pasunuru & Bansal, NAACL 2018] Results (CNN/Daily Mail)

Entail reward.2 Finally, we add a length normal- Models R-1 R-2 R-L M PREVIOUS WORK ization constraint to avoid very short sentences Nallapati (2016)? 35.46 13.30 32.65 - achieving misleadingly high entailment scores: See et al. (2017) 39.53 17.28 36.38 18.72 ? Paulus (2017) (XE) 38.30 14.81 35.49 - #tokens in generated summary ? Entail = Entail (3) Paulus (2017) (RL) 39.87 15.82 36.90 - Human evaluation: ⇥ #tokens in reference summary OUR MODELS Our Multi-reward model is better than baseline Baseline (XE) 39.41 17.33 36.07 18.27 5 Experimental Setup ROUGE (RL) 39.99 17.72 36.66 18.93 Entail (RL) 39.53 17.51 36.44 20.15 ROUGESal (RL) 40.36 17.97 37.00 19.84 5.1 Datasets and Training Details ROUGE+Ent (RL) 40.37 17.89 37.13 19.94 CNN/Daily Mail dataset (Hermann et al., 2015; ROUGESal+Ent (RL) 40.43 18.00 37.10 20.02 Nallapati et al., 2016) is a collectionTable: of online Results on CNN/DailyTable Mail 1: (nonanonymous). Results * on represents CNN/Daily previous work Mailon anonymous (non- version. ‘XE’: cross- entropy loss, ‘RL’: reinforce mixed loss (XE+RL).? Columns ‘R’: ROUGE, ‘M’: METEOR. Final multi-reward RL news articles and their summaries. Wemodel use improvements the areanonymous). statistically significantrepresents over baseline, previous ROUGE-RL, work Entail-RL. on anony- non-anonymous version of the dataset as described mous version. ‘XE’: cross-entropy loss, ‘RL’: reinforce mixed loss (XE+RL). Columns24 ‘R’: ROUGE, ‘M’: in See et al. (2017). For test-only generaliza- METEOR. tion experiments, we use the DUC-2002 single 3 document summarization dataset . For entailment sults on CNN/Daily Mail w.r.t. previous work (See reward classifier, we use a combination of the et al., 2017).4 full Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) and the recent ROUGE Rewards First, using ROUGE-L as Multi-NLI corpus (Williams et al., 2017) training RL reward (shown as ROUGE in Table 1) im- datasets. For our saliency prediction model, we proves the performance on CNN/Daily Mail in all use the Stanford Question Answering (SQuAD) metrics with stat. significant scores (p<0.001) as dataset (Rajpurkar et al., 2016). All dataset splits compared to the cross-entropy baseline (and also and other training details (dimension sizes, learn- stat. signif. w.r.t. See et al. (2017)). Similar ing rates, etc.) for reproducibility are in appendix. to Paulus et al. (2017), we use mixed loss function (XE+RL) for all our reinforcement experiments, to 5.2 Evaluation Metrics ensure good readability of generated summaries. We use the standard ROUGE package (Lin, 2004) ROUGESal and Entail Rewards With our and Meteor package (Denkowski and Lavie, 2014) novel ROUGESal reward, we achieve stat. signif. for reporting the results on all of our summariza- improvements in all metrics w.r.t. the baseline as tion models. Following previous work (Chopra well as w.r.t. ROUGE-reward results (p<0.001), et al., 2016; Nallapati et al., 2016; See et al., 2017), showing that saliency knowledge is strongly im- we use the ROUGE full-length F1 variant. proving the summarization model. For our Entail reward, we achieve stat. signif. improvements in 6 Results ROUGE-L (p<0.001) w.r.t. baseline and achieve Baseline Cross-entropy Model Our abstractive the best METEOR score by a large margin. See summarization model has attention, pointer-copy, Sec. 7 for analysis of the saliency/entailment skills and coverage mechanism. First, we apply cross- learned by our models. entropy optimization and achieve comparable re- Multi-Reward Results Similar to ROUGESal, 2Since the GT summary is correctly entailed by the source Entail is a better reward when combined with document, we directly (by transitivity) use this GT as premise the complementary phrase-matching metric in- for easier (shorter) encoding. We also tried using the full input document as premise but this didn’t perform as well formation in ROUGE; Table 1 shows that the (most likely because the entailment classifiers are not trained ROUGE+Entail multi-reward combination per- on such long premises; and the problem with the sentence-to- forms stat. signif. better than ROUGE-reward sentence avg. scoring approach is discussed below). We also tried summary-to-summary entailment scoring (sim- in ROUGE-1, ROUGE-L, and METEOR (p< ilar to ROUGE-L) as well as pairwise sentence-to-sentence 0.001), and better than Entail-reward in all avg. scoring, but we found that avg. scoring of ground- truth summary (as premise) w.r.t. each generated summary’s 4Our baseline is statistically equal to the paper-reported sentence (as hypothesis) works better (intuitive because each scores of See et al. (2017) (see Table 1) on ROUGE-1, sentence in generated summary might be a compression of ROUGE-2, based on the bootstrap test (Efron and Tibshirani, multiple sentences of GT summary or source document). 1994). Our baseline is stat. significantly better (p<0.001) 3http://www-nlpir.nist.gov/projects/duc/ in all ROUGE metrics w.r.t. the github scores (R-1: 38.82, guidelines/2002.html R-2: 16.81, R-3: 35.71, M: 18.14) of See et al. (2017). (slides by Ramakanth Pasunuru)

Thank You

25 Machine Translation Machine Translation

Useful for tons of companies, online traffic, and our international communication! Statistical Machine Translation

Source languageCurrent statistical machine translation systems f (e.g., French) Target language e (e.g., English) • Source language f, e.g. French We want the best• Target language e, e.g. Englishtarget (English) translation given the source (French) input sentence,• Probabilistic formulation (using Bayes rule) hence the probabilistic formulation is: Current statistical machine translation systems

• Source language f, e.g. French Using Bayes rule, we get the following (since p(f) in the denominator • Target language e, e.g. English• Translation model p(f|e) trained on parallel corpus is independent of the argmax over e): • Probabilistic formulation (using Bayes rule)• Language model p(e) trained on English only corpus (lots, free!)

Translation Model Language Model French à à Pieces of English à p(f|e) p(e) • Translation model p(f|e) trained on parallel corpus Decoder • Language model p(e) trained on English only corpus (lots, free!)à Proper English argmax p(f|e)p(e) [Richard Socher CS224d] Translation Model Language Model French à 10 à Pieces of English Richard Socherà 4/26/16 p(f|e) p(e)

Decoder à Proper English argmax p(f|e)p(e)

10 Richard Socher 4/26/16 Current statistical machine translation systems

• Source language f, e.g. FrenchStatistical Machine Translation • Target language e, e.g. English • Probabilistic formulation (using Bayes rule) The first part is known as the ‘Translation Model’ p(f|e) and is trained on parallel corpora of {f,e} sentence pairs, e.g., from EuroParl or Canadian parliament proceedings in multiple languages

The second part p(e) is the ‘Language Model’ and can be trained on • Translation model p(tons more monolingualf|e data,) trained whichon parallel is much easiercorpus to find! • Language model p(e) trained on English only corpus (lots, free!)

Translation Model Language Model French à à Pieces of English à p(f|e) p(e)

Decoder à Proper English argmax p(f|e)p(e)

10 Richard Socher 4/26/16

[Richard Socher CS224d] 9/24/14

Statistical MT Statistical Solution

• Parallel Texts Pioneered at IBM in the early 1990s – Rosetta Stone

Let’s make a probabilistic model of translation P(e | f) Hieroglyphs

Suppose f is de rien Demotic P(you’re welcome | de rien) = 0.45 P(nothing | de rien) = 0.13 Greek P(piddling | de rien) = 0.01 P(underpants | de rien) = 0.000000001

Statistical Solution A Division of Labor

Spanish/English English • Parallel Texts Hmm, every time one sees Bilingual Text Text “banco”, translation is bank” or “bench” … – Instruction Manuals If it’s “banco de…”, it always becomes “bank”, Statistical Analysis Statistical Analysis – Hong Kong/Macao never “bench”… Legislation

– Canadian Parliament Spanish Broken English Hansards English Translation Language – United Nations Reports Model P(f|e) Fidelity Model P(e) Fluency – Official Journal of the European Que hambre tengo yo Decoding algorithm I am so hungry Communities argmax P(f|e) * P(e) e – Translated news What hunger have I, Hungry I am so, I am so hungry, Have I that hunger …

Step 1: Alignment Statistical Machine Translation Goal: know which word or phrases in source language Alignments Alignments: harder First step in traditional machine translation is to find alignments or would translate to what words or phrases in target translational matchings between the two sentences, i.e., predict which Wewords/phrases can factor the in French translation align to whichmodel words/phrases P(f | e ) in English. à zero fertility word

language? Hard already! by identifying alignments (correspondences) not translated between Challenging words problem: in f and e.g., wordssome words in e may not have any alignments:

spurious é

word And Le Le programme a été mis en application the Le Japon secou par deux nouveaux séismes programme Le And program a Japan Japon Japan the shaken secoué has été shaken program by par been mis two deux by implemented en has new nouveaux two application been quakes séismes new implemented one-to-many quakes alignment

Alignment examples from Chris Manning/CS224n

11 Richard Socher 4/26/16

4 9/24/14

Statistical MT Statistical Solution

• Parallel Texts Pioneered at IBM in the early 1990s – Rosetta Stone

Let’s make a probabilistic model of translation P(e | f) Hieroglyphs

Suppose f is de rien Demotic P(you’re welcome | de rien) = 0.45 P(nothing | de rien) = 0.13 Greek P(piddling | de rien) = 0.01 P(underpants | de rien) = 0.000000001

Statistical Solution A Division of Labor

Spanish/English English • Parallel Texts Hmm, every time one sees Bilingual Text Text “banco”, translation is bank” or “bench” … – Instruction Manuals If it’s “banco de…”, it always becomes “bank”, Statistical Analysis Statistical Analysis – Hong Kong/Macao never “bench”… Legislation

– Canadian Parliament Spanish Broken English Hansards English Translation Language – United Nations Reports Model P(f|e) Fidelity Model P(e) Fluency – Official Journal of the European Que hambre tengo yo Decoding algorithm I am so hungry Communities argmax P(f|e) * P(e) e – Translated news What hunger have I, Hungry I am so, I am so hungry, Have I that hunger …

Statistical Machine Translation

Alignments Step 1: One wordAlignment in theAlignments: source sentence might alignharder to several words in the target sentence: We can factor the translation model P(f | e ) zero fertility word by identifying alignments (correspondences) not translated between words in f and words in e

spurious é word And Le Le programme a été mis en application the Le Japon secou par deux nouveaux séismes programme Le And program a Japan Japon Japan the shaken secoué has été shaken program by par been mis two deux by implemented en has new nouveaux two application been quakes séismes new implemented one-to-many quakes alignment

12 Richard Socher 4/26/16

4 9/24/14

Statistical Machine Translation

ManyStep words1: Alignment in the source sentence might align to a single word in the target sentence:Alignments: harder Alignments: hardest Really hard :/

The Le munis

balance reste é Les pauvres sont d Le reste appartenait aux autochtones was The Les The The the appartenait poor pauvres poor territory balance don’t sont of aux was have démunis don t the the any have aboriginal autochtones money any people territory of money the many-to-many many-to-one alignment alignments aboriginal phrase alignment people

13 Richard Socher 4/26/16

Alignment as a vector IBM Model 1 generative story

Want: P(f|e) i j aj=i

• used in all IBM models Given English sentence e , e , … e Mary 1 1 Maria 1 1 2 I • a is vector of length J did 2 2 no 3 Choose length J for French sentence not 3 3 daba 4 • maps indexes j to indexes i

slap 4 4 una 4 • each aj {0, 1 … I} Le programme a été mis en application For each j in 1 to J: 5 botefada 4 And • aj = 0 fj is spurious the – Choose a uniformly from 0, 1, … I 6 a 0 • no one-to-many alignments j the 5 7 la 5 program • no many-to-many alignments has – Choose f by translating e green 6 8 bruja 7 j aj been witch 7 9 verde 6 • but provides foundation for phrase-based alignment implemented We want to learn aj 23 4 5 6 6 6 how to do this

IBM Model 1 parameters Applying Model 1*

P(f, a | e) can be used as a translation model or an alignment model

As translation model Le programme a été mis en application And As alignment model the program has been implemented

aj 2345666 * Actually, any P(f, a | e), e.g., any IBM model

5 9/24/14

Statistical Machine Translation Alignments: harder Alignments: hardest AndStep finally,1: Alignment many words in the source sentence might align to many words in the target sentence:

The Le munis balance reste é Les pauvres sont d Le reste appartenait aux autochtones was The Les The The the appartenait poor pauvres poor territory balance don’t sont of aux was have démunis don t the the any have aboriginal autochtones money any people territory of money the many-to-many many-to-one alignment alignments aboriginal phrase alignment people

14 Richard Socher 4/26/16

Alignment as a vector IBM Model 1 generative story

Want: P(f|e) i j aj=i

• used in all IBM models Given English sentence e , e , … e Mary 1 1 Maria 1 1 2 I • a is vector of length J did 2 2 no 3 Choose length J for French sentence not 3 3 daba 4 • maps indexes j to indexes i slap 4 4 una 4 • each aj {0, 1 … I} Le programme a été mis en application For each j in 1 to J: 5 botefada 4 And • aj = 0 fj is spurious the – Choose a uniformly from 0, 1, … I 6 a 0 • no one-to-many alignments j the 5 7 la 5 program • no many-to-many alignments has – Choose f by translating e green 6 8 bruja 7 j aj been witch 7 9 verde 6 • but provides foundation for phrase-based alignment implemented We want to learn aj 23 4 5 6 6 6 how to do this

IBM Model 1 parameters Applying Model 1*

P(f, a | e) can be used as a translation model or an alignment model

As translation model Le programme a été mis en application And As alignment model the program has been implemented aj 2345666 * Actually, any P(f, a | e), e.g., any IBM model

5 Step 1: Alignment Statistical Machine Translation • We could spend an entire lecture on alignment modelsTranslation Process After learning the word and phrase alignments, the model also needs • Not only single words but could use phrases, syntaxto figure out the reordering, esp. important in language pairs with very Task:different translate orders! this sentence from German into English • • Then consider reordering of translated phrases

er geht ja nicht nach hause er geht ja nicht nach hause

he does not go home

Example from Philipp Koehn Pick phrase in input, translate •

15 Richard Socher 4/26/16

Chapter 6: Decoding 6 After many stepsStatistical Machine Translation

After many steps, you get the large ‘phrase table’. Each phrase in the Each phrase in source language has many possible source language can have many possible translations in the target translations resulting in large search space:language, and hence the search space can be combinatorially large! Translation Options

er geht ja nicht nach hause

he is yes not after house it are is do not to home , it goes , of course does not according to chamber , he go , is not in at home it is not home he will be is not under house it goes does not return home he goes do not do not is to are following is after all not after does not to not is not are not is not a

16 Many translation options to chooseRichard Socher from 4/26/16 • – in Europarl phrase table: 2727 matching phrase pairs for this sentence – by pruning to the top 20 per phrase, 202 translation options remain

Chapter 6: Decoding 8 Statistical Machine Translation

Decode: Search for best of many hypothesesFinally, you decode this hard search problem to find the best translation, e.g., using beam search on the several combinatorial Hard search problem that also includes language modelpaths through this phrase table (and also include the language model p(e) to rerank) Decoding: Find Best Path

er geht ja nicht nach hause

yes

he goes home

are does not go home

it to

17 backtrack from highestRichard Socher scoring complete hypothesis4/26/16

Chapter 6: Decoding 15 AlignmentWord ModelAlignment Details

IBM Model 1

Alignments: IBMa hiddenModel vector called1 an(Brown alignment specifies93) which English source is responsible for each French target word. The Alignments: first, simplestahidden IBMvector modelcalled treatedanalignment alignmentspecifies probabilitieswhichEnglish as roughlysource uniform:isresponsible foreachFrenchtargetword.

[Brown et al., 1993]

IBM ModelIBM Model2 (Distortion)2 The next more advanced model captures the notion of ‘distortion’, Alignmentsi.e., how fartend fromto thethediagonal diagonal(broadly is the alignmentatleast)

Other approaches for biasing alignment towards diagonal include relative vs absolute alignment, asymmetric distances, and learning a full multinomial over distances Otherschemesforbiasingalignmentstowardsthediagonal: Relativevsabsolutealignment Asymmetricdistances [Brown et al., 1993] Learningafullmultinomialoverdistances

EMforModels1/2 IBM ModelsEMfor 1/2Models EM Training1/2

ModelModel Parameters:1Parameters: TranslationTranslationalModel1Parameters:probabilities Probabilities:(1+2) DistortionDistortionTranslationparameters Probabilities:probabilities(2only)(1+2) Distortionparameters(2only) Startwith uniform,including Start with uniform P(f | e ) parameters, including P(f | null) ForStarteachwithsentence: j iuniform,including j For eachForForeacheach sentencesentence:French positionin trainingj corpus: For eachForCalculateeach FrenchFrenchposterior positionpositionoverEnglishj j: positions Calculate Calculate posteriorover overEnglish Englishpositions positions using:

(orjustusebestsinglealignment) Increment (orjustusecountbestofsinglewordfalignment)j withwordei bytheseamounts IncrementAlsoreestimate countdistortion of wordprobabilities f withfor wordmodel2 e by these amounts Incrementcountofwordfj withjword ei bytheseamountsi Iterate Similarlyuntil Alsoconvergencere estimatere-estimatedistortion distortionprobabilities probabilitiesformodel2 for Model2 Iterate Iterate untiluntil convergenceconvergence

[Brown et al., 1993] TheHMMHMM ModelModel

Emissions: Transitions

[Vogel et al., 1996]

IBMIBM ModelsModels 3/4/5 (Fertility)3/4/5

Mary did not slap the green witch

n(3|slap) Mary not slap slap slap the green witch P(NULL) Mary not slap slap slap NULL the green witch t(la|the) Mary no daba una botefada a la verde bruja d(j|i) Mary no daba una botefada a la bruja verde

[from Al-Onaizan and Knight,[Vogel et al.,1998] 1996]

Examples:IBM ModelsTranslation 3/4/5 (Fertility)andFertility

[Vogel et al., 1996] Syntactic Machine Translation

HieroHieroRules

Synchronous Tree-Substitution Grammars

[Shieber, 2004; Graehl et al., 2008]

Joint Parsing and Alignment

S Sample Synchronization Features NP VP ()=NP,b8, NP CoarseSourceTarget phrasal, phrasal :1 VBD VP FineSourceTarget NP, NP :1 VBN PP

()=NN,b7 CoarseSourceAlign pos :1 IN NP FineSourceAlign NN :1 NP PP JJ NNS IN NP

... were established in such places as Quanzhou Zhangzhou etc. 在 P 泉州 PP 漳州 NP

b8 等 NP VP b7 地 NN !立 VV

b4 了 AS VP ... NP

Figure 2: An example of a Chinese-English sentence pair with parses, word alignments, and a subset of the full optimal ITG derivation, including one totally unsynchronized bispan (b4), one partially synchronized bispan (b7), and and fully synchronized bispan (b8). The inset provides some examples of active synchronization features (see Section 4.3) on these bispans. On this example, the monolingual English parser erroneously attached the lower PP to the VP headed by established, and the non-syntactic ITG word aligner misaligned I to such instead of to etc. Our joint model[Burkett corrected et al., 2011] both of these mistakes because it was rewarded for the synchronization of the two NPs joined by b8.

We cannot efficiently compute the model expecta- expectations under Q: tions in this equation exactly. Therefore we turn next to an approximate inference method. EQ(a wi) (ti,a,ti0 ,si,si0 ) | (t, a, t ,s ,s ) ⇥ EQ(t,a,t0 s⇤i,s0 ) 0 i i0 6 Mean Field Inference | i Now, we will briefly describe how⇥ we compute⇤ Q. Instead of computing the model expectations from First, note that the parameters of Q factor along (4), we compute the expectations for each sentence individual source nodes, target nodes, and bispans. pair with respect to a simpler, fully factored distri- The combination of the KL objective and our par- bution Q(t, a, t0)=q(t)q(a)q(t0). Rewriting Q in ticular factored form of Q make our inference pro- log-linear form, we have: cedure a structured mean field algorithm (Saul and Jordan, 1996). Structured mean field techniques are well-studied in graphical models, and our adaptation Q(t, a, t0) exp + + / 2 n b n0 3 in this section to multiple grammars follows stan- n t b a n t X2 X2 X02 0 dard techniques (see e.g. Wainwright and Jordan, 4 5 2008). Here, the n, b and n0 are variational parameters which we set to best approximate our weakly syn- Rather than derive the mean field updates for , chronized model from (3): we describe the algorithm (shown in Figure 3) pro- cedurally. Similar to block Gibbs sampling, we it- eratively optimize each component (source parse, ⇤ = argmin KL Q P✓(t, a, t0 s, s0) || | target parse, and alignment) of the model in turn, ⇣ ⌘ conditioned on the others. Where block Gibbs sam- Once we have found Q, we compute an approximate pling conditions on fixed trees or ITG derivations, gradient by replacing the model expectations with our mean field algorithm maintains uncertainty in Neural Machine Translation (next week)