Determinantal Beam Search

Clara Meister Martina Forster Ryan Cotterell , ETH Zürich University of Cambridge [email protected] [email protected] [email protected]

Abstract scores, with no means for encoding interaction be- tween candidates; this is the limitation which we Beam search is a go-to strategy for decoding attempt to address in this work. neural sequence models. The algorithm can We derive determinantal beam search, a novel naturally be viewed as a subset optimization generalization of beam search that casts subset problem, albeit one where the corresponding selection as the subdeterminant optimization set function does not reflect interactions between candidates. Empirically, this leads to problem. Specifically, we formulate each iteration sets often exhibiting high overlap, e.g., strings of beam search as a subdeterminant maximization may differ by only a single word. Yet in problem parameterized by a positive semi-definite use-cases that call for multiple solutions, a di- matrix that encodes interactions between the pos- verse or representative set is often desired. To sible candidates; standard beam search is recov- address this issue, we propose a reformulation ered by a specific diagonal matrix. This framing of beam search, which we call determinantal creates a natural paradigm for taking the rela- beam search. Determinantal beam search has a natural relationship to determinantal point pro- tionships between candidates during the decoding cesses (DPPs), models over sets that inherently process, and can thus assign higher scores to di- encode intra-set interactions. By posing itera- versified sets; we show how this approach relates tions in beam search as a series of subdetermi- to k-determinantal point processes (DPPs). Given nant maximization problems, we can turn the the wealth of research on efficient kernel com- algorithm into a diverse subset selection pro- putation (Rousu and Shawe-Taylor, 2005; Farhan cess. In a case study, we use the string subse- et al., 2017) and DPP inference strategies (Li et al., quence kernel to explicitly encourage n-gram 2016; Han et al., 2017; Chen et al., 2018), we find coverage in text generated from a sequence model. We observe that our algorithm offers the impact on runtime to be quite reasonable in competitive performance against other diverse comparison to standard decoding techniques. set generation strategies in the context of In a case study on neural machine translation language generation, while providing a more (NMT), we demonstrate how to make use of the general approach to optimizing for diversity. string subsequence kernel (Lodhi et al., 2002) to encode the notion of n-gram diversity in the lan- 1 Introduction guage generation process, allowing us to derive an elegant diverse beam search. Under this scheme, The decoding of neural sequence models is a fun- we observe that determinantal beam search gener- damental component of many tasks in NLP. Yet, ates more diverse sets than standard beam search many proposed decoding methods aim to produce with minimal trade-off in terms of BLEU. We only a single solution; further, decoding strategies see improved performance over stochastic beam that provide a set, such as beam search, admit high search (SBS; Kool et al., 2019), which is reported overlap between solutions. Such approaches fail to to encourage diversity, and a slight improvement reflect that for many NLP tasks,1 there can be mul- over Vijayakumar et al.(2018)’s diverse beam tiple correct solutions—or that we may desire a di- search (DBS) while providing a more general ap- verse set of solutions. As it stands, standard beam proach to optimizing for intra-set diversity. search chooses items based purely on individual 2 Neural Sequence Models 1As concrete examples, in machine translation there almost always exist multiple ways to translate a sentence; in story generation, we often seek creative language or multiple Neural sequence models are probability distribu- options to choose from. tions p(y | x) over sequences y in an output

6551

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 6551–6562 August 1–6, 2021. ©2021 Association for Computational Linguistics space Y conditioned on an input x.2 Here we Degenerate Objective. It is important to note define Y as the set of all valid sequences derived that the highest-probability solutions under neural from a vocabulary V that are bookended by distin- sequence models are not always high-quality; guished BOS and EOS tokens, indicating the begin- specifically for tasks involving language gener- ning and end of the sequence, respectively. Typ- ation, e.g., machine translation, prior work has ically, the sequence length is upper-bounded by shown the tendency for MAP decoding to lead some value nmax ∈ Z+, which may depend on to generic or degenerate solutions (Stahlberg and x. In this work, we consider locally normalized Byrne, 2019; Meister et al., 2020; Eikema and models, i.e. where p is a probability distribution Aziz, 2020) while superior solutions assigned def over V¯ = V ∪ {EOS} conditioned on previously only slightly lower probability are often over- generated tokens y

However, as has been noted in the literature, there where ◦ is used to indicate string concatenations. are a number of issues with both Eq. (2) and (4). Note that candidates in Yt−1 already ending Y V First, as may be an exponentially large (in ) in EOS are simply added directly to Bt, i.e., space and p is typically non-Markovian, we can- EOS ◦ EOS = EOS. Under this definition, we have Y Yk not efficiently search over , much less over . the cardinality constraint |Bt| ≤ |V|¯ · k. Second, specifically for language generation tasks, these might not be useful objectives. 3A number of NLP tasks only take the highest-scoring el- ement of the returned set Y while other tasks utilize the entire 2x may be, e.g., a source sentence or an image. set of solutions.

6552 2.2 A Determinantal Reformulation Full Determinantal Beam Search We now introduce an alternative, equivalent no- tation for Eq. (5) using matrices and determi- Y0 ← {BOS} (9) nants that will shed light on the straightforward Yt ← argmax log det(DY 0 + w · KY 0 ) 0 t t generalization of beam search that we present as Yt ⊆Bt, |Y 0|=k the primary contribution of this paper. We de- t fine a timestep-dependent4 diagonal matrix D ∈ Clearly, we recover beam search when w = 0; |Bt|×|Bt| where we take the diagonal entry R however, we can now select subsets based addi- def tionally on candidate interactions. That is, Eq. (9) D = p(y(i) | x) (7) ii ≤t now has an interpretation as a diversity objective function (Indyk et al., 2014) when K is chosen y(i) ith B Here ≤t is the candidate in t according to wisely. Due to the presence of the log, Eq. (9) is a unique mapping of every element y≤t ∈ Bt to only well defined when the matrix DY + w · KY an integer between 1 and |Bt|. Furthermore, we is PSD.6 use the notation DYt where Yt ⊆ Bt, to indicate the submatrix that only contains those rows and 3.1 Constructing K columns corresponding to the elements of Yt. We One simple way to construct K is as a Gram ma- may now rewrite Eq. (5) as trix, where each i, j element of K is computed Determinantal Standard Beam Search via a kernel function K : S × S → R that maps two items in a space S to a real number. Specifi- Y0 ← {BOS} (8) cally, we define Kij = K(si, sj) where si, sj ∈ S are the ith and jth elements of S, respectively. In Yt ← argmax log det(DY 0 ) 0 t Yt ⊆Bt, slight abuse of notation, we overload the kernel 0 |Yt |=k function K to take a set S such that K = K(S) is the kernel matrix resulting from pairwise compu- where equivalence follows from the definition of tation over elements of S.7 Following from Mer- the determinant for diagonal matrices. Formally, cer’s theorem, the matrix K = K(S) is necessarily Eq. (8) is known as the subdeterminant maxi- PSD and, thus the matrix DY + w · KY is PSD for mization problem5 (Klee et al., 1995; Ebrahimi any Y ⊆ S.8 et al., 2017), which—as the name suggests— The efficient computation of kernel functions refers to the problem of finding the determinant is a well-studied problem—largely due to the maximizing subset of a matrix. While the notation prevalence of kernels in various machine learning introduced in Eq. (8) may seem contrived, it al- techniques. For example, lows us to perform the subsequent generalization. techniques are often employed in computation of K(S) (Rousu and Shawe-Taylor, 2005) or 3 Determinantal Beam Search approximate low-rank kernel matrices can be used We are now in a position to ask the fundamental in place of K(S) (Si et al., 2017). question of this work: What happens if we re- 3.2 Relation to a DPPs place the diagonal matrix D with a non-diagonal matrix? This substitution allows us to account One interpretation of Eq. (9) is as a determinan- for interactions between the elements in the beam. tal point process (DPP). Specifically, it is a k-DPP Formally, we consider a timestep-dependent posi- 6To see this, recall that the determinant is the product of tive semi-definite (PSD) matrix D + w · K where the eigenvalues. To ensure that the determinant is strictly pos- itive, we can simply enforce that all the eigenvalues are pos- the off-diagonal matrix K indicates the strength itive, which is necessarily the case for PSD matrices. Note of the interactions between candidates. The non- that in the case where any of the eigenvalues of a submatrix negative weight w ≥ 0 controls the importance of are zero, we take log det(·) = −∞. 7In machine learning literature, the term “kernel” is often these interactions during the decoding process. In used to refer to both the function K and the kernel matrix K. this case, the beam search recursion becomes: 8To see this, note that the matrix D is necessarily PSD. Since PSD matrices are closed under addition and multiplica- 4We have omitted the time-step dependence of D for no- tion by a positive scalar, then necessarily D + w · K is PSD. tational brevity as it is always clear from context. Lastly, any submatrix of a PSD matrix is also PSD, which 5 Albeit with a cardinality constraint. makes DY + w · KY a PSD matrix.

6553 (Kulesza and Taskar, 2011) in the L-ensemble pa- maximizes Eq. (9) has exponential runtime, we rameterization where we have L = D+w·K. This assume approximate methods are employed. interpretation as a k-DPP gives us a very clear un- Using the method given by Chen et al.(2018), derstanding of why Eq. (8) yields a diverse beam approximate MAP inference takes k3|V|¯ time search. The diagonal entries encode quality, which to return k items from a set of size k|V|¯ . Thus, tells how “good” each candidate on the beam is, the runtime at each iteration of determinantal while the off-diagonal entries encode how similar beam search under these conditions would be two elements are and, thus, how much they should O(c(k) + k3|V|¯ ). Note that standard beam search be repulsed. For an overview of DPPs we refer the runs in O(k|V|¯ log(k|V|¯ )) time at each iteration. reader to Kulesza and Taskar(2012). As k is generally small (≤ 20) and the impact of c(k) can be made reasonable (§3.1), the practical 3.3 Computing Log-Determinants increase in runtime is typically only moderate. Unfortunately, computing the argmax9 in Eq. (9) 4 Case Study: Diverse Beam Search is an NP-hard problem (Ko et al., 1995). However, as the subdeterminant maximization problem has We now consider the task of language generation, many applications, there has been much research where our vocabulary V¯ is a set of words and Y is on efficient algorithms for approximating log- the set of all valid strings derived from V¯. When determinants in the context of, e.g., determinantal the space of our kernel function S = Bt, one point processes (Gillenwater et al., 2012; Han simple way of modeling interactions is through a et al., 2017).10 One such algorithm uses a string subsequence kernel (Lodhi et al., 2002). first-order approximation of the log-determinant function (Han et al., 2017). The work of Chen 4.1 Computing the String Kernel et al.(2018) uses a greedy, iterative approach; by The string subsequence kernel, proposed by Lodhi updating the Cholesky factorization of the matrix et al.(2002), is a function over two strings s and kernel incrementally, the algorithm reduces infer- t computed as: ence time to O(k2|S|) to return k candidates from set S. Pseudocode for the latter approach can X X l(i) X l(j) be found in Chen et al.(2018); pseudocode for K(s, t) = λ λ (10) the algorithm in log-space—since probabilistic u∈Vn i:u=s[i] j:u=t[j] models are often worked with in log-space for where Vn is the set of all finite strings of length numerical stability—can be found in App.A. n over the alphabet V; i (or j) denotes a vector of indices i = (i , . . . , i ) where 1 < i < i ≤ 3.4 Runtime Analysis 1 |u| 1 |u| def |s|; l(i) = i −i +1 is the length of the substring We consider the runtime of selecting k candidates |u| 1 u in s; λ ∈ (0, 1] is a decay factor which serves as at any given time step in the recursion of Eq. (9). a penalty for gaps within a compared subsequence. At each time step, we must first construct the Direct computation of Eq. (10) is exponential matrix K. This computation is highly dependent in |V|, but efficient dynamic programs can be uti- on the set interactions being modeled; as such, let lized: In this work, we employ the trie-based O(c(k)) be a runtime bound for K’s computation methods of Rousu and Shawe-Taylor(2005) to when our search uses a beam size of k. Once compute Eq. (10). Under this scheme, the com- we have constructed our matrix D + w · K, we putation of the kernel between two strings s and t must next select k items. The set of hypotheses at is O(n·M ·log(max(|s|, |t|)), where n is the cho- any time step is at most k|V|¯ . While as discussed sen subsequence length (a hyperparameter) and M in §3.3, finding the size-k subset that exactly is the number of words that strings s and t have in 9We may also sample from the k-DPP modeled by Eq. (9) common. Note that |s|, and thus M, are bounded rather than taking the approximate mode; this would only re- by the time step t. Further, we can reuse many quire changing the inference algorithm and can be done in a similarly efficient manner (Li et al., 2016). We focus on of the computations between subsequent decod- deterministic methods in this work as we aim to find the ob- ing rounds due to the iterative nature of both beam jective maximizing set. search and the subsequence kernel computations. 10As beam search is already a heuristic approach, such an approximation does not have any theoretical implications for Additionally, since the magnitude of Eq. (10) is in- the results of our algorithm. fluenced by the lengths of s and t, we normalize

6554 the kernel as follows: Stochastic Beam Search. Kool et al.(2019) propose stochastic beam search (SBS), a decod- K(s, t) ing technique that samples without replacement Knorm(s, t) = p (11) K(s, s) ·K(t, t) from sequence models according to their distribu- tion over the entire space Y. For random sampling 4.2 Integration into DetBS methods such as SBS, it is customary to use a sam- The string subsequence kernel gives us a straight- pling temperature T > 0 at generation time to con- forward method for decoding diverse sets of trol for the peakiness of the sampling distribution. strings from language generators. We construct This results in the generalized softmax: the matrix pT (y | y

6555 Figure 1: Averaged n-gram diversity vs. minimum, median, and maximum BLEU score for beam sizes k = 5, 10, 20 on WMT’14 En–Fr and WMT’19 De–En newstest using various decoding strategies. The free parameter for each strategy is either the softmax temperature or the weight of the diversity parameter (see §5.2).

et al., 2019) De–En datasets; for reproducibility, While dn has a more natural interpretation as we use the pretrained models made available by coverage of different n-grams, the above quantity fairseq12 (Ott et al., 2019). We evaluate on the is often referred to as n-gram diversity in the newstest set from the respective datasets, each literature and so we transition to this term for containing 3003 sentences. Further details can be consistency. Following the experimental setup of found in App.B. Kool et al.(2019), we vary sampling temperature For determinantal beam search (DetBS), we T ∈ {0.1, 0.2,..., 0.8} in the case of beam perform a hyperparameter search (precise details search and stochastic beam search and diversity likewise in App.B) over λ and n, the decay factor weight w ∈ {0.1, 0.2,..., 0.8} in the case of and subsequence length, respectively. Search is diverse beam search. For DetBS, we observe that performed for fixed w = 0.1 and k = 10 on larger sets require a smaller diversity penalty to validation sets for both languages; we omit a achieve good n-gram diversity: in Fig.1 we show search over the entire space of w, k, λ, n so as results for DetBS with the string subsequence to not create an unfair advantage for DetBS in kernel for w ∈ {0.01, 0.02, ··· , 0.1, 0.2, 0.3, 0.4} comparison with the other decoding strategies, for k = 5, w ∈ {0.01, 0.02, ··· , 0.15] for k = 10, for which no hyperparameters are tuned. We use and w ∈ {0.01, 0.02, ··· , 0.05} for k = 20.14 To subsequence length n = 2 and λ ∈ {0.1, 0.3} for observe how BLEU is affected by larger diversity De–En and En–Fr, respectively. coefficients under DetBS, we explore a finer grain We decode sets of size k ∈ {5, 10, 20} of weights for DetBS in App.C. with each strategy, comparing sentence-level 5.3 Results BLEU and n-gram coverage dn averaged across n ∈ {1, 2, 3, 4} in the decoded sets, where we Fig.1 shows the sentence-level BLEU score and define dn as averaged n-gram diversity on the newstest set #of unique n-grams in k strings for different decoding strategies; Tab.2 shows d = (14) n #of n-grams in k strings explicit coverage of 1, 2, 3, 4-grams and aver- 12https://github.com/pytorch/fairseq/ aged across 1, 2, 3, 4-grams for different decoding /master/examples/translation strategies when BLEU is controlled for. The 3 lines 13For each decoding strategy, we choose the diversity pa- rameter corresponding to the most diverse set that had median 14Recall w = 0 recovers standard beam search with a tem- BLEU 28.5 ± 0.05. perature of T = 1.

6556 Source Sentence Zum Abschluss wurde eine Tombola verlost. Die Wahrheit zu sagen ist aber kein Verbrechen. Beam Search (T = 0.6) • A raffle was held to close the event. • But telling the truth is not a crime. • A raffle was held to conclude the event. • Telling the truth is not a crime. • A raffle was held at the end. • But telling the truth isn’t a crime. • At the end a raffle was held. • But telling the truth is no crime. • A raffle was held to close the draw. • However, telling the truth is not a crime. Diverse Beam Search (w = 0.4) • To conclude, a raffle was held. • But telling the truth is not a crime. • A raffle was held to close the event. • But telling the truth is not a crime. • A raffle was held to close the event. • But telling the truth is not a crime. • A raffle was held to close the event. • But telling the truth is not a crime. • At the end of the event, a raffle was held. • Telling the truth, however, is not a crime. Determinantal Beam Search (w = 0.12) • Finally, a raffle was held. • But telling the truth is not a crime. • A raffle was held at the end. • But telling the truth isn’t a crime. • At the end a raffle was held. • Telling the truth is not a crime. • To conclude, a raffle was held. • However, telling the truth is not a crime. • A raffle was held to close the event. • But to tell the truth is not a crime.

Table 1: Generated translations for given source sentence from WMT’19 De–En dataset using different decoding strategies. All use a beam size of five.13 We see large overlap in Beam Search results while DBS actually returns several of the same results. In comparision, DetBS turns qualitatively diverse results even for simple sentences. per decoding strategy in Fig.1 represent the mini- has been used to find core sets—a concept origi- mum, median, and maximum sentence-level BLEU nating in computational geometry concerning the score out of the k translation options, averaged existence of a small, representative set of core across the corpus. We consider median BLEU to items—in data summarization problems (Mirza- be the best metric of set text-quality, as a good di- soleiman et al., 2013), nearest neighbor search verse decoding algorithm should not completely (Abbar et al., 2013) and streaming algorithms (In- sacrifice BLEU for the sake of diversity. The plots dyk et al., 2014) inter alia. Informally, we can are analogous to those in Kool et al.(2019). connect our problem to the notion of decoding a On both datasets and across different set sizes, core set from sequence models. To the best of our results indicate that DetBS generates diverse sets knowledge, our work is the first to use this concept of strings while maintaining high median and when decoding sequence models. maximum BLEU scores. We see similar or higher Wang and Chan(2019) incorporate DPPs into n-gram diversity in comparison to DBS for the a reinforcement learning objective to optimize same median BLEU and a notably better n-gram for diverse text when training image captioning diversity vs. BLEU trade-off than standard beam models. We optimize for diversity during de- search and SBS. Further, the highest quality trans- coding, rather than training, which makes our lation (shown by max BLEU) does not appear to methods applicable with out-of-the-box models be sacrificed when the diversity parameter is in- and allows us to avoid highly hyperparameter- creased for DetBS. In contrast, there is a notable sensitive techniques, like minimum-risk training drop-off for generation strategies in which diver- or reinforcement learning-based algorithms, while sity is controlled for using temperature. We show achieving the same goal. While the application samples of generated text in Tab.1. of our methods at training times is an interesting research direction, we foresee technical chal- 6 Related Work lenges corresponding to such approaches that may outweigh their benefits. Our work is built upon much of the subset op- As a decoding method, our work is closest timization literature in machine learning. We 15In the case that not all strategies had such a set, we in- base our algorithm off the subdeterminant maxi- stead bounded BLEU by the lowest of the median BLEU across mization problem (Agarwal et al., 2004), which decoding strategies.

6557 De–En En–Fr BLEU 1-gram 2-gram 3-gram 4-gram avg. BLEU 1-gram 2-gram 3-gram 4-gram avg. DetBS 26.24 0.11 0.17 0.21 0.26 0.19 23.29 0.11 0.17 0.21 0.25 0.18 DBS 26.52 0.09 0.14 0.18 0.22 0.16 24.26 0.09 0.14 0.17 0.2 0.15 Beam Search 26.15 0.08 0.14 0.19 0.23 0.16 23.29 0.09 0.14 0.18 0.22 0.16 SBS 25.95 0.08 0.14 0.19 0.24 0.16 23.08 0.1 0.16 0.21 0.25 0.18

Table 2: We report the coverage (as defined in Eq. (14)) of 1, 2, 3, 4-grams and averaged across 1, 2, 3, 4-grams as well as median BLEU for k = 20 on the newstest dataset. For each decoding strategy, we report metrics on the generated set that has highest (average) dn, where we set the constraint that median BLEU for the set is still within 1 point of the highest median BLEU (across decoding strategies and diversity parameters).15 to that of Vijayakumar et al.(2018), who pro- search optimization problem. We discuss and ex- pose a variation of beam search (described in periment with efficient methods for inference and §5.3). However, their algorithm lacks theoretical kernel computation that make DetBS an efficient motivation and is not guaranteed to provide a non- decoding strategy in practice. We use DetBS in overlapping set; the same solution may appear the context of language generation, where we multiple times in the decoded set if the diversity explicitly encourage n-gram coverage through penalty is not large enough, as shown in Tab.2. the string subsequence kernel. In our NMT Additionally, groups at each time step t must be experiments, we find DetBS generates much more processed in order since the score of all hypothe- diverse sets of strings than standard beam search ses considered for group g +1 depend on hypothe- and stochastic beam search with a small trade- ses in groups 1, . . . , g, which creates a large bottle- off in median BLEU. We observe competitive neck under the recommended settings of G = k. performance compared with diverse beam search. Random sampling strategies for decoding neu- ral sequence models have received much attention Acknowledgements in recent years. While techniques such as stochas- We would like to thank the anonymous reviewers tic beam search and the UniqueRandomizer (Shi for their helpful feedback and recommendations. et al., 2020) are convenient for creating statis- tical estimators and have uses in reinforcement Ethical Considerations learning techniques due to their clear probabilistic interpretation, there are no diversity guarantees While language generation can be used for mali- for the set of generated sequences. cious purposes, e.g., to propagate misinformation or offensive text, we do not foresee any specific Tam(2020) likewise adapts beam search, ethical concerns with the techniques in this work. proposing a k-means clustering version that clus- ters solutions by averaged word embeddings. As there lacks an interpretation of distance between References averaged word embeddings though, it is unclear Sofiane Abbar, Sihem Amer-Yahia, Piotr Indyk, Sepi- if the method can explicitly optimize for any deh Mahabadi, and Kasturi R. Varadarajan. 2013. tangible notion of coverage or diversity. Diverse near neighbor problem. In Proceedings of the Twenty-Ninth Annual Symposium on Computa- 7 Conclusion tional Geometry. Association for Computing Ma- chinery.

We propose determinantal beam search (DetBS): Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. a new way of framing beam search that allows Varadarajan. 2004. Approximating extent measures us to optimize set generation for diversity and of points. Journal of the Association for Computing coverage rather than simply individual scores. Machinery, 51(4):606–635. Formally, we redefine beam search as an iterative Loïc Barrault, Ondrejˇ Bojar, Marta R. Costa-jussà, subdeterminant maximization problems where Christian Federmann, Mark Fishel, Yvette Gra- we select the approximately maximizing set ham, Barry Haddow, Matthias Huck, Philipp Koehn, Shervin Malmasi, Christof Monz, Mathias Müller, according to the PSD matrix parameterizing our Santanu Pal, Matt Post, and Marcos Zampieri. 2019. score function. This gives us the ability to encode Findings of the 2019 conference on machine trans- the notion of intra-set diversity into the beam lation. In Proceedings of the Fourth Conference on

6558 Machine Translation, pages 1–61. Association for Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Computational Linguistics. Yejin Choi. 2020. The curious case of neural text degeneration. In Proceedings of the International Ondrejˇ Bojar, Christian Buck, Christian Federmann, Conference on Learning Representations. Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, Saint-Amand, Radu Soricut, Lucia Specia, and Aleš and Vahab S. Mirrokni. 2014. Composable core-sets Tamchyna. 2014. Findings of the 2014 workshop for diversity and coverage maximization. In Pro- on statistical machine translation. In Proceedings of ceedings of the Thirty-Third Association for Com- the Ninth Workshop on Statistical Machine Trans- puting Machinery SIGMOD-SIGACT-SIGART Sym- lation, pages 12–58. Association for Computational posium on Principles of Database Systems, page Linguistics. 100–108. Association for Computing Machinery.

Laming Chen, Guoxin Zhang, and Eric Zhou. 2018. V. Klee, P. Gritzmann, and D. Larman. 1995. Largest Fast greedy map inference for determinantal point j-simplices in n-polytopes. Discrete and Computa- process to improve recommendation diversity. In tional Geometry, 13(3-4):477–516. Advances in Neural Information Processing Sys- Chun-Wa Ko, Jon Lee, and Maurice Queyranne. 1995. tems , pages 5622–5633. An exact algorithm for maximum entropy sampling. Operations Research, 43(4):684–691. J. B. Ebrahimi, D. Straszak, and N. K. Vishnoi. 2017. Subdeterminant maximization via nonconvex relax- Wouter Kool, Herke Van Hoof, and Max Welling. ations and anti-concentration. In 2017 IEEE 58th 2019. Stochastic beams and where to find them: Annual Symposium on Foundations of Computer The Gumbel-top-k trick for sampling sequences Science (FOCS), pages 1020–1031. IEEE Computer without replacement. In Proceedings of the Inter- Society. national Conference on Machine Learning, pages 3499–3508. Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at Alex Kulesza and Ben Taskar. 2011. k-DPPs: fixed- scale. In Proceedings of the 2018 Conference on size determinantal point processes. In Proceedings Empirical Methods in Natural Language Process- of the 28th International Conference on Machine ing, pages 489–500. Association for Computational Learning, pages 1193–1200. Linguistics. Alex Kulesza and Ben Taskar. 2012. Determinantal Bryan Eikema and Wilker Aziz. 2020. Is MAP de- Point Processes for Machine Learning. Now Pub- coding all you need? The inadequacy of the mode lishers Inc. in neural machine translation. In Proceedings of the 28th International Conference on Computational Chengtao Li, Stefanie Jegelka, and Suvrit Sra. 2016. Linguistics, pages 4506–4520. International Com- Efficient sampling for k-determinantal point pro- mittee on Computational Linguistics. cesses. volume 51 of Proceedings of Machine Learning Research, pages 1328–1337. Muhammad Farhan, Juvaria Tariq, Arif Zaman, Mu- Zhifei Li and Jason Eisner. 2009. First- and second- dassir Shabbir, and Imdad Ullah Khan. 2017. Ef- order expectation semirings with applications to ficient approximation algorithms for strings kernel minimum-risk training on translation forests. In based sequence classification. In Advances in Neu- Proceedings of the 2009 Conference on Empirical ral Information Processing Systems, volume 30, Methods in Natural Language Processing, pages pages 6935–6945. Curran Associates, Inc. 40–51. Association for Computational Linguistics. Jonas Gehring, Michael Auli, David Grangier, De- Huma Lodhi, Craig Saunders, John Shawe-Taylor, nis Yarats, and Yann N. Dauphin. 2017. Convolu- Nello Cristianini, and Chris Watkins. 2002. Text tional sequence to sequence learning. In Proceed- classification using string kernels. Journal of Ma- ings of the 34th International Conference on Ma- chine Learning Research, 2:419–444. chine Learning, volume 70, pages 1243–1252. Clara Meister, Ryan Cotterell, and Tim Vieira. 2020. Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. If beam search is the answer, what was the ques- 2012. Near-optimal MAP inference for determinan- tion? In Proceedings of the 2020 Conference on tal point processes. In Advances in Neural Informa- Empirical Methods in Natural Language Processing tion Processing Systems, volume 25, pages 2735– (EMNLP), pages 2173–2185. Association for Com- 2743. Curran Associates, Inc. putational Linguistics. Insu Han, Prabhanjan Kambadur, Kyoungsoo Park, and Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, Jinwoo Shin. 2017. Faster greedy MAP inference and Andreas Krause. 2013. Distributed submodular for determinantal point processes. volume 70 of maximization: Identifying representative elements Proceedings of Machine Learning Research, pages in massive data. In Advances in Neural Information 1384–1393. Processing Systems, volume 26, pages 2049–2057.

6559 Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Michael Auli, and Sergey Edunov. 2019. Facebook Le, Mohammad Norouzi, Wolfgang Macherey, FAIR’s WMT19 news translation task submission. Maxim Krikun, Yuan Cao, Qin Gao, Klaus In Proceedings of the Fourth Conference on Ma- Macherey, Jeff Klingner, Apurva Shah, Melvin chine Translation (Volume 2: Shared Task Papers, Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Day 1), pages 314–319. Association for Computa- Gouws, Yoshikiyo Kato, Taku Kudo, Hideto tional Linguistics. Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Myle Ott, Sergey Edunov, Alexei Baevski, Angela Riesa, Alex Rudnick, Oriol Vinyals, Gregory S. Fan, Sam Gross, Nathan Ng, David Grangier, and Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Michael Auli. 2019. fairseq: A fast, extensible Google’s neural machine translation system: Bridg- toolkit for sequence modeling. In Proceedings of ing the gap between human and machine translation. the 2019 Conference of the North American Chap- ter of the Association for Computational Linguis- Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- tics (Demonstrations), pages 48–53. Association for bonell, Rusland Salakhutdinov, and Quoc V Le. Computational Linguistics. 2019. XLNet: Generalized autoregressive pretrain- ing for language understanding. In Advances in Juho Rousu and John Shawe-Taylor. 2005. Efficient Neural Information Processing Systems, volume 32. computation of gapped substring kernels on large alphabets. Journal of Machine Learning Research, 6:1323–1344.

Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kar- tik Talamadupula, Bowen Zhou, Yoshua Bengio, and Aaron Courville. 2017. Multiresolution recur- rent neural networks: An application to dialogue response generation. In Proceedings of the Thirty- First AAAI Conference on Artificial Intelligence, page 3288–3294. AAAI Press.

Kensen Shi, David Bieber, and Charles Sutton. 2020. Incremental sampling without replacement for se- quence models. In Proceedings of the 37th Inter- national Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8785–8795. PMLR.

Si Si, Cho-Jui Hsieh, and Inderjit S. Dhillon. 2017. Memory efficient kernel approximation. Journal of Machine Learning Research, 18(20):1–32.

Felix Stahlberg and Bill Byrne. 2019. On NMT search errors and model errors: Cat got your tongue? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 3356– 3362. Association for Computational Linguistics.

Yik-Cheung Tam. 2020. Cluster-based beam search for pointer-generator chatbot grounded by knowledge. Computer Speech and Language, 64:101094.

Ashwin Vijayakumar, Michael Cogswell, Ramprasaath Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In AAAI Conference on Artificial Intelligence, pages 7371– 7379.

Qingzhong Wang and Antoni B. Chan. 2019. To- wards diverse and accurate image captions via reinforcing determinantal point process. CoRR, abs/1908.04919.

6560 A Log-Space Computations Dataset n decay (λ) WMT’14 En–Fr 2 0.3 Algorithm 1 Fast Greedy MAP Inference with WMT’19 De–En 2 0.1 log-space parameterization (Chen et al., 2018). We transform computations according to (Li and Table 3: Best performing configurations found in Eisner, 2009). search over string subsequence kernel parameters. Input: L: log of PSD matrix k: desired set size 1: function GREEDY_MAP_INFERENCE() Interestingly, larger values of n did not improve 2: ci = [], di = Lii, si = [] performance, and were more computationally ex- 3: j = argmaxi∈S di 4: Yg = {j} pensive; small values of n and decay λ appear to 5: while |Yg| != k : offer the best BLEU vs. n-gram diversity trade-off. 6: for i ∈ S\Yg : 7: s, log_inner ← LOGSUMEXP(cj , ci, si, sj ) Dataset and Model Statistics We use a con- 8: FUNC ← LOG_ADD if s < 0 else LOG_MINUS volutional sequence-to-sequence model trained 9: if Lji > log_inner : 10: s ← 1 according to Gehring et al.(2017) on the WMT’14 11: ei = FUNC(Lji, log_inner) − 0.5 · dj En–Fr dataset.16 Data preprocessing steps, model 12: else 13: s ← −s hyperparameters and baseline performances can 14: ei = FUNC(log_inner,Lji) − 0.5 · dj be found in their work. We use the pre-trained 15: ci = [ci ei], di = di − 2 · ei, si = [si s] model checkpoints made available by fairseq 16: j = argmax d ,Y = Y ∪ {j} i∈S\Yg i g g at https://github.com/pytorch/fairseq/ 17: return Yg tree/master/examples/translation. We use a 18: function LOGSUMEXP(c1, c2, s1, s2) Transformer-based model trained according to Ng 19: s, log_inner ← 1, −∞ et al.(2019) on the WMT’19 De–En dataset. 17 20: for hc1, c2, s1, s2i ∈ hc1, c2, s1, s2i : 0 21: s ← s1 · s2 Likewise, data preprocessing steps, model hy- 0 22: FUNC ← LOG_ADD if s == s else LOG_MINUS perparameters and baseline performances can be 23: if log_inner > c1 + c2 : 24: log_inner ← FUNC(log_inner, c1 + c2) found in Ng et al.(2019). We similarly use the 25: else pretrained model checkpoints made available by 0 26: s = s fairseq. 27: log_inner ← FUNC(c1 + c2, log_inner) return s, log_inner C Additional Results

B Experimental Setup Hyperparameters. As we use the string subse- quence kernel of section §4 in DetBS, there are a number of hyperparameters that can be adjusted beyond the diversity weight w: the decay factor λ indicates the degree to which interior gaps are penalized and subsequence length n indicates the length of the considered substrings u. For each language, we perform a search over these two hyperparameters for set size k = 10 and diver- sity coefficient w = 0.1 on validation sets. We use a grid search over n = [2, 3, 4, 5, 6, 7, 8] and Figure 2: n-gram diversity vs. minimum, median and λ = [0.1, 0.3, 0.5, 0.7, 1.0]. We choose the config- maximum BLEU score for beam sizes k = 5, 10, 20 on uration that yields the highest (average n-gram di- WMT’19 De–En newstest using a larger range of the diversity weight w. versity)*BLEU, using this configuration in all sub- sequent experiments. While there may be better performing hyperparameters under different k and 16 w, we omit searching over the entire space to cre- available at http://statmt.org/wmt14/ translation-task.html ate a fairer comparison with the other decoding 17available at http://www.statmt.org/wmt19/ strategies. translation-task.html

6561 Figure 3: n-gram diversity vs. minimum, median and maximum BLEU score for beam sizes k = 5, 10, 20 on WMT’14 En–Fr newstest using a larger range of the diversity weight w.

6562