Determinantal Beam Search

Determinantal Beam Search Clara Meister Martina Forster Ryan Cotterell ; ETH Zürich University of Cambridge [email protected] [email protected] [email protected] Abstract scores, with no means for encoding interaction between candidates; this is the limitation which we Beam search is a go-to strategy for decoding attempt to address in this work. neural sequence models. The algorithm can We derive determinantal beam search, a novel naturally be viewed as a subset optimization generalization of beam search that casts subset problem, albeit one where the corresponding selection as the subdeterminant optimization set function does not reflect interactions between candidates. Empirically, this leads to problem. Specifically, we formulate each iteration sets often exhibiting high overlap, e.g., strings of beam search as a subdeterminant maximization may differ by only a single word. Yet in problem parameterized by a positive semi-definite use-cases that call for multiple solutions, a di- matrix that encodes interactions between the pos- verse or representative set is often desired. To sible candidates; standard beam search is recov- address this issue, we propose a reformulation ered by a specific diagonal matrix. This framing of beam search, which we call determinantal creates a natural paradigm for taking the rela- beam search. Determinantal beam search has a natural relationship to determinantal point pro- tionships between candidates during the decoding cesses (DPPs), models over sets that inherently process, and can thus assign higher scores to di- encode intra-set interactions. By posing itera- versified sets; we show how this approach relates tions in beam search as a series of subdetermi- to k-determinantal point processes (DPPs). Given nant maximization problems, we can turn the the wealth of research on efficient kernel com- algorithm into a diverse subset selection pro- putation (Rousu and Shawe-Taylor, 2005; Farhan cess. In a case study, we use the string subse- et al., 2017) and DPP inference strategies (Li et al., quence kernel to explicitly encourage n-gram 2016; Han et al., 2017; Chen et al., 2018), we find coverage in text generated from a sequence model. We observe that our algorithm offers the impact on runtime to be quite reasonable in competitive performance against other diverse comparison to standard decoding techniques. set generation strategies in the context of In a case study on neural machine translation language generation, while providing a more (NMT), we demonstrate how to make use of the general approach to optimizing for diversity. string subsequence kernel (Lodhi et al., 2002) to encode the notion of n-gram diversity in the lan- 1 Introduction guage generation process, allowing us to derive an elegant diverse beam search. Under this scheme, The decoding of neural sequence models is a fun- we observe that determinantal beam search gener- damental component of many tasks in NLP. Yet, ates more diverse sets than standard beam search many proposed decoding methods aim to produce with minimal trade-off in terms of BLEU. We only a single solution; further, decoding strategies see improved performance over stochastic beam that provide a set, such as beam search, admit high search (SBS; Kool et al., 2019), which is reported overlap between solutions. Such approaches fail to to encourage diversity, and a slight improvement reflect that for many NLP tasks,1 there can be mul- over Vijayakumar et al.(2018)’s diverse beam tiple correct solutions—or that we may desire a di- search (DBS) while providing a more general ap- verse set of solutions. As it stands, standard beam proach to optimizing for intra-set diversity. search chooses items based purely on individual 2 Neural Sequence Models 1As concrete examples, in machine translation there almost always exist multiple ways to translate a sentence; in story generation, we often seek creative language or multiple Neural sequence models are probability distribu- options to choose from. tions p(y j x) over sequences y in an output 6551 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 6551–6562 August 1–6, 2021. ©2021 Association for Computational Linguistics space Y conditioned on an input x.2 Here we Degenerate Objective. It is important to note define Y as the set of all valid sequences derived that the highest-probability solutions under neural from a vocabulary V that are bookended by distin- sequence models are not always high-quality; guished BOS and EOS tokens, indicating the begin- specifically for tasks involving language gener- ning and end of the sequence, respectively. Typ- ation, e.g., machine translation, prior work has ically, the sequence length is upper-bounded by shown the tendency for MAP decoding to lead some value nmax 2 Z+, which may depend on to generic or degenerate solutions (Stahlberg and x. In this work, we consider locally normalized Byrne, 2019; Meister et al., 2020; Eikema and models, i.e. where p is a probability distribution Aziz, 2020) while superior solutions assigned def over V¯ = V [ fEOSg conditioned on previously only slightly lower probability are often over- generated tokens y<t. The probability of the full looked (Holtzman et al., 2020). Consequently, sequence y = hy1; y2;::: i is then calculated via heuristic search methods or alternative objectives the chain rule of probability: are frequently employed for decoding language generators. jyj Y p(y j x) = p(yt j y<t; x) (1) 2.1 Beam Search t=1 A common heuristic to approximate the decoding def where y<1 = y0 = BOS. Our model p is typically problem in Eq. (2) is to sequentially choose parameterized by a neural network with weights θ. the token yt at each time step t that maximizes As we do not focus on the underlying model itself p(yt j y<t; x) until the EOS token is generated or in this work, we omit the dependence of p on the the maximum sequence length nmax is reached. parameters θ. This procedure is known as greedy search. Beam We define the decoding problem as the search search is an oft-employed generalization of greedy for the highest-scoring y among all sequences in search that returns k candidates and explores more Y according to the model p(y j x), which is also of the search space.3 In this work, we focus on called maximum-a-posteriori (MAP) inference: a framing of beam search as iterative subset selection, which allows for a remarkably concise ? y = argmax log p(y j x) (2) formulation of the algorithm. Given an initial y2Y set Y0 containing only the BOS token, we choose where the log transform of p is used by conven- subsequent Yt for t 2 f1; : : : ; nmaxg according to tion. We further define the set decoding problem the following recursion: as the search for a set Y ? of a specified cardinality k among all valid subsets fY 0 ⊆ Y j jY 0j = kg Standard Beam Search that has the highest score where, by overloading, Y fBOSg (5) we define 0 0 Yt argmax log p(Yt j Yt−1; x) def Y 0 p(Y j x) = p(y j x) (3) Yt ⊆Bt; jY 0j=k y2Y t Similarly to Eq. (2), the set-decoding problem is where we are constrained to only extending candi- then defined as: dates present in the beam set, which we define as Y ? = argmax log p(Y 0 j x) (4) Y 0⊆Y; def ¯ jY 0j=k Bt = fy<t ◦ y j y<t 2 Yt−1 and y 2 Vg (6) However, as has been noted in the literature, there where ◦ is used to indicate string concatenations. are a number of issues with both Eq. (2) and (4). Note that candidates in Yt−1 already ending Y V First, as may be an exponentially large (in ) in EOS are simply added directly to Bt, i.e., space and p is typically non-Markovian, we can- EOS ◦ EOS = EOS. Under this definition, we have Y Yk not efficiently search over , much less over . the cardinality constraint jBtj ≤ jVj¯ · k. Second, specifically for language generation tasks, these might not be useful objectives. 3A number of NLP tasks only take the highest-scoring element of the returned set Y while other tasks utilize the entire 2x may be, e.g., a source sentence or an image. set of solutions. 6552 2.2 A Determinantal Reformulation Full Determinantal Beam Search We now introduce an alternative, equivalent notation for Eq. (5) using matrices and determi- Y0 fBOSg (9) nants that will shed light on the straightforward Yt argmax log det(DY 0 + w · KY 0 ) 0 t t generalization of beam search that we present as Yt ⊆Bt; jY 0j=k the primary contribution of this paper. We de- t fine a timestep-dependent4 diagonal matrix D 2 Clearly, we recover beam search when w = 0; jBt|×|Btj where we take the diagonal entry R however, we can now select subsets based addi- def tionally on candidate interactions. That is, Eq. (9) D = p(y(i) j x) (7) ii ≤t now has an interpretation as a diversity objective function (Indyk et al., 2014) when K is chosen y(i) ith B Here ≤t is the candidate in t according to wisely. Due to the presence of the log, Eq. (9) is a unique mapping of every element y≤t 2 Bt to only well defined when the matrix DY + w · KY an integer between 1 and jBtj. Furthermore, we is PSD.6 use the notation DYt where Yt ⊆ Bt, to indicate the submatrix that only contains those rows and 3.1 Constructing K columns corresponding to the elements of Yt.

Load more