Best-First Beam Search

Best-First Beam Search Clara Meister Tim Vieira Ryan Cotterell∗, ETH Zurich¨ Johns Hopkins University ∗University of Cambridge [email protected] [email protected] [email protected] Abstract unexpectedly providing a beneficial search bias over exact search for many tasks (Stahlberg and Decoding for many NLP tasks requires an ef- Byrne, 2019). fective heuristic algorithm for approximating Within NLP, most research on beam search has exact search because the problem of searching focused on altering the log-probability scoring the full output space is often intractable, or function to return improved results, for example, impractical in many settings. The default algo- higher BLEU scores (Wu et al., 2016; Murray and rithm for this job is beam search—a pruned Chiang, 2018; Shu and Nakayama, 2018; Yang version of breadth-first search. Quite surpris- et al., 2018) or a more diverse set of outputs ingly, beam search often returns better results (Vijayakumar et al., 2016). However, little work than exact inference due to beneficial search has been done to speed up beam search itself. bias for NLP tasks. In this work, we show that Filling this gap, this paper focuses on reform- the standard implementation of beam search ulating beam search in order to make it faster. can be made up to 10x faster in practice. Our We propose best-first beam search, a priori- method assumes that the scoring function is tized version of traditional beam search that is up monotonic in the sequence length, which to an order of magnitude faster in practice while allows us to safely prune hypotheses that can- still returning the same set of results. We add- not be in the final set of hypotheses early on. itionally discuss an even faster heuristic version We devise effective monotonic approxima- of our algorithm that further limits the number of tions to popular nonmonontic scoring functions, candidate solutions, leading to a smaller memory including length normalization and mutual footprint while still finding good solutions. information decoding. Lastly, we propose a Concretely, we offer a novel interpretation of memory-reduced variant of best-first beam search, beam search as an agenda-based algorithm where which has a similar beneficial search bias in traditional beam search is recovered by utilizing terms of downstream performance, but runs in a length-based prioritization scheme. We prove a fraction of the time. that a specific best-first prioritization scheme, as ∗ 1 Introduction in classic A search (Hart et al., 1968), allows for the elimination of paths that will necessarily Beam search is a common heuristic algorithm for fall off the beam; for many scoring functions, decoding structured predictors (e.g., neural ma- including standard log-probability scoring, we can chine translation models and transition-based still guarantee the same k hypotheses as traditional parsers). Because of the widespread adoption of beam search are returned. Indeed, our algorithm recurrent neural networks and other non-Markov returns beam search’s top hypothesis the first time models, traditional dynamic programming solu- it encounters a complete hypothesis, allowing the tions, such as the Viterbi algorithm (Viterbi, program to stop early. Further, we discuss the ap- 1967), are prohibitively inefficient; this makes plication of best-first beam search to several beam search a common component of many state- popular scoring functions in the literature (He et of-the-art NLP systems. Despite offering no al., 2016; Li et al., 2016); this demonstrates that we formal guarantee of finding the highest-scoring have a general framework for adapting a variety hypothesis under the model, beam search yields of rescoring methods and alternate objectives to impressive performance on a variety of tasks— work with our algorithm. 795 Transactions of the Association for Computational Linguistics, vol. 8, pp. 795–809, 2020. https://doi.org/10.1162/tacl a 00346 Action Editor: Kristina Toutanova. Submission batch: 2/2020; Revision batch: 6/2020; Published 12/2020. c 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Empirically, we compare best-first beam search words, every valid sequence begins and ends with 1 to ordinary beam search on two NLP sequence-to- distinguished tokens (BOS and EOS, respectively). sequence tasks: neural machine translation (NMT) Furthermore, each sequence has at most length and abstractive summarization (AS). On NMT, nmax(x)—which is typically dependent on x—a we find that our algorithm achieves roughly a restriction we impose to ensure termination. Some 30% speed-up over traditional beam search with applications may require a stronger coupling increased gains for larger beams (e.g., ≈ 10x between Y(x) and x (e.g., |x| = |y|). We drop the for a beam of 500). We find similar results dependence of Y and nmax on x when it is clear hold for AS. Finally, we show that our memory- from context. reduced version, which limits the number of active Scoring. We consider a general additively de- hypotheses, leads to additional speed-ups over composable scoring model of the form best-first beam search across beam sizes while maintaining similar BLEU scores. Ny score(x, y)= score(x, y ◦ y ) (4) 2 Sequence Transduction <t t t=1 A core operation in structured prediction models This framework covers a variety of modeling is the determination of the highest-scoring output methodologies including probabilistic transducers for a given input under a learned scoring model. (both globally and locally normalized) and non- probabilistic models such as maximum-margin y def=argmaxscore(x, y) (1) y∈Y(x) techniques (Taskar et al., 2004). Most importantly, (4) covers MAP decoding (2) of neural sequence- where x is an input and Y(x) is a set of well- to-sequence models a` la Sutskever et al. (2014):2 formed outputs for the input. An important ex- ◦ | ample of (1) is maximum a posteriori (MAP), scores2s(x, y<t yt)=logp(yt y<t, x) (5) yMAP def=argmaxp(y | x) (2) We note that (5) is the scoring function used for y∈Y(x) decoding many language generation models. Our work focuses on sequence-to-sequence Beam search. The worst-case running time of transduction: predicting an output sequence given exactly computing (1) is exponential in nmax; an input sequence. One such task is machine namely, O(|V|nmax ).3 Beam search is a commonly translation, wherein a source-language sentence is used approximation to (1) in NMT and language mapped (‘‘transduced’’) to a target-language sen- generation tasks. It is used in many (if not most) tence. While our exposition focuses on sequence- state-of-the-art NLP systems (Wu et al., 2016; to-sequence prediction, our algorithms are directly Serban et al., 2017; Edunov et al., 2018; Yang applicable to any sequential structured predic- et al., 2019). Beam search may be understood as a tion model, such as transition-based parsers (Nivre pruned version of the classic path-search algo- et al., 2008) and sequence taggers (McCallum rithm, breadth-first search (BFS), where the breadth et al., 2000; Lafferty et al., 2001). is narrowed to the beam size k. Pseudocode is given in (1). Notation. Let x = x ,...,x be an input 1 Nx Although, beam search does not solve (1) sequence of length N and, likewise, let y = x exactly, it is a surprisingly useful approximation y ,...,y be an output sequence of length N . 1 Ny y for NLP models. In many settings, beam Each yt is an element of V, the set of output 1 tokens. Finally, let Y(x) be the set of all valid BOS and EOS are typically members of V. Often, EOS output sequences (i.e., complete hypotheses). For counts towards the nmax length limit while BOS does not. This is reflected in (3). the task of language generation, which we focus 2To see why, apply exp (an order-preserving transforma- on experimentally, this set is defined as Ny tion): exp(scores2s(x, y)) =exp log p(yt |y<t, x) = t=1 N def y p(y | y , x)=p(y | x). Y { ◦ ◦ | ∈V<nmax } t=1 t <t (x) = BOS v EOS v (3) 3This can be improved if, for example, score(·, ·) admits a low-order Markov factorization (Viterbi, 1967; Vieira et al., ◦ V<nmax(x) where is string concatenation and is 2016). We do not discuss that setting in this paper because it the set of all subsets of V of size <nmax(x).In limits the scoring model’s expressive power. 796 Algorithm 1 Standard beam search4 Algorithm 2 General decoding scheme.4,5 High- Input: x: source sentence lighted sections are choice points in the algorithm k: maximum beam size for which values determine the search strategy. nmax: maximum hypothesis length See § 3.1 for detailed explanation. score(·, ·): scoring function Input: x: source sentence 1: B0 ←{0, BOS } nmax: maximum hypothesis length · · 2: for t ∈{1,...,nmax −1} : score( , ): scoring function 3: B ← ∅ : comparator 1 · · 2 4: for s, y∈Bt−1 : stop( , ): stopping criterion 5: if y.last() = EOS : k: maximum beam size 3 6: B.add(s, y) h(·, ·): heuristic function 4 7: continue 1: Q←priority queue() ∈V 8: for y : 2: Q.push(0, BOS) ← ◦ 9: s score(x, y y) 3: POPS ← counter() ◦ 10: B.add( s, y y ) 4: while not stop(Q) and not Q.empty() : 11: B ← B. (k) t top 5: sh, y←Q.pop() 12: return B.max() 6: if POPS[|y|] ≥ k or |y| >nmax : 7: continue 8: POPS[|y|] ← POPS[|y|]+1 search outperforms exact methods in terms of 9: if y.last() = EOS : downstream evaluation (Koehn and Knowles, 10: Q.push(sh, y◦ EOS) 2017; Stahlberg and Byrne, 2019).

Load more