Best-First Beam Search
Total Page:16
File Type:pdf, Size:1020Kb
Best-First Beam Search ‹ ˚;‹ Clara Meister Ryan Cotterell Tim Vieira ‹ ˚ ETH Zurich¨ University of Cambridge Johns Hopkins University [email protected] [email protected] [email protected] Abstract model, beam search yields impressive performance on a variety of tasks—unexpectedly providing a Decoding for many NLP tasks requires a beneficial search bias over exact search for many heuristic algorithm for approximating exact tasks (Stahlberg and Byrne, 2019). search since the full search space is often in- tractable if not simply too large to traverse ef- Within NLP, most research on beam search has ficiently. The default algorithm for this job focused on altering the standard log-probability is beam search—a pruned version of breadth- scoring function to return improved results, e.g., first search—which in practice, returns better higher BLEU scores (Wu et al., 2016; Murray and results than exact inference due to beneficial Chiang, 2018; Shu and Nakayama, 2018; Yang search bias. In this work, we show that stan- et al., 2018) or a more diverse set of outputs (Vi- dard beam search is a computationally ineffi- jayakumar et al., 2016). However, little work has cient choice for many decoding tasks; specif- been done to speed up beam search itself. Filling ically, when the scoring function is a mono- tonic function in sequence length, other search this gap, this paper focuses on reformulating beam algorithms can be used to reduce the number search in order to make it faster. We propose best- of calls to the scoring function (e.g., a neural first beam search, a prioritized version of traditional network), which is often the bottleneck compu- beam search which is up to an order of magnitude tation. We propose best-first beam search, an faster in practice while still returning the same set algorithm that provably returns the same set of of results. We additionally discuss an even faster results as standard beam search, albeit in the heuristic version of our algorithm which further minimum number of scoring function calls to guarantee optimality (modulo beam size). We limits the number of candidate solutions, leading show that best-first beam search can be used to a smaller memory footprint while still finding with length normalization and mutual infor- good solutions. mation decoding, among other rescoring func- Concretely, we offer a novel interpretation of tions. Lastly, we propose a memory-reduced beam search as an agenda-based algorithm where variant of best-first beam search, which has a traditional beam search is recovered by employing similar search bias in terms of downstream per- a length-based prioritization scheme. We prove formance, but runs in a fraction of the time. that a specific best-first prioritization scheme, as ˚ 1 Introduction in classic A search (Hart et al., 1968), allows for the elimination of paths that will necessarily Beam search is a common heuristic algorithm fall off the beam; for many scoring functions, for decoding structured predictors, e.g., neural including standard log-probability scoring, we can machine translation models and transition-based still guarantee the same k hypotheses as traditional parsers. Due to the widespread adoption of recur- beam search are returned. Indeed, our algorithm rent neural networks and other non-Markov models, returns beam search’s top hypothesis the first time traditional dynamic programming solutions, such it encounters a complete hypothesis, allowing the as the Viterbi algorithm (Viterbi, 1967), are pro- program to stop early. Further, we discuss the hibitively inefficient; this makes beam search a application of best-first beam search to several common component of many state-of-the-art NLP popular scoring functions in the literature (He systems. Despite offering no formal guarantee of et al., 2016; Li et al., 2016); this demonstrates that finding the highest-scoring hypothesis under the we have a general framework for adapting a variety of rescoring methods and alternate objectives to where ˝ is string concatenation and Vănmaxpxq is ‹ work with our algorithm. the set of all subsets of V of size ă nmaxpxq. In Empirically, we compare best-first beam search words, every valid sequence begins and ends with to ordinary beam search on two NLP sequence- distinguished tokens (BOS and EOS, respectively).1 to-sequence tasks: neural machine translation Furthermore, each sequence has at most length (NMT) and abstractive summarization (AS). On nmaxpxq—which is typically dependent on x—a NMT, we find that our algorithm achieves roughly restriction we impose to ensure termination. Some a 30% speed-up over traditional beam search with applications may require a stronger coupling increased gains for larger beams (e.g., « 10x for a between Ypxq and x (e.g., |x| “ |y|). We drop the beam of 500). We find similar results hold for AS. dependence of Y and nmax on x when it is clear Finally, we show that our memory-reduced version, from context. which limits the number of active hypotheses, Scoring. We consider a general additively de- leads to additional speed-ups over best-first beam composable scoring model of the form search across beam sizes while maintaining similar BLEU scores. Ny scorepx; yq “ scorepx; yăt ˝ ytq (4) t“1 2 Sequence Transduction ¸ This framework covers a variety of modeling A core operation in structured prediction models methodologies including probabilistic transducers is the determination of the highest-scoring output (both globally and locally normalized) and non- for a given input under a learned scoring model. probabilistic models such as maximum-margin techniques (Taskar et al., 2004). Most importantly, ‹ def y “ argmax scorepx; yq (1) (4) covers MAP decoding (2) of neural sequence- yPYpxq to-sequence models a` la Sutskever et al.(2014): 2 where x is an input and Ypxq is a set of well- formed outputs for the input. An important scores2spx; yăt ˝ ytq “ log ppyt | yăt; xq (5) example of (1) is maximum a posteriori (MAP), We note that (5) is the scoring function used for def decoding many language generation models. yMAP “ argmax ppy | xq: (2) yPYpxq Beam search. The worst-case running time of ex- Our work focuses on sequence-to-sequence actly computing (1) is exponential in nmax; namely, n 3 transduction: predicting an output sequence given Op|V| max q. Beam search is a commonly used an input sequence. One such task is machine approximation to (1) in NMT and language gener- translation, wherein a source-language sentence ation tasks. It is used in many (if not most) state- is mapped (“transduced”) to a target-language of-the-art NLP systems (Wu et al., 2016; Serban sentence. While our exposition focuses on et al., 2017; Edunov et al., 2018; Yang et al., 2019). sequence-to-sequence prediction, our algorithms Beam search may be understood as a pruned ver- are directly applicable to any sequential struc- sion of the classic path-search algorithm, breadth- tured prediction model, such as transition-based first search (BFS), where the breadth is narrowed parsers (Nivre et al., 2008) and sequence taggers to the beam size k. Pseudocode is given in Alg.1. (McCallum et al., 2000; Lafferty et al., 2001). Although, beam search does not solve (1) ex- actly, it is a surprisingly useful approximation for NLP models. In many settings, beam search outper- Notation. Let x “ xx1; : : : ; xNx y be an in- put sequence of length N and, likewise, let 1 x BOS and EOS are typically members of V. Often, EOS counts y “ xy1; : : : ; yNy y be an output sequence of towards the nmax length limit while BOS does not. This is length N . Each y is an element of V, the set of reflected in (3). y t 2 Y x To see why, apply exp (an order-preserving transformation): output tokens. Finally, let p q be the set of all Ny exppscores2spx; yqq “ exp log ppyt | y t; xq “ valid output sequences (i.e., complete hypotheses). t“1 ă Ny ´ ¯ For the task of language generation, which we t“1 ppyt | yăt; xq “ ppy | xřq 3This can be improved if, for example, scorep¨; ¨q admits a focus on experimentally, this set is defined as low-order± Markov factorization (Viterbi, 1967; Vieira et al., 2016). We do not discuss that setting in this paper because it def n Ypxq “ tBOS ˝ v ˝ EOS | v P Vă max u (3) limits the scoring model’s expressive power. Algorithm 1 Standard beam search4 Algorithm 2 A˚ beam search.4;5 Highlighted sec- Input: x: source sentence tions are choice points in the algorithm for which k: maximum beam size values determine the search strategy. See § 3.1 for nmax: maximum hypothesis length detailed explanation. scorep¨; ¨q: scoring function Input: x: source sentence 1: B0 Ð tx0; BOSyu nmax: maximum hypothesis length 2: for t P t1; : : : ; nmax ´1u : scorep¨; ¨q: scoring function 3: B ÐH : comparator 1 5 2 4: for xs; yy P Bi´1 : stopp¨; ¨q: stopping criterion 5: if y:lastpq “ EOS : k: maximum beam size 3 6: B:addpxs; yyq hp¨; ¨q: heuristic function 4 7: continue 1: Q Ðpriority queuep q 5 8: for y P V : 2: Q:pushpx0; BOSyq 9: s Ð scorepx; y ˝ yq 3: POPS Ð counterpq 10: B:addpxs; y ˝ yyq 4: while not stoppQq and not Q:emptypq : 11: Bi Ð B:toppkq 5: xsh; yy Ð Q:poppq 12: return B:maxpq 6: if POPSr|y|s ¥ k or |y| ¡ nmax : 7: continue 8: POPSr|y|s Ð POPSr|y|s ` 1 forms exact methods in terms of downstream eval- 9: if y:lastpq “ EOS : uation (Koehn and Knowles, 2017; Stahlberg and 10: Q:pushpxsh; y ˝ EOSyq Byrne, 2019). For the remainder of this paper, we 11: else: will pivot our attention away from exact solutions 12: for y P V : to (1) to exact solutions to the beam search output.