Best-First Beam Search

Clara Meister Tim Vieira Ryan Cotterell∗,

ETH Zurich¨ Johns Hopkins University ∗University of Cambridge [email protected] [email protected] [email protected]

Abstract unexpectedly providing a beneficial search bias over exact search for many tasks (Stahlberg and Decoding for many NLP tasks requires an ef- Byrne, 2019). fective heuristic algorithm for approximating Within NLP, most research on beam search has exact search because the problem of searching focused on altering the log-probability scoring the full output space is often intractable, or function to return improved results, for example, impractical in many settings. The default algo- higher BLEU scores (Wu et al., 2016; Murray and rithm for this job is beam search—a pruned Chiang, 2018; Shu and Nakayama, 2018; Yang version of breadth-first search. Quite surpris- et al., 2018) or a more diverse set of outputs ingly, beam search often returns better results (Vijayakumar et al., 2016). However, little work than exact inference due to beneficial search has been done to speed up beam search itself. bias for NLP tasks. In this work, we show that Filling this gap, this paper focuses on reform- the standard implementation of beam search ulating beam search in order to make it faster. can be made up to 10x faster in practice. Our We propose best-first beam search, a priori- method assumes that the scoring function is tized version of traditional beam search that is up monotonic in the sequence length, which to an order of magnitude faster in practice while allows us to safely prune hypotheses that can- still returning the same set of results. We add- not be in the final set of hypotheses early on. itionally discuss an even faster heuristic version We devise effective monotonic approxima- of our algorithm that further limits the number of tions to popular nonmonontic scoring functions, candidate solutions, leading to a smaller memory including length normalization and mutual footprint while still finding good solutions. information decoding. Lastly, we propose a Concretely, we offer a novel interpretation of memory-reduced variant of best-first beam search, beam search as an agenda-based algorithm where which has a similar beneficial search bias in traditional beam search is recovered by utilizing terms of downstream performance, but runs in a length-based prioritization scheme. We prove a fraction of the time. that a specific best-first prioritization scheme, as ∗ 1 Introduction in classic A search (Hart et al., 1968), allows for the elimination of paths that will necessarily Beam search is a common heuristic algorithm for fall off the beam; for many scoring functions, decoding structured predictors (e.g., neural ma- including standard log-probability scoring, we can chine translation models and transition-based still guarantee the same k hypotheses as traditional parsers). Because of the widespread adoption of beam search are returned. Indeed, our algorithm recurrent neural networks and other non-Markov returns beam search’s top hypothesis the first time models, traditional solu- it encounters a complete hypothesis, allowing the tions, such as the Viterbi algorithm (Viterbi, program to stop early. Further, we discuss the ap- 1967), are prohibitively inefficient; this makes plication of best-first beam search to several beam search a common component of many state- popular scoring functions in the literature (He et of-the-art NLP systems. Despite offering no al., 2016; Li et al., 2016); this demonstrates that we formal guarantee of finding the highest-scoring have a general framework for adapting a variety hypothesis under the model, beam search yields of rescoring methods and alternate objectives to impressive performance on a variety of tasks— work with our algorithm. 795

Transactions of the Association for Computational Linguistics, vol. 8, pp. 795–809, 2020. https://doi.org/10.1162/tacl a 00346 Action Editor: Kristina Toutanova. Submission batch: 2/2020; Revision batch: 6/2020; Published 12/2020. c 2020 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Empirically, we compare best-first beam search words, every valid sequence begins and ends with 1 to ordinary beam search on two NLP sequence-to- distinguished tokens (BOS and EOS, respectively). sequence tasks: neural machine translation (NMT) Furthermore, each sequence has at most length and abstractive summarization (AS). On NMT, nmax(x)—which is typically dependent on x—a we find that our algorithm achieves roughly a restriction we impose to ensure termination. Some 30% speed-up over traditional beam search with applications may require a stronger coupling increased gains for larger beams (e.g., ≈ 10x between Y(x) and x (e.g., |x| = |y|). We drop the for a beam of 500). We find similar results dependence of Y and nmax on x when it is clear hold for AS. Finally, we show that our memory- from context. reduced version, which limits the number of active Scoring. We consider a general additively de- hypotheses, leads to additional speed-ups over composable scoring model of the form best-first beam search across beam sizes while maintaining similar BLEU scores. Ny score(x, y)= score(x, y ◦ y ) (4) 2 Sequence Transduction

yMAP def=argmaxp(y | x) (2) We note that (5) is the scoring function used for y∈Y(x) decoding many language generation models. Our work focuses on sequence-to-sequence Beam search. The worst-case running time of transduction: predicting an output sequence given exactly computing (1) is exponential in nmax; an input sequence. One such task is machine namely, O(|V|nmax ).3 Beam search is a commonly translation, wherein a source-language sentence is used approximation to (1) in NMT and language mapped (‘‘transduced’’) to a target-language sen- generation tasks. It is used in many (if not most) tence. While our exposition focuses on sequence- state-of-the-art NLP systems (Wu et al., 2016; to-sequence prediction, our algorithms are directly Serban et al., 2017; Edunov et al., 2018; Yang applicable to any sequential structured predic- et al., 2019). Beam search may be understood as a tion model, such as transition-based parsers (Nivre pruned version of the classic path-search algo- et al., 2008) and sequence taggers (McCallum rithm, breadth-first search (BFS), where the breadth et al., 2000; Lafferty et al., 2001). is narrowed to the beam size k. Pseudocode is given in (1). Notation. Let x = x ,...,x  be an input 1 Nx Although, beam search does not solve (1) sequence of length N and, likewise, let y = x exactly, it is a surprisingly useful approximation y ,...,y  be an output sequence of length N . 1 Ny y for NLP models. In many settings, beam Each yt is an element of V, the set of output 1 tokens. Finally, let Y(x) be the set of all valid BOS and EOS are typically members of V. Often, EOS output sequences (i.e., complete hypotheses). For counts towards the nmax length limit while BOS does not. This is reflected in (3). the task of language generation, which we focus 2To see why, apply exp (an order-preserving transforma-   on experimentally, this set is defined as Ny tion): exp(scores2s(x, y)) =exp log p(yt |y

796 Algorithm 1 Standard beam search4 Algorithm 2 General decoding scheme.4,5 High- Input: x: source sentence lighted sections are choice points in the algorithm k: maximum beam size for which values determine the search strategy. nmax: maximum hypothesis length See § 3.1 for detailed explanation. score(·, ·): scoring function Input: x: source sentence

1: B0 ←{0, BOS } nmax: maximum hypothesis length · · 2: for t ∈{1,...,nmax −1} : score( , ): scoring function 3: B ← ∅ : comparator 1 · · 2 4: for s, y∈Bt−1 : stop( , ): stopping criterion 5: if y.last() = EOS : k: maximum beam size 3 6: B.add(s, y) h(·, ·): heuristic function 4 7: continue 1: Q←priority queue() ∈V 8: for y : 2: Q.push(0, BOS) ← ◦ 9: s score(x, y y) 3: POPS ← counter()  ◦  10: B.add( s, y y ) 4: while not stop(Q) and not Q.empty() : 11: B ← B. (k) t top 5: sh, y←Q.pop() 12: return B.max() 6: if POPS[|y|] ≥ k or |y| >nmax : 7: continue 8: POPS[|y|] ← POPS[|y|]+1 search outperforms exact methods in terms of 9: if y.last() = EOS : downstream evaluation (Koehn and Knowles, 10: Q.push(sh, y◦ EOS) 2017; Stahlberg and Byrne, 2019). For the re- 11: else: mainder of this paper, we will pivot our attention 12: for y ∈V: away from exact solutions to (1) to exact solutions 13: s ← score(x, y ◦ y) to the beam search output. 14: sh ← s+ h(x, y ◦ y) 15: Q.push(sh, y ◦ y) Definition 2.1. k-optimal hypothesis. We say that 16: return Q.pop() if not Q.empty() else null a hypothesis is k-optimal if it is the top hypothesis returned by beam search with beam size k. 3.1 Choice Points of 2 Here we review the components of our meta 3A∗ Beam (the highlighted sections in Alg. 2) that can be varied to recover different search strategies: We develop a meta-algorithm that is parameter- 1  : y×y →{True, False}. A priority queue ized by several choice points. Our general search Q maintains the set of active hypotheses. algorithm for decoding (Alg. 2) takes an arbitrary Elements in this set are ordered according to prioritization function, stopping criterion, and a generic comparator . When its peek() (or search heuristic. With certain values of these at- pop()) methods are called, the first element tributes, we recover many common search algo- ordered by  is returned (or returned and rithms: greedy search, beam search, best-first removed). search (Dijkstra, 1959), and A∗ search (Hart et al., 1968). We propose an alternate prioritization 2 stop(·) : Collectiony→{True, False}. function for beam search that allows for faster The algorithm terminates according to decoding while still returning the same k-optimal configurable stopping criterion based on the set of hypotheses. current set of elements in Q.

5 If the last token of y is the end symbol (e.g., EOS), then y 4Often, the score function is additively decomposable in is not expanded any further. One can either regard y as any t, such as (5). Implementations can exploit this fact to make other hypothesis albeit with y ◦ yt = y or keep appending each score evaluation (line 9) O(1) rather than O(t).We EOS (i.e., y ◦ yt = y ◦ EOS ) so that time step and length can did not make this implementation detail explicit in Alg. 1 or be regarded as synonymous. We adopt the latter standard for Alg. 2 for generality and simplicity. comparability with subsequent algorithms.

797 Beam Search Best-First Beam Search A∗ Beam Search 1     ⇐⇒|| | |     ⇐⇒     ⇐⇒ sh, y sh, y y < y sh, y sh, y sh >sh sh, y sh, y sh >sh | | | | ≥ | | | | | | | | or ( y = y and sh sh) or (sh = sh and y < y ) or (sh = sh and y < y ) 2 stop(Q) ⇐⇒ stop(Q) ⇐⇒ stop(Q) ⇐⇒ y.last() = EOS ∀y ∈Q Q.peek().last() = EOS Q.peek().last() = EOS 3 k = beam size k = beam size k = beam size 4 0 0 any admissible heuristic Breadth-First Search Best-First Search A∗ Search 1     ⇐⇒|| | |     ⇐⇒     ⇐⇒ sh, y sh, y y < y sh, y sh, y sh >sh sh, y sh, y sh >sh | | | | ≥ | | | | | | | | or ( y = y and sh sh) or (sh = sh and y < y ) or (sh = sh and y < y ) 2 stop(Q) ⇐⇒ stop(Q) ⇐⇒ stop(Q) ⇐⇒ y.last() = EOS ∀y ∈Q Q.peek().last() = EOS Q.peek().last() = EOS 3 k = ∞ k = ∞ k = ∞ 4 0 0 any admissible heuristic

Table 1: Values at choice points for various search algorithms. Note that any admissible heuristic may be used for variants of A∗ search.

3 k ∈ N>0. Only k paths of a given length This behavior is sufficient for the majority of are considered. If the algorithm has already use cases for beam search. However, if the full encountered k paths of a given length, set of k hypotheses is desired, the stopping crite- subsequent paths of that length are not rion can be changed to evaluate true only when evaluated. If we take k = ∞, we recover k hypotheses are complete. Under the other beam unpruned search algorithms. search settings, this would probably return the same set as beam search (see § 4.1). 4 h(·, ·):x × y → R. A heuristic function h(x, y) can be used during search to change Recovering A∗. To recover the traditional A∗ the priority in which paths are evaluated. search algorithm, we use the comparator that We note that with pruning, a heuristic may prioritizes hypotheses with a higher score first; ties change the value of the k-optimal hypothesis are broken by hypothesis length. The algorithm § (see 4.1). terminates when the first item of Q contains an EOS.Ifwetakek = ∞, best-first beam search Recovering Beam Search. To recover beam ∗ recovers A . Any admissible heuristic may be search from Algorithm 2, we use the choice used for h(x, y). points from Table 1. Explicitly, the comparator prioritizes hypotheses from earlier time steps Definition 3.1. Admissible Heuristic. A heuristic first, but breaks ties with the hypotheses’ scores h is admissible if it never overestimates the future under the model. We note that while the standard cost—or underestimates the future reward—of algorithm for beam search does not prioritize by continuing down a path. score within a time step, variations of the algorithm use this strategy so they can use early-stopping strategies (Klein et al., 2017; Huang et al., 2017). 3.2 Best-First Beam Search Beam search terminates once either all hypotheses In its original form, A∗ search may traverse the end in EOS or the queue is empty (i.e., when the entire O(|V|nmax ) graph, which as discussed ear- k beams have been extended nmax time steps but lier, is intractable for many decoding problems. none end in EOS). In the second case, no complete While standard beam search addresses this prob- hypothesis is found. Finally, choosing the heuris- lem by limiting the search space, it still has tic h(x, y)=0makes the algorithm a case of computational inefficiencies—namely, we must standard best-first search. analyze k hypotheses of a given length (i.e., time Note that, while standard beam search returns a step), regardless of how poor their scores may set, Alg 2 only returns the k-optimal hypothesis. already be, before considering longer hypotheses.

798 However, prioritization by length is not strictly reached EOS. We prove the k-optimality of these necessary for finding a k-optimal hypothesis. As hypotheses under best-first beam search in § 4.1. is done in A∗, we can use score as the prioritiza- tion scheme and still guarantee optimality–or k- 3.3 Implementation Details optimality–of the paths returned by the algorithm. We define A∗ beam search as the A∗ algorithm Standard beam search forms a separate set of where breadth is limited to size k. Further, we active hypotheses for each time step, that is, each ∗ define best-first beam search as the case of A Bt is its own set. Once Bt has been narrowed down beam search when no heuristic is used (see Table 1 to the top k, the previous B

799

We propose a variant of best-first beam search Case 2: score(x, y

4 Algorithm Analysis Case 3: score(x, y )=score(x, y). The lemma holds if either y or y is popped first. 4.1 Correctness By the principle of induction, Lemmma 4.1 ∈ N We show the equivalence of the top hypothesis6 holds for all t >0. returned by beam search and best-first beam search when score(·, ·) is monotonically decreasing in t, Lemma 4.2. The first hypothesis that best-first length-based prioritization is used, and the beam beam search pops that ends in EOS is k-optimal. size k is the same for both algorithms. Without Proof. Let y be the first hypothesis popped by loss of generality, we hold x constant in all the best-first beam search ending in EOS. By rules of following proofs. the priority queue, no other active hypothesis has a Note that we take the terms pop and push higher score than y. Additionally, by monotonicity from queue terminology. Specifically, ‘‘popping of the scoring function, no other hypothesis can a hypothesis’’ refers to making it past line 7 of subsequently have score greater than y. Therefore Alg. 2, where a hypothesis y is expanded by y must be k-optimal. yt ∈V. In path search terminology, this would be equivalent to visiting a node and adding the Lemma 4.3. If best-first beam search pops a edges from that node as potential paths to explore. hypothesis, then beam search necessarily pops Lastly, we refer to the priority queue used by beam that same hypothesis. search and best-first beam search as QBS and QA∗ , Proof. respectively. We prove the lemma by induction on hypothesis length. The base case holds trivially: Lemma 4.1. Best-first beam search evaluates all For hypotheses of length 0, both best-first beam hypotheses of a given length t in order of their search and beam search must pop the BOS as it is score. the only item in the queue after initialization. By the inductive hypothesis, suppose Lemma 4.3 Proof. We prove the lemma by induction. The holds for hypotheses of length

Q ∗ Case 1: score(x, y

800 all extensions of those y score(x, y≤t+j) may cause other items to be pruned from best-first ∀ ∈ i 1,...,k. This implies that for beam search, beam search that otherwise would have remained y≤t+j would not be in the top-k paths at on the beam in standard beam search. time step t + j since by Lemma 4.3, paths { (1) (k) } 4.2 Runtime y≤t+j,...,y≤t+j would also be evaluated by H beam search. Therefore y cannot be in BS,which Theorem 4.6. The runtime of best-first beam is a contradiction. search is O(nmaxk (|V| log(k)+log(nmax))) Case 2: For no time step t + j (j ≥ 0)dowe · pop k paths. This can only happen if the algorithm Proof. We pop at most nmax k items. Each |V| stops early, namely, we have found k complete pop requires us to push items. Each push hypotheses y(1),...,y(k). If this is the case, then requires log(k) time when the priority queue is by rules of the priority queue, each y(1),...,y(k) implemented with a min–max heap (Atkinson et al., 1986) and incrementally pruned so that it must have score greater than score(x, y score(x, y). This implies y cannot be in BS, which is a contradiction. priority queue of priority queues, which requires log(nmax) time. This yields O(nmaxk (|V| log(k)+ Non-monotonic Scoring Functions. Non- log(nmax))) time. monotonic scoring functions (Definition 3.2) break the assumptions of § 4.1, in which case Theorem 4.7. The runtime of standard beam O |V| best-first beam search is not guaranteed to return a search is (nmax k log(k)). k-optimal hypothesis. However, when the scoring 7For monotonic scoring functions, we have U(x, y )=0.

801 Proof. The proof is the same as Theorem 4.6, els; this is done by normalizing scores by but we can forgo the percolation step in the hypothesis length (see Murray and Chiang [2018] queue of queues because standard beam search for more detail). proceeds in order of hypothesis length. This yields For early stopping in beam search with length O(nmaxk|V| log(k)). normalization, Huang et al. (2017) propose bound- ing the additive length reward as the minimum of Although the theoretical bound of best-first a pre-determined optimal sequence length ratio r beam search has an additional log factor compared and the final sequence length N : with standard beam search, we find this to be neg- y ligible in practice. Rather, we find number of calls score (x, y)=score(x, y) to score, the scoring function under our model LN (7) (e.g., a neural network), is often the bottleneck + β · min{r|x|,Ny} operation when decoding neural networks (see § 6 for empirical evidence). In terms of this metric, where β is the scaling parameter for the reward. O the beam search algorithm makes (knmax) calls We note, however, that the same can be done with to score,asscore is called once for each active the maximum sequence length nmax such that the hypothesis in B and B may evolve for nmax rounds. traditional length reward used by He et al. (2016) The worst-case number of calls to score will be is recovered: the same as for beam search, which follows from Lemma 4.3. { } scoreLN(x, y)=score(x, y)+β min nmax,Ny =score(x, y)+βN (8) 5 Scoring Functions y

Even before the findings of Stahlberg and Byrne We formally propose two methods for length (2019), it was well known that the best-scoring normalization. We use the scoring functions in (7) hypothesis with respect to the traditional likeli- or (8) with either: (1) the following heuristic: hood objective can be far from ideal in practice  0 for y.last () = EOS (Wu et al., 2016; Murray and Chiang, 2018; h(x, y)= Yang et al., 2018). For language generation tasks β max{b −|y|, 0} for y.last () = EOS specifically, the results returned by neural models (9) using the standard scoring function are often short 8 where b can be r|x| or nmax; or (2) stopping and default to high-frequency words (Vinyals and criterion as in (6) albeit with scoring function Le, 2015; Shen et al., 2016). score and upper-bound function: To alleviate such problems, methods that revise LN hypothesis scores to incorporate preferences for U(x, y)=β max{0,b−|y|} (10) longer, less repetitive, or more diverse options have been introduced and are often used in prac- tice. While most such techniques change the Despite their similarities, these two methods are scoring function such that it is no longer mono- not guaranteed to return the same results. Whereas tonic, we can still guarantee the k-optimality the second method will return the same k-optimal of the returned hypothesis for (upper) bounded hypotheses as beam search, using a heuristic scoring functions using the methods discussed during pruned search means we can no longer in § 4.1. In the remainder of this section, we guarantee the k-optimality of the results with present alternate scoring schemes adapted to work respect to the scoring function as the heuristic with best-first beam search. Additionally, we may push hypotheses off of the beam. We present § present several heuristics which, while breaking experimental results for both methods in 6. the k-optimality guarantee, provide another set of decoding strategies worth exploring. Mutual Information. Maximum mutual infor- mation decoding (Li et al., 2016) aims to alleviate Length Normalization. Length normalization the inherent preference of neural models for high- is a widely used hypothesis scoring method that frequency tokens when using the log-probability aims to counteract the propensity for shorter se- 8 quences to have higher scores under neural mod- We enforce r|x|

802 decoding objective. Rather than choosing the 6 Experiments hypothesis y to maximize conditional probability with respect to the input x, we instead choose y to We run our algorithm on several language-related maximize pointwise mutual information (PMI): tasks that typically use beam search for decoding: NMT and AS. Specifically, experiments are per- p(x, y) formed on IWSLT’14 De-En (Cettolo et al., 2012), PMI(x; y)=log (11) WMT’17 De-En (Bojar et al., 2017), MTTT Fr-En p(x)p(y) (Duh, 2018), and CNN-DailyMail (Hermann et al., 2015) using both Transformers (Vaswani et al., p(y|x) Note that (11) is equivalent to log p(y) , which can 2017) and Convolutional sequence-to-sequence be rewritten as log p(y | x) − log p(y),making models (Gehring et al., 2017). the objective additive and thus (11) can conform For reproducibility, we use the data pre-processing to (4). scripts provided by fairseq (Ott et al., 2019) and From this last form, we can see how mutual follow their methods for training sequence trans- information decoding penalizes high-frequency duction models. Hyperparameters are set in accor- and generic outputs; the negative p(y) term, as Li dance with previous works. Specifically, on et al. (2016) point out, acts as an ‘‘anti-language IWSLT’14 and MTTT tasks, we follow the rec- model.’’ One unfortunate side effect of this ommended Transformer settings for IWSLT’14 in objective is that ungrammatical and nonsensical fairseq,9 which are based on Vaswani et al. (2017) outputs, which have probabilities close to 0 under a and Gehring et al. (2017). Hyperparameters for language model like p(y), end up with high scores models trained on the WMT task are set following because of the second term in the score function. version 3 of the Tensor2Tensor toolkit (Vaswani To address this problem, and to upper-bound et al., 2018). We use byte-pair encoding (BPE; the scoring function, we propose lower-bounding Sennrich et al. 2016) for all languages. Vocabulary the language model term by a hyperparameter sizes for WMT and IWSLT’14 are set from rec- 1 ≥ ε>0. We additionally use the strength ommendations for the respective tasks in fairseq; hyperparameter λ employed by Li et al. (2016): for the MTTT tasks, vocabulary sizes are tuned on models trained with standard label-smoothing | regularization. Similarly, the CNN/DailyMail scorePMI(x, y)=logp(y x) dataset is pre-processed and uses BPE following − λ log max{p(y),ε} (12) the same steps as (Lewis et al., 2019); model hyperparameters are likewise copied. Details are Similarly to our methods for length normali- available on fairseq’s Web site.10 zation, we can use the scoring function in (12) We use BLEU (Papineni et al., 2002) (evaluated either with the heuristic: using SacreBLEU [Post, 2018]) for MT metrics  and ROUGE-L (Lin, 2004) for abstractive summar- 0 for y.last () = EOS h(x, y)= ization metrics. We build our decoding framework − −| |  λ log ε(nmax y ) for y.last () = EOS in SGNMT.11 (13) or with stopping criterion as in (6) albeit with 6.1 Running Time scorePMI and upper-bound function: In Table 2, we report values as the average number of calls to the scoring function per input; we U(x, y)=−λ log ε(n −|y|) (14) max do not use wall-clock time as this is heavily dependent on hardware. See Fig. 1 for empirical Because −λ log ε is the best possible score at any justification of the correlation between calls to the given time step, clearly we can bound the increase scoring function and runtime on the hardware our in scorePMI by the above function. However, as with our length normalization strategy, we lose the k- 9https://github.com/pytorch/fairseq/tree optimality guarantee with the heuristic method /master/examples/translation. 10https://github.com/pytorch/fairseq/blob for mutual information decoding. We present /master/examples/bart/README.cnn.md. experimental results for both methods in § 6. 11https://github.com/ucam-smt/sgnmt.

803 IWSLT’14 De-En MTTT Fr-En CNN-DailyMail k =5 k =10 k =100 k =500 k =10 k =100 k =500 k =5 k =10 k =100 (35.6)(35.4)(34.7)(7.9)(33.0)(9.9)(1.2)(31.5)(30.9)(29.1)

BF beam search 93 (24%) 169 (36%) 1275 (79%) 1168 (736%) 184 (16%) 867 (138%) 885 (836%) 200 (33%) 305 (43%) 2960 (92%) Beam search (ES) 107 (7%) 210 (9%) 2047 (12%) 7685 (27%) 196 (9%) 1310 (58%) 4182 (98%) 224 (19%) 357 (22%) 3942 (59%) Beam search 115 229 2286 9770 214 2066 8281 266 435 5673

Table 2: Average number of calls (rounded to nearest whole digit) to score, the sequence transduction model, per generated sequence when using different decoding algorithms. Green percentages are performance improvements over standard beam search. Beam search (ES) refers to the OpenNMT early-stopping method (Klein et al., 2017). All methods provably return the same solution and thus, evaluation metrics (in dark blue) for a given beam size are identical.

IWSLT’14 De-En k method search error BLEU # calls

shrinking 0% 35.4 229 (0%) 10 early 0% 35.4 225 (2%) BF BS − 35.4 169 (36%)

shrinking 31.7% 13.2 2278 (0%) 100 early 31.7% 13.2 1738 (31%) BF BS − 34.7 1275 (79%) WMT’17 De-En

shrinking 0% 28.6 260 (0%) 10 early 0% 28.6 252 (3%) Figure 1: Number of calls to scoring function score BF BS − 28.6 230 (12%) vs. total sequence generation time. Each point is a decoded sequence. Colors represent different model shrinking 1.7% 26.4 2587 (0%) architectures and shapes signify the decoding algorithm 100 early 1.7% 26.4 2402 (8%) used (beam sizes 3 and 10 are included for each). There BF BS − 26.9 2046 (26%) is no notable difference in the overhead (time-wise) of best-first beam search and beam search. Table 3: BLEU, search error, and average number of calls to score for different stopping criterion. experiments were run on. For reference, in our ‘‘shrinking’’ refers to the shrinking beam method experiments, the scoring function took on average of Bahdanau et al. (2015) and ‘‘early’’ refers > 99% of the total computation time, even with to the stopping criterion of Huang et al. (2017). larger beam sizes, when overhead of the search Note that neither method is guaranteed to return algorithm is most significant. the same result as standard beam search. Search We find that best-first (BF) beam search leads to error and performance increases are with respect significant speed-ups over both traditional beam to standard beam search. search and beam search with early stopping, with a performance increase12 of ≈ 8x for a beam size of 500. We likewise find that best-first beam search in Table 4. We find that both methods, that offers speed-ups over early stopping methods that is, changing the stopping criterion and using a are not guaranteed to return the same results as heuristic during search, provide improvements standard beam search (see Table 3). over baseline BLEU scores albeit with different hyperparameter settings; increases are similar to improvements reported by Murray and Chiang 6.2 Length Normalization (2018). Notably, using a heuristic causes a large percentage of search errors with respect to stand- We experiment with both forms of length ard beam search using the same scoring function. normalization presented in § 5 and provide results However, the difference in results appears to be 12Performance increase is defined as (old − new)/new. beneficial in terms of BLEU.

804 IWSLT’14 De-En kε β # calls search BLEU kβ b # calls search BLEU error error 5 − .05 115 − 33.2 50.8|x| 115 (0%) 40.6% 33.9 +0.3 Baseline Heuristic 10 − .05 229 − 33.0 10 1.2 |x| 229 (0%) 54.7% 33.8 +0.5 5 .02 .05 129 (0%) 42.7% 33.2 50.5nmax 73 (58%) − 33.7 +0.1 Heuristic Stopping Criterion 10 .02 .05 256 (0%) 42.7% 33.0 10 0.5 nmax 130 (76%) − 33.7 +0.4 5 3e–4 .05 114 (1%) 29.2% 33.2 Stopping Criterion MTTT Fr-En 10 5e–5 .05 224 (2%) 26.6% 33.0

50.8.7|x| 100 (8%) 16.2% 33.5 +0.2 Heuristic 10 1.0 .7|x| 196 (9%) 25.2% 33.6 +0.6 Table 5: BLEU scores with mutual information 51.0n 65 (66%) − 34.1 +0.8 Stopping Criterion max scoring function on IWSLT’14 De-En. Baseline 10 1.2 nmax 88 (143%) − 34.1 +1.1 is PMI decoding with unbounded p(y),thatis, Table 4: BLEU search error, and average number ε =0. Search error is with respect to beam search of calls to score for output obtained with length decoding of baseline with same β. normalization scoring function on the IWSLT’14 De-En and MTTT Fr-En test sets. Increase in BLEU is over baseline with no length normalization. IWSLT’14 De-En Search error and performance increases are with kγ search BLEU # calls error respect to standard beam search decoding using the same scoring function. 2 22.7% 35.7 +0.1 43.8 (163%) 5 5 4.4% 35.8 +0.2 79.8 (44%) nmax − 35.6 93.0 (24%) 6.3 Mutual Information We train a language model on the IWSLT dataset 2 22.6% 35.7 +0.3 48.4 (374%) 10 5 4.5% 35.6 +0.2 126.9 (81%) and use it to calculate p(y) from (12) as margin- n − 35.4 169.0 (36%) alizing over y is intractable (see Li et al. [2016] for max further justification). We run experiments using WMT’17 De-En both of the methods discussed in § 5 and present 2 29.0% 29.7 +0.2 77.5 (75%) results in Table 5. We find that both methods 5 5 1.2% 29.5 +0.0 115.8 (12%) − provide results of equivalent BLEU score compared nmax 29.5 118.8 (10%) with the baseline output, namely, results obtained 2 36.6% 29.5 +0.2 97.3 (165%) with the unbounded PMI objective and beam 10 5 2.6% 29.3 +0.0 230.0 (12%) search. Again, despite the high search error rate nmax − 29.3 230.2 (12%) demonstrated by the heuristic method, evaluation metrics are still comparable. Table 6: BLEU scores and the number of calls to score on the IWSLT’14 De-En validation 6.4 Memory Usage set and WMT’17 De-En test set with queue · We conduct a set of experiments where we limit size restricted to nmax k. Note that γ =nmax is the standard best-first beam search algorithm. total queue capacity to k·γ for γ ∈{1,...,nmax}, as described in § 3.3, and report the BLEU score of Performance increases are over standard beam the resulting set of hypotheses. search. Search error is with respect to beam As shown in Table 6, we find that restricting search with same beam width. the queue capacity does not harm output quality and, additionally, leads to even greater runtime score demonstrates that search error in this context performance increase. For example, runtime for does not necessarily hurt the quality of results. decoding of IWSLT’14 with a beam size of 10 can be improved by > 3x while returning results with better evaluation metrics. We find 7 Related Work that improvements are even more pronounced for larger beam sizes. Across beam widths and Our work is most similar to that of Zhou and tasks, we find that search error (with respect to Hansen (2005), who propose beam stack search. standard beam search) is quite low for γ =5. However, they are focused on exact inference and Additionally, for smaller γ, the change in BLEU still evaluate hypotheses in breadth-first order.

805 Additionally, their algorithm requires O(nmaxk) References memory; although best-first beam search has M. D. Atkinson, J. R. Sack, N. Santoro, and T. the same requirements, we introduce effective Strothotte, 1986. Min-max heaps and general- methods for reducing them, namely, memory- ized priority queues. Communications of ACM: reduced best-first beam search. 29(10). DOI: https://doi.org/10.1145 Huang et al. (2017) propose and prove the /6617.6621 optimality of an early-stopping criterion for beam search. The authors find in practice though that Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua reduction in computation from their algorithm was Bengio. 2015. Neural machine translation by generally not significant. We build on this work jointly learning to align and translate. In and introduce additional methods for avoiding Proceedings of the International Conference unnecessary computation. Our method leads to on Learning Representations. better performance, as shown in Table 2. ∗ Ondrejˇ Bojar, Rajen Chatterjee, Christian Klein and Manning (2003) use A for PCFG Federmann, Yvette Graham, Barry Haddow, parsing; however, they use the un-pruned version Shujian Huang, Matthias Huck, Philipp Koehn, for exact search, which is not applicable for Qun Liu, Varvara Logacheva, Christof Monz, NMT or AS as the memory requirements of Matteo Negri, Matt Post, Raphael Rubino, the algorithm are far too large for these tasks. Lucia Specia, and Marco Turchi. 2017. Subsequently, Pauls and Klein (2009) provide a Findings of the 2017 conference on machine method for pruning this search algorithm, albeit translation. In Proceedings of the Conference using a threshold rather than explicitly limiting ∗ on Machine Translation, Volume 2: Shared the state space. Huang et al. (2012) also adapt A Task Papers, Copenhagen, Denmark. DOI: for a k-best decoding algorithm. Although their https://doi.org/10.18653/v1/W17-4717 methods differ notably from ours, they likewise use pruning techniques that allow for substantial Mauro Cettolo, Christian Girardi, and Marcello speedups. Federico. 2012. Wit3: Web inventory of Stahlberg and Byrne (2019) create an exact transcribed and translated talks. In Proceedings inference algorithm for decoding and use it of the Conference of the European Association to analyze the output of neural NMT models. for Machine Translation. Whereas they likewise utilize the monotonicity of the scoring function to make their method Edsger W. Dijkstra. 1959. A note on two problems tractable, they do not focus on speed or mimicking in connexion with graphs. Numerische Mathe- the results of standard beam search. matik 1(1). DOI: https://doi.org/10.1007 /BF01386390

8Conclusion Kevin Duh. 2018. The multitarget TED talks task. http://www.cs.jhu.edu/∼kevinduh We propose best-first beam search, an algorithm /a/multitarget-tedtalks/ that allows for faster decoding while still Sergey Edunov, Myle Ott, Michael Auli, and guaranteeing k-optimality. We provide results on David Grangier. 2018. Understanding back- several sequence-to-sequence transduction tasks translation at scale. In Proceedings of the 2018 that show the speed-ups that our algorithm Conference on Empirical Methods in Natural provides over standard beam search for decoding Language Processing. Brussels, Belgium, As- neural models. We adapt several popular alternate sociation for Computational Linguistics. DOI: scoring functions to best-first beam search and https://doi.org/10.18653/v1/D18 provide a framework that can be used to -1045 adapt other scoring methods such as coverage normalization (Wu et al., 2016) or diverse beam Jonas Gehring, Michael Auli, David Grangier, and search (Vijayakumar et al., 2016). We also provide Yann Dauphin. 2017. A convolutional encoder a memory-reduced version of our algorithm, model for neural machine translation. In Pro- which returns competitive results in a fraction ceedings of the 55th Annual Meeting of of the time needed for standard beam search. the Association for Computational Linguistics

806 (Volume 1: Long Papers). Vancouver, Canada. OpenNMT: Open-source toolkit for neural Association for Computational Linguistics. machine translation. In Proceedings of ACL DOI: https://doi.org/10.18653/v1 2017, System Demonstrations, Vancouver, /P17-1012, PMID: 28964987, PMCID: Canada. Association for Computational Lin- PMC6754825 guistics. DOI: https://doi.org/10.18653 /v1/P17-4012 Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. 1968. A formal basis for the heuristic Philipp Koehn and Rebecca Knowles. 2017. Six determination of minimum cost paths. IEEE challenges for neural machine translation. In Transactions on Systems Science and Cyber- Proceedings of the First Workshop on Neural netics, 4(2). DOI: https://doi.org/10 Machine Translation, Vancouver. Association .1109/TSSC.1968.300136 for Computational Linguistics. DOI: https:// /10.18653/v1/W17-3204 Wei He, Zhongjun He, Hua Wu, and Haifeng Wang. 2016. Improved neural machine transla- John D. Lafferty, Andrew McCallum, and tion with SMT features. In Proceedings of the Fernando C. N. Pereira. 2001. Conditional ran- AAAI Conference on Artificial Intelligence. dom fields: Probabilistic models for segment- ing and labeling sequence data. In Proceedings Karl Moritz Hermann, Tomas Kocisky, Edward of the International Conference on Machine Grefenstette, Lasse Espeholt, Will Kay, Learning, ICML ’01, San Francisco, CA, USA. Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend, Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Advances in Neural Information Processing Ghazvininejad, Abdelrahman Mohamed, Omer Systems. Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2019. BART: Denoising sequence-to-sequence Liang Huang, Kai Zhao, and Mingbo Ma. 2017. pre-training for natural language generation, When to finish? Optimal beam search for neural translation, and comprehension. In arXiv:1910. text generation (modulo beam size). In Proceed- 13461. DOI: https://doi.org/10.18653 ings of the 2017 Conference on Empirical /v1/2020.acl-main.703 Methods in Natural Language Processing, Copenhagen, Denmark. Association for Com- Jiwei Li, Michel Galley, Chris Brockett, Jianfeng putational Linguistics. DOI: https://doi Gao, and Bill Dolan. 2016. A diversity- .org/10.18653/v1/D17-1227, PMID: promoting objective function for neural con- 28564569 versation models. In Proceedings of the 2016 Conference of the North American Chapter of Zhiheng Huang, Yi Chang, Bo Long, Jean- the Association for Computational Linguistics: Francois Crespo, Anlei Dong, Sathiya Keerthi, Human Language Technologies, San Diego, and Su-Lin Wu. 2012. Iterative Viterbi A* California. Association for Computational algorithm for k-best sequential decoding. In Linguistics. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics Chin-Yew Lin. 2004. ROUGE: A package for (Volume 1: Long Papers), Jeju Island, Korea. automatic evaluation of summaries. In Text Sum- Association for Computational Linguistics. marization Branches Out, Barcelona, Spain. Association for Computational Linguistics. Dan Klein and Christopher D. Manning. 2003. A* parsing: Fast exact Viterbi parse selection. Andrew McCallum, Dayne Freitag, and Fernando In Proceedings of the 2003 Human Language C. N. Pereira. 2000. Maximum entropy Markov Technology Conference of the North American models for information extraction and segmen- Chapter of the Association for Computatio- tation. In Proceedings of the International nal Linguistics. DOI: https://doi.org Conference on Machine Learning, ICML 00. /10.3115/1073445.1073461 San Francisco, CA, USA.

Guillaume Klein, Yoon Kim, Yuntian Deng, Kenton Murray and David Chiang. 2018. Correct- Jean Senellart, and Alexander Rush. 2017. ing length bias in neural machine translation.

807 In Proceedings of the Third Conference on Bengio, and Aaron Courville. 2017. Multireso- Machine Translation: Research Papers, Belgium, lution recurrent neural networks: An application Brussels. Association for Computational Lin- to dialogue response generation. guistics. DOI: https://doi.org/10.18653 /v1/W18-6322 Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. Joakim Nivre, Igor M. Boguslavsky, and Leonid L. 2016. Minimum risk training for neural machine Iomdin. 2008. Parsing the SynTagRus treebank translation. In Proceedings of the 54th Annual of Russian. In Proceedings of the 22nd Interna- Meeting of the Association for Computational tional Conference on Computational Linguis- Linguistics (Volume 1: Long Papers), Berlin, tics (Coling 2008), Manchester, UK. Coling Germany. Association for Computational Lin- 2008 Organizing Committee. DOI: https:// guistics. DOI: https://doi.org/10.18v653 doi.org/10.3115/1599081.1599162 /v1/P16-1159, PMID: 27069146

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Raphael Shu and Hideki Nakayama. 2018. Im- Fan, Sam Gross, Nathan Ng, David Grangier, proving beam search by removing monotonic and Michael Auli. 2019. fairseq: A fast, exten- constraint for neural machine translation. In sible toolkit for sequence modeling. In Pro- Proceedings of the 56th Annual Meeting of ceedings of NAACL-HLT: Demonstrations. the Association for Computational Linguis- DOI: https://doi.org/10.18653/v1 tics (Volume 2: Short Papers), Melbourne, /N19-4009 Australia. Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for Felix Stahlberg and Bill Byrne. 2019. On NMT automatic evaluation of machine translation. search errors and model errors: Cat got your In Proceedings of the Annual Meeting on Asso- tongue? In Proceedings of the Conference on ciation for Computational Linguistics. DOI: Empirical Methods in Natural Language Pro- https://doi.org/10.3115/1073083 cessing and the 9th International Joint Con- .1073135 ference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China. DOI: Adam Pauls and Dan Klein. 2009. Hierarchical https://doi.org/10.18653/v1/D19 search for parsing. In Proceedings of Human -1331 Language Technologies: The 2009 Annual Con- ference of the North American Chapter of the Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Association for Computational Linguistics, 2014, Sequence to sequence learning with neu- Boulder, Colorado. Association for Computa- ral networks. In Advances in Neural Inform- tional Linguistics. DOI: https://doi.org ation Processing Systems. /10.3115/1620754.1620835 Ben Taskar, Carlos Guestrin, and Daphne Koller. Matt Post. 2018. A call for clarity in reporting 2004. Max-margin markov networks. In Ad- BLEU scores. In Proceedings of the Confer- vances in Neural Information Processing ence on Machine Translation: Research Pa- Systems. pers. DOI: https://doi.org/10.18653 /v1/W18-6319 Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Rico Sennrich, Barry Haddow, and Alexandra Gouws, Llion Jones, Łukasz Kaiser, Nal Birch.2016. Neural machine translation of rare Kalchbrenner, Niki Parmar, Ryan Sepassi, words with subword units. In Proceedings of Noam Shazeer, and Jakob Uszkoreit. 2018. the Annual Meeting of the Association for Com- Tensor2tensor for neural machine translation. putational Linguistics. DOI: https://doi CoRR. .org/10.18653/v1/P16-1162 Ashish Vaswani, Noam Shazeer, Niki Parmar, Iulian Serban, Tim Klinger, Gerald Tesauro, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Kartik Talamadupula, Bowen Zhou, Yoshua Łukasz Kaiser, and Illia Polosukhin. 2017.

808 Attention is all you need. In Advances in Neural Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Information Processing Systems. Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Tim Vieira, Ryan Cotterell, and Jason Eisner. Jason Smith, Jason Riesa, Alex Rudnick, 2016. Speed-accuracy tradeoffs in tagging with Oriol Vinyals, Gregory S. Corrado, Macduff variable-order CRFs and structured sparsity. Hughes, and Jeffrey Dean. 2016. Google’s Proceedings of the Conference on Empiri- In neural machine translation system: Bridging the cal Methods in Natural Language Processing . gap between human and machine translation. DOI: https://doi.org/10.18653/v1 /D16-1206 Yilin Yang, Liang Huang, and Mingbo Ma. 2018. Breaking the beam search curse: A Ashwin K. Vijayakumar, Michael Cogswell, study of (re-)scoring methods and stopping Ramprasaath R. Selvaraju, Qing Sun, Stefan criteria for neural machine translation. In Pro- Lee, David J. Crandall, and Dhruv Batra. ceedings of the 2018 Conference on Empiri- 2016. Diverse beam search: Decoding diverse cal Methods in Natural Language Processing, solutions from neural sequence models. CoRR, Brussels, Belgium. Association for Computa- abs/1610.02424. tional Linguistics. DOI: https://doi.org Oriol Vinyals and Quoc V. Le. 2015. A neural /10.18653/v1/D18-1342 conversational model. DOI: https://doi Zhilin Yang, Zihang Dai, Yiming Yang, Jaime .org/10.1109/TIT.1967.1054010 Carbonell, Russ R. Salakhutdinov, and Quoc V. Andrew Viterbi. 1967. Error bounds for convolu- Le. 2019. XLNet: Generalized autoregressive tional codes and an asymptotically optimum pretraining for language understanding. In decoding algorithm. IEEE Transactions on Advances in Neural Information Processing Information Theory, 13(2). Systems. Yonghui Wu, Mike Schuster, Zhifeng Chen, Rong Zhou and Eric A. Hansen. 2005. Beam- Quoc V. Le, Mohammad Norouzi, Wolfgang stacksearch: Integrating with beam Macherey, Maxim Krikun, Yuan Cao, Qin search. In Proceedings of the International Con- Gao, Klaus Macherey, Jeff Klingner, Apurva ference on International Conference on Auto- Shah, Melvin Johnson, Xiaobing Liu, Lukasz mated Planning and Scheduling, ICAPS05.

809