Boosting Search Engines with Interactive Agents

Leonard Adolphs†∗ Benjamin Boerschinger‡ Christian Buck‡ Michelle Chen Huebscher‡

Massimiliano Ciaramita‡ Lasse Espeholt‡ Thomas Hofmann† Yannic Kilcher†∗

† ETH, Zurich ‡ Google Research

Abstract

Can machines learn to use a search engine as an interactive tool for finding in- formation? That would have far reaching consequences for making the world’s knowledge more accessible. This paper presents first steps in designing agents that learn meta-strategies for contextual query refinements. Our approach uses machine reading to guide the selection of refinement terms from aggregated search results. Agents are then empowered with simple but effective search operators to exert fine-grained and transparent control over queries and search results. We develop a novel way of generating synthetic search sessions, which leverages the power of transformer-based generative language models through (self-)supervised learning. We also present a reinforcement learning agent with dynamically con- strained actions that can learn interactive search strategies completely from scratch. In both cases, we obtain significant improvements over one-shot search with a strong information retrieval baseline. Finally, we provide an in-depth analysis of the learned search policies.

1 Introduction

Web search is the portal to a vast ecosystem of general and specialized knowledge, designed to support humans in their effort to seek relevant information and make well-informed decisions. Utilizing search as a tool is intuitive, and users quickly learn interactive, language-based search strategies characterized by sequential reasoning, exploration, and synthesis [Rutter et al., 2015, Russell, 2019]. The success of web search relies on machines learning human notions of relevance, but also on the users’ ability to (re-)formulate appropriate queries, grounded in a tacit understanding of strengths and limitations of search engines. Given recent breakthroughs in language models (LM) [Vaswani

arXiv:2109.00527v1 [cs.CL] 1 Sep 2021 et al., 2017, Devlin et al., 2019, Brown et al., 2020] as well as reinforcement learning (RL) [Mnih et al., 2013, Silver et al., 2016, Berner et al., 2019] it seems timely to ask whether, and how, machine learning agents can be trained to interactively use search engines. However, the lack of expert search sessions puts supervised learning out of reach, and RL is often ineffective for solving complex natural language understanding (NLU) tasks. The feasibility of search agents hence remains an open question, which has inspired our research. We pursue a design philosophy in which search agents operate in structured action spaces defined as generative grammars, resulting in policies that are compositional, productive, and semantically transparent. Further domain knowledge is incorporated through the use of well-known models and algorithms from NLU and information retrieval (IR). Most notably, we develop a self-supervised learning scheme for generating high-quality search sessions by exploiting insights from relevance feedback [Rocchio, 1971]. We train a supervised LM search agent based on T5 [Raffel et al., 2020] directly on this data. We also build an RL search agent based on MuZero [Schrittwieser et al., 2020], which performs planning via rule-constrained Monte Carlo tree search and a learned dynamics model. ∗Work carried out in part during internships at Google. Created by joni from the Noun Project

Created by joni Created by joni from the Noun Project from the Noun Project Figure 1: Schematic overview of the agent’s interaction with the search environment (BM25).

Our evaluation is performed on a passage retrieval task for open-domain question answer- ing [Voorhees, 2000]. The T5 agent achieves a relative improvement of 95% in ranking performance over a Lucene BM25 system, running as the search engine. The improvement of the MuZero agent is somewhat lower at 73%; however, both agents learn diverse policies that lead to deep and effective explorations of the search results. We open-source the code and trained checkpoints for both agents.2

2 Self-Supervised Learning for Interactive Search

It has been a powerful vision for more than 20 years to design search engines that are intuitive and simple to use. Despite their remarkable success, search engines are not perfect and may not yield the most relevant result(s) in one shot. This is particularly true for rare and intrinsically difficult queries, which may require interactive exploration by the user to be answered correctly and exhaustively. For instance, Russell [2019] observes that search engines are sensitive to local conditions and phrasing: if one asks Google “How many languages are there in India” the answer is 2, while Bing says 23 and Wolfram Alpha, 400.3 A comprehensive answer can be found by carefully refining the query, borrowing relevant terms (e.g., “dialects”) from the search results. By rephrasing the question as “how many languages or dialects are spoken in India” one gets a single answer clarifying that there are 2 official languages (Hindi and English), 22 major languages and 720 dialects. Contextual query refinement is a familiar technique [Jansen et al., 2009], even among children [Rutter et al., 2015], used to improve search by combining evidence from previous results and background knowledge [Huang and Efthimiadis, 2009]. Such refinements often rely on inspecting result snippets and titles or on skimming the content of top-ranked documents. This process is iterative and may be repeated to produce a sequence of queries q0, q1, . . . , qT until (optimistically) a satisfactory answer is found. It seems natural to mimic this interactive process by a search agent, which learns the basic step of generating a follow-up query from previous queries and their search results. Furthermore, it is noteworthy that power users apply dedicated search operators and sophisticated investigative strategies to solve deep search puzzles [Russell, 2019]. In particular, unary operators for exact term matches (‘+’) and for term exclusions (‘-’) offer a great deal of fine-grained control and transparency and as such are highly effective in the hands of expert users to overcome limitations in document retrieval and ranking. As we show in this paper, these operators are also pivotal in designing interactive agents that learn to use search engines.

2.1 Result Aggregation for Contextual Refinement

Inspired by human experts’ strategies, we suggest using a machine reader (MR, cf. [Rajpurkar et al., 2016]) with a passage scorer (PS) to rank the most relevant text snippets found so far. Specifically, we use a DPR reader [Karpukhin et al., 2020], which builds upon a pre-trained BERT [Devlin et al., 2019] model in order to identify the most promising answer span within each result document d and which also estimates the probability of d containing the (unspecified) answer P(d ANSWER q) [0; 1]. This probability can also be viewed as a PS-score that induces a calibrated ranking3 across| all∈ result

2https://github.com/google-research/google-research/tree/master/muzero and https:// github.com/google-research/language/tree/master/language/search_agents. 3In Russell’s words, this question “needs research”; cf. searchresearch1.blogspot.com.

2 documents within a session. Moreover, the machine reader can extract a context window around the best answer in every document. The default setting in our experiments is to use the 5 top-ranked documents and extract a 70 token window centered at the answer predicted by the reader. In addition, we include the document titles, truncated to 10 tokens. Finally, the query tokens and refinements describing qt are also included. In total this leads to a segmented observation token sequence ot which is truncated to length 512. We then use pre-trained transformer-based language models (LM), BERT or T5, to produce an≤ embedding st from which the search agent will generate the next query. If we denote result sets for qt by t, then we get diagrammatically (see also Figure 1): D   q0, . . . , qt MR/PS LM agent search engine o s q (1)  ↓  t t t+1 ,..., |7−→{z } |7−→{z } |7−→{z } D0 Dt observation encoding generation

We will focus on the case, where qt+1 is obtained from qt through augmentation. This may add a keyword w Σidx, the search index vocabulary, with the usual disjunctive search engine semantics or a structured∈ search term formed by the use of unary operators (‘+’,‘-’) and fields (see below).

2.2 Rocchio Query Expansions

In the absence of training sessions collected from human expert users, we propose to generate synthetic search sessions in a self-supervised manner, making use of a set of question-answer pairs (q, a), which – cf. §5.1 – can also be synthetic. We initialize q0 = q and aim to find a reasonably short sequence of query refinements that make progress towards identifying documents containing the correct answer (i.e. matching the string a). For this purpose, we make use of an NDCG reward function qt t rt [0; 1] (see §4 for details). A query is not further refined, if either t = 4 (maximal length)7→ D or7→ if no∈ score increasing refinement can be found. To create candidate refinements, we make use of the idea of pseudo-relevance feedback as suggested in Rocchio [1971]. An elementary refinement – called a Rocchio expansion – then takes the form

q τ α β q := q ∆q , ∆q := [+ TITLE CONTENT ] w , w Σ := Σ Σ Σ Σ (2) t+1 t t t | − | t t ∈ t t ∪ t ∪ t ∪ t where Σt refers to a set of terms accessible to the agent. By that we mean terms that occur in the top PS-ranked session documents. We use superscripts to refer to the vocabulary of the question (q), titles (τ), answers (α) or bodies (β) of documents in ot. Note that adding terms Σt would make refinements difficult to reproduce for an agent and thus would provide supervision6∈ of low utility. Another aspect of creating sessions as described above has to do with the search complexity of finding optimal sequences of Rocchio expansions. We consider q∗ = q + a as the “ideal” query, whose results define the result vocabulary Σ∗. For efficiency reasons, we further constrain the terms to be added via exact matches or term exclusions by defining respective constrained dictionaries

+ − Σ = Σ Σ∗, Σ = Σ Σ∗ . (3) t t ∩ t t − This means it is possible to upgrade accessible terms to exact matches if they also occur in the ideal result set; and to exclude accessible terms if they are not present in the ideal results. We have found experimentally that this leads to a good trade-off between the quality of Rocchio expansions and the search effort to find them. The search for sequences of Rocchio expansions is done heuristically. Details can be found in §5.1 and the Appendix.

2.3 Self-Supervised T5 Agent

We suggest to train a generative LM agent in a supervised manner by making use of synthetic search sessions generated by Rocchio expansions. We use T5 [Raffel et al., 2020], a pretrained transformer encoder-decoder model which achieves state-of-the-art results on multiple NLU tasks. As a search agent, T5 predicts a single new search expansion from an observed state, i.e., the query, previous expansions, and the top-5 title/document/answer-triples. In the spirit of everything-is-string- prediction, both state and expansions are represented as plain strings. See the Appendix for a full example.

3 Our T5 agent is trained in a purely supervised fashion: we produce multi-step episodes using the Rocchio expansions described in the previous section and treat each step as a single training example. As is common in sequence prediction tasks, we use the cross-entropy loss for optimization. At test time, we start with the initial query and incrementally add new expansions, querying the trained T5 model in every step. We then use the refined query to retrieve new documents and continue until either the set of new documents is empty or we reach the maximum number of 10 steps. Throughout the session, we maintain the top-5 documents among all those retrieved, as detailed in Section 2.1.

3 Reinforcement Learning: MuZero Agent

MuZero [Schrittwieser et al., 2020] is a state-of-the-art search-based agent characterized by a learnable model of the environment dynamics. This allows the use of Monte Carlo tree search (MCTS) to predict the next action while planning, in the absence of an explicit simulator. In the current use case, MuZero aims to anticipate the latent state implied by each action with regard to the results obtained by the search engine. For instance, in the example above, it will predict the effect of adding a certain term (“dialects”), without knowing the exact content that will be returned. This offers a performance advantage of many orders of magnitude as executing a query with the real search engine is resource-intense. Following the notation of Schrittwieser et al. [2020], the agent has the following three learnable components.

• Representation function (BERT & RNN), hθ(ot, a0:t) = [BERT(ot); RNN(a0:t)] produces the first (0-th) embedding of observation ot and the chosen actions up to the current time step a0:t by individually encoding with a BERT model and an RNN, respectively. The results are concatenated to form a single, fixed-size encoding. We omit the t index in the remainder. k−1 k k k • Dynamics function (LSTM), gθ(s , a ) = r , s , which models the environment by predicting the next (latent) state sk and rewardhrk. i k k k k • Prediction function (MLP), fθ(s ) = p , v , which predicts the action scores p and the k k h 0 i value v for state s , produced by gθ, or s , produced by hθ. hθ employs a custom BERT with additional embedding layers to represent the different parts of observations (see Appendix for more details). The output, hθ(ot, a0:t), is the starting point for the k MCTS. At each step in the MCTS simulation, the prediction function fθ(s ) generates the policy scores pk and the state value vk which are used to predict the next action using an Upper Confidence Bound score (UCB) [Kocsis and Szepesvári, 2006]. MuZero alternates game play for data collection and off-policy training. At training time, a batch of episodes is sampled from the replay buffer and all three functions are trained jointly. Each data point in a batch consists of a game at a given position, plus K unroll steps in the future regarding the next actions taken and the target value, reward, and policy stored during self-play. The component functions hθ and fθ are used for the first position, while gθ and fθ are used for the following ones.

3.1 Grammar-Guided Search

Compared to T5, the MuZero agent has a more challenging starting point: its BERT-based repre- sentation function is pre-trained on less data, it has fewer parameters (110M vs. 770M) and no cross-attention mechanism: predictions are conditioned on a single [CLS] vector. Further, it cannot easily exploit supervised or self-supervised data. However, it can more thoroughly explore the space of policies, e.g. independent of the Rocchio expansions. Through many design iterations, we have identified it to be crucial to structure the action space of the MuZero agent and constrain admissible actions and refinement terms dynamically based on context. The motivation is to provide a domain-informed inductive bias that increases the statistical efficiency of learning a policy via RL. We take inspiration from generative, specifically context-free, grammars (CFGs) [Chomsky, 1956] and encode the structured action space as a set of production rules, which will be selected in (fixed) top-down, left-to-right order. A query refinement is generated as follows

Q UQ W Q, U Op Field W, Op + , Field TITLE CONTENT, (4) ⇒ | ⇒ ⇒ | − ⇒ |

4 which allows for adding plain or structured keywords using unary operators. The selection of each refinement term W proceeds in three steps, the first two can be described by the rules

W W q W τ W β W α W idx,W x w Σx, x q, τ, β, α ,W idx w Σidx (5) ⇒ t | t | t | t | t ⇒ ∈ τ ∈ { } ⇒ ∈ which means that the agent first decides on the origin of the refinement term, i.e., the query or the different parts of the top-scored result documents, and afterwards selects the term from the corresponding vocabulary. As the term origin correlates strongly with its usefulness as a refinement term, this allows to narrow down the action space effectively. In fact, our experiments show that the most successful agents choose the smaller vocabularies W α and W q over the larger vocabulary W β (cf. Appendix), and thus control the cardinality of the subsequent action spaces. The agent is forced to pick a term from the larger vocabulary (1.6M terms) of the search index Σidx during MCTS, as there is no observable context to constrain the vocabulary. The third level in the action hierarchy concerns the selection of the terms. We have found it advantageous to make use of subword units; specifically, BERT’s 30k lexical rules involving word pieces, to generate terms sequentially, starting from a term prefix and concatenating one or more suffixes. Note that this part of the generation is context-sensitive, as we restrict the generation to x words present in the vocabulary. We make use of tries to efficiently represent each Στ and amortize computation.

4 The OpenQA Environment

We evaluate search agents in the context of open-domain question answering (Open-QA) [Voorhees, 2000, Chen et al., 2017]. Given a question q, agents search for documents that contain the answer a by interacting with a search engine, the environment, an IR system that mapsD queries to ranked lists of documents. Following common practice, we use Lucene-BM25 with default settings on the English Wikipedia. BM25 has provided the reference probabilistic IR benchmark for decades [Robertson and Zaragoza, 2009], only recently outperformed by neural models [Lee et al., 2019], and the Lucene implementation provides search operators comparable to commercial search engines. The search engine is treated as a black box. In a way analogous to how sparingly users scan search results [Joachims et al., 2007], the agent considers only the top k documents, where k = 5, as discussed in Section 2.1. The score of a results set , S( q) is computed as the Normalized Discounted Cumulative Gain (NDCG) at k D D|

k ! k ! X X S( q) = NDCG ( q) = w rel(d q) / w w−1 = log (i + 1). (6) D| k D| i i| i i 2 i=1 i=1 where rel(d q) = 1, if a d, 0 otherwise. NDCG is one of the most popular metrics in IR because it accounts for| the rank position∈ of relevant documents, the normalized form is comparable across queries, and it discriminates consistently between ranking functions [Wang et al., 2013]. Hence, the score provides a stable signal for training. Based on (6), we define the search step reward r = S( 5 q ) S( 5 q ). (7) t Dt | 0 − Dt−1| 0 5 Scores are always computed based on q0 and use the aggregated top-5 documents based on the PS-score. We train the MuZero agent directly on the reward. The reward is sparse, asD none is issued in between search steps. The T5 agent is trained indirectly on the reward via the induction of synthetic episodes with positive monotonic rewards (cf. §2.2).

5 Experiments

We use the OpenQA-NQ dataset [Lee et al., 2019], a subset of the Natural Questions data, consisting of Google queries paired with answers extracted from Wikipedia [Kwiatkowski et al., 2019]. The data includes 79,168 train questions, 8,757 dev questions and 3,610 for test. We use the provided partitions and Wikipedia dump. Following Lee et al. [2019] we pre-process Wikipedia into blocks of 288 tokens, for a total of 13M passages. We evaluate each system on the top-5 288-token passages returned. Model and checkpoint selection, as well as data analysis are performed on NQ Dev.

5 Table 1: Results on NQ Dev and Test as percentages. PS( 0) refers to a reranking of the BM25 results based on the passage selection score. T5 is finetunedD first on synthetic episodes and then on NQ Train. MZ+T5 is a combination of T5 and MuZero. HR is an estimate of the headroom.

Metric Data BM25 PS( ) MZ T5 MZ+T5 HR D0 NDCG@5 Dev 19.83 22.95 35.29 40.07 41.31 55.08 Devk 22.39 26.14 44.69 48.79 51.64 65.65 Devu 15.61 17.67 19.78 25.66 24.23 37.63 Test 21.51 24.82 37.30 41.92 42.70 54.45 Top-5 Dev 50.47 50.47 58.29 60.35 62.76 72.15 Test 53.76 53.76 60.55 62.38 64.32 72.52 EM Dev 15.31 25.15 26.09 27.55 27.72 34.03 Test 14.79 25.87 26.26 27.78 28.12 33.74

5.1 Synthetic Search Episodes and Synthetic QA Data

We generate synthetic search sessions using Rocchio expansions (§2.2). At each step, we attempt at most M BM25 searches, based on the current observation ot. We sort the terms in ot by Lucene’s IDF score and keep the top N. For the NQ data we set N=M=100. We use the NQ Dev and Test sessions for headroom estimates. For NQ Train, we find 83k Rocchio expansions from 48k questions. We also produce large amounts of synthetic sessions from synthetic QA data. For each Wikipedia passage, we generate synthetic (q, a) pairs applying the procedure proposed in [Alberti et al., 2019], using a dedicated T5 system trained on NQ Train. We filter out generic questions leading to multiple answers or overlapping with any NQ partition. We obtain 4M synthetic pairs. Using a lower quality setting, N=20, M=10 (for high throughput), we find 3M expansions from 1.7M questions. The expansion steps, either from synthetic pairs or from NQ Train, are used to train the T5 agent.

5.2 Agents Training and Inference

4 The DPR reader/selector and MuZero’s hθ function use 12-layer BERT systems. To train the former, we generate for each query in NQ Train 200 candidate passages from our BM25 system, picking one positive and 23 negative passages for each query at random whenever the query is encountered during training (cf. the Appendix for more details). The reader/selector is not trained further. The MuZero implementation is scaled and distributed via an agent-learner setup [Espeholt et al., 2018] in the SEED RL [Espeholt et al., 2020] framework allowing for centralized batching of inference for effective use of accelerators. MuZero is trained on NQ Train for a total of 2.6 million steps ( 10 days) using 500 CPU-based actors and 4 Cloud TPU v2 for inference and training on the learner.≈ 5 For each step, 100 simulations are performed. During training, we limit sessions to a maximum of five steps. The agent also can decide to stop early by selecting a dedicated stop action. The T5 agent6 is first trained on the expansions from the synthetic QA data for 60k steps followed by another 200 steps on the NQ Train expansions. Training took about 1.5 days using 16 Cloud TPU v3. At inference time, we found that fixing the sessions to 10 steps worked best for both T5 and MuZero. We report detailed training configurations and ablations in the Appendix.

5.3 Results

Table 1 summarizes the results on the NQ Dev and Test partitions. Our baseline is BM25 with a NDCG@5 of 21.5 on Test. The same documents are re-ranked effectively (+3 points) by the passage selection score (PS( 0)). Our agents improve performance by large margins. The MuZero agent (MZ) has an NDCGD score of 37.30, a +15.8 absolute improvement (+73% relative). The T5 agent score is 41.92, or +20.4 (+95%). Figure 2a shows that the results are consistent across question types.

4BERT-base, initialized from https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1. 5For details, see https://cloud.google.com/tpu. 6We start from the pretrained T5-Large (770 million parameters) checkpoint.

6 BM25 MZ T5 0.5 MZ 0.4 0.4 T5

0.3 0.3 NDCG 0.2 0.2

0.1 0.1 Frequency in Top-5 Docs

0.0 0.0

1 - 5 6 - 50 the (2%) 51 - 100 > 1000 how (6%) 101 - 250 251 - 500 who (39%)when (21%)what (12%)where (8%) which (2%)other (10%) 501 - 1000 (a) Ranking performance on question types. (b) MuZero and T5 depth of documents explored. Figure 2

Aggregating the results by their PS score improves NDCG further to 42.7 (MZ+T5). While this may seem like a small improvement, we find that the two agents implement partially complementary policies (cf. §5.4): On NQ Dev MuZero improves 818 questions not improved by T5, and T5 improves 1531 not improved by MuZero. This is evidence that multiple policies exist and more successful ones may be learned; the headroom (HR) estimated via oracle document selection on the Rocchio sessions (55 NDCG@5) supports this. Noticeably, T5 also solves 473 questions not solved by the headroom procedure (cf. §5.1), 357 for MuZero. We evaluate also Top-5 accuracy: the queries for which any of the top-5 documents contain the answer. This is a recall metric, not directly captured by NDCG, which is a ranking metric. Still, MuZero improves by 6.8 points, T5 by 8.6, and 10.5 combined. Hence, search agents not only work for partially solved queries but also for unsolved ones. Top-5 accuracy offers an opportunity to compare with neural retrieval methods. Karpukhin et al. [2020, §1] report 65.27 for DPR. Our best results (62.7 on Dev) are just a few points below DPR’s, reinforcing the thesis that modeling the user side of search is a viable research direction. We plan to investigate neural or hybrid retrieval environments in future work. Such architectures introduce the challenge of training neural retrievers allowing search operators or redesigning the action space and data generation without them. Finally, we report Exact Match (EM): the share of questions where the reader extracts the gold answer, and the PS score ranks its passage in position 1. Agents improve, which is positive, but far from state of the art levels. This is due to two reasons. First, the reward does not capture answer span quality. Second, the reader and passage selector are not trained on the agents’ output, only on BM25. We discuss this point further below, in Section 5.5.

5.4 Qualitative Analysis

We run a qualitative analysis on NQ Dev. First, we note that agents perform better on questions whose answer appeared in the training data, Devk (answer known) vs. Devu (unknown) in Table 1. In this respect, agents behave similarly to other popular systems, as pointed out in [Lewis et al., 2021a]. T5, a generative model, is more robust, which is consistent with the findings of Izacard and Grave [2021]. MuZero and T5 learn different policies. MuZero uses "+CONTENT" 54% of the time and "-TITLE" 43%. T5 uses "+CONTENT" 85% of the time, "-CONTENT" 12%, and "+TITLE" 2%. With respect to term sources, MuZero uses primarily answers (57%) and questions (43%). We can only track the source loosely for T5, as it generates without trace. We find its search terms in the document (99%), answer (91%), or title (88%), less often in questions (63%). Even though some rules are used sparsely, they are all valuable. Ablating the grammar of the least used rules leads to worse results. MuZero’s policy resembles filtering, with the aggressive use of "-TITLE" leading to more diverse results. T5’s policy is more subtle, relying mostly on topical expansions. Table 2 shows one representative example of each approach. See the Appendix for more examples, with a discussion.

7Up to 66.8 on NQ Dev for some configurations [Karpukhin et al., 2020, Appendix].

7 Table 2: Examples of MuZero and T5 agents’ strategies. Question when was korea separated into north and south Answer 1945 BM25 (NDCG: 0) 1st Doc Title: relations leader Kim Dae-jung by the Korean CIA. Talks restarted, however, and between 1973 and 1975 there were 10 meetings of the North-South Coordinating Committee at . . . MuZero (NDCG: 100) Query "q" -(title:"korea") -(title:"north") -(title:"separated") +(contents:"1945") -(title:"south") +(title:"korean") -(title:"historical") -(title:"1949") 1950 1st Doc Title: Eighth (BM25 Rank: ’>1000’) At the end of World War II in 1945 , Korea was divided into North Korea and South Korea with North Korea (assisted by the Soviet Union), becoming a communist government . . . Question who cut his hair and lost his strength Answer Samson BM25 (NDCG: 0) 1st Doc Title: Hair Removal . . . grabbed "Resistance" founder (and student-body president) David Harris , cut off his long hair, and shaved his beard. During European witch-hunts of the Medieval . . . T5 (NDCG: 100) Query "q" +(contents:"strength") +(contents:"samson") +(contents:"long") -(contents:"judah") +(contents:"testament") +(contents:"old") -(contents:"book") +(contents:"god") 1st Doc Title: Seduction (BM25 Rank: ’>1000’) . . . describes Delilah seducing Samson who was given great strength by God, but ulti- mately lost his strength when she allowed the Philistines to shave his hair off . . .

MuZero collects on average 21.4 different documents in an episode, 15.6 for T5. Both dig deep into the search results. Figure 2b computes the rank of the documents in the original BM25 results. Both agents have many documents ranking below 1000; however, there are more of these documents for MuZero than T5. This is positive, as one of the main motivations for using methods like RL is behavior discovery; in contrast, T5 can only inherit the exploratory behavior from the Rocchio policy. Both agents’ documents in the example of Table 2 rank below 1000 for BM25.

5.5 Limitations

The least robust component in the current architecture is the observation builder. The DPR system is only trained once at the beginning, on the output of BM25. As a consequence, the reader and passage selector effectively operate out of distribution. This is more evident in the RL agent than T5. MuZero is able to exploit this feature adversarially, pushing the selector where it is more likely to give high scores regardless. For example, for “when did tampa bay lightning win the stanley cup”, MuZero requests terms for multiple years, bringing up irrelevant documents like “Isle of Wight Festival” or “Subiaco Football Club” that are filled with dates and fool the reader/selector. This issue can be addressed by careful reward design and joint training of all components within the agent.8 The latter involves primarily engineering complexity. However, it is non-trivial as, ideally, training of the observation builder would go along with the agent’s training, e.g., interleaving the two as in DQNs [Mnih et al., 2013]. This brings up a broader limitation of RL for real-world tasks: complexity. A remarkable effort was required to design, implement and train an effective RL search agent. Nevertheless, several fundamental NLU problems lend themselves naturally to RL formulations. We hope that, in the future, these will be regarded as teething problems. We have tried several ways of combining supervised and RL signals. Pre-training tasks for MuZero, hybrid supervised and experienced replay buffers fed by RL or heuristic actors, using the Rocchio policy to provide advice during MCTS, using MCTS to select the next token while decoding in T5, and more. Nothing worked convincingly enough yet, but this stands as an important area for future research. As a positive, we feared the T5 agent might suffer from exposure bias. Instead, provided with enough training data, it performed well in sequential prediction.

8There is an interesting side to this, as the agent can poke holes in the task. In OpenQA, all that matters is the answer string. Short and generic, or long and over-specific, answers can lead to counter-intuitive results.

8 We conclude by considering that pre-trained language models of the kind used here have been shown to capture societal biases [Tan and Celis, 2019, Webster et al., 2020]. We have no reason to believe our architectures would exacerbate biases, but the overall problems may persist. We hope that end-to-end optimization methods, as in this proposal, will in the future help address these problems.

6 Related Work

Query optimization is an established topic in IR. Methods range from hand-crafted rules [Lawrence and Giles, 1998] to data-driven transformation patterns [Agichtein et al., 2001]. Narasimhan et al. [2016] use RL for information extraction: the actions of the agent include querying the web with heuristic query templates. Nogueira and Cho [2017] and Buck et al. [2018] use RL-trained QA agents to seek good answers by reformulating questions with seq2seq models. However, these methods are limited to one-step episodes and queries to natural language terms. This type of modeling is closely related to the use of RL for neural machine translation, whose value is currently debated [Choshen et al., 2020, Kiegeland and Kreutzer, 2021]. Das et al. [2019] propose to perform query reformulation in embedding (continuous) space and find that it can outperform the sequence-based approach. Other work at the intersection of IR and RL has investigated bandits methods for news recommenda- tion [Li et al., 2010] and learning to rank [Yue and Joachims, 2009] in the context of the user-search engine interaction where the agent is the system. We consider the search problem from the user perspective. This setup is characterized by imperfect information. The search engine has access to the entire corpus and full documents. Similarly to users, agents only examine the top few snippets. The literature on searchers behavior is vast, e.g., see [Strzelecki, 2020] for an overview of eye-tracking studies. While behavior evolves with interfaces, users keep parsing results fast and frugally, attending to just a few items. From a similar angle, Yuan et al. [2020] offer promising findings on training QA agents with RL for templatized information-gathering and answering actions. Most of the work in language-related RL is otherwise centered on synthetic navigation/arcade environments [Hu et al., 2019]. This line of research shows that RL for text reading can help transfer [Narasimhan et al., 2018] and generalization [Zhong et al., 2020] in synthetic tasks but skirts the challenges of more realistic language-based problems. On the topic of grammars, Neu and Szepesvári [2009] show that Inverse RL can learn parsing algorithms in combination with PCFGs [Salomaa, 1969]. Current work in OpenQA focuses on the search engine side of the task, typically using dense neural passage retrievers based on a dual encoder framework instead of BM25 [Lee et al., 2019, Karpukhin et al., 2020]. Leveraging large pre-trained language models to encode the query and the paragraphs separately led to a performance boost across multiple datasets, not just in the retrieval metrics but also in exact-match score. While Karpukhin et al. [2020] use an extractive reader on the top-k returned paragraphs, Lewis et al. [2020b] further improves using a generative reader (BART [Lewis et al., 2020a]). This design combines the strengths of a parametric memory – the pre-trained LM – with a non-parametric memory – the retrieved Wikipedia passages supplied into the reader’s context. This idea of combining a dense retriever with a generative reader is further refined in Izacard and Grave [2021], which fuses multiple documents in the decoding step. A recent line of work is concerned with constraining the model in terms of the number of parameters or retrieval corpus size while remaining close to state-of-the-art performance [Min et al., 2021]. This effort led to a synthetic dataset of 65 million probably asked questions [Lewis et al., 2021b]. This dataset is used to either do a nearest neighbor search on the question – no learned parameters needed – or train a closed-book generative model without access to a retrieval corpus. Xiong et al. [2021] successfully use relevance feedback by jointly encoding the question and the text of its retrieved results for multi-hop QA.

7 Conclusion

Learning to search is an aspiring goal for AI, touching on key challenges in NLU and ML, with broad societal potential for information accessibility. Our paper provides two main contributions. First, we open up the area of search session research to supervised language modeling. Second, we provide evidence for the ability of RL to discover successful search policies in a challenging task, characterized by sparse rewards and a high-dimensional, compositional action space. Our findings agree with a long-standing tradition in psychology that argues against radical behaviorism – i.e., pure reinforcement-driven learning, from tabula rasa – for language [Chomsky, 1959]. RL

9 agents require a remarkable share of hard-wired domain knowledge. LM-based agents are more powerful, but they need abundant task-specific data for fine tuning. Supplied with the right inductive bias, supervised and RL search agents prove surprisingly effective. Different architectures learn different policies, suggesting broad possibilities in the design space for future work.

Acknowledgments

We would like to thank for their valuable feedback: Robert Baldock, Marc Bellemare, Jannis Bulian, Michelangelo Diligenti, Sylvain Gelly, Thomas Hubert, Rudolf Kadlec, Kenton Lee, Simon Schmitt, Julian Schrittwieser, Pier Giuseppe Sessa, David Silver.

References E. Agichtein, S. Lawrence, and L. Gravano. Learning search engine specific query transformations for question answering. In Proceedings of WWW10, pages 169–178, 2001. C. Alberti, D. Andor, E. Pitler, J. Devlin, and M. Collins. Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 6168–6173, 2019. C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with large scale deep reinforcement learning. https://arxiv.org/abs/1912.06680, 2019. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020. C. Buck, J. Bulian, M. Ciaramita, A. Gesmundo, N. Houlsby, W. Gajewski, and W. Wang. Ask the right questions: Active question reformulation with reinforcement learning. In International Conference on Learning Representations, 2018. D. Chen, A. Fisch, J. Weston, and A. Bordes. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, 2017. N. Chomsky. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124, 1956. N. Chomsky. Review of B. F. Skinner, Verbal Behavior. Language, 39:26–58, 1959. L. Choshen, L. Fox, Z. Aizenbud, and O. Abend. On the weaknesses of reinforcement learning for neural machine translation. In International Conference on Learning Representations, 2020. R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum. Multi-step retriever-reader interaction for scal- able open-domain question answering. In International Conference on Learning Representations, 2019. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019. L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. In Proceedings of the 35th International Conference on Machine Learning, pages 1407–1416, 2018.

10 L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski. Seed rl: Scalable and efficient deep- rl with accelerated central inference. In International Conference on Learning Representations, 2020. H. Hu, D. Yarats, Q. Gong, Y. Tian, and M. Lewis. Hierarchical decision making by generating and following natural language instructions. In Advances in Neural Information Processing Systems, volume 32, 2019. J. Huang and E. Efthimiadis. Analyzing and evaluating query reformulation strategies in web search logs. In CIKM, pages 77–86, 2009. G. Izacard and E. Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021. B. J. Jansen, D. L. Booth, and A. Spink. Patterns of query reformulation during web searching. Journal of the American Society for Information Science and Technology, 60(7):1358–1371, 2009. T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems, 25, 2007. V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020. S. Kiegeland and J. Kreutzer. Revisiting the weaknesses of reinforcement learning for neural machine translation. In Proceedings of NAACL, 2021. L. Kocsis and C. Szepesvári. Bandit based monte-carlo planning. In Machine Learning: ECML 2006, pages 282–293, 2006. T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M.-W. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov. Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics, 2019. S. Lawrence and C. L. Giles. Context and page analysis for improved web search. IEEE Internet Computing, 2, 1998. K. Lee, M.-W. Chang, and K. Toutanova. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096, 2019. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettle- moyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, pages 7871–7880, 07 2020a. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020b. P. Lewis, P. Stenetorp, and S. Riedel. Question and answer test-train overlap in open-domain question answering datasets. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021a. P. Lewis, Y. Wu, L. Liu, P. Minervini, H. Küttler, A. Piktus, P. Stenetorp, and S. Riedel. Paq: 65 million probably-asked questions and what you can do with them, 2021b. L. Li, W. Chu, J. Langford, and R. Schapire. A contextual-bandit approach to personalized news article. In Proceedings of WWW, 2010.

11 S. Min, J. Boyd-Graber, C. Alberti, D. Chen, E. Choi, M. Collins, K. Guu, H. Hajishirzi, K. Lee, J. Palomaki, C. Raffel, A. Roberts, T. Kwiatkowski, P. Lewis, Y.Wu, H. Küttler, L. Liu, P. Minervini, P. Stenetorp, S. Riedel, S. Yang, M. Seo, G. Izacard, F. Petroni, L. Hosseini, N. D. Cao, E. Grave, I. Yamada, S. Shimaoka, M. Suzuki, S. Miyawaki, S. Sato, R. Takahashi, J. Suzuki, M. Fajcik, M. Docekal, K. Ondrej, P. Smrz, H. Cheng, Y. Shen, X. Liu, P. He, W. Chen, J. Gao, B. Oguz, X. Chen, V. Karpukhin, S. Peshterliev, D. Okhonko, M. Schlichtkrull, S. Gupta, Y. Mehdad, and W. tau Yih. NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned, 2021.

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.

K. Narasimhan, A. Yala, and R. Barzilay. Improving information extraction by acquiring external evidence with reinforcement learning. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.

K. Narasimhan, R. Barzilay, and T. S. Jaakkola. Deep transfer in reinforcement learning by language grounding. Journal of Artificial Intelligence Research, 63, 2018.

G. Neu and C. Szepesvári. Training parsers by inverse reinforcement learning. Machine Learning, 77, 2009.

R. Nogueira and K. Cho. Task-oriented query reformulation with reinforcement learning. In Proceedings of EMNLP, 2017.

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ questions for machine com- prehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.

S. Robertson and H. Zaragoza. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.

J. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The Smart retrieval system - experiments in automatic document processing, pages 313–323. Englewood Cliffs, NJ: Prentice-Hall, 1971.

D. M. Russell. The Joy of Search: A Google Insider’s Guide to Going Beyond the Basics. The MIT Press, 2019.

S. Rutter, N. Ford, and P. Clough. How do children reformulate their search queries? Information Research, 20(1), 2015.

A. Salomaa. Probabilistic and weighted grammars. Information and Control, 15:529–544, 1969.

J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, T. Graepel, T. Lillicrap, and D. Silver. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(588):604–609, 2020.

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

A. Strzelecki. Eye-tracking studies of web search engines: A systematic literature review. Information, 11(6), 2020.

Y. C. Tan and L. E. Celis. Assessing social and intersectional biases in contextualized word represen- tations. In Advances in Neural Information Processing Systems, 2019.

12 A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polo- sukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017. E. Voorhees. The trec-8 question answering track report. In TREC, 11 2000. Y. Wang, L. Wang, Y. Li, D. He, and T.-Y. Liu. A theoretical analysis of ndcg type ranking measures. In Proceedings of the 26th Annual Conference on Learning Theory, pages 25–54, 2013. K. Webster, X. Wang, I. Tenney, A. Beutel, E. Pitler, E. Pavlick, J. Chen, and S. Petrov. Measuring and reducing gendered correlations in pre-trained models. https://arxiv.org/abs/2010.06032, 2020. W. Xiong, X. Li, S. Iyer, J. Du, P. Lewis, W. Y. Wang, Y. Mehdad, S. Yih, S. Riedel, D. Kiela, and B. Oguz. Answering complex open-domain questions with multi-hop dense retrieval. In International Conference on Learning Representations, 2021. X. Yuan, J. Fu, M.-A. Côté, Y. Tay, C. Pal, and A. Trischler. Interactive machine comprehension with information seeking agents. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, July 2020. Association for Computational Linguistics. Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of ICML, 2009. V. Zhong, T. Rocktäschel, and E. Grefenstette. Rtfm: Generalising to new environment dynamics via reading. In International Conference on Learning Representations, 2020.

13 Appendix

A Details and Examples for the Grammar-Guided MCTS (cf. §3.1 Main Paper)

o ,a Q WQ UQ STOP {AAAB83icbVBNS8NAEN3Ur1q/qh69LBbBg5SkCnosevFYwX5AE8Jmu2mXbnbD7kQooX/DiwdFvPpnvPlv3LY5aOuDgcd7M8zMi1LBDbjut1NaW9/Y3CpvV3Z29/YPqodHHaMyTVmbKqF0LyKGCS5ZGzgI1ks1I0kkWDca38387hPThiv5CJOUBQkZSh5zSsBKvp+rEC4wCcGfhtWaW3fnwKvEK0gNFWiF1S9/oGiWMAlUEGP6nptCkBMNnAo2rfiZYSmhYzJkfUslSZgJ8vnNU3xmlQGOlbYlAc/V3xM5SYyZJJHtTAiMzLI3E//z+hnEN0HOZZoBk3SxKM4EBoVnAeAB14yCmFhCqOb2VkxHRBMKNqaKDcFbfnmVdBp177LeeLiqNW+LOMroBJ2ic+Sha9RE96iF2oiiFD2jV/TmZM6L8+58LFpLTjFzjP7A+fwBnXuRaA== t t}

AAAB6HicdVDLSgMxFM3UV62vqks3wSK4GjLT0ba7ohuXLdgHtEPJpJk2NpMZkoxQhn6BGxeKuPWT3Pk3pg9BRQ9cOJxzL/feEyScKY3Qh5VbW9/Y3MpvF3Z29/YPiodHbRWnktAWiXksuwFWlDNBW5ppTruJpDgKOO0Ek+u537mnUrFY3OppQv0IjwQLGcHaSM3xoFhCtotqyKtAZJcvLqueZ0jFqblGcWy0QAms0BgU3/vDmKQRFZpwrFTPQYn2Myw1I5zOCv1U0QSTCR7RnqECR1T52eLQGTwzyhCGsTQlNFyo3ycyHCk1jQLTGWE9Vr+9ufiX10t1WPUzJpJUU0GWi8KUQx3D+ddwyCQlmk8NwUQycyskYywx0Sabggnh61P4P2m7tlO23aZXql+t4siDE3AKzoEDKqAObkADtAABFDyAJ/Bs3VmP1ov1umzNWauZY/AD1tsnRaqNQg== a U ⇒ Op Field| W| h AAAB7nicbVBNS8NAEJ3Ur1q/qh69LBZBEEpSBT0WvXisYD+gDWWz3bRLN5uwOxFK6I/w4kERr/4eb/4bt20O2vpg4PHeDDPzgkQKg6777RTW1jc2t4rbpZ3dvf2D8uFRy8SpZrzJYhnrTkANl0LxJgqUvJNoTqNA8nYwvpv57SeujYjVI04S7kd0qEQoGEUrtWk/wwtv2i9X3Ko7B1klXk4qkKPRL3/1BjFLI66QSWpM13MT9DOqUTDJp6VeanhC2ZgOeddSRSNu/Gx+7pScWWVAwljbUkjm6u+JjEbGTKLAdkYUR2bZm4n/ed0Uwxs/EypJkSu2WBSmkmBMZr+TgdCcoZxYQpkW9lbCRlRThjahkg3BW355lbRqVe+yWnu4qtRv8ziKcAKncA4eXEMd7qEBTWAwhmd4hTcncV6cd+dj0Vpw8plj+APn8wftaY9M t+1 ⇒ 0 AAAB73icdVDLSsNAFJ3UV62vqks3g0VwIWHSxlR3RTcuK9gHtCFMppN26OThzKRQQn/CjQtF3Po77vwbJ20FFT1w4XDOvdx7j59wJhVCH0ZhZXVtfaO4Wdra3tndK+8ftGWcCkJbJOax6PpYUs4i2lJMcdpNBMWhz2nHH1/nfmdChWRxdKemCXVDPIxYwAhWWuomHjqDEw955QoynUvLcRyITFSz63Y1J5Z9XrOhZaI5KmCJpld+7w9ikoY0UoRjKXsWSpSbYaEY4XRW6qeSJpiM8ZD2NI1wSKWbze+dwROtDGAQC12RgnP1+0SGQymnoa87Q6xG8reXi395vVQFF27GoiRVNCKLRUHKoYph/jwcMEGJ4lNNMBFM3wrJCAtMlI6opEP4+hT+T9pV06qZ1Vu70rhaxlEER+AYnAIL1EED3IAmaAECOHgAT+DZuDcejRfjddFaMJYzh+AHjLdPJXSPZQ== p0,v0 Op + fAAAB6HicdVDLSsNAFL3xWeur6tLNYBFchaSNqe6Kbly2YB/QhjKZTtqxkwczE6GEfoEbF4q49ZPc+TdO2goqeuDC4Zx7ufceP+FMKsv6MFZW19Y3Ngtbxe2d3b390sFhW8apILRFYh6Lro8l5SyiLcUUp91EUBz6nHb8yXXud+6pkCyObtU0oV6IRxELGMFKS81gUCpbpntpu66LLNOqOjWnkhPbOa86yDatOcqwRGNQeu8PY5KGNFKEYyl7tpUoL8NCMcLprNhPJU0wmeAR7Wka4ZBKL5sfOkOnWhmiIBa6IoXm6veJDIdSTkNfd4ZYjeVvLxf/8nqpCi68jEVJqmhEFouClCMVo/xrNGSCEsWnmmAimL4VkTEWmCidTVGH8PUp+p+0K6ZdNStNp1y/WsZRgGM4gTOwoQZ1uIEGtIAAhQd4gmfjzng0XozXReuKsZw5gh8w3j4BM5WNNQ== sAAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2ls120y7dbMLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXFtRKwecJxwP6IDJULBKFrp3jy6vVLZrbgzkGXi5aQMOeq90le3H7M04gqZpMZ0PDdBP6MaBZN8UuymhieUjeiAdyxVNOLGz2anTsipVfokjLUthWSm/p7IaGTMOApsZ0RxaBa9qfif10kxvPIzoZIUuWLzRWEqCcZk+jfpC80ZyrEllGlhbyVsSDVlaNMp2hC8xZeXSbNa8c4r1buLcu06j6MAx3ACZ+DBJdTgFurQAAYDeIZXeHOk8+K8Ox/z1hUnnzmCP3A+fwACV42d Field ⇒ −title | contents

AAAB6HicdVDJSgNBEK2JW4xb1KOXxiB4GnomwSS3oBePCZgFkiH0dHqS1p6F7h4hDPkCLx4U8eonefNv7CyCij4oeLxXRVU9PxFcaYw/rNza+sbmVn67sLO7t39QPDzqqDiVlLVpLGLZ84ligkesrbkWrJdIRkJfsK5/dzX3u/dMKh5HN3qaMC8k44gHnBJtpNZ4WCxhG1cr9WodYbtSxjXsGuK65TK+QI6NFyjBCs1h8X0wimkaskhTQZTqOzjRXkak5lSwWWGQKpYQekfGrG9oREKmvGxx6AydGWWEgliaijRaqN8nMhIqNQ190xkSPVG/vbn4l9dPdVDzMh4lqWYRXS4KUoF0jOZfoxGXjGoxNYRQyc2tiE6IJFSbbAomhK9P0f+k49pO2XZblVLjchVHHk7gFM7BgSo04Bqa0AYKDB7gCZ6tW+vRerFel605azVzDD9gvX0CMkiNNA==

g AAACCHicdVDLSgNBEJyNrxhfqx49OBgET2E3CZjcgl48JmIekCxhdjJJhsw+mOlVw7JHL/6KFw+KePUTvPk3TrIRVLSgoajqprvLDQVXYFkfRmZpeWV1Lbue29jc2t4xd/daKogkZU0aiEB2XKKY4D5rAgfBOqFkxHMFa7uT85nfvmZS8cC/gmnIHI+MfD7klICW+uZhD9gtxI0E9y75aAxEyuAGp2ITN5K+mbcK5WKpapdwSipWSiqVKrYL1hx5tEC9b773BgGNPOYDFUSprm2F4MREAqeCJblepFhI6ISMWFdTn3hMOfH8kQQfa2WAh4HU5QOeq98nYuIpNfVc3ekRGKvf3kz8y+tGMKw4MffDCJhP00XDSGAI8CwVPOCSURBTTQiVXN+K6ZhIQkFnl9MhfH2K/yetYsEuFYqNcr52togjiw7QETpBNjpFNXSB6qiJKLpDD+gJPRv3xqPxYrymrRljMbOPfsB4+wR1yppF x Q UQ x ⇒ x | x 1 )

W V V W rAAAB6nicdVDLSgNBEOyNrxhfUY9eBoPgaZndBJPcgl48RjSJkKxhdjKbDJl9MDMrhCWf4MWDIl79Im/+jZOHoKIFDUVVN91dfiK40hh/WLmV1bX1jfxmYWt7Z3evuH/QVnEqKWvRWMTy1ieKCR6xluZasNtEMhL6gnX88cXM79wzqXgc3ehJwryQDCMecEq0ka7lndMvlrCNq5V6tY6wXSnjGnYNcd1yGZ8hx8ZzlGCJZr/43hvENA1ZpKkgSnUdnGgvI1JzKti00EsVSwgdkyHrGhqRkCkvm586RSdGGaAglqYijebq94mMhEpNQt90hkSP1G9vJv7ldVMd1LyMR0mqWUQXi4JUIB2j2d9owCWjWkwMIVRycyuiIyIJ1Sadggnh61P0P2m7tlO23atKqXG+jCMPR3AMp+BAFRpwCU1oAYUhPMATPFvCerRerNdFa85azhzCD1hvn2bLjeI=

AAAB73icdVDLSsNAFJ3UV62vqks3g0VwISHTxlR3RTcuK9gHtCFMppN26OThzKRQQn/CjQtF3Po77vwbJ20FFT1w4XDOvdx7j59wJpVlfRiFldW19Y3iZmlre2d3r7x/0JZxKghtkZjHoutjSTmLaEsxxWk3ERSHPqcdf3yd+50JFZLF0Z2aJtQN8TBiASNYaambeOgMTjzklSuW6Vwix3GgZVo1u25Xc4Ls85oNkWnNUQFLNL3ye38QkzSkkSIcS9lDVqLcDAvFCKezUj+VNMFkjIe0p2mEQyrdbH7vDJ5oZQCDWOiKFJyr3ycyHEo5DX3dGWI1kr+9XPzL66UquHAzFiWpohFZLApSDlUM8+fhgAlKFJ9qgolg+lZIRlhgonREJR3C16fwf9KumqhmVm/tSuNqGUcRHIFjcAoQqIMGuAFN0AIEcPAAnsCzcW88Gi/G66K1YCxnDsEPGG+fKIGPZw== p ,v x ⇒ x | x x 1 1 1

fAAAB6HicdVDLSsNAFL3xWeur6tLNYBFchaSNqe6Kbly2YB/QhjKZTtqxkwczE6GEfoEbF4q49ZPc+TdO2goqeuDC4Zx7ufceP+FMKsv6MFZW19Y3Ngtbxe2d3b390sFhW8apILRFYh6Lro8l5SyiLcUUp91EUBz6nHb8yXXud+6pkCyObtU0oV6IRxELGMFKS81gUCpbpntpu66LLNOqOjWnkhPbOa86yDatOcqwRGNQeu8PY5KGNFKEYyl7tpUoL8NCMcLprNhPJU0wmeAR7Wka4ZBKL5sfOkOnWhmiIBa6IoXm6veJDIdSTkNfd4ZYjeVvLxf/8nqpCi68jEVJqmhEFouClCMVo/xrNGSCEsWnmmAimL4VkTEWmCidTVGH8PUp+p+0K6ZdNStNp1y/WsZRgGM4gTOwoQZ1uIEGtIAAhQd4gmfjzng0XozXReuKsZw5gh8w3j4BM5WNNQ==

AAAB63icbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Ae0sWy2k3bp7ibsboRS+he8eFDEq3/Im//GpM1BWx8MPN6bYWZeEAturOt+O4W19Y3NreJ2aWd3b/+gfHjUMlGiGTZZJCLdCahBwRU2LbcCO7FGKgOB7WB8m/ntJ9SGR+rBTmL0JR0qHnJGbSaZR6/UL1fcqjsHWSVeTiqQo9Evf/UGEUskKssENabrubH1p1RbzgTOSr3EYEzZmA6xm1JFJRp/Or91Rs5SZUDCSKelLJmrvyemVBozkUHaKakdmWUvE//zuokNr/0pV3FiUbHFojARxEYke5wMuEZmxSQllGme3krYiGrKbBpPFoK3/PIqadWq3kW1dn9Zqd/kcRThBE7hHDy4gjrcQQOawGAEz/AKb450Xpx352PRWnDymWP4A+fzBzjtjbI= W V V W AAAB73icdVDLSsNAFJ3UV62vqks3g0VwISHTxlR3RTcuK9gHtCFMppN26OThzKRQQn/CjQtF3Po77vwbJ20FFT1w4XDOvdx7j59wJpVlfRiFldW19Y3iZmlre2d3r7x/0JZxKghtkZjHoutjSTmLaEsxxWk3ERSHPqcdf3yd+50JFZLF0Z2aJtQN8TBiASNYaambeOgMTjzklSuW6Vwix3GgZVo1u25Xc4Ls85oNkWnNUQFLNL3ye38QkzSkkSIcS9lDVqLcDAvFCKezUj+VNMFkjIe0p2mEQyrdbH7vDJ5oZQCDWOiKFJyr3ycyHEo5DX3dGWI1kr+9XPzL66UquHAzFiWpohFZLApSDlUM8+fhgAlKFJ9qgolg+lZIRlhgonREJR3C16fwf9KumqhmVm/tSuNqGUcRHIFjcAoQqIMGuAFN0AIEcPAAnsCzcW88Gi/G66K1YCxnDsEPGG+fKIGPZw== p1,v1 s x ⇒ | V w w V U U

B AAAB6HicdVDJSgNBEK2JW4xb1KOXxiB4GnomwSS3oBePCZgFkiH0dHqS1p6F7h4hDPkCLx4U8eonefNv7CyCij4oeLxXRVU9PxFcaYw/rNza+sbmVn67sLO7t39QPDzqqDiVlLVpLGLZ84ligkesrbkWrJdIRkJfsK5/dzX3u/dMKh5HN3qaMC8k44gHnBJtpNZ4WCxhG1cr9WodYbtSxjXsGuK65TK+QI6NFyjBCs1h8X0wimkaskhTQZTqOzjRXkak5lSwWWGQKpYQekfGrG9oREKmvGxx6AydGWWEgliaijRaqN8nMhIqNQ190xkSPVG/vbn4l9dPdVDzMh4lqWYRXS4KUoF0jOZfoxGXjGoxNYRQyc2tiE6IJFSbbAomhK9P0f+k49pO2XZblVLjchVHHk7gFM7BgSo04Bqa0AYKDB7gCZ6tW+vRerFel605azVzDD9gvX0CMkiNNA== g 2 ))

AAACFHicdVDJSgNBEO2Je9yiHr00BkUQwkwSMLmJgnhzwZhAJoSeTiVp0rPQXaOGIR/hxV/x4kERrx68+Td2FkFFHxS8fq+KrnpeJIVG2/6wUlPTM7Nz8wvpxaXlldXM2vqVDmPFocJDGaqaxzRIEUAFBUqoRQqY70moer2joV+9BqVFGFxiP4KGzzqBaAvO0EjNzJ6LcItJZUB33AvR6SJTKryhrmveI+c0oscCZItWB81M1s4V84WyU6BjUrLHpFQqUydnj5AlE5w1M+9uK+SxDwFyybSuO3aEjYQpFFzCIO3GGiLGe6wDdUMD5oNuJKOjBnTbKC3aDpWpAOlI/T6RMF/rvu+ZTp9hV//2huJfXj3GdqmRiCCKEQI+/qgdS4ohHSZEW0IBR9k3hHElzK6Ud5liHE2OaRPC16X0f3KVzzmFXP68mD04nMQxTzbJFtklDtknB+SEnJEK4eSOPJAn8mzdW4/Wi/U6bk1Zk5kN8gPW2ydGf55Z Op Field W

⇒ { | ∈x ∧ AAAB6nicdVDLSgNBEOyNrxhfUY9eBoPgaZndBJPcgl48RjSJkKxhdjKbDJl9MDMrhCWf4MWDIl79Im/+jZOHoKIFDUVVN91dfiK40hh/WLmV1bX1jfxmYWt7Z3evuH/QVnEqKWvRWMTy1ieKCR6xluZasNtEMhL6gnX88cXM79wzqXgc3ehJwryQDCMecEq0ka7lndsvlrCNq5V6tY6wXSnjGnYNcd1yGZ8hx8ZzlGCJZr/43hvENA1ZpKkgSnUdnGgvI1JzKti00EsVSwgdkyHrGhqRkCkvm586RSdGGaAglqYijebq94mMhEpNQt90hkSP1G9vJv7ldVMd1LyMR0mqWUQXi4JUIB2j2d9owCWjWkwMIVRycyuiIyIJ1Sadggnh61P0P2m7tlO23atKqXG+jCMPR3AMp+BAFRpwCU1oAYUhPMATPFvCerRerNdFa85azhzCD1hvn2hPjeM= trie(Σ ).HasSubstring(w) r AAACFHicdVDJSgNBEO2Je9yiHr00BkUQwkwSMLmJgnhzwZhAJoSeTiVp0rPQXaOGIR/hxV/x4kERrx68+Td2FkFFHxS8fq+KrnpeJIVG2/6wUlPTM7Nz8wvpxaXlldXM2vqVDmPFocJDGaqaxzRIEUAFBUqoRQqY70moer2joV+9BqVFGFxiP4KGzzqBaAvO0EjNzJ6LcItJZUB33AvR6SJTKryhrmveI+c0oscCZItWB81M1s4V84WyU6BjUrLHpFQqUydnj5AlE5w1M+9uK+SxDwFyybSuO3aEjYQpFFzCIO3GGiLGe6wDdUMD5oNuJKOjBnTbKC3aDpWpAOlI/T6RMF/rvu+ZTp9hV//2huJfXj3GdqmRiCCKEQI+/qgdS4ohHSZEW0IBR9k3hHElzK6Ud5liHE2OaRPC16X0f3KVzzmFXP68mD04nMQxTzbJFtklDtknB+SEnJEK4eSOPJAn8mzdW4/Wi/U6bk1Zk5kN8gPW2ydGf55Z Op Field W x } V w #w V # 2 B sAAAB7HicbVBNS8NAEJ3Urxq/qh69LBbBU0mqoMeiF48VTFtoY9lsN+3SzSbsboQS+hu8eFDEqz/Im//GTZqDtj4YeLw3w8y8IOFMacf5tipr6xubW9Vte2d3b/+gdnjUUXEqCfVIzGPZC7CinAnqaaY57SWS4ijgtBtMb3O/+0SlYrF40LOE+hEeCxYygrWRPPXYtO1hre40nAJolbglqUOJ9rD2NRjFJI2o0IRjpfquk2g/w1IzwuncHqSKJphM8Zj2DRU4osrPimPn6MwoIxTG0pTQqFB/T2Q4UmoWBaYzwnqilr1c/M/rpzq89jMmklRTQRaLwpQjHaP8czRikhLNZ4ZgIpm5FZEJlphok08egrv88irpNBvuRaN5f1lv3ZRxVOEETuEcXLiCFtxBGzwgwOAZXuHNEtaL9W59LForVjlzDH9gff4Ab5GNxw== ⇒ {trie(Σ| x)∈.HasSubstring(∧ w ) −→ }

hAAAB/3icdVDLSgNBEJyNrxhfUcGLl8GgeHHZTaJJbiEieowhiUISwuxk1gzOPpjpFcOag7/ixYMiXv0Nb/6Nk4egogUNRVU33V1OKLgCy/owEjOzc/MLycXU0vLK6lp6faOpgkhS1qCBCOSlQxQT3GcN4CDYZSgZ8RzBLpzr45F/ccOk4oFfh0HIOh658rnLKQEtddNbfXyA28BuIa6c1Oq4vYdPa41hN52xzKxVsvIFbJm5w6NiPq9JwS5ltWKb1hgZNEW1m35v9wIaecwHKohSLdsKoRMTCZwKNky1I8VCQq/JFWtp6hOPqU48vn+Id7XSw24gdfmAx+r3iZh4Sg08R3d6BPrqtzcS//JaEbjFTsz9MALm08kiNxIYAjwKA/e4ZBTEQBNCJde3YtonklDQkaV0CF+f4v9JM2vaOTN7ns+UK9M4kmgb7aB9ZKMCKqMzVEUNRNEdekBP6Nm4Nx6NF+N10powpjOb6AeMt0+8m5So BERT & GRU (a) The productions of the query grammar: x identifies a

gAAAB+HicdVBNS0JBFJ1nX2YfWi3bDEnQpsf4lNSd1KZFgZGaoCLzxlEH530wc19kD39JmxZFtO2ntOvfNH4EFXXgwuGce7n3HjeUQgMhH1ZiaXlldS25ntrY3NpOZ3Z2GzqIFON1FshANV2quRQ+r4MAyZuh4tRzJb9xR2dT/+aWKy0CvwbjkHc8OvBFXzAKRupm0gN8jNvA7yC+uK5dTrqZLLFJsVAuljGxC3lSIo4hjpPPkxOcs8kMWbRAtZt5b/cCFnncByap1q0cCaETUwWCST5JtSPNQ8pGdMBbhvrU47oTzw6f4EOj9HA/UKZ8wDP1+0RMPa3Hnms6PQpD/dubin95rQj6pU4s/DAC7rP5on4kMQR4mgLuCcUZyLEhlClhbsVsSBVlYLJKmRC+PsX/k4Zj5/K2c1XIVk4XcSTRPjpARyiHiqiCzlEV1RFDEXpAT+jZurcerRfrdd6asBYze+gHrLdPDPCStw== LSTM specific vocabulary induced by the aggregated results at ⇡AAAB6nicdVBNS8NAEJ34WetX1aOXxSJ4CklbTL0VvXisaD+gDWWz3bRLN5uwuxFK6E/w4kERr/4ib/4bt00FFX0w8Hhvhpl5QcKZ0o7zYa2srq1vbBa2its7u3v7pYPDtopTSWiLxDyW3QArypmgLc00p91EUhwFnHaCydXc79xTqVgs7vQ0oX6ER4KFjGBtpNt+wgalsmPXPPfCqaKceF5O6t45cm1ngTIs0RyU3vvDmKQRFZpwrFTPdRLtZ1hqRjidFfupogkmEzyiPUMFjqjys8WpM3RqlCEKY2lKaLRQv09kOFJqGgWmM8J6rH57c/Evr5fqsO5nTCSppoLki8KUIx2j+d9oyCQlmk8NwUQycysiYywx0Sadognh61P0P2lXbLdqV25q5cblMo4CHMMJnIELHjTgGprQAgIjeIAneLa49Wi9WK9564q1nDmCH7DePgHV1Y4t

fAAAB9XicdVDLSgNBEJyNrxhfUY9eBoPgxWU3WTd6C3rxoBDBJEISw+xkNhky+2CmVw1L/sOLB0W8+i/e/BsnD0FFCxqKqm66u7xYcAWW9WFk5uYXFpeyy7mV1bX1jfzmVl1FiaSsRiMRyWuPKCZ4yGrAQbDrWDISeII1vMHp2G/cMql4FF7BMGbtgPRC7nNKQEs3Pj7ALWD3kF6cV0edfMEy3WPbdV1smVbJKTvFMbGdw5KDbdOaoIBmqHby761uRJOAhUAFUappWzG0UyKBU8FGuVaiWEzogPRYU9OQBEy108nVI7ynlS72I6krBDxRv0+kJFBqGHi6MyDQV7+9sfiX10zAP2qnPIwTYCGdLvITgSHC4whwl0tGQQw1IVRyfSumfSIJBR1UTofw9Sn+n9SLpl0yi5dOoXIyiyOLdtAu2kc2KqMKOkNVVEMUSfSAntCzcWc8Gi/G67Q1Y8xmttEPGG+f6MSSJg== MLP time t (index omitted), V (V #) is the BERT wordpiece B −→B prefix (suffix) vocabulary, w denotes the string ending (b) The MuZero search agent architecture. at w, including the preceding wordpieces. Figure A.1

Figure A.1a lists the detailed rules schemata for the query grammar used by the MuZero agent – explained in Section 3.1. An optional STOP rule allows the agent to terminate an episode and return the results collected up to that point. Using the BERT sub-word tokens as vocabulary allows us to generate a large number of words with a total vocabulary size of 30k tokens. ∼ Our implementation of MuZero modifies the MCTS to use the query grammar for efficient exploration. Figure A.1b shows the different network components used during MCTS. Each node expansion is associated with a grammar rule. When the simulation phase is complete, the visit counts collected at the children of the MCTS root node provide the policy π from which the next action at+1 is sampled. Each simulation corresponds to one or more hypothetical follow-up queries (or fragments) resulting from the execution of grammar rules. The MCTS procedure executes Depth-First node expansions, guided by the grammar, to generate a query top-down, left-to-right, in a forward pass. To control the process, we add two data structures to MCTS nodes: a stack γ, and an output buffer ω: γ contains a list of unfinished non-terminals, ω stores the new expansion. The stack is initialized with the start symbol γ = [Q]. The output buffer is reset, ω = [], after each document search. When expanding a node, the non-terminal symbol on the top of γ is popped, providing the left-hand side of the rule associated with the new edge. Then, symbols on the right-hand side of the rule are pushed right-to-left onto γ. When a terminal rule is applied, the terminal symbol is added to ω. The next time γ contains only Q, ω holds the new query expansion term ∆qt to be appended to the previous query qt for search. Figure A.2 illustrates the process. Nodes represent the stack γ and output buffer ω. Edges are annotated with the rule used to expand the node. We illustrate the left-branching expansion. Starting from the top, the symbol "Q" is popped from the stack, and a compatible rule, "Q WQ", is sampled. The symbols "W" and "Q" are added to the stack for later processing. The agent⇒ expands the next node choosing to use the document content vocabulary (W Wβ), then it selects a vocabulary prefix (‘dial’), adding it to the output buffer ω, followed by a vocabulary⇒ suffix (‘ects’). At that point, the stack contains only Q, and the content of ω contains a new expansion, the term ‘dialects’. A latent search step is simulated through MuZero’s gθ sub-network. Then the output buffer ω is reset. After the latent search step, the simulation is forced to use the full trie (W Widx), which includes all terms in the Lucene index. This is necessary since there is no observable⇒ context that can be used to restrict the vocabulary. Instead of simply adding an OR term (Q WQ), the right branch of the example selects an expansion with unary operator and field information⇒ (Q UQ). ⇒

14 γ = Q , ω = { } {} Q WQ ⇒ Q UQ ⇒ γ = W, Q , ω = γ = U, Q , ω = { } {} { } {} W Wβ U Op Field W ⇒ ⇒ γ = Wβ, Q , ω = γ = Op, Field, W, Q , ω = { } {} { } {} β Wβ Vβ W ... ⇒ x γ = Vβ, W , Q , ω = ... { } {} Vβ dial ⇒ β γ = W , Q , ω = dial { } { } β β W V ⇒ β γ = V , Q , ω = dial { } { } β V ects ⇒ γ = Q , ω = dialects { } { } SEARCH Q WQ ⇒ γ = W, Q , ω = { } {} W Widx ⇒ γ = Widx, Q , ω = { } {}

...

... Figure A.2

B Reward Details (cf. §4)

This section provides more details about the reward function used for training the MuZero agent. As described in Eq. 7, the per-step reward is computed as the NDCG score difference of the aggregated top-5 documents. To encourage informative exploration and avoid convergence to sub-optimal, degenerate policies, we include two penalties: one for a search step that leads to an empty set of documents retrieved by the search engine, and the other for an immediately issued ‘stop’ action. We use a default penalty value of -1. We do not impose a penalty on the agent for the number of steps, as we want to encourage the agent to explore a large document space within its limited number of interactions with the search engine. We find that the final agent never uses the dedicated ‘stop’ action but prefers to keep issuing queries until the limit is reached. It learns to circumvent the penalty for the empty results set – without stopping the session early – by choosing frequently occurring terms from the query vocabulary Σq. We ran an ablation experiment without a ‘stop’ action but this seems to slightly hurt performance. We hypothesize that including the ‘stop’ action in training might have a regularization effect. This should be investigated in future work.

C Observation Building Details

This section provides more details and examples about the encoding of observations for both the MuZero and the T5 agent. As described in Section 2.1, the main part of the observation consists of the top-5 documents retrieved so far 5. The documents are sorted according to the PS score and reduced in size by extracting fixed-lengthD snippets around the DPR reader’s predicted answer. Moreover, the corresponding Wikipedia article title is appended to each document snippet. In addition to the top documents, the observation includes the original question and information about any previous

15 refinements. While the main part of the observation is shared between the MuZero and the T5 agent, there are differences in the exact representation. The following two paragraphs give a detailed explanation and example for both agents.

MuZero Agent’s State (cf. §2.1) The MuZero agent uses a custom BERT (initialized from BERT- base) with additional embedding layers to represent the different parts of the observation. It consists of four individual embedding layers as depicted in Figure A.3. At first, the standard layer for the tokens of the query, the current tree, and the current top documents D5. The second layer assigns a type ID to each of the tokens representing if a token is part of the query, the tree, the predicted answer, the context, or the title of a document. The last two layers add scoring information about the tokens as float values. We encode both the inverse document frequency (IDF) of a word and the documents’ passage selection (PS) score. Figure A.1 shows a concrete example of a state used by the MuZero agent.

AAAB73icbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjJCviA5wt5mLlmyt3fu7gnhyJ+wsVDE1r9j579xk1yhiQ8GHu/NMDMvSATXxnW/ncLG5tb2TnG3tLd/cHhUPj5p6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWByN/c7T6g0j2XTTBP0IzqSPOSMGit1+0FImgpxUK64VXcBsk68nFQgR2NQ/uoPY5ZGKA0TVOue5ybGz6gynAmclfqpxoSyCR1hz1JJI9R+trh3Ri6sMiRhrGxJQxbq74mMRlpPo8B2RtSM9ao3F//zeqkJb/yMyyQ1KNlyUZgKYmIyf54MuUJmxNQSyhS3txI2pooyYyMq2RC81ZfXSbtW9a6qtYdapX6bx1GEMziHS/DgGupwDw1oAQMBz/AKb86j8+K8Ox/L1oKTz5zCHzifP4abj6A= AAAB/XicbVDLSgMxFL3js9bX+Ni5CRbBVZmpC10WdeGyin1AO5RMeqcNzTxIMkIdir/ixoUibv0Pd/6NaTsLbT0QOJxzb3Jy/ERwpR3n21paXlldWy9sFDe3tnd27b39hopTybDOYhHLlk8VCh5hXXMtsJVIpKEvsOkPryZ+8wGl4nF0r0cJeiHtRzzgjGojde3Djh+Q65ilIUaa3KFKhVZdu+SUnSnIInFzUoIcta791enldzBBlWq7TqK9jErNmcBxsZMqTCgb0j62DY1oiMrLpunH5MQoPRLE0hyTYar+3shoqNQo9M1kSPVAzXsT8T+vnergwst4lKQaIzZ7KEgF0TGZVEF6XCLTYmQIZZKbrIQNqKRMm8KKpgR3/suLpFEpu2flym2lVL3M6yjAERzDKbhwDlW4gRrUgcEjPMMrvFlP1ov1bn3MRpesfOcA/sD6/AEquJUK

QueryAAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxjIB8yHJEfY2c8mSvb1jd08IR36FjYUitv4cO/+Nm+QKTXww8Hhvhpl5QSK4Nq777RQ2Nre2d4q7pb39g8Oj8vFJW8epYthisYhVN6AaBZfYMtwI7CYKaRQI7ASTu7nfeUKleSwfzDRBP6IjyUPOqLHSYz8ISTNFNR2UK27VXYCsEy8nFcjRGJS/+sOYpRFKwwTVuue5ifEzqgxnAmelfqoxoWxCR9izVNIItZ8tDp6RC6sMSRgrW9KQhfp7IqOR1tMosJ0RNWO96s3F/7xeasIbP+MySQ1KtlwUpoKYmMy/J0OukBkxtYQyxe2thI2poszYjEo2BG/15XXSrlW9q2qtWavUb/M4inAG53AJHlxDHe6hAS1gEMEzvMKbo5wX5935WLYWnHzmFP7A+fwBeZKQMA== LayerAAAB8HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5ioWXQxsIigvmQ5Ah7m7lkye7dsbsnHCG/wsZCEVt/jp3/xk1yhSY+GHi8N8PMvCARXBvX/XYKa+sbm1vF7dLO7t7+QfnwqKXjVDFssljEqhNQjYJH2DTcCOwkCqkMBLaD8c3Mbz+h0jyOHkyWoC/pMOIhZ9RY6bEXhOSOZqj65Ypbdecgq8TLSQVyNPrlr94gZqnEyDBBte56bmL8CVWGM4HTUi/VmFA2pkPsWhpRidqfzA+ekjOrDEgYK1uRIXP198SESq0zGdhOSc1IL3sz8T+vm5rwyp/wKEkNRmyxKEwFMTGZfU8GXCEzIrOEMsXtrYSNqKLM2IxKNgRv+eVV0qpVvYtq7b5WqV/ncRThBE7hHDy4hDrcQgOawEDCM7zCm6OcF+fd+Vi0Fpx85hj+wPn8AVN5kBc= Tree Document Results

AAAB7XicbVA9SwNBEJ2LXzF+RS1tDoNgFe5ioWXQxjJCviA5wt5mLlmzt3vs7gnhyH+wsVDE1v9j579xk1yhiQ8GHu/NMDMvTDjTxvO+ncLG5tb2TnG3tLd/cHhUPj5pa5kqii0quVTdkGjkTGDLMMOxmygkccixE07u5n7nCZVmUjTNNMEgJiPBIkaJsVK7KSco9KBc8areAu468XNSgRyNQfmrP5Q0jVEYyonWPd9LTJARZRjlOCv1U40JoRMywp6lgsSog2xx7cy9sMrQjaSyJYy7UH9PZCTWehqHtjMmZqxXvbn4n9dLTXQTZEwkqUFBl4uilLtGuvPX3SFTSA2fWkKoYvZWl46JItTYgEo2BH/15XXSrlX9q2rtoVap3+ZxFOEMzuESfLiGOtxDA1pA4RGe4RXeHOm8OO/Ox7K14OQzp/AHzucPp3SPLA==

aAAACDHicbVDLSgMxFM3UV62vqks3wSK4KGWmCrosunFZwbZCOwyZNNOGZjJDckcopR/gxl9x40IRt36AO//GpJ2Fth7I5XDOvUnuCVPBNbjut1NYWV1b3yhulra2d3b3yvsHbZ1kirIWTUSi7kOimeCStYCDYPepYiQOBeuEo2vrdx6Y0jyRdzBOmR+TgeQRpwSMFJQrJHCrmNoCtvT6CegqJoG0qrSq7XJr7gx4mXg5qaAczaD8Za6hWcwkUEG07npuCv6EKOBUsGmpl2mWEjoiA9Y1VJKYaX8yW2aKT4zSx1GizJGAZ+rviQmJtR7HoemMCQz1omfF/7xuBtGlP+EyzYBJOn8oygSGBNtkcJ8rRkGMDSFUcfNXTIdEEQomv5IJwVtceZm06zXvrFa/Pa80rvI4iugIHaNT5KEL1EA3qIlaiKJH9Ixe0Zvz5Lw4787HvLXg5DOH6A+czx/8d5hv ,c ,t ,...,a ,c ,t Tokens qAAAB6nicbVBNS8NAEJ34WetX1aOXxSJ4KkkV9Fj04rGi/YA2lM120y7dbOLuRCihP8GLB0W8+ou8+W/ctjlo64OBx3szzMwLEikMuu63s7K6tr6xWdgqbu/s7u2XDg6bJk414w0Wy1i3A2q4FIo3UKDk7URzGgWSt4LRzdRvPXFtRKwecJxwP6IDJULBKFrp/rHn9kplt+LOQJaJl5My5Kj3Sl/dfszSiCtkkhrT8dwE/YxqFEzySbGbGp5QNqID3rFU0YgbP5udOiGnVumTMNa2FJKZ+nsio5Ex4yiwnRHFoVn0puJ/XifF8MrPhEpS5IrNF4WpJBiT6d+kLzRnKMeWUKaFvZWwIdWUoU2naEPwFl9eJs1qxTuvVO8uyrXrPI4CHMMJnIEHl1CDW6hDAxgM4Ble4c2Rzovz7nzMW1ecfOYI/sD5/AEA0I2c 0 0 0 n n n 0 lAAAB+HicbVBNSwMxFHxbv2r96KpHL8EieChltwp6LHrxWMHWQrss2TTbhibZJckKtfSXePGgiFd/ijf/jWm7B20dCAwzb3gvE6WcaeN5305hbX1jc6u4XdrZ3dsvuweHbZ1kitAWSXiiOhHWlDNJW4YZTjupolhEnD5Eo5uZ//BIlWaJvDfjlAYCDySLGcHGSqFb5qFXRb1+YnQV8VCEbsWreXOgVeLnpAI5mqH7ZcMkE1QawrHWXd9LTTDByjDC6bTUyzRNMRnhAe1aKrGgOpjMD5+iU6v0UZwo+6RBc/V3YoKF1mMR2UmBzVAvezPxP6+bmfgqmDCZZoZKslgUZxyZBM1aQH2mKDF8bAkmitlbERlihYmxXZVsCf7yl1dJu17zz2v1u4tK4zqvowjHcAJn4MMlNOAWmtACAhk8wyu8OU/Oi/PufCxGC06eOYI/cD5/ADTmkiY= 0,...,lm

TypeAAAB63icbVDLSgNBEOz1GeMr6tHLYBA8hd140GPQi8cIeUGyhNnJbDJkHsvMrBCW/IIXD4p49Ye8+TfOJnvQxIKGoqqb7q4o4cxY3//2Nja3tnd2S3vl/YPDo+PKyWnHqFQT2iaKK92LsKGcSdq2zHLaSzTFIuK0G03vc7/7RLVhSrbsLKGhwGPJYkawzaWWk4aVql/zF0DrJChIFQo0h5WvwUiRVFBpCcfG9AM/sWGGtWWE03l5kBqaYDLFY9p3VGJBTZgtbp2jS6eMUKy0K2nRQv09kWFhzExErlNgOzGrXi7+5/VTG9+GGZNJaqkky0VxypFVKH8cjZimxPKZI5ho5m5FZII1JtbFU3YhBKsvr5NOvRZc1+qP9WrjroijBOdwAVcQwA004AGa0AYCE3iGV3jzhPfivXsfy9YNr5g5gz/wPn8AG8yORg== AAACF3icbVBLS8NAEN7UV42vqEcvi0XwICWpgh6LetBbBfuANpTNdtMu3TzYnYgl5F948a948aCIV735b9y2OdjWgYXvMcPsfF4suALb/jEKS8srq2vFdXNjc2t7x9rda6gokZTVaSQi2fKIYoKHrA4cBGvFkpHAE6zpDa/GfvOBScWj8B5GMXMD0g+5zykBLXWtstkB9gjp7XXWTUl2gv9QOktB065Vssv2pPAicHJQQnnVutZ3pxfRJGAhUEGUajt2DG5KJHAqWGZ2EsViQoekz9oahiRgyk0nd2X4SCs97EdSvxDwRP07kZJAqVHg6c6AwEDNe2PxP6+dgH/hpjyME2AhnS7yE4EhwuOQcI9LRkGMNCBUcv1XTAdEEgo6SlOH4MyfvAgalbJzWq7cnZWql3kcRXSADtExctA5qqIbVEN1RNETekFv6N14Nl6ND+Nz2low8pl9NFPG1y+kR5+Y AAACHHicbVDLSsNAFJ34rPUVdelmsAgupCStoMuiLnRXwT6gDWEynbRDJw9mbsQS8iFu/BU3LhRx40Lwb5y2WdjWAwPnnnMvd+7xYsEVWNaPsbS8srq2Xtgobm5t7+yae/tNFSWSsgaNRCTbHlFM8JA1gINg7VgyEniCtbzh1dhvPTCpeBTewyhmTkD6Ifc5JaAl16wWu70I1CnuAnuE9PY6c1OSzZR0toTMNUtW2ZoALxI7JyWUo+6aX3oJTQIWAhVEqY5txeCkRAKngmXFbqJYTOiQ9FlH05AETDnp5LgMH2ulh/1I6hcCnqh/J1ISKDUKPN0ZEBioeW8s/ud1EvAvnJSHcQIspNNFfiIwRHicFO5xySiIkSaESq7/iumASEJB51nUIdjzJy+SZqVsV8uVu7NS7TKPo4AO0RE6QTY6RzV0g+qogSh6Qi/oDb0bz8ar8WF8TluXjHzmAM3A+P4F1fSh4A== ID IDAAACAXicbZDLSsNAFIYnXmu8Rd0IboJFcFWSKuiyqAvdVbAXaEOZTE/boZMLMydiCXHjq7hxoYhb38Kdb+O0zUJbfxj4+M85nDm/Hwuu0HG+jYXFpeWV1cKaub6xubVt7ezWVZRIBjUWiUg2fapA8BBqyFFAM5ZAA19Awx9ejuuNe5CKR+EdjmLwAtoPeY8zitrqWPtmG+EB05urrJNOESVAlnWsolNyJrLnwc2hSHJVO9ZXuxuxJIAQmaBKtVwnRi+lEjkTkJntREFM2ZD2oaUxpAEoL51ckNlH2unavUjqF6I9cX9PpDRQahT4ujOgOFCztbH5X62VYO/cS3kYJwghmy7qJcLGyB7HYXe5BIZipIEyyfVfbTagkjLUoZk6BHf25Hmol0vuSal8e1qsXORxFMgBOSTHxCVnpEKuSZXUCCOP5Jm8kjfjyXgx3o2PaeuCkc/skT8yPn8ARwmXbw== IDa, IDc, IDt, ...,IDa, IDc, IDt AAAB8nicbVBNSwMxEM36WetX1aOXYBE8ld0q6LGoB71VsB+wXUo2zbah2WRNZsWy9Gd48aCIV3+NN/+NabsHbX0w8Hhvhpl5YSK4Adf9dpaWV1bX1gsbxc2t7Z3d0t5+06hUU9agSijdDolhgkvWAA6CtRPNSBwK1gqHVxO/9ci04UrewyhhQUz6kkecErCS3wH2BNnt9bj70C2V3Yo7BV4kXk7KKEe9W/rq9BRNYyaBCmKM77kJBBnRwKlg42InNSwhdEj6zLdUkpiZIJuePMbHVunhSGlbEvBU/T2RkdiYURzazpjAwMx7E/E/z08huggyLpMUmKSzRVEqMCg8+R/3uGYUxMgSQjW3t2I6IJpQsCkVbQje/MuLpFmteKeV6t1ZuXaZx1FAh+gInSAPnaMaukF11EAUKfSMXtGbA86L8+58zFqXnHzmAP2B8/kDfL2RYg== q tree

IDFAAAB8HicbVDLSgNBEJz1GeMr6tHLYBA8hd140GNQEb1FNA9JljA76SRD5rHMzAphyVd48aCIVz/Hm3/jJNmDJhY0FFXddHdFMWfG+v63t7S8srq2ntvIb25t7+wW9vbrRiWaQo0qrnQzIgY4k1CzzHJoxhqIiDg0ouHlxG88gTZMyQc7iiEUpC9Zj1FinfR4e3WN76nS0CkU/ZI/BV4kQUaKKEO1U/hqdxVNBEhLOTGmFfixDVOiLaMcxvl2YiAmdEj60HJUEgEmTKcHj/GxU7q4p7QrafFU/T2REmHMSESuUxA7MPPeRPzPayW2dx6mTMaJBUlni3oJx1bhyfe4yzRQy0eOEKqZuxXTAdGEWpdR3oUQzL+8SOrlUnBaKt+Vi5WLLI4cOkRH6AQF6AxV0A2qohqiSKBn9IrePO29eO/ex6x1yctmDtAfeJ8/1ROPxQ== Score idf(l ),...,idf(l ) idf(a ), idf(c ), idf(t ),...,idf(a ), idf(c ), idf(t ) idf(q ) AAACEHicbVDLSgMxFM3UV62vUZdugkVsQcpMFXRZdOOygn1AOwyZNNOGZh4kd8Qy9BPc+CtuXCji1qU7/8a0nUVtPRByOOfem9zjxYIrsKwfI7eyura+kd8sbG3v7O6Z+wdNFSWSsgaNRCTbHlFM8JA1gINg7VgyEniCtbzhzcRvPTCpeBTewyhmTkD6Ifc5JaAl1zztAnuElPf8cUm4VvkMd3sRKH3N60HZNYtWxZoCLxM7I0WUoe6a33oQTQIWAhVEqY5txeCkRAKngo0L3USxmNAh6bOOpiEJmHLS6UJjfKKVHvYjqU8IeKrOd6QkUGoUeLoyIDBQi95E/M/rJOBfOSkP4wRYSGcP+YnAEOFJOrjHJaMgRpoQKrn+K6YDIgkFnWFBh2AvrrxMmtWKfV6p3l0Ua9dZHHl0hI5RCdnoEtXQLaqjBqLoCb2gN/RuPBuvxofxOSvNGVnPIfoD4+sXngucUA== 0 m AAACVHicbZFLSwMxFIUzU6u1aq26dBMsQgtSZqqgy6IblxXsA9oyZNJMG5rJDMkdsQz9kboQ/CVuXJg+Fu3UC4HDd+/hJid+LLgGx/m27Nxefv+gcFg8Oj4pnZbPzjs6ShRlbRqJSPV8opngkrWBg2C9WDES+oJ1/enTot99Y0rzSL7CLGbDkIwlDzglYJBXng6AvUPKR8G8SjyndoM3AM0CWIFRBHqLE09mnTLrlDWvXHHqzrLwrnDXooLW1fLKn2YVTUImgQqidd91YhimRAGngs2Lg0SzmNApGbO+kZKETA/TZShzfG3ICAeRMkcCXtJNR0pCrWehbyZDAhOd7S3gf71+AsHDMOUyToBJuloUJAJDhBcJ4xFXjIKYGUGo4uaumE6IIhTMPxRNCG72ybui06i7t/XGy12l+biOo4Au0RWqIhfdoyZ6Ri3URhR9oB8LWZb1Zf3aOTu/GrWttecCbZVd+gMhtbHy 0 0 0 n n n AAAB+XicbVBNS8NAEN3Urxq/oh69BItQLyWpgh6LXjxWsB/QhrDZbNqlm03cnRRL6D/x4kERr/4Tb/4bt20O2vpg4PHeDDPzgpQzBY7zbZTW1jc2t8rb5s7u3v6BdXjUVkkmCW2RhCeyG2BFORO0BQw47aaS4jjgtBOMbmd+Z0ylYol4gElKvRgPBIsYwaAl37LMPtAnyFkYTauPvnPuWxWn5sxhrxK3IBVUoOlbX/0wIVlMBRCOleq5TgpejiUwwunU7GeKppiM8ID2NBU4psrL55dP7TOthHaUSF0C7Ln6eyLHsVKTONCdMYahWvZm4n9eL4Po2suZSDOggiwWRRm3IbFnMdghk5QAn2iCiWT6VpsMscQEdFimDsFdfnmVtOs196JWv7+sNG6KOMroBJ2iKnLRFWqgO9RELUTQGD2jV/Rm5MaL8W58LFpLRjFzjP7A+PwBkWWS9g== 0

PSAAAB73icbVA9SwNBEJ3zM8avqKXNYhCswl0stAzaWEZiPiA5wt5mLlmyt3fu7gnhyJ+wsVDE1r9j579xk1yhiQ8GHu/NMDMvSATXxnW/nbX1jc2t7cJOcXdv/+CwdHTc0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHssHM0nQj+hQ8pAzaqzUqTdIg8UK+6WyW3HnIKvEy0kZctT7pa/eIGZphNIwQbXuem5i/Iwqw5nAabGXakwoG9Mhdi2VNELtZ/N7p+TcKgMSxsqWNGSu/p7IaKT1JApsZ0TNSC97M/E/r5ua8NrPuExSg5ItFoWpICYms+fJgCtkRkwsoUxxeythI6ooMzaiog3BW355lbSqFe+yUr2vlms3eRwFOIUzuAAPrqAGd1CHJjAQ8Ayv8OY8Oi/Ou/OxaF1z8pkT+APn8wdlu4+L Score PS(d ),...,PS(d )

AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptsvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPemeMuA== AAACDnicbVDLSgMxFM34rPU16tJNsBRakDJTBV0W3bisaB/QDkMmTdvQTGZI7ohl6Be48VfcuFDErWt3/o3pY1FbD4Qczrn3JvcEseAaHOfHWlldW9/YzGxlt3d29/btg8O6jhJFWY1GIlLNgGgmuGQ14CBYM1aMhIFgjWBwPfYbD0xpHsl7GMbMC0lP8i6nBIzk2/k2sEdIq3ejQsd3iqe43YlAm2tOlkXfzjklZwK8TNwZyaEZqr79bebQJGQSqCBat1wnBi8lCjgVbJRtJ5rFhA5Ij7UMlSRk2ksn64xw3igd3I2UORLwRJ3vSEmo9TAMTGVIoK8XvbH4n9dKoHvppVzGCTBJpw91E4EhwuNscIcrRkEMDSFUcfNXTPtEEQomwawJwV1ceZnUyyX3rFS+Pc9VrmZxZNAxOkEF5KILVEE3qIpqiKIn9ILe0Lv1bL1aH9bntHTFmvUcoT+wvn4BZDibDQ== 0 0AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOlptsvV9yqOwdZJV5OKpCj0S9/9QYxSyOUhgmqdddzE+NnVBnOBE5LvVRjQtmYDrFrqaQRaj+bHzolZ1YZkDBWtqQhc/X3REYjrSdRYDsjakZ62ZuJ/3nd1ITXfsZlkhqUbLEoTAUxMZl9TQZcITNiYgllittbCRtRRZmx2ZRsCN7yy6ukXat6F9Va87JSv8njKMIJnMI5eHAFdbiDBrSAAcIzvMKb8+i8OO/Ox6K14OQzx/AHzucPemeMuA== 0 n

Figure A.3: Schematic illustration of the MuZero search agent’s state for the BERT representation function.

Table A.1: Example state of the MuZero search agent that is the input to the BERT representation function. The ‘Type’ layer encodes the state part information for each token. The ‘IDF’ and ‘PS’ layer are additional layers with float values of the IDF and the PS score of the input tokens, respectively.

Tokens [CLS] who carries the burden of going forward with evidence in a trial Type [CLS] query query query query query query query query query query query query IDF 0.00 0.00 6.77 0.00 7.77 0.00 5.13 5.53 0.00 5.28 0.00 0.00 5.77 ··· PS 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Tokens [SEP] [pos] [content] burden ##s [neg] [title] sometimes [SEP] lit ##igan ##ts [SEP] Type [SEP] tree tree tree tree tree tree tree [SEP] answer answer answer [SEP] IDF 0.00 0.00 0.00 9.64 9.64 0.00 0.00 4.92 0.00 10.64 10.64 10.64 0.00 ··· PS 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 -3.80 -3.80 -3.80 -3.80

Tokens kinds for each party , in different phases of litigation . the burden Type context context context context context context context context context context context context context IDF 7.10 0.00 0.00 4.36 17.41 0.00 4.18 7.46 0.00 7.92 17.41 0.00 7.77 ··· PS -3.80 -3.80 -3.80 -3.80 -3.80 -3.80 -3.80 -3.80 -3.80 -3.80 -3.80 -3.80 -3.80

Tokens suspicion ", " probable cause " ( as for [SEP] evidence [SEP] Type context context context context context context context context context context [SEP] title [SEP] IDF 7.80 17.41 17.41 17.41 7.91 5.41 17.41 17.41 0.00 0.00 0.00 5.28 0.00 ··· PS -12.20 -12.20 -12.20 -12.20 -12.20 -12.20 -12.20 -12.20 -12.20 -12.20 -12.20 -12.20 -12.20

T5 Agent’s State (cf. §2.3) T5 represents the state as a flat string. The input is a concatenation of the original query, zero or more expansions, and five results. For each result, we include the answer given by the reader, the document’s title, and a span centered around the answer. The prediction target is simply the next expansion. See Table A.2 for a full example.

16 Table A.2: Example state (input) and prediction (target) of the T5 agent with linebreaks and emphasis added for readability. In this example we restrict to a 40 token span but use up to 70 tokens in our experiments.

Input Query: ’how many parts does chronicles of narnia have’. Contents must contain: ’lewis’. Contents cannot contain: ’battle’. Answer: ’seven’. Title: ’The Chronicles of Narnia’. Result: ’The Chronicles of Narnia is a series of seven fantasy novels by C. S. Lewis. It is considered a classic of children’s literature and is the author’s best-known work, having sold over 100 million copies in 47 languages. Written by’. Answer: ’seven’. Title: ’The Chronicles of Narnia (film series)’. Result: ’series of films based on "The Chronicles of Narnia", a series of novels by C. S. Lewis. From the seven books, there have been three film adaptations so far – (2005), "" (2008) and "" (2010) – which have grossed over $1.5’. Answer: ’seven’. Title: ’Religion in The Chronicles of Narnia’. Result: ’Religion in The Chronicles of Narnia "The Chronicles of Narnia" is a series of seven fantasy novels for children written by C. S. Lewis. It is considered a classic of children’s literature and is the author’s best-known work, having sold’. Answer: ’seven’. Title: ’The Chronicles of Narnia’. Result: ’always presented in chronological order. Lewis’ s early life has parallels with "The Chronicles of Narnia". At the age of seven , he moved with his family to a large house on the edge of Belfast. Its long hallways and’. Answer: ’Two’. Title: ’The Chronicles of Narnia’. Result: ’although smaller copies can be found in the most recent HarperCollins 2006 hardcover edition of "The Chronicles of Narnia". Two other maps were produced as a result of the popularity of the 2005 film "". One, the "Rose Map’.

Target Contents must contain: ’novels’

D Model and Training Configuration (cf. §5.2)

MuZero The MuZero agent learner, which performs both inference and training, runs on a Cloud TPU v2 with 8 cores which is roughly equivalent to 10 Nvidia P100 GPUs in terms of TFLOPS.9 One core is allocated for training and 7 cores are allocated for inference. We use 500 CPU based actors along with 80 actors dedicated to evaluation. Each agent is trained for 2.6 million steps, with 100 simulations per step, at an approximate speed of 10,000 steps per hour. In total, training takes about 10 days. Hyperparameters are listed in Table A.3.

T5 The T5 agent is trained on 16 Cloud TPU v3, starting from the pre-trained T5-large checkpoint. We follow the suggested hyperparameter choices given by Raffel et al. [2020], see Table A.4. We train for about 1.5 days at roughly 2 steps per second.

DPR Reader and Passage Selector We train the DPR reader (based on BERT) on TPU with a batch size of 192 for up to 10, 000 steps with a learning rate of 5e 5 using the standard BERT fine-tuning schedule with a linear ramp-up schedule for the first 1, 000−steps and a decay schedule for the remainder. We pick the best checkpoint on the NQ dev data, evaluating the actual top-20 results for each query, obtaining a passage selection accuracy of 48.9 and an exact match score of 29.76 after 6, 600 batches. Note that only 65% of the dev data queries can be answered from the top-20

9The Cloud TPU v2 has a peak performance of 180 TFLOPS (https://cloud.google.com/tpu), whereas the Nvidia P100 GPU goes up to 18.7 TFLOPS depending on precision (https://cloud.google.com/ compute/docs/gpus).

17 Table A.3: Hyperparameters for MuZero. Parameter Value Simulations per Step 100 Actor Instances 500 Training TPU Cores 1 Inference TPU Cores 7 Initial Inference Batch Size (hθ) 4 per core Recurrent Inference Batch Size (fθ, gθ) 32 per core LSTM Units (gθ) One layer of 512 Feed-forward Units (fθ) One layer of 32 Training Batch Size 16 Optimizer SGD Learning Rate 1e 4 Weight Decay 1e−5 Discount Factor (γ)− .9 Unroll Length (K) 5 Max. #actions Expanded per Step 100

Table A.4: Hyperparameters for T5. Parameter Value Number of Parameter (T5 large) 770M Encoder/Decoder Layers 24 Feed-forward Units 4096 Number of Heads 16 Batch Size (in tokens) 220 Dropout Rate 0.1 Learning Rate (constant) 0.001 Optimizer AdaFactor Maximum input length (tokens) 512 Maximum target length (tokens) 128 Finetuning steps on synthetic data 61,300 Finetuning steps on NQ Train 200

passages. Restricting to those 65%, span selection performance is at 75 and exact match score at 45.68 points, respectively.

E MuZero Training Procedure (cf. §5.2)

Training the MuZero agent is expensive in terms of time ( 4 days/1M steps) and resources. The training might diverge early on, so we choose a two-step approach:≈ We first run training for 1M steps with several seeds. We then select the two best checkpoints and continue training each with three more seeds. Figure A.4 shows the NDCG score improvement (over the BM25 baseline) on NQ-Dev during training of the MuZero agent for multiple seeds. We can neither do an extensive fine-tuning of the hyperparameters nor provide meaningful variance estimates. However, Figure A.4 gives an idea of the variance during training. To better understand the discrepancy in performance between the best and worst run – grey and blue lines in Figure A.4, respectively – we analyze the policies used by the respective agents in Figures A.5, A.6, and A.7. As can be seen in Figure A.6 using the operators, particularly ‘+contents’ and ‘-title’, is key to a good performance. The mere adding of words by using ‘or’ does not seem to be an effective strategy. The operators provide the agent with powerful actions that can drastically change the set of returned documents, while only appending a term without an operator usually does not greatly impact the results. This effect manifests clearly in the number of unique documents explored

18 during a session shown in Figure A.5: the worst agent explores significantly fewer documents. In terms of words used (Figure A.7), we also see a drastic difference between the best and worst agents. While the worst agent exclusively chooses words from the document’s body Σβ, the best agent prefers words from either the previous answers Σα or the query Σq. The effectiveness of this policy intuitively makes sense as it corresponds to a reduction of the action space to more meaningful terms; typically, words that are part of the query or predicted answer are more important than words that simply occur in the body of the document. We show examples for many of the successful policies learned by the agents in Section H.

Explored Documents

10.0 0.25 best worst 0.20 7.5 G C

D 0.15 N

5.0

Frequency 0.10 2.5 0.05

0.0 0.00 0.0 0.5 1.0 1.5 2.0 2.5 10 20 30 40 Steps 1e6 Unique Number of Documents per Session Figure A.4: NDCG score improvement (over Figure A.5: Histogram of the number of the BM25 baseline) on NQ-Dev during train- unique documents explored per session of the ing of the MuZero agent. The different runs best and worst performing MuZero Agents. are initialized using different random seeds. The colors match the runs from Figure A.4. The worst performing runs are stopped after 1M steps to save resources. Action Type Frequency Action Dictionary Frequency 1.0 1.0 best best 0.8 worst 0.8 worst

0.6 0.6

0.4 0.4 Frequency Frequency

0.2 0.2

0.0 0.0 +contents -title or q Figure A.6: Frequency of the different action Figure A.7: Frequency of the word vocabu- types used by the best and worst performing laries used by the best and worst performing MuZero agents. MuZero agents.

F Ablations for the T5 agent

In Table A.5, we present additional variants of the T5-based agent. T5Synth is trained on the synthetic episodes (cf. §5.1) but not further finetuned on the NQ Train episodes. Note that this setup still uses the NQ Train data to generate the synthetic question-answer pairs that form the basis of the synthetic episodes. T5NQ, on the other hand, is finetuned only on the NQ Train expansions but does not use the synthetic episodes. As before, we select a checkpoint based on the NDCG improvement on NQ Dev. Comparing to T5, the configuration where both synthetic and NQ train episodes are used for finetuning, we notice results are a bit lower for both T5Synth and T5NQ: 40.49 and 41.52 NDCG on NQ Test respectively vs. 41.92 for T5. While T5Synth seems to be overall below T5NQ the model achieves higher scores for ExactMatch (EM), surpassing even the T5 model. This also holds true for the combination with the Muzero agent.

19 Table A.5: Additional results (cf. Table 1) on NQ Dev and Test. T5NQ is a T5 model only finetuned on NQ Train episodes whereas T5Synth is finetuned first only on synthetic episodes. As above, T5 is finetuned on both: first synthetic episodes followed by NQ Train. MZ+T5Synth, MZ+T5NQ, and MZ+T5 are combinations of the respective T5 variant with MuZero (MZ). All results are reported as percentages.

Metric Data MZ T5Synth T5NQ T5 MZ+T5Synth MZ+T5NQ MZ+T5 NDCG@5 Dev 35.29 38.02 39.93 40.07 40.66 40.91 41.31 Devk 44.69 45.66 48.87 48.79 50.70 51.21 51.64 Devu 19.78 25.40 25.16 25.66 24.06 23.89 24.23 Test 37.30 40.49 41.53 41.92 42.38 42.35 42.70 Top-5 Dev 58.29 59.46 59.07 60.35 62.70 61.78 62.76 Test 60.55 62.02 61.69 62.38 64.10 63.68 64.32 EM Dev 26.09 27.94 26.76 27.55 27.96 27.18 27.72 Test 26.26 28.92 27.42 27.78 28.59 27.56 28.12

G Example Search Sessions (cf. §5.4)

Table A.6 shows the sequential search session of the MuZero agent for the example of Table 2. By looking at the step-by-step results and query refinements, the agent’s strategy for this particular example becomes apparent: the agent mostly requests new documents that belong to a different Wikipedia article than the currently highest-scoring one. This is achieved by adding ‘-(title:[term])’, where [term] is a token from the title of the highest-scoring document. This way it effectively explores the document space. At the fourth step, the agent adds the refinement ‘+(contents:1945)’ and thus enforces the retrieved documents to contain the gold answer. Of course, the agent does not have access to the information that this is the gold answer, but it realizes that this must be a good candidate since it is the predicted answer of the highest-scoring document so far.

H Example of Policy Types (cf. §5.4)

This section provides examples of interesting behaviors learned by the MuZero and T5 agents. For each example, we provide the NDCG score of the individual methods and an illustrative snippet of one of the top-5 documents. The answer predicted by the machine reader is colored in blue if correct and red if wrong. For the two agents, we also add the final query with all its refinements. Remember that the final results are the top-5 documents retrieved over the course of an entire session. Hence, it is possible – in fact, very common – that a document in the results set contains a term that is negated by a refinement in the final query. This simply means that the document was retrieved in a step before the negation and made it into the final results due to a high PS score.

H.1 Explore Space of Wikipedia Articles (Negate Title Terms)

The example below illustrates the use of negated title terms by the MuZero agent that we already discussed in the example of Section G. This policy helps explore the space of Wikipedia articles by explicitly negating the title of the currently high-scoring documents. For the T5 agent, we do not see this strategy; instead, it uses ‘+contents’ with topical expansions leading to another mention of the answer but from the same Wikipedia article.

20 Table A.6: Example of the MuZero agent’s search session. It shows step-by-step the refinement of the query together with the top search result obtained by BM25 (1st Doc) and the aggregated best 1 document retrieved so far ( t ). We highlight the predicted answer of the machine reader in color: blue if it matches the gold answer,D red otherwise. Question when was korea separated into north and south Answer 1945 Query when was korea separated into north and south 1st Doc Title: North Korea–South Korea relations PS-Score: -17.76 . . . leader Kim Dae-jung by the Korean CIA. Talks restarted, however, and between 1973 and 1975 there were 10 meetings of the North-South Coordinating Committee at . . . 1 D0 Title: North Korea–South Korea relations PS-Score: -17.76 . . . leader Kim Dae-jung by the Korean CIA. Talks restarted, however, and between 1973 and 1975 there were 10 meetings of the North-South Coordinating Committee at . . . Query when was korea separated into north and south -(title:“korea”) 1st Doc Title: North–South differences in the Korean language PS-Score: -8.10 . . . continued to be used by the North and the South after liberation of Korea in 1945, but with the establishments of the Democratic People’s Republic of Korea and the Republic of Korea in 1948, the two states have taken on differing policies regarding the language. 1 D1 Title: North–South differences in the Korean language PS-Score: -8.10 . . . continued to be used by the North and the South after liberation of Korea in 1945, but with the establishments of the Democratic People’s Republic of Korea and the Republic of Korea in 1948, the two states have taken on differing policies regarding the language. Query when was korea separated into north and south -(title:“korea”) -(title:“north”) 1st Doc Title: South Korean defectors PS-Score: -14.07 However, the exact number of prisoners of war held by North Korea and China has been disputed since 1953, due to unaccounted South Korean soldiers. Several South Koreans defected to the North during the . . . 1 D2 Title: North–South differences in the Korean language PS-Score: -8.10 . . . continued to be used by the North and the South after liberation of Korea in 1945, but with the establishments of the Democratic People’s Republic of Korea and the Republic of Korea in 1948, the two states have taken on differing policies regarding the language. Query when was korea separated into north and south -(title:“korea”) -(title:“north”) +(con- tents:“1945”) 1st Doc Title: South Pyongan Province PS-Score: -8.03 The province was formed in 1896 from the southern half of the former Pyongan Province, remained a province of Korea until 1945, then became a province of North Korea. 1 D3 Title: South Pyongan Province PS-Score: -8.03 The province was formed in 1896 from the southern half of the former Pyongan Province, remained a province of Korea until 1945, then became a province of North Korea. Query when was korea separated into north and south -(title:“korea”) -(title:“north”) +(con- tents:“1945”) -(title:“south”) 1st Doc Title: Eighth United States Army PS-Score: -7.77 At the end of World War II in 1945, Korea was divided into North Korea and South Korea with North Korea (assisted by the Soviet Union), becoming a communist government . . . 1 D4 Title: Eighth United States Army PS-Score: -7.77 At the end of World War II in 1945, Korea was divided into North Korea and South Korea with North Korea (assisted by the Soviet Union), becoming a communist government . . .

21 Question when did locked out of heaven come out Answer October 1, 2012 BM25 (NDCG: 34) 1st Doc Title: Locked Out of Heaven . . . was released as the lead single from the album on October 1, 2012 . The song was written by Mars, Philip Lawrence and Ari Levine of The Smeezingtons . . . MuZero (NDCG: 72) Query "q" +(contents:"1") -(title:"heaven") -(title:"come") -(title:"locked") +(contents:"december") +(contents:"2013") +(contents:"october") +(contents:"2014") +(contents:"november") 3rd Doc Title: Bruno Mars . . . title of the first single, "Locked out of Heaven", released on October 1, 2012 . The lead single from "Unorthodox Jukebox" reached number one . . . T5 (NDCG: 55) Query "q" +(contents:"jukebox") +(contents:"unorthodox") +(contents:"mainstream") +(con- tents:"digitally") +(contents:"2012") +(contents:"album") +(contents:"1") +(contents:"october") +(contents:"song") 2nd Doc Title: Locked Out of Heaven The song was unveiled digitally and to mainstream radio by Atlantic Records on October 1, 2012 , in the United States. It became available for purchase the following day . . .

H.2 Reusing Answer Terms

Both agents learn to reuse terms appearing in candidate answers based on the predictions of the machine reader. In the example below, BM25 has in the top 2 positions documents from which the machine reader extracts the correct answer. Both MuZero and T5 request the presence of both answer terms (with several repetitions for T5). BM25 ranks four documents from the same Wikipedia article in the top 5. Both MuZero and T5 have four different articles in the top 5. In addition to operators, MuZero also cleverly adds a simple topical term (‘disney’). However, the aggressive combination of ‘-’ and ‘+’ answer terms also leads to a high-ranking, loosely relevant document with the correct answer (from the article ‘ Hood (1973 film)’). T5 is more conservative, only negat- ing a content term (‘role’), and brings up in position five a document neither MuZero nor BM25 found.

Question who did the voice of king louie in the original jungle book Answer Louis Prima BM25 (NDCG: 72) 1st Doc Title: King Louie Mowgli, in order to become more human. King Louie was voiced by Louis Prima in the original 1967 film. Initially, the filmmakers considered Louis Armstrong . . . MuZero (NDCG: 100) Query "q" +(contents:"king") +(contents:"louis") -(title:"jungle") -(title:"book") -(title:"original") - (title:"voice") -(title:"louie") +(contents:"prima") disney 5th Doc Title: Robin Hood (1973 Film) Brian Bedford, with the latter being chosen. Meanwhile, Louis Prima was so angered at not being considered for a role that he personally . . . T5 (NDCG: 100) Query "q" +(contents:"prima") +(contents:"louis") +(contents:"film") +(contents:"1967") +(con- tents:"jungle") +(contents:"louis") -(contents:"role") +(contents:"louis") +(contents:"louis") 5th Doc Title: The Jungle Book (1967 Film) . . . which continued the story of the film, and included Phil Harris and Louis Prima voicing their film roles. In the record, Baloo (Harris) is missing Mowgli (Ginny Tyler), . . .

The example below shows MuZero using the ‘answer term’ strategy but for the wrong answer: ‘Bing Crosby’ from BM25 position 2, rather than ‘Lenny Kravitz’ from BM25 position 1. Interestingly, MuZero seems to try to recover towards the end of the episode by requesting the presence of the term ‘lenny’. However, at that point, it is too late. The short list of expansion terms indicates that the session ended because of empty search results.

22 Question who sings are you going to go my way Answer Lenny Kravitz BM25 (NDCG: 34) 2nd Doc Title: Going My Way . . . Congress as being "culturally, historically, or aesthetically significant". Bing Crosby recorded six of the songs for Decca Records and some of them were issued . . . MuZero (NDCG: 34) Query "q" +(contents:"crosby") -(title:"way") +(contents:"bing") -(title:"going") +(contents:"lenny") -(title:"sings") 4th Doc Title: Bing Crosby Sings the Song Hits from Broadway Shows . . . featuring some of the hits from Broadway musicals. Bing Crosby had enjoyed unprecedented success during the 1940s with his discography showing six No. 1 . . . T5 (NDCG: 84) Query "q" +(contents:"lenny") +(contents:"kravitz") +(contents:"hits") +(contents:"medley") +(contents:"released") +(contents:"album") +(contents:"single") +(contents:"song") - (contents:"billboard") 5th Doc Title: Are You Gonna Go My Way Are You Gonna Go My Way is the third studio album by American singer Lenny Kravitz , released on March 9, 1993 by Virgin Records. It was recorded at Waterfront Studios . . .

H.3 Session Structure There seems to be a progression in the search sessions executed by the MuZero agent. This sequential structure appears to be characterized by at least two phases (often repeated). One phase in which the first set of results is established (‘+contents’), followed by another phase in which it uses negation of title terms (‘-title’) to move on to new, different results. Using this strategy in the example below, the MuZero agent brings to the top position a high-quality document ranking lower than position 1000 in the BM25 search results. Conversely, the T5 agent, which is not trained on sequential data but single steps, seems to advance in the session less aggressively by adding similar terms; there is no clear structure, and terms seem to affect the ranking primarily around a more concentrated topic.

Question who sang the song i wish you were here Answer Pink Floyd BM25 (NDCG: 34) 1st Doc Title: Wish You Were Here (Pink Floyd song) "Wish You Were Here" is the title track on Pink Floyd ’s 1975 album "Wish You Were Here". David Gilmour and Roger Waters collaborated to write the music . . . MuZero (NDCG: 100) Query "q" +(contents:"pink") -(title:"wish") +(contents:"david") +(contents:"written") -(title:"song") +(contents:"rick") -(title:"sang") +(contents:"waters") +(contents:"floyd") 1st Doc Title: Scott Weiland . . . his solo songs as well as "Interstate Love Song" and a cover of Pink Floyd’s "Wish You Were Here". In a 2007 interview with "Blender magazine," Weiland . . . T5 (NDCG: 100) Query "q" +(contents:"gilmour") +(contents:"performed") +(contents:"floyd") +(contents:"album") +(contents:"vocal") +(contents:"pink") +(contents:"roger") +(contents:"david") +(con- tents:"written") 1st Doc Title: Shine On You Crazy Diamond "Shine On You Crazy Diamond" is a nine-part Pink Floyd composition written by David Gilmour, Roger Waters, and Rick Wright. It appeared on Pink Floyd ’s 1975 concept album "Wish You Were Here" . . .

H.4 Exploiting the NDCG Score One particularly interesting policy emerges for questions dealing with numbers or dates. In open-domain QA, one typically considers a document relevant if it contains the answer string. For many questions, this is a reasonable approximation. However, for answers that are numbers or dates, this can be very misleading; e.g., for the question of ‘how many episodes the second season of the TV series The Handmaid’s Tale has’, obviously not all Wikipedia paragraphs that contain the number 13 are relevant. Since the MuZero agent’s objective is to maximize its

23 NDCG-based reward, it may try to exploit this artifact of the data. In particular, it can learn that for particular ‘when’ questions with years as the answer, the best strategy is to require many different years in the query. This leads to the discovery of documents that often contain tables with many different years, leading to a high NDCG score as they are likely to contain the gold answer year. However, just because the answer is present in the document does not mean the reader can extract it. In fact, the documents that are pulled up are out-of-distribution for the reader, in the sense that during training, it is likely not exposed enough with non-topical documents containing many dates. This can lead to the reader not being able to extract the correct answer but still assigning the document a high PS score. The following example illustrates the described behavior.

Question when was the last world series won by the yankees Answer 2009 BM25 (NDCG: 0) 1st Doc Title: 1981 World Series The 1981 World Series was the championship series of the 1981 MLB season. It matched the New York Yankees against the Los Angeles Dodgers, marking their third meeting . . . MuZero (NDCG: 100) Query "q" +(contents:"1981") +(contents:"2015") +(contents:"2008") +(contents:"2009") +(con- tents:"2017") +(contents:"2016") -(title:"last") -(title:"won") -(title:"series") 1st Doc Title: Marlatt Hall The event is now held either in the Jardine Tower/Frith Community Center, or in the newly renovated Kramer Dining Center. 1972-1973 1981-1982 1985-1986 1986-1987 1988-1989 . . . 2007-2008 2008-2009 2009-2010 2015- 2016 2016-2017 2017-2018 . . .

In the following example, we can observe a closely related behavior. For the given ‘how-many’ question, both the MuZero and the T5 agent return documents in their first position that are not relevant for the question. Nevertheless, the documents contain the correct answer ‘13’ and the reader can extract them.

Question how many episodes of season two of the handmaids tale are there Answer 13 BM25 (NDCG: 0) 1st Doc Title: The Handmaid’s Tale (TV series) American dystopian drama television series created by Bruce Miller, based on the 1985 novel of the same name by Margaret Atwood. It was ordered by the streaming service Hulu as a straight-to-series order of 10 episodes, for . . . MuZero (NDCG: 100) Query "q" +(contents:"10") -(title:"handmaids") -(title:"tale") -(title:"season") -(title:"two") - (title:"episodes") -(title:"many") +(contents:"13") -(contents:"7") 1st Doc Title: Scandal (TV series) . . . season would likely be thirteen episodes or less, but the renewal of the series after the fall meant that the second season would have two arcs through the season; the first covering the main 13-episode order, with the second . . . T5 (NDCG: 100) Query "q" +(contents:"10") +(contents:"13") +(contents:"episodes") +(contents:"2009") +(con- tents:"second") +(contents:"may") +(contents:"premiered") +(contents:"season") +(con- tents:"third") 1st Doc Title: In May 2012, announced they were picking up a sixth season of "Robot Chicken", which began airing in September 2012. The seventh season premiered on April 13, 2014. Season eight premiered on October 25, 2015. Season nine premiered on December 10, 2017. ...

I Examples of Episode Structure (cf. §5.4)

This section lists and discusses a few interesting examples of episodes (from NQ Dev) focusing on the effect of the individual steps. The following examples show sequences from the original question q0 to the final query qT . At each step, we report the current NDCG@5 score in the parentheses.

24 I.1 MuZero Partially to Perfectly Solved The following two examples show the two phases (sometimes repeating), discussed in Section H.3, that characterize the MuZero policy; first using ‘+contents’ then switching to ‘-title’, likely for further/deeper exploration purposes. The second example also shows the use of simple terms, which the MuZero agent seems to throw in sometimes towards the end of the episode.

q0: "who’s the singer in fall out boy" (33.9) Answer: Patrick Stump

q1:"[ q0] +(contents:pete)" (55.3) q2:"[ q1] +(contents:patrick)" (85.3) q3:"[ q2] +(contents:stump)" (86.8) q4:"[ q3] -(title:boy)" (100) q5:"[ q4] -(title:singer)" (100) q6:"[ q5] -(title:fall)" (100) q7:"[ q6] +(contents:wentz)" (100) q8:"[ q7] -(title:album)" (100) q9:"[ q8] +(title:throman)" (100)

q0: "who ruled egypt at the time of the french invasion" (77.2) Answer: Ottoman Empire

q1:"[ q0] +(contents:empire)" (72.2) q2:"[ q1] +(contents:napoleon)" (100) q3:"[ q2] +(contents:bonaparte)" (100) q4:"[ q3] -(title:french)" (100) q5:"[ q4] -(title:egypt)" (100) q6:"[ q5] -(title:time)" (100) q7:"[ q6] -(title:ruled)" (100) q8:"[ q7] -(title:invasion)" (100) q9:"[ q8] battle" (100)

Unsolved to Partially Solved The example below shows a case where the NDCG score goes from 0 to below 1. These tend to be more challenging questions due to ambiguity or technical content. The MuZero agent seems to use negation to change the subtopic of the documents retrieved, then focusing and zeroing in on a different set of results without further success.

q0: "what major cellular event happen during the s phase of interphase" (0.0) Answer: DNA replication

q1:"[ q0] +(contents:cell)" (0,0) q2:"[ q1] +(contents:death)" (33.9) q3:"[ q2] +(contents:dna)" (68.4) q4:"[ q3] +(contents:replication)" (68.4) q5:"[ q4] -(title:interphase)" (68.4) q6:"[ q5] +(contents:process)" (68.4) q7:"[ q6] +(contents:cytokinesis)" (68.4) q8:"[ q7] -(title:major)" (68.4) q9:"[ q8] -(title:phase)" (68.4)

Unsolved to Perfectly Solved The example below shows a session where the MuZero agent perfectly solves a question in what looks like incremental steps. Interestingly, the question has multiple answers, and the ‘+’/‘-’ alternation seems to coincide with the incremental finding of more answers from the available set: ‘cytosine’ first, then ‘thymine’ and ‘guanine’.

25 q0: "what are the four elements present in dna" (0.0) Answers: [adenine [A], cytosine [C], thymine [T], guanine [G]]

q1:"[ q0] +(contents:four)" (33.9) q2:"[ q1] +(contents:may)" (69.9) q3:"[ q2] +(contents:cytosine)" (72.2) q4:"[ q3] -(title:present)" (72.2) q5:"[ q4] -(title:elements)" (72.2) q6:"[ q5] -(title:dna)" (100) q7:"[ q6] +(contents:thymine)" (100) q8:"[ q7] +(contents:guanine)" (100) q9:"[ q8] +(contents:nucleotides)" (100)

MuZero Wins Over T5 The following example is one out of 818 where MuZero does better than T5 on the NQ dev set (8757 total questions). This seems a particularly hard question, possibly because it is very specific. Notably, it is also not solved by the ‘high-quality’ Rocchio procedure used to train the T5 agent.

q0: "what type of rock is made of particles .05 cm" (0.0) Answer: Sand

q1:"[ q0] +(contents:water)" (33.9) q2:"[ q1] +(contents:sand)" (100) q3:"[ q2] +(contents:rocks)" (100) q4:"[ q3] +(contents:sandstone)" (100) q5:"[ q4] -(title:made)" (100) q6:"[ q5] -(title:type)" (100) q7:"[ q6] -(title:cm)" (100) q8:"[ q7] -(title:particles)" (100) q9:"[ q8] +(contents:feldspar)" (100) This is the corresponding T5 episode:

q0: "what type of rock is made of particles .05 cm" (0.0) Answer: Sand

q1:"[ q0] +(contents:rock)" (0.0) q2:"[ q1] +(contents:hard)" (0.0) q3:"[ q2] +(contents:tunneling)" (0.0) q4:"[ q3] +(contents:tbm)" (0.0)

MuZero Wins Over the HQ Rocchio Policy The following example is one of the 357 questions in NQ Dev, where MuZero outperforms the Rocchio policy. In this case, the Rocchio policy’s NDCG score does not improve from 0. It is an interesting episode as only at the very last step the agent achieves a perfect score when it seems to have identified the key term (‘Valinor’), after exploring less obvious Wikipedia articles about the ‘Lord of the Rings’.

q0: "in lord of the rings where do the elves sail to" (0.0) Answer: Valinor

q1:"[ q0] +(contents:earth)" (0.0) q2:"[ q1] +(contents:middle)" (0.0) q3:"[ q2] -(title:lord)" (33.9) q4:"[ q3] -(title:rings)" (55.3) q5:"[ q4] -(title:sail)" (55.3) q6:"[ q5] -(title:elves)" (55.3) q7:"[ q6] +(contents:close)" (86.8) q8:"[ q7] +(content:valinor)" (100)

26 I.2 T5 Partially to Perfectly Solved The following is a question that the T5 agent improves to the full score. Interestingly, the T5 agent, similarly to MuZero, uses multiple digit strings very closely related to the question: ‘11’ refers to January 11, 1908 – the exact date of the proclamation – and ‘1906’ is the year of the proclamation of the Grand Canyon Game Preserve.

q0: "when did the grand canyon became a national monument" (55.3) Answer: 1908

q1:"[ q0] +(contents:1908)" (86.8) q2:"[ q1] +(contents:11)" (100) q3:"[ q2] +(contents:lands)" (100) q4:"[ q3] +(contents:1906)" (100) q5:"[ q4] +(contents:preserve)" (100) q6:"[ q5] +(contents:act)" (100)

Unsolved to Partially Solved The following are two examples of hard questions which the T5 agent partially solves. The first question is tricky because there can be multiple answers related in complex ways, and the gold answer is extremely simplistic. The second is challenging because of figurative language and ambiguity: a ‘where’ question which is not looking for a typical location, rather an animal body part.

q0: "what are the symptoms of foot and mouth disease in cattle" (0.0) Answer: death

q1:"[ q0] +(contents:fmd)" (0.0) q2:"[ q1] +(contents:fever)" (21.3) q3:"[ q2] +(contents:days)" (21.3)

q0: "where does ribeye come from on a cow" (0.0) Answer: rib section

q1:"[ q0] +(contents:ribeye)" (33.9) q2:"[ q1] +(contents:section)" (55.3) q3:"[ q2] +(contents:rib)" (55.3) q4:"[ q3] +(contents:steak)" (55.3)

Unsolved to Perfectly Solved These are two examples of questions where the NDCG@5 score goes from 0 to 1. The T5 agent expands the query with closely related topical terms.

q0: "when was the first movie made" (0.0) Answer: 1977

q1:"[ q0] +(contents:franchise)" (72.2) q2:"[ q1] +(contents:trilogy)" (86.8) q3:"[ q2] +(contents:prequel)" (100) q4:"[ q3] +(contents:1977)" (100) q5:"[ q4] +(contents:film)" (100) q6:"[ q5] +(contents:lucas)" (100) q7:"[ q6] +(contents:original)" (100) q8:"[ q7] +(contents:saga)" (100) q9:"[ q8] +(contents:opera)" (100)

q0: "who has been playing the longest in the nba" (0.0) Answers: [Kareem Abdul-Jabbar, Kobe Bryant, Robert Parish, Kevin Garnett, Kevin Willis]

q1:"[ q0] +(contents:season)" (33.9) q2:"[ q1] +(contents:kobe)" (100)

27 q3:"[ q2] +(contents:player)" (100) q4:"[ q3] +(contents:seasons)" (100) q5:"[ q4] +(contents:jordan)" (100) q6:"[ q5] +(contents:twelve)" (100) q7:"[ q6] +(contents:kevin)" (100) q8:"[ q7] +(contents:players)" (100)

Additional Wins Over the HQ Rocchio Policy The following is one of the 473 questions where the T5 agent improved NDCG, while the ‘high-quality’ Rocchio procedure could not find a better solution. We note that full NDCG is achieved after enforcing the presence of a morphological variant (voices) of one of the query terms. This is a subtle use of search because the answer (the actor’s name) is more likely to appear as the verb’s subject. It is also noticeable that adding both terms of one of the gold answers, "mandy" and "moore", is not enough to reach a full NDCG score.

q0: "who plays the voice of rapunzel in the movie tangled" (50.8) Answers: [Delaney Rose Stein, Mandy Moore]

q1:"[ q0] +(contents:mandy)" (83.0) q2:"[ q1] +(contents:animated)" (85.3) q3:"[ q2] +(contents:moore)" (86.8) q4:"[ q3] +(contents:disney)" (86.8) q5:"[ q4] +(contents:film)" (86.8) q6:"[ q5] +(contents:voices)" (100) q7:"[ q6] +(contents:princess)" (100) q8:"[ q7] +(contents:animation)" (100)

T5 Wins Over MuZero The following is one example of a T5 win over MuZero (1531 total). The T5 results degrade a bit after the first reformulation, then recover to achieve a full score.

q0: "who killed the wicked witch of the west" (86.8) Answer: Dorothy

q1:"[ q0] +(contents:scarecrow)" (78.6) q2:"[ q1] +(contents:woodman)" (86.8) q3:"[ q2] +(contents:tin)" (86.8) q4:"[ q3] +(contents:wizard)" (100) q5:"[ q4] +(contents:wolves)" (100) q6:"[ q5] +(contents:pack)" (100) q7:"[ q6] +(contents:cowardly)" (100) q8:"[ q7] +(contents:dorothy)" (100) This is the corresponding MuZero episode. The agent immediately obtains a perfect score but then goes off the rails and never recovers.

q0: "who killed the wicked witch of the west" (86.8) Answer: Dorothy

q1:"[ q0] +(contents:king)" (100) q2:"[ q1] +(contents:phantom)" (16.9) q3:"[ q2] -(title:west)" (16.9) q4:"[ q3] -(title:killed)" (16.9) q5:"[ q4] -(title:wicked)" (16.9) q6:"[ q5] +(contents:lion)" (16.9)

28