<<

From sessions to conversations and back again

Evangelos Kanoulas

Informatics Institute & Amsterdam Business School University of Amsterdam

March 19, 2021 ting on user meant that there are no overlapping users between Web search engine. We also conducted paired -tests to compare training and test. We did this because we wanted to ensure that the the performance of the models with each other and the baseline. predictive patterns we learned also apply to users not seen during model training (as would be the case when deploying such models 5.1 Overall Performance in practice). Figure 3 demonstrates how we extracted training and We begin by analyzing the overall performance of the models test data from the logs. Both training (week 7) and testing (week over the baseline. We measured the change (difference) in MAP 8) data use the six weeks immediately prior to each to ensure fea- from the baseline non-personalized search engine ranking’s MAP ture distributions between train and test are based on the same across all queries for each of the four experimental conditions. amount of profile information. For each query in week seven, the For proprietary reasons, we only report the change from the base- historic features are computed based on actions from up to six line’s MAP, rather than reporting absolute performance. weeks before that query (weeks 16). For users in the test fold 5.1.1 Performance on All Queries (week 8), we used data from weeks 2–7 to build the profiles. Be- Figure 4 presents the change in overall MAP for each model from cause we aim to compare the contributions of both historic and the baseline. Error bars denote standard error of the mean in this recent activity, if users lack search activity in a timespan then chart and all other charts in the remainder of the paper. reliably comparing our experimental outcomes becomes challeng- ing. We leave studying how this tradeoff changes with the amount 1 of available history from a user as future work. Therefore, we 0.8 restricted users to those with at least one SAT click in each of the 0.6 six weeks before the week of interest (note this only uses weeks 0.4 17 and does not use knowledge of week 8 when selecting users). Over the 8 week period, the selection process resulted in around 0.2 MAP Change MAP 155K user profiles over 10.4M sessions with an average of 174.40 Baseline from 0 (σ=181.49) queries/user and 2.61 queries/session (σ=3.36). All Session Historic Aggregate Union MAP results are means of performance across the five folds. Figure 4. Average change in MAP from baseline ranker MAP 4.3 Experiments As seen in Figure 4, all methods improve over the baseline (i.e., Using the described dataset, we train a ranking model using the all reported changes are positive). All gains over the baseline and LambdaMART learning algorithm [35] for re-ranking the top ten differences between methods are significant with paired t-tests at results of the query. LambdaMART is an extension of Lamb- p < .01. The figure shows increasing amounts of profiling infor- daRank [3] based on boosted decision trees. LambdaMART has mation leads to greater improvements in retrieval perfor- been shown to be one of the best algorithms for learning to rank. mance. Interestingly using all sets of features (Union) and allow- Indeed, an ensemble model in which LambdaMART rankers were ing the ranker to learn how the time views should be used for each the key component won Track 1 of the 2010 Yahoo! Learning to query leads to the largest improvements over the baseline. This Rank Challenge [6]. However, we note the choice of learning suggests that how personalization should be applied may be more algorithm is not central to this work, and any reasonable learning nuanced than simple aggregation can capture (e.g., there may be to rank algorithm would likely provide similar results. times when we want to ignore historic data for personalization if We use LambdaMART with 500 decision trees. We did a grid the task is atypical of the user). We emphasize that while greater search using cross-validation by user over a 5 percent sample of profiling information leads to better performance, this is not a the training set using the Union feature set. We used a range of foregone conclusion. For example, if every session was a new task number of leaves {35, 70, 140, 280, 560}, minimum instances unlike anything a user has previously done, then we would expect in a leaf node {200, 400, 800, 1600, 2000}, learning rate Session to outperform Historic. Likewise, if simple combinations {0.075, 0.15, 0.30, 0.60}, and number of trees in the ensemble of long- and short-term data could fully summarize the infor- {50, 100, 200, 400}. There was relative insensitivity in the area mation available, then Union would not outperform Aggregate. where number of leaves ≤ 100, learning rate ≤ 0.3, minimum in- 5.1.2 Effect of Query Position in Session stances in a leaf node ≤ 2000, and number of trees [50,200]. During search sessions, searchers reformulate their queries and We broke ties arbitrarily and used number of leaves = 70, mini- adapt their information needs based upon exposure to information. mum instances in a leaf node = 2000, learning rate = 0.3, and We wanted to study whether this had an effect on the performance number of trees = 50. After sweeping to determine parameters, we of the models. Figure 5 shows the average change in MAP from used the same parameters for all other models. We did this be- the baseline, broken out by the query position in the session, from cause we wish to understand how each model/feature set behavesHistoric the first tovs the fifthSession query and -allbased remaining queriesPersonalization thereafter. in combination with the other parameters. For each fold, 10% of the training set is used as validation for model selection. 1

5. RESULTS 0.8 Session In this section we present the findings of our analysis. We focus 0.6 on comparing the performance of the models constructed using Historic the four conditions listed in Section 4.1. We begin by describing 0.4 Aggregate the overall performance of the models across all queries, then MAP Change Change MAP from Baseline from Union focus only on queries for which there is a measurable difference 0.2 in search performance (e.g., where MAP changes) since those 0 more clearly illustrate performance differences (although mask 1 2 3 4 5 >= 6 coverage effects, which we also explore). We report the change in Query Position in Session performance from the baseline ranking – a highly competitive top Figure 5. Avg. change in MAP by position of query in session.

ü Bennett, P. N., White, R. W., Chu, W., Dumais, S. T., Bailey, P., Borisyuk, F., & Cui, X. Modeling the impact of short-and long-term behavior on search personalization. SIGIR 2012 2

191 What to expect from a search agent?

3 1. Profile users

4 2. Account for session history

What is the best available osteoporosis Relate new query to treatment? past queries Update belief Rank List 1

What are the cons of bisphosphonates?

It is an expensive therapy.

Can it rebuild low bone density?

5 3. Ask clarifying questions

Generate and select the Update belief What is the best available osteoporosis treatment? most informative based on the question(s) Are you looking for a answer pharmaceutical treatment?

YES

How about combining it with physical exercise?

NO, I am too old for that, except if it is easy to do. 4. Decide Next Action

1. Provide ranking of document/passages 2. Provide direct answer 3. Ask clarifying question 1. Relevance over items/documents 2. Natural language questions 3. Suggested questions

Decide what is the optimal next action to take

7 1. Profile users

8 Estimate relevance

s s s s MLP ∑ CNN / Match Kernels / MLP CNN / Match Kernels MaxSim MaxSim MaxSim Document

Query Document Query Query Document Query Document (a) Representation-based Similarity (b) Query-Document Interaction (c) All-to-all Interaction (d) Late Interaction (e.g., DSSM, SNRM) (e.g., DRMM, KNRM, Conv-KNRM) (e.g., BERT) (i.e., the proposed ColBERT)

Figure 2: Schematic diagrams illustrating query–document matching paradigms in neural IR. ￿e ￿gure contrasts existing approaches (sub-￿gures (a), (b), and (c)) with the proposed late interaction paradigm (sub-￿gure (d)).

Other methods, including ColBERT (full retrieval), directly retrieve ￿ese increasingly expressive architectures are in tension. While ü C Van theGysel, top-1000 M de results Rijke, from E Kanoulas: the entire collection. Neural Vector Spaces for Unsupervisedinteraction-based Information models (i.e., Retrieval. Figure 2 (b)TOIS and 2018 (c)) tend to be su- As the ￿gure shows, BERT considerably improves search preci- perior for IR tasks [8, 21], a representation-focused model—by iso- ü C Van Gysel, M de Rijke, E Kanoulas: Learning Latent Vector Spaces for Product Search. CIKM 2016 sion, raising MRR@10 by almost 7% against the best previous meth- lating the computations among q and d—makes it possible to pre- 9 ods; simultaneously, it increases latency by up to tens of thousands compute document representations o￿ine [41], greatly reducing of milliseconds even with a high-end GPU. ￿is poses a challenging the computational load per query. In this work, we observe that tradeo￿ since raising query response times by as li￿le as 100ms is the ￿ne-grained matching of interaction-based models and the pre- known to impact user experience and even measurably diminish computation of document representations of representation-based revenue [17]. To tackle this problem, recent work has started ex- models can be combined by retaining yet judiciously delaying the ploring using Natural Language Understanding (NLU) techniques query–document interaction. Figure 2 (d) illustrates an architec- to augment traditional retrieval models like BM25 [32]. For exam- ture that precisely does so. As illustrated, every query embedding ple, Nogueira et al. [26, 28] expand documents with NLU-generated interacts with all document embeddings via a MaxSim operator, queries before indexing with BM25 scores and Dai & Callan [2] re- which computes maximum similarity (e.g., cosine similarity), and place BM25’s term frequency with NLU-estimated term importance. the scalar outputs of these operators are summed across query Despite successfully reducing latency, these approaches generally terms. ￿is paradigm allows ColBERT to exploit deep LM-based reduce precision substantially relative to BERT. representations while shi￿ing the cost of encoding documents of- To reconcile e￿ciency and contextualization in IR, we propose ￿ine and amortizing the cost of encoding the query once across ColBERT, a ranking model based on contextualized late interac- all ranked documents. Additionally, it enables ColBERT to lever- tion over BERT. As the name suggests, ColBERT proposes a novel age vector-similarity search indexes (e.g., [1, 15]) to retrieve the late interaction paradigm for estimating relevance between a query top-k results directly from a large document collection, substan- q and a document d. Under late interaction, q and d are separately tially improving recall over models that only re-rank the output of encoded into two sets of contextual embeddings, and relevance is term-based retrieval. evaluated using cheap and pruning-friendly computations between As Figure 1 illustrates, ColBERT can serve queries in tens or both sets—that is, fast computations that enable ranking without few hundreds of milliseconds. For instance, when used for re- exhaustively evaluating every possible candidate. ranking as in “ColBERT (re-rank)”, it delivers over 170 speedup ⇥ Figure 2 contrasts our proposed late interaction approach with (and requires 14,000 fewer FLOPs) relative to existing BERT-based ⇥ existing neural matching paradigms. On the le￿, Figure 2 (a) illus- models, while being more e￿ective than every non-BERT baseline trates representation-focused rankers, which independently compute (§4.2 & 4.3). ColBERT’s indexing—the only time it needs to feed an embedding for q and another for d and estimate relevance as documents through BERT—is also practical: it can index the MS a single similarity score between two vectors [12, 41]. Moving to MARCO collection of 9M passages in about 3 hours using a single the right, Figure 2 (b) visualizes typical interaction-focused rankers. server with four GPUs (§4.5), retaining its e￿ectiveness with a space Instead of summarizing q and d into individual embeddings, these footprint of as li￿le as few tens of GiBs. Our extensive ablation rankers model word- and phrase-level relationships across q and d study (§4.4) shows that late interaction, its implementation via and match them using a deep neural network (e.g., with CNNs/MLPs MaxSim operations, and crucial design choices within our BERT- [22] or kernels [36]). In the simplest case, they feed the neural net- based encoders are all essential to ColBERT’s e￿ectiveness. work an interaction matrix that re￿ects the similiarity between Our main contributions are as follows. every pair of words across q and d. Further right, Figure 2 (c) illus- (1) We propose late interaction (§3.1) as a paradigm for e￿cient trates a more powerful interaction-based paradigm, which models and e￿ective neural ranking. the interactions between words within as well as across q and d at (2) We present ColBERT (§3.2 & 3.3), a highly-e￿ective model the same time, as in BERT’s transformer architecture [25]. that employs novel BERT-based query and document en- coders within the late interaction paradigm. Challenge: Training complex models

s s s s MLP ∑ CNN / Match Kernels / MLP CNN / Match Kernels MaxSim MaxSim MaxSim

context Document Query Document Query Query Document Query Document (a) Representation-based Similarity (b) Query-Document Interaction (c) All-to-all Interaction (d) Late Interaction (e.g., DSSM, SNRM) (e.g., DRMM, KNRM, Conv-KNRM) (e.g., BERT) Extractive(i.e., the proposed phrase ColBERT) Large doc Figure 2: Schematic diagrams illustrating query–document matching paradigms in neural IR. ￿e ￿gure contrasts existing approaches (sub-￿guresDocument (a), (b), and (c)) list with the proposed late interaction paradigm (sub-Sentence￿gure (d)). collection

Other methods, including ColBERT (full retrieval), directly retrieve ￿ese increasinglyanswer expressive architectures are in tension. While question Entire Passage the top-1000 results from the entire collection. interaction-based models (i.e., Figure 2 (b) and (c)) tend to be su- As the ￿gure shows, BERT considerably improves search preci- perior for IR tasks [8, 21], a representation-focused model—by iso- sion, raising MRR@10 by almost 7% against the best previous meth- lating the computations among q and d—makes it possible to pre- ods; simultaneously, it increases latency by up to tens of thousands compute document representations o￿ine [41], greatly reducing of milliseconds even with a high-end GPU. ￿is poses a challenging the computational load per query. In this work, we observe that tradeo￿ since raising query response times by as li￿le as 100ms is the ￿ne-grained matching of interaction-based models and the pre- known to impact user experience and even measurably diminish computation of document representations of representation-based revenue [17]. To tackle this problem, recent work has started ex- models can be combined by retaining yet judiciously delaying the ploring using Natural Language Understanding (NLU) techniques query–document interaction. Figure 2 (d) illustrates an architec- to augment traditional retrieval models like BM25 [32]. For exam- ture that precisely does so. As illustrated, every query embedding ple, Nogueira et al. [26, 28] expand documents with NLU-generated interacts with all document embeddings via a MaxSim operator, queries before indexing with BM25 scores and Dai & Callan [2] re- which computes maximum similarity (e.g., cosine similarity), and place BM25’s term frequency with NLU-estimated term importance. the scalar outputs of these operators are summed across query Despite successfully reducing latency, these approaches generally terms. ￿is paradigm allows ColBERT to exploit deep LM-based reduce precision substantially relative to BERT. representations while shi￿ing the cost of encoding documents of- To reconcile e￿ciency and contextualization in IR, we propose ￿ine and amortizing the cost of encoding the query once across ColBERT, a ranking model based on contextualized late interac- all ranked documents. Additionally, it enables ColBERT to lever- tion over BERT. As the name suggests, ColBERT proposes a novel age vector-similarity search indexes (e.g., [1, 15]) to retrieve the late interaction paradigm for estimating relevance between a query top-k results directly from a large document collection, substan- q and a document d. Under late interaction, q and d are separately tially improving recall over models that only re-rank the output of encoded into two sets of contextual embeddings, and relevance is term-based retrieval. evaluated using cheap and pruning-friendly computations between As Figure 1 illustrates, ColBERT can serve queries in tens or both sets—that is, fast computations that enable ranking without few hundreds of milliseconds. For instance, when used for re- exhaustively evaluating every possible candidate. ranking as in “ColBERT (re-rank)”, it delivers over 170 speedup ⇥ Figure 2 contrasts our proposed late interaction approach with (and requires 14,000 fewer FLOPs) relative to existing BERT-based ⇥ existing neural matching paradigms. On the le￿, Figure 2 (a) illus- models, while being more e￿ective than every non-BERT baseline trates representation-focused rankers, which independently compute (§4.2 & 4.3). ColBERT’s indexing—the only time it needs to feed an embedding for q and another for d and estimate relevance as documents through BERT—is also practical: it can index the MS a single similarity score between two vectors [12, 41]. Moving to MARCO collection of 9M passages in about 3 hours using a single the right, Figure 2 (b) visualizes typical interaction-focused rankers. server with four GPUs (§4.5), retaining its e￿ectiveness with a space Instead of summarizing q and d into individual embeddings, these footprint of as li￿le as few tens of GiBs. Our extensive ablation rankers model word- and phrase-level relationships across q and d study (§4.4) shows that late interaction, its implementation via and match them using a deep neural network (e.g., with CNNs/MLPs MaxSim operations, and crucial design choices within our BERT- [22] or kernels [36]). In the simplest case, they feed the neural net- based encoders are all essential to ColBERT’s e￿ectiveness. work an interaction matrix that re￿ects the similiarity between Our main contributions are as follows. every pair of words across q and d. Further right, Figure 2 (c) illus- (1) We propose late interaction (§3.1) as a paradigm for e￿cient trates a more powerful interaction-based paradigm, which models and e￿ective neural ranking. the interactions between words within as well as across q and d at (2) We present ColBERT (§3.2 & 3.3), a highly-e￿ective model the same time, as in BERT’s transformer architecture [25]. that employs novel BERT-based query and document en- coders within the late interaction paradigm. Q: What was the nickname of Judy Lewis's father? Challenge: Training complex models Question Encoder P1: Judy Lewis (born Judith Young; November 6, 1935 MIPS – November 25, 2011) was an American actress, writer, Dense Corpus Index producer, and therapist. She was the secret biological 1st dense query daughter of actor Clark Gable and actress Loretta vector Q: What was the nickname of Judy Lewis's father? Young. Question Encoder Question Encoder P1 P1: William Clark Gable (February 1, 1901 – ... P1: Judy Lewis (born Judith Young; November 6, 1935 MIPS MIPS – November 25, 2011) was an American actress, writer, November 16, 1960) was an American film actor,Dense often Corpus Index referred to as "The King of Hollywood". He had roles producer, and therapist. She was the secret biological 1st dense query 2nd dense query daughter of actor Clark Gable and actress Loretta in more than 60 vectormotion pictures in a wide variety of vector Young. genres during a career that lasted 37 years...... P2 Question Encoder P1 P1: William Clark Gable (February 1, 1901 – ... Figure 1:MIPS An overview of the multi-hop dense retrieval approach. context November 16, 1960) was an American film actor, often referred to as "The King of Hollywood". He had roles subset of FEVER2nd dense (Thornequery et al., 2018), our approach improves greatly over the traditional linking- in more than 60 motion pictures in a wide variety of based retrievalvector methods (e.g., 28% relative gain in Recall@20 for HotpotQA). Importantly, the better genres during a career that lasted 37 years...... P2 Large doc retrieval resultsExtractive also lead phrase to new state-of-the-art downstream results on both datasets, while being 10x Document list faster at inferenceSentence time. collection Figure 1: An overview of the multi-hop dense retrieval approach. answer

question Entire Passage subset of FEVER (Thorne et al., 2018), our approach2METHOD improves greatly over the traditional linking- based retrieval methods (e.g., 28% relative gain in Recall@20 for HotpotQA). Importantly, the better retrieval results also lead to new state-of-the-art2.1 downstream PROBLEM resultsDEFINITION on both datasets, while being 10x context faster at inference time. The retrieval task considered in this work can be described as follows (see also Figure 1). Given a multi-hop question q and a large text corpus , the retrieval module needs to retrieve a sequence of C 2METHOD passagesExtractiveseq : p1,p 2phrase,...,pn that provide sufficient information for answering q. Practically, the Large doc P { } 1 2 k Document list retriever returnsSentence the k best-scoring sequence candidates, seq, seq,..., seq (k ), with the collection hope that at least one of them has the desired qualities. k{PshouldP be smallP enough} ⌧ for|C| downstream 2.1 PROBLEM DEFINITION answer modules toEntire process Passage in a reasonable time while maintaining adequate recall. In general, retrieval also The retrieval task considered in this work canneeds be described to be efficient as follows enough (see to handlealso Figure real-world 1). Given corpora a containing millions of documents. multi-hop question q and a large text corpus , the retrieval module needs to retrieve a sequence of C passages seq : p1,p2,...,pn that provide sufficient2.2 MULTIinformation-HOP DENSE for answeringRETRIEVALq. Practically, the P { } 1 2 k retriever returns the k best-scoring sequence candidates, seq, seq,..., seq (k ), with the hope that at least one of them has the desired qualities.Model Basedk{Pshould on theP be sequential smallP enough nature} ⌧ for of|C| the downstream multi-hop retrieval problem, our system solves it in an iterative fashion. We model the probability of selecting a certain passage sequence as follows: modules to process in a reasonable time while maintaining adequate recall. In general, retrieval also needs to be efficient enough to handle real-world corpora containing millions of documents.n P ( seq q)= P (pt q, p1,...,pt 1), P | | t=1 2.2 MULTI-HOP DENSE RETRIEVAL Y where for t =1, we only condition on the original question for retrieval. At each retrieval step, we Model Based on the sequential nature of the multi-hopconstruct a retrieval new query problem, representation our system based solves on previous it in an results and the retrieval is implemented as iterative fashion. We model the probability of selectingmaximum a inner certain product passage search sequence over the as dense follows: representations of the whole corpus:

n exp ( pt, qt ) P (pt q, p1,...,pt 1)= h i , where qt = g(q, p1,...,pt 1) and pt = h(pt). P ( seq q)= P (pt q,| p1,...,pt 1), exp ( p, q ) P | | p t t=1 2C h i YHere , is the inner productP between the query and passage vectors. h( ) and and g( ) are passage where for t =1, we only condition on the originaland query questionh· ·i encoders for retrieval. that produce At each the retrieval dense representations. step, we In order· to reformulate· the query construct a new query representation based onrepresentation previous results to account and the for retrieval previous is implemented retrieval results as at time step t, we simply concatenate the maximum inner product search over the densequestion representations and the retrievedof the whole passages corpus: as the inputs to g( ). Note that our formulation for each retrieval step is similar to existing single-hop dense retrieval· methods (Lee et al., 2019; Guu et al., 2020; exp ( pt, qt ) P (pt q, p1,...,pt 1)= h i Karpukhin, where q ett = al.,g 2020)(q, p1,...,p exceptt that1) and we addpt = theh query(pt). reformulation process conditioned on previous | exp ( p, q ) p h tiretrieval results. Instead of using a bi-encoder architecture with separately parameterized encoders 2C for questions and passages, we use a shared RoBERTa-base (Liu et al., 2019) encoder for both h( ) Here , is the inner productP between the queryand andg( ) passage. In 3.1.3, vectors. we showh( ) thatand this and simpleg( ) are modification passage yields considerable improvements. · and queryh· ·i encoders that produce the dense representations.· § In order· to reformulate· the query representation to account for previous retrievalTraining results and at time Inference step t, weThe simply retriever concatenate model is trained the as in Karpukhin et al. (2020), where each question and the retrieved passages as the inputs to g( ). Note that our formulation for each retrieval input query· (which at each step consists of a question and previously retrieved passages) is paired step is similar to existing single-hop dense retrievalwith a positive methods passage (Lee and et al.,m negative 2019; Guu passages et al., to 2020; approximate the softmax over all passages. The Karpukhin et al., 2020) except that we add the query reformulation process conditioned on previous retrieval results. Instead of using a bi-encoder architecture with separately parameterized encoders for questions and passages, we use a shared RoBERTa-base (Liu et al., 2019) encoder for both2 h( ) and g( ). In 3.1.3, we show that this simple modification yields considerable improvements. · · § Training and Inference The retriever model is trained as in Karpukhin et al. (2020), where each input query (which at each step consists of a question and previously retrieved passages) is paired with a positive passage and m negative passages to approximate the softmax over all passages. The

2 Challenge: Training complex models

Open-domain Complex QA

ü G Sidiropoulos, N Voskarides, S Vakulenko, E Kanoulas, Analysing Dense Passage Retrieval for Multi-Hop Questions. (under review) 2. Account for session history

What is the best available osteoporosis Relate new query to treatment? past queries Update belief Rank List 1

What are the cons of bisphosphonates?

It is an expensive therapy.

Can it rebuild low bone density?

13 User asking questions

• Session-oriented search • Task-oriented search • Conversational search

1. legality alcohol marijuana 2. legality alcohol marijuana history 3. legality alcohol history 4. why is marijuana illegal? 5. is marijuana worse for your health than alcohol?

ü B Carterette, P Clough, M Hall, E Kanoulas, M Sanderson: The TREC Session Track 2011-2014. SIGIR 2016 ü C Van Gysel, E Kanoulas, M de Rijke: Lexical Query Modeling in Session Search. ICTIR 2016 User asking questions

• Session-oriented search • Task-oriented search • Conversational search

1. What are the different types of hydroponic systems to choose from, their pros and cons? 2. What hydroponic system are used in commercial greenhouses? 3. What material is needed to build a hydroponic system myself? 4. Which plants can be grown in a hydroponics system? 5. How long does it take to grow them? 6. What nutrients to test the water for? 7. …

ü E Kanoulas, E Yilmaz, R Mehrotra, B Carterette, N Craswell, P Bailey: TREC 2017 Tasks Track Overview. TREC 2017 1 ￿ery Resolution for Conversational Search 59 2 60 3 with Limited Supervision 61 4 User asking questions 62 5 Anonymous Author(s) 63 6 64 7 Submission• Session Id:-oriented 233 search 65 8 ABSTRACT • Task-Tableoriented 1: Excerpt search from an example conversational dialog. 66 9 67 In this work we focus on multi-turn passage retrieval as a crucial• ConversationalCooccurring search terms in the conversation history and the rele- 10 vant passage to the current turn (#4) are shown in bold-face. 68 11 component of conversational search. One of the key challenges 69 Turn Query 12 in multi-turn passage retrieval comes from the fact that the cur- 70 13 rent turn query is often underspeci￿ed due to zero anaphora, topic 1 who formed ? 71 14 change, or topic return. Context from the conversational history 2 when was the band founded? 72 3 what was their ￿rst album? 15 can be used to arrive at a better expression of the current turn query, 73 4 when was the album released? 16 de￿ned as the task of query resolution. In this paper, we model 74 resolved: when was saosin ’s ￿rst album released? 17 the query resolution task as a binary term classi￿cation problem: 75 18 for each term appearing in the previous turns of the conversation Relevant passage to turn #4: The original lineup for Saosin, consisting of 76 19 decide whether to add it to the current turn query or not. We pro- Burchell, Shekoski, Kennedy and Green, was formed in the summer of 77 2003. On June 17, the band released their ￿rst commercial production, 20 pose QuReTeC (Query Resolution by Term Classi￿cation), a neural 78 the EP . 21 query resolution model based on bidirectional transformers. We 79 22 propose a distant supervision method to automatically generate 80 23 training data by using query-passage relevance labels. Such labels 81 are often readily available in a collection either as human annota- 24 search, a user aims to interactively ￿nd information stored in a 82 tions or inferred from user interactions. We show that QuReTeC 25 large document collection [10]. 83 outperforms state-of-the-art models, and furthermore, that our dis- 26 In this paper, we study multi-turn passage retrieval as an in- 84 tant supervision method can be used to substantially reduce the 27 stance of conversational search: given the conversation history 85 amount of human-curated data required to train QuReTeC. We in- 28 (the previous turns) and the current turn query, we aim to retrieve 86 corporate QuReTeC in a multi-turn, multi-stage passage retrieval 29 passage-length texts that satisfy the user’s underlying nformation 87 architecture and demonstrate its e￿ectiveness in the TREC CAsT 30 need [11]. Here, the current turn query may be under-speci￿ed and 88 dataset. 31 thus, we need to take into account context from the conversation 89 32 ACM Reference Format: history to arrive at a better expression of the current turn query. 90 33 Anonymous Author(s). 2020. Query Resolution for Conversational Search Thus, we need to perform query resolution, that is, add missing 91 34 with Limited Supervision. In Proceedings of the 43rd International ACM SIGIR context from the conversation history to the current turn query, 92 35 Conference on Research and Development in Information Retrieval (SIGIR if needed. An example of an under-speci￿ed query can be seen in 93 36 ’20), July 25–30, 2020, Xi’an, China. ACM, New York, NY, USA, 10 pages. Table 1, turn #4, for which the gold standard query resolution is: 94 https://doi.org/10.1145/nnnnnnn.nnnnnnn 37 “where was saosin ’s ￿rst album released?”. In this example, context 95 38 from all turns #1 (“saosin”), #2 (“band”) and #3 (“￿rst”) have to be 96 39 1 INTRODUCTION taken into account to arrive to the query resolution. 97 40 Conversational AI deals with developing dialogue systems that en- Designing automatic query resolution systems is challenging 98 41 able interactive knowledge gathering [17]. A large portion of work because of phenomena such as zero anaphora, topic change and 99 42 in this area has focused on building dialogue systems that are capa- topic return, which are prominent in information seeking conversa- 100 43 ble of engaging with the user through chit-chat [25] or helping the tions [53]. These phenomena are not easy to capture with standard 101 44 user complete small well-speci￿ed tasks [34]. In order to improve NLP tools (e.g., coreference resolution). Also, heuristics such as 102 45 the capability of such systems to engage in complex information appending (part of) the conversation history to the current turn 103 46 seeking conversations [37], researchers have proposed information query are likely to lead to query drift [30]. Recent work has mod- 104 47 seeking tasks such as conversational question answering (QA) over eled query resolution as a sequence generation task [15, 23, 39]. 105 48 simple contexts, such as a single-paragraph text [7, 40]. In con- Another way of implicitly solving query resolution is by query 106 49 trast to conversational QA over simple contexts, in conversational modeling [19, 46, 50], which has been studied and developed under 107 50 the setup of session-based search [5, 6]. 108 51 Permission to make digital or hard copies of part or all of this work for personal or In this paper, we propose to model query resolution for conver- 109 52 classroom use is granted without fee provided that copies are not made or distributed sational search as a binary term classi￿cation task: for each term 110 for pro￿t or commercial advantage and that copies bear this notice and the full citation 53 111 on the ￿rst page. Copyrights for third-party components of this work must be honored. in the previous turns of the conversation decide whether to add 54 For all other uses, contact the owner/author(s). it to the current turn query or not. We propose QuReTeC (Query 112 55 SIGIR ’20, July 25–30, 2020, Xi’an, China Resolution by Term Classi￿cation), a query resolution model based 113 © 2020 Copyright held by the owner/author(s). 56 114 ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. on bidirectional transformers [47] – more speci￿cally BERT [13]. 57 https://doi.org/10.1145/nnnnnnn.nnnnnnn The model encodes the conversation history and the current turn 115 58 1 116 ￿ery Resolution for Conversational Search with Limited Supervision

Nikos Voskarides1 Dan Li1 Pengjie Ren1 Evangelos Kanoulas1 Maarten de Rijke1, 2 1University￿ery of Amsterdam, Resolution Amsterdam, The for Netherlands Conversational2Ahold Delhaize, Zaandam, Search The Netherlands [email protected], [email protected], [email protected],Task: [email protected],multi-turn passage [email protected] retrieval

ABSTRACT with LimitedGiven SupervisionTable the conversation 1: Excerpt from history an and example the current conversational turn query, dialog. Co- retrieveoccurring passages terms that satisfy in the the conversation user’s information history need and the relevant In this work we focus on1 multi-turn passage1 retrieval as a crucial1 Query resolution1 in conversational1, 2 search Nikoscomponent Voskarides of conversational search.Dan Li One of thePengjie key challenges Ren passageEvangelos to the current Kanoulas turn (#4) areMaarten shown in bold-face.de Rijke in multi-turn1University passage of retrieval Amsterdam, comesAmsterdam, from the fact that The the Netherlands cur- Turn2Ahold Query Delhaize, Zaandam, The Netherlands rent turn query [email protected], often underspeci￿ed due [email protected], zero anaphora, [email protected], topicQuReTeC:1 [email protected], whotraining formed saosin data? [email protected] change, or topic return. Context from the conversational history 2 when was the band founded? ABSTRACTcan be used to arrive at a better expression of the current turn query, Table3 1: whatExcerpt was their from￿rst analbum? example conversational dialog. Co- Extract relevant4 when terms was from the album gold released? standard resolutions In thisde work￿ned we as the focus task on of multi-turn query resolution. passage In retrieval this paper, as awe crucial model occurring terms in the conversation history and the relevant the query resolution task as a binary term classi￿cation problem: resolved: when was saosin ’s ￿rst album released? component of conversational search. One of the key challenges passage to the current turn (#4) are shown in bold-face. for each term appearing in the previous turns of the conversation : The original lineup for , consisting of in multi-turn passage retrieval comes from the fact that the cur- ChallengeTurnRelevant: passageQuerycurrent toturn turn query #4 (#4) under-specifiedSaosin decide whether to add it to the current turn query or not. We pro- Burchell, Shekoski, Kennedy and Green, was formed in the summer of rent turnpose query QuReTeC is often (Query underspeciResolution￿ed by dueTerm toC zerolassi anaphora,￿cation), a neuraltopic 12003. On who June formed 17, the bandsaosinreleased? their ￿rst commercial production, change,query or topic resolution return. model Context based from on bidirectional the conversational transformers. history We 2the EP Translating when was the the Name. band founded? can bepropose used to a arrive distant at supervisiona better expression method of to the automatically current turn generate query, 3 what was their ￿rst album? de￿nedtraining as the data task by of using query query-passage resolution. relevance In this paper, labels. we Such model labels 4 when was the album released? resolved: when was saosin ’s ￿rst album released? the queryare often resolution readily taskavailable as a in binary a collection term classi either￿ ascation human problem: annota- user complete small well-speci￿ed tasks [32]. In order to improve for eachtions term or inferred appearing from in user the previous interactions. turns We of show the conversation that QuReTeC theRelevant capability passage of to such turn systems #4: The original to engage lineup in complex for Saosin information, consisting of outperforms state-of-the-art models, and furthermore, that our dis- decide whether to add it to the current turn query or not. We pro-But, goldseekingBurchell, standard conversations Shekoski, resolutions Kennedy [34], researchersnot and always Green, have wasreadily formedproposed available in information the summer of pose QuReTeCtant supervision (Query methodResolution can be by usedTerm toC substantiallylassi￿cation), reduce a neural the seeking2003. On tasks June such 17, the as conversationalband released their question￿rst answeringcommercial (QA) production, over queryamount resolution of human-curated model based data on bidirectional required to train transformers. QuReTeC. We We in- simplethe EP contexts, Translating such the asName. a single-paragraph text [7, 37]. In con- proposecorporate a distant QuReTeC supervision in a multi-turn, method tomulti-stage automatically passage generate retrieval trast to conversational QA over simple contexts, in conversational trainingarchitecture data by using and demonstrate query-passage its e￿ relevanceectiveness labels. on the Such TREC labels CAsT search, a user aims to interactively ￿nd information stored in a dataset. ü N Voskarideslarge document, D Li, collection P Ren, E Kanoulas [10]. and M de Rijke, Query Resolution for Conversational Search with Limited Supervision. SIGIR 2020 are often readily available in a collection either as human annota- user complete small well-speci￿ed tasks [32]. In order to improve In this paper, we study multi-turn passage retrieval as an in- tionsKEYWORDS or inferred from user interactions. We show that QuReTeC the capability of such systems to engage in complex information 17 outperforms state-of-the-art models, and furthermore, that our dis- stance of conversational search: given the conversation history Conversational search; Query resolution seeking conversations [34], researchers have proposed information tant supervision method can be used to substantially reduce the (the previous turns) and the current turn query, we aim to retrieve seekingpassage-length tasks such texts as that conversational satisfy the user’s question underlying answering information (QA) over amountACM of Reference human-curated Format: data required to train QuReTeC. We in- simpleneed [11 contexts,]. Here, the such current as a turn single-paragraph query may be under-speci text [7, 37￿ed]. and In con- corporateNikos QuReTeC Voskarides, in Dan a Li,multi-turn, Pengjie Ren, multi-stage Evangelos Kanoulas, passage and retrieval Maarten de Rijke. 2020. Query Resolution for Conversational Search with Limited Su- trastthus, to we conversational need to take into QA account over simple context contexts, from the in conversation conversational architecturepervision. and In Proceedings demonstrate of the its 43rd e￿ectiveness International on ACM the SIGIR TREC Conference CAsT search,history toa user arrive aims at a to better interactively expression￿ ofnd the information current turn stored query. in a dataset.on Research and Development in Information Retrieval (SIGIR ’20), July largeThus, document we need to collection perform query [10]. resolution, that is, add missing 25–30, 2020, Virtual Event, China. ACM, New York, NY, USA, 10 pages. contextIn this from paper, the conversation we study multi-turn history to passage the current retrieval turn query, as an in- https://doi.org/10.1145/3397271.3401130 KEYWORDS stanceif needed. of Anconversational example of an search: under-speci given￿ed the query conversation can be seen history in Conversational search; Query resolution Table 1, turn #4, for which the gold standard query resolution is: 1 INTRODUCTION (the previous turns) and the current turn query, we aim to retrieve passage-length“when was saosin texts ’s ￿rst that album satisfy released? the user’s”. In this underlying example, information context ACMConversational Reference Format: AI deals with developing dialogue systems that en- needfrom [ all11 turns]. Here, #1 the (“saosin”), current #2 turn (“band”) query and may #3 be (“￿ under-specirst”) have to￿ed be and Nikosable Voskarides, interactive Dan knowledge Li, Pengjie gatheringRen, Evangelos [17]. AKanoulas, large portion and Maarten of work taken into account to arrive to the query resolution. de Rijke. 2020. Query Resolution for Conversational Search with Limited Su- thus, we need to take into account context from the conversation in this area has focused on building dialogue systems that are capa- Designing automatic query resolution systems is challenging pervision. In ble of engagingProceedings with of the the user 43rd through International chit-chat ACM [23 SIGIR] or Conference helping the historybecause to of arrive phenomena at a better such as expression zero anaphora, of the topic current change turn and query. on Research and Development in Information Retrieval (SIGIR ’20), July Thus,topic return, we need which to are perform prominentquery in information resolution, thatseeking is, conversa- add missing 25–30,Permission 2020, Virtual to make Event, digital China. or hard copiesACM, of New all or York, part of NY,this work USA, for10 personalpages. or contexttions [50 from]. These the phenomena conversation are not history easy to to capture the current with standard turn query, https://doi.org/10.1145/3397271.3401130classroom use is granted without fee provided that copies are not made or distributed for pro￿t or commercial advantage and that copies bear this notice and the full citation ifNLP needed. tools An (e.g., example coreference of an resolution). under-speci Also,￿ed heuristics query can such be seen as in on the ￿rst page. Copyrights for components of this work owned by others than the Tableappending1, turn (part #4, of) for the which conversation the gold history standard to querythe current resolution turn is: 1 INTRODUCTIONauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci￿c permission “querywhen wasare likely saosin to ’s lead￿rst to album query released? drift [27].”. Recent In this work example, has mod- context Conversationaland/or a fee. Request AI deals permissions with developing from [email protected]. dialogue systems that en- fromeled query all turns resolution #1 (“saosin”), as a sequence #2 (“band”) generation and #3 task (“￿rst”) [15, 21 have, 36]. to be able interactiveSIGIR ’20, July knowledge 25–30, 2020, Virtual gathering Event, China [17]. A large portion of work takenAnother into way account of implicitly to arrive solving to the query query resolution resolution. is by query © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. in thisACM area ISBN has 978-1-4503-8016-4/20/07...$15.00 focused on building dialogue systems that are capa- modelingDesigning [18, 42 automatic, 47], which query has been resolution studied andsystems developed is challenging under ble ofhttps://doi.org/10.1145/3397271.3401130 engaging with the user through chit-chat [23] or helping the becausethe setup of of phenomena session-based such search as [ zero5, 6]. anaphora, topic change and topic return, which are prominent in information seeking conversa- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed tions [50]. These phenomena are not easy to capture with standard for pro￿t or commercial advantage and that copies bear this notice and the full citation NLP tools (e.g., coreference resolution). Also, heuristics such as on the ￿rst page. Copyrights for components of this work owned by others than the appending (part of) the conversation history to the current turn author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci￿c permission query are likely to lead to query drift [27]. Recent work has mod- and/or a fee. Request permissions from [email protected]. eled query resolution as a sequence generation task [15, 21, 36]. SIGIR ’20, July 25–30, 2020, Virtual Event, China Another way of implicitly solving query resolution is by query © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-8016-4/20/07...$15.00 modeling [18, 42, 47], which has been studied and developed under https://doi.org/10.1145/3397271.3401130 the setup of session-based search [5, 6]. QuReTeC: modeling ￿ery Resolution• Input for: previous Conversational and current turns in a single Search sequence with Limited• Output: a Supervisionbinary label for each term QuReTeC: architectureQuery Resolution in‣ no Conversational labels for current turn terms Search 1 1 1 1 1, 2 Nikos Voskarides Term ScoreDan Li Pengjie Ren Evangelos Kanoulas Maarten de Rijke 1 2 UniversityTerm Score of Amsterdam, Amsterdam,Label The Netherlands- 0 0 1 0 Ahold0 0 Delhaize,0 0 0 Zaandam,0 0 1 0 The- Netherlands- - - - - [email protected],Term Classification Layer [email protected], [email protected],Distant [email protected], supervision: [email protected] method Input Label - 0 0 1 0 0 0 0 0 0 0 0 1 0 ------the first Who was was their the BERT Encoder Sequence When band What was formed album? When album Saosin? formed? ABSTRACTTerm Classification Layer DetectTable overlapping 1: Excerpt from terms an in example the conversation conversational history dialog.released and the Co- occurring terms in the conversation history and the relevant In this work we focus on multi-turnInput Sequence passage retrieval as a crucial Turn #1relevant Turnpassage #2 of currentTurn #3 turn and labelTurn those #4 (current) as relevant Input component of conversational search. One of the key challenges passage to the current turn (#4) are shown in bold-face. (a) QuReTeC model architec- (b) Example input sequence and gold standard term labels (1: relevant,the 0: non-relevant) for QuReTeC. first Who was was their the BERT Encoder Sequence When band What was formed album? When album ture. Turn Query Saosin? formed? in multi-turn passage retrieval comes from the fact that the cur- released Figure 2 rent turn query is often underspeci￿ed due to zero anaphora, topic 1 who formed saosin? change, or topicInput return. Sequence Context from the conversational history 2Turn when #1 was the bandTurnfounded? #2 Turn #3 Turn #4 (current) can be used to arrive at a(RQ3) betterHow expression does QuReTeC’s of the performance current turn vary query, depending on3 the whatwindow was of their 50 characters￿rst album? around the answer span to extract passage- turn of the conversation? 4 whenlength was texts, the and album we treat released? the extracted passage as the relevant de￿ned(a) as QuReTeC the task of model queryFor all the resolution. research architec- questions In this listed paper,(b) above Example we we measuremodel performance input sequencepassage. and When gold training standard with distant labels, term we labelsuse the part (1: of therelevant, 0: non-relevant) for QuReTeC. Trained withthe binary query resolutioncross entropy task asloss a binary term classi￿cation problem: resolved: when was saosin ’s ￿rst album released? ture. in both an intrinsic and an extrinsic sense. Intrinsic evaluation train split of [7] that does not have gold standard resolutions. for each term appearingmeasures in the previousquery resolution turns performance of the conversation on term classi￿cation.RelevantEx- passageThe TREC to turn CAsT #4: The dataset original [11] also lineup contains for Saosin gold standard, consisting query of decide whether to add ittrinsic to theevaluation current measures turn query retrieval or performance not. We pro- at both the initialBurchell,resolutions Shekoski, Kennedy forFigure its test and set. However,Green, 2 was it is formed too small in to the train summer a super- of pose QuReTeC ( ery retrievalsolution and bythe rerankingrm lassi steps.￿cation), a neural 2003. On Junevised 17, query the resolutionband released model, their and we￿rst onlycommercial use it as a complemen- production, Qu Re Te C tary set. the EP Translatingtest the Name. query resolution model5.2 based Datasets on bidirectional transformers. We The two query resolution datasets described above have three propose(RQ3) a distantHow supervision does5.2.1 Extrinsic QuReTeC’s method evaluation to automatically – performance retrieval. The generate TREC CAsT vary dataset depending is a main on di￿erences. the First, thewindow conversations of in 50 QuAC characters are centered around the answer span to extract passage- training data by using query-passagemulti-turn passage relevance retrieval dataset labels. [11 Such]. It is the labels only such dataset around a single Wikipedia article section about people whereas areü oftenN Voskarides readilyturn available, D Li, ofthat P the Ren, in is apublicly conversation? collectionE Kanoulas available. eitherand Each M as de topic human Rijke consists, annota-Query of aResolution sequence of for Conversationalthe conversations Search in CAsT withlength are Limited centered texts, aroundSupervision. an and arbitrary weSIGIR topic. treat 2020 the extracted passage as the relevant user completeSecond, small the answers well-speci of the￿ QuACed tasks questions [32]. are In orderspans in to the improve Wiki- tionsFor or inferred all the from research userqueries. interactions. The questions topics are We open-domain listed show that above and QuReTeC diverse we in terms measure of their performance passage. When training with distant labels, we use the part of the information need. The topics are curated manually to re￿ectthe in- capabilitypedia ofsection such whereas systems the CAsT to engage queries inhave complex relevant passages information that 18 outperforms state-of-the-art models, and furthermore, that our dis- in both an intrinsicformation seeking and conversational an extrinsic structure sense. patterns.Intrinsic Laterseeking turn evaluation conversationsoriginate from [di34￿erent], researchers Webtrain resourcessplit have besides proposed of [Wikipedia.7] that information Third, does not have gold standard resolutions. tant supervision methodqueries can in be a topic used depend to substantially only on the previous reduce turn the queries, andseeking not taskslater such turns as in QuAC conversational do depend on question the answers answering in previous (QA) turns, over measures query resolution performance on term classi￿cation.while in CAsTEx- they do not (SectionThe3.1). TREC Interestingly, CAsT in Section dataset6.1 [11] also contains gold standard query amount of human-curatedon the data returned required passages to of train the previous QuReTeC. turns, We which in- is a limitationsimple contexts, such as a single-paragraph text [7, 37]. In con- corporatetrinsic QuReTeCevaluation inof a multi-turn, this dataset. measures Nonetheless, multi-stage retrieval the passage dataset performance is retrieval su￿ciently challenging at both thewe demonstrate initial that despiteresolutions these di￿erences, for training its QuReTeC test set. However, it is too small to train a super- for comparing automatic systems, as we will show in Sectiontrast6.1.3. to conversationalon QuAC generalizes QA over well to simple the CAsT contexts, dataset. in conversational architecture and demonstrate its e￿ectiveness on the TREC CAsT 5 retrieval andTable the3 rerankingshows statistics steps. of the dataset. The original datasetsearch, con- a userTable aims4 provides to interactively statisticsvised for the￿nd two query information datasets. resolutionFirst, we stored observe in model, a and we only use it as a complemen- dataset. sists of 30 training and 50 evaluation topics. 20 of 50 topics inlarge the documentthat the collection QuAC dataset [10 is]. much larger than CAsT. Also, QuAC has evaluation set were annotated for relevance by NIST assessors on a larger number of terms ontary averagetest thanset. CAsT (~97 vs ~40) and In thisa paper, larger negative-positive we study multi-turn ratio (~20:1 passage vs ~40:1). retrieval This is because as an in- KEYWORDS a 5-point relevance scale. We use this set as the TREC CAsTstance test of conversational search: givenThe the two conversation query resolution history datasets described above have three 5.2 Datasetsset. The organizers also provided a small set of judgements for the in QuAC the answers to the previous turns are included in the Conversational search;training Query set, resolution however we do not use it in our pipeline. The passage(the previousconversation turns) and history the whereas currentmain in CAsT turn di theyquery,￿erences. are wenot. aimFor this to First, reason, retrieve the conversations in QuAC are centered 5.2.1 Extrinsic evaluation – retrieval. The TREC CAsT datasetwe expect is query a resolution on QuAC to be more challenging than ACM Reference Format:collection is the union of two passage corpora, the MS MARCOpassage-length [28] texts that satisfy the user’s underlying information 4 on CAsT. around a single Wikipedia article section about people whereas Nikosmulti-turn Voskarides, Dan passage Li,(Bing), Pengjie and Ren, retrieval the Evangelos TREC CAR dataset Kanoulas, [14] (Wikipedia [ and11]. Maarten passages). It is the onlyneed [ such11]. Here, dataset the current turn query may be under-speci￿ed and thus, we need to take into accountthe context conversations from the conversation in CAsT are centered around an arbitrary topic. de Rijke.that 2020. is Query publicly Resolution5.2.2 available. for Intrinsic Conversational evaluation Each – Search query topic resolution.with Limited consistsThe Su- original of QuAC a sequence5.3 Evaluation of metrics pervision. In Proceedings ofdataset the 43rd [7] contains International dialogues ACM on a SIGIR single Conference Wikipedia article sectionhistory to arrive at a better expression of the current turn query. 5.3.1 Extrinsic evaluationSecond, – retrieval. We the report answers NDCG@3 of (the the QuAC questions are spans in the Wiki- queries. Theregarding topics people are (e.g.,open-domain early life of a singer). and Each diverse dialogue contains in terms of their on Research and Development in Information Retrieval (SIGIR ’20), July Thus, weo need￿cial TRECto perform CAsT evaluationquery resolution metric), Recall,, that MAP, is, add and MRR missing at up to 12 questions and their corresponding answer spans in the pedia section whereas the CAsT queries have relevant passages that 25–30, 2020, Virtual Event, China. ACM, New York, NY, USA, 10 pages. context fromrank the1000. conversation We also provide history performance to the metrics current averaged turn per query, turn informationsection. need. It wasThe constructed topics by are asking curated two crowdworkers manually (a student to re￿ect in- https://doi.org/10.1145/3397271.3401130 to show how retrieval performance varies across turns. and a teacher) to perform an interactive dialogue about a speciif needed.￿c An example of an under-specioriginate￿ed from query can di￿ beerent seen in Web resources besides Wikipedia. Third, formation seeking conversational structure patterns. LaterWe report turn on statistical signi￿cance with a paired two-tailed topic. Elgohary et al.[15] crowdsourced question resolutionsTable for a 1, turn #4, for which the gold standard query resolution is: t-test. We depict a signi￿cantlater increase turns for p < in0.01 QuACas N. do depend on the answers in previous turns, 1 INTRODUCTIONqueries in a topicsubset of depend the original only QuAC dataseton the [7]. previousAll the questions turn in the“when queries,dev was saosin and ’s not￿rst album released?”. In this example, context Conversational AI dealsand withtest developingsplits of [15] have dialogue gold standard systems resolutions. that en- We use thefromdev all turns5.3.2 #1 Intrinsic (“saosin”), evaluation #2 –(“band”)while query resolution. in and CAsT #3We (“￿ reportrst”) they have on do Micro- to not be (Section 3.1). Interestingly, in Section 6.1 on the returnedsplit passages for early stopping of when the training previous QuReTeC turns, and evaluate which on the is a limitation able interactive knowledge gathering [17]. A large portion of work taken intoPrecision account (P), to Micro-Recall arrive to the (R) query and Micro-F1 resolution. (F1), i.e., metrics cal- of this dataset.test Nonetheless,set. When training with the gold dataset supervision is(gold su standard￿ciently query challenging we demonstrate that despite these di￿erences, training QuReTeC in this area has focused on building dialogue systems that are capa- Designingculated automatic per query query and then resolution averaged across systems all turns is challenging and topics. for comparingresolutions), automatic we use systems, the train split from as we [15], willwhich is show a subset in of SectionWe ignore6.1.3 queries. that areon the ￿ QuACrst turn of the generalizes conversation when well to the CAsT dataset. ble of engaging with thethe user train through split of [7 chit-chat]; all the questions [23] or thereinhelping have the gold standardbecause ofcalculating phenomena the mean, such since as we zero do anaphora,not predict term topic labels change for those. and 5 Table 3 showsresolutions. statistics Since QuAC of the is not dataset. a passage retrieval The collection, originaltopic in return, dataset which con- are prominent inTable information4 provides seeking conversa- statistics for the two datasets. First, we observe Permission to make digital ororder hard copies to obtain of all distant or part supervision of this work labels for personal (Section or4.3), we use a tions [50].5 These phenomena arethat not easy the to QuAC capture datasetwith standard is much larger than CAsT. Also, QuAC has classroomsists use is of granted 30 without training fee provided and that 50 copies evaluation are not made or topics. distributed 20 of 50 topicsNote that in the the￿rst turn in each topic does not need query resolution because there is for pro￿t or commercial advantage4The Washington and that copies Post collection bear this was notice also partand of the the full original citation collection butNLP it was toolsno (e.g., conversation coreference history at that resolution). pointa and larger thus the Also, query number resolution heuristics CAsT of test termssuch has 20 as on average than CAsT (~97 vs ~40) and on the ￿evaluationrst page. Copyrights set forexcluded were components from the annotated o of￿cial this TREC work evaluation owned for process by relevance others and therefore than the we by do not NIST useappending it. assessors(the (part number of) of on topics) the fewer conversation queries than in Table history3. to the current turn author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or a larger negative-positive ratio (~20:1 vs ~40:1). This is because republish,a to5-point post on servers relevance or to redistribute scale. to lists, requires We use prior speci this￿c permission set as the TRECquery are CAsT likely to test lead to query drift [27]. Recent work has mod- and/or a fee. Request permissions from [email protected]. eled query resolution as a sequencein QuAC generation the task answers [15, 21, 36 to]. the previous turns are included in the SIGIR ’20,set. July The 25–30, 2020,organizers Virtual Event, also China provided a small set of judgementsAnother way for of implicitly the solving query resolution is by query © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. conversation history whereas in CAsT they are not. For this reason, ACM ISBNtraining 978-1-4503-8016-4/20/07...$15.00 set, however we do not use it in our pipeline.modeling The [18 passage, 42, 47], which has been studied and developed under https://doi.org/10.1145/3397271.3401130collection is the union of two passage corpora, thethe MS setup MARCO of session-based [28] searchwe [5, expect6]. query resolution on QuAC to be more challenging than (Bing), and the TREC CAR [14] (Wikipedia passages).4 on CAsT.

5.2.2 Intrinsic evaluation – query resolution. The original QuAC 5.3 Evaluation metrics dataset [7] contains dialogues on a single Wikipedia article section 5.3.1 Extrinsic evaluation – retrieval. We report NDCG@3 (the regarding people (e.g., early life of a singer). Each dialogue contains o￿cial TREC CAsT evaluation metric), Recall, MAP, and MRR at up to 12 questions and their corresponding answer spans in the rank 1000. We also provide performance metrics averaged per turn section. It was constructed by asking two crowdworkers (a student to show how retrieval performance varies across turns. and a teacher) to perform an interactive dialogue about a speci￿c We report on statistical signi￿cance with a paired two-tailed topic. Elgohary et al.[15] crowdsourced question resolutions for a t-test. We depict a signi￿cant increase for p < 0.01 as N. subset of the original QuAC dataset [7]. All the questions in the dev and test splits of [15] have gold standard resolutions. We use the dev 5.3.2 Intrinsic evaluation – query resolution. We report on Micro- split for early stopping when training QuReTeC and evaluate on the Precision (P), Micro-Recall (R) and Micro-F1 (F1), i.e., metrics cal- test set. When training with gold supervision (gold standard query culated per query and then averaged across all turns and topics. resolutions), we use the train split from [15], which is a subset of We ignore queries that are the ￿rst turn of the conversation when the train split of [7]; all the questions therein have gold standard calculating the mean, since we do not predict term labels for those. resolutions. Since QuAC is not a passage retrieval collection, in order to obtain distant supervision labels (Section 4.3), we use a 5 Note that the ￿rst turn in each topic does not need query resolution because there is 4The Washington Post collection was also part of the original collection but it was no conversation history at that point and thus the query resolution CAsT test has 20 excluded from the o￿cial TREC evaluation process and therefore we do not use it. (the number of topics) fewer queries than in Table 3. Ren et al.

We decompose the task into four sub-tasks (CPU, RPS, STI, RG), and propose a modularized • CaSE model that uses a weakly-supervised CCCE loss to identify the supporting tokens and PPG to encourage generating more accurate responses. We conduct extensive experiments to show the e￿ectiveness of CaSE and identify room for • further improvement on conversations with search engines.

2 DATASET We next describe the stages involved in creating the SaaC dataset.

2.1 Collecting Conversational ￿eries TREC CAsT [12] has already built a collection of conversational query sequences, so we reuse their data to reduce development cost. Here, we brie￿y recall the process used in collecting the TREC CAsT data. The topics in CAsT are collected from a combination of previous TREC topics (Common Core [9], Session Track [18], etc.), MS MARCO Conversational Sessions, and the CAsT organizers [12]. The organizers ensured that the information needs are complex (requiring multiple rounds of elaboration), diverse (across di￿erent information categories), open-domain (not requiring expert domain knowledge to access), and mostly answerable (with su￿cient coverage in the passage Challenge: From documentcollection). A description of an example ranking topic is shown in Figure to2. Then, the TREC CAsT organizers answer generation Information need: The Bronze Age collapse and the transition into a dark age Tell me about the Bronze Age collapse.

Step 1: rewrite the current query if necessary. Tell me about the Bronze Age collapse.

Step 2: list all supporting spans. 1. Bronze Age Britain had large reserves of tin in... 2. A shortage of tin during the Bronze Age collapse...... Step 3: formulate a short conversational response by summarizing the supporting spans. It may be because of a shortage of tin, that is... Fig. 2. The process of building the SaaC dataset. ü P. Ren, Z. Chen, Z. Ren, E. Kanoulas, C.Monz M. de Rijke Conversations with Search Engines: Datasets & Baselines, TOIS 2020 created sequences of conversational queries for each turn. They started with the description of the 19 topic and manually formulated the ￿rst conversational query. After that, they formulated follow-up

ACM Transactions on Information Systems, Vol. 1, No. 1, Article . Publication date: April 2020. Challenge: From document ranking to answer generation New SCAI’21 Conversational Question Answering Challenge: https://scai.info/

1. Question rewriting; 2. Document retrieval; and 3. Question answering.

ü Anantha, R., Vakulenko, S., Tu, Z., Longpre, S., Pulman, S. and Chappidi, S., 2020. Open-Domain Question Answering Goes Conversational via Question Rewriting. arXiv preprint arXiv:2010.04898. 20 Challenge: Large-scale dynamic collection

• Automate corpus creation (1h/topic) • Question generation • Answer relevance

• User simulator • Question conditioned on previous answer

ü S Vakulenko et. al, Generating Topic-based Question Sequences for an Automated Evaluation of Conversational Passage Ranking. (work in progress) 21 Challenge: Large-scale dynamic collection

1. what is the speed of a mako shark 1. what are the different types of sharks 2. what do mako sharks eat 2. are sharks endangered if so which species 3. where does the longfin mako live 3. tell me more about tiger sharks 4. where do bull sharks live 4. what is the largest shark ever to have lived on Earth 5. what kind of teeth do sharks have 5. what's the biggest shark ever caught 6. how long do hammerhead sharks live 6. what about for great whites 7. how long do sandbar sharks live 7. tell me about mako sharks 8. how long do tiger sharks live 8. what are mako shark adaptations 9. what sharks are endangered 9. where do mako sharks live 10. what is the location of the tiger shark 10. what do mako sharks eat 11. what is the biggest great white shark 11. how do mako sharks compare with tiger sharks for

being dangerous 22 Challenge: Large-scale dynamic collection

1. what are red blood cells 1. does anemia cause fatigue 2. how are red blood cells created 2. causes of anemia in men 3. how is oxygen transported in blood 3. can anemia cause death 4. what is anemia 4. what is normocytic anemia 5. what are the symptoms of anemia 5. symptoms of anemia in children 6. can anemia go away 6. what is the treatment for iron deficiency anemia 7. what are anemia possible causes 7. red blood cell disorder symptoms 8. what causes low red blood cells 9. causes of low iron levels in blood 10. where are red blood cells found 11. what food is high in iron

23 3. Asking clarifying questions

Generate and select the Update belief What is the best available osteoporosis treatment? most informative based on the question(s) Are you looking for a answer pharmaceutical treatment?

YES

How about combining it with physical exercise?

NO, I am too old for that, except if it is easy to do. Generate and select clarifying questions

ü J. Zou, Y. Chen, E. Kanoulas, Towards Question-based Recommender Systems. SIGIR 2020 ü J. Zou, E. Kanoulas, Towards Question-Based High-Recall Information Retrieval: Locating the Last Few Relevant Documents for Technology Assisted Reviews, TOIS 2020 ü J. Zou, E. Kanoulas. Learning to Ask: Question-based Sequential Bayesian Product Search. CIKM 2019 25 The world of entities

ü V Provatorova, S Vakulenko, E Kanoulas, K Dercksen, J Hulst: Named Entity Recognition and Linking on Historical Newspapers, CLEF 2020 ü C Wu, E Kanoulas, M de Rijke: Learning entity-centric document representations using an entity facet topic model, IPM 2020 ü C Wu, E Kanoulas, M de Rijke: It all starts with entities: A Salient entity topic model. NLE 2020 ü C Wu, E Kanoulas, M de Rijke, W Lu: WN-Salience: A Corpus of News Articles with Entity Salience Annotations. LREC 2020 this paper, we use an uninformative uniform system belief distri- entity e, and U is the number of products in the candidate set U . | | bution by setting all ’s to 1 to isolate the e￿ect of the proposed The ranking list is generated according to the user preference 0 l⇤ method. That is, the user preference is initialized to be the same for over the products. Note that we use the worst ranking index as the each product. index of product in the ranked list when there are ties over the user During the duet training phase, there is a training set Dt for each preferences. Thus, in the ￿rst question, Ibefore is initialized to the topic, which contains all of the training products for this topic. For last ranking index, which is equal to the number of products in the each product in Dt , we generate a set of questions based on all the collection. entities in the collection, and we obtain a posterior belief using the After the system belief learning and question reward learning, Bayes’ rule after every question for the target training product d the model uses its current user preference d from the belief ⇤( ) is being answered, and calculate the reward R e for each question P and the estimated reward R e to derive a policy to ￿nd the ( ) t t ( ) (entity). We then get the average reward for each entity to be used optimal entity to query. in the online interactive search, and obtain the trained system belief P , which is also used as a prior belief over products during the 1.3 QSBPS Algorithm t ( ) interactive search. In this section, we introduce two versions of the QSBPS algorithm. System Belief: We learn the system belief from the training The ￿rst version assumes that there is no noise in the answers of data of all users on a certain topic, assuming that the training a user. That is, when an entity appears in the text of the relevant products on a certain topic are related in the entity embedding product the user gives a correct positive answer, while when it does space, and thus can provide useful guidance. One could also learn not the user gives a correct negative answer (see user simulation user personalized preferences and entity informativeness, however, in Section ??). The online interactive learning of our proposed we do not do so, hypothesizing that users can buy signi￿cantly algorithm is provided in Algorithm 2. We ￿rst load in the trained di￿erent products. system belief Pt and question rewards Rt e , then compute the Let P be the system’s belief over in the n-th question. We ( ) ( ) l ⇤ user preference with prior belief equal to Pt , and ￿nd the optimal compute the user preference d in the n-th question by, ( ) l⇤( ) entity el by Equation 6. Inspired by Wen et al. [5], we extend GBS E over entities to ￿nd the entity that best splits the probability mass of n⇤ d = Pn d d (1) ( ) ⇠ [ ( )]8 2 D predicted product relevance closest to two halves, but also maximize Similar to Wen et al. [5], we model the user preference ⇤ by a Trained the question reward for the remaining of the products during lth multinomial distribution over products . Then, the system up- Binary search question reward D question, as the optimal entity. dates its belief after observing the user answer to a question asked, which is sampled i.i.d. from ⇤. e = arg min 2 e d = 1 1 ⇤ d R e (6) l e ( { ( ) } ) ( ) ⇤ ( ) d u Pn+1 d Pn (2) ’2 l ( ) / ( ) ( ) 8 Trained where e is the lth choosen entity, u is the set of products of the can- We model the prior P , by the conjugate prior of the multino- l l system belief 0 didate version space when asking the l question, e d expresses mial distribution, i.e., the Dirichlet distribution, with parameter . th ( ) whether the product d contains the entity e or not, while is the Further, we de￿ne the indicator vector Z d = e d = e d , l ( ) { l ( ) l ( ⇤)} weight to trade the question reward R e . We ask whether the entity where e d means whether the product d contains entity e or not. ( ) l ( ) l e is present in the target product that the user wants to ￿nd, d , From Bayes’ rule, the posterior belief prior to the question l is: l ⇤ observe the reply e d⇤ , and remove e from the entity pool. Then n l ( ) l we reduce Ul , update the system’s belief Pl using Bayes’ rule and P + = Dir + Z (3) n 1 ( j ) recalculate the user preference, i.e., the user preference is updated j=0 ’ sequentially by Bayesian learning that refers to sequential Bayesian From the properties of the Dirichlet distribution, we have: based search. Since the user preference ⇤ is a multinomial distri- n d + = Zj d bution over products , and Pt , a Dirichlet distribution, updating E ( ) j 0 ( ) D n⇤ d = Pn d = n (4) the system belief is performed in a similar to Eq.3 manner. ( ) ⇠ [ ( )] d d 0 + =0 Zj d 0 0 2D( ( Õ) j ( )) In Algorithm 2, we make the assumption, that users, when pre- where d is the i entry ofÕ, which correspondsÕ to product sented with an entity, know with 100% con￿dence whether the ( ) th d. Therefore, the user preference can be updated by counting l⇤ entity appears in the target product. To relax this assumption we and re-normalization. also propose a noise-tolerant version of the algorithm. That is, we Question Reward: For the question reward learning, we use allow the user to make mistakes and provide the algorithm with historical training data to learn the reward of each entity. We de￿ne the wrong answer. We integrate the probability that the user will the following simple reward function, which can learn the ranking give the wrong answer to a question about entity e, h e , into the ( ) rising ratio of the target product relative to the candidate products new objective function, at line 7 of Algorithm 2, version space when training.

I I e = arg min 2 e d = 1 1 ⇤ d + 2 h e R (7) R e = before after (5) l e ( { ( ) } ) ( ) ⇤ ( ) ⇤ l ( ) U d | | ’2D where the Ibefore is the index of the target product in the ranked We observe the noisy answer and update the posterior system list before asking the question about entity e, the Iafter is the index belief according to this noisy answer. Intuitively, a question will be of target product in the ranked list after asking the question about chosen to be asked not only if it is about an informative entity, but Algorithm 1: QSBPS O￿ine Learning Figure 1: Research framework input :A product documents collection, , the set of D annotated entities in , , a set of topics , a prior D E T Dirichlet parameter, 0 output:System belief P , question rewards R e t ( ) t ( ) 1 foreach topic t do 2 T 2 Compute the initial user preference with 0: ⇤ d = E P d d 0 ( ) ⇠ 0 [ ( )] 8 2 D 3 Let be the set of products within t Dt 4 n 0

5 foreach d do 2 Dt 6 foreach entity e do 2 E 7 Update the system’s belief P using Bayes’ rule: Pn+1 d Pn ( ) / ( ) ( ) 8 8 Calculating reward for each entity: I I R e = bef ore af ter d U ( ) | | 9 n n + 1

10 end 11 end using historical data from each user individually, and across users; 12 Output trained system belief Pt = Pn+1 and (c) the interactive search step, which sequentially selects ques- ( ) ( ) 13 Output average reward for each entity: tions to be asked to the user and updates the user preferences. The = 1 Rt e d t Rd e focus of this work lies in the two latter parts. ( ) |Dt | 2D ( ) 14 end Õ 3.1 Question Pool Construction The proposed method of learning informative questions to ask to Algorithm 2: QSBPS Online Learning users, described in details in Sections 3.2 and 3.3, depends on the input :A product documents collection, , the set of D availability of a pool of questions on product properties. That is, annotated entities in , , a set of topics ,a given a product, the user should be able to answer the question, D E T number of questions to be asked, Nq , the system with a reference to the relevant product, with a “yes” or a “no”. belief Pt t , and the question rewards There are di￿erent methods one could employ to construct such ( )8 2 T relevant products, Rt e , and (b) a system beliefRt e overt the relevance of updates its belief after observing the user answer to a question a set of questions. For instance, one can use labelled topics [31, 32( ],) ( )8 Update2 T the belief based on the answer the products, Pl . Instead of usingoutput: the SBSUser algorithm preference [25],⇤ a data- asked, which is sampled i.i.d. from ⇤. ( ) Nq extracted keywords [11], item categorieshungry and attributes, algorithm or that ex- requires a large amount of training data for 1 foreach topic t do P d P tract triplets and generate a rich set of questionseach user’s based request, on these our approach uses a duet learning2 T approach on n+1 n (2) 2 Load Pt , Rt e ( ) / ( ) ( ) 8 triplets [19]. In this work, we take a rudimentarythe given approach. topic, t Our.3 The as- proposed method updates( ) the( ) system belief 3 l 1 We model the prior P0, by the conjugate prior of the multino- sumption is that a user can discriminate betweenover relevance products based and the on entity e￿ectiveness after every question is 4 System belief initialization: P Pmial distribution, i.e., the Dirichlet distribution, with parameter . the language of the documents (i.e., descriptions and reviews) they l ( ) t ( ) being answered by the user, performing well using limited and weak Further, we de￿ne the indicator vector Z d = e d = e d⇤ , are associated with. We then identify informative terms (instan- 5 while l Nq and Ul > 1 do l ( ) { l ( ) l ( )} signals. The system belief training over products learns| the| interest where e d means whether the product d contains entity e or not. tiated by entities in this work) using the entity linking algorithm 6 Compute the user preference with Pl : l ( ) l of the users over products, while the entity e￿ectiveness training MultinomialFrom( distribution Bayes’) rule, the posterior belief prior to the question l is: TAGME [8], similar to previous research [26, 27]. We assume that ⇤ d = E P d d Dirichlet distribution over entities learns the reward or informativenessl ( ) of questions,⇠ l [ ( )] 8 and2 D over products these informative terms comprise the most important characteris- 7 Find the optimal target entity: e = n thus ￿nds the optimal policy for asking questions. Our algorithm l conjugate tics of a product, and we generate questions about the presence or arg min 2 e d = 1 1 d R e Pn+1 = Dir + Zj (3) performs two rounds. During the o￿ine phase, thee algorithm| d ul ( learns{ ( ) } ) ⇤( )| ⇤ t ( ) ( ) absence of such entity in the product related documents. That is, 2 j=0 8 Ask the question about el , observe the reply el d⇤ ’ we instantiate the question candidate set bywhat identifying is the average entities users’ in preference over the productsÕ within each ( ) 9 Remove el from entity pool From theGenerate properties ranking of list the Dirichlet distribution, we have: the product related documents. For example,product from category, the following and how e￿ective are entities in identifying these 10 U = U i : e i = e d products. Then, during the online interactivel+1 searchl \ 2 D phase,l ( the) l ( ⇤) n description of a product, "Apple iPhone XS (64GB), gold color. The d + =0 Zj d algorithm continues learning11 product relevanceUpdate the on system’sthe basis belief: of the Pl+1 d Pl E ( ) j ( ) most durable glass ever in a smartphone. And a breakthrough dual- ( ) / (n⇤) d( =) Pn d = (4) ( ) 8 ⇠ [ ( )] d + n Z d 12 l l + 1 d0 0Õ j=0 j 0 camera system. iPhone XS is everything youuser’s love answers about to iPhone. the algorithm’s questions. 2D( ( ) ( )) The o￿ine training phase13 is describedend in Algorithm 1. We as- Taken to the extreme.", the extracted entities can be "Apple", "iPhone where d is the ith entry ofÕ, which correspondsÕ to product 14 4 ( ) XS", "gold color", "smartphone", "dual-camerasume system", that and there "iPhone". is a target relevantend product d⇤ . The user d. Therefore, the user preference can be updated by counting 2 D l⇤ We don’t use any ￿lter on the annotationpreferences scores of the for TAGME the products are modelled by a (multinomial) prob- and re-normalization. ability distribution over products , and the target product is results, i.e. all annotations are being considered, which is also⇤ a D Question Reward: For the question reward learning, we use widely used setting in previous work [18, drawn26]. After i.i.d. that, from the this pro- distribution.3.2 We Cross-user also assume Duet that there Learning is a historical training data to learn the reward of each entity. We de￿ne P posed algorithm asks a sequence of questionsprior of the belief form0 “Areover you the user preferencesIn this section, ⇤, we which describe is a probability the training algorithmthe following that jointly simple learns reward function, which can learn the ranking interested in [entity]?” to locate the targetdensity product. function over all the possible(a) a belief realizations over the e￿ectiveness of ⇤. The of prior questions (entities)rising ratio in identifying of the target product relative to the candidate products P system belief 0 is a Dirichlet distribution, with a hyper-parameter version space when training. 0, which can be set by using any other product search algorithm that measures the lexical or semantic similarity between the query I I R e = before after (5) and product documents, or any collaborative ￿ltering method. In ( ) U this paper, we use an uninformative uniform system belief distri- | | where the Ibefore is the index of the target product in the ranked bution by setting all 0’s to 1 to isolate the e￿ect of the proposed method. That is, the user preference is initialized to be the same for list before asking the question about entity e, the Iafter is the index each product. of target product in the ranked list after asking the question about entity e, and U is the number of products in the candidate set U . During the duet training phase, there is a training set Dt for each | | The ranking list is generated according to the user preference topic, which contains all of the training products for this topic. For l⇤ over the products. Note that we use the worst ranking index as the each product in Dt , we generate a set of questions based on all the entities in the collection, and we obtain a posterior belief using the index of product in the ranked list when there are ties over the user Bayes’ rule after every question for the target training product d preferences. Thus, in the ￿rst question, Ibefore is initialized to the is being answered, and calculate the reward R e for each question last ranking index, which is equal to the number of products in the ( ) (entity). We then get the average reward for each entity to be used collection. in the online interactive search, and obtain the trained system belief After the system belief learning and question reward learning, the model uses its current user preference ⇤ d from the belief Pt , which is also used as a prior belief over products during the ( ) ( ) P and the estimated reward R e to derive a policy to ￿nd the interactive search. t t ( ) optimal entity to query. System Belief: We learn the system belief from the training data of all users on a certain topic, assuming that the training products on a certain topic are related in the entity embedding 3.3 QSBPS Algorithm space, and thus can provide useful guidance. One could also learn In this section, we introduce two versions of the QSBPS algorithm. user personalized preferences and entity informativeness, however, The ￿rst version assumes that there is no noise in the answers of we do not do so, hypothesizing that users can buy signi￿cantly a user. That is, when an entity appears in the text of the relevant di￿erent products. product the user gives a correct positive answer, while when it does Let Pl be the system’s belief over ⇤ in the n-th question. We not the user gives a correct negative answer (see user simulation compute the user preference d in the n-th question by, in Section 4.1.3). The online interactive learning of our proposed l⇤ ( ) algorithm is provided in Algorithm 2. We ￿rst load in the trained d = E P d d (1) n⇤( ) n [ ( )] 2 D system belief P and question rewards R e , then compute the ⇠ 8 t ( ) t ( ) Similar to Wen et al. [25], we model the user preference ⇤ user preference with prior belief equal to P , and ￿nd the optimal t ( ) by a multinomial distribution over products . Then, the system entity e by Equation 6. Inspired by Wen et al. [25], we extend GBS D l 3Topics in this paper are product subcategories, which can represent user queries. over entities to ￿nd the entity that best splits the probability mass of 4In the rest of paper, "product" and "product document" will be used interchangeably predicted product relevance closest to two halves, but also maximize Example Salomon Trail IV Tight

Rank 1. Is the product for women? yes 1753 2. Is it a slim fit product? yes 273 3. Is it a stretch product? yes 8 4. Is it a technology product? no 3 2020/2/17 Title Step 2: Target product assignment Imagine that you want to buy the target product shown below, please read the product title and product description very carefully ( if not familiar with the target product or the description is not clear, you can click the User perspective on"Change tarclarifyingget product" button to be assigned a new product). Aftequestionsr you click the "start conversational search" Step 1: Category selection button, you will interact with our algorithm by answering YES/NO questions and the algorithm will take a few seconds Step 1: Category selection to start the interactive search session: Imagine that you wantStep to 1: buyCategory a product. selection Please select the Step 2: Target product assignment productImagine categoryImagine that youthat you wantyou arewant to mostbuyto buya product.familiara product Pleasewith. Please (e.g., selectselect most the Imagine that you want to buy the target product shown below, please read the frequentlyproductthe purchased categoryproduct category you category). are youmostare familiarmost withfamiliar (e.g.,with most(e.g., product title & description very carefully. After you click the ”Next step" frequentlymost frequentlypurchasedpurchased category).category). button, you will interact with our algorithm by answering YES/NO questions:

Category:Category: HomeHome andand KitchenKitchen Product Title:

ToTo Step 2 2 Hog Wild Fish Sticks (Sold Individually) Product DescProductription: Title: Hog Wild Fish Sticks (Sold Individually) Product Description: Kids love this pre-scissors skills activity set of 1-piece Kids love this pchopsticks!re-scissors skUseills acttheivity stongset of 1-withpieceoralcho-pmotorsticks! Uactivitiesse the ton.gSimplys with oralset-mouptor acsmalltivitiestoys,. Simply set up small toys, easy-grip foods or cotton balls for kids to transfer across a midlineasye. Sty-legrips mayfoodsvary. orSetcottonof 48. Eballsducatiforon Ckidsategotorietransfers: SpecialacrossNeeds / aFinmidlinee Motor ./ …Scissors - Tools. UNSPSC/NIGP Codes: 6000000000-78500000 2020/2/17 Title Step 4: Questionnaire StepStep 4:4:Step QuestionnairQuestionnair 4: Questionnaire ee Change target Next step Q1: Did you find ourQ1: conversational Did you find our system question helpful-based towards system locating helpful thetowards target product? Step 3: Find the target product Q1:Q1: DidDid youyou findfind ourourlocating conversationalconversationalStep the target 4: systemsystem product? Questionnair helpfulhelpful towardstowards locatinge the target product? Q1: Did you find our conversational system helpful towards locating theP ltargetease a nproduct?swer the following algorithmically constructed questions according to the title and description of your Step 4: QuestionnairNotNotNot Sure SureSure e Q1: Did you find our conversational systemNot helpful Sure towards locating the targettarg product?et product shown in last page (e.g.Step, choo se3: 'yeFinds' w htheen t htargete selec tproducted term in the question is present in the title Q2:Q2: WillWill youyou useuse suchsuchQ2: aa Will conversationalconversational you use such system systemsystem a question for forfor product product-productbased systemsearch search foror or recommendationproduct recommendationand de scriptioPleasen w inh iinl ethe answercthehoose 'ntheo' wfollowinghen absenalgorithmicallyt, choose 'not suconstructedre' when you aquestionsre not sure accordingabout it). Afttoer you click the Q2:Q1: WillDid you finduse suchour conversational a conversational system system helpfulNot Surefor producttowards searchlocating or the recommendation target product? in the future?future? search or recommendation in the future? "next" buttonthe, thetitlealgorandithm descriptionwill take few seofcoyournds ttargeto selectproductthe next shownquestionon, plethease lastwait pagefor a w. hAfterile and do not future?future?Q2: Will you use such a conversational system for product search or recommendation in the Not Sure click the buttoyoun twclickice. Whtheen "next"you wishbutton,to stoptheanswalgorithmering questwillionsselect you cathen clicnextk thequestion"stop" buttotonask. . future? NotNotNot Sure SureSure Q2: Will you use suchQ3: a What conversationalStep was your 4: experience systemQuestionnairNot Surefor using product the searchquestione or-based recommendation When in youthe wish to stop answering questions you can click the "stop" button. Q3:Q3: WhatWhat waswas youryour experienceexperience usingusing the the conversational conversationalNot Sure system? system? Q3:Q3: WhatWhat was your system?experience using the conversational system? future?Q3:Q1: WhatDid you was find your our experience conversational using system the conversational helpful towards system? locating the target product? Question: Is "rosewood" relevant to the product you are looking for? 111 (very (very(very negative) negative)negative) 1 (veryNotNot negative) Sure Q4:Q4: HowHow manymany questionsquestionsQ4: How areare many youyou willingquestionswilling to to1 (veryanswer answerare negative) you for forwilling locating locating to answer your your target fortarget product? product? Q4: HowQ2: many Will you questions use such aare conversational you willing systemto answer for product for locating search youror recommendation target product? in the Yes Q4:Q3: HowWhatQ4: many wasHow manyyour questions locatingexperiencequestions are your are you using youtarget willing willing the product? to conversationalto answeranswer for for locating locating system? your your target target product? product? future? No 1 (very negative) 127.0.0.1:8000/shoes/0/168/ Not Sure 1/1 Q5:Q5: WhyWhyQ5: diddid Why youyou did click clickyouQ5: click thethe Why the"Stop""Stop" did "Stop" you button buttonbutton buttonclick to theto toto stop stopstop Not"Stop" Sure answering answeringanswering button into in lastinstop last last step? answering step? step? in Q5:Q4: WhyHow didmany you questions click the are"Stop" you buttonwilling to answerstop answering for locating in last your step? target product? Q3: What was theyour last experience step? using the conversational system? Next Stop FoundFoundFoundFound the thethe the target targettarget product productproduct Q6: If selected "other" in Q5, pleaseFound specify: the1 target(very negative) product Q6:Q6: IfIf selectedselected "other""other" inin Q5,Q5, pleaseplease specify: specify:specify: Ranking list of search results, from top 1 (left) - top 4(right): Q6:Q5: IfWhy selectedQ4: did How you many"other" clickQ6: questions the inIfStep Q5,selected "Stop" areplease you button"other"4: willing specify: Questionnair toin to stopQ5, answer please answering for locating specify: in your last targetestep? product? Ranking list ofHogsear Wildch re Fishsults, Sticksfrom t oBonep 1 (l e&ft )Rosewood- top 4(righ t):Fred & Friends Good 2pk Green Pot Q1: DidQ7: you Are findthe generated our conversational questionsFound easy system tothe answer? target helpful product towards locating the target product?(Sold Individually): Chopsticks: Fortune Chopsticks: Holders/Trivet Set: Q7:Q7: AreAreQ5: thethe Why generatedgenerated did youQ7: click questions questionsAre the the "Stop" generated easyeasyeasy button to toto answer? answer? questionsanswer?to stop answering easy to in answer? last step? Q7:Q6: AreIf selected the generated "other" questionsin Q5, please easy specify: to Panswer?rodNotuct tSureitle: Hog Wild Fish Sticks (Sold Individually): Product title: Bone & Rosewood Chopsticks: Product title: Fred & Friends Good Fortune Product title: 2pk Green Pot Holders/Trivet Set: Found the NottargetNot Sure Sureproduct Chopsticks: NotNot SureSure Q2: WillQ6: you If selected use such "other" a conversational in Q5, please specify: systemNotAll Suredone for product search or recommendation in the All done Q7:future? Are the generated questions easy to answer?AllAll donedone … All done Q7: Are the generated questions easy to answer?NotNot Sure Sure Q3: What was your experience using the conversationalNot Sure system? All done ü Zou, J., Kanoulas, E. and Liu, Y. An1 (very AllEmpirical negative)done Study on Clarifying Question-Based Systems. CIKM 2020 Q4: How many questions are you willing to answer for locating your target product? 30

Q5: Why did you click the "Stop" button to stop answering in last step?

Found the target product Q6: If selected "other" in Q5, please specify:

Q7: Are the generated questions easy to answer?

Not Sure

All done

127.0.0.1:8000/start_search/0/168/B000IA35SG/ 1/1 User perspective on clarifying questions

120 2veraOO accuPuOative pOot IntervaO accuPuOative histograP 100

80

60

3ercentage(%) 40

20

0 1 3 5 7 9 1113151719212325272931333537394143454749 # oI answered questions ü Zou, J., Kanoulas, E. and Liu, Y. An Empirical Study on Clarifying Question-Based Systems. CIKM 2020 31 User’s ability to answer clarifying questions

100 Correct answers 100 CoUUect answeUs 1ot 6ure answers 1ot 6uUe answeUs Incorrect answers 80 80 IncoUUect answeUs

60 60

40 40 3ercentage(%) 3eUcentage(%)

20 20

0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 10 20 30 40 50 7arget Sroduct 8seUs

ü Zou, J., Kanoulas, E. and Liu, Y. An Empirical Study on Clarifying Question-Based Systems. CIKM 2020 32 Table 1: Overview of the dataset. M denotes Metadata only and M&R the question reward for the remaining of the products during lth question, as the optimal entity. denotes Metadata & Reviews. Arithmetic mean and standard devia- tion are indicated wherever applicable.

e = arg min 2 e d = 1 1 ⇤ d R e (6) l e ( { ( ) } ) ( ) ⇤ ( ) Home & Kitchen Clothing, Shoes & d ul Jewelry ’2 Number of products 8,192 16,384 this paper, we use an uninformative uniform system belief distri- entity e, and U is the number of products in the candidate set U . where|el| is the lth choosen entity, ul is the set of products of the can- Number of description docs 8,192 16,384 bution by setting all ’s to 1 to isolate the e￿ect of the proposed The ranking list is generated according to the user preference 0 didate version space when asking the l question,l⇤ e d expresses method. That is, the user preference is initialized to be the same for over the products. Note that we use the worst ranking indexth as the ( ) Number of reviews 79,938 77,640 whether the product d contains the entity e or not, while is the Length of documents 70.02 73.82 58.41 61.90 each product. index of product in the ranked list when there are ties over the user ± ± During the duet training phase, there is a training set Dt for each preferences.weight Thus, to trade in the the￿rst question question, I rewardbefore is initializedR e . We to ask the whether the entity Reviews per product 9.76 52.01 4.74 18.60 topic, which contains all of the training products for this topic. For last ranking index, which is equal to the number of products( ) in the ± ± el is present in the target product that the user wants to ￿nd, d⇤, Number of topics 729 833 each product in Dt , we generate a set of questions based on all the collection. Products per topic 11.24 31.16 19.67 55.24 observe the reply el d⇤ , and remove el from the entity pool. Then ± ± entities in the collection, and we obtain a posterior belief using the After the system belief learning( ) and question reward learning, Number of entities (M) 232,086 385,727 Bayes’ rule after every question for the target training product d the modelwe reduce uses itsU currentl , update user preference the system’s d belieffrom thePl beliefusing Bayes’ rule and ⇤( ) Number of entities (M&R) 1,483,659 1,408,828 is being answered, and calculate the reward R e for each question Pt andrecalculate the estimated the reward userR preference,t e to derive i.e., a policy the user to ￿nd preference the is updated ( ) ( ) Entities per product (M) 28.33 23.93 23.54 19.25 (entity). We then get the average reward for each entity to be used optimalsequentially entity to query. by Bayesian learning that refers to sequential Bayesian ± ± in the online interactive search, and obtain the trained system belief Entities per product (M&R) 181.11 797.55 85.99 276.30 based search. Since the user preference ⇤ is a multinomial distri- ± ± Pt , which is also used as a prior belief over products during the 1.3 QSBPS Algorithm ( ) bution over products , and P , a Dirichlet distribution, updating interactive search. In this section, we introduce twoD versions oft the QSBPS algorithm. System Belief: We learn the system belief from the training The ￿rstthe version system assumes belief that is performed there is no noise in a in similar the answers to Eq.3 of manner. 4 EXPERIMENTS AND ANALYSIS data of all users on a certain topic, assuming that the training a user. ThatIn is, Algorithm when an entity 2, we appears make in the the assumption, text of the relevant that users, when pre- products on a certain topic are related in the entity embedding Through the experiments conducted in this work we aim to answer productsented the user with gives an a correct entity, positive know answer, with while 100% when con it does￿dence whether the space, and thus can provide useful guidance. One couldUser’s also learn abilitynot the user to givesanswer a correct negativeclarifying answer (see questions user simulation the following research questions: user personalized preferences and entity informativeness, however, entity appears in the target product. To relax this assumption we 100 Correct aQswers in Section ??). The online interactive learning of our proposed 1ot 6ure aQswers RQ1 What is the impact of the number of questions asked and the we do not do so, hypothesizing that users can buy signi￿cantlyIQcorrect aQswers also propose a noise-tolerant version of the algorithm. That is, we 80 algorithm is provided in Algorithm 2. We ￿rst load in the trained di￿erent products. parameter that trades the weight of question reward? systemallow belief P thet userand question to make rewards mistakesRt e , and then provide compute the the algorithm with 60 Let P be the system’s belief over ⇤ in the n-th question. We ( ) ( ) RQ2 What is the in￿uence of using user reviews along with prod- l user preferencethe wrong with answer. prior belief We equal integrate to Pt , and the￿ probabilitynd the optimal that the user will 40 ( ) compute the user preference ⇤ d in the n-th questionPerceQtage(%) by, entity e by Equation 6. Inspired by Wen et al. [5], we extend GBS uct descriptions? l ( ) givel the wrong answer to a question about entity e, h e , into the E 20 over entities to ￿nd the entity that best splits the probability mass of n⇤ d = Pn d d (1) ( ) RQ3 Does our duet training by using other users’ data help? ( ) ⇠ [ ( )]8 2 D new objective function, at line 7 of Algorithm 2, 0 predicted product relevance closest to two halves, but also maximize 0 50 100 150 200 Similar to Wen et al. [5], we model the user preference ⇤ by a 4uestioQs Trained RQ4 What is the performance when considering noisy answers? the question reward for the remaining of the products during lth multinomial distribution over products . Then, the system up- question reward D question, as the optimal entity. RQ5 How e￿ective is our proposed method for ￿nding the best dates its belief after observing the user answer to a question asked, matching product compared to prior works? which is sampled i.i.d. from ⇤. ele ==argarg min min 2 e d2 =e1 d 1=1⇤ d 1 ⇤Rde + 2(6) h e R (7) l e ( { ( ) } ) ( ) ⇤ ( ) l ed u ( { ( ) } ) ( ) ⇤ ( ) ⇤ Pn+1 d Pn (2) ’2 l d ( ) / ( ) ( ) 8 ’2D Trained Noise rate 4.1 Experimental Setup where el is the lth choosen entity, ul is the set of products of the can- We model the prior P0, by the conjugate prior of the multino- system belief didate versionWe observe space when the asking noisy the answerlth question, ande updated expresses the posterior system 4.1.1 Dataset. In our experiments we use the collection of Ama- mial distribution, i.e., the Dirichlet distribution, withü parameter . ( ) Zou, J., Kanoulas, E. and Liu,whether Y. An Empirical the product Study on dClarifyingcontains Question the entity-Based Systems.e or not, CIKM while 2020 is the Further, we de￿ne the indicator vector Z d = e d = e d , belief according to this noisy answer. Intuitively, a question33 will be zon products 5 [16]. It includes millions of customers, products and l ( ) { l ( ) l ( ⇤)} weight to trade the question reward R e . We ask whether the entity where e d means whether the product d contains entity e or not. ( ) l ( ) l e is presentchosen in to the be target asked product not only that the if it user is wantsabout to an￿nd, informatived , entity, but reviews. Each product contains rich metadata such as title, product From Bayes’ rule, the posterior belief prior to the question l is: l ⇤ observealso the if reply thise entityd⇤ , and is remove the onee forfrom which the entity users pool. have Then a good con￿dence descriptions, product categories, and also reviews from customers. n l ( ) l we reduce Ul , update the system’s belief Pl using Bayes’ rule and P + = Dir + Z (3) in providing an answer. The experiments will be discussed in RQ4. Similar to Van Gysel et al. [22], we use the same four di￿erent n 1 ( j ) recalculate the user preference, i.e., the user preference is updated j=0 Regarding the error rate for each entity h e , we consider two product domains from the Amazon product dataset, but due to the ’ sequentially by Bayesian learning that refers to sequential Bayesian( ) From the properties of the Dirichlet distribution, we have: settings: In the ￿rst setting all of the h e are simply set to equal based search. Since the user preference ⇤ is a multinomial distri- limited space, we only report two domains in this paper, which are n ( ) d + = Zj d butionvalues, over products and we, and experimentPt , a Dirichlet with distribution, di￿erent updating error rates that range "Home & Kitchen", and the "Clothing, Shoes & Jewelry". Statistics E ( ) j 0 ( ) D n⇤ d = Pn d = (4) the system belief is performed in a similar to Eq.3 manner. ( ) ⇠ [ ( )] d + n Z d from 0.1 to 0.5 with a step of 0.1, to explore the performance trend on these two domains are shown in Table 1. The documents asso- d0 ( ( 0Õ) j=0 j ( 0)) 2D In Algorithmof our model 2, we make with the di assumption,￿erent error that rates. users, when An error pre- rate h e of 0.5 ciated with every product consist of the product description and where d is the i entry ofÕ, which correspondsÕ to product sented with an entity, know with 100% con￿dence whether the ( ) th ( ) d. Therefore, the user preference can be updated by counting means that the user has a 50% probability to give the wrong answer. the reviews provided by Amazon customers. To construct topics l⇤ entity appears in the target product. To relax this assumption we and re-normalization. also proposeIn the a second noise-tolerant setting, version given of the that algorithm. users are That usually is, we more con￿dent (queries) we use the method employed in Van Gysel et al. [22], Question Reward: For the question reward learning, we use allowin the their user to answers make mistakes about and an entity providee theif e algorithmis frequently with occurring in the and Ai et al. [1], i.e. we use a subcategory title from the above two historical training data to learn the reward of each entity. We de￿ne the wrong answer. We integrate the probability that the user will the following simple reward function, which can learn the ranking give thegiven wrong topic, answer we to de a￿ questionne h e aboutas a entity functione, h e of, into average the term frequency product domains to form a topic string. Each topic (i.e. subcategory) ( ) ( ) rising ratio of the target product relative to the candidate products new objective(TF) in function, the topic at linefor 7each of Algorithm entity, which 2, is in the range of (0, 0.5]: contains multiple relevant products and products can be relevant version space when training. to multiple topics. After that, we remove the topics which contain

I I el = arg min 2 e d = 1 1 ⇤ d + 2 h1 e Rl (7) just a single product, since having a single relevant product does R e = before after (5) e ( { ( ) } ) =( ) ⇤ ( ) ⇤ ( ) U d h e (8) not allow constructing a training set and test set. Similar to [1], we | | ’2D ( ) 2 1 + TFavg e where the Ibefore is the index of the target product in the ranked We observe the noisy answer and update( the posterior( )) system randomly split the dataset to 70% and 30% subsets, and we use 70% list before asking the question about entity e, the Iafter is the index belief accordingWhere toTF this noisye answer.represents Intuitively, the average a question term will be frequency of entity of the products for each topic (i.e. subcategory) for training. We of target product in the ranked list after asking the question about chosen to be asked notavg only( ) if it is about an informative entity, but e in the given topic. The choice of this function is ad-hoc and any also use a 10% of the data as validation set. The validation set is other function of any other characteristic of entities could also used to select the optimal parameters to avoid over￿tting. be used. Ideally, one should conduct a user study to identify a 4.1.2 Evaluation Measures. To quantify the quality of algorithms, reasonable error rate function, the properties of entities that a￿ect we use Mean Reciprocal Rank (MRR), and average Recall@k (k= 5) the error rate, or even the characteristics of the users that in￿uence the error rate. We leave such a user study as future work. 5http://jmcauley.ucsd.edu/data/amazon/ Challenge: How to generate clarifying questions

• Train constrained question generation • Identify what discriminates documents and use as constraints

ü M Aliannejadi, et al: Ad-hoc generation of clarifying questions (work in progress) 34 Analyzing the E￿ect of Clarifying ￿estions in Conversational Search Woodstock ’18, June 03–05, 2018, Woodstock, NY

233 0 Pos >= Neg Pos >= Neutral Neg >= Neutral #⇡⇠⌧ 291 p-value 0.879 0.017 0.001 234 Inf. need Drafting of Declaration of Independence 292 235 Table 2: Welch’s T-test for the average performance improve- & all men are created equal 293 0 +18.53 236 ment for di￿erent answer types - Would you like to learn more about 294 CQ 237 Thomas Je￿erson? 295 238 Answer - No 296 239 Challenge: Update the belief based on the answer 297 240 Neutral Inf. need Information about Atari arcade games 298 + answers - answers & atari 241 answers 0 +15.97 299 242 correlation - Would you like to play atari 300 0.041 0.210 0.146 CQ Analyzing the E￿ect of Clarifying ￿estions in Conversational Search243 coe￿cient Woodstock ’18, June 03–05, 2018, Woodstock, NY arcade games online? 301 39 10 244 p-value 0.11 8.5 10 3.1 10 Answer - No 302 · · 233 0 Pos >= Neg Pos >= Neutral Neg >= Neutral 245 Table 3: Spearman’s rank correlation between #⇡⇠⌧#⇡⇠⌧& an- 291 Inf. need How is a credit score determined? 303 p-value 0.879 0.017 0.001 246 swer length &0 credit report 304 234 Inf. need 292 +13.57 247 Drafting of Declaration of Independence - Would you like to know about the process 305 235 Table 2: Welch’s T-test for the average performance improve- & all men are created equal 293 CQ 248 0 +18.53 of disputing credit report? 306 236 ment for di￿erent answer types - Would you like to learn more about 294 249 CQ Answer - No 307 237 ? 295 250 Thomas Je￿erson 308 238 Answer - No 296 Inf. need Website of HobbyTown USA 251 Especially in our case, where an unsupervised term-matching rank- 309 239 ing model is used, precision can really bene￿t the performance, 297 &0 hobby stores Neutral 252 Inf. need Information about Atari arcade games +11.37 310 240 + answers - answers as typically the number of keywords mentioned in the conversa- 298 - Would you like to know the answers 253 &0 atari CQ 311 241 tion is larger. This ￿nding suggests that devising a way to+15.97 extract 299 hours of operation? 242 correlation 254 - Would you like to play atari 300 312 CQkeywords from such information-seeking conversations is a mean- Answer - No 0.041 0.210 0.146 255 arcade games online? 313 243 coe￿cient 301 Inf. need 39 10 ingful future direction. In addition, we can conclude that when Find the TSA website, o￿ering air travel tips 244 p-value 0.11 8.5 10 3.1 10 256 Answer - No 302 314 · · all question and answer tokens are treated naively without such &0 air travel information 245 257 Inf. need 303 +8.91 315 Table 3: Spearman’s rank correlation between #⇡⇠⌧ & an- distiction,How asking is a more credit closed-form score determined? questions is preferable. CQ - Would you like to book a ￿ight ? 246 258 & credit report 304 316 swer length 0 Another interesting aspect is the in￿uence of the user’s+13.57 answer Answer - No 247 259 - Would you like to know about the process 305 317 CQlength. With respect to the mixed-initiative setting, answer length Table 4: Qualitative analysis 248 260 of credit report? 306 318 can act as adisputing proxy signal into quantifying whether the user is 249 261 Answer - No 307 319 262 indeed taking back the initiative after the system asks the clarifying 320 250 Inf.question. need Website of HobbyTown USA 308 251 Especially in our case, where an unsupervised term-matching rank- 263 309 321 ü A Krasakis,&0 Figure M Aliannejadi, 2,hobby suggests stores that N whenVoskarides, the answer E Kanoulas: is positive, its Analysing length thethat Effect throughout of Clarifying the clari￿ Questionscation questions, on Document a number of Ranking contex- in 252 ing model is used, precision can really bene￿t the performance, 264 +11.37 310 322 Conversationaldoes not inSearch.-￿ Woulduence youICTIR the like performance 2020 to know the too much. This is supported tually relevant words appears. For instance, Thomas Je￿erson is a 253 as typically the number of keywords mentioned in the conversa- 265 CQ 311 323 by Spearman’shours rankof operation? correlation in Table 3. For very short negative very relevant entity with respect to the drafting of the declaration 254 tion is larger. This ￿nding suggests that devising a way to extract 266 312 324 Answeranswers (typically- No just "No"), performance usually decreases as it is of independence and his name regularly appears within the rele- 255 keywords from such information-seeking conversations is a mean- 267 313 325 Inf.expected. need However,Find the TSA when website, the user o￿ering responds air travel with tipsa more elaborate vant results. Likewise, arcade games and book, ￿ight are relevant 256 ingful future direction. In addition, we can conclude that when 268 314 326 all question and answer tokens are treated naively without such &answer,0 theair retrieval travel information performance recovers and even improves in words which were previously not mentioned and help improve the 257 269 +8.91 315 327 CQmany cases.- Would The correlation you like to betweenbook a answer￿ight length? and #⇡⇠⌧ is ranking. Even further, words such as disputing and hours,operation 258 distiction, asking more closed-form questions is preferable. 270 316 328 Answersigni￿cant- and No also much higher compared to the rest of the answer are much less relevant to the information need, but they appear 259 Another interesting aspect is the in￿uence of the user’s answer 271 317 329 types. This indicates that even if the "wrong" clarifying question in the relevant documents because they do contain information 260 length. With respect to the mixed-initiative setting, answer length 272 Table 4: Qualitative analysis 318 330 can act as a proxy signal into quantifying whether the user is is asked with respect to the information need, theNot outcome might for distribution.about those too (eg. the opening hours of a store appear in the main 261 273 Unpublished working319 draft. 331 turn bene￿cial, as long as he understands the importance of his website). This indicates that taking into account such conversations, 262 indeed taking back the initiative after the system asks the clarifying 274 320 332 interaction and is willing to elaborate further. In the case of neutral has to some extent similar e￿ects to query expansion and can prove 263 question. 275 321 333 answers, the e￿ects are similar, with lengthier answers appearing helpful, even when the added terms are not strictly relevant to the 264 Figure 2, suggests that when the answer is positive, its length 276 that throughout the clari￿cation questions, a number of contex- 322 334 to be signi￿cantly more helpful. information need. 265 does not in￿uence the performance too much. This is supported 277 tually relevant words appears. For instance, Thomas Je￿erson is a 323 335 To further illustrate the importance of this e￿ect, we compute 266 by Spearman’s rank correlation in Table 3. For very short negative 278 very relevant entity with respect to the drafting of the declaration 324 336 a list of distinctive important words for each speci￿c information 267 answers (typically just "No"), performance usually decreases as it is 279 of independence3.3 Importance and his name of contextual regularly appears words within the rele- 325 337 need. To do so, we compute the contribution of each term in relative 268 expected. However, when the user responds with a more elaborate 280 vantIn results. this section, Likewise, we studyarcade RQ3. games Our previousand book, analysis￿ight are indicated relevant that 326 338 entropy between the relevant documents and the entire corpus [6]. 269 answer, the retrieval performance recovers and even improves in 281 wordsfor aboutwhich30 were35% previouslyof the not cases mentioned where a and short help negative improve answer the 327 339 #⇡⇠⌧ We count the number of previously unmentioned words appearing 270 many cases. The correlation between answer length and is 282 ranking.was provided, Even further, the conversation words such as rounddisputing helpedand thehours,operation retrieval engine 328 340 in each conversation turn and investigate their correlation with the 271 signi￿cant and also much higher compared to the rest of the answer 283 areachieve much less better relevant results to (Figure the information 2). This seems need, peculiar but they and appear counter- 329 341 types. This indicates that even if the "wrong" clarifying question 284 inintuitive, the relevant as one documents would expect because that they ignoring do contain a clari￿cation information question improvement in performance (Figure 3). Welch’s T-test indicates a 342 272 Not for distribution. 330 93 is asked with respect to the information need, the outcome might 285 about those too (eg. the opening hours of a store appear in the main statistically signi￿cant correlation (? E0;D4 = 1.6 10 ). 343 273 Unpublished workingwhich the user indicated draft. as irrelevant would almost always bene￿t 331 · 274 turn bene￿cial, as long as he understands the importance of his 286 website).the performance. This indicates that taking into account such conversations, 332 344 275 interaction and is willing to elaborate further. In the case of neutral 287 has toTo some understand extent similar why this e￿ects is happening to query expansion and how and could can a retrievalprove 333 4 CONCLUSION 345 276 answers, the e￿ects are similar, with lengthier answers appearing 288 helpful,engine even possibly when bene the￿ addedt from terms this to are improve not strictly matching, relevant we to perform the 334 In this work, we investigated the e￿ects of clarifying questions 346 277 to be signi￿cantly more helpful. 289 informationa qualitative need. analysis of such kind of queries in Table 4. We observe 335 in open-domain conversational search. We provided insights on 347 278 290 2020-02-24To further 12:01. illustrate Page 3 of the 1–4. importance of this e￿ect, we compute 336 348 279 3.3 Importance of contextual words a list of distinctive important words for each speci￿c information 337 280 In this section, we study RQ3. Our previous analysis indicated that need. To do so, we compute the contribution of each term in relative 338 281 for about 30 35% of the cases where a short negative answer entropy between the relevant documents and the entire corpus [6]. 339 282 was provided, the conversation round helped the retrieval engine We count the number of previously unmentioned words appearing 340 283 achieve better results (Figure 2). This seems peculiar and counter- in each conversation turn and investigate their correlation with the 341 284 intuitive, as one would expect that ignoring a clari￿cation question improvement in performance (Figure 3). Welch’s T-test indicates a 342 statistically signi￿cant correlation (? E0;D4 = 1.6 10 93). 285 which the user indicated as irrelevant would almost always bene￿t · 343 286 the performance. 344 287 To understand why this is happening and how could a retrieval 4 CONCLUSION 345 288 engine possibly bene￿t from this to improve matching, we perform In this work, we investigated the e￿ects of clarifying questions 346 289 a qualitative analysis of such kind of queries in Table 4. We observe in open-domain conversational search. We provided insights on 347 290 2020-02-24 12:01. Page 3 of 1–4. 348 4. Mixed Initiative: Complex Strategies

1. Provide ranking of document/passages 2. Provide direct answer 3. Ask clarifying question 1. Relevance over items/documents 2. Natural language questions 3. Suggested questions

Decide what is the optimal next action to take

36 Example of Mixed Initiative

ü M Aliannejadi, et al. Modelling and Measuring Conversational Search: An Analysis of Mixed Initiatives and Conversational Strategies (under review) 37 Conversational search agent interface

1. a voice only CSA often via a virtual assistant 2. a chat based CSA (that is in-situ within a platform like Slack or Telegram 3. an augmented search engine interface, or 4. a multi-modal virtual assistants.

38 User Interaction Model of Conversational Search

ü M Aliannejadi, et al. Modelling and Measuring Conversational Search: An Analysis of Mixed Initiatives and Conversational Strategies (under review) 39 User Interaction Model of Conversational Search

SIMULATION Answer Question Use Suggested Query Feedback First Feedback After

ü M Aliannejadi, et al. Modelling and Measuring Conversational Search: An Analysis of Mixed Initiatives and Conversational Strategies (under review) 40 Evaluating the Cost and Benefit of CS

Cumulative gain:

Cumulative effort:

Rate of gain:

41 Grounding Costs

• Time as an estimation of cost.

• Crowdsourcing:

• Feedback

• Assessment

42 Grounding Costs

• Time as an estimation of cost.

43 Analyzing Strategies

44 Analyzing Mixed Initiative

45 Table 3. 16 dialogue datasets used in our analysis: 15 public and 1 private (OCLC). The last three rows correspond to the di￿erent splits of the Meena dataset, which contains human-human and human-machine dialogues (*) that were collected for a chatbot evaluation. All other dialogue datasets contain only human-humanAnalyzing dialogues. Dial. len. – average Mixed dialogue length calculated Initiative as the number of u￿erances. U￿. len. – average u￿erance length calculated as the number of words. U￿./turn – average number of u￿erances per single dialogue turn produced by one of the dialogue participants. The number in brackets is a standard deviation from the mean. Outliers for the di￿erent dimensions are highlighted in bold. Dataset Dialogues Domain Type Subtype Modality Source Setup Dial. len. Utt. len. Utt./turn MSDialog [23] 35,500 tech info-seek IN text forum natural 9 (25) 67 (87) 1 (3) MANTIS [22] 1,400 multi info-seek IN text forum natural 4 (1) 98 (160) 1 (0) OCLCa 560 library info-seek IN text chat natural 25 (20) 12 (15) 1 (1)

Vakulenko, Kanoulas and de Rijke Ubuntu [20] 1,200,000 tech info-seek IN text chat natural 6 (8) 10 (9) 1 (1) SCSdata [35] 37 web info-seek IS speech chat simulated 27 (21) 16 (25) 1 (0) MISC [34] 110 web info-seek IS speech chat simulated 120 (47) 7 (7) 2 (2) ReDial [17] 10,000 movies info-seek IS text chat simulated 18 (5) 6 (5) 1 (0) CCPE [25] 502 movies info-seek IS text chat simulated 23 (6) 12 (13) 1 (0) Qulac [? ] 10,277 web info-seek IS text taskb simulated 3 (0) 8 (3) 1 (0) QuAC [7] 11,600 Wikipedia info-seek IS text chat simulated 14 (4) 9 (7) 1 (0) MultiWOZ [? ] 10,000 multi task-orient TO text chat simulated 13 (5) 13 (6) 1 (0) Meena/Mitsuku* [1] 100 open social CC text chat simulated 18 (4) 8 (10) 1 (0) Meena/Meena* [1] 91 open social CC text chat simulated 19 (5) 6 (4) 1 (0) Meena/Human [1] 95 open social CC text chat simulated 15 (2) 13 (10) 1 (0) DailyDialog [18] 11,000 multi social CC text bookc simulated 7 (4) 13 (10) 1 (0) PersonaH [29] 102 personal social KG text chat simulated 12 (0) 9 (4) 1 (0) WoW [12] 22,000 Wikipedia social KG text chat simulated 9 (1) 16 (7) 1 (0) OpenDialKG [21] 13,800 Freebase social KG text chat simulated 6 (2) 12 (6) 1 (0)

aThis dataset is not public. It is an intellectual property of OCLC https://www.oclc.org and is subject to a licence agreement. bIn Qulac, participants were not paired to participate in a live conversation but added their responses into an on-line form given the previous utterance and the information need description as a prompt. ü S VakulenkocThese dialogues, E Kanoulas were extracted, fromM de a textbook Rijke for, An English Analysis learners, i.e., of likely Mixed created Initiative by a single author. and Collab. in Information-Seeking Dialogues, SIGIR 2020 ü S Vakulenko, E Kanoulas, M de Rijke, Large Scale Analysis of Mixed Initiative in Information-Seeking Dialogues for Conversational Search (under review) 111:14 ACM Transactions on Information Systems, Vol. 39, No. 4, Article 111. Publication date: August 2020. 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 Analyzing Mixed Initiative

111:26 Vakulenko, Kanoulas and de Rijke (R) Hey! What kind of movies do you like to watch? 1226 (H) I’m really big on indie romance and dramas 1227 (R) Ok what’s your favorite movie? 1228 1229 (R) Staying with that genre, have you seen @884871230 or @104253? (R) Those are two really good ones 1231 1232 (H) When I was a kid I liked horror like @1810971233 (R) @Misery is really creepy but really good. 1234I only recently got into horror 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 ü S Vakulenko, E Kanoulas, M de Rijke, An Analysis of1249 Mixed Initiative and Collab. in Information-Seeking Dialogues, SIGIR 2020 ü S Vakulenko, E Kanoulas, M de Rijke, Large Scale Analysis1250 of Mixed Initiative in Information-Seeking Dialogues for Conversational Search, under review 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 Fig. 9. Two sca￿er plots with dialogue datasets compared along the three dimensions of mixed initiative. 1270 The top plot shows the relation between Direction and Information. The bo￿om plot shows the relation 1271 between Volume and Information. 1272 1273 1274 ACM Transactions on Information Systems, Vol. 39, No. 4, Article 111. Publication date: August 2020. Summing up …

Agent Evaluation 1. Provide ranking of 1. Simulate mixed innitiative document/passages 2. Decide costs/gains 2. Provide direct answer 3. Develop evaluation 3. Ask clarifying question measures 1. Relevance over items/documents 2. Natural language questions 3. Suggested questions

48 Thanks!