<<

Recent Advances in Conversational Retrieval (CIR) - A review of neural approaches

Jianfeng Gao, Chenyan Xiong, Paul Bennett SIGIR 2020 July 26, 2020 (Xi’an, China) Outline

• Part 1: Introduction • A short definition of CIR • Task-oriented dialog and Web search • Research tasks of CIR • Part 2: Conversational (QA) methods • Part 3: Conversational search methods • Part 4: Overview of public and commercial systems Who should attend this tutorial?

• Whoever wants to understand and develop modern CIR systems that • Can interact with users for via multi-turn dialogs • Can answer questions • Can help users search / look up information • Can help users with learning and investigation tasks • … • Focus on neural approaches in this tutorial • Hybrid approaches that combine classical AI methods and deep learning methods are widely used to build real-world systems A short definition of CIR

• A Conversational (CIR) system is • an information retrieval (IR) system with • a conversational interface which • allows users to interact with the system to seek information • via multi-turn conversations of natural language.

• CIR is a task-oriented dialog system (aka. task-oriented bot) • Complete tasks (e.g., information seeking) via multi-turn conversations of natural language Classical task-oriented dialog system architecture

Dialog Manager (DM)

Find me a restaurant Language intent: get_restaurant Dialog state serving understanding food: chinese tracking Chinese food Web or words meaning state (APIs)

Which area do you Language intent: ask_slot Policy prefer? generation slot: area (action selection) User-System interaction in task-oriented dialog

• User: know (roughly) what she needs, but not what is available. • System: know what is available, but not what a user needs • dialog: a two-way process of knowing each other to make a deal A (multi-domain) dialog session DB State Belief State H: I am looking for a museum in the center of town. do you have any 1 suggestion? Belief state: { M: there are 6 museums in the centre of town . how about the Broughton ‘attraction’:{‘type’: ‘museum’, 2 ‘area’: ‘centre’} Attraction house gallery at 98 king street ? } 3 H: nice, can you tell me what is the phone number and address ?

M: the phone number for the Broughton house gallery is 01223314960 and 4 they are located at 98 king street . H: ok thanks, can you recommend a restaurant in the same area serving Indian 5 food M: curry garden serves Indian food and is located in the centre . would you like 6 me to make a reservation ?

7 H: sure, please book a table for two people on thursday at 18:00 Restaurant Belief state: { M: ok , i have a table for 2 at 18:00 on Thursday . your reference number is 8 ‘restaurant’: {‘food’: ‘indian’, ‘area’: 1234567890 . is there anything else I can help you with ? ‘centre’}, ‘booking’: {‘day’: ‘Thursday’, 9 H: that's it, bye. ‘people’: ‘2’, ‘time’: ’18:00’}, ‘attraction’:{‘type’: ‘museum’, M: have a great time and please contact us if you need any further assistance . ‘area’: ‘centre’} 10 goodbye . } [Peng+20] User-system interaction in Web search

• User: know (roughly) what she needs, but not what is available. • System: know what is available, but not what a user needs • Generally viewed as a one-way information seeking process • User plays a proactive role to iteratively • issue a query, • inspect search results, • reformulate the query • System plays a passive role to make search more effective • Autocomplete a query • Organize search results (SERP) • Suggest related queries

System should interact with users more actively

• How people search -- Information seeking • Information lookup – short search sessions; • Exploratory search based on a dynamic model, an iterative “sense-making” process where users learn as they search, and adjust their information needs as they see search results.

• Effective information seeking requires interaction btw users and a system that explicitly models the interaction by • Tracking belief state (user intent) • Asking clarification questions • Providing recommendations • Using natural language as input/output

[Hearst+11; Collins-Thompson+ 17; Bates 89] A long definition of CIR - the RRIMS properties

• User Revealment: help users express their information needs • E.g., query suggestion, autocompletion • System Revealment: reveal to users what is available, what it can or cannot do • E.g., recommendation, SERP • Mixed Initiative: system and user both can take initiative (two-way conversation) • E.g., asking clarification questions • Memory: users can reference past statement • State tracking • Set Retrieval: system can reason about the utility of sets of complementary items • Task-oriented, contextual search or QA

[Radlinski&Craswell 17] CIR research tasks (task-oriented dialog modules)

• What we will cover in this tutorial • Conversational Query Understanding (LU, belief state tracking) • Conversational document ranking (database state tracking) • Learning to ask clarification questions (action select via dialog policy, LG) • Conversational leading suggestions (action select via dialog policy, LG) • Search result presentation (response generation, LG)

• Early work on CIR [Croft’s keynote at SIGIR-19]

• We start with conversational QA which is a sub-task of CIR Outline

• Part 1: Introduction • Part 2: Conversational QA methods • Conversational QA over knowledge bases • Conversational QA over texts • Part 3: Conversational search methods • Part 4: Case study of commercial systems Conversational QA over Knowledge Bases

• Knowledge bases and QAs • C-KBQA system architecture • Semantic parser • Dialog manager • Response generation • KBQA w/o semantic parser • Open benchmarks Knowledge bases

• Relational • Entity-centric knowledge base • Q: what super-hero from Earth appeared first? • Knowledge Graph • Properties of billions of entities • Relations among them • (relation, subject, object) tuples • Freebase, FB Entity Graph, MS Satori, Google KG etc. • Q: what is Obama’s citizenship? • KGs work with paths while DBs work with sets

[Iyyer+18; Gao+19] Question-Answer pairs

• Simple questions • can be answered from a single tuple • Object? / Subject? / Relation? • Complex questions • requires reasoning over one or more tuples • Logical / quantitively / comparative • Sequential QA pairs • A sequence of related pairs • Ellipses, coreference, clarifications, etc.

[Saha+18] C-KBQA system architecture

Select Movie Where • Semantic Parser Find me the {direct = Bill Murray} Bill Murray’s Semantic Parser • map input + context to a semantic movie. representation (logic form) to • Query the KB Dialog state • Dialog manager KB • Maintain/update state of dialog tracker history (e.g., QA pairs, DB state) • Select next system action (e.g., ask

clarification questions, answer) Dialog Policy Dialog Dialog Manager • Response generator (action selection) • Convert system action to natural language response When was it Response • KB search (Gao+19) released? Generator Dynamic Neural Semantic Parser (DynSP)

• Given a question (dialog history) and a table • Q: “which superheroes came from Earth and first appeared after 2009?”

• Generate a semantic parse (SQL-query) • A select statement (answer column) • Zero or more conditions, each contains • A condition column • An operator (=, >, <, argmax etc.) and arguments • Q: Select Character Where {Home World = “Earth”} & {First Appear > “2009”} • A: {Dragonwing, Harmonia}

[Iyyer+18; Andreas+16; Yih+15] Model formulation • Parsing as a state-action search problem • A state 푆 is a complete or partial parse (action sequence) • An action 퐴 is an operation to extend a parse • Parsing searches an end state with the highest score • “which superheroes came from Earth and first Types of actions and the number of action instances appeared after 2009?” in each type. Numbers / datetimes are the mentions • (퐴1) Select-column Character discovered in the question. • (퐴2) Cond-column Home World • (퐴3) Op-Equal “Earth” • (퐴2) Cond-column First Appeared

• (퐴5) Opt-GT “2009” Possible action transitions based on their types. Shaded circles are end states.

[Iyyer+18; Andreas+16; Yih+15] How to score a state (parse)?

• Beam search to find the highest-scored parse (end state) • 푉휃 푆푡 = 푉휃 푆푡−1 + 휋휃(푆푡−1, 퐴푡), 푉 푆0 = 0

• Policy function, 휋휃(푆, 퐴), • Scores an action given the current state • Parameterized using different neural networks, each for an action type • E.g., Select-column action is scored using the semantic similarity between question words (embedding vectors) and column name (embedding vectors) 1 σ 푇 • 푤푐∈푊푐 max 푤푞 푤푐 푊푐 푤푞∈푊푞

[Iyyer+18; Andreas+16; Yih+15] Model learning

푡 • State value function: 푉휃(푆푡) = σ푖=1 휋휃(푆푖−1, 퐴푖) • An E2E trainable, question-specific, neural network model • Weakly setting • Question-answer pairs are available • Correct parse for each question is not available • Issue of delayed (sparse) reward • Reward is only available after we get a (complete) parse and the answer • Approximate (dense) reward • Check the overlap of the answers of a partial parse 퐴(푆) with the gold answers 퐴∗ 퐴 푆 ∩퐴∗ • 푅 푆 = 퐴 푆 ∪퐴∗

[Iyyer+18; Andreas+16; Yih+15] Parameter updates

• Make the state value function 푉휃 behave similarly to reward 푅 • For every state 푆 and its (approximated) reference state 푆∗, we define loss as ∗ ∗ • ℒ 푆 = 푉휃 푆 − 푉휃 푆 − 푅 푆 − 푅 푆 • Improve learning efficiency by finding the most violated state 푆መ

// labeled QA pair // Finds the best approximated reference state

// Finds the most violated state

[Iyyer+18; Taskar+04] DynSP SQA

• “which superheroes came from Earth and first appeared after 2009?” • (퐴1) Select-column Character • (퐴2) Cond-column Home World • (퐴3) Op-Equal “Earth” • (퐴2) Cond-column First Appeared • (퐴5) Opt-GT “2009” • “which of them breathes fires” • (퐴12) S-Cond-column Powers • (퐴13) S-Op-Equal “Fire breath”

Possible action transitions based on their types. Shaded circles are end states. [Iyyer+18; Andreas+16; Yih+15] DynSP for sequential QA (SQA)

• Given a question (history) and a table • Q1: which superheroes came from Earth and first appeared after 2009? • Q2: which of them breathes fire?

• Add subsequent statement (answer column) for sequential QA • Select Character Where {Home World = “Earth”} & {First Appear > “2009”} • A1: {Dragonwing, Harmonia} • Subsequent Where {Powers = “Fire breath”} • A2: {Dragonwing}

[Iyyer+18] Query rewriting approaches to SQA

Q1: When was California founded? A1: September 9, 1850 Q2: Who is its governor? → Who is California governor? A2: Jerry Brown Q3: Where is Stanford? A3: Palo Alto, California Q4: Who founded it? → Who founded Stanford? A4: Leland and Jane Stanford Q5: Tuition costs → Tuition cost Stanford A5: $47,940 USD

[Ren+18; Zhou+20] Dialog Manager – dialog memory for state tracking

Dialog Memory (of state tracker) Entity {United States, “q”} {New York City, “a”} {University of Pennsylvania, “a”} … Predicate {isPresidentOf} {placeGraduateFrom} {yearEstablished} …

Action subsequence Set → 퐴4 퐴15푒푢푠푟푝푟푒푠 (partial/complete states) Set → 퐴4 퐴15

[Guo+18] Dialog Manager – policy for next action selection

• A case study of Movie-on-demand • System selects to • Either return answer or ask a clarification question. • What (clarification) question to ask? E.g., movie title, director, genre, actor, release-year, etc.

[Dhingra+17] What clarification question to ask

0.7

• Baseline: ask all questions in a 0.6 randomly sampled order 0.5 • Ask questions that users can answer 0.4 • learned from query logs 0.3

• Ask questions that help reduce Task Success Rate Success Task

search space 0.2 • Entropy minimization 0.1 • Ask questions that help complete the task successfully 0 1 2 3 4 5 6 7 8 9 • via agent- # of dialogue turns user interactions Results on simulated users​

[Wu+15; Dhingra+17; Wen+17; Gao+19] Response Generation

• Convert “dialog act” to “natural language response” • Formulated as a seq2seq task in a few-shot learning setting 푇 • 푝휃 풙 퐴 = σ푡=1 푝휃(푥푡|푥<푡, 퐴) • Very limited training samples for each task • Approach • Semantically Conditioned neural language model • Pre-training + fine-tuning, • e.g., semantically conditioned GPT (SC-GPT)

[Peng+20; Yu+19; Wen+15; Chen+19] SC-GPT

Performance of different response generation models in few-shot setting (50 samples for each task)

[Peng+20; Raffel+19] C-KBQA approaches w/o semantic parser

• Building semantic parsers is challenging • Limited amounts of training data, or • Weak supervision • C-KBQA with no logic-form • Symbolic approach: “look before you hop” • Answer an initial question using any standard KBQA • Form a context subgraph using entities of the initial QA pair • Answer follow-up questions by expanding the context subgraph to find candidate answers • Neural approach • Encode KB as graphs using a GNN • Select answers from the encoded graph using a point network

[Christmann+19; Muller+19] Open Benchmarks

• SQA (sequential question answering) • https://www.microsoft.com/en-us/download/details.aspx?id=54253 • CSQA (complex sequence question answering), • https://amritasaha1812.github.io/CSQA/ • ConvQuestions (conversational question answering over knowledge graphs) • https://convex.mpi-inf.mpg.de/ • CoSQL (conversational text-to-SQL) • https://yale-lily.github.io/cosql • CLAQUA (asking clarification questions in Knowledge-based question answering) • https://github.com/msra-nlc/MSParS_V2.0 Conversational QA over Texts

• Tasks and datasets • C-TextQA system architecture • Conversational machine reading compression models • Remarks on pre-trained language models for conversational QA QA over text – extractive vs. abstractive QA

[Rajpurkar+16; Nguyen+16; Gao+19] Conversation QA over text: CoQA & QuAC

[Choi+18; Reddy+18] Dialog behaviors in conversational QA

• Topic shift: a question about sth previous discussed • Drill down: a request for more info about a topic being discussed • Topic return: asking about a topic again after being shifted • Clarification: reformulating a question • Definition: asking what is meant by a team

[Yatskar 19] C-TextQA system architecture

Machine Reading • (Conversational) MRC Q3: Where? Comprehension • Find answer to a question given text (MRC) module and previous QA pairs • Extractive (span) vs. abstractive answers Conversation history: Q1: what is the story about? Dialog state tracker Texts A1: young girl and her dog (previous QA pairs) • Dialog manager Q2: What were they doing? A2: set out a trip • Maintain/update state of dialog history (e.g., QA pairs) • Select next system action (e.g., ask Dialog Policy clarification questions, answer) Dialog Manager (action selection) • Response generator • Convert system action to natural language response Response A3: the woods Generator

[Huang+19] Neural MRC models for extractive TextQA

• QA as classification given (question, text) • Classify each word in passage as start/end/outside of the answer span • Encoding: represent each passage word using an integrated context vector that encodes info from • Lexicon/word embedding (context-free) • Passage context • Question context • Conversation context (previous question-answer pairs) • Prediction: predict each word (its integrated context vector) the start and end position of answer span.

[Rajpurkar+16; Huang+10; Gao+19] Three encoding components • Lexicon embedding e.g., GloVe • represent each word as a low-dim continuous vector • Passage contextual embedding e.g., Bi-LSTM/RNN, ELMo, Self-Attention/BERT • capture context info for each word within a passage • Question contextual embedding e.g., Attention, BERT • fuse question info into each passage word vector

… …

question passage [Pennington+14; Melamud+16; Peters+18; Devlin+19] Neural MRC model: BiDAF

Answer prediction

Integrated context vectors

Question contextual embedding

Passage contextual Embedding

Lexicon Embedding

[Seo+16] Transformer-based MRC model: BERT

Answer prediction

Integrated context vectors

Question contextual Passage contextual embedding Embedding (inter-attention) (self-attention)

Lexicon Embedding

Question Passage

[Devlin+19] Conversational MRC models

• QA as classification given (question, text) • Classify each word in passage as start/end/outside of answer span • Encoding: represent each passage word using an integrated context vector that encodes info about • Lexicon/word embedding • Passage context • Question context • Conversation context (previous question-answer pairs) • Prediction: predict each word (its integrated context vector) the start and end position of answer span.

A recent review on conversational MRC is [Gupta&Rawat 20] Conversational MRC models

• Pre-pending conversation history to current question or passage • Convert conversational QA to single-turn QA • BiDAF++ (BiDAF for C-QA) • Append a feature vector encoding dialog turn number to question embedding • Append a feature vector encoding N answer locations to passage embedding • BERT (or RoBERTa) • Prepending dialog history to current question • Using BERT as • context embedding (self-attention) • Question/conversation context embedding (inter-attention)

[Choi+18; Zhu+19; Ju+19; Devlin+19] FlowQA: explicitly encoding dialog history • Integration Flow (IF) Layer • Given: • Current question 푄푇, and previous questions 푄푡, 푡 < 푇 • For each question 푄푡, integrated context vector of each passage word 푤푡 • Output: • Conversation-history-aware integrated context vector of each passage word • 푤푇 = LSTM(푤1, … , 푤푡, … , 푤푇) • So, the entire integrated context vectors for answering previous questions can be used to answer the current question. • Extensions of IF • FlowDelta explicitly models the info gain thru conversation • GraphFLOW captures conversation flow using a graph neural network • Implementing IF using Transformer with proper attention masks

[Huang+19; Yeh&Chen 19; Chen+19] Remarks on BERT/RoBERTa

• BERT-based models achieve SOTA results on conversational QA/MRC leaderboards. • What BERT learns • BERT rediscovers the classical NLP pipeline in an interpretable way • BERT exploits spurious statistical patterns in datasets instead of learning meaning in the generalizable way that humans do, so • Vulnerable to adversarial attack/tasks (adversarial input perturbation) • Text-QA: Adversarial SQuAD [Jia&Liang 17] • Classification: TextFooler [Jin+20] • Natural language inference: Adversarial NLI [Nie+19] • Towards a robust QA model

[Tenney+19; Nie+ 19; Jin+20; Liu+20] BERT rediscovers the classical NLP pipeline in an interpretable way

• Quantify where linguistic info is captured within the network • Lower layers encode more local syntax • higher layers encode more global complex semantics • A higher center-of-gravity value means that the information needed for that task is captured by higher layers

[Tenney+19] Adversarial examples

Text-QA Sentiment Classification SQuAD MR IMDB Yelp Original 88.5 86.0 90.9 97.0 Adversarial 54.0 11.5 13.6 6.6

BERTBASE results

[Jia&Liang 17; Jin+20; Liu+20] Build Robust AI models via adversarial training • Standard Training objective

• Adversarial Training in computer vision: apply small perturbation to input images that maximize the adversarial loss

• Adversarial Training for neural language modeling (ALUM): • Perturb word embeddings instead of words • adopt virtual adversarial training to regularize standard objective

[Goodfellow+16; Madry+17; Miyato+18; Liu+20] Generalization and robustness

• Generalization: perform well on unseen data • pre-training • Robustness: withstand adversarial attacks • adversarial training • Can we achieve both? • Past work finds that adversarial training can enhance robustness, but hurts generalization [Raghunathan+19; Min+20] • Apply adversarial pre-training (ALUM) improves both [Liu+20]

[Raghunathan+19; Min+20; Liu+20] Outline

• Part 1: Introduction • Part 2: Conversational QA methods • Part 3: Conversational search methods • Part 4: Case study of commercial systems Conversational Search: Outline • What is conversational search? • A view from TREC Conversational Assistance Track (TREC CAsT) [1]

• Unique Challenges in conversational search. • Conversational query understanding [2]

• How to make search more conversational? • From passive retrieval to active conversation with conversation recommendation [3]

[1] Cast 2019: The conversational assistance track overview [2] Few-Shot Generative Conversational Query Rewriting [3] Leading Conversational Search by Suggesting Useful Questions 5151 Why Conversational Search Ad hoc Search Conversational Search

Keyword-ese Queries Natural Queries

Necessity: • Speech/Mobile Interfaces Opportunities: • More natural and explicit expression of information needs Challenge: • Query understanding & sparse retrieval

5252 Why Conversational Search Ad hoc Search Conversational Search

Ten Blue-Links Natural Responses

Necessity: • Speech/Mobile Interfaces Opportunities: • Direct & Easier access to information Challenge: • Document understanding; combine and synthesize information

5353 Why Conversational Search Ad hoc Search Conversational Search

Single-Shot Query Multi-Turn Dialog

Necessity: • N.A. Opportunities: • Serving complex information needs and tasks Challenge: • Contextual Understanding & Memorization

5454 Why Conversational Search Ad hoc Search Conversational Search

Did you mean the comparison between seed investment and crowdfunding?

Passive Serving Active Engaging

Necessity: • N.A. Opportunities: • Collaborative information seeking & better task assistance Challenge: • Dialog management, less lenient user experience

5555 A View of Current Conversational Search

How does seed investment work?

Conversational Queries (R1)

Search

Documents Documents Documents

Response Synthesis System Response

5656 A View of Current Conversational Search

How does seed investment work? Tell me more about the difference

Conversational Queries (R1) Conversational Queries (R2) Contextual Search Understanding “Tell me more about the Context Resolved Query difference between seed and Documents early stage funding” Documents Documents

Response Synthesis System Response

5757 A View of Current Conversational Search

How does seed investment work? Tell me more about the difference

Conversational Queries (R1) Conversational Queries (R2) Contextual Search Understanding Context Resolved Query Documents Documents System Response Documents Are you also interested in learning the different Conversation Response Recommendations series of investments? Synthesis Recommendation Did you mean the System Response Learn to Ask Clarifications difference between seed and early stage?

5858 A Simpler View from TREC CAsT 2019 • “Conversational Passage Retrieval/QA”

Input: Conversational Queries (R1) Conversational Queries (R2) • Manually written conversational queries Contextual • ~20 topics, ~8 turns per topic Search Understanding • Contextually dependent on previous queries Context Resolved Query Answer Passage Corpus: Contextual • MS MARCO + CAR Answer Passages Answer Passage Search Task: • Passage Retrieval for conversational queries Answer Passage Answer Passage

Answer Passage

Answer Passage

http://treccast.ai/ 5959 TREC CAsT 2019 • An example conversational search session

Title: head and neck cancer Input: Description: A person is trying to compare and contrast • Manually written conversational queries types of cancer in the throat, esophagus, and lungs. • ~20 topics, ~8 turns per topic • Contextually dependent on previous queries 1 What is throat cancer? Corpus: 2 Is it treatable? • MS MARCO + CAR Answer Passages 3 Tell me about lung cancer. Task: • Passage Retrieval for conversational queries 4 What are its symptoms? 5 Can it spread to the throat? 6 What causes throat cancer? 7 What is the first sign of it? 8 Is it the same as esophageal cancer? 9 What's the difference in their symptoms?

http://treccast.ai/ 6060 TREC CAsT 2019 • Challenge: contextual dependency on previous conversation queries

Title: head and neck cancer Input: Description: A person is trying to compare and contrast • Manually written conversational queries types of cancer in the throat, esophagus, and lungs. • ~20 topics, ~8 turns per topic • Contextually dependent on previous queries 1 What is throat cancer? Corpus: 2 Is it treatable? • MS MARCO + CAR Answer Passages 3 Tell me about lung cancer. Task: • Passage Retrieval for conversational queries 4 What are its symptoms? 5 Can it spread to the throat? 6 What causes throat cancer? 7 What is the first sign of it? 8 Is it the same as esophageal cancer? 9 What's the difference in their symptoms?

http://treccast.ai/ 6161 TREC CAsT 2019 • Learn to resolve the contextual dependency

Title: head and neck cancer Manual Queries provided by CAsT Y1 Description: A person is trying to compare and contrast types of cancer in the throat, esophagus, and lungs. 1 What is throat cancer? 1 What is throat cancer? 2 Is it treatable? 2 Is throat cancer treatable? 3 Tell me about lung cancer. 3 Tell me about lung cancer. 4 What are its symptoms? 4 What are lung cancer’s symptoms? 5 Can it spread to the throat? 5 Can lung cancer spread to the throat 6 What causes throat cancer? 6 What causes throat cancer? 7 What is the first sign of it? 7 What is the first sign of throat cancer? 8 Is it the same as esophageal cancer? 8 Is throat cancer the same as esophageal cancer? 9 What's the difference in their symptoms? 9 What's the difference in throat cancer and esophageal cancer's symptoms?

http://treccast.ai/ 6262 TREC CAsT 2019: Query Understanding Challenge • Statistics in Y1 Testing Queries

Type (#. Turns) Utterance Mention Pronominal (128) How do they celebrate Three Kings Day? they -> Spanish people Zero (111) What cakes are traditional? Null -> Spanish, Three Kings Day Groups (4) Which team came first? which team -> Avengers, Justice League Abbreviations (15) What are the main types of VMs? VMs -> Virtual Machines

Cast 2019: The Conversational Assistance Track Overview 6363 TREC CAsT 2019: Result Statics • Challenge from contextual query understanding

Notable gaps between auto and manual runs

Cast 2019: The Conversational Assistance Track Overview 6464 TREC CAsT 2019: Techniques • Techniques used in Query Understanding

Usage NDCG Gains 60% 40% 50% 30% 20% 40% 10% 30% 0% 20%

-10% % relative gain relative % Usage Fraction Usage 10% -20%

0% -30%

Rules

None

External

Coreference

Entity Linking Entity

Unsupervised

Deep Learning Deep

Use NLP Toolkit NLP Use

Y1 Training Data Training Y1

MS MARCO Conv MARCO MS Y1 Manual Testing… Manual Y1

Cast 2019: The Conversational Assistance Track Overview 6565 TREC CAsT 2019: Notable Solutions • Automatic run results GPT-2 generative query rewriting [1] BERT [2]

0.60 0.50 0.40 0.30 0.20 0.10

0.00

pgbert

MPmlp

MPgate

pg2bert

clacBase

SMNmlp

SMNgate

mpi_bert

mpi_base

ensemble

RUCIR-run2 RUCIR-run3 RUCIR-run4

ug_cont_lin

coref_cshift

UDInfoC_TS

UDInfoC_BL

bertrr_rel_q

mpi-d5_intu

mpi-d5_cqw

galago_rel_q

ECNUICA_ORI

h2oloo_RUN2

bertrr_rel_1st

mpi-d5_union

ECNUICA_MIX

ilps-bert-feat2 ilps-bert-feat1

ilps-bert-featq

ilps-lm-rm3-dt

UDInfoC_TS_2

coref_shift_qe

galago_rel_1st

mpi-d5_igraph

ECNUICA_BERT

clacBaseRerank

ug_cedr_rerank

BM25_BERT_FC

indri_ql_baseline

UMASS_DMN_V1

CFDA_CLIP_RUN1

ug_1stprev3_sdm BM25_BERT_RANKF

[1] Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering [2] Lin et al. 2020. Query Reformulation using Query History for Passage Retrieval in Conversational Search 6666 Conversational Query Understanding Via Rewriting • Learn to rewrite a full-grown context-resolved query Input Output

… ∗ 푞1 푞2 푞푖 푞푖

What is throat cancer? What is the first sign of throat cancer?

What is the first sign of it?

Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering 6767 Conversational Query Understanding Via Rewriting • Learn to rewrite a full-grown context-resolve query Input Output

… ∗ 푞1 푞2 푞푖 푞푖

What is throat cancer? What is the first sign of throat cancer?

What is the first sign of it? • Leverage pretrained NLG model (GPT-2) [1]

NLG ∗ GPT-2 푞푖

푞1 푞2 … 푞푖 “[GO]”

Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering 6868 Conversational Query Understanding Via Rewriting • Learn to rewrite a full-grown context-resolve query Input Output

… ∗ 푞1 푞2 푞푖 푞푖

What is throat cancer? What is the first sign of throat cancer?

What is the first sign of it? CAsT Y1 Data: • Concern: Limited training data • Manually written conversational queries • 50 topics, 10 turns per topic NLG ∗ • GPT-2 푞푖 20 topics with TREC relevance labels 100X Millions of Parameters ? 500 Manual Rewrite Labels

푞1 푞2 … 푞푖 “[GO]”

Vakulenko et al. 2020. Question Rewriting for Conversational Question Answering 6969 Few-Shot Conversational Query Rewriting • Train conversational query rewriter with the help of ad hoc search data

Ad hoc Search Conversational Search • Existing billions of search sessions • Production scenarios still being explored • Lots of high-quality public benchmarks • Relative new topic, fewer available data

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 7070 Few-Shot Conversational Query Rewriting • Leveraging ad hoc search sessions for conversational query understanding

Ad hoc Search Conversational Search

Ad hoc Search Sessions Conversational Rounds

7171 Few-Shot Conversational Query Rewriting • Leveraging ad hoc search sessions for conversational query understanding

Ad hoc Search Conversational Search

Ad hoc Search Sessions Conversational Rounds Challenges? • Available only in commercial search engines • Approximate sessions available in MS MARCO • Keyword-ese • Filter by question words

7272 Few-Shot Conversational Query Rewriting • Leveraging ad hoc search sessions for conversational query understanding

Ad hoc Search Conversational Search

?

Ad hoc Search Sessions Conversational Rounds Challenges? • Available only in commercial search engines • Approximate sessions available in MS MARCO • Keyword-ese • Filter by question words • No explicit context dependency?

7373 Few-Shot Conversational Query Rewriting: Self-Training • Learn to convert ad hoc sessions to conversational query rounds

“Contextualizer”: make ad hoc sessions more conversation-alike

Learn to omit information and add contextual dependency ∗ ∗ … ∗ GPT-2 Converter ′ 푞1 푞2 푞푖 푞푖

Self-contained Queries “Conversation-alike” Queries

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 7474 Few-Shot Conversational Query Rewriting: Self-Training • Learn to convert ad hoc sessions to conversational query rounds

“Contextualizer”: make ad hoc sessions more conversation-alike

Learn to omit information and add contextual dependency ∗ ∗ … ∗ GPT-2 Converter ′ 푞1 푞2 푞푖 푞푖

Self-contained Queries “Conversation-alike” Queries Training: • X (Self-contained q): Manual rewrites of CAsT Y1 conversational sessions • Y (Conversation-alike q): Raw queries in CAsT Y1 sessions Inference: • X (Self-contained q): Ad hoc questions from MS MARCO sessions • Y (Conversation-alike q): Auto-converted conversational sessions Model: • Any pretrained NLG model: GPT-2 Small in this Case

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 7575 Few-Shot Conversational Query Rewriting: Self-Training • Leverage the auto-converted conversation-ad hoc session pairs

“Rewriter”: recover the full self-contained queries from conversation rounds

Learn from generated training data by the converter … GPT-2 Rewriter ∗ 푞1 푞2 푞푖 푞푖

“Conversation-alike” Queries Self-contained Queries

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 7676 Few-Shot Conversational Query Rewriting: Self-Training • Leverage the auto-converted conversation-ad hoc session pairs

“Rewriter”: recover the full self-contained queries from conversation rounds

Learn from generated training data by the converter … GPT-2 Rewriter ∗ 푞1 푞2 푞푖 푞푖

“Conversation-alike” Queries Self-contained Queries Training: • X (Conversation-alike q): Auto-converted from the Contextualizer • Y (Self-contained q): Raw queries from ad hoc MARCO sessions Inference: • X (Conversation-alike q): CAsT Y1 raw conversational queries • Y (Self-contained q): auto-rewritten queries that are more self-contained Model: • Any pretrained NLG model: another GPT-2 Small in this Case

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 7777 Few-Shot Conversational Query Rewriting: Self-Training • The full “self-learning” loop Learn to omit information is easier than recover GPT-2 Converter: Convert ad hoc sessions to conversation-alike sessions • learn from a few conversational queries with manual rewrites

Self-Contained Conversation Queries Queries

GPT-2 Rewriter: Rewrite conversational queries to self-contained ad hoc queries • learn from the large amount of auto-converted “ad hoc” ↔ “conversation alike” sessions

Much more training signals from the Contextualizer

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 7878 Few-Shot Conversational Query Rewriting: Results

TREC CAsT Y1 BLEU-2 TREC CAsT Y1 NDCG@3 1 +7% compared 0.6 Better to coreference generation→+12% 0.9 ranking NDCG resolution 0.5 0.8 Y1 Best 0.7 0.4 0.6 0.5 0.3 0.4 0.3 0.2 0.2 0.1 0.1 0 0 Raw Query Coreference Raw Query Coreference Self-Learned GPT-2 Oracle Self-Learned GPT-2 Oracle

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 7979 How Few-shot Can Pretrained NLG Models Be? • Five Sessions are all they need?

BLEU-2 on CAsT Y1 NDCG@3 on CAsT Y1 1 0.6

0.8 0.5 0.4 0.6 0.3 0.4 0.2 0.2 0.1 0 0 0 10 20 30 40 0 10 20 30 40 # Training Session # Training Session

CV Self-Learn CV Self-Learn

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 8080 What is learned? • More about learning the task format, nor the semantics • Semantic mostly in the pretrained weights

% Rewriting Terms Copied % Starting with Question Words 100% 100%

80% 80%

60% 60%

40% 40%

20% 20%

0% 0% 0 50 100 150 0 50 100 150 # Training Steps # Training Steps

CV Self-Learn Oracle CV Self-Learn Oracle

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 8181 Auto-rewritten Examples: Win • Surprisingly good at Long-term dependency and Group Reference

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 8282 Auto-rewritten Examples: Win • More “fail to rewrite”

Yu et al. Few-Shot Generative Conversational Query Rewriting. SIGIR 2020 8383 CAsT Y2: More Realistic Conversational Dependencies • More interactions between queries and system responses

Conversational Queries (R1) Conversational Queries (R2)

Search Context Resolved Query Answer Passage Dependency on Previous Results Contextual Answer Passage Search

Answer Passage Answer Passage

Answer Passage

Answer Passage

Developed by interacting with a BERT-based : http://boston.lti.cs.cmu.edu/boston-2-25/ 8484 CAsT Y2: More Realistic Conversational Dependencies • More interactions between queries and system responses

Conversational Queries (R1) Conversational Queries (R2) Q1: How did snowboarding begin?

Search R1: …The development of snowboarding was inspired Context Resolved Query by skateboarding, surfing and skiing. The first Answer Passage Dependency on snowboard, the Snurfer, was invented by Sherman Contextual Previous Results Poppen in 1965. Snowboarding became a Winter Search Answer Passage Olympic Sport in 1998. Answer Passage Answer Passage Q2:Interesting. That's later than I expected. Who were Answer Passage the winners?

Answer Passage Manual rewrites: Who were the winners of snowboarding events in the 1998 Winter Olympics? Auto rewrites without considering response: Who were the winners of the snowboarding contest?

Developed by interacting with a BERT-based search engine: http://boston.lti.cs.cmu.edu/boston-2-25/ 8585 From Passive Information Supplier to Active Assistant

Conversational Queries (R1) Conversational Queries (R2)

Context Resolved Query Documents

Documents Passive Retrieval System Response Documents

System Response

8686 From Passive Information Supplier to Active Assistant

Conversational Queries (R1) Conversational Queries (R2)

Context Resolved Query Documents

Documents Passive Retrieval System Response Documents Conversation Active Assistant Recommendations Recommendation

System Response

Rosset et al. Leading Conversational Search by Suggesting Useful Questions 8787 Making Search Engines More Conversational • Search is moving from "ten blue links" to conversational experiences

https://sparktoro.com/blog/less-than-half-of-google-searches-now-result-in-a-click/

88 Making Search Engines More Conversational • Search is moving from "ten blue links" to conversational experiences

Yet most queries are not “conversational” 1. Users are trained to use keywords 2. Less conversational queries 3. Less learning signal 4. Less conversational experience “Chicken and Egg” Problem

https://sparktoro.com/blog/less-than-half-of-google-searches-now-result-in-a-click/

89 Conversation Recommendation: “People Also Ask” • Promoting more conversational experiences in search engines • E.g., for keyword query "Nissan GTR" • Provide the follow questions:

What is Nissan GTR?

How to buy used Nissan GTR in Pittsburgh?

Does Nissan make sports car?

Is Nissan Leaf a good car?

9090 Conversation Recommendation: Challenge • Relevant != Conversation Leading/Task Assistance • User less lenient to active recommendation

What is Nissan GTR? [Duplicate]

How to buy used Nissan GTR in Pittsburgh? [Too Specific]

Does Nissan make sports car? [Prequel]

Is Nissan Leaf a good car? [Miss Intent]

9191 Conversation Recommendation: Beyond Relevance • Recommending useful conversations that • Help user complete their information needs • Assist user with their task • Provide meaningful explorations

Relevant Relevant & Useful

What is Nissan GTR?

How to buy used Nissan GTR in Pittsburgh?

Does Nissan make sports car?

Is Nissan Leaf a good car?

9292 Usefulness Metric & Benchmark • Manual annotations on Bing query, conversation recommendation pairs

Types of non-useful ones. • Crucial for annotation consistency

A higher bar of being useful

https://github.com/microsoft/LeadingConversationalSearchbySuggestingUsefulQuestions 9393 Conversation Recommendation Model: Multi-Task BERT • BERT seq2seq in the standard multi-task setting

X Y Not Conversation Leading BERT [CLS] Query [SEP] PAA Question User Click Click Bait?

BERT [CLS] Query [SEP] PAA Question Relevance Just Related?

BERT [CLS] Query [SEP] PAA Question High/Low CTR Click Bait #2?

9494 Conversation Recommendation: Session Trajectory • Problem: the previous 3 signals were prone to learning click-bait • We need more information about how users seek new information • Solution: imitate how users issue queries in sessions

Task: classify whether the potential next query was issued 4. Millions of sessions for imitation learning by the user BERT [CLS] Session [SEP] Potential Next Query User Behavior

“Federal Tax Return” “Flu Shot Codes 2018” “Facebook” “Flu Shot Billing Codes 2018” Predict last query from “How Much is Flu Shot?” session context

9595 Conversation Recommendation: Weak Supervision • Learn to lead the conversation from queries user search in the next turn

PAA Tasks Y BERT [CLS] Query [SEP] PAA Question User Click

BERT [CLS] Query [SEP] PAA Question Relevance

BERT [CLS] Query [SEP] PAA Question High/Low CTR

Weak Supervision from Sessions BERT User provided contents [CLS] Query [SEP] Potential Next Query User Behavior More exploratory Less Constrained by Bing

9696 Conversation Recommendation: Session Trajectory • What kinds of sessions to learn from?

Randomly Chosen Sessions: Noisy and unfocused People often multi-task in search sessions

“Federal Tax Return”

“Flu Shot Codes 2018” "These don't belong!" “Facebook”

“Flu Shot Billing Codes 2018”

“How Much is Flu Shot?”

9797 Multi-task Learning: Session Trajectory Imitation • What kinds of sessions to learn from?

"Conversational" Sessions: Subset of queries that all have some coherent relationship to each other

“Federal Tax Return” 0.23 “Flu Shot Codes 2018” Gen-Encoding Similarity “Facebook” 0.89 0.61 “Flu Shot Billing Codes 2018” 0.73 “How Much is Flu Shot?”

Zhang et al. Generic Intent Representation in Web Search. SIGIR 2019 9898 Multi-task Learning: Session Trajectory Imitation What kinds of sessions to learn from?

"Conversational" Sessions: Subset of queries that all have some coherent relationship to each other

1. Treat each session as a graph “Federal Tax Return” 0.23 2. Edge weights are "GEN-Encoder “Flu Shot Codes 2018” Similarity" ( of Gen-Encoding Similarity query intent vector encodings) “Facebook” 0.89 3. Remove edges < 0.4 0.61 4. Keep only the largest "Connected “Flu Shot Billing Codes 2018” Component" of queries 0.73 “How Much is Flu Shot?”

Zhang et al. Generic Intent Representation in Web Search. SIGIR 2019 9999 Method: Inductive Weak Supervision • Learn to lead the conversation from queries user search in the next turn

PAA Tasks Y BERT [CLS] Query [SEP] PAA Question User Click

BERT [CLS] Query [SEP] PAA Question Relevance

BERT [CLS] Query [SEP] PAA Question High/Low CTR

Weak Supervision from Sessions BERT [CLS] Query [SEP] “Next Turn Conversation” User Behavior User Next-Turn Interaction

100100 Results: Usefulness • Usefulness on human evaluation/our usefulness benchmark

Dup Usefulness PRODUCTION Dup DEEPSUGGEST w/Ans 0.5 Preque w/Ans Preque Too Too l Spec l 0.4 +35% over online Spec Production Dup Q 0.3 Useful Dup Q Useful 0.2 Misses 0.1 Misses Intent Intent 0

BERT + Clean Session + Conv Session DeepSuggestion

101101 Results: Online A/B • Online experiment results with a large fraction of Bing online traffic.

Relative to Online Online Click Rate (TOP) +8.90% Online Click Rate (Bottom) +6.40% Online Overall Success Rate 0.05% Offline Usefulness 35.60% Offline Relevance 0.50%

102102 Example Conversation Question Recommendations • All from the actual systems

103103 Conversational Search Recap

What is conversational search: Conversational Queries (R1) Conversational Queries (R2) • A view from TREC CAsT Y1 Contextual Search Understanding What are its unique challenges: Context Resolved Query Documents • Contextual query understanding

Documents How to make search more conversational: System Response Documents • Recommending useful conversations Conversation Response Recommendations Synthesis Recommendation System Response Much more to be done! Learn to Ask Clarifications

104104 Outline

• Part 1: Introduction • Part 2: Conversational QA methods • Part 3: Conversational search methods • Part 4: Case study of commercial systems Overview of Public and Commercial Systems

• Focus Points • Published systems for conversational IR and related tasks • Historical highlights, recent trends, depth in an exemplar

• Research Platforms and Toolkits

• Application areas • • Conversational Search Engines • Productivity-Focused Agents • Device-based Assistants • Hybrid-Intelligence Assistants Research platforms and toolkits for building conversational experiences Common Goals of Toolkits

• Abstract state representation

• Democratize ability to build conversational AI to developers with minimal AI experience

• Provide easy code integration to external APIs, channels, or devices Several Widely used Toolkits

Research • Microsoft Research ConvLab

Research platform for comparing models in a more research-oriented environment. • Macaw: An Extensible Conversational Information Seeking Open Source Platform

Research platform for comparing models in a more research-oriented environment. Development • Google’s Dialogflow Conversational experiences integrated with different engagement platforms with integration with Google’s Cloud Natural Language services. • Facebook’s Wit.ai Supports intent understanding and connection to external REST APIs.. • Alexa Developer Tools Develop new skills for Alexa, devices with Alexa integrated for control, and enterprise-related interactions. • Rasa Provides an open source platform for text and voice based assistants. • Microsoft Power Virtual Agents on Azure Integrates technology from the Conversation Learner to build on top of LUIS and the Azure Bot service and learn from example dialogs Macaw

• Macaw is an open-source for conversational research.

• Macaw is implemented in Python and can be easily integrated with popular deep learning libraries, such as, TensorFlow and PyTorch.

Zamani & Craswell, 2019 Macaw supports multi-modal interactions. The modular architecture of Macaw makes it easily extensible. Simple Configuration List of Request Action 1: Search Messages Action 1 • Query Generation:

• Co-reference Resolution Query Generation • Query re-writing • Generate a language model (or query)

• Retrieval Model (Search Engine): Retrieval Model • Indri • Bing API • BERT Re-ranking Result Generation • Result Generation List of Request Action 2: QA Messages Action 2 • Query Generation:

• Co-reference Resolution Query Generation • Query re-writing • Generate a language model (or query)

• Retrieval Model: Retrieval Model • Indri • Bing API • BERT Re-ranking Answer Generation • Answer Generation: • Machine Reading Comprehension (e.g., DrQA) List of Request Action 3: Commands Messages Action 3 • Command Processing:

• Identifying the command Command Processing • Command re-writing • Command Execution • Result Generation Command Execution • Command specific

Result Generation Conversation Learner: Learn from dialogs emphasize easy correction

User Generated Example conversations used to train the bot

Machine-Learned Runtime Next action prediction based on Word embeddings & conversational context

Machine Teaching UI For correcting errors and continual improvement Power Virtual Agent: Combine rule and ML based with machine teaching

Graphical bot creation

Slot-filling capabilities

Part of Microsoft’s Power Platform Chatbots Overview

• Historical Review • Types • Social • Task-oriented Completion • Information bots • Recommendation-focused bots

• Increasingly bots blend all of these. Both EQ and IQ seen as key part of HCI design for chatbots. A few well-known Chatbots

• ELIZA (Weizenbaum, 1966)

• PARRY (Colby et al, 1975)

• ALICE (Wallace, 2009) A few well-known Chatbots

• ELIZA (Weizenbaum, 1966)

• PARRY (Colby et al, 1975)

• ALICE (Wallace, 2009)

Excerpted from Weizenbaum (CACM, 1966). Eliza simulated a Rogerian psychotherapist that primarily echoes back statements as questions. A few well-known Chatbots

• ELIZA (Weizenbaum, 1966)

• PARRY (Colby et al, 1975)

• ALICE (Wallace, 2009)

PARRY was an attempt to simulate a paranoid schizophrenic patient to help understand more complex human conditions. Vint Cerf hooked up ELIZA and PARRY to have a conversation on ARPANET (excerpt from [Cerf, Request for Comments: 439, 1973]) A few well-known Chatbots

• ELIZA (Weizenbaum, 1966)

• PARRY (Colby et al, 1975)

• ALICE (Wallace, 2009)

From transcript of Loebner 2004 Contest of Turing’s Imitation Game where ALICE won the gold medal (as reported in [Shah, 2006] )

Spike Jonze cited ALICE as inspiration for screenplay of Her (Morais, New Yorker, 2013) XiaoIce (“Little Ice”) [Zhou et al, 2018]

• Create an engaging conversation: the journey vs the destination • Most popular social chatbot in the world • Optimize long-term user engagement (Conversation-turns Per Session) • Released in 2014 • More than 660 million active users • Average of 23 CPS

• Available in other countries under other names (e.g. Rinna in Japan) Evolution of Social Connection

Excerpted from Zhou et al, 2018

Building rapport and connection Evolution of Social Connection

Excerpted from Zhou et al, 2018

Implicit information seeking Evolution of Social Connection

Excerpted from Zhou et al, 2018

Encouraging social norms as part of responsible AI Time-sharing Turing Test

• View as a companion and goal is for person to enjoy companionship.

• Empathetic (Cai 2006; Fung et al. 2016) to recognize human emotions and needs, understand context, and respond appropriately in terms of relevant and long-term positive impact of companionship

• Empathetic computing layer recognizes emotion, opinion on topic, interests, and responsible for consistent bot personality etc. Responsible AI and Ethics

• Microsoft Responsible AI: https://www.microsoft.com/en-us/ai/responsible-ai

• Microsoft’s Responsible bots: 10 guidelines for developers of conversational AI • Articulate the purpose of your bot and take special care if your bot will support consequential use cases. • Be transparent about the fact that you use bots as part of your product or service. • Ensure a seamless hand-off to a human where the human-bot exchange leads to interactions that exceed the bot’s competence. • Design your bot so that it respects relevant cultural norms and guards against misuse • Ensure your bot is reliable. • Ensure your bot treats people fairly. • Ensure your bot respects user privacy. • Ensure your bot handles data securely. • Ensure your bot is accessible. • Accept responsibility Key Focus Points for Principles of Responsible AI Design in XiaoIce • Privacy Includes awareness of topic sensitivity in how groups are formed and use of conversations

• Control User-focused control with right to not respond for XiaoIce and potential harm (including a model of breaks and diurnal rhythms to encourage boundaries in usage)

• Expectations Always represent as a bot, help build connections with others, set accurate expectations on capabilities

• Behavioral standards Through filtering and cleaning adhere to common standards of morality and avoid imposing values on others. High-level Guidance to Maintain Responsible AI in XiaoIce

• Aim to achieve and consistently maintain a reliable, sympathetic, affectionate, and wonderful sense of humor in persona of bot.

• Learn from examples of public-facing dialogues specific to culture and local, labeled into desired vs undesired behavior. Driving long-term engagement

• Generic responses yield long-term engagement but lead to user attrition as measured by Number of Active Users (NAU) [Li et al. 2016c; Fang et al. 2017]

Example: “I don’t understand, what do you mean?”

• Topic selection • Contextual relevance and novelty: related to discussion so far but novel • Freshness: Currently in focus in the news or other sources. • Personal Interests: Likely of interest to the user • Popularity: High attention online or in chatbot • Acceptance: Past interaction with topic from other users high Overall Interaction model

• Extensible skill set (200+) which determines mode: General, Music, Travel, Ticket-booking

• Hierarchical Decision-Making governs dialog • Determine current mode using Markov Decision Process (e.g. image of food might trigger Food Recommendation skill) • Prompt or respond • Update

• New information (e.g. particular musical artists of interest) is remembered to help create more engaging dialogue in the future

• Explore (learn more about interests) vs Exploit (engage on known topics of interests and highly probable contextual replies) Chat Styles and Applications of XiaoIce

• Basic chat fuses two styles of chat • IR based chat which uses retrieval from past conversations filtered for appropriateness • Neural based chat which is trained on filtered query-response pairs

• Applications • Powers personal assistants and virtual avatars • Lawson and Tokopedia customer service • Pokemon, Tecent, Netesase chatbots Toward Conversational Search Evolution of Search Engine Result Page Evolution of Search Engine Result Page

Entity pane for understanding related attributes Evolution of Search Engine Result Page

Instant answers and perspectives Evolution of Search Engine Result Page

Useful follow-up questions once this question is answered Demonstrate Clarification Questions understanding while clarifying

[Zamani et al, WebConf 2020; SIGIR 2020] Contextual Understanding Sample TREC CAST 2019 Topic Contextual Understanding in Search Variety of Attempts … the future? Productivity and Personal- Information Conversational Search DARPA Personal Assistants that Learn (PAL) CALO / RADAR

Key Focus Points • Calendar management [Berry et al, 2003; Berry et al., 2006; Modi et al., 2004] • Dealing with uncertain resources in scheduling [Fink et al., 2006] • Task management [Freed et al. 2008]

From Freed et al. 2008 From PAL to SIRI

• Learnings from the PAL project including CALO/SIRI recognized need for unifying architectures. [Guzzoni et al., 2007]

A “do engine” rather than a “search engine”

From Guzzoni et al, 2007 Device-based Assistants

• Mobile phone based assistants • Includes: Apple’s Siri, Google Assistant, Microsoft’s Cortana • Blends productivity-focused and information focused with voice-related recognition

• Situated speakers and Devices • Amazon Alexa, Google Home, Facebook Portal w/Alexa, etc. • Combines microphone arrays, multi-modal, multi-party devices in addition Hybrid Intelligence

• Mix AI and Human Computation to achieve an intelligent experience that leverages best of both worlds and push the envelope of possible.

• When escalated to human, often serves as a feedback loop for learning.

• Examples: • Facebook’s M • Microsoft’s Calendar.help Calendar.help → Scheduler

• Initially high-precision rules • Unhandled cases handled by low latency human crowdsourcing workflows • Transition flywheel to https://calendar.help [Cranshaw et al., 2017] Current application-oriented research questions

• Long-term evaluation metrics for engagement beyond CPS and NAU (cf. Lowe et al. [2017]; Serban et al. [2017]; Sai et al. [2019]) • Other metrics of social companionship: linguistic accommodation or coordination? • Application to detection: Relationship to the inverse problems of toxicity, bias, etc.

• Aspirational goal-support from assistants

• Best proactivity engagement based on model of interests

• Integrating an understanding of physical environment Challenges for Conversational Interaction

• Human-AI Interaction Design • Goal-directed design: Enable people to express goals flexibly and allow the agent to progress toward those goals. • Gulf of evaluation: Communicate the range of skills of an intelligent agent to users and what is available in current context.

• Conversational Understanding • Grounded Language Generation and Learning: Transform NL intent to action that depends on state and factual correctness. • Extensible Personalized Skills: Support new skills and remember preferences to evaluate changes/updates.

• External World Perception and Resource Awareness • Multi-modality input and reasoning: Integrate observations from modalities including voice, vision, and text. • Identity and interactions: Identify people around and interact with them appropriate to setting. • Physical understanding: Monitor physical situation and intelligently notify for key situations (safety, anomalies, interest). • Constrained scheduling: Support reasoning about limited and bound resources such as space/time constraints, keep knowledge of constraints to deal with updates, etc. Challenges for Conversational Interaction

• Principles & Guarantees • Responsible AI: Evolve best practice and design new techniques as new ethical challenges arise. • Privacy: Reason about data in a privacy aware way (e.g. who is in room and what is sensitive).

• Richer paradigms of supervision and learning • Programming by Demonstration/Synthesis: Turn sequences of actions into higher level macros/scripts that map to NL. • Machine Teaching: Support efficient supervision schemes from a user-facing perspective that also enable resharing with others (especially for previous bullet).

• Advanced Reasoning • Attention: Suspend and resume conversation/task naturally based on listener’s attention. • Emotional Intelligence: Support the emotional and social needs of people to enable responsible AI and multi-party social awareness. • Causal Reasoning: Reason about the impact of taking an action. Upcoming Book (by early 2021) Neural Approaches to Conversational Information Retrieval (The Information Retrieval Series)

Contact Information: Jianfeng Gao https://www.microsoft.com/en-us/research/people/jfgao/ Chenyan Xiong https://www.microsoft.com/en-us/research/people/cxiong/ Paul Bennett https://www.microsoft.com/en-us/research/people/pauben/

Slides: Please check our personal websites. 162