Mitigation of Data Scarcity Issues for Semantic Classification in a Virtual Patient Dialogue Agent

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Adam Stiff, BS

Graduate Program in Computer Science and Engineering

The Ohio State University

2020

Dissertation Committee:

Eric Fosler-Lussier, Advisor Michael White Yu Su c Copyright by

Adam Stiff

2020 Abstract

We introduce a virtual patient question-answering dialogue system, used for training

medical students to interview real patients, which presents many unique opportunities for

research in linguistics, speech, and dialogue. Among the most challenging research topics at

this point in the system’s development are issues relating to scarcity of training data. We

address three main problems.

The first challenge is that many questions are very rarely asked of the virtual patient, which leaves little data to learn adequate models of these questions. We validate one

approach to this problem, which is to combine a statistical question classification model with a rule-based system, by deploying it in an experiment with live users. Additional work

further improves rare question performance by utilizing a recurrent neural network model with a multi-headed self-attention mechanism. We contribute an analysis of the reasons

for this improved performance, highlighting specialization and overlapping concerns in

independent components of the model.

Another data scarcity problem for the virtual patient project is the challenge of adequately

characterizing questions that are deemed out-of-scope. By definition, these types of questions

are infinite, so this problem is particularly challenging. We contribute a characterization of

the problem as it manifests in our domain, as well as a baseline approach to handling the

issue, and an analysis of the corresponding improvement in performance.

ii Finally, we contribute a method for improving performance of domain-specific tasks

such as ours, which use off-the-shelf as inputs, when no in-domain

speech data is available. This method augments text training data for the downstream task with inferred phonetic representations, to make the downstream task tolerant of speech

recognition errors. We also see performance improvements from sampling simulated errors

to replace the text inputs during training. Future enhancements to the spoken dialogue

capabilities of the virtual patient are also considered.

iii Dedicated to Elizabeth, without whose support—and patience—

this would not have been possible.

iv Acknowledgments

Nothing written here will ever be adequate to acknowledge the full depth and breadth

of the myriad of interactions that have contributed to me getting to this point, so I am

embracing imperfection and keeping this short. Just know that if you are reading this, I

undoubtedly have a reason to thank you, and I hope that I have the wisdom to appreciate what that reason is.

I would like to extend special thanks to all of my collaborators whose efforts contributed

to work described in this dissertation. Chapter 3 includes descriptions of some work by

Kellen Maicher, Doug Danforth, Evan Jaffe, Marisa Scholl, and Mike White; Chapter 4

includes contributions from Mike White, Eric Fosler-Lussier, Lifeng Jin, Evan Jaffe, and

Doug Danforth; Chapter 5 was a collaboration with Qi Song and Eric Fosler-Lussier; and

Chapter 6 was coauthored with Prashant Serai and Eric Fosler-Lussier.

Cheers, thanks, and best wishes to the friends made along the way: Deblin, Denis, Peter,

Chaitanya, Joo-Kyung, Prashant, Ahmad, Manirupa, and the innumerable others who have

also provided invaluable insight, support, humor, feedback, or otherwise made this journey

a little more enjoyable.

Finally, I want to express my extreme gratitude to Eric Fosler-Lussier for his mentorship

and efforts in helping/pushing/dragging me over the finish line. Thanks for the guidance when I needed it, for the space to learn how to guide myself, and for the patience that I

probably didn’t deserve.

v Vita

May 2020 ...... PhD, Computer Science and Engineering, The Ohio State University, USA. March 2006 ...... BS, Biochemistry, The Ohio State University, USA.

Publications

Research Publications

Adam Stiff, Qi Song, and Eric Fosler-Lussier. How Self-Attention Improves Rare Class Performance. In Proceedings of the 21st Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics, 2020.

Prashant Serai, Adam Stiff, and Eric Fosler-Lussier. End to end speech recognition error prediction with sequence to sequence learning. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6339-6343. IEEE, 2020.

Adam Stiff, Prashant Serai, and Eric Fosler-Lussier. Improving human-computer in- teraction in low-resource settings with text-to-phonetic data augmentation. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7320-7324. IEEE, 2019.

Deblin Bagchi, Peter Plantinga, Adam Stiff, and Eric Fosler-Lussier. Spectral feature mapping with mimic loss for robust speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5609-5613. IEEE, 2018.

vi Fields of Study

Major Field: Computer Science and Engineering

Studies in: Artificial Intelligence Prof. Eric Fosler-Lussier Computational Linguistics Prof. Michael White Computational Neuropsychology Prof. Alexander Petrov

vii Table of Contents

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vi

List of Tables ...... xi

List of Figures ...... xii

1. Introduction ...... 1

1.1 Overview and Contributions ...... 5

2. Background and Related Work ...... 8

2.1 Issues in Dialogue ...... 9 2.1.1 Response Generation ...... 10 2.1.2 Natural Language Understanding ...... 14 2.1.3 Spoken Dialogue ...... 17 2.2 Text Classification ...... 19 2.3 One Class Classification ...... 22 2.4 Speech Recognition ...... 23 2.5 Other Virtual Patients ...... 24 2.6 Conclusion ...... 25

viii 3. System Overview ...... 26

3.1 Data ...... 26 3.1.1 Annotation ...... 28 3.1.2 Core Data Set ...... 30 3.1.3 Enhanced Data Set ...... 32 3.1.4 Spoken Data Set ...... 32 3.1.5 Strict Relabeling ...... 35 3.2 System Description ...... 36 3.2.1 Front End Client ...... 37 3.2.2 Back End Web Service ...... 38 3.2.3 Response Model ...... 39 3.2.4 Miscellaneous ...... 43 3.2.5 Spoken Interface Extensions ...... 43 3.3 Challenges and Opportunities ...... 46 3.4 Conclusion ...... 49

4. Hybrid System Deployment and Out-of-Scope Handling ...... 50

4.1 Experiment 1: Baseline Reproduction ...... 51 4.1.1 Stacked CNN ...... 54 4.1.2 Training ...... 57 4.1.3 Baseline ...... 58 4.1.4 CNN Results ...... 58 4.1.5 Hybrid System ...... 61 4.2 Experiment 2: Prospective Replication ...... 64 4.2.1 Experimental Design ...... 65 4.2.2 Results and Discussion ...... 66 4.3 Experiment 3: Out-of-Scope Improvements ...... 74 4.3.1 Models ...... 77 4.3.2 Training and Evaluation ...... 79 4.3.3 Baseline ...... 82 4.3.4 Results ...... 83 4.3.5 Discussion ...... 85 4.4 Conclusion ...... 91

5. How Self-Attention Improves Rare Class Performance ...... 94

5.1 Introduction ...... 95 5.2 Related Work ...... 97 5.3 Task and Data ...... 99

ix 5.4 Experimental Design and Results ...... 100 5.5 Analysis ...... 104 5.5.1 Why did BERT perform less well? ...... 104 5.5.2 Analyzing the Self-attention RNN ...... 105 5.6 Conclusion ...... 111

6. Modality Transfer with Text-to-Phonetic Data Augmentation ...... 116

6.1 Introduction ...... 117 6.2 Related Work ...... 119 6.3 Model ...... 120 6.4 Experiments ...... 121 6.4.1 Data ...... 122 6.4.2 Experimental details ...... 123 6.5 Results ...... 124 6.6 Discussion ...... 126 6.7 Future Work ...... 127

7. Conclusion and Future Work ...... 129

Appendices 134

A. Label Sets ...... 134

A.1 Strict Relabeling Labels ...... 134

B. Table Definitions ...... 137

x List of Tables

Table Page

3.1 An example conversation opening...... 27

3.2 The implemented REST-like API...... 40

3.3 Extensions to the back end API...... 45

4.1 Mean accuracy across ten folds, with standard deviations...... 58

4.2 Features used by the chooser...... 61

4.3 Hybrid system results...... 63

4.4 Summary statistics of collected data...... 66

4.5 Raw accuracies of control and test...... 67

4.6 Top changes in class frequency from core to enhanced data...... 70

4.7 Adjusted accuracies of control and test...... 73

4.8 New features used by the Three-way chooser...... 80

4.9 Development accuracy with additional features ...... 83

4.10 Effects of retraining on test performance ...... 83

4.11 Test results for three-way choosers with retraining ...... 84

5.1 Dev set results comparing different models...... 103

6.1 Test set question classification accuracy...... 125

xi List of Figures

Figure Page

1.1 The of the Virtual Patient...... 2

3.1 Frequency of labels in the core dataset by rank, with quintiles by color . . . 31

3.2 Overview of client-server architecture...... 40

3.3 System architecture with spoken interface extensions...... 46

4.1 Overview of the stacked CNN architecture...... 53

4.2 Accuracy of the tested models by label quintiles...... 60

4.3 Accuracy of hybrid system by label quintiles...... 72

4.4 A simple illustration of scope and domain boundaries...... 75

4.5 t-SNE plot of Multilabel chooser training data ...... 86

4.6 Quintile accuracies for multilabel components, with baselines ...... 90

5.1 The self-attentive RNN model...... 101

5.2 Quintile accuracies for the tested RNN and CNN baseline ...... 106

5.3 Example inputs with bottleneck attention head representations...... 108

5.4 A rare class borrows representations from two frequent classes...... 109

5.5 Excerpt of dendrogram from agglomerative clustering of full representations of head 3...... 114

xii 5.6 Second excerpt of dendrogram from agglomerative clustering of full repre- sentations of head 3...... 115

6.1 Overview of the classification model...... 121

xiii Chapter 1: Introduction

In order to help their patients, physicians need to be able to figure out what the patient’s problem really is. The first line of attack in that process is a very simple one: asking questions. Of course, the circumstances of each medical case are different, so the task of finding the right questions to ask is nontrivial. Medical students are given explicit instruction and training to develop this skill, and must demonstrate proficiency to pass board examinations. Traditionally, this training and testing is performed by hiring an actor to play the part of a patient visiting a doctor to pursue treatment for some ailment (known as a standardized patient). This arrangement introduces some practical problems which are mitigated by the introduction of a virtual patient, which is the basis of this dissertation.

Perhaps the most obvious issues with patient actors are cost and convenience: actors must be paid and trained, so individual student access to the patient actors is necessarily limited, and must be carefully coordinated across a cohort of students. Further, due to the cost, any extra training time that students may need or want might not be available.

More subtle, but of at least equal importance, are the issues of consistency and feedback latency. An actor sitting in an examination room for a full day, to give the same details to dozens of interviewers, will understandably become bored or fatigued. This can lead to interactions which are less instructive for the students scheduled at the end of the day, e.g., a patient providing answers to questions that were not asked, to speed up the interview.

1 Figure 1.1: The user interface of the Virtual Patient. After the user presses the Enter key, the patient will respond with, “Pretty good, except for this back pain.”

This can also introduce assessment problems: if a student receives important information

they didn’t ask for, the instructor cannot evaluate whether or not they knew to ask. An

actor may also develop clarification strategies over the course of a day, which make earlier

interactions more challenging. More generally, simple human error and variability can

produce substantially different experiences for different students, which is usually not

ideal in instructional situations. Finally, in order for students to receive feedback on their

performance, the interview must be graded by a medical educator, whose time is necessarily

limited. This can introduce delays from when the interview is performed, which is when

feedback would be most valuable for learning.

2 A well-built virtual patient (i.e., a with a graphical avatar, see Figure 1.1) can

address all of these issues. Once developed, a virtual patient can be deployed at a low cost, with constant, concurrent availability through a chosen electronic interface. This low cost

also means that the students can practice with more scenarios and repeat interactions, which

is prohibitively expensive with standardized patient actors. The patient can be programmed

to give consistent responses to equivalent questions, and feedback about topic coverage and

other aspects of student performance can be provided immediately following the interview.

Of course, programming a virtual patient introduces significant technical challenges that

paying an actor does not. The graphical interface must be seamless and believable enough

not to distract from the educational goals of the patient interaction, while ideally retaining

features of non-linguistic communication that are important for doctors to respond to,

including expressions of pain, anxiety, etc. Among the most significant challenges, and

chief among the aims of the present work, is the ability of the virtual patient to correctly

identify the natural language question being asked by the student. This not only directly

impacts the ability of the chatbot to correctly answer the question, thus informing further

questioning, but also affects the ability to rapidly evaluate the student’s identification and

coverage of topics that are relevant to the patient’s condition.

The patient that we devised is modeled as a simple agent. The user

always has the initiative, which creates a simple turn-taking dialog in which the patient

simply responds to the user’s questions. This allows us to treat each question as a somewhat

independent input, and we provide fixed responses to hundreds of known questions. The

primary challenge, then, is to map a wide variety of equivalent natural language inputs into

the questions that the patient is programmed to answer. We call this formulation the question

identification problem. One can envision multiple approaches to this problem, including

3 ranking inputs for matches against known queries, or directly classifying inputs as one of

many known classes. We adopt the latter approach.

The earliest versions of the Virtual Patient (Danforth et al., 2013) were built with a

3D-graphical patient avatar, using a rule-based dialog management engine called ChatScript

(Wilcox, 2019) to handle the necessary natural language understanding task. ChatScript is a

pattern matching engine that automatically performs some input regularization and analysis,

provides a straightforward pattern matching syntax, maintains dialog state, can remember

facts and dialog history, and responds to input with prepared answers or further questions

(Wilcox and Wilcox, 2013). ChatScript-based have been successful entrants at

the annual Loebner Prize (Bradeško and Mladenic,´ 2012), but rule-based approaches to

dialog agents do exhibit some drawbacks. It is something of an art form to write patterns

and answers that are both adequately specific and sufficiently general to correctly match the

questioner’s intent. ChatScript provides many advanced features for interpreting input and

formulating responses, but taking advantage of all of them requires ample expertise and is

labor intensive. Furthermore, as the number of authored patterns increases, the potential

for conflicting patterns and interactions multiplies. For these reasons, there is interest in

incorporating a machine learning (ML) approach.

A successful machine learning system can automatically capture some of the variability

of natural language in a limited domain, given sufficient labeled data. The clinical skills

training setting imposes some important constraints on the ML approach, however. Most

significantly, both the users and annotators are, for the purpose of application development,

experts. Students using the system to develop their skills, besides having all of the back-

ground education necessary for admission to medical school, have already received detailed

4 instruction about how to interview patients by the time they use the virtual patient applica-

tion. Users are trained to query the patient about various aspects of the history of the present

illness, as well as the patient’s past medical history, family history and social history. This

level of expertise makes it infeasible to collect large amounts of data from a crowd sourcing

platform such as Amazon’s Mechanical Turk (Kittur et al., 2008). Furthermore, while this is

an important skill to develop, it is far from the only skill taught in a medical curriculum;

students are busy, and opportunities to capture reasonable numbers of interactions with the

application are limited. Annotation of the correct interpretation of each query also requires

medical expertise and familiarity with the set of possible answers programmed into the virtual patient, since subtle linguistic distinctions may have significant medical implications

(consider, “Have you used drugs?” versus, “Are you using drugs?”). Thus, the virtual patient

is situated in a relatively data-scarce problem space.

Exacerbating the general data scarcity issue is a label imbalance issue. As designed,

there are a relatively small number of questions that nearly everyone asks; on the other hand,

there are also a large number of rarely-asked, but perfectly valid, questions that the patient

should be able to answer. That is, the label frequencies exhibit a long tail, as discussed in

Section 3.1.

1.1 Overview and Contributions

This dissertation describes several contributions in support of the Virtual Patient project.

We start by positioning this work relative to similar projects in the literature, and highlighting

the state of the art with respect to some of the specific challenges of the Virtual Patient

domain, in Chapter 2. Chapter 3 provides a detailed description of the data that defines

the domain, as well as the computational systems that constitute both the end product

5 and the research framework. The development of this data and framework constitute a

non-trivial contribution to the literature in themselves, but the details about the systems

and data also motivate the specific research questions that are addressed in the subsequent

chapters. The unifying theme of these research questions is scarcity of data. How do we

improve performance on classes with few examples? How do we handle negative classes, which can never be fully characterized by finite data? How can we improve performance in

modalities for which we don’t have exemplary data?

In Chapter 4, we contribute a controlled experiment to determine the efficacy of a previ-

ously developed question identification model in a live production setting. This model was

developed to improve rare class performance, and part of the contribution is a reproduction

of the prior results. This is noteworthy in light of the generally acknowledged “replica-

tion crisis” (Hutson, 2018; Wieling et al., 2018), but the value of conducting replication

experiments is further validated in that the outcome of the live experiment highlights the

unanticipated need for a strategy to handle inputs that fall outside of the scope of questions

that the system is designed to answer. Analysis demonstrates that this is a very difficult

problem, but we implement a baseline strategy to serve as a basis for future work.

Chapter 5 implements a model utilizing multi-headed self-attention that rather dramati-

cally improves the Virtual Patient’s responses to rare questions, and compares this model to

a state-of-the-art contextualized language model that surprisingly underperforms previous

baselines. An additional contribution is analysis of the internal representations of the suc-

cessful model, that highlights the importance of semantic specialization in the attentional

and representational behavior of the individual attention heads. We also show that some

simplifying modifications to the original model maintain sufficient performance while also

facilitating interpretability.

6 The final contribution of this dissertation is a general method to improve the performance

of specialized text classification tasks, such as the Virtual Patient, that use the output of

a general-domain speech recognition system as input (Chapter 6). The main strategy of

the method is to augment training data with inferred phonetic representations to make

the classifier robust to misrecognitions, but we see additional performance improvements

from a strategy of randomly training on generated erroneous forms of the input sentences.

This allows for the improvement of a Virtual Patient with a spoken interface without any

in-domain speech data.

Finally, we conclude in Chapter 7 with a consideration of future research directions for

the Virtual Patient, primarily focused on pragmatic dialogue issues that show up in the data.

Handling some of these phenomena well has potential to enable more advanced capabilities,

like mixed initiative conversations.

7 Chapter 2: Background and Related Work

In this chapter, we provide background information and review the current state of the

art in the several subfields that are important for the functioning of a dialogue system such

as the Virtual Patient.

We begin with the issue of dialogue generally. One useful partitioning of concerns

that can be made within the broad category of dialogue is between language understanding

and language generation, since any dialogue agent must accept natural language input and

produce natural language output. There are certainly other ways to break down a discussion

of the mechanics of dialogue agents, and the distinctions are sometimes vague, but this is

most useful for our present aims because of the specific assumptions of the Virtual Patient.

In particular, the Virtual Patient obviates many issues on the generation side by design, so we begin by summarizing some of the main issues and approaches in dialogue generation,

noting why they do not apply, and moving on to the understanding side. We describe the

main challenges and techniques in natural language understanding for dialogue systems,

and position the Virtual Patient as a question-answering agent, which again circumvents

some of the thornier issues in natural language understanding. Finally among the dialogue

issues, since one contribution of this dissertation is a spoken interface for the Virtual Patient, we note that spoken dialogue poses many unique challenges in addition to those handled by

text-based dialogue systems. Accordingly, we review some such problems, with emphasis

8 on those that have an impact on the Virtual Patient’s performance. While this dissertation does not make direct contributions in spoken dialogue, we outline unique opportunities for future research in following chapters.

As a question-answering agent, the Virtual Patient allows for a task formulation that amounts to a text classification problem, so we briefly review background and the state of the art for this very broad class of problems. In Chapter 4, we consider the problem of out-of-scope queries, so we also summarize some research in the related subfield of one class classification.

One part of the contributions of this dissertation is the mitigation of speech recognition errors in a spoken interface for the Virtual Patient. Recent advances, mostly in computational power, have essentially commoditized speech recognition, so we mostly take its functioning for granted, but we also provide a very brief overview for background.

Finally, we note that our virtual patient is far from the only patient dialogue simulation system, and describe some similar models.

2.1 Issues in Dialogue

In the most general sense, dialogue is two agents, human or otherwise, using words to interact. The interaction itself presents challenges that need not be addressed for many other tasks, such as parsing, information extraction, speech recognition, etc. From an automated processing standpoint, major dialogue issues include establishing or learning a dialogue policy, generating appropriate responses to a dialogue partner, and handling initiative. These are issues that will be present in most dialogue systems, whether they use a spoken or typewritten interface.

9 A dialogue policy just describes what to do in a conversation, recognizing that there are various types of dialogue acts, each of which have their own sets of appropriate ways to

respond. For example, if someone expresses a greeting, it is usually inappropriate to respond with an apology. Creating a complex automated dialogue agent often involves establishing a

set of task-specific dialogue acts, making a system to recognize them accurately enough,

and learning or writing a policy for how to respond to the various input acts. In the Virtual

Patient domain, the overwhelming majority of inputs are questions from the doctor, meaning

that a very simple dialogue policy of “answer the question” works very well.

Initiative in dialogue refers to the dialogue participant that is recognized by other

participants as having control of the conversation. Here again, the Virtual Patient offers a nice

simplification of the more general dialogue problem, in that the user, as the doctor performing

the interview, always has the initiative. This means that any extraneous difficulties involving

bidding for or ceding initiative can be ignored.

Of course, the greatest challenge for implementing a dialogue agent is understanding

input and producing appropriate responses, which are dealt with in detail in the following

subsections.

2.1.1 Response Generation

It is common in dialogue literature to distinguish between open- and closed-domain

dialogue systems, since the requirements and strategies are different for each. Open-domain

systems basically chat for the sake of chatting; the general goal is to engage a user in

conversation on any topic for as long as possible. The earliest notions of this type of

challenge go back to the “imitation game” first described by Turing (1950), now commonly

known as the Turing Test, which is the basis for the annual Loebner Prize.1 Closed-domain

1https://aisb.org.uk/aisb-events/

10 systems are more circumscribed, in that they typically only handle content from one or a small number of domains, such as booking a restaurant reservation. Closed domain systems are often deployed as more natural user interfaces for comparatively complicated computing systems, either as a means to control a system (e.g. for booking a flight, buying movie tickets, or controlling “smart” appliances such as thermostats), or to request information from a system (e.g. ). There is often overlap in systems that provide information and allow for user controls; for example, one must find out which flights exist before executing an order to book one.

There are two main approaches in recent literature to producing dialogue responses in data-driven open-domain systems: either generating strings of response text de novo, or retrieving responses from large human dialogue corpora that are appropriate to the input, according to a learned model. Each has advantages: retrieval ensures that responses will be natural and well-formed, while generation, in theory, allows for fully customized responses for unique inputs. In practice, generation poses two notable challenges: it is hard to produce responses that cohere well with the input, and too easy to produce boring responses.

The simplest approach to a retrieval-based agent is to condition the output on the single last input (e.g., Wang et al., 2015), but more sophisticated approaches consider more of the conversational history to find the response that best fits in the conversation. Wu et al. (2017) devise Sequential Matching Networks, a recurrent hierarchical model that considers past context at different levels of granularity, as well as the relationships between past inputs via the recurrent model, to produce a matching score given a candidate response. Extensions of

Sequential Matching Networks improve the performance with attention mechanisms and other heuristic optimizations to emphasize the most important portions of the context for

finding a good match (Zhang et al., 2018; Zhou et al., 2018). A potential source of bias in

11 retrieval models is that using a large learned model to identify the best matches is intractable for real-time applications at test time, so candidate responses must be filtered out of the larger corpus using some kind of fast, coarse heuristic. A naïve or poorly-tuned heuristic has the potential to remove good matches before the more sophisticated model can identify them.

Earlier efforts at direct generation of responses established recurrent neural network language models as a viable approach to generating responses conditioned on conversational context (e.g., Serban et al., 2015; Sordoni et al., 2015), albeit with the aforementioned challenges of coherence and diversity. This was quickly followed by efforts to improve the diversity of outputs, at first using a mutual information maximization objective (Li et al.,

2016). Later work improved diversity and conversational coherence with reinforcement learning in an adversarial setup (Li et al., 2017a). The adversarial approach also establishes an evaluation method that overcomes some of the limitations of more traditional machine translation evaluations often used to validate dialogue responses (Liu et al., 2016). Very recent work has taken advantage of large Transformer-based (Vaswani et al., 2017) archi- tectures to pre-train generative language models with massive amounts of data for realistic dialogue response generation (Zhang et al., 2019).

Perhaps the simplest and oldest method of producing responses in a dialogue agent is template expansion. The very early ELIZA system used a collection of rules for how to reflect user inputs back to the user, e.g. in the form of questions, to produce plausible dialogue (Weizenbaum, 1966). Closed-domain dialogue agents make templates a more reasonable approach for response production, since a limited number of concerns can reduce the number of rules to define for believable interactions. ChatScript (Wilcox, 2019) is a modern dialogue management system that qualifies as a template-based approach to response

12 production, because it allows for a system designer to define responses to include variables that can be expanded, for example based on memorized facts that were introduced to the conversational context by the user. While ChatScript has won prizes for producing human- like dialogues (Bradeško and Mladenic,´ 2012), the process of writing templates is very labor intensive. Other recent approaches that have successfully deployed template-based response production as part of a system include several Prize participants (see Khatri et al., 2018; Ram et al., 2018).

The Virtual Patient is clearly a closed-domain system; not only does it focus on being a patient, but in the present work, it specifically is only concerned with a particular case presentation of back pain. In addition to that, it has the requirement that very specific information is revealed in response to particular queries, as a means of evaluating whether students know what information they need to extract from their patients. For example, students need to know both whether the patient has any allergies, and (if so) how the patient reacts to those allergens. A human standardized patient that accidentally provides both pieces of information in response to a query of, “Do you have any allergies?” does not test whether the student knew to ask the follow-up question about reactions. Because the output of the virtual patient must respect very specific factual entailment values, and direct language generation methods are not currently effective at that, every response is completely scripted. Some interesting research opportunities exist for future work to incorporate language generation that does respect discrete factual entailment values, as discussed briefly later, but the system described in this dissertation obviates the need for any of the techniques described in this section.

13 2.1.2 Natural Language Understanding

Since the open-domain dialogue task is often formulated as producing a response

conditioned on prior context, the understanding portion of the task is usually implicit or

latent in whatever model is proposed to generate responses. This leads to a fairly tight

correlation between response generation models and open-domain tasks on the one hand,

and natural language understanding (NLU) models and closed-domain tasks on the other

hand. There is no requirement that either type of model be exclusively used for either task,

but a closed domain is often a side effect of many command-and-control kinds of tasks, for which fine-grained understanding of user inputs is more important than producing engaging

responses. For example, a voice-operated light switch does not need to understand the

diagnostic criteria for diabetes, but it does need to accurately map a variety of language

inputs to a small number of machine instructions that have very specific effects on the

physical world. That said, one could consider voice search, or even search engines generally,

to be examples of open-domain NLU, in that they accept (semi-) natural language about

any topic as input, and produce relevant documents or excerpts as outputs. The impetus that

drives the efforts toward understanding instead of generation is the specificity of the desired

result; the user expects to find specific information, so it is important to fully understand the

request.

As just stated, a closed domain is a fortunate side effect of spoken interfaces for many

command-and-control tasks. However, the variability within the bounds of some domains

does motivate particular approaches to NLU tasks, two of which are known as slot-filling

and semantic parsing.

Slot-filling tasks allow a system designer to define frames that define all of the infor-

mation necessary to execute some action on behalf of the user—for example, in order to

14 book a flight, a system must know (at least) the point of departure, the point of arrival, and

date and time of the flight. Each of these variables is called a slot, which is filled with

information provided by the user. Each of a finite number of user intents, such as finding

a flight or making a reservation, correspond directly to an associated frame. The system’s

job is to predict the user’s intent, keep track of slot-filler values as they are provided over

multiple dialogue turns, and to elicit values for required, unfilled slots for the intended frame.

Particular challenges include recognizing the linguistic structures that characterize fillers

of particular slots, defining and recognizing entity types that are permitted to fill different

kinds of slots, handling different user intents, handling unexpected or invalid fillers, and

constructing the outputs to confirm and/or request information from the user.

An early example of a slot-filling task is the ATIS data set (Moore et al., 1995), which

defines a conversational flight-booking system. A more recent example is the Frames

data set (El Asri et al., 2017), which includes multiple domains and intents (e.g., buying

movie tickets or making restaurant reservations), and others add the challenge of negotiation

(Lewis et al., 2017). The most successful recent approaches have begun using neural network

architectures, and see improvements by tracking latent contextual states (Bapna et al., 2017).

Semantic parsing is a slightly more general formulation than slot-filling, which seeks

to map all of the content of a natural language input into some equivalent expression in a

formal, unambiguous language, e.g. an SQL database query, or a lambda calculus expression

(Church, 1932). As an example, imagine a natural language interface for a database of world

cities. One might give the system an input of, “Show me all the cities in Germany that have

a population larger than Moscow.” An SQL expert can write an equivalent query, but a

semantic parser can theoretically allow a non-expert to ask the question and get the right

result. Compared to slot-filling tasks, semantic parsing assumes the added complexity of

15 mapping syntax in one language to syntax in another, rather than assuming a fixed number

of intents/frames that can be operationalized by simply extracting entities from the input.

Still, it essentially requires a closed domain, since the target language must operate on

a finite knowledge base. The most successful current approaches use neural sequence-

to-sequence models (Sutskever et al., 2014), with modifications to include attention and

copying mechanisms (e.g. Finegan-Dollak et al., 2018). Recent task enhancements include

user interaction through clarifying subdialogues to improve parsing performance (Yao et al.,

2019).

Perhaps the simplest approach to natural language understanding in a closed domain,

and the one deployed by the Virtual Patient, is to define a limited number of meanings that

are relevant for the domain, and to directly classify each input as one of those meanings.

This approach limits the ability to generalize to new domains, but it can be effective for

some applications. One particularly relevant line of research is another dialogue-based

training simulation, which uses maximum entropy models to directly classify user inputs

to determine which of a fixed number of responses to produce (DeVault et al., 2011a).

This paradigm fits with other so-called question-answering dialogue agents that have a

fixed knowledge base with pre-programmed responses numbering in the hundreds (Traum

et al., 2012). The simplicity of the direct classification approach makes it relatively easy to

establish reasonable baselines, while leaving room to explore more sophisticated approaches

in the future. Later, we address a few topics in the general area of text classification that

apply to the Virtual Patient domain.

16 2.1.3 Spoken Dialogue

Spoken dialogue poses a number of unique problems above and beyond text-based

dialogue or even pure speech recognition. To begin with, spoken English is statistically quite

different from written English, even in a typed chat. As an example, repairs are strategies

that speakers use to correct errors that would not appear in text due to the existence of the

backspace key. Speakers may utter part of word and restart from a few words prior, use

indicative language like, “I mean...” or disfluencies like, “er,” or, “um,” to indicate that they

are fixing an error, or simply insert the right word after the wrong one. All of these strategies

are easily understood by humans to override the meaning of the erroneous speech, but are

difficult for an automated dialogue system to understand. Furthermore, while typos are fairly

common in typed chats, automatic speech recognition (ASR) systems that provide input to a

dialogue management system will not misspell words, but may recognize the wrong words.

These ASR errors can be innocuous, or catastrophic, depending on the specific errors and

the downstream system.

Turn-taking is another issue in spoken dialogue that is generally much easier to deal with in a typed interface. When a user is done with their turn in a text message, they press

the Enter key; the vocal cues that humans use to give and take turns are much harder to

recognize. Speakers often interrupt each other, known in literature as barge-in; or they

may start their sentence before the other is entirely finished with theirs. Meanwhile, some

utterances may not require a full change of turn—these are known as backchannels, and

serve to indicate to the speaker that the listener understands or is paying attention. Typical

examples include, “mhmm” or “uh-huh.”

Even ignoring all the reasons and ways that speech can overlap, simply recognizing, in

a timely manner, that a user is finished saying what they want to say is very challenging.

17 Human dialogue partners typically switch turns over a matter of about 200ms (Schegloff

et al., 1974), but mid-utterance pauses often exceed that, so a simple time limit for turn

changes leads to far too many interruptions. Some of the information about whether or not

a speaker is finished speaking is encoded in the prosody of their speech: the way the tone,

pitch, volume, and rate of speaking change over the course of an utterance.

Section 3.2.5 describes the implementation of a rudimentary end-of-utterance (EOU)

detector to mitigate broken utterances. In Chapter 7, we propose potential improvements,

including detection of certain kinds of dialogue act intents. Previous EOU work has used

language modeling to contribute to the determination of whether an utterance is finished

or not, but efforts to use semantic information like dialogue acts to make end of utterance

determinations have been fairly limited.

Voice Activity Detection and Endpointing have a substantial body of work in the signal

processing literature, but these tasks focus more on analyzing the state of a signal, rather than

uncovering the speaker’s intent. Early efforts at using prosodic cues to pick up on a speaker’s

intent sought to improve on algorithms that simply waited for a detected silence to last a

specific amount of time before calling the utterance complete. This type of naïve baseline

remains in use today due to its ease of implementation (indeed, this is basic approach we

use now), but leaves much to be desired. The early efforts to improve on this state of affairs

extracted features from the signal aimed at capturing prosodics, and incorporated language

modeling of ASR results, using decision trees or other non-neural methods for models.

These efforts led to substantial improvements in error rates and latency reduction (e.g. Ferrer

et al., 2002).

The use of prosody features continues to be useful, and research in the area has con-

tinued into recent years, even introducing new corpora for the task (Arsikere et al., 2015).

18 Computational capabilities have now advanced enough to make neural models feasible for

real-time applications, and very recent work has successfully modeled end-of-utterance

detection for a voice search task using an LSTM (Hochreiter and Schmidhuber, 1997) model with filterbank inputs (Shannon et al., 2017). Yet another related use of neural models has

introduced a new flavor to the task: rather than waiting for a minimum-length silence to

classify as ending an utterance or not, the aim is to continuously predict, over a sliding future window of time, the likelihood of a turn change across conversation partners (Skantze, 2017).

This is laying the groundwork for systems that can interrupt a user, provide backchannels,

or deal gracefully with overlaps or otherwise being interrupted. Even more recently, authors

have begun to look at capturing pragmatic information from multiple modalities, including

gaze estimation (Roddy et al., 2018).

Finally, speaker diarization is an issue in some spoken dialogue systems. Diarization as

a task aims to keep track of the individual voices that a system is interacting with, which

can help to determine who to pay attention to. Multiple speakers are not supposed to be

an issue for the Virtual Patient, so this problem is largely ignored in the present work, but

incorporating diarization approaches has potential for some improvements, discussed in

Chapter 7.

2.2 Text Classification

Text classification is nearly as old as computing itself, and to attempt to cover it in

detail would never approach a complete picture; nonetheless, as the predominant strategy

for handling inputs in the Virtual Patient, mentions of the general problem and the most

recent advances in the field are warranted.

19 One egregiously simplistic view of the task of text classification would frame it as

finding some representation of the text in a fixed-dimensional vector space, and then finding

partitions of that space that correspond to the desired classes, using any of a number of

optimization algorithms. Finding appropriate representations, then, is a large part of the

problem, and the model architectures and algorithms for producing these representations

are the focus of much research. One of the simplest such representations, known as a bag

of words representation, is to map every word in a vocabulary to a dimension in a vector

space, and to represent strings of text with non-zero values for the dimensions corresponding

to the words present in the text. The values can be boolean values indicating the simple

presence of words in the input, or more complex calculations involving frequency of terms

in a corpus. This approach ignores word order effects, but these can be incorporated by

making the vector space dimensions correspond to ordered tuples (N-grams) of words that

appear in the text, with certain approximations for N-grams that were unseen in training.

A common classification algorithm to use with such representations is maximum entropy

classification (Nigam et al., 1999), which estimates the class probability distribution given a

particular input, by learning the “most uniform” distribution over classes that respects the

constraints imposed by the training set.

The advent of word vectors marked a significant advance in linguistic representations for

classification tasks. These representations embed word types as points in a high-dimensional

space, so that words with similar meanings appear nearby in the space. These embeddings

are learned from large text corpora, based on the so-called Distributional Hypothesis,

that words that appear in similar contexts should have similar meanings (Rubenstein and

Goodenough, 1965). Among the algorithms for learning these representations for individual words are the Word2Vec algorithm (Mikolov et al., 2013) and GloVe (Pennington et al.,

20 2014). These word vectors can be aggregated in various ways to represent the sentences

in which they appear, as simply as by taking an average, or by using them as the inputs to

more sophisticated learned models that produce a fixed-dimensional output from a sequence

of word vectors as input. One such aggregation model that we deploy is a Text CNN (Kim,

2014), which learns a large number of filters—alternatively, these can be thought of as

sensors that respond selectively to preferred short paths through the word vector space, each

producing a single real value to indicate how close the input is to their preferred input at

every time step. Each sentence is then represented as the maximal activation of all such

filters before being classified. An alternative aggregator is a recurrent neural model (e.g.,

Hochreiter and Schmidhuber, 1997), which accepts the sequence of word vectors one at a

time, and updates its internal state in response to each input. Then, the final internal state is

used as the representation of the whole sentence for classification (Arevian, 2007).

The latest leap forward in representational power for text classification problems has

been the development of contextualized language models, such as BERT (Devlin et al., 2019)

or XLNet (Yang et al., 2019). These models take advantage of pretraining on huge volumes

of text through the use of a Transformer architecture (Vaswani et al., 2017), and they work by

producing a unique, context-dependent representation for every word in a string of text. This way, different senses of the same word can be represented independently. The BERT model

is of particular interest, since we compare it to other methods in Chapter 5. It is pretrained

by optimizing two objectives: the first is to predict random words that have been masked out

of the input, and the second is to predict if two input sentences are successive sentences in

the original text. The model produces both word- and sentence-level representations that are

suitable for fine-tuning to perform downstream classification tasks. The BERT model and

21 variants have dramatically improved performance on a number of benchmark tasks (Wang

et al., 2019).

2.3 One Class Classification

In Chapter 4, we identify and characterize what we call out-of-scope queries, which pose

a particular challenge to our performance. The related problem of distinguishing in- and out-

of-domain data when only in-domain data is available is well-studied in the literature, and

known by several variants falling under the banner of One-Class Classification, including

Novelty Detection, Anomaly Detection, Outlier Detection, etc. (Khan and Madden, 2009).

Techniques leverage various assumptions about the data in some representation space, e.g.

that it be maximally separated from the origin (Schölkopf et al., 2000), or that the measure

of a hypersphere surrounding it be minimized (Tax and Duin, 1999). Many published

techniques in the text domain make use of large corpora of unlabeled in- and out-of-domain

data to provide a basis for the negative class (e.g., Yu et al., 2004). In the absence of

negative examples, adequately characterizing the boundaries of the positive class requires

a large number of positive examples (Yu, 2005), while, if a large number of unlabeled

examples are available, good results can be achieved by maximizing the number of them

that are classified as negative while constraining positive examples to be correctly classified

(Liu et al., 2002). Neural approaches to one-class classification have been popular more

recently, and these focus on learning a representation distribution that is amenable to outlier

detection, for example using ordinary multiclass classification as a joint auxiliary task for

one-class training with a class compactness objective (Perera and Patel, 2019). Recent work

in a command-and-control spoken dialogue context uses joint out-of-domain and domain

22 classification objectives along with an externally supplied false acceptance target rate to

improve domain classification accuracy (Kim and Kim, 2018).

2.4 Speech Recognition

In this dissertation, we generally treat speech recognition capabilities as a black box, so while we depend on them, we do not rely on deep knowledge of the internal workings of

any specific system. Nonetheless, a brief overview of speech recognition is warranted, since

it is immediately adjacent to some contributions, and certain terminology may be useful.

In simplified terms, the traditional speech recognition pipeline consists of several models

chained together to produce an output. At the front, an audio signal of speech content

undergoes some unsupervised, deterministic processing to prepare the data for subsequent

analysis. This commonly involves a conversion from the time domain to the frequency

domain through a Fourier transform. Further processing can decorrelate the frequency

domain features and/or aggregate them in perceptually meaningful bins. Importantly, the

outcome of these transformations is a signal consisting of frames of features (distinct from

the semantic frames used in slot-filling dialogue tasks), usually representing 25ms chunks

of audio at 10ms overlapping intervals. These frames serve as input to an acoustic model which produces as output a distribution over all phonemes in the target language being

recognized. A phoneme is a linguistically meaningful sound, for example the /b/ sound in

boy, so the output of the acoustic model is the respective probability of each phoneme being

present in every frame, given the input features. A Hidden Markov Model combined with

a pronunciation lexicon and language model then find the sequence of words that is most

likely, given the phoneme probabilities produced by the acoustic model.

23 Within this very broad framework lies a huge experimental space. Acoustic models are

often deep neural models, and researchers began claiming superhuman performance for

certain tasks in recent years (Xiong et al., 2018).2 Novel training objectives have led to

recent improvements in the traditional model (Povey et al., 2016), but there has been much

interest in so-called end-to-end speech recognition, which aims to directly predict the correct

sequence of words, sub-words, or characters, instead of the traditional two stages of acoustic

modeling followed by decoding according to a lexicon and language model (e.g., Watanabe

et al., 2018). Again, most of these details are ancillary to the aims of this dissertation, but

are provided as a point of reference.

2.5 Other Virtual Patients

Dialogue agents in medical settings are not unheard of outside of the present program,

and there are various goals and reasons for having them. Morbini et al. (2012) use an avatar

to explore mixed-initiative conversations in a troop-deployment, counseling-related scenario.

DeVault et al. (2013) use another such agent to assess diagnostic vocal characteristics of

PTSD patients. More recently a team at LIMSI have built a virtual patient with very similar

educational goals to ours, which takes a heavily engineered approach to defining rules and

ontologies that allow for accurate question answering given an authored patient health record

(Campillos-Llanos et al., 2019). This system emphasizes flexibility in deploying patients

to represent a wide array of conditions and histories, at the expense of some naturalness

in responses. Accordingly, they do not make use of machine learning techniques, since

adequate conversational data to represent all possible ailments is not available. Talbot et al.

(2012) provides a relatively recent review and taxonomy of various types of virtual patients

2Skepticism is warranted, since machines still underperform humans in many speech recognition tasks, especially in noisy environments.

24 and reasons for deploying them. Uniquely, our system’s immediate student feedback has been validated as demonstrating high agreement with expert human graders (Maicher et al.,

2019).

2.6 Conclusion

While the subject overviews provided in this chapter are necessarily brief and superficial, it should be apparent that the machinery of a functional dialogue system comes from a broad collection of subfields in computing and linguistics. The particular simplifications that are allowed by the Virtual Patient domain make some of the complexity unnecessary, but they also present a framework of unique opportunities for research—the very fact that advanced methods for response generation and language understanding are optional is what makes the system a prime target for experimental manipulations of those elements. The following chapter explores this idea more fully, by considering the specific implementation details of the Virtual Patient, and introducing the research contributions of the rest of the dissertation in the context of the unique challenges and opportunities presented by the system.

25 Chapter 3: System Overview

Having just seen the problems and approaches that contribute to the study of dialogue

agents in general, we devote this chapter to the specifics of how we operationalize these

issues in the Virtual Patient. We begin with a description of the data that defines the domain,

including the processes of its collection and annotation. This data consists of several subsets which were collected at different times, which each have different nuances, and which were

used in different combinations in the experiments that follow this chapter. The description

of the data is followed by a detailed explanation of the architecture of the deployed Virtual

Patient system. The data sets and deployed system constitute contributions in themselves,

but the details in this chapter highlight the unique challenges and opportunities presented by

this project, which inform and motivate the research contributions described in following

chapters.

3.1 Data

The main purpose of the Virtual Patient is to provide reinforcement for classroom

training on patient interviewing skills as part of the first year curriculum for medical

students. Students are trained to elicit general information about the cause for the visit

using open-ended questions, before drilling down into specific details about the complaint,

then acquiring important patient background information such as family history and social

26 Table 3.1: An example conversation opening.

Student: hello! Patient: hello doctor. i am so glad to see you. Student: what brings your in today mr. wilkins Patient: i was hoping you could help me with my back pain, it really hurts! it has been awful. i can barely even tie my shoes. Student: great i will ask you some detailed questions about your back pain. Patient: it is. Student: do you want to discuss anything else Patient: the occasional need to urinate more frequently. Student: where is your back pain Patient: it is mostly in my lower back. Student: does the pain spread anywhere else Patient: just my back. Student: how would you describe your pain Patient: it’s a dull ache right in the middle of my lower back.

history, etc., all while maintaining appropriate bedside manner. The first several turns of an

example conversation are shown in Table 3.1.

Deploying the Virtual Patient as a component of a curriculum presents the opportunities

both to test the capabilities of the current system, as well as to collect more input data

to use for refining and retraining the system for future deployments. Each school year,

then, presents the opportunity to conduct controlled experiments with the dialogue system, within standard ethical boundaries for educational experiments. The data associated with

the project have naturally grown according to this annual test/collect cycle, as have the

capabilities of the system.

In this section, we name and describe the several collections of the Virtual Patient data which are used in the experiments in the following chapters. The first data set, collected

in 2016 from a ChatScript-based system, is called the core data set, and it formed the

27 basis for the system that was used to collect the rest of the data. After developing a

ChatScript/machine learning hybrid that performed well on the core data, we sought to validate that system the next year by comparing it to a ChatScript-only system. This resulted

in the collection of the enhanced data set in 2017, which is further partitioned into CS-only

and hybrid subsets, named for the systems used to collect them. Next, we collected a spoken

data set in 2018, after upgrading the software to use a spoken interface. Finally, the core

and enhanced data sets were re-annotated under different criteria, and the union of those

re-annotated data sets is called the strict data set. Additional data has been collected, but

remains incompletely annotated at the time of writing. The complete label set for strict

relabeling of the text data sets described here is available as an example of questions the

patient can answer in Appendix A. We begin with a description of the general collection and

annotation methods.

3.1.1 Annotation

Data are collected by simply logging user interactions with some version of the system,

including every input from the user, as well as the system’s response to each query. After

collection of a corpus of such logs, each dialogue turn is evaluated by an analyst (paid or volunteer at various stages in the project) for whether the system-provided response was

correct or incorrect. A single dataset may have been annotated by one or more such analysts.

After the initial correctness determination, the incorrect turns were passed off to the domain

expert to determine what the correct class should have been.

Labels in this project are a canonical sentence with an equivalent meaning of the query

sentence. As discussed in the introduction, this is in some sense a paraphrase identification

task, but since there are a finite number of relevant queries to which the patient can respond,

28 we treat the question identification problem as a multiclass classification problem. So for

example, the queries:

1. “And since then you have had the constant back pain[?]”

2. “OK, and does it hurt you all the time?”

should both be identified as belonging to the same class, and should produce the same

response starting with, “It is pretty constant, although sometimes it is a little better or a little worse.” This class is labeled with the canonical sentence, “Is the pain constant?”

Preliminary analysis identified a small number of classes that were deemed semantically

equivalent for the purposes of assessing the validity of the data-driven approach, and these were manually collapsed to representative labels. As an example, the two queries:

1. “Tell me about your parents.”

2. “Are your parents still alive?”

are semantically distinct in reality, and the ChatScript engine recognizes them as distinct

inputs. However, by design, they each produce the response, “Both of my parents are alive

and well,” and they are mapped to the same class for the purpose of training the machine

learning system. We refer to this manual re-mapping effort as label collapse. Label collapse

makes the question identification problem somewhat easier by virtue of having fewer subtly

distinguished classes, but also creates additional headaches when comparing performance

across data sets. We further note that the class counts presented below are drawn from the

correct labels, exclusive of the potentially incorrect interpretations that ChatScript may have

made at the time of data collection.

29 3.1.2 Core Data Set

The core dataset was collected from students interacting with the initial ChatScript-based virtual patient as voluntary, extracurricular practice for an examination with a standardized

patient. Each data point consists of the student’s query sentence, the interpretation of the

query by ChatScript, ChatScript’s response, and the correct label.

This core dataset was originally developed as a follow-up to preliminary work by Jaffe

et al. (2015), which did formulate the question identification task as explicit paraphrase

identification. In this setup, pairwise comparisons between an input and all labels were

performed, ranked, and the best match was returned. To evaluate this approach, certain

classes whose instances were not paraphrases were excluded — chiefly negative classes

defined mainly by an absence of some information as opposed to its presence. These

include a catch-all class for questions about any body system that does not affect the case

presentation, which we call “negative symptoms.” Examples can include non-paraphrases

such as, “Do your fingers ache?” and, “Does your knee hurt?” Another such class are

questions that are out-of-scope entirely, for various reasons we discuss in detail in a later

chapter. After development of the dataset for evaluation of the paraphrase identification

approach, it was discovered that the multiclass classification approach was both more

effective with the volume of data available, and faster, making it more conducive to a

real-time deployment. Nonetheless, the exclusion of these nonparaphrastic classes was not

reevaluated, and development was standardized around this dataset. This innocent historical

decision had significant impacts on our replication experiment, as described in Section 4.2.

The original dataset in the above format consists of 94 dialogs, comprising a total of

4,330 queries and responses, representing 359 classes (after label collapse). We refer to this

set as the core dataset.

30 Figure 3.1: Frequency of labels in the core dataset by rank, with quintiles by color

Since the data are essentially natural dialogs, the occurrence counts of classes are extremely unbalanced, exhibiting a long Zipfian (Zipf, 1949) tail. For example, nearly every interaction will include some variant of, “Is there anything else I can help you with today?” while, “Does moving increase the pain?” is far less common, but clearly clinically significant. A quintile analysis of the core dataset is shown in Figure 3.1, showing the frequency of occurrence of each class. The top quintile consists of only ten classes, while the bottom quintile consists of 256 classes, many of which appear only once in the dataset, but which nevertheless should be recognized and answered correctly. This long tail presents one of the primary challenges of the project.

31 3.1.3 Enhanced Data Set

Previous work with the Virtual Patient using the core dataset found that a combination

of ChatScript and a Text CNN (Kim, 2014) were effective at classifying the inputs in that

set (Jin et al., 2017). To validate that finding, we deployed their hybrid system in a live,

randomized A/B test with new users against a system using only ChatScript. This work is

described in Chapter 4, and we call the data collected from that experiment the enhanced

data set. That data consists of a further 154 dialogs, comprising 5,296 turns. Since the data were derived from systems with different architectures, we maintained subdivisions of the

enhanced set into CS-only and hybrid subsets, for the data obtained from the ChatScript-only

and hybrid systems, respectively. The enhanced dataset contains 258 classes that were seen

in the core dataset, plus 74 rare classes that were new or had been previously unseen. The

unseen classes comprise 142 turns of the enhanced dataset, or approximately 2.7% of the

data. This data still exhibits a long tail, although there are important shifts in frequency of

some classes relative to the core data set, which are discussed in more detail in Chapter 4.

The CS-only subset consists of 2,497 examples, with the remaining 2,799 examples in the

hybrid subset.

3.1.4 Spoken Data Set

After the deployment of the hybrid system and the associated data collection effort, the

system was adapted to use a spoken interface, which introduced a new set of data collection

challenges. The spoken system uses an off-the-shelf cloud-based speech recognition API to

convert incoming speech audio to text, and the captured text fit easily into the existing logging

framework. However, we also captured the audio for the sake of post-hoc comparisons to

the captured text, which required additional annotation efforts.

32 One challenge in implementing the spoken interface, described in more detail later, is

simply understanding when a user is finished talking, since our goal of simulating a natural

patient interaction precludes some more robust methods like push-to-talk. The commercial

speech-to-text service provides some facilities for this, namely by including a completeness

flag in the recognition results, but we found it to be a bit too sensitive for our use case, since

users often pause while thinking of what to ask or how to phrase a question. We mitigated

some of this by inserting a wait period before responding to collected text, but errors in the

implementation sometimes left the collected text misaligned with the collected audio. Thus,

the annotation of the spoken data set required an extra alignment step.

In detail, the speech-to-text service indicates the start of speech by returning a recognition

result, and to deal with network latency, we captured at least one second of audio preceding

a speaking event through caching. After the speech-to-text service returned an utterance

completion event, we waited a variable length of time, dependent upon the number of words

in the collected transcript (the fewer the words, the longer the wait). If additional input came

in during the wait, we appended to the captured audio and reset the wait period. After the

collection phase, audio captured this way was sent to a manual transcription service for full verbatim transcription, including disfluencies, repairs, etc. The regular annotation method

described above was used to determine the correct labels for the inputs, but notably, the

automatically captured transcripts from the speech-to-text service were used, not the manual

transcripts.

The system using the spoken interface was deployed as an assignment to first-year

medical students, in which they were asked to interview a middle-aged male patient with

back pain, and fill out an electronic health record for his case. This exercise resulted in the

33 collection of 251 conversations containing a total of 12,821 automatically captured utter-

ances.We captured 12,326 audio files which were sent out for manual transcription (again,

implementation errors in our rudimentary end-of-utterance detection caused occasional

misalignment in automatic transcripts relative to the audio).

After the automatic transcripts were fully annotated, they were manually aligned to

the manual transcripts. This included repairing a number of automatic transcripts that were interrupted by an erroneous end-of-utterance. This repair usually consisted of simply

concatenating the interrupted automatic transcripts, and annotating the location of the con-

catenation with an “[EOU]” token. The label of the resulting sentence was then reassessed.

A small number of unexplained errors left some audio without any matching automatic

transcripts, and vice-versa; such unmatchable inputs were deleted. A small number of

conversations were judged to be unserious or conducted in such a noisy environment that the

data was not useful; these were removed. The end result is a collection of 11,912 data points

in 231 conversations, each example consisting of audio, a manual transcript, an automatic

transcript, the correct label, the system’s response, and the system’s original interpretation

of the automatic transcript. 450 utterances were either interrupted by erroneous end-of-

utterances, or were fillers that were incorrectly interpreted as complete queries (374 and

118, respectively, with 42 instances including both phenomena). This data set includes 398

classes, a number of which are subdivisions of the No Answer class that reflect common

trends in that subset of the data (i.e., ChatScript templates that should probably be written,

but did not exist at the time of data collection).

34 3.1.5 Strict Relabeling

The experiments reported in Chapter 5 combine the core data set with the enhanced data

set, both to provide more training data, as well as to allow for testing with a fixed set instead

of using cross-validation. In the course of development with this data, we discovered a

number of mislabeled data points, which were often incorrectly labeled as though ChatScript

had produced the correct response to begin with. Upon closer inspection, we realized that

in many of these cases, ChatScript had actually produced an acceptable response to the

input query, but it had done so by misinterpreting the input. This, in turn, highlighted an

important bias introduced by only manually annotating class labels for points which were

flagged as incorrect: an incorrect interpretation that produced an acceptable response often went uncorrected. An example is the utterance, “We will try to figure out what is up but

first i have a few questions,” which was labeled as the class make a follow up appointment.

The pre-programmed response for that class is simply, “Of course,” which is a perfectly valid response to the input. However, the class I have more questions is a far more accurate

interpretation of the meaning of the original sentence.

Since we treat the problem as a classification problem, our evaluation metrics do not

(and currently cannot, short of human evaluations for every experiment) account for appro-

priateness of the provided response; they assume a single correct answer for every question,

and they are based on the semantics of the question, not on the semantics of the response.

Thus, it is important that the labels reflect a correct interpretation of the input, not the ap-

propriateness of the system’s response. Nonetheless, annotation for an appropriate response

is a valid strategy, so we do not think this invalidates previous work; we just consider this

a more strict annotation strategy that may apply better depending on the goals of a given

experiment.

35 Accordingly, we reexamined every example in the core and enhanced data sets, by

looking at all members of a class and determining if every input sentence had equivalent

semantics, and reclassifying those that did not. This resulted in the elimination of 43 classes

from the data set, the addition of 3 new classes, and changed labels for approximately 15%

of the data. We held out the hybrid data subset as a test set, which contains only 268 classes,

of which 15 are unseen in either the core or CS-only data sets. The reannotation reduced the

accuracy of ChatScript on the data to approximately 70%. We call the data relabeled in this way the strict relabeling.

3.2 System Description

We now turn to an overview of the components that comprise the Virtual Patient system, which is largely documentary, but facilitates understanding of the experiments presented in

following chapters, and also provides a basis for the structure of this dissertation.

The virtual patient is deployed as a simple client/server architecture, with the client

bearing responsibility for user interface functions, including rendering the animated image

of the patient to the user, displaying responses, and passing the user’s queries to the back

end server. The server, on the other hand, houses the apparatus for actually interpreting the

user’s query, formulating a response, tracking the state of the conversation to the extent that

such tracking is done, storing data for later analysis, and automatic assessment of student

performance. An overview diagram is shown in Figure 3.2. The following subsections

examine the client and server in more detail, followed by a description of the extensions to

both that were required for a spoken interface.

36 3.2.1 Front End Client

The client software is developed in Unity 3D,3 a multi-platform 3D game development

engine. At various times, the virtual patient client application has been deployed to different

platforms. For the experiments in Chapter 4, it was deployed in a WebGL format4 and

hosted on an Apache web server running on a university computer. This enabled students to

access the virtual patient anywhere that they had access to a and a keyboard.

For the deployment that collected the spoken data set, the client was deployed to iOS. Every

medical student is issued an iPad, so this allowed us to take advantage of consistent hardware

for audio recording.

3D models were acquired from the Unity Asset Store5 and animated using Autodesk

Maya.6 The patient can express a range of emotions, and these are controlled with messages

from the back end service that are sent concurrently with linguistic replies to user queries.

After providing their name for academic assessment purposes, students can interact with the

patient by simply typing a question in the case of the text-based system, or simply talking

to the spoken interface deployment. When they are finished with the interview, they can

click a button on the screen to receive a transcript of their session, including feedback about

questions that should have been asked. This transcript comes in the form of a PDF file that

is generated dynamically in the browser with JavaScript, using information provided from

the back end server. Students also have access to an instructional PDF, hosted statically on

the same server as the WebGL application.

3https://unity.com/ 4https://www.khronos.org/webgl/ 5https://assetstore.unity.com/ 6https://www.autodesk.com/products/maya/overview

37 3.2.2 Back End Web Service

As stated above, while the client application bears responsibility for displaying content to

the user, the back end service is generally responsible for deciding what that content is. Not-

ing that the typewritten chatbot paradigm imposes a regular synchronous request/response

usage pattern, this back end is designed as a typical REST-like web application (Fielding,

2000), communicating over HTTPS. We note that the web service is not actually RESTful,

because it maintains a fair amount of state information about each user’s conversations with

the virtual patient, and therefore, as currently implemented, does not properly scale beyond

installation on a single server. Nonetheless, the single server has proven to have adequate

capacity for its intended educational use; that is, up to 200 users over the course of a week, with a deadline-driven ramp up in concurrent usage. Despite not being truly RESTful, we

made an effort to maintain correct usage of HTTP verbs, return codes, etc. with respect to

state changes when designing the API.

At the time of the experiment described in Section 4.2.1, the web service implemented

the API shown in Table 3.2. Only the specified verbs were implemented. This API was later

extended, as described in the next section.

With limited personnel developing the back end service, many design decisions were

made with an eye toward minimizing development time. The stacked CNN and the hybrid

system described in section 4.1 were implemented in PyTorch (Paszke et al., 2017) and

SciKit-learn (Pedregosa et al., 2011), respectively. Therefore, when selecting a framework

for deployment of the back end web service, a premium was placed on those with a Python

programming interface. Among these, Flask7 was selected for having a low programming

7https://palletsprojects.com/p/flask/

38 overhead and being fairly lightweight. The Flask application was deployed within a gEvent8

container for efficiently handling concurrent traffic with an event-based implementation.

It was selected as a lighter-weight alternative to a more full-featured web server such as

Apache httpd9 with installed modules, which was expected to require more maintenance

effort. For purely logistical reasons, this Flask web service application and associated

gEvent container were hosted on a separate university computer from the one hosting the

Virtual Patient client WebGL application. This separate hosting did not introduce noteworthy

latency effects, but did require a Flask extension to enable cross-origin resource sharing.

The server hosting the web service uses a six-core 2.4GHz AMD Opteron processor with

32GB of memory.

An important auxiliary function of the back end service, beyond supporting the edu-

cational objectives of the virtual patient, is to collect data to support the present research

and further improvements to the system. This function is fulfilled with a simple MySQL

database instance (specifically, MariaDB10), using only two tables. SQL defining these

tables is provided in Appendix B. The web service interfaces with this database through the

Flask-MySQLdb extension.

3.2.3 Response Model

The model developed by Jin et al. (2017) consists of a CNN ensemble (the stacked

CNN) and the ChatScript engine, both of which provide an independent interpretation of

the user’s input. Both of these interpretations are fed into a logistic regressor that chooses which interpretation is most likely correct. The deployed web service, then, must execute

8http://www.gevent.org/ 9https://httpd.apache.org/ 10https://mariadb.org/

39 Figure 3.2: Overview of client-server architecture.

Table 3.2: The implemented REST-like API.

Endpoint Description /conversations/ POST: Create a new conversation resource with parameters provided in JSON post form. Return JSON containing opening greeting and URL for created resource. /conversations// GET: Show the metadata for the conversation specified by /conversations//query/ POST: Create a new query resource within the conversation specified by , with the query text specified in JSON post form. Return JSON containing reply.

40 the mechanics of accepting an input query from the user, deciding between the ChatScript

interpretation and the stacked CNN interpretation, and returning the correct response, as

coded by the expert content author within the ChatScript instance. The process for this

decision making is explained in more detail below, but here we continue the overview of the web service by noting the organizational structure of the components of the process. The

trained stacked CNN and logistic regressor are simply embedded in the web service, and

are loaded into memory as soon as the service is started. The stacked CNN, despite being

trained using a GPU, is configured within the service to perform computations using only

the CPU. This is partially due to hardware availability, but also based on the assumption

that the overhead of loading data into the GPU memory would be greater than the benefit

of parallel computation for single examples. Note, though, that we never performed a

rigorous evaluation of this assumption. The ChatScript instance runs in server mode in

an independent process from the web service, which may or may not reside on the same

machine as the web service. The back end service makes calls to the ChatScript instance

over raw socket connections.

When the user submits a question to the client application, the text of the query, along with some metadata, is packaged into a JSON object and sent to the web service via an

HTTP request. The web service unpacks the JSON object and forwards the query text to

the ChatScript engine, which replies with what it believes to be the correct response. By

design, ChatScript does not automatically provide any metadata along with this response

text. Since responses are not unique to an interpretation of a query (e.g. yes/no questions),

and because the interpretation is needed for features that serve as input for the logistic

regression model, we then send the debugging command :why to the ChatScript instance, which reveals information about the response, including the name of the template and the

41 pattern that matched the input. Due to time constraints and the label collapse issue (see section 3.1.1), during the experiment in which the enhanced data set was collected, this metadata could only be used to determine whether or not ChatScript found a match for the query. This was confirmed to only have a minimal negative effect on accuracy with the core dataset (< 0.1%).

With the ChatScript response to the query adequately characterized, the same input is sent through the stacked CNN, and the output of both systems is sent to the logistic regression model, as described in Section 4.1.5. If the two systems differ, and the chooser determines that the stacked CNN is the better choice, the canonical label sentence for the stacked CNN’s chosen class is sent to the ChatScript instance. Thus, whenever the stacked CNN’s response is chosen, ChatScript sees two volleys. With default settings,

ChatScript is usually quite stateful, removing discussion topics that have already been seen, to prevent unnaturally repetitive responses; we configured the engine to repeat most answers an effectively unlimited number of times, so this double volley strategy is acceptable for those answers. For the few answers that do not accept unlimited repeats, we do not distinguish different replies in our evaluations.

Once the chosen response has been obtained, the synchronous HTTP response is returned to the client. Subjectively, the latency is noticeable compared to a system using only

ChatScript, but is not worse than the normal pace of a spoken conversation. Some of the latency is due to the double volley and network connection to the ChatScript server, but most of the wait is due to the computations of the stacked CNN.

42 3.2.4 Miscellaneous

One of the goals of the virtual patient is to provide rapid feedback to students about

their interviewing technique, which we call scoring, although it might more accurately be

called summary report generation. This is done by classifying queries by subject according

to rules, and reporting the coverage of important subjects. In the present setup, this function

is coded in the ChatScript instance, which requires some accounting for the double volley

design when the CNN’s interpretation is chosen. Score summary reports are generated using

JavaScript code hosted on the same machine hosting the virtual patient client.

Finally, we note that the experiment detailed in section 4.2.1 requires some users to

converse with an agent implemented using ChatScript only; we include a per-conversation

switch that allows the web service to bypass the stacked CNN and chooser when producing

its response.

3.2.5 Spoken Interface Extensions

The addition of the spoken interface to the Virtual Patient client application required a

number of extensions to the text-based version.

As stated above, the client was changed to be deployed as an iOS app, instead of through

a web browser, to take advantage of consistent microphone hardware. The basic strategy for

handling audio input was to use an off-the-shelf cloud-based commercial speech recognition

system to transcribe spoken inputs and forward the transcripts to the existing text-based back

end server. This also required development of a couple of strategies for generating speech to

provide responses on the client side. As described in Section 3.1.4, it was important to also

capture the audio that generated the transcripts, so changes to both the client and server were

necessary to capture and store that data, described below. The shift to deploying on iOS

43 required minor changes to the scoring report generation. All of these changes are described

below in detail.

For speech-to-text (STT) capabilities, we chose IBM Watson’s cloud service, since they

offer an open-source Unity API that was easy to incorporate into the existing client software.

This decision in turn made it simple to use Watson for text-to-speech (TTS) on the client

as well. The STT library maintains an open connection to the cloud server via websockets,

and returns speech transcripts as they are decoded, with additional metadata, e.g. to indicate

its belief that the end of an utterance (EOU) has been observed. The TTS service is more

atomic, accepting text and returning an audio file. We adapted the STT code to record audio

to send to our back end service for data collection, by continuously caching one second

of data, recording audio at any time the STT service returned results, and prepending the

cached audio. When EOU was received, the recorded audio was encoded as a WAV file on

the client and sent to a new endpoint on the back end web service to store. When our web

service returns a response, the text of that response is simply sent to the TTS service, and

the resulting audio file is played on the client, concurrent with appropriate lip movement

animations for the avatar.

To prevent the STT from recognizing the output of the TTS and triggering a cycle of

self-responses, we simply mute the STT module the duration of the audio output. The

text interface was changed to show only a bullet glyph for each recognized word, with the

reasoning that listening feedback was helpful, but users would be less inclined to alter their

speech patterns if they could not observe misrecognitions taking place. Captions for patient

responses were also removed, to require the more natural interaction of actually listening to

the patient.

44 Table 3.3: Extensions to the back end API.

Endpoint Description /config/ GET: Show the configurable parameters for the client and setup speci- fied in the URL parameters. Allows for easy switching between speech production strategies without redeployment of client. /score/ GET: Show the conversation summary for the conversation number speci- fied in the URL parameters. /conversations//query//audio POST: Create a new audio file resource (i.e. save the audio) within the conversation specified by , for the existing query specified by .

For work not otherwise described in this dissertation, it was necessary to have a Spanish-

accented English voice for the Virtual Patient, and there were no commercial TTS options

available to handle this requirement. Since all responses are fixed and finite anyway, we

developed an alternative method for producing speech responses, namely, pre-recording

responses, storing them on the back end service, and downloading them to the client upon

startup. After download, a simple lookup table was used to find the appropriate audio file to

play, upon receiving a response to a query from the back end web service. The software was written to make it easy to switch between TTS and pre-recorded audio for producing speech

on the client.

Scoring report generation was changed from PDF generation in the browser to a simple

HTML page, to facilitate cross-platform operability. This involved creating a new endpoint

on the back end web service. The full set of API modifications to accommodate the changes

described in this section is shown in Table 3.3. A diagram of the modified architecture is

shown in Figure 3.3.

45 Figure 3.3: System architecture with spoken interface extensions.

3.3 Challenges and Opportunities

The systems, data, and goals of the Virtual Patient project present many unique chal-

lenges and opportunities. We briefly consider a few of the most fundamental problems here, which motivate the work in the following chapters. We also introduce a few of the possi-

bilities for future work that are particularly suited to this project, which will be considered

in more detail in Chapter 7. We can categorize these topics under three broad umbrellas:

data scarcity challenges; known issues in dialogue that can be studied particularly well with

our data and system; and more general linguistic phenomena that, if handled well, would

substantially improve the user experience for users of the Virtual Patient application.

46 The data scarcity challenges are the primary motivation for the contributions in this dis-

sertation. Included in this family of problems is the long-tailed class frequency distribution

that is inherent in all of the data sets described in Section 3.1. Since the least frequently

occurring classes are represented in the data by only a handful of examples at most, it is

difficult to learn classification boundaries for those classes. The other side of this coin,

however, is that we have a large number of semantic classes which in many cases have

rather subtle distinctions. We assume these are orthogonal for the sake of producing a single

one of the known responses for the question that was asked, but the reality is that many of

these classes share common linguistic substructures, which implies latent structures in the

label space. This kind of large, implicitly structured label space is rare in text classification

literature. Chapters 4 and 5 describe different machine-learning architectures designed

to handle the long tail problem. The best at handling rare classes seems to succeed by

leveraging some of the latent structure in the label space, apparently learning components of

frequent classes that generalize well to rarer classes.

Another very challenging data scarcity problem in the Virtual Patient data is simply

identifying questions that the system is not programmed to answer. We include this problem

under the banner of data scarcity because the number of sentence meanings that the system

cannot handle is literally infinite, and as such it is impossible to collect enough data to

positively characterize these sentences as a class. Strategies for handling such problems are

found in the literature as instances of the general task of one-class classification, although

our specific variant of the issue reflects additional challenges that are not common in most

existing data sets. We characterize in detail the subtleties of our problem in Chapter 4, and we develop a baseline approach to this difficult problem.

47 As the performance of the system improves, it naturally becomes desirable to increase

the scope of the capabilities and applications of the system. One such example, described

above, is the expansion of the user interface to operate in the modality of speech instead of

text; another would be the implementation of new patients with different case presentations

and diagnoses. In either case, data scarcity again becomes an issue, as the models trained for

one purpose may not perform as well for the new purpose, even though the separate tasks

may clearly be related. The obvious solution is to train new models for the new tasks, but

by virtue of their novelty, the new tasks lack representative data. In these cases we would

hope to develop strategies for transfer of the knowledge from one domain or modality to

another. Chapter 6 describes such an approach: a novel method of improving performance

in the spoken modality without the benefit of any speech data.

The dialogue issues that surface in the Virtual Patient project have a big practical

impact on the quality of user interactions. End-of-utterance (EOU) detection is a known

problem (Ferrer et al., 2002), and our naïve implementation, discussed above, is inadequate.

Furthermore, our modeling of the interaction as a fixed-initiative conversation with strictly

alternating turns presents its own benefits and disadvantages. On the plus side, it has

produced a very controlled environment to facilitate fairly extensive data collection and

study of interesting linguistic phenomena; however, the data make clear that this format is at

times too inflexible or frustrating for users. Despite the inherently one-sided nature of the

interview interaction, the data contain some examples of basic turn-taking related phenomena

that could pave the way for more complex mixed-initiative conversations. Further, and

related to the strictly alternating turn structure, our decision to mute speech recognition when producing speech limits our ability to handle pragmatic turn-taking phenomena

like overlapping speech and barge-in. The Virtual Patient system presents an excellent

48 experimental environment for exploring solutions to these problems, since the present work

establishes effective baselines for comparison to new approaches using the same corpora.

Chapter 7 proposes some improvements for these issues.

Finally, the Virtual Patient presents several opportunities for studying more general

linguistic phenomena. Since members of each class are effectively paraphrases of each

other, the data can be used to learn models for generating and ranking paraphrases (Jin

et al., 2018). Coreference errors are common in all existing approaches, since the models

ignore context; straightforward approaches to context management (e.g., Bapna et al.,

2017) would likely yield a substantial benefit to the learned models. Language generation

also presents some compelling opportunities for user experience improvements, with the

possibility of conditioning responses on information introduced in previous turns, generating

backchannels appropriate for particular speech acts, or conditioning responses relating to a

single propositional truth value relative to the polarity of the query about that proposition.

Some of these possibilities are being explored by colleagues.

3.4 Conclusion

In this chapter, we provided detailed descriptions of the data sets that have been collected

in the Virtual Patient domain, as well as a fairly complete account of the software systems

that make up the dialogue agent that medical students interact with. These components

represent a substantial amount of engineering work in themselves, but they are really just the

framework that uniquely allows for the exploration of several interesting research questions.

This dissertation specifically aims to address several of the problems surrounding data

scarcity, and those experiments are described in the following chapters.

49 Chapter 4: Hybrid System Deployment and Out-of-Scope Handling

In the last chapter, we provided a detailed description of the Virtual Patient software and

the data that defines the domain, as well as laying out many of the challenges that motivate

the rest of the dissertation. In this chapter, we begin to address and further characterize the

data scarcity problems in this domain. We do so in three main parts: the first is a relatively

straightforward reproduction of previous work (Jin et al., 2017). The primary motivation

for the reproduction is actually logistical: it was necessary to train new models in order

to deploy them in the production system described in the previous chapter, which in turn was necessary for the A/B validation test in the second part of this chapter. Despite this

pragmatic incentive for the reproduction work, it was done within a wider scientific context

of common difficulty in reproducing and replicating published experiments. While the

exact result was not reproduced, the results are wholly consistent with prior results, and the

exercise uncovered some bugs in the original work.

The hybrid system described in the first section of this chapter was designed to address

the combined problems of overall data scarcity and rare classes. It seems to achieve that

goal in cross-validation experiments on the core data set, but we wanted to test if it was

better than a ChatScript-only system when deployed with live users. The difference between

a static set of inputs obtained from live users and a dynamic system interacting with live

users may not be obvious, but since this is a dialogue task and successive inputs are not

50 truly independent (despite our modeling assumptions to the contrary), a different response

can change the course of a conversation. A fixed data set always asks the same questions, whether the answer is correct or not. Given this potential confound, the second experiment

in this chapter is a randomized A/B test to probe the quality of the hybrid system compared

to that of a ChatScript-only system. The main result is that the hybrid system does indeed

significantly outperform the control, but both give unexpectedly low performance compared

to the results of the first experiment. Analysis explains most of this as due to queries that

fall outside the design scope of the agent, many of which should be identified as such and

handled gracefully.

The third experiment in this chapter, then, aims to improve the identification of these

out-of-scope inputs. We implement a baseline approach that achieves modest improvements

on this specific subproblem, but analysis of the results shows both how difficult the problem

is, as well as a somewhat surprising reason that the approach improves the performance on

in-scope queries as well.11

4.1 Experiment 1: Baseline Reproduction

Note again that the work in this section is a reproduction of work published elsewhere

(Jin et al., 2017), but is presented here to provide a complete context for the rest of the work.

We use the same code base and data, with a different random permutation of the data set, while also incorporating data corrections and bug fixes. In particular, we corrected an error

in the calculation of the CS Log Prob feature (described below), which in turn highlighted

approximately two percent of the core data in which the ChatScript response was labeled

inconsistently with the true label set. Chronologically, these corrections took place after

11This work is under review as a submission to the Journal of Natural Language Engineering, Cambridge University Press.

51 the experiment reported in Section 4.2, but before the reproduction work described in this

section.

As mentioned above, writing the patterns for the ChatScript-based Virtual Patient is

something of an art form. While the ChatScript pattern syntax provides many advanced

features, developing awareness and understanding of these capabilities, and being able to

effectively utilize them, requires substantial expertise and investment of time. The variability

of natural language, particularly the occurrence of paraphrases, often requires multiple rules

that can trigger the same response. Meanwhile, the patterns can conflict with each other,

and of course, the opportunity for conflicts increases as more patterns are added, leading to

diminishing returns in terms of performance improvement per time invested. Furthermore, while ChatScript offers some spelling correction, pattern matching remains brittle in the

face of some typos that would otherwise be correctly overlooked by a human reader.

These challenges motivate the need for machine learning-based models that can learn

the distinctions between classes from data, rather than requiring expensive human authoring.

Given the challenges of the rule-based approach, we seek to determine the effectiveness of a

more data-driven approach. We hypothesize that a machine learning system may offer some

benefits; specifically, the use of word embeddings can capture similarity and relatedness of words (Mikolov et al., 2013), as observed through co-occurrence statistics in a large corpus of

text. This may allow for external data to effectively supplement the relatively small data set

available for the task in an unsupervised way. More direct supervision, by training a model

to perform the question identification task, can also naturally and automatically uncover the

non-conflicting regularities that can be used to perform the classification enumerated by the

data set, without requiring manual analysis on a per-pattern basis, as would otherwise be

required with the rules-based approach.

52 Figure 4.1: Overview of the stacked CNN architecture.

Toward this end, we use a convolutional neural network (CNN) on the query text

input (Kim, 2014) to do the question identification, described in detail below. We find

that ensembling multiple CNN models is important for mitigating noise in the data due

to general data sparsity and label imbalance, and ensembling models trained on different

representations of the same input (word or character sequence) offers further benefit. Finally, we show that the error profiles of the ensembled CNN models and the rule-based ChatScript

system have some complementary properties, and that combining these systems with a

simple binary classifier leads to a substantial reduction in error compared to ChatScript

alone.

The training data used in this section is the core data set; the enhanced data set is

collected from the experiment conducted in Section 4.2.

53 4.1.1 Stacked CNN

The stacked CNN) is an ensemble of ensembles of Text CNNs trained on different splits

of the core data set. Later, we show further benefits from choosing between the output of

the stacked CNN and the output of the original rule-based system.

At the highest level, the stacked CNN is making a classification decision based on the

outputs of two ensembles of CNNs operating on different forms of the same input. One of

these two sub-ensembles operates on the sequence of words in the query sentence (word

CNNs), represented as word embeddings (Mikolov et al., 2013); the other sub-ensemble works on the sequence of characters in the same query (character CNNs), again represented

as embeddings, but in a different space than the words. Each sub-ensemble consists of five

CNNs, which we call the sub-models, described in detail below. Each CNN is trained on a

different split of the data, described in Section 4.1.2. See Figure 4.1 for a graphical overview

of the stacked CNN.

4.1.1.1 CNN sub-model

The structure of each CNN is a one-layer convolutional network with max-pooling, which then feeds into a single fully connected layer, whereafter the input is classified as one

of the 359 classes in a final softmax layer.

For the word CNNs, the input is a sequence of k-dimensional word embeddings, mak-

ing the input a T × k matrix of real values for a sentence of length T. We use pretrained

Word2Vec vectors trained on the Google News corpus,12 thus k = 300. We experimented with GloVe.6B vectors (Pennington et al., 2014), but observed a slight performance degrada-

tion. Word embeddings are held fixed during training.

12https://code.google.com/archive/p/word2vec/

54 Analogously, input to the character CNN is a sequence of randomly initialized 16-

dimensional character embeddings. In contrast to the word embeddings, these are tuned

during training.

Let W be a set of integers, representing a set of kernel widths. The convolutional layer

is a set of m filters each of size w × k, w ∈ W, for a total of m × |W| kernels. These are all

convolved over the time axis of the input (i.e. the length of the sentence) to produce a feature

map for the sentence. Since the kernels span the full length of the input vector, k, they output

a single value per input element, which takes into account a variable amount of context

T×m×|W| depending on the kernel width w, leaving the output in R . For the word CNNs, m is 300 and W = {3,4,5}, while for the character CNNs m is 400 and W = {2,3,4,5,6}. The

output of each kernel is passed through a rectified linear unit (Nair and Hinton, 2010).

We use max pooling over the length of the sentence (Collobert et al., 2011) to represent

the sentence in a fixed size. This fixed dimensional output of the convolutional layer is

passed through a final fully connected linear layer containing as many units as there are

classes, and the final classification is determined by a softmax operation over the output of

this layer.

We found it beneficial to employ a few regularization strategies. We use 50% dropout

(Srivastava et al., 2014) at the output of the max-pooling layer, and use a variant of the

max-norm constraint employed in Kim (2014). Specifically, where they renormalize a row in

the weight matrix to the specified max norm only if the 2-norm of that row exceeds the max, we observe a benefit from always renormalizing to the specified norm. This renormalization

strategy came from a reimplementation of Kim (2014).13 We set the max norm to 3.0.

13https://github.com/harvardnlp/sent-conv-torch

55 4.1.1.2 Ensembling

Since our core data set is relatively small, we validate our training progress using cross- validation development sets; since the label frequencies are also very unbalanced, any given

random split can produce significant differences in test performance. We seek to minimize

this variance in model outputs by ensembling multiple CNN sub-models, each trained on

different splits of the data.

For each input form (words and characters), we train five of the CNN sub-models and

combine their outputs with simple majority voting. During evaluation in the training of

sub-ensembles, ties are arbitrarily broken in favor of the class with the lower index.

We combine the outputs of each ensemble of CNNs using stacking (Wolpert, 1992). This

is essentially a weighted linear interpolation of system outputs, where the weights assigned

to each system are trained on the data. Thus the final output of the stacked CNNy ˆt is

yˆt = softmax(αwyˆe,w + αcyˆe,c) (4.1) where yˆe,w is the output of the word ensemble, yˆe,c is the output of the character ensemble,

and αw and αc are the trained coefficients for word and character ensembles respectively.

4.1.1.3 Parameters

Any words that do not appear in the set of pretrained embeddings are assigned random values with each dimension drawn from the distribution Unif (−0.25,0.25). Care was taken

to ensure that this distribution falls within the range of the rest of the embeddings, as values

outside this range were seen to reduce performance of the word CNNs.

Convolutional kernels in the word CNNs are initialized with Unif (−0.01,0.01), and

the linear layer is initialized using a Normal distribution with zero mean and variance of

56 1 × 10−4. In the character CNNs, weights for both the convolutional kernels and linear √ √ layer are initialized by drawing from Unif (−1/ nin,1/ nin), where nin is the length of

the input, as recommended by Glorot and Bengio (2010). The character embedding matrix

is initialized from N(0,1), and all bias terms (for both the word and character CNNs) are

initialized to zero.

4.1.2 Training

We use Adadelta (Zeiler, 2012) to optimize submodel weights, using recommended

parameters (ρ = 0.9, ε = 1 × 10−6, initial learning rate = 1.0), and a cross-entropy loss

criterion.

Since data are limited, we use 10-fold cross-validation to train each CNN sub-model

and validate performance; thus, for each of the ten test folds, we train five CNN sub-models,

reporting average performance across folds. Each of the five sub-models in a fold is trained

on a different training/development split of the 90% of data remaining after the test fold is

held out, with the development set comprising the same number of examples as the test set,

and the remaining 80% used for training. Importantly, each training set is supplemented with the label sentences—that is, the canonical example—for each class. This ensures that

no class is completely unseen during training, which is otherwise likely, given the label

imbalance of the data.

We train for 25 epochs with minibatches of size 50, and take the last model with the best

dev set performance for test validation. Note that ensembling the sub-models by majority voting is a non-differentiable function, so each sub-model is trained independently of the

others. The performance of the sub-ensembles is validated by using the single majority

decision of the vote, but the input to the stacking network is the vector of vote counts for all

57 Table 4.1: Mean accuracy across ten folds, with standard deviations.

Single Ensemble ChatScript 80.93% n/a Baseline 76.93 ± 1.58% n/a Word 76.17 ± 1.23% 76.95 ± 1.87% Character 75.28 ± 2.08% 77.09 ± 1.53% Stacked n/a 78.01 ± 1.62%

classes, which is effectively an unnormalized distribution. The stacking network is trained

by holding the sub-model parameters fixed (again, voting is non-differentiable) and training

for another 25 epochs. The optimizer is the same as that used for the sub-models above, i.e.

Adadelta with recommended parameters.

4.1.3 Baseline

As a baseline comparison for the stacked CNN, we train a simple maximum entropy

(logistic regression) classifier implemented using SciKit-learn (Pedregosa et al., 2011) with

n-grams as input features, which is very similar to the method of DeVault et al. (2011b).

Specifically, we use 1-, 2-, and 3-grams of both the raw word forms and their Snowball-

stemmed (via NLTK; Bird et al., 2009; Porter, 2001) equivalents, as well as 1- through

6-grams of characters. The model is optimized using the stochastic average gradient for up

to 100 epochs. We use the same cross-validation strategy and the same splits as described

above.

4.1.4 CNN Results

We report performance in terms of overall accuracy, averaged over the ten folds. Results,

including standard deviations, are summarized in Table 4.1. We present a comparison of the

58 performance of single models in the cross-validation setup vs. the full ensembles, to clearly

illustrate the benefit of ensembling in our limited data regime. The Word and Character

entries are for the corresponding components of the full stacked CNN. We again note that

these numbers vary slightly from previously published work (Jin et al., 2017) due to data

corrections, different random cross-validation splits, and random model initializations, but

all of the same trends are observed, so we take this as a successful, if not exact, reproduction.

The naïve baseline exhibits surprisingly strong performance, which proves to be difficult

to beat. Indeed, none of the single models are able to do so, which, along with the high vari-

ance across folds, motivates the ensembling approach. We largely attribute the performance

of the baseline to the scarcity of the data, and expect, based on further analysis below, that

more data would widen the gap between it and the stacked CNN.

The ensembling strategy generally seems to boost performance, although the effect is

larger for the character-based models than for the word-based models. Individually, the word

and character ensembles exhibit similar performance to the maximum entropy baseline, but

stacking the word and character sub-ensembles gives a significant gain over the baseline

(p = 0.0074, McNemar’s test). This suggests that the word and character sub-ensembles are

picking up on complementary information.

Some examples suggest that the stacked CNN is able to take advantage of similarity

information that is latent in the word embedding space. For instance, where the maximum

entropy baseline chooses the label, “Do you have any medical problems?” in response

to the query, “Has your mother had any medical conditions?” the stacked CNN correctly

chooses, “Are your parents healthy?” Presumably similar embeddings for “mother” and

“parent” should help here. Another benefit of the stacked CNN seems to be robustness to

59 Figure 4.2: Accuracy of the tested models by label quintiles.

some particularly egregious misspellings, identifying the correct class for “could youtell me more about the pback pain,” for example.

Despite improving over the baseline, the stacked CNN still fails to outperform ChatScript, highlighting the benefit of rule-based approaches in the face of limited data. However, a closer look at the accuracy on specific classes when grouped by frequency, shown in

Figure 4.2, reveals some important trends. First, the stacked CNN consistently performs slightly better than the maximum entropy baseline at all label frequencies. More interestingly, for the 60% of examples comprising the most frequently seen labels, the learned models outperform ChatScript. For the least frequently seen data, the learned models exhibit a substantial degradation in accuracy, especially relative to ChatScript. ChatScript is less sensitive to the data scarcity in comparison to its performance on more frequent labels, although it does exhibit some degradation at lower label frequencies.

This difference in error profiles between ChatScript and the stacked CNN suggests that some combination of both systems may yield further improvements in accuracy. Indeed, one or both of the systems answer correctly in 92.45% of examples, establishing a rather

60 Table 4.2: Features used by the chooser.

Feature Description Log Prob The log of the probability of the class chosen by the stacked CNN. Entropy The entropy of the distribution over the classes, as determined by the stacked CNN. Confidence The average over sub-models of the unnormalized score for the chosen class. CNN Label The one-hot label predicted by the stacked CNN. CS Label The one-hot label matched by ChatScript, or zero if there was no match. CS No_ans Boolean indicator that is true if and only if ChatScript did not find an answer. CS Log Prob The log probability of the class chosen by ChatScript, according to the stacked CNN.

impressive performance for a hypothetical oracle that can always correctly choose which

system to trust. This oracle performance motivates the development of a simple model that

can decide to override ChatScript’s answer with the stacked CNN’s answer, in the next

section.

4.1.5 Hybrid System

Having seen that ChatScript and the stacked CNN exhibit complementary error profiles, we seek to determine to what extent we can practically benefit from a system utilizing both

approaches. To do this, we construct a simple binary logistic regressor using SciKit-learn

(Pedregosa et al., 2011) to choose either the rule-driven or data-driven prediction, dependent

upon some features that can be extracted from each system. We (unimaginatively) call

the logistic model the chooser, and hereafter refer to the overall system (the stacked CNN

combined with ChatScript using the chooser) as the hybrid system.

61 Obviously, the correct label is not known at test time, so the chooser must use fea- tures of the input to make its decision. However, the chooser does also have access to meta-information about the decisions made by the stacked CNN and ChatScript, such as confidence of the decision, entropy of the distribution of output labels, or agreement between systems. Again note, though, that ChatScript, as a rule-based system, does not provide any statistical information about its decision, so meta-information is more limited there. Based on our experiments with the available features, we found those enumerated in Table 4.2 to be most effective.

For the majority of examples in the data set (over 66%), both systems are correct, so choosing either system would be acceptable; similarly, in approximately 7.5% of examples, either choice would be wrong. Conversely, when both systems agree, they are correct

98% of the time. Therefore, we only train the chooser on the examples where the systems disagree, taking agreed-upon classes as the answer in every case. This does, however, raise the question of which system’s answer to use as the default to be overridden when the chooser deems it appropriate, and further, how the choice of that default affects the accuracy on unseen data. The main difference between the systems in this regard is that ChatScript provides a default “I don’t understand” response, in the event that it does not match any available patterns, asking the user to rephrase. We call this the No Answer response. On the other hand, the stacked CNN, as discussed in Section 3.1, is effectively only trained on positive classes. From a user’s perspective, No Answer is a system failure: the user asked a question and did not receive an appropriate response. We recognize, however, that sometimes receiving no answer is better than receiving an incorrect answer; such an answer could give the illusion of understanding while providing an unintended response to the actual question asked, potentially leading to further unanswerable questions, for example.

62 Table 4.3: Hybrid System results. ChatScript and stacked CNN baselines are repeated from above. “Conf” is using only the Confidence feature from Table 4.2; “base” is using Log Prob, Entropy, and Confidence. CS = ChatScript, CNN = stacked CNN.

System Features Default Accuracy% No Answer% ChatScript 80.93 10.88 stacked CNN 78.01 0.00 Hybrid conf CS 83.58 6.56 Hybrid base CS 85.43 5.52 Hybrid all CS 89.40 3.23 Hybrid all CNN 89.97 0.00 Oracle 92.45 4.16

For these reasons, we examine the effects of the two different default choices, including

the effect on frequency of No Answer responses. The performance of the hybrid system is

evaluated using ten-fold cross validation.

Results of the hybrid system, including overall response accuracy and the percentage

of No Answer responses, are summarized in Table 4.3. Note that, consistent with previous

tables, this is the question classification accuracy, not the accuracy of the decision of which

system to choose. We include the results of ablating some of the features used by the

chooser. Statistical measures from the stacked CNN alone serve to improve performance

over the ChatScript baseline fairly substantially, but the more detailed information about which classes were chosen by each model, and the extent of the two models’ agreement,

provide a dramatic boost in performance. Using all features while defaulting to ChatScript’s

response constitutes a 44% relative reduction in error from ChatScript alone, and recovers

almost 74% of the oracle performance.

63 Using the stacked CNN’s response as the default yields a slightly higher overall accuracy,

but since the stacked CNN never gives the No Answer response, the whole system never

does either, in this condition. This means that for the approximately 3% of answers that are

incorrect but have the relatively benign No Answer response with the ChatScript default,

most of them are still incorrect under the stacked CNN default, but potentially misleadingly

so. We assume this trade-off favors using the ChatScript default in further experiments.

As we will show in those experiments, it turns out to be very likely that some No Answer

responses are necessary, so their complete absence is probably undesirable.

4.2 Experiment 2: Prospective Replication

Having established the benefit of a hybrid data- and rule-driven approach on a static data

set, we now seek to probe how effective the hybrid system is in live conversations. In other words, we have seen retrospectively how, all else being equal, a hybrid system might have

improved the performance for a specific set of queries from a specific cohort of medical

students who were actually interacting with a different system. We recognize, however, that

all else would not be equal—better responses, or even just different ones, can affect the

course of a conversation, and that could affect the performance in ways that are difficult to

predict. Therefore, we deployed the hybrid system as part of a controlled experiment in which a new cohort of medical students were randomly assigned to interact with the hybrid

system or a system using only ChatScript. In this way, we can prospectively determine

the efficacy of the system. The production deployment also serves as a verification of the various usability factors that must be considered for modern human-computer interaction,

given the latency and computational demands of the data-driven model.

64 Even though live interaction introduces unknown variables, we expect that the core data set is representative of the distribution of questions that should be asked of our virtual patient; thus, this is essentially a replication experiment, and we hypothesize that the deployed system should corroborate our main findings from the first experiment. In particular, the hybrid system should be more accurate than ChatScript alone, and we should see comparable performance in terms of absolute accuracy. In the following sections, we describe the experimental design that we use to test our hypotheses, results of the experiment, and a discussion of those results. The architecture of the deployed system used for this experiment is described in detail in Section 3.2.

4.2.1 Experimental Design

Our experiment aims to determine if the performance improvement on the fixed data set translates to users interacting with a system dynamically. To do this, we prepared a simple blinded experiment, where, upon starting a conversation, students are randomly assigned to either use the hybrid system or a ChatScript-only version of the virtual patient, with equal probability. Use of the virtual patient application was completely voluntary, and presented to students as an opportunity to practice their interviewing skills prior to an exam with a standardized patient. Because of the voluntary participation, the population as a whole may reflect some self-selection bias, but the test and control groups should be affected equally, due to the random assignment at the beginning of each conversation.

After the participation period, conversation logs were annotated according to the proce- dure described in Section 3.1.1. Correctness of the stacked CNN’s response was determined by its match to the annotation of the correct ChatScript response (after collapse of semanti- cally similar classes, see Section 3.1). This seems obvious and is reasonable, but we note

65 Table 4.4: Summary statistics of collected data.

CS-only Hybrid Overall Conversations 74 91 165 Turns Total 2,497 2,799 5,296 Mean 33.7 30.7 32.1 Median 32 30 31.0 Std Dev 20.8 20.5 20.7

that it introduces some bias in the annotations, since in some cases multiple classes may

provide valid answers to the question asked. A very simple example is the question, “Are you single or married?” ChatScript interprets the question as, “Are you single?” while the

stacked CNN interprets it as, “Are you married?” The separate labels exist for the possi-

bility of producing a response that matches the polarity of the question, but for simplicity,

both correctly produce the response, “I am single.” In such cases, ChatScript’s acceptable

response would always be favored over the stacked CNN’s acceptable response.

4.2.2 Results and Discussion

Data were collected over the course of approximately two weeks, resulting in well over

5,000 individual conversation turns. Some users had multiple conversations with the patient.

After collection, conversations which were manually judged to have been cursory probes of

the system’s capability were removed, resulting in a total of 5,296 turns in 165 conversations, which constitute the enhanced data set. Lengths of individual conversations range from one14

to 103 turns. Summary statistics are provided in Table 4.4, in which the control group is

labeled as CS-only, the test group is labeled as Hybrid, and statistics for the whole enhanced

14Some students quit conversations and later returned to ask single legitimate questions, which were kept in the data.

66 Table 4.5: Raw accuracies of control and test conditions, and sub-components of the hybrid system (CS = ChatScript, CNN = stacked CNN).

Control Test Acc% NoAns% Acc% NoAns% Oracle 73.45 11.69 85.21 5.64 CS 73.45 11.69 74.85 12.22 CNN — — 67.56 0.04 Hybrid — — 80.03 6.57

data set are provided in the Overall column. Although the Hybrid conversations were three

turns shorter on average, the difference in conversation length between the CS-only and

Hybrid data is not statistically significant (p = 0.38, Mann-Whitney U-test).

Raw accuracy results are summarized in Table 4.5. Note that the accuracy figures include

some overlap (1.6 percent overall) with the percentage of No Answer responses, despite our

previous assumption that No Answer is always incorrect; this is an important point that we

focus on later. Otherwise, a few things are obvious from the summary results. First, and

most positively, the hybrid (test) system outperforms the ChatScript-only (control) system

significantly (p = 1.9×10−15, one-sided Pearson’s χ2). Second, the stacked CNN performs worse than the ChatScript component of the combined system, but still provides a fairly

large benefit when combined with ChatScript using the chooser. Of the approximately 10%

absolute accuracy improvement that the Oracle results indicate is possible, about half is

recovered by using the chooser. Third—disappointingly—in terms of absolute performance,

all systems seem to perform much worse than hoped for based on the results of the first

experiment. Finally, and somewhat more subtly, the ChatScript component of the combined

system outperforms the system relying only on ChatScript (barely reaching significance at

67 p = 0.049 in a one-sided Pearson’s χ2 test). We provide further analysis of the collected

data below, which aims to explain these phenomena.

Perhaps the most obvious place to begin accounting for the difference in absolute

performance from the first experiment is in differences in the distribution of labels between

the core data set and the enhanced set. The enhanced data set is labeled with 77 ChatScript

patterns that never appeared in the core data set. These appear as the labels in 142 turns. For

the purpose of evaluating the performance of the stacked CNN, we map these into a single

unknown class, since it could never have produced these classes.

Most of the unseen classes originate from an unanticipated team synchronization issue.

As mentioned in Section 3.2, the ChatScript instance and stacked CNN deployment were

developed by separate teams for this experiment, and classes were added to the ChatScript

instance in an effort to improve performance, unbeknownst the team deploying the stacked

CNN. From the perspective of maximizing performance of the stacked CNN, this would

appear at first blush to be an embarrassing source of what might be called engineering noise

in the model development process. However, the hybrid system answers these questions

correctly 60.7 percent of the time in the test condition (N = 61), with only one instance of

the chooser choosing the stacked CNN when ChatScript answered correctly. This is less

accurate than ChatScript’s average for the least frequent quintile of data, but certainly far

better than the zero percent maximum performance of the stacked CNN on these examples.

Thus, we count this as a serendipitous illustration of the benefit of a hybrid rule- and

learning-based approach to the natural language understanding component of a dialogue

system. It allows us to relatively easily handle completely new classes with a baseline-level

accuracy, without even having to retrain the machine learning components.

68 An additional source of unseen data was the Negative Symptoms class, which serves to

trigger a default response from the patient when asked about a system that has no impact on

the patient’s present illness — e.g., “Does your ear itch?” — to which the patient should

respond with something like, “No, no problems with that.” As mentioned in Section 3.1,

these examples had been intentionally removed from the core data set, since members of

this class are not necessarily paraphrases, and a pairwise matching method would not be

expected to succeed. The core data set was developed with this approach in mind, and when

the multiclass classification approach was seen to yield better results, the omission of this

class was just incidentally not reconsidered. Bringing this oversight to light is certainly an

argument in favor of conducting the replication study.

Besides such completely unseen classes, the enhanced data set reflects some substantial

shifts in the frequency distribution of the seen classes. Of the remaining 359 classes in

the core data set, only 261 appear in the enhanced set. To give a sense of the differences,

Table 4.6 shows the ten largest changes in class frequency from the core data set to the

enhanced set.

The No Answer class is by far the biggest difference between the core and enhanced

data sets. As described in Section 4.1, this is the default class, which ChatScript produces when a question strays outside of the dialogue agent’s knowledge base. Similar to the

Negative Symptoms class, it is a negative class, defined by the absence of a match to a

positive class, rather than on any specific content of the query. While the core data set does

contain examples of the No Answer class, it is almost never chosen by the stacked CNN in

practice. Effectively, the only way for the hybrid system to produce a No Answer response

is for the chooser to choose ChatScript when ChatScript fails to match any other pattern.

In the enhanced data set, there are two main sources of No Answer annotations. The first

69 Table 4.6: Top changes in class frequency by magnitude of difference between core and enhanced data sets. Change is the count of examples of the class in the enhanced set minus the count in the core set.

Label Change no answer 441 any other problems 184 what should i call you -156 unknown 142 tell me more about your back pain 125 describe the pain 95 are you taking any medication 80 can you rate the pain 78 how much do you work -76 are you having any other pain -75

source is questions which the patient should be able to answer, but which had previously

not been considered, or had otherwise not been prioritized by the content author. We refer

to these types of queries as being currently out-of-scope. The second source is questions which are entirely outside the scope of the educational objective, e.g., “What would you

consider your biggest strengths?” We call these permanently out-of-scope, which generally

overlaps with what would be called out-of-domain in dialogue literature. Note that while we make a conceptual distinction between currently and permanently out-of-scope for the

purpose of discussion, our annotation scheme only identifies out-of-scope queries generally,

and assigns them the No Answer label. Unlike what had been generally assumed in the first

experiment, the No Answer response is the correct response in these situations, not just a

least-risk fallback strategy. But since the system was not trained to identify such situations,

this class accounts for a large volume of the errors. Addressing this issue is the motivation

70 for our work in Section 4.3, although it turns out to be a very difficult problem to handle

effectively.

The remaining differences in the label distribution between the data sets are harder to

explain definitively. The data in each set is collected from different cohorts of medical

students; as such, they may have received slightly different training, either as natural variation among instructors, or from intentional change in emphasis on particular topics

from one year to the next. The instructions given to students to introduce the software from

one year to the next were also not tightly controlled, so may have made certain queries

seem redundant. For example, it does not make much sense to ask a patient’s name if you

have instructions telling you their name. ChatScript pattern definitions can also change, which may introduce a bias for some labels, given the annotation methods. That is, if two

responses are appropriate to a particular query, and a change to a template suddenly makes

one occur more frequently than the other, the annotation methods will probably not flag that

response as incorrect, and thus the true label will be annotated as whatever the response was.

We can examine accuracy of the hybrid system as a function of label frequency by

quintiles, as we did above in the comparison of ChatScript and the stacked CNN; however,

given the distributional differences between experiments, this raises the question of which

label/quintile assignments to use. A comparison is presented in Figure 4.3. The first chart

shows the accuracy according to the same quintile assignments as used in Experiment 1,

and the second shows a full recalculation of the quintiles according to the data collected in

the enhanced data set. The first chart is broadly consistent with the previous experiment,

although it shows a big drop in performance in the fourth-most-frequent quintile. That the

trend is preserved is reassuring; the models generally exhibit consistent performance on the

same semantic content. The second chart, however, shows an alarming drop in performance

71 Figure 4.3: Accuracy of the hybrid system and constituent components by label quintiles. “Core dataset quintiles” have the same label membership as in the previous experiment; “raw enhanced dataset quintiles” are the quintiles according to the frequencies observed in Experiment 2. “Fair enhanced dataset quintiles” remove No Answer labels and other unseen labels from consideration.

on the most frequent labels in the enhanced data set. This is consistent with the drop in

performance in the fourth quintile of the core data set labels—the No Answer class is among

the most frequent in the enhanced data set, and is almost always incorrectly identified. In

the core data set, No Answer occurred in the fourth quintile, explaining the performance

drop for that data according to the core data set quintile assignments. If we omit the No

Answer class and other unseen classes from the analysis, but otherwise assign quintiles

according to label frequency in the enhanced data set, we see a performance profile generally

consistent with that shown in Figure 4.2 in the previous experiment, suggesting that the

major distributional differences between the data sets are fairly localized to the occurrence

of those few unseen classes.

Given the expected effects of the distributional differences discussed above, we here

consider the results when omitting certain examples, to facilitate a fairer comparison to

previous results. Primarily we want to see how the system performs when omitting unseen

72 Table 4.7: Adjusted accuracies of control and test conditions, and sub-components of the combined system.

Unseen labels omitted Unseen and NoAns labels omitted Control (N = 2399) Test (N = 2700) Control (N = 2157) Test (N = 2482) Acc% NoAns% Acc% NoAns% Acc% NoAns% Acc% NoAns% Oracle 74.70 10.38 85.63 5.15 81.08 9.55 91.42 3.87 ChatScript 74.70 10.38 74.89 11.96 81.08 9.55 79.73 11.28 Stacked CNN — — 70.04 0.04 — — 76.15 0.00 Hybrid System — — 80.33 6.19 — — 85.74 5.08

classes (including the Negative Symptoms class) from the result set, and see the magnitude

of the effect of the No Answer class, given its huge change in frequency. These results are

given in Table 4.7. As a point of clarification, note that omissions made for the sake of this

analysis are based on the correct labels for individual examples, not on the responses. While

the overall performance of the hybrid system under the reasoned adjustments does not reach

the same level as the prior experiment, the control experiment is much more consistent with

it, and it is very plausible to attribute the remainder of the difference in the stacked CNN’s

performance to the other distributional differences that manifested in the enhanced data set.

One curious aspect of the results in Table 4.7 is that when omitting the No Answer class,

ChatScript in the control condition outperforms ChatScript in the test condition, where it

had trailed when including all data. This mostly follows from the fact that the No Answer

class is a larger proportion of the CS-only (control) data set.

Another available measure of the quality of the systems is the number of times a user

has to rephrase a query to get a relevant answer. A natural response to a misunderstood

question, whether it manifests as an incoherent reply or a request to rephrase, is to try

again; anecdotally, we indeed find that users most often follow this pattern. Thus, it is

73 easy to estimate performance in this regard, by simply finding contiguous queries with

the same labels in the data set. Doing so, we find 249 repeated queries in the control

set of 2,497 turns, and 209 repeated queries in the test set of 2,799 turns. This turns

out to be a statistically significant reduction in repeated queries due to the hybrid system

(p = 1.9×10−6, Pearson’s χ2). This may explain some of the difference in raw performance

of the ChatScript subsystem between the control and test conditions.

4.3 Experiment 3: Out-of-Scope Improvements

The analysis of the previous experiment offers a number of promising avenues for

improving the system going forward. The most obvious deficit that it brought to light is that

out-of-scope questions are far more frequent than the core data set reflects. Furthermore, while we had assumed that a response of “I don’t understand that question” was always,

in some sense, incorrect, it is in fact an appropriate response from a back pain patient to a

query about, e.g., their greatest strengths. The data collected also provide an opportunity to

improve overall performance while examining the interplay between the components of the

hybrid system in the face of additional training data.

In analyzing the results with our domain expert, we discovered that there were subtle

but important distinctions in the types of questions currently unanswered by the system.

The easier distinction to draw is between in-domain and out-of-domain data, which is a well explored problem in dialogue systems; for the latter type of question a default “I don’t

understand” answer suffices. However, there is a more subtle set of questions that are

relevant to the medical domain, but currently not covered by the content in the system; we

deem these out-of-scope. Some types of questions were determined to be not germane to

the particular patient case, and were intentionally left to the default answer (permanently

74 Figure 4.4: A simple illustration of scope and domain boundaries as discussed in this section.

out-of-scope), whereas other questions were relevant to the case (currently out-of-scope). A simple illustration of the relationships between scope and domain boundaries can be seen in

Figure 4.4, with more discussion below.

The model changes that we propose in this section are mainly focused on improving performance for out-of-scope queries, i.e. the No Answer class, which we are able to do with limited success. The problem of accurately identifying out-of-scope queries is especially challenging for several reasons. For one, the distinction between in- and out-of-scope is sometimes fairly arbitrary in our case, having little to do with the inherent semantics of the question. For example, a patient’s marriage status can be quite relevant to clinical outcomes, and our Virtual Patient can accurately answer the question, “Are you married?” with the response, “I’m single.” However, the question, “Have you ever been married?”—implying the possibility of that single status being due to divorce, another clinically relevant social

75 status—was deemed by the medical content author to not be worth the development cost to

distinguish, due to subtle syntactic differences, difficult polarity issues, etc., that can arise

in various paraphrases of the two questions. Thus, the boundary between in- and out-of-

scope queries can be quite narrow and convoluted in our application domain. Furthermore,

the in-scope (meta-)class comprises a few hundred distinct query classes, making it fairly

heterogeneous, and thus harder to distinguish from out-of-scope data, particularly given the

relatively small size of the data set and the dearth of adequately similar out-of-scope data.

The related problem of distinguishing in- and out-of-domain data when only in-domain

data is available is well-studied in the literature, and known by several variants falling under

the banner of One-Class Classification. Some of this work is reviewed briefly in Chapter 2.

Text-based techniques utilizing unlabeled data augmentation might be appealing for our

use case, except for the aforementioned narrow, convoluted boundary between in- and

out-of-scope data. We expect that adequately characterizing that boundary would require a

large amount of “in-domain, but currently out-of-scope” data, and such data is hard to come

by. We attemped to augment the Virtual Patient data with several data sets to learn a model

of the out-of-scope boundary, but available data sets are too easy to distinguish from the

Virtual Patient data simply on the basis of obvious syntax and vocabulary features, so those

experiments did not work, and are not presented here. Techniques depending on a robust

characterization of the distribution of the positive class are similarly expected to perform

poorly, due to the heterogeneity and relatively small size of the positive data, also mentioned

above. For these reasons, we attempt to improve recognition of the No Answer class largely within our existing framework, by treating it as just another query class and augmenting our

data set with more representative examples. We recognize that more sophisticated one-class

76 classification techniques may further improve performance over what we show here, but we

generally consider this work the establishment of a baseline for further research.

Thus, we propose some simple changes to our existing system with the aim of improving

our handling of the No Answer class, which we can evaluate with the enhanced data set.

The division of the enhanced data set into the CS-only and hybrid portions leaves us with

a natural partition to use for testing, and we use the hybrid data set to evaluate our model

changes. This leaves us with the CS-only data set to explore the benefits of additional

training data in our hybrid system, which makes the training data more reflective of the

distribution of classes seen in the test set. We call the union of the CS-only and core data

sets the augmented data set.

Architecturally, we focus our attention on the chooser component of the system. We

reason that since the No Answer class is a negative class — defined only by the lack of

relevant content — that the stacked CNN is likely to fail to generalize to the full variability of

the class, absent some sophisticated training techniques. The chooser, on the other hand, has

access to statistics produced by the stacked CNN, and may more easily be able to recognize

the situation that the stacked CNN is not highly confident in any particular positive class.

Concretely, we extend the binary chooser to a three-way choice between the response

produced by ChatScript, that of the stacked CNN, or the No Answer class, experimenting with two different labeling schemes.

We also explore additional features for input to the chooser.

4.3.1 Models

We implement two variations on a three-way chooser and evaluate their performance

trade-offs. The first is a fairly straightforward extension of the original model to a multiclass

77 setup, while the second employs a multilabel setup. The multilabel model was developed in

recognition of the fact that sometimes multiple choices are equally acceptable, since either

the input models or the chooser can produce an appropriate No Answer response. We use

the same logistic regression model from SciKit-learn (Pedregosa et al., 2011) as used in

the previous experiment, using the built-in one-vs-rest classification scheme to handle the

three-way multiclass classification, and using a binary relevance strategy (Boutell et al.,

2004) in the multilabel case, via the OneVsRestClassifier class. We use LIBLINEAR

(Fan et al., 2008) to solve. Details of training and model evaluation are described below.

We experiment with the addition of new input features and particular combinations of

them. In general, the new features attempt to infer additional information about ChatScript’s

“opinion” of the stacked CNN’s output. As discussed in Section 4.1, ChatScript does not

output a probability distribution for its possible responses, so any kind of measurement of

the confidence ChatScript has in its response, or possible alternative responses, must be

inferred in some way. ChatScript decides its output by searching over possible responses,

and returning the first one that matches. The search order is determined implicitly, by

ChatScript’s search priorities and the author’s organization of the patterns. For example, a

pattern within the current topic, as the author defines the topic, will always match before

a pattern in another topic. Each pattern is essentially a regular expression that may match

an input sentence; we call a set of patterns that all produce the same response a class,15

consistent with the definition of a class learned by the stacked CNN, after the collapse of

certain semantically similar classes; and a topic is an author-defined collection of notionally

related templates.

15ChatScript documentation refers to this as a template.

78 One source of additional information that can be made available from ChatScript is a

ranked list of all of the patterns that would have matched the input, if the search had not

stopped after the first hit. This includes multiple patterns within the same class. Since

patterns and classes are manually authored, we reason that a class with more patterns should

usually be more “mature,” in that more time has been spent developing it. In other words,

a larger number of edge cases in the language that can represent the same semantics have

been identified and encoded as patterns. Accordingly, we expect ChatScript to be more

confident about a match on a pattern within a mature class, and even more confident when

there are multiple matches to the same class. Finally, due to ChatScript’s inbuilt heuristic

priorities, we expect higher ranked matches to correspond to higher confidence. Note that

the No Answer class is always the last item in the list.

Given all of the above, we assign a score to each class that is just the normalized sum of

the inverse ranks of pattern matches for that class. That is, we assign a score s j such that, for

the 1-indexed sequence of patterns matching the input M = (pi) containing the set of classes

{c j}, and the set Pj = {i : C(pi) = c j}, where C(pi) is the class that pattern pi belongs to:

1 ∑i∈Pj i s j = 1 (4.2) ∑i∈M i

For notational convenience we denote the score for the template chosen by ChatScript as s0.

Given Equation 4.2, the new features are summarized in Table 4.8. Additional features were explored, but did not show a benefit during development.

4.3.2 Training and Evaluation

While the modifications of the chooser models themselves are straightforward uses

of off-the-shelf software, specific aspects of the training of the models merit detailed

enumeration.

79 Table 4.8: New features used by the Three-way chooser.

Feature Description CS ratio The ratio of the score of the chosen response to the sum of all scores, i.e. s0 . ∑ j s j CNN agreement The score s j corresponding to the class chosen by the stacked CNN, if it is present in the list of matches. Otherwise, zero.

Besides augmenting the training data with the extra examples from the CS-only data

set, we add two labels, one each for Unknown classes and Negative Symptoms. These are

largely to facilitate evaluation of the features which measure agreement between the stacked

CNN and ChatScript when ChatScript produces these responses, but the added examples

include items belonging to the new classes. The Negative Symptoms class, while being

incompatible with the early paraphrase identification approach, could reasonably be handled

by the direct classification approach, due to similarities in questions relating generically to

medical conditions and anatomy; treating all unseen classes as a single class for purposes

of training the stacked CNN, however, is more problematic. Ideally, the chooser should

learn to defer to ChatScript when it produces a class unseen by the stacked CNN, but this is

a hypothesis to be verified, since the behavior of the stacked CNN when trained on these

examples is less predictable.

We retrain the stacked CNN on the augmented data using the same tenfold cross- validation scheme, with new splits to accommodate the extra data. We collect statistics for

calculating the chooser’s training input features from the predictions collected from the cross validation. Chooser inputs for the test set are calculated from a single stacked CNN model

trained on the entire augmented training set, where the test set of course was unseen during

80 the retraining. We encode the labels for training the chooser using different conventions,

depending on which three-way variant we are training.

In the case of the exclusive configuration, which refines the earlier approach, each

example is labeled as exactly one of ChatScript, stacked CNN, or No Answer being the

correct choice. All instances of the No Answer class label are labeled with the No Answer

choice for the chooser, regardless of whether or not either ChatScript or the stacked CNN

correctly identified it. All examples in which none of the available choices would be correct

are labeled as ChatScript being the correct choice, to most take advantage of the hybrid

system’s ability to handle classes unseen by the stacked CNN.

In the case of the multilabel three-way variant, every label is a binary vector of length

three, with the separate dimensions indicating independently which of the three choices

provides an acceptable answer. Thus, if ChatScript provided an accurate No Answer

response, then the dimensions for both ChatScript and No Answer would be set to one. In

contrast to the exclusive setup, if no choice is correct, we leave the example unlabeled (the

zero vector), based on development performance. To evaluate the multilabel setup at test

time, we predict probabilities for each choice and take the max.

As with the baseline binary chooser, we only train on examples where the component

systems disagree, always taking agreed-upon classes where they exist. We use tenfold cross validation on the augmented data set to measure development performance to find the best

configuration of each three-way variant for testing. Baseline development performance is

taken as the weighted average of both a tenfold cross validation result of the two-way model

trained on the core data set, and the performance of a single model trained on all of the

core data set and evaluated on the CS-only data set. In other words, baseline development

81 performance is a cross validation result on the entire augmented training set, but where training folds are only comprised of the core data set.

We test on the entirety of the hybrid data set. As our primary evaluation metric we report class accuracy. Note that this is not the accuracy of the three-way choice, but of the class returned by the chosen system. Since our model adjustments are directed at improving the performance of the No Answer class, we also report the percentage of No Answer responses—irrespective of which system produced them, and whether or not they were chosen correctly—as well as precision and recall on the No Answer class.

4.3.3 Baseline

A brief discussion of the baseline used for evaluation of the test set is warranted. In theory, this should be the same accuracy number as reported in Section 4.2.2, i.e. 80.03 percent accuracy in the unadjusted case. However, for the same reasons our attempted reproduction in Section 4.1 was slightly worse than the original result (Jin et al., 2017), our baseline in the present experiment is slightly lower than what was annotated in the live replication experiment. Between the bug fix, data corrections, and a change in splits resulting from a different random shuffle—see Gorman and Bedrick (2019) for an excellent discussion of the impact of splits on performance—we consider this a justifiable drop in performance without invalidating any conclusions. We further confirm that the changes result in different behavior of the chooser with a simple comparison of the frequency of the choices: the chooser in the live replication experiment chose the stacked CNN in 15.5 percent of turns in the test condition, while the baseline chooser for the present experiment chooses the stacked CNN in 10.8 percent of turns, using the same input sentences.

82 Table 4.9: Development accuracy with additional features

System Data set Base feats +agree +agree+ratio 2-way baseline core 84.41 — — 2-way baseline augmented 84.85 84.73 84.72 3-way exclusive augmented 84.92 84.98 84.99 3-way multilabel augmented 85.70 85.83 85.83

Table 4.10: Effects of retraining on test performance

core aug. CS Acc 74.85 74.85 CNN Acc 67.27 73.17 Oracle 84.82 86.89

4.3.4 Results

Development results are shown in Table 4.9, with the best performing features for each system (row) highlighted in bold font. Note that the absolute accuracy numbers are not directly comparable to test results due to different label distributions. In general, the additional features add very little to the performance, but we test the features that give the highest accuracy in development. The CS Ratio feature does not increase accuracy for the three-way multilabel system, but it does slightly increase F1 score on the No Answer class relative to the CNN Agreement feature alone (0.164 vs. 0.153), so we take that configuration as the best for running test results.

Retraining the stacked CNN on the augmented data set unsurprisingly improves its performance on the test set, although the resulting boost in oracle performance does not

83 Table 4.11: Test results for three-way choosers with retraining

No Answer System Training data Accuracy Percent Precision Recall 2-way baseline core 78.78 5.11 .210 .133 2-way baseline augmented 79.64 7.34 .199 .183 3-way exclusive +feats augmented 79.92 7.04 .198 .174 3-way multilabel +feats augmented 81.10 4.14 .328 .174

match (see Table 4.10), implying that much of the improvement overlaps with examples

that ChatScript was already answering correctly. In absolute terms, the stacked CNN perfor-

mance under retraining is much closer to ChatScript as well, which is more consistent with

the results from Section 4.1. This supports the claim that basic differences in the distribution

of class labels were a source of the unexpected performance discrepancy observed in the

live replication experiment.

Test results for the two three-way systems that were identified as having the best

development performance are shown in Table 4.11. We also include the best-performing

binary chooser when using the augmented training data, to isolate the effects of the training

data relative to the baseline and the model enhancements. The retraining alone leads to a

modest improvement in overall accuracy, with a small boost in recall on the No Answer class

and an even smaller drop in precision. The model changes aimed at improving performance

on the No Answer class introduce some trade-offs. The exclusive setup improves over

the baseline on recall, being generally more likely to produce the No Answer response, while the multilabel setup favors precision without improving recall over the exclusive

setup. Both three-way systems exhibit a significant boost in accuracy over the two-way

baseline, particularly the multilabel setup (exclusive: p = 0.009; multilabel: p = 4.6×10−6,

84 McNemar’s test), and given the No Answer recall numbers, in both cases this is due to

increased performance on the positive classes. We show in the next section that the large

increase in the multilabel system is due to the presence of the unlabeled examples in the

case that no correct choice exists; forcing these examples to belong to a class during training

serves mostly to confuse the true boundaries of that class.

We offer analyses to gain further insight into all of these results in the next section.

4.3.5 Discussion

Despite significant improvements in accuracy using the multilabel chooser, the improve-

ments on the No Answer class can only be described as modest, at best. Some analysis

illuminates the issue a bit better.

As we supposed above, the distinction between in- and out-of-scope proves to be hard

to discover, and we can characterize the problem more fully with quantitative techniques.

The most intuitive illustration comes from a t-SNE (Maaten and Hinton, 2008) plot of the

training data. T-SNE is a stochastic technique that maps high-dimensional data points into visualizably low-dimensional space, while attempting to preserve distances between points.

Such a plot of the chooser inputs with correct choices labeled can give a good intuition for

the separability of the data, with the understanding that it presents a distorted visualization of

the data space. Figure 4.5 shows the result of applying t-SNE to those points in the chooser

training data where ChatScript and the stacked CNN do not agree. The figure illustrates why

the multilabel approach yields an accuracy improvement, but the result is a bit grim for the

out-of-scope question.

The plot makes it apparent that the No Answer labels are thoroughly enmeshed through-

out the data space, which confirms our intuition that the boundary around these points would

85 Figure 4.5: t-SNE plot of Multilabel chooser training data

86 be convoluted and narrow. In fact, it is hard to claim that any such boundary really exists in

this representation space. Further, there are several points where the stacked CNN correctly

identifies a No Answer response, but these are often buried in a region that is predominated

by ChatScript choices, making it hard for a chooser model to take advantage of the CNN’s

right answer or the chooser’s ability to override with a No Answer response. Overall, this

does a lot of the work to explain the limited improvements on the No Answer class.

The separation between CNN and ChatScript choices is much more encouraging, how-

ever. This clearly illustrates why the chooser provides such a significant performance boost;

large regions full of high-purity clusters of CNN choices are easy to separate from the

ChatScript choices. We can also clearly see why not labeling points that have no correct

choice (i.e., the labeling scheme in the multilabel scenario) provides such a big benefit

for the overall accuracy. Many of these unlabeled points (gray dots in the figure) cluster

together with the otherwise very pure cluster of CNN data points. If these are treated as

ChatScript labels, this easily separable region of the space becomes much more confusing

for the classifier, which in turn damages the model’s generalization to the test set.

While the multilabel scheme creates higher accuracy by allowing the chooser to trust

the stacked CNN more in the region where it is accurate, the inevitable outcome is that the

system will choose the CNN on many of the occasions when neither subsystem is correct.

This revisits the trade-off that was introduced in Section 4.1: a boost in accuracy also comes with more incorrect replies, instead of the presumably lower risk of a No Answer response.

To quantify the trade-off, consider two types of errors: major errors, where the system

replies to a question that was not asked, but which the user can potentially interpret as the

answer to the question they did ask; and minor errors, in which the system should have

answered the question, but instead replied with, “I don’t understand,” implying that the user

87 should try again. While the three-way exclusive setup has the higher total error rate at 20.1

percent, its major errors are 14.4 percent compared to 16.1 percent in the multilabel case, which has the lower overall error rate at 18.9 percent. Accordingly, the minor errors are

more than doubled in the exclusive case relative to the multilabel case.

Exactly how bad these errors are—especially with respect to the unlabeled examples that

effectively end up as a CNN choice in the multilabel setup—is a more subjective question

that depends on how the CNN actually interpreted the input. In one of the subjectively worst

cases, a user asks, “Okay. So back pain and urinating more frequently. Is that all?” This

should be interpreted as the class, “Any other problems?” ChatScript incorrectly provides

the No Answer response, and the retrained stacked CNN incorrectly interprets the question

as, “Have you noticed pain while urinating?” The programmed reply to this question is, “I

haven’t had any pain like that,” which could be construed as rejecting the back pain issue

among the user’s summary of topics to discuss. In this case, the No Answer response is the

preferable option. A more innocuous, and much more typical, example is the input, “So

it seems to get worse with prolonged use?” According to our annotation, this should be

interpreted as, “Does the pain improve with exercise?” ChatScript again provides the No

Answer response, while the CNN interprets it as, “Is the pain improving?” The multilabel

chooser picks the CNN, and the reply then comes as, “I don’t know if it is getting worse

or not. It pretty much just hurts all the time.” The reply is mostly relevant, but it is also

not adequately responsive to the question. This lack of coherence then is a sufficient cue to

the user that they should try language more focused on the physical activity component of

their question. Other similarly benign confusions include, “How much ibuprofen do you

take?” for, “How much ibuprofen have you taken?” and, “When did the pain start?” for,

“Did anything happen to cause the pain?” Looking at the erroneous differences between the

88 exclusive and multilabel setups, our subjective impression is that the multilabel system is

more helpful than harmful, suggesting that the so-called major errors quantified above are

not usually catastrophic.

A majority of the No Answer responses in the multilabel condition (65 percent) come

from agreement between the stacked CNN and ChatScript, although of these, 68 percent are

incorrect. The chooser overrides both subsystems to provide 18 percent of the No Answer

responses, with the remainder all coming from choosing the stacked CNN’s response. A

No Answer response from ChatScript is never chosen. The No Answer responses from

the stacked CNN are the most accurate, at 55 percent. The live subjects were surprisingly

focused on questions about the patient’s employment, to a level of detail that was not

clinically relevant. Accordingly, many such questions were out of scope, and trends in the

chooser’s No Answer responses reflect this. Two of the three correct No Answer responses

that came from the chooser override were about back pain symptoms at work, but more

often, the chooser returned No Answer for valid employment questions, such as, “Are you working?” The majority of the benefit to performance on the No Answer class, then, seems

to come from an increased opportunity to trust the stacked CNN’s determination of a No

Answer response, along with the stacked CNN’s improved ability to detect out-of-scope

questions through extra training data.

Finally, we note that even after retraining the stacked CNN with over 50 percent more

data, the hybrid system still outperforms both the rule-based and data-driven components

individually, reconfirming the benefit of combining both approaches in our relatively data-

scarce domain. Figure 4.6 shows another breakdown of model test performance by label

frequency quintiles, this time using quintile assignments derived from the label frequency

of the augmented data set used for training the multilabel model. Results from the binary

89 Figure 4.6: Quintile accuracies for multilabel components, with baselines

90 chooser baseline replication are included for comparison. Again, the general trend of

the stacked CNN outperforming ChatScript on high-frequency labels persists, and for the

most part this translates to higher performance of the hybrid system than either of the

subcomponents. Notably, though, the hybrid system underperforms the stacked CNN in the

most frequent quintile. The large jump in the accuracy of the stacked CNN over its baseline

in this quintile is mostly driven by a boost in accuracy on the No Answer class (about 36

percent vs. less than two percent), which remains difficult for the chooser to distinguish, as

discussed at length above. We believe that the difficulty of the out-of-scope question does

most of the work to explain this divergence from the otherwise consistent trend of the hybrid

system outperforming either component, but acknowledge the possibility that the results

may reflect a threshold of training data volume beyond which the stacked CNN is simply

more effective on its own than in combination with ChatScript. Nonetheless, the benefit at

lower frequencies remains very clear.

4.4 Conclusion

This chapter highlights the importance of replicating model development experiments with live subjects, especially in the context that a model has been developed with the

goal of building an interactive tool. Our early work with paraphrase identification biased

our subsequent approach to question classification, and while the assumptions made were

reasonable in their isolated context, the replication presented here forced us to recognize

important issues for live deployment that those assumptions obscured. In particular, the

prevalence of out-of-scope queries was far greater than expected based on our early efforts.

The work done here to address the issue certainly leaves room for improvement, but our

analysis also demonstrates why this is a very difficult problem. The lesson here is not just to

91 eliminate some particular source of bias in the development of a given dataset; in fact, we

believe such biases are inevitable results of the process of developing and refining research

questions. Rather, we emphasize the importance of the validation of the previous results with live subjects.

The most important positive outcome of the experiments presented here was reinforcing

the benefit of the hybrid approach to natural language understanding in our data-limited

domain. The power of the hybrid approach repeatedly proves to be the complementary

performance of the rule-based and data-driven components by label frequency, even to the

point that labels may be entirely unseen by the data-driven model. Having a baseline level of

performance on unseen classes offers content authors highly desirable flexibility in adding

new classes in the deployed system. A further positive result of our experiments within

the hybrid framework was the insight not to force a choice during training if both rules

and data fail to find a correct answer. Even though it was born out of an analysis aimed at

understanding the difficulty of the out-of-scope problem, shedding light on the shape of the

data in the chooser’s input space resulted in a more informed approach to the hybrid system.

The analysis presented here suggests that the out-of-scope distinction is something of a

separate task from the question classification task we began with, and is somewhat different

than the typical out-of-domain task that dialog systems face. This may suggest either making

the determination prior to the question classification, or parallel with it. Corresponding

architectural adjustments may involve joint tasks for the stacked CNN or end-to-end neural

models that implicitly incorporate the chooser by conditioning the stacked CNN’s response

on ChatScript’s output. We leave these explorations for future work.

While the complementarity of the rule-based and data-driven models leads to a hybrid

system with reasonably good results, it does raise the question of why the stacked CNN

92 performs comparatively poorly on infrequent classes, and what could be done to better leverage the available data to improve performance on those classes. The next chapter aims to address this question, noting in particular that many infrequent classes do have linguistic commonalities with more frequent classes that we can try to exploit.

93 Chapter 5: How Self-Attention Improves Rare Class Performance

The previous chapter repeatedly emphasized the disparate performance of the ChatScript

system and the stacked CNN model with respect to class frequency, and how combining

the two yielded substantial improvements over either individually. Analysis gave insight

about the most effective ways to combine the two models, but reasons for the disappointing

performance of the CNN models on rare classes were not considered in much detail beyond vague notions of data scarcity. In this chapter, we seek to maximize the performance of a

single learned model on rare classes, by leveraging the linguistic structure in more common

classes.

Borrowing intuition from slot-filling paradigms for natural language understanding

tasks, we hypothesized that a classification model that was forced to rely on a collection of

independent underpowered models (by analogy: slots) would encourage those models to

specialize on components of the sentence meanings (fillers) that generalized across classes, while cooperating to make sure that the specialties were diverse enough to ensure adequate

separability of the classes when combined. Such a model could infer the latent structure in

the semantic space based on the top level classifications as well as the observable linguistic

structures of the input, then leverage that structure to improve performance on rare classes when they shared components with frequent classes. This would only be possible because of

94 the large number of classes in the Virtual Patient domain that often have much in common with each other.

We adapted a simple multi-headed attention model (Lin et al., 2017), with the intuition

that each attention head would act as the analog of a slot. Our initial hypothesis was that

limiting the modeling capacity of the attention heads would encourage each one to focus on

the components that best generalized across classes. That may have been the case, but the

constrained attention heads only suffered in overall performance compared to unconstrained

models, as described later in this chapter. The constrained heads did prove extremely useful

for rapidly visualizing the model’s behavior, and that strongly shaped our analysis.

As comparison, we also tested the powerful BERT model (Devlin et al., 2019) for the

Virtual Patient task. Surprisingly, we found that it underperformed the simpler CNN baseline,

particularly on rare classes, which we largely attribute to insufficient training data for the

classifier to learn to accommodate the high degree of freedom in the BERT representations.

These experiments and analysis comprise the main contributions of this chapter.16

5.1 Introduction

Many semantic classification tasks, of which the Virtual Patient is just one, have seen

a huge boost in performance in recent years (Wang et al., 2019, 2018), thanks to the

power of contextualized language models such as BERT (Devlin et al., 2019), which uses a

Transformer (Vaswani et al., 2017) architecture to produce context-specific word embeddings

for use in downstream classification tasks. These large, data-hungry models are not always well suited to tasks that have a large number of classes or relatively small data sets (Mahabal

16A condensed form of this chapter was accepted as a short paper to SIGDIAL2020.

95 et al., 2019). As thoroughly discussed in previous chapters, the Virtual Patient corpus has

both of these inauspicious properties.

Many of the classes in this task are distinguished in subtle ways, e.g., in degree of

specificity (“Are you married?” vs. “Are you in a relationship?”) or temporal aspect (“Do you [currently] have any medical conditions?” vs. “Have you ever had a serious illness?”).

As discussed in Section 3.1, a few classes are very frequent, but many appear only once

in the data set, with almost three quarters of the classes comprising only 20 percent of the

examples (Jin et al., 2017). Nonetheless, these rare labels are no less important to answer

accurately.

In this chapter, we seek to improve upon the rare class performance of the Text CNN-

based system described in the previous chapter. That approach naïvely treats all classes as

orthogonal, so the semantic similarity of the classes above can be problematic. Ideally, a

model should be able to learn the semantic contributions of common linguistic substructures

from frequent classes, and use that knowledge to improve performance when those structures

appear in infrequent classes.

We hypothesize that multi-headed attention mechanisms may help with this kind of

generalization, because each head is free to specialize, but should be encouraged to do so

cooperatively to maximize performance. Three different methods of utilizing BERT-based

architectures for this task surprisingly did not improve upon the performance of the CNN

models of Jin et al. (2017). In contrast, a very simple RNN equipped with a multi-headed

self-attention mechanism improves performance substantially, especially on rare classes.

We assess the reasons for this using several techniques, chiefly, visualization of severely

constrained intermediate representations from within the network, agglomerative clustering

of full representations, and manipulation of the behavior of the attention module by imposing

96 additional constraints. We find evidence that independent attention heads: 1) represent the

same concepts similarly when they appear in different classes; 2) learn complementary

information; and 3) may learn to attend to the same word for different reasons. This last

behavior leads to discovery of idiomatic meanings of some words within our domain.

5.2 Related Work

Self-attention, in which a model examines some hidden representation to determine which portions of that representation should be passed along for further processing, became

prominent relatively recently (Lin et al., 2017; Vaswani et al., 2017). These models have

been very successful for some tasks (Wang et al., 2019), but other approaches may work

better for classification tasks with many classes and few examples (Mahabal et al., 2019).

We explore two types of self-attentive models for the virtual patient dialogue task (Danforth

et al., 2013; Jaffe et al., 2015), which has many classes and scarce data. Previous authors

have used memory networks (Weston et al., 2015) to improve performance on rare classes

for this task (Jin et al., 2018).

Despite the contrast presented above, our self-attentive model may share characteristics with the work by Mahabal et al. (2019), as we find that representations of some word tokens

reflect parallel meanings. Mahabal et al. (2019) use sparse vectors to represent words as the

set of lexical syntactic contexts in which they appear (Mahabal et al., 2018), an approach that

they call Category Builder. These representations are intended to simultaneously capture

all of the meanings of a word type, and then use feature selection based on the context of

a given token to perform advanced tasks like analogies or classification. This stands in

contrast to the contextualized language modeling approach of models like BERT, which

use complex architectures to examine a word’s context to produce disambiguated dense

97 representations. In a way, the sparse Category Builder representations defer decision-making

about the meanings of words, while BERT-like models use extensive pretraining to learn

how to decide which word sense to send downstream. We find that separate attention heads

in our self-attentive RNN sometimes simultaneously reflect distinct senses of the same token

in a single sentence, so it may be learning to produce representations that are similar to

Category Builder in that way. However, it is hard to draw a clear contrast and say that similar

phenomena do not occur internally to BERT, due to its complexity.

The Category Builder (CB) representations are more effective than BERT’s representa-

tions in low-data situations, with BERT requiring a minimum of several dozen examples of

each class to learn to classify more than two classes accurately (Mahabal et al., 2019). The

authors attribute the difference in performance between CB and BERT to the inevitably low

lexical overlap between training and test sets when training data is scarce. The CB represen-

tations allow the classifier to learn which lexical contexts are important for performance, and

then any word that can appear in such contexts can evoke the correct class, even if the word was unseen in training. BERT, on the other hand, likely needs to see multiple examples of

lexical contexts to learn which semantic distinctions are important for the specific task—for

example, a BERT model may very well have knowledge that many sodas are caffeinated,

but without a variety of examples of caffeinated beverages, it may not know to emphasize

that aspect of soda to correctly identify a question about caffeine consumption. Low lexical

overlap is the likely culprit for our difficulties with rare classes, but we attribute much of the

benefit of the Self-attentive RNN to the unique structure of our dataset, which allows for

transfer of knowledge about useful contextual patterns learned from frequent classes to rare

classes.

98 We present a detailed analysis of our model’s behavior using clustering and visualization techniques; this bears a resemblance to the analysis by Tenney et al. (2019), although they use internal representations to make predictions for linguistic probing tasks, rather than directly examining correlations between representations and individual input tokens. Some authors have provided compelling analyses of the behavior of attention heads in BERT models (Clark et al., 2019; Voita et al., 2019), demonstrating redundancy and specialization in the heads, but also by focusing primarily on external behavior (e.g. attended tokens, effects of ablation on performance) instead of internal representations.

5.3 Task and Data

The task is the same question identification task as defined previously, but this experiment makes use of the strict relabeling of the core and enhanced data sets. We use the core set plus the CS-only subset of the enhanced data as training, and hold out the hybrid data set as a test set (see Section 3.1). We perform tenfold cross-validation on the training set for development, following the training procedures described in Section 4.1.2, in particular, augmenting each training fold with the canonical label sentences to make sure that no class is unseen at test time. Again, the test set only contains 268 classes, but fifteen are unseen in the training data (other than the canonical question). Note that the relabeling means it is not possible to directly compare the results in this chapter with the results in the previous chapter. We compare against baseline results for the relabeled data with matched folds. The baseline in this chapter outperforms the previous work due to operating on cleaner data.

99 5.4 Experimental Design and Results

We start from a Text-CNN baseline for this task (Jin et al., 2017), utilizing a single

stream system (i.e. without ensembling) for comparisons. This system convolves GloVe word embeddings with 300 filters each of widths 3, 4, and 5; the max of each filter over the

sequence serves as input to a fully connected leaky ReLU layer (Nair and Hinton, 2010),

followed by a softmax layer.

We compare this against two contextual models: the relatively well known Fine-tuned

BERT (Devlin et al., 2019) model, as well as a variant of a simpler RNN model with

self-attention (Lin et al., 2017).17

We follow the recommended procedure for fine-tuning BERT to our task. We used the

uncased base pretrained BERT model as input to a dense layer followed by a softmax for

classification. All parameters were tuned jointly. The grid search optimized hyperparameters were a max sequence length of 16, a batch size of 2, 10 training epochs, and a learning rate

of 2e-5.

Figure 5.1 illustrates our RNN model with self-attention. It is a single-layer BiGRU

(Cho et al., 2014) equipped with a two-layer perceptron that takes hidden states as inputs,

and produces one attention score per attention head, per input step. The BiGRU has hidden

state sizes of 500 in each direction, and the hidden layer in the attention module has 350 tanh

units, which in turn feeds a linear layer that produces scalar scores for each of eight attention

heads. These scores are then softmaxed over the input, and the attention-weighted sum of

the corresponding hidden states serves as the value of the attention head. These values are

concatenated and fed into a fully connected layer with 500 tanh units, and another linear

layer to produce the pre-softmax scores for each class, followed by a softmax output to

17https://github.com/ExplorerFreda/Structured-Self-Attentive-Sentence-Embedding

100 Figure 5.1: The self-attentive RNN model.

determine the class. The original model utilizes an orthogonality constraint on the attention vectors for each attention head, but we find that this is detrimental to our task, so we disable

it.

Formally, consider the model input, X = (x1,x2,...,xn), where xi is the word embedding

corresponding to the ith word of the input sentence. The sequence of hidden states of the

RNN corresponding to the input, H = (h1,h2,...,hn) is just

H = BiGRU(X) (5.1)

101 The attention A is calculated as:

T A = softmax(Ws2tanh(Ws1H )) (5.2)

where Ws1 and Ws2 are learned parameters, and with the softmax taken over the second

dimension of its input. This produces a 2D matrix representation of the sentence, M, which

is just calculated as

M = AH (5.3)

Then let Mc be a vector consisting of the concatenated rows of M, and the final prediction yˆ

is

yˆ = softmax(Wf 2tanh(Wf 1Mc)) (5.4) where Wf 1 and Wf 2 are also learned parameters.

The RNN is trained using the same fold splits and canonical query augmentation as the

CNN baseline. We use the Adam optimizer (Kingma and Ba, 2014) with default parameters.

Layer weights are initialized uniformly at random in the range [−0.1,0.1], and tokenize

inputs using default SpaCy tokenization (Honnibal and Montani, 2017). GloVe.42B vectors

serve as the inputs, with batch sizes of 20. We train for 40 epochs with an initial learning

rate of 0.001, take the best model, reinitialize an optimizer with learning rate of 2.5 × 10−4,

and train for another 20 epochs, taking the best model of all 60 epochs to test.

The development set results (top 3 lines of Table 5.1) were a bit surprising to us: while we expected that contextual models would outperform the baseline CNN, fine-tuned BERT

performed comparatively poorly. The Self-attention RNN, however, performed significantly

better than the baseline CNN, which carries over to a smaller degree to the test set (CNN:

76.2% accuracy, 51.9% F1; RNN: 79.1% accuracy, 54.7% F1).18 A breakdown of accuracy

18We only tested on the baseline and best system here to minimize use of the test set for future work.

102 System Acc. (%) F1 Baseline CNN 80.7 55.6 BERT Fine-tune 79.8 46.6 Self-attention RNN 82.6 61.4 BERT Static CNN 76.9 49.4 BERT Contextual CNN 75.3 45.2 Mean-pool RNN 81.0 57.2 Bottleneck RNN 80.8 57.2 Orthogonal Attention RNN 80.3 55.3

Table 5.1: Dev set results comparing different models (top, Sec. 5.4), word embeddings (middle, Sec. 5.5.1), and attentional mechanisms (bottom, Sec. 5.5.2).

by class frequency quintiles for the test results is shown in Figure 5.2, to emphasize the

relationship between F1 and rare class performance.

In particular, the BERT model has a very low F1, likely because of the large number of

subtly distinguished classes, the relatively small data set, and the high degree of freedom

in the BERT model. That is, BERT may be representing semantically similar sentences in

nearby regions of the representation space, but with enough variation within those regions

that our training set does not permit enough examples for the classifier to learn good

boundaries for those regions. Alternatively, the masked language modeling task may simply

not induce the grammatical knowledge required to distinguish some classes well.

The success of one attention-based contextual model (Self-attention RNN) and the

failure to improve of another (Fine-tuned BERT) led us to ask two analytical questions:

first, are the BERT representations not as appropriate for the Virtual Patient dialog domain

compared to GloVe embeddings? Second, is there something that we can learn about how

the attention-based method is helping over the CNN (and particularly on F1)?

103 5.5 Analysis

5.5.1 Why did BERT perform less well?

The difference in accuracy from the baseline CNN model to the BERT fine-tuning result

is fairly small, while the drop in F1 is substantial. Since there are many more infrequent

classes than frequent classes, this suggests that BERT is seriously underperforming in the

least frequent quintiles, and making up for it in the most frequent. That, in turn, supports

the interpretation that small numbers of examples are inadequate to train a classifier to

handle the variation in representations that come out of a contextualized model. This would be consistent with other research showing poor performance of BERT in low-data

regimes (Mahabal et al., 2019). Some of the discrepancy may also be explained by a domain

mismatch. The BERT base model is trained on book and encyclopedia data (Devlin et al.,

2019), to provide long, contiguous sequences of text. In contrast, our inputs are short,

conversational, and full of typos. GloVe.42B, trained on web data (Pennington et al., 2014),

may simply be a better fit for our corpus.

To try to tease apart the contributions of model architecture and learned representations, we utilized two different embeddings within the CNN: the contextual BERT embeddings,

i.e. the full 768-dimensional hidden state from the first layer19 of the BERT model corre-

sponding to each input token, and a static BERT embedding. We collect these static BERT

embeddings by running the training set through the BERT model, and taking the state of the

first layer from the BERT model as the embedding of the correspond token. We then average

these representations for each word type in the data set, and use that as the input wherever

the word occurs. Note that since BERT is trained with positional embeddings instead of

ordering, representations from this layer likely retain a lot of positional information, which

19Empirically, and surprisingly, this worked better than other layers we tried.

104 could be an important source of noise in the averaged representations. Training the CNN is

otherwise the same as in the baseline experiment.

The worst of our BERT-based models is the full contextualized embeddings fed into the

baseline CNN. Since the classification architecture is the same as the baseline, this suggests

that a significant contributor to the reduced performance of the BERT-based models is the

contextualized representations themselves. It seems that stable representations of lexical

items are beneficial for generalizing to unseen sentences when few training examples are

available. Consistent with this, the static BERT CNN result, despite a lower accuracy than

the fine-tuning result, shows a gain in F1. Again, this supports the idea that variation, insofar

as it is not variation that is relevant to the task, is harmful for rare classes, since stable

representations of informative words for those classes help.

5.5.2 Analyzing the Self-attention RNN

One question is how much attention versus recurrency is playing a role in the Self-

attention RNN’s improvements. We replaced the attention mechanism with mean pooling;

Table 5.1 shows that performance is more on par with the CNN, suggesting that the attention

does play a significant role.

To better understand the behavior of the self-attentive RNN, we employ a relatively novel

method of analyzing attention: we insert bottleneck layers of just eight dimensions after

each attention head, with sigmoid activations and no dropout. This adds another nonlinearity

into the model, but reduces the total number of parameters substantially. Visualizations

of the Bottleneck RNN’s behavior on development data are shown in Figure 5.3. Each

attention head is shown with an arbitrary color; the underline of that color under the tokens

of the input shows what the attention head is attending to, with the opacity of the underline

105 Figure 5.2: Quintile accuracies for the tested RNN and CNN baseline

corresponding to the strength of the attention. The grid representation to the right of each input shows how each head represents what it attends to, again with color and opacity corresponding to head identity and unit activations, respectively. These patterns are arbitrary, but by comparing head-specific patterns across classes and inputs, we can get a sense of how certain model behaviors are consistent or not. Since the final classification must be made entirely on the basis of the bottleneck layer, we assume that the activations in this layer are a complete representation of the input sentence according to the model. Thus, the heads/rows that exhibit similar activation patterns for different sentences are representing their attended tokens similarly. We do note that this interpretation can be misleading, since

106 the representation that feeds into the bottleneck layer is the attention-mediated hidden state

of the RNN, which, due to the action of the RNN, could be anything. The following analysis

does support the interpretation that the RNN’s representation correlates with the input token,

though. The bottleneck RNN and CNN have similar overall performance (Table 5.1), but

the RNN’s performance on the least frequent classes is still superior.

By finding the greatest Jensen-Shannon divergence between predictions made by the

baseline CNN and the RNN, as well as the largest change in class recall between the systems, we can identify interesting cases illlustrating the benefit of the RNN system. One compelling

case is the difference between Do you drink [alcohol]?, Do you drink coffee?, and Do you

drink enough fluid? (classes 85, 86, and 87 in development data). The Do you drink? class

is very frequent, while the other two are in the least frequent quintile. Since drink by itself

implies alcohol, the trigram do you drink is highly predictive of the alcohol class, and the

CNN almost always errs on the other classes.

The RNN, on the other hand, handles this distinction quite well. In all cases, drink is

attended by multiple heads (Figure 5.3), but across the set most of the heads are focused on

representing the verb itself, while the magenta and tan representations in particular (third

and last row, respectively) are representing the object of the drinking. In the absence of an

object, the object-focused head lands on the verb itself, and learns the implicit meaning of

alcohol from the supervision.

Further examination of the bottleneck representations shows some evidence of the model

learning to combine concepts from frequent classes to improve performance on a rare class.

In development data, the rare class Does your back hurt at work? (class 82) sees a large

boost in accuracy from the Self-attentive RNN relative to the baseline CNN (57% vs 14%,

respectively), albeit over only seven examples. Figure 5.4 suggests that the rare class is

107 Figure 5.3: Example inputs with bottleneck attention head representations. The colored underlines correspond to the foci of the attention heads, with opacity corresponding to attention weights. The activation patterns in the correspondingly-colored rows of the grid representations reflect how the attended tokens are represented by each head. Note that heads consistently attending to “drink” (e.g. yellow and green) have same or similar representations across classes, while heads attending to the object of drinking (e.g. magenta and tan) have distinct representations for each class; further, the object-focused heads accept the verb as a stand-in for its implicit object when alcohol is not explicitly mentioned.

represented with components of the representations of work and worsening, illustrated by

examples from the classes What do you do for a living? (class 290) and What makes the

pain worse? (class 306), respectively. In particular, the magenta head strongly attends to the

“work” token in the class 82 example, and the representation in that head (and to a lesser

extent, the green one as well) is very similar to the representation of the same word in class

290. Similarly, the red and tan heads in class 82 attend strongly to the “worse” token, which

is similar to the “worse” token from class 306. Obviously this is not perfect; “worse” is

attended strongly in class 82 by the darker blue head (row 2), but the representation is more

dissimilar in the frequent analog, despite having a similar context. Nonetheless, we take the

108 Figure 5.4: A rare class borrows representations from two frequent classes. Note that the magenta representation of “work” in class 82 is very similar to the magenta representation of the same word in class 290, and that the red and tan representations of “worse” are very similar to the same word in class 306.

overall trend as evidence that the model may be appropriately recombining representations

of frequent patterns to improve rare class performance.

We confirm that some of these behaviors persist in the full model by performing agglom-

erative clustering on the full head representation in the test RNN. A portion of a resulting

dendrogram is shown in Figure 5.5. We see that the head that attends most strongly to

water and coffee also often represents alcohol and drink in the same cluster. Marijuana,

drugs and tobacco (via its associated verb smoke) are also represented nearby, as well

as other, more licit consumables such as medications. We can also see that alcohol is

represented consistently across classes, since members of test classes 93 and 229 (Do you

drink? and How much do you drink?, respectively) are represented in the same cluster

(and overwhelmingly classified correctly).20 Meanwhile, other heads attend to the verbal

meaning of drink (not shown), and encouragingly, these representations cluster nearby to

20Classes were renumbered from development to test experiments, to accommodate the unseen classes.

109 similar consumption verbs such as use in the context of illegal drugs. This may be expected

due to the pretrained word vectors, but we also observe clusterings of apparently unrelated words in development data like take and on, which are similarly predictive of questions

about prescribed medication (e.g. “Are you on any prescriptions?”), but which word senses

are unlikely to converge representationally from pretraining on a general domain corpus. We

take this as evidence of the BiGRU’s ability to disambiguate word senses based on context,

especially since we occasionally observe the same word types in different clusters within

the same head.

In Figure 5.6, we can see some very broad concepts associated with tense and aspect

being captured by the same attention head, that generalize across many classes. In particular, we see a cluster containing many instances of ever, had, and past, which appear as members

of several classes. Such temporal distinctions separate many otherwise semantically identical

classes in the data set, such as Do you smoke? vs. Have you ever smoked?. This example

highlights another potential interpretation of the disparity in performance between the CNN

and RNN: in many cases, the smallest unit of generalization is a single word. The smallest

filter size in the CNN has a width of three, so the CNN must observe “you ever [verbed]”

and every meaningful variation for each class dependent on perfect tense. The RNN with

self-attention, on the other hand, can learn that a handful of single words distinguish present

from perfect.

Note, however, that many portions of these dendrograms do not demonstrate tight

clusterings, nor are they as easy to interpret as the neat examples listed here. Yet, given the

high performance of the RNN with self-attention, it would seem that some of these less

clear representations may simply be unimportant for the classification decision. This would

lend more support to the interpretations of Clark et al. (2019) and Voita et al. (2019) that

110 mutliple attention heads each learn a relatively narrow focus, and when not needed, act as a

no-op, or can be ablated or otherwise ignored.

As noted above, the model from which our RNN is derived imposed an orthogonality

constraint that encouraged the attention heads to attend to different parts of the input. This

constraint applies a loss penalty to each example:

T 2 P = k(AA − I)kF (5.5) where k · kF indicates the Frobenius norm of a matrix, A is the multi-head attention matrix,

and I is the identity matrix. Since this is a penalty, the optimization drives AAT toward the

identity matrix, which encourages each attention head to attend to separate tokens. The

penalty also encourages focusing on as few tokens as possible, since each row of A must sum

to 1. Given the analysis presented above, which shows the benefit of multiple heads attending

to the same words for different reasons, it is unsurprising that this hurts performance for our

model (see Orthogonal Attention RNN in Table 5.1), but the drop in performance does

support the analysis. Perhaps more interestingly, qualitatively we observe that under this

constraint, the attention heads become more or less positional. During training, it is easiest

to reduce the loss by quickly enforcing orthogonality by sorting the heads into a positional

order. This then prevents the more effective semantically organized heads from emerging in

the training.

5.6 Conclusion

In some sense, our analysis is unsurprising. Words having the same input representations

should cluster together in model-internal representations, and members of the same class

should similarly cluster. However, we have shown evidence that the self-attentive RNN

does some amount of word sense disambiguation that generalizes across classes, and this

111 behavior is driven only by semantic classification. From a human perspective, it makes sense

that learning the most generalizable representation should be effective, but it’s not clear

that a model would need to learn those generalizations in order to perform the classification

task. Clearly it benefits from doing so, so it seems the multi-headed self-attention at least

allows for learning these generalizable concepts and the corresponding better optimum. Our

analysis supports the claim that representations learned in frequent classes are transferring

to, and improving performance on, rare classes, and further supports the value of a data set with a large number of subtly distinct classes.

There are some interesting questions and open issues that should be addressed with

future work. Additional experiments should do more to control for parameter counts; in

particular, the mean pooling RNN result is based on a model with much fewer relevant

parameters than the Self-attentive RNN it is compared to. Similarly, parameter counts

should be matched for comparisons of the Bottleneck RNN to the full Self-attentive RNN,

to more robustly characterize the effects of the additional nonlinearity in the bottleneck

model. The Bottleneck representations also seem to reflect something like rudimentary

“concepts,” insofar as similar semantics often cluster together in the representation space.

It would be interesting to examine whether any kind of “metacognitive” processes could

improve performance, for example with deductive or abductive inferences about relationships

between representations across attention heads. If two attention heads are attending to the

same word, can a meta-model determine that one representation is redundant? Would

consolidation of redundant representations release model capacity to improve performance

on other classes? How would such a consolidation process be designed? Can representational

context allow a model to infer domain-specific synonymy? These are very speculative and

112 aspirational notions, but the direct observation of something resembling concept formation

raises some intriguing possibilities.

The final data scarcity problem that we consider in this dissertation is an issue of

modality transfer: how can we improve the performance of a spoken dialogue agent when

the only in-domain data we have available is from a different modality? In the next chapter, we borrow from the character submodels of the stacked CNN in Chapter 4, which were

developed to achieve robustness to misspellings, to build CNNs that are robust to phonetic

misspellings, i.e. speech recognition errors.

113 114

Figure 5.5: Excerpt of dendrogram from agglomerative clustering of full representation of head 3. Leaves are labeled with the maximally attended word for this head, along with the label index of the sentence. Height of branch points corresponds to distance between merged clusters (or leaves), with color differences highlighting clusters with diameters under an arbitrary threshold. Note the cluster that corresponds to the objects of drinking (e.g. “alcohol”), and that it includes “drink” itself, capturing the implicit object. 115

Figure 5.6: A second excerpt of the same dendrogram in the previous figure. Lines appear in red because all branch points are below the cluster coloring threshold, and many of the attended words are temporal in nature. Note the cluster of words relating to past existence of some event (e.g. “ever,” “had,” and “past”), and that they span several classes. Chapter 6: Modality Transfer with Text-to-Phonetic Data Augmentation

The previous chapters have largely taken aim at data scarcity problems that are inherent

to the Virtual Patient domain by virtue of its expert user base, or by virtue of the natural

long tail in the frequency of semantic classes—with the exception of the out-of-scope

problem. In this chapter, we consider yet another kind of data scarcity: that brought

about by a change of modality. Because of the unique problems of spoken dialogue (see

Section 2.1.3), we expect that a classification model trained on text inputs will not work

as well on automatically recognized speech inputs. The work in this chapter establishes a

novel approach to addressing this problem, mainly by making the classifier resilient to the

types of errors that are likely to occur in a generic ASR system. It does this by providing

parallel phonetic inputs for the classifier, in addition to the regular recognized words. We

also see a benefit from randomly sampling alternative inputs as training examples, where

the alternatives are generated by a novel error prediction method. The method recovers a

significant proportion of the error volume induced by naïvely using text-trained models on

ASR inputs.21

21This chapter is a modification of an article that was accepted to ICASSP 2019 (Stiff et al., 2019).

116 6.1 Introduction

The extension of the text-based Virtual Patient to use a spoken interface presents a

number of problems, as discussed in Section 3.3. A prominent one is the fact that spoken

language is quite different from written language. It uniquely contains fillers, pauses, repairs,

and more, and speakers are prone to be more verbose than writers, since it is very easy for

humans to produce speech. Furthermore, speech recognition is error prone, whether done

by machines or humans, and in automatic speech recognition (ASR) systems, the types of

errors are much different from the errors seen in text-based systems. Where the Stacked

CNN ensembles utilized a character-based input representation to develop robustness to

typos, the expected benefit of such a model to the output of an ASR system would be smaller,

since it will always produce correctly-spelled words. The real problem is that it might not

always produce the correct words. An off-the-shelf ASR system would be expected to

produce a non-trivial number of misrecognitions in the fairly niche Virtual Patient domain,

so we sought to understand how serious this problem might be, and how we might address

it, despite having no in-domain speech data. Note that while this work was developed in the

Virtual Patient context, it has broad applicability to any novel domain-specific task that uses

speech recognition for inputs.

The usual approach to improving the performance of ASR systems deployed to provide

input to downstream tasks in custom domains would be to reduce misrecognitions by training

custom acoustic and/or language models for the application. However, in our unique domain,

the extensive annotated speech training data required to develop high quality models was not

available. In such low- or no-resource cases, it may make sense to deploy a broad-purpose

ASR system, and to just make the downstream task more tolerant of the inevitable ASR

errors. However, without speech transcripts from the general purpose system in the target

117 domain, it is difficult to know what types of errors to expect, so training a downstream task

to be tolerant of those errors is not straightforward.

In order to increase the robustness of downstream tasks to ASR errors, we propose a

simple method that allows us to leverage existing text data within the domain of interest,

directly inspired by the character CNNs in the Stacked CNN model. In the same way

that character-based CNNs are robust to misspellings, we aim to develop a classifier that

is robust to phonetic misspellings, i.e. ASR misrecognitions. In brief, we infer phonetic

representations of the in-domain data from the text modality using a grapheme-to-phoneme

converter, and instead of building an ensemble on words and the characters that comprise

them, we build an ensemble on words and the (inferred) phonemes that comprise them. The

speech recognizer, lacking domain-specific models, may sometimes produce transcripts of

generally likely words that have no semantic connection to what was said—e.g., “Oreo”

instead of “how are you”—but which share some acoustic similarities. The downstream

model should know what the important semantic distinctions “sound like,” so that if the

ASR produces the wrong words, the downstream model still has a chance to reinterpret the

sounds correctly.

We find that this method recovers a substantial portion of the errors resulting from

naïvely using speech transcripts as input to a model trained only on the text. We are able to

further boost performance by generating alternative versions of the text input that a speech

recognizer is likely to produce in error, and randomly sampling these as alternatives of

the original text during training. Our error generation method does rely on speech data to

determine specific error likelihoods, but importantly, we are able to show a benefit by using

an out-of-domain, general purpose speech corpus. In sum, we are able to boost performance

118 of a speech-based system in a custom domain, where the only in-domain data comes from a

non-speech modality.

While speech is a more natural modality for doctor-patient interactions, data collection

and availability is challenging, for many reasons outlined in previous chapters. In this study,

the only data available for tuning the speech recognition of the virtual patient actually come

from the typed conversations of previous versions of the patient.

Throughout this text we use typescript to refer to the typewritten conversations with the virtual patient, and speakscript to refer to speech recognition transcripts.

6.2 Related Work

Dialog systems have recently attracted a fair amount of research attention in terms of

both interpretation and generation of natural dialog, e.g. Li et al. (2017b). Besides being

predominantly in the text modality, this work sits in direct contrast to our domain, due to the

need for the virtual patient to present consistent, precise responses, in order to accomplish its

educational objectives. This work focuses solely on interpretation: for language generation we rely on precise predefined answers required by the medical teaching staff for educational

training and assessment.

Dialogue system development in the face of resource constraints has been a challenge

for several groups. Plauché et al. describe methods for language adaptation for speech

dialog systems in a target language with little recorded speech data, by adapting recognition

models as new input is collected (Plauché et al., 2008). In perhaps the closest work to ours,

Sarikaya et al. deploy spoken dialog systems in new domains with little or no resources

(Sarikaya et al., 2005), mining static text resources to develop in-domain language models

to improve the speech recognition performance directly. We are unaware of other work

119 utilizing in-domain, cross-modal data to improve the compatibility of a downstream model and a general-purpose speech recognizer.

Of course, this work builds directly upon the Stacked CNN described by Jin et al. (2017), and deploys text CNNs (Kim, 2014) as the downstream classification model.

6.3 Model

The downstream classification model used to identify questions is an ensemble of text

CNNs Kim (2014), following Jin et al. (2017) (Figure 6.1). These details are described more completely in Section 4.1, but are summarized again here for convenience. We train two sub- ensembles and combine their output with a stacking network Wolpert (1992). The stacking network outputs a weighted sum of its inputs, with weights learned to minimize the error.

Each sub-ensemble is trained on one of the two forms of the input, i.e. inferred phonetic representation, and original typescript. The output of each sub-ensemble is determined by majority voting, which empirically performs better than a product of experts Hinton (2002) or averaging. Vote tallies of each sub-ensemble serve as input to the stacking network.

Both sub-ensembles consist of five convolutional networks, each trained on a different subset of the data. Each of these has a single convolutional layer followed by ReLU activations and max pooling, using dropout of 0.5. This is fed into a single fully-connected linear layer to produce the 359-dimensional softmax output of the network, which is trained using a cross-entropy criterion.

Phoneme-based CNNs take input of one channel of 16-dimensional embeddings (except for the 2-channel condition, see section 6.4.1), initialized randomly and tuned for the task.

Convolutional layers consist of 400 kernels each of widths 2 through 6 phonemes. Input to word-based CNNs are 300-dimensional pretrained word2vec embeddings Mikolov et al.

120 Figure 6.1: Overview of the classification model. Phoneme- and word-based representations are input to ensembles of text CNNs. Output of ensembles is determined by majority voting, and combined in a stacking network to produce final classification output.

(2013) held static during training. Word-based networks use 300 kernels each of widths 3, 4,

and 5 words.

Under sampling conditions, training examples have alternate versions which may be

presented for training instead of the original input. These alternate versions simulate generic

ASR errors (see Section 6.4.1). We first randomly determine whether to choose an alternate;

if an alternative is desired it is sampled according to the likelihood of its generation.

6.4 Experiments

The main experimental variables that we manipulate are the representation of phonemes

used, and the rate at which we randomly sample erroneous alternative forms of the input. In

this section we describe in detail the data used in the experiments, as well as the experimental

conditions.

121 6.4.1 Data

The core data set is used for training. Spelling errors occur regularly in the set, and are

generally left as-is for the purpose of phonetic inference (although some unpronounceable

special characters were stripped).

Phonetic data is derived from the typescript by looking up the pronunciation in CMUdict

(omitting stress), or using a Phonetisaurus Novak et al. (2012) grapheme-to-phoneme model

trained on CMUdict for unknown words. We experiment with three variations on the

phonetic representation. The plain phones condition simply concatenates the phoneme

sequences of the constituent words in the sentence in order. The boundary tokens condition

adds a single “boundary phoneme” to the alphabet, which is inserted between words, to

explore the value of word segmentation information in the semantic classification task.

Finally, the 2-channel condition adds word boundary information in a second channel to

the text CNNs comprising the phonetic sub-ensemble. In other words, each phoneme is

represented as both the identity of the phoneme, as well as whether or not that sound is the

start of a word (both encoded as a 16-dimensional embedding vector).

Simulated ASR error alternatives are generated by a method due to Serai et al.. Briefly,

the method aims to simulate a neural acoustic model making incorrect predictions, and to

decode the resulting lattice as usual, generating text that is likely to be erroneously produced

by a non-domain specific acoustic model. The technique samples both when to produce an

erroneous phoneme, and which one to produce, if so. The distribution of phoneme choice

is learned from the confusions produced by a trained model, but the posterior probabilities

of erroneous phonemes are determined using the confusibility of the original phoneme, to

simulate an over-confident system. This method is used to generate up to 100 alternatives

for a given typescript input sentence, along with the frequency with which each alternative

122 is produced. Note, again, that use of this method constitutes a dependence on audio data, but

this data need not be (and in the present experiment, is not) from within the target virtual

patient domain.

To evaluate the effects of our data augmentation, we collected a small test set of

speakscript. Six adult, native English-speaking volunteers, three each male and female, read

randomly selected dialogs from the enhanced data set. These read-speech utterances were

fed into the target ASR system to collect the corresponding speakscript. The dataset consists

of 756 transcribed utterances. After spelling correction of the typescript input, word error

rate of the speakscript output was calculated at approximately 10%. Classification accuracy

of the speakscript test set in the combined typescript-trained model was 65.7% (cf. 69.9%

for typescript input). We also generated phonetic forms of the speakscript test data using the

same methods used for typescript.

The test set does have several shortcomings: its size does not admit very many of the

types of errors we would be able to correct with our method; read speech is better-behaved

than spontaneous speech (Nakamura et al., 2008); and it includes some unseen labels.

Nonetheless, it allows an evaluation of our approach.

6.4.2 Experimental details

Models are trained according to descriptions provided by Jin et al. (2017), and again,

are described in more detail in previous chapters, but summarized here. All models in

each sub-ensemble are trained individually, using distinct 90/10 train/dev splits, using the

Adadelta learning rule Zeiler (2012), with initial learning rate 1.0. We train each sub-model

for 25 epochs, and keep the model from the epoch with the best dev set performance.

123 We report accuracy for each sub-ensemble (Phonemes and Words), as well as the

accuracy of the combined system. With the exception of the “all alternatives” condition

(see below), all of the different inferred phonetic representations used the same training and

development splits during training. The training data for every model in the ensemble was

supplemented with a list of the “canonical” sentences for each of the 359 classes. Thus, the

development set for every model was guaranteed to have no unseen classes.

We include the results of an early experiment in which we simply trained the whole

system using all of the available alternative error forms, randomly shuffled, and split 90/10

for each sub-model. This experiment was unsuccessful (see sections 6.5 and 6.6), but was

motivation for implementing the sampling paradigm, so we report its results for comparison.

In addition to altering the phonetic input representation, we experiment with sampling

rates ranging from 0-50% for error alternatives. In experiments using sampling, dev set

examples never use sampled alternatives.

6.5 Results

Results are reported in Table 6.1. Likely due to the small sizes of the training and test

sets, as well as the randomness introduced by sampling, the results exhibit a fair amount

of variation from run to run. Therefore, we report averages over three runs under identical

model parameterizations.

The best-performing combined system is plain phonemes with a sampling rate of 20%;

this recovers approximately 62% of the increased error rate in using speakscript within a

typescript model. Plain phonemes also exhibit the best performance among the non- sampled

conditions. 2-channel bounds give the best performing phoneme sub-ensemble, although

the difference from the baseline comes nowhere close to statistical significance. Combined

124 Sampling Phonemes Words Combo Baseline System trained as N/A 69.9 (typescript) combination only Baseline System trained as N/A 65.7 (speakscript) combination only All N/A 64.95 65.48 65.74 alternatives 0% 67.15 66.27 67.55 5% 66.89 66.76 67.68 Plain 10% 66.75 66.40 67.73 phonemes 20% 66.75 66.00 68.30 50% 66.36 66.09 67.50 0% 66.45 66.09 67.64 5% 66.67 66.05 67.86 Boundary 10% 66.58 66.88 67.90 tokens 20% 65.88 66.76 67.77 50% 65.96 66.31 67.99 0% 67.37 66.89 67.37 5% 66.67 66.58 67.59 2-channel 10% 66.48 66.40 67.42 bounds 20% 66.62 66.89 68.12 50% 67.11 66.36 67.95

Table 6.1: Test set question classification accuracy, reported as the average of three runs. Column maxima in bold font. All “Combo” results are a significant improvement over the speakscript baseline using Pearson’s χ2 and the Benjamini-Hochberg multiple tests correction Benjamini and Hochberg (1995) with a false discovery rate of 10%.

systems are always at least as good as either of the constituent sub-ensembles, and usually much better, although we don’t test for significance of this difference.

125 6.6 Discussion

The aforementioned test set issues make it difficult to make sweeping pronouncements

about the results, but we do find some encouraging trends. The two clearest such results

are 1) that even inferred phonetic representations can improve speech recognition input for

downstream tasks, and 2) that sampling generated errors seems to further boost performance,

although we cannot claim statistical significance relative to 0% sampling.

The motivation for sampling in the first place derives from the negative result from the

“all alternatives” condition. In essence, including all alternatives just allowed for serious

overfitting: with only minor variations in the training examples, it overspecialized on the

specific sentences underlying the alternate forms, harming performance on unseen sentences.

This may have been mitigated with smarter stratification of the development sets, but

sampling alternatives also enhances variety in the surface forms for each label without

making the development sets easier.

Because sampling does not seem to benefit solely word-based or phoneme-based systems,

it would seem that sampling encourages diversification across the two sub-ensembles, as the

best combination results usually do not have the best component results. Indeed, the best-

performing individual systems that contribute to the averages shown in the table maintain

this trend (data not shown).

The benefit of word boundary information is less clear: the best-performing model on

average included no word boundary information, and the second best average used the version of word boundaries that is easiest to ignore; however, boundary tokens sometimes

outperform other representations under otherwise equivalent conditions. This speaks to the

need for further experiments with more statistical power.

126 6.7 Future Work

As this was a pilot study, there are many avenues for improvements to the current work,

as well as new questions identified by the experiments presented. First and foremost is the

need for more data, both to improve generalizability of the models, as well as to put the

results on firmer statistical footing. In the time since the work described in this chapter was completed, we have repeated this experiment using a subset of the spoken data set (see

Section 3.1.4) as a test set, to attempt to improve the statistical power of the results here.

We only included examples where both the correct label and ChatScript’s interpretation were classes seen in the core data set, and tested the configuration that showed the best

results here, i.e. plain phones with 20% sampling. That experiment, as well as sensible

checks, failed to show an improvement over a baseline using a word/character ensemble

on the speakscript. The working hypothesis is that the subsetting operation excluded many

misrecognitions that might have benefitted from phonetic representations, possibly due

to the known biases in the annotation of the data. Ongoing investigations will assess the validity of this hypothesis at first with simple word error rate metrics for the included and

excluded sets. We have also experimented with different methods of generating errorful

input alternatives (Serai et al., 2020), but these also did not improve performance, at least

for the read-speech test set collected here. These approaches may yield fruitful research

once the issues with the spoken data set are better understood. Additional investigations will

need to compare the phonetic augmentation approach to language model customizations

using the in-domain text data. The cloud-based speech recognizer provides straightforward

facilities for such customization, although the mechanics of the customization process are

fairly opaque.

127 An intriguing question raised by the current study is why random alternative sampling

affords a benefit, and whether the mechanism of the benefit is the same for phone repre-

sentations as for words. One possibility is that sampling is just a form of regularization, which may be born out by the slight drops in performance for each of the sub-ensembles —

suggesting the need to directly compare to other types of regularization. A further possibility

is just that, by luck, the random samples introduce some of the specific errors seen in the test

set. If this were the case, we might expect less benefit in a broader domain, as confusible words begin to impinge on important topics. To borrow an example from the present set of

alternatives, “toaster” is probably a safe replacement for “mister” in our domain, but only

because questions about breakfast are irrelevant to the patient’s back pain.

Also of interest are ways in which we might more directly encourage acoustic similarities

to be represented in the input, instead of depending on the distant supervision of the correct

semantic class and generated errors to encourage similarities to emerge. One straightforward

option to try would be to initialize the embedding matrix for phonemes with corresponding

average MFCCs or GMM representations.

Finally, we are currently experimenting with a form of knowledge distillation for appli-

cation to this task, in which we seek to minimize the mean squared error between analogous

layers in a high-performing network and a network learning from alternative versions of

the same input. In this way we hope to encourage the representations of the alternatives to

become similar in semantically coherent ways. Initial results have been promising, but did

not surpass the best models presented here.

128 Chapter 7: Conclusion and Future Work

This dissertation presents several contributions to the Virtual Patient project specifically,

and the scientific literature generally.

The first, and central, contribution is an engineered system that meets its pedagogical

goals reasonably well—controlled experiments validated that it achieves approximately

80% accuracy on positive classes, focused on the most frequent classes; subsequent work

demonstrably improved components of the system; and the scoring functionality of the

system has been shown to correlate with human graders (Maicher et al., 2019), although

controlled experiments comparing actual educational outcomes are left to future research.

The Virtual Patient also serves as a framework for data collection and research in various

issues in linguistics, speech, and dialogue. This system sees regular use, and has enabled

novel educational programs not described in this text, such as ethnic bias training for

physicians.

Another contribution was the validation of the hybrid question identification model as a

significant performance boost for our task, relative to a rule-based system by itself. This, in

turn, led to a characterization of the out-of-scope problem in the Virtual Patient domain, as well as a preliminary approach to mitigating it. This work, the bulk of which was presented

in Chapter 4, is currently under review for publication in the Journal of Natural Language

Engineering.

129 The chief contributions from Chapter 5 are a surprising negative result—that the powerful

BERT model is not a panacea in our domain—as well as the discovery that a different

multi-headed self-attentive model performs particularly well on rare classes in our domain.

The most important contribution is an analysis of why that seems to be the case, with

specialization among attention heads playing an important role, and the uniquely large

number of fine-grained semantic classes likely contributing. A version of this work was

submitted to the SIGDIAL 2020 conference, and is undergoing review.

The final main contribution in this dissertation, presented in Chapter 6, is a fairly

simple method of augmenting training data with inferred phonetic representations to make

downstream tasks robust to speech recognition errors. The pilot work described in that

chapter was accepted to ICASSP 2019, and the expansions outlined in Section 6.7 are

currently in development for potential submission to TASLP.

The Virtual Patient program is rich with opportunities for continued research, and this

dissertation can hopefully serve as a useful roadmap for anyone seeking to explore them.

Besides the extensions and expansions of individual contributions that were mentioned in

the context of their related projects, the Virtual Patient program—the goals, software, data,

participants, etc.—offers several opportunities for continued research. Several of these were

introduced in Section 3.3, and we elaborate on some of them here.

There are many interrelated pragmatic spoken dialogue issues that show up within the

Virtual Patient, which directly or indirectly involve turn-taking. This may be somewhat

surprising, since the the agent is modeled as a question-answering agent that never takes

initiative. However, in the same way that this simplifying assumption has been useful for

building a functional system that provides an opportunity to study several data scarcity

130 issues “in the wild,” it offers an opportunity to build in turn-taking behaviors that are not critical to the basic functioning of the system, and thus easier to experiment with and test.

To elaborate, the term “question-answering dialogue agent,” which has been used throughout this dissertation, is actually a misnomer! While the overwhelming majority of inputs are indeed questions, a substantial amount fall into a number of other speech act categories, including fillers, commissives, and acknowledgements, which further break down into different dialogue act intents, such as confirmation of information, or expressions of encouragement, approval, and sympathy. How this variety of inputs factors into turn-taking is that many of them may not actually cede a turn, or even require a response at all. Fillers are commonly used by speakers to hold a turn; that is, to indicate that they are not done talking. General-purpose speech recognition systems, however, may be designed to actually suppress some such inputs, because popular downstream tasks such as voice search or keyword detection are not helped by the presence of “umm” and “uh”. The Virtual Patient’s end-of-utterance (EOU) detection would be substantially improved by simply waiting longer to send an input if a filler word was the last thing detected, or even ignoring such inputs altogether. Similarly, many utterances start with simple acknowledgments of the previous utterance, e.g., “OK. [pause] Do you have any children?” The pause risks an EOU in the current system, but an acknowledgement-aware EOU detector could be made to wait, to great (and measurable) effect. Incorporating these kinds of intent detection into the model could also pave the way for understanding and taking advantage of initiative-relinquishing intents in mixed-initiative agents.

While the EOU detection could be improved by teaching a model about these relatively simple semanto-pragmatic phenomena, the general user experience would also benefit from any efforts to make the turn-taking itself more fluid. Here, there are a few separate, but

131 related, issues. First, incremental understanding could reduce response latencies: if the

beginning of an input is discriminative enough, a response can be selected before the user

has even completed their utterance. Even if an input remains ambiguous until the very

end, the beginning can rule out many interpretations, potentially speeding up the decision

problem left after the last word is received. This would be very similar to work by DeVault

et al. (2011a), but, distinctly, could be implemented as an extension of the RNN presented

in Chapter 5, potentially using student-teacher learning to train an effective unidirectional

model with an added penalty for later decision-making.

A second potential turn-taking improvement would be to handle barge-in and overspeak.

The major impediment to progress here is the design decision to mute recognition in

the application during patient speaking events. Handling any kind of speech overlap would obviously necessitate disabling the muting, which then changes the basic turn-taking

dynamic into a speaker diarization problem, with suppression of self-recognition as a

downstream or auxiliary task.

Again, the virtual patient is an effective test bed for any of these projects, because it basi-

cally meets most of its goals in its current state. These kinds of experimental modifications:

1) will produce obvious, measurable effects, and 2) remain fairly independent of the core

functionality of the system, which makes experimental manipulation easier. The potential

upside from a user experience perspective is also substantial.

In summary, there is much more depth to the Virtual Patient project than appears at

first glance. It is a simple classification problem; but it is also a unique out-of-domain

problem, and a challenging long tail problem, with an unusually rich semantic class model.

It is a simple turn-alternating agent; except when a user was not really done talking. It

is a simple question answering agent; unless the user just wants to express sympathy or

132 confirm information. The unique properties of the domain create significant opportunities for ongoing discoveries, particularly for spoken dialogue topics, and hopefully this work serves as a springboard for future research.

133 Appendix A: Label Sets

A.1 Strict Relabeling Labels

any changes in your diet •any chest pain •any family history of depression •any family history of hypertension •any family history of illness •any heart problems •any heart problems in your family •any otc medication •any other hospitalizations •any other past sexual partners •any other problems •any other sexual partners •any prescription medication •any previous heart problems •any prior surgeries •appetite •are you able to drive •are you able to sit •are you able to stand •are you abused •are you active •are you currently working •are you current on bloodwork •are you current on vaccinations •are you dating anyone •are you depressed •are you divorced •are you exposed to secondhand smoke •are you frustrated •are you happy •are you having any other pain •are you healthy •are you in a relationship •are you married •are you mr. wilkins •are you nervous •are your grandparents living •are your muscles sore •are your parents healthy •are your parents living •are your relatives healthy •are your siblings healthy •are you sexually active •are you sick •are you stressed •are you suicidal •are you sure •are you taking any medication •are you taking any medication for the pain •are you taking any other medication •are you taking any other medication for pain •are you taking supplements •can you care for yourself •can you do normal activities •can you move around with the pain •can you point to the pain •can you rate the pain •describe the frequent urination •describe the furniture you were lifting •describe the pain •did anything happen to cause the frequent urination •did anything happen to cause the pain •did i miss anything •did it happen at work •did the ibuprofen help •did the pain start immediately •did you feel or hear a pop •did you find a parking spot •did you have any trouble finding us •did you hurt your back •did you take ibuprofen today •discussion is confidential •do an exam now •do any positions make the pain worse or better •does anyone in your family have back pain •does any position make the pain worse •does any position relieve the pain •does anything else help the pain •does it hurt to touch your back •does it hurt when not moving •does it hurt when you bend •does moving increase the pain •does that plan sound good •does the frequent urination keep you up at night •does the pain improve with exercise •does the pain increase when you stand •does the pain interfere with work •does the pain keep you up at night •does the pain radiate •does this bother you mentally •does your back hurt when you work •does your hip hurt •do you do any heavy lifting at work •do you drink •do you drink coffee •do you drink enough fluid •do you eat fast food •do you eat fruit •do you eat healthy food •do you exercise •do you feel anxious •do you feel safe at home •do you feel well •do you have a cause for the pain •do you have a doctor •do you have a family •do you have a history of depression •do you have annual exams •do you have any allergies •do you have any associated symptoms •do you have any bladder problems •do you have any bowel problems •do you have any children •do you have any chronic illnesses •do you have any hip pain •do you have any hobbies •do you have any medical problems •do you have any pain •do you have any pain in your hip •do you have any pets •do you have any problems urinating •do you have any problems with your hip •do you have any questions •do you have any trouble sleeping •do you have any weakness •do you have a sexually transmitted infection •do you have a significant other •do you have a weapon •do you have exposure to chemicals •do you have frequent urination •do you have headaches •do you have loss of bowel or bladder control •do you have numbness or tingling •do you have pain anywhere else •do you have relatives •do you have siblings •do

134 you have support •do you have vision changes •do you live alone •do you mind if i call you jim •do you need a note for work •do you prefer men or women •do your body parts hurt •do you see your parents often •do you smoke •do you use any contraception •do you use illegal drugs •family history •good •goodbye •has anyone expressed concern about your drinking •has the frequent urination become worse or better •has the frequent urination stopped you from doing anything •has the inability to work caused problems •has the pain affected your activity •has the pain become worse or better •has your weight changed •have you been able to work •have you been incontinent •have you been nauseous •have you been off work •have you been resting •have you ever been dizzy •have you ever been in an accident •have you ever been in the hospital •have you ever been in the military •have you ever been pregnant before •have you ever had any chronic illnesses •have you ever had any serious illnesses •have you ever had a sexually transmitted infection •have you ever smoked •have you ever taken any other medication •have you had a colonoscopy •have you had an accident •have you had any past bladder problems •have you had a prostate exam •have you had back injury before •have you had back pain before •have you had blood glucose checked •have you had frequent urination before •have you had physical therapy •have you noticed a discharge •have you noticed an itch •have you noticed an odor in your urine •have you noticed any blood in your urine •have you noticed any physical changes •have you noticed pain while urinating •have you seen anyone else •have you tried anything else for the pain •have you tried anything for the frequent urination •have you tried any treatment •have you tried heat or ice •having mood changes •hello •hopefully we can help you •how about your body part •how are you •how did the pain start •how did you get here today •how did your grandparents die •how else is the frequent urination affecting you •how else is the pain affecting you •how far did you get in school •how frequent is the frequent urination •how frequent is the pain •how has the pain changed over time •how has this affected you •how has your day been going •how have you been handling this •how intense is the pain •how is home •how is work •how is your blood pressure •how is your bodily function •how is your body part •how is your cholesterol •how is your diet •how is your family •how is your hip •how is your mood •how is your pain now •how is your social life •how long does the pain last •how long have you been a worker •how long have you been taking the aspirin •how long have you been taking the ibuprofen •how long have you had the pain •how many sexual partners •how much do you drink •how much do you work •how much ibuprofen have you taken •how much sleep •how often do you take the ibuprofen •how often do you take the saw palmetto •how often do you urinate •how old are you •how old are your grandparents •how old are your parents •i am medical student •i am sorry •i have all the information i need •i have more questions •i’ll go report to the doctor •is pain better when you lie down •is pain worse when you lie down •is the frequent urination constant •is the frequent urination improving •is the frequent urination new •is the frequent urination worse in the morning or at night •is the pain constant •is the pain deep •is the pain dull •is the pain improving •is the pain in your upper or lower back •is the pain new •is the pain on one side •is the pain sharp •is the pain worse in the morning or at night •is there anything else i can help you with •is this the first time you have had this pain •is urination always painful •is your job physically demanding •is your job pleasurable •is your job stressful •i understand completely •i will do my best •i would like to get to know more •my name is bob •name and date of birth •negative symptoms •nice talking with you •nice to meet you •nice to see you •none •ok then •please repeat that •should I call you by a different name •social history •sounds like •tell me about your grandparents •tell me about yourself •tell me about your work •tell me more about the saw palmetto •tell me more about your back pain •thanks •that must be awful •that must be hard •that was nice •the doctor will be in •today we have 15 minutes •to summarize •want to ask about •was the onset of the frequent urination sudden •was the onset of the pain sudden •were you doing anything when the pain started •were you healthy •were you lifting anything •what are your allergy reactions •what bothers you the most •what brings you in today •what can’t you do at work •what caused the past back pain •what concerns you about the pain •what do you do for a living •what do you do for fun •what do you eat each day •what do you think about the buckeyes •what do you think about this weather •what do you think is the problem •what else have you tried for the frequent urination •what else have you tried for the pain •what have you tried for the pain •what is your goal for this visit •what is your name •what is your past medical history •what is your religious preference •what makes the frequent urination better •what makes the frequent urination worse •what makes the pain better •what makes the pain worse •what should i call you •what was the dose •what was the dose of aspirin •what were

135 the pills called •what would you like me to do •when did the frequent urination start •when did the pain start •when do you have the frequent urination •when do you have the pain •when is the pain most severe •when was the last time you had intercourse •when was the last time you saw a doctor •when was your last bowel movement •when was your last period •when was your last tetanus shot •when were you born •where do you live •where do you work •where is the pain •who do you live with •who prescribed the medicine •who supports you •why do you take the medication •why do you take the supplements •would you like some pain medication •you are welcome •you seem uncomfortable •

136 Appendix B: Table Definitions

CREATE TABLE Conversations ( Convo_num int NOT NULL AUTO_INCREMENT, Client_ID varchar(8), WS_Version varchar(16), First_name varchar(255), Last_name varchar(255), Patient_choice int, Input_method varchar(8), Mic varchar(8), Exp_group varchar(16), Raw_score TEXT, Uuid varchar(40), PRIMARY KEY (Convo_num) );

CREATE TABLE Queries ( Convo_num int NOT NULL, Query_num int NOT NULL, Input_text varchar(510), CS_interp varchar(510), CNN_interp varchar(510), CS_init_reply varchar(510), CS_retry_reply varchar(510), Choice varchar(8), Audio_path varchar(255), CONSTRAINT PK_Query PRIMARY KEY (Convo_num, Query_num), CONSTRAINT FK_Query_Convo FOREIGN KEY (Convo_num) REFERENCES Conversations(Convo_num) );

137 Bibliography

Garen Arevian. 2007. Recurrent neural networks for robust real-world text classification.

In IEEE/WIC/ACM International Conference on Web Intelligence (WI’07). IEEE, pages

326–329.

Harish Arsikere, Elizabeth Shriberg, and Umut Ozertem. 2015. Enhanced end-of-turn

detection for speech to a personal assistant. In 2015 AAAI Spring symposium series.

Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. 2017. Sequential dialogue

context modeling for spoken language understanding. In Proceedings of the 18th Annual

SIGdial Meeting on Discourse and Dialogue. pages 103–114.

Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: a practical

and powerful approach to multiple testing. Journal of the Royal statistical society: series

B (Methodological) 57(1):289–300.

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with

Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".

Matthew R Boutell, Jiebo Luo, Xipeng Shen, and Christopher M Brown. 2004. Learning

multi-label scene classification. Pattern recognition 37(9):1757–1771.

138 Luka Bradeško and Dunja Mladenic.´ 2012. A survey of chatbot systems through a loebner

prize competition. In Proceedings of Slovenian Language Technologies Society Eighth

Conference of Language Technologies. pages 34–37.

Leonardo Campillos-Llanos, Catherine Thomas, Éric Bilinski, Pierre Zweigenbaum, and

Sophie Rosset. 2019. Designing a virtual patient dialogue system based on terminology-

rich resources: Challenges and evaluation. Natural Language Engineering pages 1–38.

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi

Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representa-

tions using rnn encoder–decoder for statistical machine translation. In Proceedings of

the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).

pages 1724–1734.

Alonzo Church. 1932. A set of postulates for the foundation of logic. Annals of mathematics

pages 346–366.

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What

does bert look at? an analysis of bert’s attention. In Proceedings of the 2019 ACL

Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. pages

276–286.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and

Pavel Kuksa. 2011. Natural Language Processing (Almost) from Scratch. Journal of

Machine Learning Research 12:2493–2537.

Douglas Danforth, A. Price, K. Maicher, D. Post, B. Liston, D. Clinchot, C. Ledford, D. Way,

and H. Cronau. 2013. Can virtual standardized patients be used to assess communication

139 skills in medical students. In Proceedings of the 17th Annual IAMSE Meeting, St. Andrews,

Scotland.

David DeVault, Kallirroi Georgila, Ron Artstein, Fabrizio Morbini, David Traum, Stefan

Scherer, Louis-Philippe Morency, et al. 2013. Verbal indicators of psychological distress

in interactive dialogue with a virtual human. In Proceedings of the SIGDIAL 2013

Conference. pages 193–202.

David DeVault, Kenji Sagae, and David Traum. 2011a. Incremental interpretation and

prediction of utterance meaning for interactive dialogue. Dialogue & Discourse 2(1):143–

170.

David DeVault, Kenji Sagae, and David Traum. 2011b. Incremental interpretation and predic-

tion of utterance meaning for interactive dialogue. Dialogue and Discourse 2(1):143–170.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-

training of deep bidirectional transformers for language understanding. In Proceedings of

the 2019 Conference of the North American Chapter of the Association for Computational

Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pages

4171–4186.

Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery Fine,

Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: a corpus for adding memory to

goal-oriented dialogue systems. In Proceedings of the 18th Annual SIGdial Meeting

on Discourse and Dialogue. Association for Computational Linguistics, Saarbrücken,

Germany, pages 207–219. https://doi.org/10.18653/v1/W17-5526.

140 Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008.

Liblinear: A library for large linear classification. Journal of machine learning research

9(Aug):1871–1874.

Luciana Ferrer, Elizabeth Shriberg, and Andreas Stolcke. 2002. Is the speaker done yet?

faster and more accurate end-of-utterance detection using prosody. In Seventh Interna-

tional Conference on Spoken Language Processing.

Roy Thomas Fielding. 2000. Rest: architectural styles and the design of network-based

software architectures. Doctoral dissertation, University of California .

Catherine Finegan-Dollak, Jonathan K Kummerfeld, Li Zhang, Karthik Ramanathan, Sesh

Sadasivam, Rui Zhang, and Dragomir Radev. 2018. Improving text-to-sql evaluation

methodology. In Proceedings of the 56th Annual Meeting of the Association for Compu-

tational Linguistics (Volume 1: Long Papers). pages 351–360.

Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep

feedforward neural networks. In Proceedings of the 13th International Conference on

Artificial Intelligence and Statistics (AISTATS). volume 9, pages 249–256.

Kyle Gorman and Steven Bedrick. 2019. We need to talk about standard splits. In Pro-

ceedings of the 57th Conference of the Association for Computational Linguistics. pages

2786–2791.

Geoffrey E Hinton. 2002. Training products of experts by minimizing contrastive divergence.

Neural computation 14(8):1771–1800.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput.

9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.

141 Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with

Bloom embeddings, convolutional neural networks and incremental parsing. To appear.

Matthew Hutson. 2018. Artificial intelligence faces reproducibility crisis. Science

359(6377):725–726. https://doi.org/10.1126/science.359.6377.725.

Evan Jaffe, Michael White, William Schuler, Eric Fosler-Lussier, Alex Rosenfeld, and

Douglas Danforth. 2015. Interpreting questions with a log-linear ranking model in a

virtual patient dialogue system. In Proceedings of the Tenth Workshop on Innovative Use

of NLP for Building Educational Applications. pages 86–96.

Lifeng Jin, David King, Amad Hussein, Michael White, and Douglas Danforth. 2018.

Using paraphrasing and memory-augmented models to combat data sparsity in question

interpretation with a virtual patient dialogue system. In Proceedings of the Thirteenth

Workshop on Innovative Use of NLP for Building Educational Applications. pages 13–23.

Lifeng Jin, Michael White, Evan Jaffe, Laura Zimmerman, and Douglas Danforth. 2017.

Combining cnns and pattern matching for question interpretation in a virtual patient

dialogue system. In Proceedings of the 12th Workshop on Innovative Use of NLP for

Building Educational Applications. pages 11–21.

Shehroz S Khan and Michael G Madden. 2009. A survey of recent trends in one class

classification. In Irish conference on artificial intelligence and cognitive science. Springer,

pages 188–197.

Chandra Khatri, Behnam Hedayatnia, Anu Venkatesh, Jeff Nunn, Yi Pan, Qing Liu, Han

Song, Anna Gottardi, Sanjeev Kwatra, Sanju Pancholi, et al. 2018. Advancing the

142 state of the art in open domain dialog systems through the alexa prize. arXiv preprint

arXiv:1812.10757 .

Joo-Kyung Kim and Young-Bum Kim. 2018. Joint learning of domain classification and

out-of-domain detection with dynamic class weighting for satisficing false acceptance

rates. Proc. Interspeech 2018 pages 556–560.

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. Proceedings

of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP

2014) pages 1746–1751.

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.

arXiv preprint arXiv:1412.6980 .

Aniket Kittur, Ed H Chi, and Bongwon Suh. 2008. Crowdsourcing user studies with

mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing

systems. ACM, pages 453–456.

Mike Lewis, Denis Yarats, Yann Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal

or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017

Conference on Empirical Methods in Natural Language Processing. pages 2443–2453.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-

promoting objective function for neural conversation models. In Proceedings of the

2016 Conference of the North American Chapter of the Association for Computational

Linguistics: Human Language Technologies. pages 110–119.

143 Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017a. Ad-

versarial learning for neural dialogue generation. In Proceedings of the 2017 Conference

on Empirical Methods in Natural Language Processing. pages 2157–2169.

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien˙ Jean, Alan Ritter, and Dan Jurafsky. 2017b.

Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Confer-

ence on Empirical Methods in Natural Language Processing. pages 2157–2169.

Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou,

and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint

arXiv:1703.03130 .

Bing Liu, Wee Sun Lee, Philip S Yu, and Xiaoli Li. 2002. Partially supervised classification

of text documents. In ICML. Citeseer, volume 2, pages 387–394.

Chia-Wei Liu, Ryan Lowe, Iulian Vlad Serban, Mike Noseworthy, Laurent Charlin, and

Joelle Pineau. 2016. How not to evaluate your dialogue system: An empirical study

of unsupervised evaluation metrics for dialogue response generation. In Proceedings

of the 2016 Conference on Empirical Methods in Natural Language Processing. pages

2122–2132.

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal

of machine learning research 9(Nov):2579–2605.

Abhijit Mahabal, Jason Baldridge, Burcu Karagol Ayan, Vincent Perot, and Dan Roth. 2019.

Text classification with few examples using controlled generalization. In Proceedings of

the 2019 Conference of the North American Chapter of the Association for Computational

144 Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pages

3158–3167.

Abhijit Mahabal, Dan Roth, and Sid Mittal. 2018. Robust handling of polysemy via

sparse representations. In Proceedings of the Seventh Joint Conference on Lexical and

Computational Semantics. pages 265–275.

Kellen R. Maicher, Laura Zimmerman, Bruce Wilcox, Beth Liston, Holly Cronau, Allison

Macerollo, Lifeng Jin, Evan Jaffe, Michael White, Eric Fosler-Lussier, William Schuler,

David P. Way, and Douglas R. Danforth. 2019. Using virtual standardized patients to

accurately assess information gathering skills in medical students. Medical Teacher

0(0):1–7. PMID: 31230496. https://doi.org/10.1080/0142159X.2019.1616683.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S Corrado, and Jeffrey Dean. 2013. Dis-

tributed Representations of Words and Phrases and their Compositionality. In Advances

in Neural Information Processing Systems 26 (NIPS 2013). pages 3111–3119.

Robert Moore, Douglas Appelt, John Dowding, J Mark Gawron, and Douglas Moran. 1995.

Combining linguistic and statistical knowledge sources in natural-language processing

for atis. In Proceedings of the ARPA Spoken Language Systems Technology Workshop.

pages 261–264.

Fabrizio Morbini, Eric Forbell, David DeVault, Kenji Sagae, David Traum, and Albert Rizzo.

2012. A mixed-initiative conversational dialogue system for healthcare. In Proceedings

of the 13th annual meeting of the special interest group on discourse and dialogue. pages

137–139.

145 Vinod Nair and Geoffrey E Hinton. 2010. Rectified Linear Units Improve Restricted

Boltzmann Machines. In Proceedings of the 27th International Conference on Machine

Learning (ICML 2010). 3, pages 807–814.

Masanobu Nakamura, Koji Iwano, and Sadaoki Furui. 2008. Differences between acoustic

characteristics of spontaneous and read speech and their effects on speech recognition

performance. Computer Speech & Language 22(2):171–184.

Kamal Nigam, John Lafferty, and Andrew McCallum. 1999. Using maximum entropy for

text classification. In IJCAI-99 workshop on machine learning for information filtering.

Stockholom, Sweden, volume 1, pages 61–67.

Josef R Novak, Nobuaki Minematsu, and Keikichi Hirose. 2012. Wfst-based grapheme-

to-phoneme conversion: Open source tools for alignment, model-building and decoding.

In Proceedings of the 10th International Workshop on Finite State Methods and Natural

Language Processing. pages 45–49.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary

DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic

differentiation in PyTorch. In NIPS Autodiff Workshop.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-

del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,

M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python.

Journal of Machine Learning Research 12:2825–2830.

146 Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors

for word representation. In Proceedings of the 2014 conference on empirical methods in

natural language processing (EMNLP). pages 1532–1543.

Pramuditha Perera and Vishal M Patel. 2019. Learning deep features for one-class classifi-

cation. IEEE Transactions on Image Processing .

Madelaine Plauché, Özgür Çetin, and Udhaykumar Nallasamy. 2008. How to build a spoken

dialog system with limited ( or no ) language resources. In AI in ICT4D. ICFAI University

Press, India.

Martin F Porter. 2001. Snowball: A language for stemming algorithms.

Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar,

Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. 2016. Purely sequence-trained neural

networks for asr based on lattice-free mmi. In Interspeech. pages 2751–2755.

Ashwin Ram, Rohit Prasad, Chandra Khatri, Anu Venkatesh, Raefer Gabriel, Qing Liu, Jeff

Nunn, Behnam Hedayatnia, Ming Cheng, Ashish Nagar, et al. 2018. Conversational ai:

The science behind the alexa prize. arXiv preprint arXiv:1801.03604 .

Matthew Roddy, Gabriel Skantze, and Naomi Harte. 2018. Multimodal continuous turn-

taking prediction using multiscale rnns. In Proceedings of the 20th ACM International

Conference on . ACM, New York, NY, USA, ICMI ’18, pages

186–190. https://doi.org/10.1145/3242969.3242997.

Herbert Rubenstein and John B Goodenough. 1965. Contextual correlates of synonymy.

Communications of the ACM 8(10):627–633.

147 Ruhi Sarikaya, Agustin Gravano, and Yuqing Gao. 2005. Rapid language model development

using external resources for new spoken dialog domains. In Acoustics, Speech, and Signal

Processing, 2005. Proceedings.(ICASSP’05). IEEE International Conference on. IEEE,

volume 1, pages I–573.

Emanuel Schegloff, Gail Jefferson, and Harvey Sacks. 1974. A simplest systematics for the

organization of turn-taking for conversation. Language 50(4):696–735.

Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John Shawe-Taylor, and John C

Platt. 2000. Support vector method for novelty detection. In Advances in neural informa-

tion processing systems. pages 582–588.

Prashant Serai, Adam Stiff, and Eric Fosler-Lussier. 2020. End to end speech recogni-

tion error prediction with sequence to sequence learning. In ICASSP 2020-2020 IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.

Prashant Serai, Peidong Wang, and Eric Fosler-Lussier. ???? Improving speech recognition

error prediction for modern and off-the-shelf speech recognizers. Submitted to 2019 IEEE

International Conference on Acoustics, Speech, and Signal Processing. (ICASSP). .

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau.

2015. Hierarchical neural network generative models for movie dialogues. arXiv preprint

arXiv:1507.04808 7(8).

Matt Shannon, Gabor Simko, Shuo-Yiin Chang, and Carolina Parada. 2017. Improved end-

of-query detection for streaming speech recognition. In Interspeech. pages 1909–1913.

148 Gabriel Skantze. 2017. Towards a general, continuous model of turn-taking in spoken

dialogue using lstm recurrent neural networks. In Proceedings of the 18th Annual SIGdial

Meeting on Discourse and Dialogue. pages 220–230.

Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret

Mitchell, Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015. A neural network approach to

context-sensitive generation of conversational responses. In Proceedings of the 2015 Con-

ference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies. pages 196–205.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-

nov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal

of Machine Learning Research 15:1929–1958.

Adam Stiff, Prashant Serai, and Eric Fosler-Lussier. 2019. Improving human-computer

interaction in low-resource settings with text-to-phonetic data augmentation. In ICASSP

2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP). IEEE, pages 7320–7324.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with

neural networks. In Advances in neural information processing systems. pages 3104–3112.

Thomas B Talbot, Kenji Sagae, Bruce John, and Albert A Rizzo. 2012. Sorting out the virtual

patient: how to exploit artificial intelligence, game technology and sound educational

practices to create engaging role-playing simulations. International Journal of Gaming

and Computer-Mediated Simulations (IJGCMS) 4(3):1–19.

149 David MJ Tax and Robert PW Duin. 1999. Support vector domain description. Pattern

recognition letters 20(11-13):1191–1199.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Bert rediscovers the classical nlp

pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational

Linguistics. pages 4593–4601.

David Traum, Priti Aggarwal, Ron Artstein, Susan Foutz, Jillian Gerten, Athanasios Kat-

samanis, Anton Leuski, Dan Noren, and William Swartout. 2012. Ada and grace: Direct

interaction with museum visitors. In International conference on intelligent virtual agents.

Springer, pages 245–251.

AM Turing. 1950. Computing machinery and intelligence. Mind LIX:433–60.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances

in neural information processing systems. pages 5998–6008.

Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. Analyzing

multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

pages 5797–5808.

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix

Hill, Omer Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for

general-purpose language understanding systems. In Advances in Neural Information

Processing Systems. pages 3261–3275.

150 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman.

2018. Glue: A multi-task benchmark and analysis platform for natural language under-

standing. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and

Interpreting Neural Networks for NLP. pages 353–355.

Mingxuan Wang, Zhengdong Lu, Hang Li, and Qun Liu. 2015. Syntax-based deep matching

of short texts. In Twenty-Fourth International Joint Conference on Artificial Intelligence.

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno,

Nelson-Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. 2018.

Espnet: End-to-end speech processing toolkit. Proc. Interspeech 2018 pages 2207–2211.

Joseph Weizenbaum. 1966. Eliza—a computer program for the study of natural language

communication between man and machine. Communications of the ACM 9(1):36–45.

Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory Networks. In ICLR.

pages 1–15.

Martijn Wieling, Josine Rawee, and Gertjan van Noord. 2018. Reproducibility in computa-

tional linguistics: Are we willing to share? Computational Linguistics 44(4):641–649.

https://doi.org/10.1162/coli_a_00330.

Bruce Wilcox. 2019. Chatscript. [Online; accessed 23-July-2019].

https://github.com/ChatScript/ChatScript.

Bruce Wilcox and Sue Wilcox. 2013. Making it real: Loebner-winning chatbot design.

ARBOR Ciencia, Pensamiento y Cultura 189(764):10–3989.

D.H. Wolpert. 1992. Stacked generalization. Neural Networks 5(2):241–259.

151 Yu Wu, Wei Wu, Chen Xing, Ming Zhou, and Zhoujun Li. 2017. Sequential matching net-

work: A new architecture for multi-turn response selection in retrieval-based chatbots. In

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

(Volume 1: Long Papers). pages 496–505.

Wayne Xiong, Lingfeng Wu, Fil Alleva, Jasha Droppo, Xuedong Huang, and Andreas

Stolcke. 2018. The microsoft 2017 conversational speech recognition system. In 2018

IEEE international conference on acoustics, speech and signal processing (ICASSP).

IEEE, pages 5934–5938.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V

Le. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In

Advances in neural information processing systems. pages 5754–5764.

Ziyu Yao, Yu Su, Huan Sun, and Wen-tau Yih. 2019. Model-based interactive semantic

parsing: A unified framework and a text-to-sql case study. In Proceedings of the 2019

Conference on Empirical Methods in Natural Language Processing and the 9th Inter-

national Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pages

5450–5461.

Hwanjo Yu. 2005. Single-class classification with mapping convergence. Machine Learning

61(1-3):49–69.

Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan Chang. 2004. Pebl: Web page classification

without negative examples. IEEE Transactions on Knowledge & Data Engineering

(1):70–81.

Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. CoRR .

152 Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng

Gao, Jingjing Liu, and Bill Dolan. 2019. Dialogpt: Large-scale generative pre-training

for conversational response generation. arXiv preprint arXiv:1911.00536 .

Zhuosheng Zhang, Jiangtong Li, Pengfei Zhu, Hai Zhao, and Gongshen Liu. 2018. Modeling

multi-turn conversation with deep utterance aggregation. In Proceedings of the 27th

International Conference on Computational Linguistics. pages 3740–3752.

Xiangyang Zhou, Lu Li, Daxiang Dong, Yi Liu, Ying Chen, Wayne Xin Zhao, Dianhai

Yu, and Hua Wu. 2018. Multi-turn response selection for chatbots with deep attention

matching network. In Proceedings of the 56th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers). pages 1118–1127.

George Kingsley Zipf. 1949. Human behavior and the principle of least effort. .

153