Combining Cnns and Pattern Matching for Question Interpretation in a Virtual Patient Dialogue System

Combining CNNs and Pattern Matching for Question Interpretation in a Virtual Patient Dialogue System Lifeng Jin1, Michael White1, Evan Jaffe1, Laura Zimmerman2 and Douglas Danforth2 1Department of Linguistics, 2Department of Family Medicine The Ohio State University, Columbus, OH, USA fjin, white, [email protected] [email protected], [email protected] Abstract For medical students, virtual patient dialogue systems can provide useful training opportunities without the cost of em- ploying actors to portray standardized patients. This work utilizes word- and character-based convolutional neural networks (CNNs) for question identification in a virtual patient dialogue system, out- performing a strong word- and character- based logistic regression baseline. While the CNNs perform well given sufficient training data, the best system performance is ultimately achieved by combining CNNs with a hand-crafted pattern matching system that is robust to label Figure 1: Virtual Patient avatar used to train med- sparsity, providing a 10% boost in system ical students accuracy and an error reduction of 47% as compared to the pattern-matching system sponse for each input question identified by the alone. system. While pattern matching with ChatScript can achieve relatively high accuracy with suffi- 1 Introduction cient pattern-writing skill and effort, it is unable to Standardized Patients (SPs) are actors who play take advantage of large amounts of training data, the part of a patient with a specific medical his- somewhat brittle regarding misspellings, and diffi- tory and pathology. Medical students interact with cult to maintain as new questions and patterns are SPs to train skills like taking a patient history and added. developing a differential diagnosis. However, SPs With an apparent plateau in system perfor- are expensive and can behave inconsistently from mance, this work explores new data-driven meth- student to student. A virtual patient dialogue sys- ods. In particular, we use convolutional neural tem aims to overcome these issues as well as pro- networks with both words and characters as in- vide a means of automated evaluation of the med- put, demonstrating a significant improvement in ical student’s interaction with the patient (see Fig- overall question identification accuracy relative to ure1). a strong multiclass logistic regression baseline. Previous work with a hand-crafted pattern- Furthermore, inspired by the different error pat- matching system called ChatScript (Danforth terns between the ChatScript and CNNs, we de- et al., 2009, 2013) used a 3D avatar and al- velop a simple system combination using a bi- lowed for students to input questions using text nary classifier that results in the highest overall or speech. ChatScript matches input text using performance, achieving a remarkable 47% reduc- hand-written patterns and outputs a scripted re- tion in error in comparison to the ChatScript system alone. Frequency quantile analysis shows that dialogues (Ramanarayanan et al., 2017) could al- the hybrid system is able to leverage the relatively leviate data sparsity problems for rare categories higher performance of ChatScript on the infre- by providing additional training examples, but this quent label items, while also taking advantage of technique is limited to more general domains that the CNN system’s superior accuracy where more do not require special training/skills. In the cur- data is available for training. rent medical domain, workers on common crowdsourcing platforms are unlikely to have the exper- 2 Related Work tise required to take a patient’s medical history in a natural way, so any data collected with this method Question identification has been formulated as at would likely suffer quality issues and fail to gen- least two distinct tasks. Multi-class logistic regres- eralize to real medical student dialogues. Rossen sion is a standard approach that can take advan- and Lok(2012) have developed an approach for tage of class-specific features but requires a good collecting dialogue data for virtual patient sys- amount of training data for each class. A pairwise tems, but their approach does not directly address setup involves a more general binary classification the issue that even as the number of dialogues col- decision which is then made for each label, choos- lected increases, there can remain a long tail of ing the highest confidence match. relevant but infrequently asked questions. Early work (Ravichandran et al., 2003) found CNNs have been used to great effect for image that treating a question answering task as a max- identification (Krizhevsky et al., 2012) and are be- imum entropy re-ranking problem outperformed coming common for natural language processing. using the same system as a classifier. DeVault In general, CNNs are used for convolution over in- et al.(2011) observed maximum entropy systems put language sequences, where the input is often a performed well with simple n-gram features. Jaffe matrix representing a sequence word embeddings et al.(2015) explored a log-linear pairwise rank- (Kim, 2014). Intuitively, word embedding kernels ing model for question identification and found it are convolving n-grams, ultimately generating fea- to outperform a multiclass baseline along the lines tures that represent n-grams over word vectors of of DeVault et al. However, Jaffe et al.(2015) used length equal to the kernel width. CNNs are very a much smaller dataset with only about 915 user popular in systems for tasks like paraphrase detec- turns, less than one-fourth as many as in the cur- tion (Yin and Schutze¨ , 2015; Yin et al., 2016; He rent dataset. For this larger dataset, multiclass lo- et al., 2015), community question answering (Das gistic regression outperforms a pairwise ranking et al., 2016; Barbosa et al., 2016) and even ma- model. With no pairwise comparisons, a multi- chine translation (Gehring et al., 2017). Character- class classifier is also much faster, lending itself to based models that embed individual characters as real-time use. input units are also possible, and have been used It is probable that multiclass vs. pairwise ap- for language modeling (Kim et al., 2016) to good proaches’ overall effectiveness depends on the effect. It is worth noting that character sequences amount of training data; pairwise ranking meth- are more robust to spelling errors and potentially ods have potential advantages for cross-domain have the same expressive capability as word se- and one-shot learning tasks (Vinyals et al., 2016) quences given long enough character sequences. where data is sparse or non-existent. In the closely related task of short-answer scoring, Sakaguchi 3 Dataset et al.(2015a) found that pairwise methods could be effectively combined with regression-based ap- The dataset consists of 94 dialogues of medical proaches to improve performance in sparse-data students interacting with the ChatScript system. cases. The ChatScript system has been deployed in a Other work involving dialogue utterance classi- medical school to assess student’s ability to in- fication has traditionally required a large amount teract with patients through a text-based interface of data. For example, Suendermann et al.(2009) and the questions typed by the students and the re- acquired 500,000 dialogues with over 2 million ut- sponses given by ChatScript, which then are hand- terances, observin that statistical systems outper- corrected by annotators, form this dataset. There form rule-based ones as the amount of data in- are 4330 total user turns, with a mean of 46.1 turns creases. Crowdsourcing for collecting additional per dialogue. Each turn consists of the question value for each window. ci = σ(w · xi:i+h−1 + b) (1) In Eq.1, b is a scalar and σ is a non-linearity. The js|−h+1 feature map c 2 R for this kernel is the concatenation of all the feature values from the convolution. In order to maintain fixed dimension for the output, max-over-time pooling (Collobert et al., 2011) is applied to the feature map and the Figure 2: Label frequency distribution is ex- maximum value c^ is extracted from c. tremely long-tailed, with few frequent labels and Because there are many kernels for each kernel many infrequent labels. Values are shown above height h, the output from a group of kernels with quintile boundaries. the same height is oh = [^c1; c^2;:::; c^nh ], where nh is the number of kernels for the kernel width h. We concatenate all the outputs from all the ker- the student asked, ChatScript’s automatic label N nels into a single vector o 2 R where N = (with hand-correction) and the scripted response P h nh, and apply a linear transformation with associated with the label. An example turn could the softmax non-linearity to it as the final fully- be represented with the tuple, (‘hello mr. wilkins, connected neural network layer for the CNN. how are you doing today?’, ‘how are you’, ‘well i would be doing pretty well if my back weren’t hurt- y^ = softmax(Wlo + bl) (2) ing so badly.’). The task is to predict the label of N×m the asked question. where Wl 2 R is the weight matrix of the m There are 359 unique labels, with a mean of 12 final layer, bl 2 R is the bias term for the final instances per label, median of 4, and large stan- layer, and m is the number of classes that we are dard deviation of 20. Of note, the distribution of trying to predict. labels is extremely long-tailed (Figure2), with 8 of the most common labels accounting for nearly 20% of the data, while the bottom 20% includes 4.1 Regularization 265 infrequent labels. The most frequent label oc- We follow Kim(2014) for regularization strate- curs 156 times.

Load more