Mitigation of Data Scarcity Issues for Semantic Classification in a Virtual

Mitigation of Data Scarcity Issues for Semantic Classification in a Virtual Patient Dialogue Agent Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Adam Stiff, BS Graduate Program in Computer Science and Engineering The Ohio State University 2020 Dissertation Committee: Eric Fosler-Lussier, Advisor Michael White Yu Su c Copyright by Adam Stiff 2020 Abstract We introduce a virtual patient question-answering dialogue system, used for training medical students to interview real patients, which presents many unique opportunities for research in linguistics, speech, and dialogue. Among the most challenging research topics at this point in the system’s development are issues relating to scarcity of training data. We address three main problems. The first challenge is that many questions are very rarely asked of the virtual patient, which leaves little data to learn adequate models of these questions. We validate one approach to this problem, which is to combine a statistical question classification model with a rule-based system, by deploying it in an experiment with live users. Additional work further improves rare question performance by utilizing a recurrent neural network model with a multi-headed self-attention mechanism. We contribute an analysis of the reasons for this improved performance, highlighting specialization and overlapping concerns in independent components of the model. Another data scarcity problem for the virtual patient project is the challenge of adequately characterizing questions that are deemed out-of-scope. By definition, these types of questions are infinite, so this problem is particularly challenging. We contribute a characterization of the problem as it manifests in our domain, as well as a baseline approach to handling the issue, and an analysis of the corresponding improvement in performance. ii Finally, we contribute a method for improving performance of domain-specific tasks such as ours, which use off-the-shelf speech recognition as inputs, when no in-domain speech data is available. This method augments text training data for the downstream task with inferred phonetic representations, to make the downstream task tolerant of speech recognition errors. We also see performance improvements from sampling simulated errors to replace the text inputs during training. Future enhancements to the spoken dialogue capabilities of the virtual patient are also considered. iii Dedicated to Elizabeth, without whose support—and patience— this would not have been possible. iv Acknowledgments Nothing written here will ever be adequate to acknowledge the full depth and breadth of the myriad of interactions that have contributed to me getting to this point, so I am embracing imperfection and keeping this short. Just know that if you are reading this, I undoubtedly have a reason to thank you, and I hope that I have the wisdom to appreciate what that reason is. I would like to extend special thanks to all of my collaborators whose efforts contributed to work described in this dissertation. Chapter 3 includes descriptions of some work by Kellen Maicher, Doug Danforth, Evan Jaffe, Marisa Scholl, and Mike White; Chapter 4 includes contributions from Mike White, Eric Fosler-Lussier, Lifeng Jin, Evan Jaffe, and Doug Danforth; Chapter 5 was a collaboration with Qi Song and Eric Fosler-Lussier; and Chapter 6 was coauthored with Prashant Serai and Eric Fosler-Lussier. Cheers, thanks, and best wishes to the friends made along the way: Deblin, Denis, Peter, Chaitanya, Joo-Kyung, Prashant, Ahmad, Manirupa, and the innumerable others who have also provided invaluable insight, support, humor, feedback, or otherwise made this journey a little more enjoyable. Finally, I want to express my extreme gratitude to Eric Fosler-Lussier for his mentorship and efforts in helping/pushing/dragging me over the finish line. Thanks for the guidance when I needed it, for the space to learn how to guide myself, and for the patience that I probably didn’t deserve. v Vita May 2020 . PhD, Computer Science and Engineering, The Ohio State University, USA. March 2006 . BS, Biochemistry, The Ohio State University, USA. Publications Research Publications Adam Stiff, Qi Song, and Eric Fosler-Lussier. How Self-Attention Improves Rare Class Performance. In Proceedings of the 21st Annual SIGdial Meeting on Discourse and Dialogue. Association for Computational Linguistics, 2020. Prashant Serai, Adam Stiff, and Eric Fosler-Lussier. End to end speech recognition error prediction with sequence to sequence learning. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6339-6343. IEEE, 2020. Adam Stiff, Prashant Serai, and Eric Fosler-Lussier. Improving human-computer in- teraction in low-resource settings with text-to-phonetic data augmentation. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7320-7324. IEEE, 2019. Deblin Bagchi, Peter Plantinga, Adam Stiff, and Eric Fosler-Lussier. Spectral feature mapping with mimic loss for robust speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5609-5613. IEEE, 2018. vi Fields of Study Major Field: Computer Science and Engineering Studies in: Artificial Intelligence Prof. Eric Fosler-Lussier Computational Linguistics Prof. Michael White Computational Neuropsychology Prof. Alexander Petrov vii Table of Contents Page Abstract . ii Dedication . iv Acknowledgments . .v Vita ........................................... vi List of Tables . xi List of Figures . xii 1. Introduction . .1 1.1 Overview and Contributions . .5 2. Background and Related Work . .8 2.1 Issues in Dialogue . .9 2.1.1 Response Generation . 10 2.1.2 Natural Language Understanding . 14 2.1.3 Spoken Dialogue . 17 2.2 Text Classification . 19 2.3 One Class Classification . 22 2.4 Speech Recognition . 23 2.5 Other Virtual Patients . 24 2.6 Conclusion . 25 viii 3. System Overview . 26 3.1 Data . 26 3.1.1 Annotation . 28 3.1.2 Core Data Set . 30 3.1.3 Enhanced Data Set . 32 3.1.4 Spoken Data Set . 32 3.1.5 Strict Relabeling . 35 3.2 System Description . 36 3.2.1 Front End Client . 37 3.2.2 Back End Web Service . 38 3.2.3 Response Model . 39 3.2.4 Miscellaneous . 43 3.2.5 Spoken Interface Extensions . 43 3.3 Challenges and Opportunities . 46 3.4 Conclusion . 49 4. Hybrid System Deployment and Out-of-Scope Handling . 50 4.1 Experiment 1: Baseline Reproduction . 51 4.1.1 Stacked CNN . 54 4.1.2 Training . 57 4.1.3 Baseline . 58 4.1.4 CNN Results . 58 4.1.5 Hybrid System . 61 4.2 Experiment 2: Prospective Replication . 64 4.2.1 Experimental Design . 65 4.2.2 Results and Discussion . 66 4.3 Experiment 3: Out-of-Scope Improvements . 74 4.3.1 Models . 77 4.3.2 Training and Evaluation . 79 4.3.3 Baseline . 82 4.3.4 Results . 83 4.3.5 Discussion . 85 4.4 Conclusion . 91 5. How Self-Attention Improves Rare Class Performance . 94 5.1 Introduction . 95 5.2 Related Work . 97 5.3 Task and Data . 99 ix 5.4 Experimental Design and Results . 100 5.5 Analysis . 104 5.5.1 Why did BERT perform less well? . 104 5.5.2 Analyzing the Self-attention RNN . 105 5.6 Conclusion . 111 6. Modality Transfer with Text-to-Phonetic Data Augmentation . 116 6.1 Introduction . 117 6.2 Related Work . 119 6.3 Model . 120 6.4 Experiments . 121 6.4.1 Data . 122 6.4.2 Experimental details . 123 6.5 Results . 124 6.6 Discussion . 126 6.7 Future Work . 127 7. Conclusion and Future Work . 129 Appendices 134 A. Label Sets . 134 A.1 Strict Relabeling Labels . 134 B. Table Definitions . 137 x List of Tables Table Page 3.1 An example conversation opening. 27 3.2 The implemented REST-like API. 40 3.3 Extensions to the back end API. 45 4.1 Mean accuracy across ten folds, with standard deviations. 58 4.2 Features used by the chooser. 61 4.3 Hybrid system results. 63 4.4 Summary statistics of collected data. 66 4.5 Raw accuracies of control and test. 67 4.6 Top changes in class frequency from core to enhanced data. 70 4.7 Adjusted accuracies of control and test. 73 4.8 New features used by the Three-way chooser. 80 4.9 Development accuracy with additional features . 83 4.10 Effects of retraining on test performance . 83 4.11 Test results for three-way choosers with retraining . 84 5.1 Dev set results comparing different models. ..

Load more