Leveraging Multimodal Redundancy for Dynamic Learning, with SHACER — a Speech and Handwriting Recognizer
Total Page:16
File Type:pdf, Size:1020Kb
Leveraging Multimodal Redundancy for Dynamic Learning, with SHACER | a Speech and HAndwriting reCognizER Edward C. Kaiser B.A., American History, Reed College, Portland, Oregon (1986) A.S., Software Engineering Technology, Portland Community College, Portland, Oregon (1996) M.S., Computer Science and Engineering, OGI School of Science & Engineering at Oregon Health & Science University (2005) A dissertation submitted to the faculty of the OGI School of Science & Engineering at Oregon Health & Science University in partial ful¯llment of the requirements for the degree Doctor of Philosophy in Computer Science and Engineering April 2007 °c Copyright 2007 by Edward C. Kaiser All Rights Reserved ii The dissertation \Leveraging Multimodal Redundancy for Dynamic Learning, with SHACER | a Speech and HAndwriting reCognizER" by Edward C. Kaiser has been examined and approved by the following Examination Committee: Philip R. Cohen Professor Oregon Health & Science University Thesis Research Adviser Randall Davis Professor Massachusetts Institute of Technology John-Paul Hosom Assistant Professor Oregon Health & Science University Peter Heeman Assistant Professor Oregon Health & Science University iii Dedication To my wife and daughter. iv Acknowledgements This thesis dissertation in large part owes it existence to the patience, vision and generosity of my advisor, Dr. Phil Cohen. Without his aid it could not have been ¯nished. I would also like to thank Professor Randall Davis for his interest in this work and his time in reviewing and considering it as the external examiner. I am extremely grateful to Peter Heeman and Paul Hosom for their e®orts in participating on my thesis committee. There are many researchers with whom I have worked closely at various stages during my research and to whom I am indebted for sharing their support and their thoughts. I would like to thank Ron Cole for his early support of my desire to study computer science, and Wayne Ward for many valuable discussions about parsing and grammar compilation early in my time as a doctoral student. During my work on the CALO project I have been blessed with many fruitful collaborations and I would like acknowledge my gratitude to Alex Rudnicky and Alan Black for their willingness to help in the development of SHACER. Also I'm grateful to David Demirdjian for his unfailing friendship and support in many related research endeavors. I have enjoyed being part of both the Center for Human Computer Communication (CHCC) and now of Adapx. during my years as a PhD student. My thanks to Paulo Barthelmess, Xiao Huang, Matt Wesson, Xioaguang Li, Sanjeev Kumar, David McGee, Michael Johnston, Andrea Coradini and others for their e®orts over the years without which my work and SHACER in particular would not have been possible. To my fellow PhD students, Rebecca and Raj, I wish all the best for their forthcoming theses. Finally, I would like to thank my wife and daughter for their trust and love over these last years. I am grateful to the Defense Advanced Research Projects Agency (DARPA) for ¯- nancially supporting my thesis research through the Contract No. NBCHD030010 for v the CALO program. I note that any opinions, ¯ndings and conclusions or recommenda- tions expressed in this material are my own and do not necessarily reflect the views of the DARPA or the Department of Interior-National Business Center (DOI-NBC). I would also like to thank SRI for letting me work on parts of my thesis as part of the CALO project. vi Contents Dedication ::::::::::::::::::::::::::::::::::::::::: iv Acknowledgements ::::::::::::::::::::::::::::::::::: v Abstract ::::::::::::::::::::::::::::::::::::::::::xxvii 1 Introduction ::::::::::::::::::::::::::::::::::::: 1 1.1 MOTIVATION . 1 1.2 THE PROBLEM: NEW LANGUAGE AND THE NEED FOR DYNAMIC LEARNING . 5 1.3 MULTIMODAL REDUNDANCY AS A BASIS FOR DYNAMIC LEARNING 8 1.4 LEVERAGING MULTIMODAL REDUNDANCY . 10 1.4.1 Problems to be addressed . 11 1.4.2 Hypothesis . 11 1.4.3 Argument . 12 2 Empirical Studies of Multimodal Redundancy :::::::::::::::: 13 2.1 A WORKING HYPOTHESIS OF MULTIMODAL REDUNDANCY . 14 2.1.1 Multimodal Complementarity Versus Redundancy . 16 2.1.2 Out-Of-Vocabulary Words In Natural Speech Processing Contexts . 17 2.1.3 Dynamic Learning Of Out-Of-Vocabulary Words . 19 2.1.4 Multimodality In Learning And Teaching . 19 2.2 STUDY OF MULTIMODAL REDUNDANCY . 20 2.2.1 Term Frequency - Inverse Document Frequency . 21 2.2.2 Corpora Description . 22 2.2.3 Study Results . 25 2.3 DISCUSSION . 39 2.3.1 Study Implications . 39 2.3.2 Multimodal Understanding of Human-Human Interaction . 40 2.4 SUMMARY . 40 vii 3 SHACER: Overview :::::::::::::::::::::::::::::::: 42 3.1 TASK DOMAIN . 42 3.1.1 Gantt Charts . 42 3.1.2 Meeting Corpus . 43 3.1.3 Ambient Cumulative Interface . 44 3.2 SHACER PROCESSING . 47 3.2.1 Modal Recognition . 47 3.2.2 Phonetic Alignment . 50 3.2.3 Pronunciation Re¯nement . 51 3.2.4 Integration . 55 3.2.5 Learning . 55 3.2.6 Understanding Abbreviations . 56 3.3 SUMMARY . 58 4 Prelude to SHACER: Multimodal New Vocabulary Recognition :::: 61 4.1 HIGH LEVEL GOALS . 61 4.1.1 Dynamic Learning . 61 4.1.2 The Vocabulary Problem . 64 4.2 RELATED WORK . 66 4.2.1 Hybrid Fusion for Speech to Phone Recognition . 67 4.2.2 Multimodal Semantic Grounding . 68 4.3 The MNVR Approach . 72 4.3.1 Mutual Disambiguation . 73 4.3.2 System Description . 74 4.3.3 Two-Level Recursive Transition Network Grammar . 76 4.4 EVALUATION: BASELINE PERFORMANCE TEST . 81 4.5 CONCLUSION . 88 4.5.1 Robotic and Intelligent System Learning From Demonstration . 88 4.5.2 Discussion of MNVR Test Results . 88 4.6 SUMMARY . 90 5 SHACER: Alignment and Re¯nement ::::::::::::::::::::: 92 5.1 PHONE ENSEMBLE RECOGNITION . 94 5.1.1 Phone Recognition in Spoken Document Retrieval . 94 5.1.2 SHACER's Hybrid Word/Phone Recognition Approach . 97 5.2 INK SEGMENTATION . 101 5.2.1 Segmentation Rules . 102 5.2.2 Iterative Best-Splitting of Segmented Stroke Groups . 103 viii 5.2.3 Correct Segmentation Rates . 105 5.3 IDENTIFYING HANDWRITING AND SPEECH REDUNDANCY . 105 5.3.1 Word Spotting and Cross-Domain Matching in SHACER . 106 5.3.2 Multiparser Alignment Requests Within SHACER . 108 5.3.3 Phonetic Articulatory-Features as an Alignment Distance Metric . 110 5.3.4 The Mechanics Of Phonetic Alignment . 115 5.3.5 Alignment Procedure Modi¯cations for SHACER . 119 5.4 REFINEMENT OF HANDWRITING/SPEECH SEGMENT PRONUNCI- ATION . 126 5.4.1 Lattice Term Sequence Extraction . 127 5.4.2 Positional Phone-Bigram Modeling . 130 5.4.3 Second Pass Phone Recognition . 134 5.5 SUMMARY . 136 6 SHACER: Integration :::::::::::::::::::::::::::::::137 6.1 INPUT SOURCES . 137 6.1.1 Saliency of Various Input Combination . 138 6.2 CHOOSING THE BEST COMBINATIONS OF INPUT INFORMATION . 140 6.3 INTEGRATION EXAMPLES . 142 6.3.1 Lattice Alignment Fusion: 'Fred Green' . 145 6.3.2 Speech/Handwriting Fusion: 'Cindy Black' . 148 6.3.3 Speech/Handwriting Fusion: `Buy Computer' . 150 6.3.4 Speech/Handwriting Fusion: Discussion . 150 6.4 DISCOVERING HANDWRITING ABBREVIATION SEMANTICS . 152 6.4.1 The Role of the Word/Phrase-Spotting Recognizer . 152 6.4.2 Comparing Word/Phrase-Spotter Recognition to Handwriting . 153 6.5 SUMMARY . 157 7 SHACER: Testing ::::::::::::::::::::::::::::::::::158 7.1 WORKING HYPOTHESIS . 158 7.2 MULTIMODAL TURN END-POINTING . 158 7.2.1 Fixed-Threshold Wait-Times for Turn Segmentation . 159 7.2.2 Parsing and Edge-Splitting . 160 7.2.3 Testing Edge-Splitting in Charter . 161 7.2.4 Baseline Word and Phone Recognition . 163 7.2.5 Word/Phrase Spotter Out-Of-Vocabulary (OOV) Enrollment . 165 7.3 TWO PHASE PROCESSING . 165 7.4 FILE-BASED ACCUMULATORS FOR LEARNING PERSISTENCE . 166 ix 7.4.1 Adding to Speech and Handwriting Dictionaries . 166 7.4.2 Expanding the Master Speech Recognizer's Abbreviation Table . 167 7.4.3 Biasing Handwriting Recognition . 167 7.4.4 A±x Tracking for Mis-Recognition Correction . 167 7.5 PLAUSIBILITY TESTING . 168 7.5.1 Speech/Handwriting Combination Bene¯ts . 168 7.5.2 Testing Articulatory-Feature Based Phonetic Distances . 172 7.6 TEST OUTCOMES . 175 7.6.1 Results on Development Test Set . 175 7.6.2 Results on Held-Out Test Set . 177 7.7 SUMMARY . 178 8 Conclusion and Future Work :::::::::::::::::::::::::::180 8.1 CONCLUSION . 180 8.1.1 Speci¯city of Contribution . 180 8.1.2 Timeliness of Contribution . 180 8.1.3 Signi¯cance of Contribution . 181 8.1.4 Contributions . 181 8.2 FUTURE WORK . 185 8.2.1 Moving Toward Stochastic Decision Processes . 185 8.2.2 Better Modal Recognizers . 185 8.2.3 Better Structural Understanding of Syntax and Semantics . 186 8.2.4 Beyond Handwriting and Speech: MOOVR . 187 8.2.5 Better Understanding of Multimodal Redundancy . 187 Bibliography :::::::::::::::::::::::::::::::::::::::189 A Articulatory Features and Phonetic Distances ::::::::::::::::207 A.1 ARTICULATORY FEATURE TABLE AND PSEUDO-CODE . 207 A.1.1 Articulatory Feature Table . 207 A.2 PER PHONE ARTICULATORY FEATURE VALUE TABLE . 209 A.3 EXAMPLE PHONE DISTANCE TABLE . 211 B SHACER Grammars ::::::::::::::::::::::::::::::::213 B.1 GRAMMAR DESCRIPTION AND FORMAT . 213 B.2 CONSTRAINED SYLLABARY GRAMMAR . 220 B.3 CONSTRAINED PHONE GRAMMAR . 220 B.4 WORD PHRASE GRAMMAR . ..