Preliminary Test of a Real-Time, Interactive Silent Speech Interface Based on Electromagnetic Articulograph

Jun Wang Ashok Samal Jordan R. Green Dept. of Bioengineering Dept. of Computer Science & Dept. of Communication Sci- Callier Center for Communi- Engineering ences & Disorders cation Disorders University of Nebraska- MGH Institute of Health Pro- University of Texas at Dallas Lincoln fessions [email protected] [email protected] [email protected]

treatment options for these individuals including Abstract (1) esophageal speech, which involves oscillation of the esophagus and is difficult to learn; (2) tra- A silent speech interface (SSI) maps articula- cheo-esophageal speech, in which a voice pros- tory movement data to speech output. Alt- thesis is placed in a tracheo-esophageal puncture; hough still in experimental stages, silent and (3) electrolarynx, an external device held on speech interfaces hold significant potential the neck during articulation, which produces a for facilitating oral communication in persons robotic voice quality (Liu and Ng, 2007). Per- after or with other severe voice impairments. Despite the recent efforts on si- haps the greatest disadvantage of these ap- lent speech recognition algorithm develop- proaches is that they produce abnormal sounding ment using offline data analysis, online test speech with a fundamental frequency that is low of SSIs have rarely been conducted. In this and limited in range. The abnormal voice quality paper, we present a preliminary, online test of output severely affects the social life of people a real-time, interactive SSI based on electro- after laryngectomy (Liu and Ng, 2007). In addi- magnetic motion tracking. The SSI played tion, the tracheo-esophageal option requires an back synthesized speech sounds in response additional surgery, which is not suitable for eve- to the user’s tongue and lip movements. ry patient (Bailey et al., 2006). Although re- Three English talkers participated in this test, search is being conducted on improving the where they mouthed (silently articulated) phrases using the device to complete a voice quality of esophageal or electrolarynx phrase-reading task. Among the three partici- speech (Doi et al., 2010; Toda et al., 2012), new pants, 96.67% to 100% of the mouthed assistive technologies based on non-audio infor- phrases were correctly recognized and corre- mation (e.g., visual or articulatory information) sponding synthesized sounds were played af- may be a good alternative approach for providing ter a short delay. Furthermore, one participant natural sounding speech output for persons after demonstrated the feasibility of using the SSI laryngectomy. for a short conversation. The experimental re- Visual speech recognition (or automatic lip sults demonstrated the feasibility and poten- reading) typically uses an optical camera to ob- tial of silent speech interfaces based on elec- tain lip and/or facial features during speech (in- tromagnetic articulograph for future clinical applications. cluding lip contour, color, opening, movement, etc.) and then classify these features to speech 1 Introduction units (Meier et al., 2000; Oviatt, 2003). Howev- er, due to the lack of information from tongue, Daily communication is often a struggle for per- the primary articulator, visual speech recognition sons who have undergone a laryngectomy, a sur- (i.e., using visual information only, without gical removal of the due to the treatment tongue and audio information) may obtain a low of cancer (Bailey et al., 2006). In 2013, about accuracy (e.g., 30% - 40% for phoneme classifi- 12,260 new cases of laryngeal cancer were esti- cation, Livescu et al., 2007). Furthermore, Wang mated in the United States (American Cancer and colleagues (2013b) have showed any single Society, 2013). Currently, there are only limited tongue sensor (from tongue tip to tongue body

38 Proceedings of the 5th Workshop on Speech and Language Processing for Assistive Technologies (SLPAT), pages 38–45, Baltimore, Maryland USA, August 26 2014. c 2014 Association for Computational Linguistics

Figure 1. Design of the real-time silent speech interface. back on the midsagittal line) encodes significant- based SSIs have been tested online with multiple ly more information in distinguishing phonemes subjects and encouraging results were obtained than do lips. However, visual speech recognition in a phrase reading task where the subjects were is well suited for applications with small- asked to silently articulate sixty phrases (Denby vocabulary (e.g., a lip-reading based command- et al., 2011). SSI based on electromagnetic sens- and-control system for home appliance) or using ing has been only tested using offline analysis visual information as an additional source for (using pre-recorded data) collected from single acoustic speech recognition, referred to as audio- subjects (Fagan et al., 2008; Hofe et al., 2013), visual speech recognition (Potamianos et al., although some work simulated online testing 2003), because such a system based on portable using prerecorded data (Wang et al., 2012a, camera is convenient in practical use. In contrast, 2012b, 2013c). Online tests of SSIs using elec- SSIs, with tongue information, have potential to tromagnetic articulograph with multiple subjects obtain a high level of silent speech recognition are needed to show the feasibility and potential accuracy (without audio information). Currently, of the SSIs for future clinical applications. two major obstacles for SSI development are In this paper, we report a preliminary, online lack of (a) fast and accurate recognition algo- test of a newly-developed, real-time, and interac- rithms and (b) portable tongue motion tracking tive SSI based on a commercial EMA. EMA devices for daily use. tracks articulatory motion by placing small sen- SSIs convert articulatory information into text sors on the surface of tongue and other articula- that drives a text-to-speech synthesizer. Although tors (e.g., lips and jaw). EMA is well suited for still in developmental stages (e.g., speaker- the early stage of SSI development because it (1) dependent recognition, small-vocabulary), SSIs is non-invasive, (2) has a high spatial resolution even have potential to provide speech output in motion tracking, (3) has a high sampling rate, based on prerecorded samples of the patient’s and (4) is affordable. In this experiment, partici- own voice (Denby et al., 2010; Green et al., pants used the real-time SSI to complete an 2011; Wang et al., 2009). Potential articulatory online phrase-reading task and one of them had a data acquisition methods for SSIs include ultra- short conversation with another person. The re- sound (Denby et al., 2011; Hueber et al., 2010), sults demonstrated the feasibility and potential of surface electromyography electrodes (Heaton et SSIs based on electromagnetic sensing for future al., 2011; Jorgensen and Dusan, 2010), and elec- clinical applications. tromagnetic articulograph (EMA) (Fagan et al., 2008; Wang et al., 2009, 2012a). 2 Design Despite the recent effort on silent speech in- 2.1 Major design terface research, online test of SSIs has rarely been studied. So far, most of the published work Figure 1 illustrates the three-component design on SSIs has focused on development of silent of the SSI: (a) real-time articulatory motion speech recognition algorithm through offline tracking using a commercial EMA, (b) online analysis (i.e., using prerecorded data) (Fagan et silent speech recognition (converting articulation al., 2008; Heaton et al., 2011; Hofe et al., 2013; information to text), and (c) text-to-speech syn- Hueber et al., 2010; Jorgenson et al., 2010; Wang thesis for speech output. et al., 2009a, 2012a, 2012b, 2013c). Ultrasound- The EMA system (Wave Speech Research

39

Figure 2. Demo of a participant using the silent speech interface. The left picture illustrates the coordinate system and sensor locations (sensor labels are described in text); in the right picture, a participant is using the silent speech interface to finish the online test. system, Northern Digital Inc., Waterloo, Canada) phrases into the system through the GUI. Adding was used to track the tongue and lip movement a new phrase in the vocabulary is done in two in real-time. The sampling rate of the Wave sys- steps. The user (the patient) first enters the tem was 100 Hz, which is adequate for this ap- phrase using a keyboard (keyboard input can also plication (Wang et al., 2012a, 2012b, 2013c). be done by an assistant or speech pathologist), The spatial accuracy of motion tracking using and then produces a few training samples for the Wave is 0.5 mm (Berry, 2011). phrase (a training sample is articulatory data la- The online recognition component recognized beled with a phrase). The system automatically functional phrases from articulatory movements re-trains the recognition model integrating the in real-time. The recognition component is mod- newly-added training samples. Users can delete ular such that alternative classifiers can easily invalid training samples using the GUI as well. replace and be integrated into the SSI. In this preliminary test, recognition was speaker- 2.3 Real-time data processing dependent, where training and testing data were The tongue and lip movement positional data from the same speakers. obtained from the Wave system were processed The third component played back either pre- in real-time prior to being used for recognition. recorded or synthesized sounds using a text-to- This included the calculation of head- speech synthesizer (Huang et al., 1997). independent positions of the tongue and lip sen- sors and low pass filtering for removing noise. 2.2 Other designs The movements of the 6 DOF head sensor A graphical user interface (GUI) is integrated were used to calculate the head-independent into the silent speech interface for ease of opera- movements of other sensors. The Wave system tion. Using the GUI, users can instantly re-train represents object orientation or rotation (denoted the recognition engine (classifier) when new by yaw, pitch, and roll in Euler angles) in qua- training samples are available. Users can also ternions, a four-dimensional vector. Quaternion switch output voice (e.g., male or female). has its advantages over Euler angles. For exam- Data transfer through TCP/IP. Data transfer ple, quaternion avoids the issue of gimbal lock from the Wave system to the recognition unit (one degree of freedom may be lost in a series of (software) is accomplished through TCP/IP, the rotation), and it is simpler to achieve smooth in- standard data transfer protocols on Internet. Be- terpolation using quaternion than using Euler cause data bandwidth requirement is low (multi- angles (Dam et al., 1998). Thus, quaternion has ple sensors, multiple spatial coordinates for each been widely used in computer graphics, comput- sensor, at 100 Hz sampling rate), any 3G or fast- er vision, robotics, virtual reality, and flight dy- er network connection will be sufficient for fu- namics (Kuipers, 1999). Given the unit quaterni- ture use with wireless data transfer. on Extensible (closed) vocabulary. In the early q = (a, b, c, d) (1) stage of this development, closed-vocabulary silent speech recognition was used; however, the where a2 + b2 + c2 + d2 = 1, a 3 × 3 rotation ma- vocabulary is extensible. Users can add new trix R can be derived using Equation (2):

40 2222  −−+ − 22 adbcdcba + 22 acbd  (2) back as possible, depending on the participant’s  2222  R =  + 22 −+− dcbaadbc cd − 22 ab  tongue length (Wang et al., 2013b). Lip move-  2222   − 22 acbd cd + 22 +−− dcbaab  ments were captured by attaching two 5 DOF sensors to the vermilion borders of the upper For details of how the quaternion is used in (UL) and lower (LL) lips at midline. The four Wave system, please refer to the Wave Real- sensors (i.e., TT, TB, UL, and LL) placements Time API manual and sample application were selected based on literature showing that (Northern Digital Inc., Waterloo, Canada). they are able to achieve as high recognition accu- 3 A Preliminary Online Test racy as that obtained using more tongue sensors for this application (Wang et al., 2013b). 3.1 Participants & Stimuli As mentioned previously, real-time prepro- cessing of the positional time series was con- Three American English talkers participated in ducted, including subtraction of head movements this experiment (two males and one female with from tongue and lip data and noise reduction us- average age 25 and SD 3.5 years). No history of ing a 20 Hz low pass filter (Green et al., 2003; speech, language, hearing, or any cognitive prob- Wang et al., 2013a). Although the tongue and lip lems were reported. sensors are 5D, only the 3D spatial data (i.e., x, y, Sixty phrases that are frequently used in daily and z) were used in this experiment. life by healthy people and AAC (augmentative and alternative communication) users were used Training in this experiment. Those phrases were selected from the lists in Wang et al., 2012a and Beukel- The training step was conducted to obtain a few man and Gutmann, 1999. samples for each phrase. The participants were 3.2 Procedure asked to silently articulate all sixty phrases twice at their comfortable speaking rate, while the Setup tongue and lip motion was recorded. Thus, each phrase has at least two samples for training. Dy- The Wave system tracks the motion of sensors namic Time Warping (DTW) was used as the attached on the articulators by establishing an classifier in this preliminary test, because of its electromagnetic field by a textbook-sized genera- rapid execution (Denby et al., 2011), although tor. Participants were seated with their head Gaussian mixture models may perform well too within the calibrated magnetic field (Figure 2, when the number of training samples is small the right picture), facing a computer monitor that (Broekx et al., 2013). DTW is typically used to displays the GUI of the SSI. The sensors were compare two single-dimensional time-series, attached to the surface of each articulator using dental glue (PeriAcryl Oral Tissue Adhesive). Training_Algorithm Prior to the experiment, each subject participated Let T1… Tn be the sets of training samples for n in a three-minute training session (on how to use phrases, where the SSI), which also helped them adapt to the Ti = {Ti,1, … Ti,j, … Ti,mi} are mi samples for oral sensors. Previous studies have shown those phrase i. sensors do not significantly affect their speech 1 for i = 1 to n // n is the number of phrases output after a short practice (Katz et al., 2006; 2 Li = sum(length(Ti,j)) / mi, j = 1 to mi; Weismer and Bunton, 1999). 3 T = Ti,1; // first sample of phrase i Figure 2 (left) shows the positions of the five 4 for j = 2 to mi sensors attached to a participant’s forehead, 5 (T', T'i,j) = MDTW(T, Ti,j); tongue, and lips (Green et al., 2003; 2013; Wang 6 T = (T' + T'i,j) / 2;//amplitude mean et al., 2013a). One 6 DOF (spatial and rotational) 7 T = time_normalize(T, Li); head sensor was attached to a nose bridge on a 8 end 9 Ri = T; // representative sample for phrase i pair of glasses (rather than on forehead skin di- 10 end rectly), to avoid the skin artifact (Green et al., 1 1 Output(R); 2007). Two 5 DOF sensors - TT (Tongue Tip) and TB (Tongue Body Back) - were attached on Figure 3. Training algorithm using DTW. The the midsagittal of the tongue. TT was located function call MDTW() returns the average approximately 10 mm from the tongue apex DTW distances between the corresponding (Wang et al., 2011, 2013a). TB was placed as far sensors and dimensions of two data samples.

41 thus we calculated the average DTW distance Accuracy Latency # of Train- across the corresponding sensors and dimensions Subject (%) (s) ing Samples of two data samples. DTW was adapted as fol- lows for training. S01 100 3.086 2.0 The training algorithm generated a repre- S02 96.67 1.403 2.4 sentative sample based on all available training S03 96.67 1.524 3.1 samples for each phrase. Pseudo-code of the training algorithm is provided in Figure 3, for the Table 1. Phrase classification accuracy and convenience of description. For each phrase i, latency for all three participants. first, MDTW was applied to the first two training samples, Ti,1 and Ti,2. Sample T is the amplitude- Subject: I use a silent speech interface to talk. mean of the warped samples T'i,1 and T'i,2 (time- series) (Line 5). Next, T was time-normalized Investigator: That’s cool. (stretched) to the average length of all training Subject: Do you understand me? Investigator: Oh, yes. samples for this phrase (Li), which was to reduce the effects of duration change caused by DTW Subject: That’s good. warping (Line 6). The procedure continued until 4 Results and Discussion the last training sample Ti, mi (mi is the number of training samples for phrase i). The final T was Table 1 lists the performance using the SSI for the representative sample for phrase i. all three participants in the online, phrase- The training procedure can be initiated by reading task. The three subjects obtained a pressing a button on the GUI anytime during the phrase recognition accuracy from 96.67% to use of the SSI. 100.00%, with a latency from 1.403 second to 3.086 seconds, respectively. The high accuracy Testing and relatively short delay demonstrated the fea- sibility and potential of SSIs based on electro- During testing, each participant silently articulat- magnetic articulograph. ed the same list of phrases while the SSI recog- The order of the participants in the experiment nized each phrase and played corresponding syn- was S01, S02, and then S03. After the experi- thesized sounds. DTW was used to compare the ment of S01, where all three dimensional data (x, test sample with the representative training sam- y, and z) were used, we decided to use only y and ple for each phrase (Ri, Figure 3). The phrase that z for S02 and S03 to reduce the latency. As listed had the shortest DTW distance to the test sample in Table 1, the latencies of S02 and S03 did sig- was recognized. The testing was triggered by nificantly reduce, because less data was used for pressing a button on the GUI. If the phrase was online recognition. incorrectly predicted, the participant was allowed Surprisingly, phrase recognition without using to add at most two additional training samples x dimension (left-right) data led to a decrease of for that phrase. accuracy and more training samples were re- Figure 2 (right) demonstrates a participant is quired; prior research suggests that tongue using the SSI during the test. After the partici- movement in this dimension is not significant pant silently articulated “Good afternoon”, the during speech in healthy talkers (Westbury, SSI displayed the phrase on the screen, and 1994). This observation is supported by partici- played the corresponding synthesized sound pant S01, who had the highest accuracy and simultaneously. needed fewer training samples for each phrase Finally, one participant used the SSI for a bidi- (column 3 in Table 1). S02 and S03 used data of rectional conversation with an investigator. Since only y and z dimensions. Of course, data from this prototype SSI has a closed-vocabulary more subjects are needed to confirm the potential recognition component, the participant had to significance of the x dimension movement of the choose the phrases that have been trained. This tongue to silent speech recognition accuracy. task was intended to provide a demo of how the Data transfer between the Wave machine and SSI is used for daily communication. The script the SSI recognition component was done through of the conversation is as below: TCP/IP protocols and in real-time. In the future, Investigator: Hi DJ, How are you? this design feature will allow the recognition Subject: I’m fine. How are you doing? component to run on a smart phone or any wear- Investigator: I’m good. Thanks. able devices with an Internet connection (Cellu-

42 lar or Wi-Fi). In this preliminary test, the indi- In this experiment, a simple DTW algorithm vidual delays caused by TCP/IP data transfer, was used to compare the training and testing online data preprocessing, and classification phrases, which is known to be slower than most were not measured and thus unknown. The delay machine learning classifiers. Thus, in the future, information may be useful for our future devel- the latency can be significantly reduced by using opment that the recognition component is de- faster classifiers such as support vector machines ployed on a smart-phone. A further study is (Wang et al., 2013c) or hidden Markov models needed to obtain and analyze the delay infor- (Heracleous and Hagita, 2011; King et al., 2007; mation. Rudzicz et al., 2012; Uraga and Hain, 2006). The bidirectional dialogue by one of the par- Furthermore, in this proof-of-concept design, ticipants demonstrated how the SSI can be used the vocabulary was limited to a small set of in daily conversation. To our best knowledge, phrases, because our design required the whole this is the first conversational demo using a SSI. experiment (including training and testing) to be An informal survey to a few colleagues provided done in about one hour. Additional work is need- positive feedback. The conversation was smooth, ed to test the feasibility of open-vocabulary although it i s noticeably slower than a conversa- recognition, which will be much more usable for tion between two healthy talkers. Importantly, people after laryngectomy or with other severe the voice output quality (determined by the text- voice impairments. to-speech synthesizer) was natural, which strong- ly supports the major motivation of SSI research: 5 Conclusion and Future Work to produce speech with natural voice quality that A preliminary, online test of a SSI based on elec- current treatments cannot provide. A video demo tromagnetic articulograph was conducted. The is available online (Wang, 2014). results were encouraging revealing high phrase The participants in this experiment were recognition accuracy and short playback laten- young and healthy. It is, however, unknown if cies among three participants in a phrase-reading the recognition accuracy may decrease or not for task. In addition, a proof-of-concept demo of users after laryngectomy, although a single pa- bidirectional conversation using the SSI was tient study showed the accuracy may decrease provided, which shows how the SSI can be used 15-20% compared to healthy talkers using an for daily communication. ultrasound-based SSI (Denby et al., 2011). Theo- Future work includes: (1) testing the SSI with retically, the tongue motion patterns in (silent) patients after laryngectomy or with severe voice speech after the surgery should be no difference impairment, (2) integrating a phoneme- or word- with that of healthy talkers. In practice, however, level recognition (open-vocabulary) using faster some patients after the surgery may be under machine learning classifiers (e.g., support vector treatment for swallowing using radioactive de- machines or hidden Markov models), and (3) vices, which may affect their tongue motion pat- exploring speaker-independent silent speech terns in articulation. Thus, the performance of recognition algorithms by normalizing the articu- SSIs may vary and depend on the condition of latory movement across speakers (e.g., due to the the patients after laryngectomy. A test of the SSI anatomical difference of their tongues). using multiple participants after laryngectomy is needed to understand the performance of SSIs Acknowledgements for those patients under different conditions. Although a demonstration of daily conversa- This work was in part supported by the Callier tion using the SSI is provided, SSI based on the Excellence in Education Fund, University of non-portable Wave system is currently not ready Texas at Dallas, and grants awarded by the Na- for practical use. Fortunately, more affordable tional Institutes of Health (R01 DC009890 and and portable electromagnetic devices are being R01 DC013547). We would like to thank Dr. developed as are small handheld or wearable de- Thomas F. Campbell, Dr. William F. Katz, Dr. vices (Fagan et al., 2008). Researchers are also Gregory S. Lee, Dr. Jennell C. Vick, Lindsey testing the efficacy of permanently implantable Macy, Marcus Jones, Kristin J. Teplansky, Ve- and wireless sensors (Chen et al., 2012; Park et dad “Kelly” Fazel, Loren Montgomery, and al., 2012). In the future, those more portable, and Kameron Johnson for their support or assistance. wireless articulatory motion tracking devices, We also thank the anonymous reviewers for their when they are ready, will be used to develop a comments and suggestions for improving the portable SSI for practice use. quality of this paper.

43 References movement patterns during speech and swallowing, Journal of the Acoustical Society of America, American Cancer Society. 2013. Cancer Facts and 113:2820-2833. Figures 2013. American Cancer Society, Atlanta, GA. Retrieved on February 18, 2014. Hofe, R., Ell, S. R., Fagan, M. J., Gilbert, J. M., Green, P. D., Moore, R. K., and Rybchenko, S. I. Bailey, B. J., Johnson, J. T., and Newlands, S. D. 2013. Small-vocabulary speech recognition using a 2006. Head and Neck Surgery – Otolaryngology, silent speech interface based on magnetic sensing, Lippincot, Williams & Wilkins, Philadelphia, PA, Speech Communication, 55(1):22-32. USA, 4th Ed., 1779-1780. Hofe, R., Ell, S. R., Fagan, M. J., Gilbert, J. M., Berry, J. 2011. Accuracy of the NDI wave speech Green, P. D., Moore, R. K., and Rybchenko, S. I. research system, Journal of Speech, Language, and 2011. Parameter Generation for Hearing Research, 54:1295-1301. the Assistive Silent Speech Interface MVOCA, Beukelman, D. R., and Gutmann, M. 1999. Generic Proc. Interspeech, 3009-3012. Message List for AAC users with ALS. Huang, X. D., Acero, A., Hon, H.-W., Ju, Y.-C., Liu, http://aac.unl.edu/ALS_Message_List1.htm J., Meredith, S., and Plumpe, M. 1997. Recent Im- Broekx, L., Dreesen, K., Gemmeke, J. F., and Van provements on Microsoft’s Trainable Text-to- Hamme, H. 2013. Comparing and combining clas- Speech System: Whistler, Proc. IEEE Intl. Conf. sifiers for self-taught vocal interfaces, ACL/ISCA on Acoustics, Speech, and Signal Processing, 959- Workshop on Speech and Language Processing for 962. Assistive Technologies, 21-28, 2013. Hueber, T., Benaroya, E.-L., Chollet, G., Denby, B., Chen, W.-H., Loke, W.-F., Thompson, G., and Jung, Dreyfus, G., Stone, M. 2010. Development of a si- B. 2012. A 0.5V, 440uW frequency synthesizer for lent speech interface driven by ultrasound and opti- implantable medical devices, IEEE Journal of Sol- cal images of the tongue and lips, Speech Commu- id-State Circuits, 47:1896-1907. nication, 52:288–300. Dam, E. B., Koch, M., and Lillholm, M. 1998. Qua- Heaton, J. T., Robertson, M., and Griffin, C. 2011. ternions, interpolation and animation. Technical Development of a wireless electromyographically Report DIKU-TR-98/5, University of Copenhagen. controlled electrolarynx voice , Proc. of the 33rd Annual Intl. Conf. of the IEEE Engineer- Denby, B., Cai, J., Roussel, P., Dreyfus, G., Crevier- ing in Medicine & Biology Society, Boston, MA, Buchman, L., Pillot-Loiseau, C., Hueber, and T., 5352-5355. Chollet, G. 2011. Tests of an interactive, phrase- book-style post-laryngectomy voice-replacement Heracleous, P., and Hagita, N. 2011. Automatic system, the 17th International Congress on Pho- recognition of speech without any audio infor- netic Sciences, Hong Kong, China, 572-575. mation, Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2392-2395. Denby, B., Schultz, T., Honda, K., Hueber, T., Gil- bert, J. M., and Brumberg, J. S. 2010. Silent speech Jorgensen, C. and Dusan, S. 2010. Speech interfaces interface s, Speech Communication, 52:270-287. based upon surface electromyography, Speech Communication, 52:354–366, 2010. Doi, H., Nakamura, K., Toda, T., Saruwatari, H., Shi- kano, K. 2010. Esophageal speech enhancement Katz, W., Bharadwaj, S., Rush, M., and Stettler, M. based on statistical voice conversion with Gaussian 2006. Influences of EMA receiver coils on speech mixture models, IEICE Transactions on Infor- production by normal and aphasic/apraxic talkers, mation and Systems, E93-D, 9:2472-2482. Journal of Speech, Language, and Hearing Re- search, 49:645-659. Fagan, M. J., Ell, S. R., Gilbert, J. M., Sarrazin, E., and Chapman, P. M. 2008. Development of a (si- Kent, R. D., Adams, S. G., and Tuner, G. S. 1996. lent) speech recognition system for patients follow- Models of speech production, in Principles of Ex- ing laryngectomy, Medical Engineering & Physics, perimental Phonetics, Ed., Lass, N. J., Mosby: St 30(4):419-425. Louis, MO. Green, P. D., Khan, Z., Creer, S. M. and Cunningham, King, S., Frankel, J. Livescu, K., McDermott, E., S. P. 2011. Reconstructing the voice of an individ- Richmond, K., Wester, M. 2007. Speech produc- ual following Laryngectomy, Augmentative and Al- tion knowledge in automatic speech recognition, ternative Communication, 27(1):61-66. Journal of the Acoustical Society of America, 121(2):723-742. Green, J. R., Wang, J., and Wilson, D. L. 2013. SMASH: A tool for articulatory data processing Kuipers, J. B. 1999. Quaternions and rotation Se- and analysis, Proc. Interspeech, 1331-35. quences: a Primer with Applications to Orbits, Aerospace, and Virtual Reality, Princeton Univer- Green, J. R. and Wang, Y. 2003. Tongue-surface sity Press, Princeton, NJ.

44 Liu, H., and Ng, M. L. 2007. Electrolarynx in voice Wang, J., Samal, A., Green, J. R., and Rudzicz, F. rehabilitation, Auris Nasus Larynx, 34(3): 327-332. 2012b. Whole-word recognition from articulatory movements for silent speech interfaces, Proc. In- Livescu, K., Çetin, O., Hasegawa-Johnson, Mark, terspeech, 1327-30. King, S., Bartels, C., Borges, N., Kantor, A., et al. (2007). Articulatory feature-based methods for Wang, J., Green, J. R., Samal, A. and Yunusova, Y. acoustic and audio-visual speech recognition: 2013a. Articulatory distinctiveness of vowels and Summary from the 2006 JHU Summer Workshop. consonants: A data-driven approach, Journal of Proc. IEEE Intl. Conf. on Acoustics, Speech, and Speech, Language, and Hearing Research, 56, Signal Processing, 621-624. 1539-1551. Meier, U., Stiefelhagen, R., Yang, J., and Waibel, A. Wang, J., Green, J. R., and Samal, A. 2013b. Individ- (2000). Towards Unrestricted Lip Reading. Inter- ual articulator's contribution to phoneme produc- national Journal of Pattern Recognition and Artifi- tion, Proc. IEEE Intl. Conf. on Acoustics, Speech, cial Intelligence, 14(5): 571-585. and Signal Processing, Vancouver, Canada, 7795- 89. Oviatt, S. L. 2003. Multimodal interfaces, in Human– Computer Interaction Handbook: Fundamentals, Wang, J., Balasubramanian, A., Mojica de La Vega, Evolving Technologies and Emerging Applications, L., Green, J. R., Samal, A., and Prabhakaran, B. Eds. Julie A. Jacko and Andrew Sears (Mahwah, 2013c. Word recognition from continuous articula- NJ:Erlbaum): 286–304. tory movement time-series data using symbolic representations, ACL/ISCA Workshop on Speech Park, H., Kiani, M., Lee, H. M., Kim, J., Block, J., and Language Processing for Assistive Technolo- Gosselin, B., and Ghovanloo, M. 2012. A wireless gies, Grenoble, France, 119-127. magnetoresistive sensing system for an intraoral tongue-computer interface, IEEE Transactions on Wang J. 2014. DJ and his friend: A demo of conver- Biomedical Circuits and Systems, 6(6):571-585. sation using a real-time silent speech interface based on electromagnetic articulograph. [Video]. Potamianos, G., Neti, C., Cravier, G., Garg, A. and Available: http://www.utdallas.edu/~wangjun/ssi- Senior, A. W. 2003. Recent advances in the auto- demo.html matic recognition of audio-visual speech, Proc. of IEEE, 91(9):1306-1326. Weismer, G. and Bunton, K. (1999). Influences of pellet markers on speech production behavior: Rudzicz, F., Hirst, G., Van Lieshout, P. 2012. Vocal Acoustical and perceptual measures, Journal of the tract representation in the recognition of cerebral Acoustical Society of America, 105: 2882-2891. palsied speech, Journal of Speech, Language, and Hearing Research, 55(4): 1190-1207. Westbury, J. 1994. X-ray microbeam speech produc- tion database user’s handbook. University of Wis- Toda, T., Nakagiri, M., Shikano, K. 2012. Statistical consin-Madison, Madison, Wisconsin. voice conversion techniques for body-conducted unvoiced speech enhancement, IEEE Transactions on Audio, Speech and Language Processing, 20(9): 2505-2517. Uraga, E. and Hain, T. 2006. Automatic speech recognition experiments with articulatory data, Proc. Inerspeech, 353-356. Wang, J., Samal, A., Green, J. R., and Carrell, T. D. 2009. Vowel recognition from articulatory position time-series data, Proc. IEEE Intl. Conf. on Signal Processing and Communication Systems, Omaha, NE, 1-6. Wang, J., Green, J. R., Samal, A., and Marx, D. B. 2011. Quantifying articulatory distinctiveness of vowels, Proc. Interspeech, Florence, Italy, 277- 280. Wang, J., Samal, A., Green, J. R., and Rudzicz, F. 2012a. Sentence recognition from articulatory movements for silent speech interfaces, Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Pro- cessing, 4985-4988.

45