End-To-End Neural Networks for Subvocal Speech Recognition

End-to-end neural networks for subvocal speech recognition Pol Rosello and Pamela Toman and Nipun Agarwala Stanford University fprosello, ptoman, [email protected] Abstract error rates on the order of 10-50% per speaker on the EMG-UKA test corpus of approximately 100 Subvocalization is a phenomenon ob- unique words (Wand et al., 2014) are far too high served while subjects read or think, char- to use subvocal speech recognition in a practical acterized by involuntary facial and laryn- setting. geal muscle movements. By measuring Approaches to EMG-based speech recognition this muscle activity using surface elec- have so far focused on HMM-GMM models. A tromyography (EMG), it may be possi- hybrid HMM-NN model for phone labeling has ble to perform automatic speech recog- also been briefly explored (Wand and Schultz, nition (ASR) and enable silent, hands- 2014). However, obtaining the ground truth phone free human-computer interfaces. In our alignments in EMG recordings is much more chal- work, we describe the first approach lenging than with sound. It is also unclear that toward end-to-end, session-independent laryngeal muscle movements can be classified subvocal speech recognition by leverag- into the same phonemes that are used for audible ing character-level recurrent neural net- speech. To address these challenges, we leverage works (RNNs) and the connectionist tem- recent techniques in end-to-end speech recogni- poral classification loss (CTC). We at- tion that do not require forced phoneme alignment tempt to address challenges posed by models. We also consider large speaker variability a lack of data, including poor gener- and noisy measurements. alization, through data augmentation of We use a baseline three-layer recurrent neural electromyographic signals, a specialized network using spectrogram features, and try to multi-modal architecture, and regulariza- improve performance through feature engineering tion. We show results indicating reason- and using an ensemble of LSTM-based recurrent able qualitative performance on test set ut- networks. We also explore a multi-modal RNN terances, and describe promising avenues architecture that manages to perform best for the for future work in this direction. audible EMG dataset, though the recurrent en- 1 Introduction semble models performed best for whispered and silent EMG data. Through our experiments in fea- Subvocalization is silent internal speech produced ture engineering, data augmentation and architec- while reading. It is characterized by small move- ture exploration, we achieve a character error rate ments in facial and laryngeal muscles measur- of 0.702 on the EMG-UKA dataset, which is an able by surface electromyography (EMG). A suc- improvement over the 0.889 CER of our baseline cessful subvocal speech recognizer would pro- model. vide a silent, hands-free human-computer interface. Such an interface can add confidential- 2 Related Work ity to public interactions, improve communica- tion in high-noise environments, and assist peo- Non-audible speech is a focus of ongoing re- ple with speech disorders. Although some work search. The first break-through paper in EMG has attempted to perform speech recognition on recording to speech recognition was Chan et al. EMG recordings of subvocalization, current word- in 2001 (Chan et al., 2001), who achieved an average accuracy of 93% on a 10-word vocabulary loss (Graves and Jaitly, 2014) reframes the of English digits. In addition to EMG, researchers problem of automatic speech recognition from have explored automatic speech recognition using one in which speech is comprised of phones data from magnetic sensing (Hofe et al., 2013), which have a mapping to text, into one in which radar (Shin and Seo, 2016), and video (Wand et al., speech is decoded directly as text. By feeding a 2016). recurrent neural network architecture the speech Having been inspired by Chan et al., a series of signal or derived features from the speech signal papers on EMG-based subvocalization began be- at each time step, the network learns to generate ing published by Jou, Schultz, Wand, and others characters of text. With minimal postprocessing beginning in 2007 (Jou et al., 2007). In particular, to remove duplicated and “blank” characters, the Schultz working group has steadily improved the model’s predictions map very closely to the on models that use a traditional HMM acoustic ar- character sequence. Because each time step does chitecture using time-domain features of the EMG not need a hard and correct label reflecting the signal, with triphones, phonetic bundling, a lan- phone being uttered, this approach avoids many guage model based on broadcast news trigrams, of the assumptions that have posed a challenge and lattice rescoring to estimate the most likely for traditional HMM-based modeling. The Graves hypothesis word sequences. Because EMG data et al. paper introducing CTC showed an approx- differs from audio data, we believe that the pri- imately 5% improvement in label error rate on mary contribution of collaborative work of Schultz TIMIT, from a context-dependent HMM LER of and Wand comes in their development of features 35.21% to a CTC prefix-searech LER of 30.51%, for EMG data. Based on sampling frames of time, and the performance gap has continued to grow they build a feature with high and low frequency since 2014. components, which they then reduce in dimen- sionality using LDA (Schultz and Wand, 2010; 3 Approach Wand and Schultz, 2014). A 2014 paper from We use the public portion of the EMG-UKA elec- Wand that develops a neural network architecture tromyography dataset, and we derive four alter- for phone labeling (Wand and Schultz, 2014) may native feature types from that data. Because this perform somewhat better, though direct compar- dataset has poor phoneme-level alignments, we isons are challenging as the datasets and EMG col- use the character-level CTC problem formulation, lection devices differ. The current state-of-the-art which maps audio recordings directly to textual achieves a word error rate of 9.38% for the best transcription rather than predicting a phone as an speaker-session combination on a limited set of intermediary. In contrast to the existing work words; the reported interquartile range is approxi- in the literature, we strive to build a session- mately 22% to 45%. independent model that does not retrain for each An alternative arm of work by Freitas et al. new EMG session. We experiment with three that attempts to recover text from Portuguese approaches in this realm: a mode-independent EMG signals achieves best average performance model, an ensemble of mode-dependent models, of 22.5% word error rate, also under a traditional and a multi-modal model that uses weight sharing approach (Freitas et al., 2012). The authors find to reduce the number of parameters that must be that the nasality of vowels is a primary source of trained. error, and they suspect that the muscles activated in producing nasal vowels are not detected well by 3.1 Dataset the surface EMG. Freitas et al.’s focus on phones Our data is the public EMG-UKA trial cor- aligns with the work by Schultz and Wand on “ex- pus (Wand et al., 2014). The EMG-UKA trial cor- tracting a set of features which directly represent pus consists of about two hours of EMG record- certain articulatory activities, and, in turn, can be ings of four subjects reading utterances from linked to phonetic features” (Wand and Schultz, broadcast news. EMG recordings are available 2014), and the history of challenges with that ap- while subjects read utterances in three modes: au- proach motivate our application of the connection- dibly, silently, and while whispering. The record- ist temporal classification approach. ings contain 6 channels of EMG data collected at Connectionist temporal classification 600 Hz using multiple leads, a sound recording audible whispered silent 2011). The amount of adipose tissue, age and word 4.6 (62) 3.8 (65) 3.4 (57) slackness of skin, muscle cross-talk, and the sur- phone 3.6 (194) 0.8 (194) 0.2 (188) face nature of non-invasive EMG can also reduce signal quality (Kuiken et al., 2003). Additionally, Table 1: Quality of data labels provided in corpus session-to-session differences in electrode appli- on a 0-5 Likert scale. These results indicate that it cation can result in models that overfit to a single is inappropriate to use phone-level labels for whis- session. A significant challenge of our work is to pered and silent data. We approximate data quality therefore design a model that can generalize well by averaging across qualitative ratings of five ut- to unseen speakers, sessions, and utterances de- terances selected at random from each mode, and spite a considerable lack of data. we provide the sample size at each level in paren- While the full EMG-UKA corpus contains 8 theses. hours of recording data rather than the 2 hours available in the trial corpus, it is not publicly avail- collected at 1600 Hz, and a transcription of the ut- able. The authors of the corpus were not reachable terance. While sound recordings of utterances are for release of the full dataset, despite multiple at- available, at no point in our work do we use them tempts. Because of this, it is impossible for us to to train our models. directly compare the performance of our models Each sample in the corpus contains estimated against prior work on this dataset. phone and word alignments for the audible and whispered data based on an HMM model of the 3.2 EMG feature extraction audio track, and estimated phone and word align- Traditional features used in ASR such as MFCCs ments for the silent data based on a model that cannot be used for EMG data since they rely on maps the HMM results to the silent mode.

Load more