<<

ACOUSTICALLY GROUNDED EMBEDDINGS FOR IMPROVED -TO-WORD RECOGNITION

Shane Settle? Kartik Audhkhasi† Karen Livescu? Michael Picheny†

? TTI-Chicago †IBM Research AI

ABSTRACT less data (e.g., 300 hours), but performance gaps still remain between A2W and sub-word models. Direct acoustics-to-word (A2W) systems for end-to-end automatic In this work we develop techniques for addressing the challenge speech recognition are simpler to train, and more efficient to decode of infrequent in A2W recognition. In an end-to-end neural with, than sub-word systems. However, A2W systems can have dif- A2W model, the final weight consists of one vector per word in ficulties at training time when data is limited, and at decoding time the , which can be seen as a matrix. Most when recognizing words outside the training vocabulary. To address weights in this large matrix are associated with very few training these shortcomings, we investigate the use of recently proposed acous- examples since most words are rare. In addition, out-of-vocabulary tic and acoustically grounded word embedding techniques in A2W words cannot be predicted at all (unlike in sub-word models). systems. The idea is based on treating the final pre-softmax weight Our approach is to first learn a word embedding model in a matrix of an AWE recognizer as a matrix of word embedding vectors, data-efficient way, and then to use it in two ways: (1) By training and using an externally trained set of word embeddings to improve the recognizer so as to retain similar weights to the pre-trained em- the quality of this matrix. In particular we introduce two ideas: (1) beddings, and (2) for predicting words that are unseen in training. Enforcing similarity at training time between the external embeddings The embedding model learns shared structure between words, and and the recognizer weights, and (2) using the word embeddings at therefore generalizes well to rare or unseen words. Our pre-trained test time for predicting out-of-vocabulary words. Our word embed- embedding model is learned using techniques from recent work on ding model is acoustically grounded, that is it is learned jointly with acoustic word embeddings, which has found that high-quality discrim- acoustic embeddings so as to encode the words’ acoustic-phonetic inative embeddings can be learned from very little data (e.g., ∼100 content; and it is parametric, so that it can embed any arbitrary (po- minutes) [10–13]. In particular, our embedding approach closely tentially out-of-vocabulary) sequence of characters. We find that follows that of [12], which jointly learns an acoustic embedding both techniques improve the performance of an A2W recognizer on function—mapping an acoustic signal to a fixed-dimensional vector— conversational telephone speech. and an acoustically grounded textual embedding function—mapping Index Terms— automatic speech recognition, direct acoustics- a character sequence to a vector. to-word models, connectionist temporal classification, acoustic word We explore a number of variants of these ideas, and find that A2W embeddings, triplet contrastive loss recognition performance is consistently improved by either initial- izing the recognizer’s word embeddings with acoustically grounded 1. INTRODUCTION embeddings or by regularizing toward them. We also introduce a simple method for predicting words outside of the training vocabulary, End-to-end automatic speech recognition (ASR) focuses on replacing which improves performance when the training vocabulary is limited. the modular training approaches of traditional ASR systems with conceptually simpler methods. Instead of requiring separately trained 2. APPROACH acoustic, pronunciation, and language models, neural network-based connectionist temporal classification (CTC) and encoder-decoder 2.1. Acoustics-to-Word Model approaches allow for joint optimization of a single objective. In The acoustics-to-word (A2W) model uses a single recurrent neu- arXiv:1903.12306v1 [cs.CL] 29 Mar 2019 principle such models can map acoustics directly to words. However, ral network, typically a bidirectional long short-term memory net- to achieve performance comparable with traditional methods, these work (BLSTM), trained with the connectionist temporal classification systems are still typically trained to predict sub-word units such (CTC) loss to recognize words from input acoustic sequences. Prior as characters or “wordpieces” [1–4], thereby relying on additional work has found that A2W models either require very large amounts decoders and externally trained language models. of training data [5] or careful training when using limited amounts of Acoustics-to-word (A2W) systems [5–9] jointly model the acous- training data [6,7]. In particular, [7] showed that presenting utterances tic, pronunciation, and language models at the word level under a uni- in increasing order of length, initializing with a phone CTC BLSTM, fied framework. Word-level modeling avoids the need for additional and dropout contributed to significant improvements in the word error decoding, but introduces new challenges. By modeling acoustics at rate (WER). This recipe, when applied to the (intermediate sized) the word level, the system needs to deal with significant variability in 2000-hour Switchboard-Fisher training set, produced a WER on par word duration, and it is challenging to learn to recognize infrequent with several state-of-the-art sub-word based models at the time. words. A2W systems perform well given access to very large amounts Despite some success training competitive A2W models with of training data. For example, [5] trained a 100K-word vocabulary large amounts of data, they lag behind conventional models when A2W model and matched the performance of a state-of-the-art sub- given more limited training data. Furthermore, an A2W model word CTC model, but required 125K hours of training speech. Other is trained with a fixed vocabulary and cannot recognize out-of- recent work [6, 7, 9] has explored techniques for training on much vocabulary (OOV) words. Several prior approaches have been

c 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. dimensional vectors. The acoustic embedding model consists of a stacked BLSTM followed by a sum over the output layer hidden states and a projection to a lower-dimensional vector in