Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020

WHOLE-WORD SEGMENTAL SPEECH RECOGNITION WITH ACOUSTIC WORD EMBEDDINGS Bowen Shi, Shane Settle, Karen Livescu TTI-Chicago, USA fbshi,settle.shane,[email protected] ABSTRACT Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole- word (“acoustic-to-word”) speech recognition, with the feature vectors defined using vector embeddings of segments. Such models are computationally challenging as the number of paths is proportional to the vocabulary size, which can be orders of magnitude larger than when using subword units like phones. We describe an efficient approach for end-to-end whole-word segmental models, with forward-backward and Viterbi decoding performed on a GPU and a simple segment scoring function that reduces space complexity. In addition, we inves- tigate the use of pre-training via jointly trained acoustic word embeddings (AWEs) and acoustically grounded word embeddings (AGWEs) of written word labels. We find that word error rate can be reduced by a large margin by pre-training the acoustic segment representation with AWEs, and additional Fig. 1. Whole-word segmental model for speech recognition. (smaller) gains can be obtained by pre-training the word pre- Note: boundary frames are not shared. diction layer with AGWEs. Our final models improve over segmental models, where the sequence probability is com- prior A2W models. puted based on segment scores instead of frame probabilities. Index Terms— speech recognition, segmental model, Segmental models have a long history in speech recognition acoustic-to-word, acoustic word embeddings, pre-training research, but they have been used primarily for phonetic recognition or as phone-level acoustic models [11–18]. There has 1. INTRODUCTION also been work on whole-word segmental models for second- Acoustic-to-word (A2W) models for speech recognition map pass rescoring [13, 19, 20], but to our knowledge our approach input acoustic frames directly to words. Unlike conventional is the first to address end-to-end A2W segmental models. subword-based automatic speech recognition (ASR) systems, The key ingredient in our approach is to define the segment arXiv:2007.00183v2 [eess.AS] 24 Nov 2020 A2W models do not require an external lexicon, thus simplify- scores in terms of dot products between vector embeddings of ing training and decoding. Recent work has shown that A2W acoustic segments and a weight layer of written word embed- models can achieve performance competitive with state-of- dings. This form of the model allows for (1) efficient re-use the-art subword-based systems either with large amounts of of feature functions and therefore reduced memory cost and training data [1] or with careful training techniques [2–5]. (2) initialization of the acoustic and written embeddings using Most work on A2W models [1–7] is based on connection- pre-trained acoustic word embeddings (AWEs) and acousti- ist temporal classification (CTC) [8], where the word sequence cally grounded word embeddings (AGWEs), following the probability is defined as the product of frame-level probabil- successful use of such pre-training in prior work on speech ities. In such approaches there is no explicit modeling of recognition [5] and search [21, 22]. We also obtain speed-ups segments of frames corresponding to words. There has also via GPU implementations of the forward-backward and Viterbi been recent work on encoder-decoder A2W models, which can algorithms. We find that pre-trained AWEs provide large gains, focus on ”soft segments” via an attention mechanism [9, 10]. and result in segmental models that outperform the best prior In this paper we propose an approach using whole-word A2W models on conversational telephone speech recognition. 2. SEGMENTAL MODEL FORMULATION network (SRNN) [18, 24, 25]. Frame classifier-based score Segmental models compute the score of a hypothesized label functions use a mapping from input acoustic frames X to sequence as a combination of scores of multi-frame segments frame log-probability vectors P, which are then pooled (via of speech in the sequence, rather than using individual frame mean, sampling, etc.) to get the segment score wt;s;v. This method introduces a multiplicative memory dependence on V , scores (see Figure 1). Let X = fx1; x2; :::; xT g be a se- which is a factor V=D increase in memory overhead over our quence of input acoustic frames and L = fl1; l2; :::; lK g be the output label sequence. A segmentation π with approach. In our case V is the number of words in the vocabu- respect to X and L is defined as a sequence of tuples lary, which is typically ∼ 10 times larger than D and makes this approach extremely (sometimes prohibitively) expensive. f(t1; s1; l1); (t2; s2; l2); :::; (tK ; sK ; lK )g. Each tuple de- (2) 1 SRNNs compute the score w = φT f ([h ; h ; a ]), fines a segment ek consisting of a start timestep tk, an end t;s;v θ t t+s v (2) timestep tk + sk, and a label lk, such that t1 = 0; tK + sK = where fθ is a learned feature function, av is an embedding of T; tk + sk = tk+1; and sk > 0 for all 1 ≤ k ≤ K.A word v, and [u; v] denotes concatenation of u; v. This method segmental model assigns a score wt;s;v to each segment introduces an O(T SDV ) memory overhead, which can again (t; s; v). The score of a segmentation is then defined as quickly make it infeasible for large-vocabulary recognition. P w(π) = (t;s;v)2π wt;s;v. In addition to computational savings, our formulation of segment scores in terms of products of acoustic embeddings 2.1. Segment Score Functions and written word embeddings also has the advantage that As in other recent sequence models, the input acoustic frames these two factors can be pre-trained using methods from prior are first passed through a neural network and encoded into work [5, 26] (see Section 2.4). frame features H = Enc(X) 2 RT ×F , where F denotes the feature dimensionality. In segmental models, however, 2.2. Training these frame features are then used to produce segment scores Segmental models can be trained in a variety of ways [24]. W 2 RT ×S×V , where S and V denote the maximum segment One way, which we adopt here, is to interpret them as proba- size and vocabulary size, respectively, and wt;s;v is the score bilistic models and optimize the marginal log loss under that of segment (t; s; v). Our approach defines segment scores model, which is equivalent to viewing our models as seg- W in terms of dot products between learned representations mental conditional random fields [27]. Under this view, the of variable-length segments and word labels: model assigns probabilities to paths, conditioned on the input (2)T (2) wt;s;v = a fac(Ht:t+s) + b (1) acoustic sequence, by normalizing the path score. Letting v v u(π) U := exp(W), we define p(π) := P as the prob- π2P u(π) where fac is an acoustic segment embedding function map- 0:T Q s×F ability of the segmentation, where u(π) = ut;s;v ping segments Ht:t+s 2 R to fixed-dimensional embed- (t;s;v)2π and P denotes all segmentations of X . We define the dings f (H ) 2 D, a(2) is a row from the matrix 0:T 1:T ac t:t+s R v loss for a given word sequence L and input X as the marginal A(2) 2 V ×D composed of embeddings for all words v in the R log loss, by marginalizing over all possible segmentations: vocabulary, and bv is the bias on word v, which can be inter- X X preted as a log-unigram probability. We define the acoustic L(L; X) = − log u(π) + log u(π) (6) segment embedding function as follows: π2P0:T , π2P0:T B(π)=L (1) (1) fac(Ht:t+s) = ReLU(A G(Ht:t+s) + b ) (2) where B(π) maps π = f(tk; sk; lk)g1≤k≤|πj to its label se- where G is a pooling function chosen between: quence flig1≤k≤|πj. The summations can be efficiently computed with dynamic programming: G(Ht:t+s) = [ht; ht+s] (3) S V 1 s (d) X X X (d) P α := u(π) = ut−s;s;vα G(Ht:t+s) = i=1 ht+i (4) t t−s s π2P0:t s=1 v=1 1 s S (7) P T (n) X X (n) G(Ht:t+s) = i=1 Softmax(g Ht:t+s)iht+i (5) s αt;y := u(π) = ut−s;s;ly αt−s;y−1 π2P0:t s=1 where (3) is concatenation, (4) is mean pooling, and (5) is B(π1:y )=L1:y attention pooling (with learnable parameter g). Equation 1 With α(d) and α(n) computed, the loss value follows directly allows feature sharing, which helps limit the memory needed to (n) (d) compute segment features to O(T SD) and simplifies scoring from L(L; X) = − log αT;jLj +log αT . The last summations (2) (2) in Equations 7 can be efficiently implemented on a GPU. In to matrix multiplication, i.e. Wt;s = A fac(Ht:t+s) + b . (n) (n) Recent work on segmental models has largely used addition, α1:T;y can be computed in parallel given α1:T;y−1 two types of segment score functions: (1) frame classifier- such that the overall time complexity2 of computing the loss based [15, 16, 23, 24] and (2) segmental recurrent neural is O(T log(SV ) + jLj log(S)).

Arxiv:2007.00183V2 [Eess.AS] 24 Nov 2020

Malware Classification with BERT

Aviation Speech Recognition System Using Artificial Intelligence

Arxiv:2002.06235V1

Combining Word Embeddings with Semantic Resources

A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech

Word-Like Character N-Gram Embedding

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

Knowledge-Powered Deep Learning for Word Embedding

Lecture 26 Word Embeddings and Recurrent Nets

Local Homology of Word Embeddings

Word Embedding, Sense Embedding and Their Application to Word Sense Induction

Synthesis and Recognition of Speech Creating and Listening to Speech