Gated Embeddings in End-To-End Speech Recognition for Conversational-Context Fusion

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion Suyoun Kim1, Siddharth Dalmia2 and Florian Metze2 1Electrical & Computer Engineering 2Language Technologies Institute, School of Computer Science Carnegie Mellon University fsuyoung1, sdalmia, [email protected] Abstract Jaitly, 2014; Hannun et al., 2014; Miao et al., 2015; Bahdanau et al., 2015; Chorowski et al., We present a novel conversational-context 2015; Chan et al., 2016; Kim et al., 2017) in- aware end-to-end speech recognizer based tegrates all available information within a sin- on a gated neural network that incorpo- rates conversational-context/word/speech em- gle neural network model, allows to make fus- beddings. Unlike conventional speech recog- ing conversational-context information possible. nition models, our model learns longer However, these are limited to encode only one pre- conversational-context information that spans ceding utterance and learn from a few hundred across sentences and is consequently better hours of annotated speech corpus, leading to min- at recognizing long conversations. Specifi- imal improvements. cally, we propose to use text-based external Meanwhile, neural language models, such as word and/or sentence embeddings (i.e., fast- Text, BERT) within an end-to-end framework, fastText (Bojanowski et al., 2017; Joulin et al., yielding significant improvement in word er- 2017, 2016), ELMo (Peters et al., 2018), OpenAI ror rate with better conversational-context rep- GPT (Radford et al., 2019), and Bidirectional En- resentation. We evaluated the models on the coder Representations from Transformers (BERT) Switchboard conversational speech corpus and (Devlin et al., 2019), that encode words and sen- show that our model outperforms standard tences in fixed-length dense vectors, embeddings, end-to-end speech recognition models. have achieved impressive results on various nat- 1 Introduction ural language processing tasks. Such general word/sentence embeddings learned on large text In a long conversation, there exists a tendency corpora (i.e., Wikipedia) has been used exten- of semantically related words, or phrases reoccur sively and plugged in a variety of downstream across sentences, or there exists topical coherence. tasks, such as question-answering and natural lan- Existing speech recognition systems are built at in- guage inference, (Devlin et al., 2019; Peters et al., dividual, isolated utterance level in order to make 2018; Seo et al., 2017), to drastically improve their building systems computationally feasible. How- performance in the form of transfer learning. ever, this may lose important conversational con- In this paper, we create a conversational-context text information. There have been many studies aware end-to-end speech recognizer capable of in- that have attempted to inject a longer context infor- corporating a conversational-context to better pro- mation (Mikolov et al., 2010; Mikolov and Zweig, cess long conversations. Specifically, we pro- 2012; Wang and Cho, 2016; Ji et al., 2016; Liu and pose to exploit external word and/or sentence em- Lane, 2017; Xiong et al., 2018), all of these mod- beddings which trained on massive amount of els are developed on text data for language model- text resources, (i.e. fastText, BERT) so that the ing task. model can learn better conversational-context rep- There has been recent work attempted to use resentations. So far, the use of such pre-trained the conversational-context information within a embeddings have found limited success in the end-to-end speech recognition framework (Kim speech recognition task. We also add a gating and Metze, 2018; Kim et al., 2018; Kim and mechanism to the decoder network that can inte- Metze, 2019). The new end-to-end speech recog- grate all the available embeddings (word, speech, nition approach (Graves et al., 2006; Graves and conversational-context) efficiently with increase 1131 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1131–1141 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics representational power using multiplicative inter- Several recent studies have considered to em- actions. Additionally, we explore a way to train bed a longer context information within a end-to- our speech recognition model even with text-only end framework (Kim and Metze, 2018; Kim et al., data in the form of pre-training and joint-training 2018; Kim and Metze, 2019). In contrast with our approaches. We evaluate our model on the Switch- method which can learn a better conversational- board conversational speech corpus (Godfrey and context representation with a gated network that Holliman, 1993; Godfrey et al., 1992), and show incorporate external word/sentence embeddings that our model outperforms the sentence-level from multiple preceding sentence history, their end-to-end speech recognition model. The main methods are limited to learn conversational- contributions of our work are as follows: context representation from one preceding sentence in annotated speech training set. • We introduce a contextual gating mecha- Gating-based approaches have been used for nism to incorporate multiple types of em- fusing word embeddings with visual representa- beddings, word, speech, and conversational- tions in genre classification task or image search context embeddings. task (Arevalo et al., 2017; Kiros et al., 2018) and • We exploit the external word (fastText) for learning different languages in speech recogni- and/or sentence embeddings (BERT) for tion task (Kim and Seltzer, 2018). learning better conversational-context repre- 3 End-to-End Speech Recognition sentation. Models • We perform an extensive analysis of ways to 3.1 Joint CTC/Attention-based represent the conversational-context in terms encoder-decoder network of the number of utterance history, and sam- pling strategy considering to use the gener- We perform end-to-end speech recognition us- ated sentences or the true preceding utter- ing a joint CTC/Attention-based approach with ance. graphemes as the output symbols (Kim et al., 2017; Watanabe et al., 2017). The key advantage • We explore a way to train the model jointly of the joint CTC/Attention framework is that it even with text-only dataset in addition to an- can address the weaknesses of the two main end- notated speech data. to-end models, Connectionist Temporal Classifi- cation (CTC) (Graves et al., 2006) and attention- 2 Related work based encoder-decoder (Attention) (Bahdanau Several recent studies have considered to incor- et al., 2016), by combining the strengths of the porate a context information within a end-to-end two. With CTC, the neural network is trained speech recognizer (Pundak et al., 2018; Alon et al., according to a maximum-likelihood training cri- 2019). In contrast with our method which uses a terion computed over all possible segmentations conversational-context information in a long con- of the utterance’s sequence of feature vectors to versation, their methods use a list of phrases (i.e. its sequence of labels while preserving left-right play a song) in reference transcription in specific order between input and output. With attention- tasks, contact names, songs names, voice search, based encoder-decoder models, the decoder net- dictation. work can learn the language model jointly without Several recent studies have considered to ex- relying on the conditional independent assump- ploit a longer context information that spans mul- tion. tiple sentences (Mikolov and Zweig, 2012; Wang Given a sequence of acoustic feature vectors, x, and Cho, 2016; Ji et al., 2016; Liu and Lane, 2017; and the corresponding graphemic label sequence, Xiong et al., 2018). In contrast with our method y, the joint CTC/Attention objective is represented which uses a single framework for speech recog- as follows by combining two objectives with a tun- nition tasks, their methods have been developed able parameter λ : 0 ≤ λ ≤ 1: on text data for language models, and therefore, L = λLCTC + (1 − λ)Latt: (1) it must be integrated with a conventional acoustic model which is built separately without a longer Each loss to be minimized is defined as the neg- context information. ative log likelihood of the ground truth character 1132 sequence y∗, is computed from: obtained 1.2% - 3.7% word error rate (WER) rel- ative improvements in evaluation set where exists X L − ln p(πjx) CTC , (2) 2.9% of OOVs. π2Φ(y) 4 Conversational-context Aware Models X ∗ ∗ Latt , − ln p(yujx; y1:u−1) (3) u In this section, we describe the A2W model with conversational-context fusion. In order to where π is the label sequence allowing the pres- fuse conversational context information within the ence of the blank symbol, Φ is the set of all possi- A2W, end-to-end speech recognition framework, ∗ ble π given u-length y, and y1:u−1 is all the previ- we extend the decoder sub-network to predict the ous labels. output additionally conditioning on conversational Both CTC and the attention-based encoder- context, by learning a conversational-context em- decoder networks are also used in the inference bedding. We encode single or multiple preceding step. The final hypothesis is a sequence that utterance histories into a fixed-length, single vec- maximizes a weighted conditional probability of tor, then inject it to the decoder network as an ad- CTC and attention-based encoder-decoder net- ditional input at every output step. work (Hori et al., 2017): Let say we have K number of utterances in a conversation. For k-th sentence, we y∗ = argmaxfγ log pCTC (yjx) k (4) have acoustic features (x1; ··· ; xT ) and out- + (1 − γ) log patt(yjx)g put word sequence, (w1; ··· ; wU ). At output 3.2 Acoustic-to-Words Models timestamp u, our decoder generates the probability distribution over words (wk), conditioned In this work, we use word units as our model out- u on 1) speech embeddings, attended high-level puts instead of sub-word units.

Load more