Gated Embeddings in End-To-End Speech Recognition for Conversational-Context Fusion

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion Suyoun Kim1, Siddharth Dalmia2 and Florian Metze2 1Electrical & Computer Engineering 2Language Technologies Institute, School of Computer Science Carnegie Mellon University fsuyoung1, sdalmia, [email protected] Abstract Jaitly, 2014; Hannun et al., 2014; Miao et al., 2015; Bahdanau et al., 2015; Chorowski et al., We present a novel conversational-context 2015; Chan et al., 2016; Kim et al., 2017) in- aware end-to-end speech recognizer based tegrates all available information within a sin- on a gated neural network that incorpo- rates conversational-context/word/speech em- gle neural network model, allows to make fus- beddings. Unlike conventional speech recog- ing conversational-context information possible. nition models, our model learns longer However, these are limited to encode only one pre- conversational-context information that spans ceding utterance and learn from a few hundred across sentences and is consequently better hours of annotated speech corpus, leading to min- at recognizing long conversations. Specifi- imal improvements. cally, we propose to use text-based external Meanwhile, neural language models, such as word and/or sentence embeddings (i.e., fast- Text, BERT) within an end-to-end framework, fastText (Bojanowski et al., 2017; Joulin et al., yielding significant improvement in word er- 2017, 2016), ELMo (Peters et al., 2018), OpenAI ror rate with better conversational-context rep- GPT (Radford et al., 2019), and Bidirectional En- resentation. We evaluated the models on the coder Representations from Transformers (BERT) Switchboard conversational speech corpus and (Devlin et al., 2019), that encode words and sen- show that our model outperforms standard tences in fixed-length dense vectors, embeddings, end-to-end speech recognition models. have achieved impressive results on various nat- 1 Introduction ural language processing tasks. Such general word/sentence embeddings learned on large text In a long conversation, there exists a tendency corpora (i.e., Wikipedia) has been used exten- of semantically related words, or phrases reoccur sively and plugged in a variety of downstream across sentences, or there exists topical coherence. tasks, such as question-answering and natural lan- Existing speech recognition systems are built at in- guage inference, (Devlin et al., 2019; Peters et al., dividual, isolated utterance level in order to make 2018; Seo et al., 2017), to drastically improve their building systems computationally feasible. How- performance in the form of transfer learning. ever, this may lose important conversational con- In this paper, we create a conversational-context text information. There have been many studies aware end-to-end speech recognizer capable of in- that have attempted to inject a longer context infor- corporating a conversational-context to better pro- mation (Mikolov et al., 2010; Mikolov and Zweig, cess long conversations. Specifically, we pro- 2012; Wang and Cho, 2016; Ji et al., 2016; Liu and pose to exploit external word and/or sentence em- Lane, 2017; Xiong et al., 2018), all of these mod- beddings which trained on massive amount of els are developed on text data for language model- text resources, (i.e. fastText, BERT) so that the ing task. model can learn better conversational-context rep- There has been recent work attempted to use resentations. So far, the use of such pre-trained the conversational-context information within a embeddings have found limited success in the end-to-end speech recognition framework (Kim speech recognition task. We also add a gating and Metze, 2018; Kim et al., 2018; Kim and mechanism to the decoder network that can inte- Metze, 2019). The new end-to-end speech recog- grate all the available embeddings (word, speech, nition approach (Graves et al., 2006; Graves and conversational-context) efficiently with increase 1131 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1131–1141 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics representational power using multiplicative inter- Several recent studies have considered to em- actions. Additionally, we explore a way to train bed a longer context information within a end-to- our speech recognition model even with text-only end framework (Kim and Metze, 2018; Kim et al., data in the form of pre-training and joint-training 2018; Kim and Metze, 2019). In contrast with our approaches. We evaluate our model on the Switch- method which can learn a better conversational- board conversational speech corpus (Godfrey and context representation with a gated network that Holliman, 1993; Godfrey et al., 1992), and show incorporate external word/sentence embeddings that our model outperforms the sentence-level from multiple preceding sentence history, their end-to-end speech recognition model. The main methods are limited to learn conversational- contributions of our work are as follows: context representation from one preceding sentence in annotated speech training set. • We introduce a contextual gating mecha- Gating-based approaches have been used for nism to incorporate multiple types of em- fusing word embeddings with visual representa- beddings, word, speech, and conversational- tions in genre classification task or image search context embeddings. task (Arevalo et al., 2017; Kiros et al., 2018) and • We exploit the external word (fastText) for learning different languages in speech recogni- and/or sentence embeddings (BERT) for tion task (Kim and Seltzer, 2018). learning better conversational-context repre- 3 End-to-End Speech Recognition sentation. Models • We perform an extensive analysis of ways to 3.1 Joint CTC/Attention-based represent the conversational-context in terms encoder-decoder network of the number of utterance history, and sam- pling strategy considering to use the gener- We perform end-to-end speech recognition us- ated sentences or the true preceding utter- ing a joint CTC/Attention-based approach with ance. graphemes as the output symbols (Kim et al., 2017; Watanabe et al., 2017). The key advantage • We explore a way to train the model jointly of the joint CTC/Attention framework is that it even with text-only dataset in addition to an- can address the weaknesses of the two main end- notated speech data. to-end models, Connectionist Temporal Classifi- cation (CTC) (Graves et al., 2006) and attention- 2 Related work based encoder-decoder (Attention) (Bahdanau Several recent studies have considered to incor- et al., 2016), by combining the strengths of the porate a context information within a end-to-end two. With CTC, the neural network is trained speech recognizer (Pundak et al., 2018; Alon et al., according to a maximum-likelihood training cri- 2019). In contrast with our method which uses a terion computed over all possible segmentations conversational-context information in a long con- of the utterance’s sequence of feature vectors to versation, their methods use a list of phrases (i.e. its sequence of labels while preserving left-right play a song) in reference transcription in specific order between input and output. With attention- tasks, contact names, songs names, voice search, based encoder-decoder models, the decoder net- dictation. work can learn the language model jointly without Several recent studies have considered to ex- relying on the conditional independent assump- ploit a longer context information that spans mul- tion. tiple sentences (Mikolov and Zweig, 2012; Wang Given a sequence of acoustic feature vectors, x, and Cho, 2016; Ji et al., 2016; Liu and Lane, 2017; and the corresponding graphemic label sequence, Xiong et al., 2018). In contrast with our method y, the joint CTC/Attention objective is represented which uses a single framework for speech recog- as follows by combining two objectives with a tun- nition tasks, their methods have been developed able parameter λ : 0 ≤ λ ≤ 1: on text data for language models, and therefore, L = λLCTC + (1 − λ)Latt: (1) it must be integrated with a conventional acoustic model which is built separately without a longer Each loss to be minimized is defined as the neg- context information. ative log likelihood of the ground truth character 1132 sequence y∗, is computed from: obtained 1.2% - 3.7% word error rate (WER) rel- ative improvements in evaluation set where exists X L − ln p(πjx) CTC , (2) 2.9% of OOVs. π2Φ(y) 4 Conversational-context Aware Models X ∗ ∗ Latt , − ln p(yujx; y1:u−1) (3) u In this section, we describe the A2W model with conversational-context fusion. In order to where π is the label sequence allowing the pres- fuse conversational context information within the ence of the blank symbol, Φ is the set of all possi- A2W, end-to-end speech recognition framework, ∗ ble π given u-length y, and y1:u−1 is all the previ- we extend the decoder sub-network to predict the ous labels. output additionally conditioning on conversational Both CTC and the attention-based encoder- context, by learning a conversational-context em- decoder networks are also used in the inference bedding. We encode single or multiple preceding step. The final hypothesis is a sequence that utterance histories into a fixed-length, single vec- maximizes a weighted conditional probability of tor, then inject it to the decoder network as an ad- CTC and attention-based encoder-decoder net- ditional input at every output step. work (Hori et al., 2017): Let say we have K number of utterances in a conversation. For k-th sentence, we y∗ = argmaxfγ log pCTC (yjx) k (4) have acoustic features (x1; ··· ; xT ) and out- + (1 − γ) log patt(yjx)g put word sequence, (w1; ··· ; wU ). At output 3.2 Acoustic-to-Words Models timestamp u, our decoder generates the probability distribution over words (wk), conditioned In this work, we use word units as our model out- u on 1) speech embeddings, attended high-level puts instead of sub-word units.

Gated Embeddings in End-To-End Speech Recognition for Conversational-Context Fusion

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support