Gated Embeddings in End-to-End for Conversational-Context Fusion

Suyoun Kim1, Siddharth Dalmia2 and Florian Metze2 1Electrical & Computer Engineering 2Language Technologies Institute, School of Computer Science Carnegie Mellon University {suyoung1, sdalmia, fmetze}@andrew.cmu.edu

Abstract Jaitly, 2014; Hannun et al., 2014; Miao et al., 2015; Bahdanau et al., 2015; Chorowski et al., We present a novel conversational-context 2015; Chan et al., 2016; Kim et al., 2017) in- aware end-to-end speech recognizer based tegrates all available information within a sin- on a gated neural network that incorpo- rates conversational-context//speech em- gle neural network model, allows to make fus- beddings. Unlike conventional speech recog- ing conversational-context information possible. nition models, our model learns longer However, these are limited to encode only one pre- conversational-context information that spans ceding utterance and learn from a few hundred across sentences and is consequently better hours of annotated speech corpus, leading to min- at recognizing long conversations. Specifi- imal improvements. cally, we propose to use text-based external Meanwhile, neural language models, such as word and/or sentence embeddings (i.e., fast- Text, BERT) within an end-to-end framework, fastText (Bojanowski et al., 2017; Joulin et al., yielding significant improvement in word er- 2017, 2016), ELMo (Peters et al., 2018), OpenAI ror rate with better conversational-context rep- GPT (Radford et al., 2019), and Bidirectional En- resentation. We evaluated the models on the coder Representations from Transformers (BERT) Switchboard conversational speech corpus and (Devlin et al., 2019), that encode and sen- show that our model outperforms standard tences in fixed-length dense vectors, embeddings, end-to-end speech recognition models. have achieved impressive results on various nat- 1 Introduction ural language processing tasks. Such general word/sentence embeddings learned on large text In a long conversation, there exists a tendency corpora (i.e., Wikipedia) has been used exten- of semantically related words, or phrases reoccur sively and plugged in a variety of downstream across sentences, or there exists topical coherence. tasks, such as question-answering and natural lan- Existing speech recognition systems are built at in- guage inference, (Devlin et al., 2019; Peters et al., dividual, isolated utterance level in order to make 2018; Seo et al., 2017), to drastically improve their building systems computationally feasible. How- performance in the form of transfer learning. ever, this may lose important conversational con- In this paper, we create a conversational-context text information. There have been many studies aware end-to-end speech recognizer capable of in- that have attempted to inject a longer context infor- corporating a conversational-context to better pro- mation (Mikolov et al., 2010; Mikolov and Zweig, cess long conversations. Specifically, we pro- 2012; Wang and Cho, 2016; Ji et al., 2016; Liu and pose to exploit external word and/or sentence em- Lane, 2017; Xiong et al., 2018), all of these mod- beddings which trained on massive amount of els are developed on text data for language model- text resources, (i.e. fastText, BERT) so that the ing task. model can learn better conversational-context rep- There has been recent work attempted to use resentations. So far, the use of such pre-trained the conversational-context information within a embeddings have found limited success in the end-to-end speech recognition framework (Kim speech recognition task. We also add a gating and Metze, 2018; Kim et al., 2018; Kim and mechanism to the decoder network that can inte- Metze, 2019). The new end-to-end speech recog- grate all the available embeddings (word, speech, nition approach (Graves et al., 2006; Graves and conversational-context) efficiently with increase

1131 Proceedings of the 57th Annual Meeting of the Association for Computational , pages 1131–1141 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics representational power using multiplicative inter- Several recent studies have considered to em- actions. Additionally, we explore a way to train bed a longer context information within a end-to- our speech recognition model even with text-only end framework (Kim and Metze, 2018; Kim et al., data in the form of pre-training and joint-training 2018; Kim and Metze, 2019). In contrast with our approaches. We evaluate our model on the Switch- method which can learn a better conversational- board conversational speech corpus (Godfrey and context representation with a gated network that Holliman, 1993; Godfrey et al., 1992), and show incorporate external word/sentence embeddings that our model outperforms the sentence-level from multiple preceding sentence history, their end-to-end speech recognition model. The main methods are limited to learn conversational- contributions of our work are as follows: context representation from one preceding sen- tence in annotated speech training set. • We introduce a contextual gating mecha- Gating-based approaches have been used for nism to incorporate multiple types of em- fusing word embeddings with visual representa- beddings, word, speech, and conversational- tions in genre classification task or image search context embeddings. task (Arevalo et al., 2017; Kiros et al., 2018) and • We exploit the external word (fastText) for learning different languages in speech recogni- and/or sentence embeddings (BERT) for tion task (Kim and Seltzer, 2018). learning better conversational-context repre- 3 End-to-End Speech Recognition sentation. Models • We perform an extensive analysis of ways to 3.1 Joint CTC/Attention-based represent the conversational-context in terms encoder-decoder network of the number of utterance history, and sam- pling strategy considering to use the gener- We perform end-to-end speech recognition us- ated sentences or the true preceding utter- ing a joint CTC/Attention-based approach with ance. graphemes as the output symbols (Kim et al., 2017; Watanabe et al., 2017). The key advantage • We explore a way to train the model jointly of the joint CTC/Attention framework is that it even with text-only dataset in addition to an- can address the weaknesses of the two main end- notated speech data. to-end models, Connectionist Temporal Classifi- cation (CTC) (Graves et al., 2006) and attention- 2 Related work based encoder-decoder (Attention) (Bahdanau Several recent studies have considered to incor- et al., 2016), by combining the strengths of the porate a context information within a end-to-end two. With CTC, the neural network is trained speech recognizer (Pundak et al., 2018; Alon et al., according to a maximum-likelihood training cri- 2019). In contrast with our method which uses a terion computed over all possible segmentations conversational-context information in a long con- of the utterance’s sequence of feature vectors to versation, their methods use a list of phrases (i.e. its sequence of labels while preserving left-right play a song) in reference transcription in specific order between input and output. With attention- tasks, contact names, songs names, voice search, based encoder-decoder models, the decoder net- dictation. work can learn the language model jointly without Several recent studies have considered to ex- relying on the conditional independent assump- ploit a longer context information that spans mul- tion. tiple sentences (Mikolov and Zweig, 2012; Wang Given a sequence of acoustic feature vectors, x, and Cho, 2016; Ji et al., 2016; Liu and Lane, 2017; and the corresponding graphemic label sequence, Xiong et al., 2018). In contrast with our method y, the joint CTC/Attention objective is represented which uses a single framework for speech recog- as follows by combining two objectives with a tun- nition tasks, their methods have been developed able parameter λ : 0 ≤ λ ≤ 1: on text data for language models, and therefore, L = λLCTC + (1 − λ)Latt. (1) it must be integrated with a conventional acoustic model which is built separately without a longer Each loss to be minimized is defined as the neg- context information. ative log likelihood of the ground truth character

1132 sequence y∗, is computed from: obtained 1.2% - 3.7% word error rate (WER) rel- ative improvements in evaluation set where exists X L − ln p(π|x) CTC , (2) 2.9% of OOVs. π∈Φ(y) 4 Conversational-context Aware Models X ∗ ∗ Latt , − ln p(yu|x, y1:u−1) (3) u In this section, we describe the A2W model with conversational-context fusion. In order to where π is the label sequence allowing the pres- fuse conversational context information within the ence of the blank symbol, Φ is the set of all possi- A2W, end-to-end speech recognition framework, ∗ ble π given u-length y, and y1:u−1 is all the previ- we extend the decoder sub-network to predict the ous labels. output additionally conditioning on conversational Both CTC and the attention-based encoder- context, by learning a conversational-context em- decoder networks are also used in the inference bedding. We encode single or multiple preceding step. The final hypothesis is a sequence that utterance histories into a fixed-length, single vec- maximizes a weighted conditional probability of tor, then inject it to the decoder network as an ad- CTC and attention-based encoder-decoder net- ditional input at every output step. work (Hori et al., 2017): Let say we have K number of utterances in a conversation. For k-th sentence, we y∗ = argmax{γ log pCTC (y|x) k (4) have acoustic features (x1, ··· , xT ) and out- + (1 − γ) log patt(y|x)} put word sequence, (w1, ··· , wU ). At output 3.2 Acoustic-to-Words Models timestamp u, our decoder generates the proba- bility distribution over words (wk), conditioned In this work, we use word units as our model out- u on 1) speech embeddings, attended high-level puts instead of sub-word units. Direct acoustics- representation (ek ) generated from encoder, to-word (A2W) models train a single neural net- speech and 2) word embeddings from all the words work to directly recognize words from speech seen previously (eu−1 ), and 3) conversational- without any sub-word units, pronunciation model, word context embeddings (ek ), which represents decision tree, decoder, which significantly sim- context the conversational-context information for current plifies the training and decoding process (Soltau (k) utterance prediction: et al., 2017; Audhkhasi et al., 2017, 2018; Li et al., 2018; Palaskar and Metze, 2018). In ad- k k dition, building A2W can learn more semanti- espeech =Encoder(x ) (5) cally meaningful conversational-context represen- k k k k wu ∼Decoder(econtext, eword, espeech) tations and it allows to exploit external resources (6) like word/sentence embeddings where the unit of representation is generally words. However, A2W models require more training data com- We can simply represent such contextual em- k pared to conventional sub-word models because bedding, econtext, by mean of one-hot word vec- tors or word distributions, mean(ek−1 + ··· + it needs sufficient acoustic training examples per word1 ek−1 ) from the preceding utterances. word to train well and need to handle out-of- wordU vocabulary(OOV) words. As a way to manage In order to learn and use the conversational- this OOV issue, we first restrict the vocabulary context during training and decoding, we serial- to 10k frequently occurring words. We then ad- ize the utterances based on their onset times and ditionally use a single character unit and start-of- their conversations rather than random shuffling of OOV (sunk), end-of-OOV (eunk) tokens to make data. We shuffle data at the conversation level and our model generate a character by decomposing create mini-batches that contain only one sentence the OOV word into a character sequence. For ex- of each conversation. We fill the ”dummy” in- ample, the OOV word, rainstorm, is decomposed put/output example at positions where the conver- into (sunk) r a i n s t o r m (eunk) and the model sation ended earlier than others within the mini- tries to learn such a character sequence rather than batch to not influence other conversations while generate the OOV token. From this method, we passing context to the next batch.

1133 Figure 1: Conversational-context embedding representations from external word or sentence embeddings.

4.1 External word/sentence embeddings dimensional embeddings from 10k-dimensional one-hot vector or distribution over words of each Learning better representation of conversational- previous word and then merge into a single con- context is the key to achieve better processing of k text vector, econtext. Since we also consider mul- long conversations. To do so, we propose to en- tiple word/utterance history, we consider two sim- code the general word/sentence embeddings pre- ple ways to merge multiple embeddings (1) mean, trained on large textual corpora within our end- and (2) concatenation. The second method is to to-end speech recognition framework. Another use sentence embeddings, BERT. It is used to a advantage of using pre-trained embedding mod- generate single 786-dimensional sentence embed- els is that we do not need to back-propagate the ding from 10k-dimensional one-hot vector or dis- gradients across contexts, making it easier and tribution over previous words and then merge into faster to update the parameters for learning a a single context vector with two different merging conversational-context representation. methods. Since our A2W model uses a restricted There exist many word/sentence embeddings vocabulary of 10k as our output units and which is which are publicly available. We can broadly clas- different from the external embedding models, we sify them into two categories: (1) non-contextual need to handle out-of-vocabulary words. For fast- word embeddings, and (2) contextual word em- Text, words that are missing in the pretrained em- beddings. Non-contextual word embeddings, such beddings we map them to a random multivariate as (Mikolov and Zweig, 2012), GloVe normal distribution with the mean as the sample (Pennington et al., 2014), fastText (Bojanowski mean and variance as the sample variance of the et al., 2017), maps each word independently on known words. For BERT, we use its provided to- the context of the sentence where the word occur kenizer to generates byte pair encodings to handle in. Although it is easy to use, it assumes that each OOV words. word represents a single meaning which is not true Using this approach, we can obtain a more in real-word. Contextualized word embeddings, dense, informative, fixed-length vectors to encode k sentence embeddings, such as deep contextualized conversational-context information, econtext to be word representations (Peters et al., 2018), BERT used in next k-th utterance prediction. (Devlin et al., 2019), encode the complex charac- teristics and meanings of words in various con- 4.2 Contextual gating text by jointly training a bidirectional language We use contextual gating mechanism in our de- model. The BERT model proposed a masked lan- coder network to combine the conversational- guage model training approach enabling them to context embeddings with speech and word em- also learn good “sentence” representation in order beddings effectively. Our gating is contextual to predict the masked word. in the sense that multiple embeddings compute In this work, we explore both types of embed- a gate value that is dependent on the context dings to learn conversational-context embeddings of multiple utterances that occur in a conversa- as illustrated in Figure1. The first method is to tion. Using these contextual gates can be benefi- use word embeddings, fastText, to generate 300- cial to decide how to weigh the different embed-

1134 dings, conversational-context, word and speech Dataset # of utter. # of conversations avg. # of utter. embeddings. Rather than merely concatenat- /conversation ing conversational-context embeddings (Kim and training 192,656 2402 80 validation 4,000 34 118 Metze, 2018), contextual gating can achieve more eval.(SWBD) 1,831 20 92 improvement because its increased representa- eval.(CH) 2,627 20 131 tional power using multiplicative interactions. Figure2 illustrates our proposed contextual gat- Table 1: Experimental dataset description. We used ing mechanism. Let ew = ew(yu−1) be our pre- 300 hours of Switchboard conversational corpus. Note that any pronunciation lexicon or Fisher transcription vious for a word yu−1, and let k was not used. es = es(x1:T ) be a speech embedding for the k acoustic features of current k-th utterance x1:T and ec = ec(sk−1−n:k−1) be our conversational- 5 Experiments context embedding for n-number of preceding ut- 5.1 Datasets terances sk−1−n:k−1. Then using a gating mecha- nism: To evaluate our proposed conversational end-to- end speech recognition model, we use the Switch- g = σ(ec, ew, es) (7) board (SWBD) LDC corpus (97S62) task. We split 300 hours of the SWBD training set into where σ is a 1 hidden layer DNN with sigmoid two: 285 hours of data for the model training, activation, the gated embedding e is calcuated as and 5 hours of data for the hyper-parameter tuning. We evaluate the model performance on the HUB5 e = g (ec, ew, es) (8) Eval2000 which consists of the Callhome English h = LSTM(e) (9) (CH) and Switchboard (SWBD) (LDC2002S09, LDC2002T43). In Table1, we show the number and fed into the LSTM decoder hidden layer. The of conversations and the average number of utter- output of the decoder h is then combined with ances per a single conversation. conversational-context embedding ec again with a The audio data is sampled at 16kHz, and then gating mechanism, each frame is converted to a 83-dimensional fea- ture vector consisting of 80-dimensional log-mel g = σ(eC , h) (10) filterbank coefficients and 3-dimensional pitch features as suggested in (Miao et al., 2016). The hˆ = g (ec, h) (11) number of our word-level output tokens is 10,038, Then the next hidden layer takes these gated acti- which includes 47 single character units as de- vations, hˆ, and so on. scribed in Section 3.2. Note that no pronunciation lexicon was used in any of the experiments.

5.2 Training and decoding For the architecture of the end-to-end speech recognition, we used joint CTC/Attention end-to- end speech recognition (Kim et al., 2017; Watan- abe et al., 2017). As suggested in (Zhang et al., 2017; Hori et al., 2017), the input feature images are reduced to (1/4 × 1/4) images along with the time-frequency axis within the two max-pooling layers in CNN. Then, the 6-layer BLSTM with 320 cells is followed by the CNN layer. For the attention mechanism, we used a location-based Figure 2: Our contextual gating mechanism in decoder method (Chorowski et al., 2015). For the de- network to integrate three different embeddings from: coder network, we used a 2-layer LSTM with 300 1) conversational-context, 2) previous word, 3) current cells. In addition to the standard decoder net- speech. work, our proposed models additionally require extra parameters for gating layers in order to fuse

1135 Trainable External SWBD CH Model Output Units Params LM (WER%) (WER%) Prior Models LF-MMI (Povey et al., 2016) CD phones N/A  9.6 19.3 CTC (Zweig et al., 2017) Char 53M  19.8 32.1 CTC (Sanabria and Metze, 2018) Char, BPE-{300,1k,10k} 26M  12.5 23.7 CTC (Audhkhasi et al., 2018) Word (Phone init.) N/A  14.6 23.6 Seq2Seq (Zeyer et al., 2018) BPE-10k 150M*  13.5 27.1 Seq2Seq (Palaskar and Metze, 2018) Word-10k N/A  23.0 37.2 Seq2Seq (Zeyer et al., 2018) BPE-1k 150M*  11.8 25.7 Our baseline Word-10k 32M  18.2 30.7 Our Proposed Conversational Model Gated Contextual Decoder Word-10k 35M  17.3 30.5 + Decoder Pretrain Word-10k 35M  16.4 29.5 + fastText for Word Emb. Word-10k 35M  16.0 29.5 (a) fastText for Conversational Emb. Word-10k 34M  16.0 29.5 (b) BERT for Conversational Emb. Word-10k 34M  15.7 29.2 (b) + Turn number 5 Word-10k 34M  15.5 29.0

Table 2: Comparison of word error rates (WER) on Switchboard 300h with standard end-to-end speech recognition models and our proposed end-to-end speech recogntion models with conversational context. (The * mark denotes our estimate for the number of parameters used in the previous work). conversational-context embedding to the decoder conversational embedding and increasing recep- network compared to baseline. We denote the to- tive field of the conversational context. Our best tal number of trainable parameters in Table2. model gets around 15% relative improvement on For the optimization method, we use AdaDelta the SWBD subset and 5% relative improvement (Zeiler, 2012) with gradient clipping (Pascanu on the CallHome subset of the eval2000 dataset. et al., 2013). We used λ = 0.2 for joint We start by evaluating our proposed model CTC/Attention training (in Eq.1) and γ = 0.3 which leveraged conversational-context embed- for joint CTC/Attention decoding (in Eq.4). We dings learned from training corpus and compare bootstrap the training of our proposed conversa- it with a standard end-to-end speech recogni- tional end-to-end models from the baseline end- tion models without conversational-context em- to-end models. To decide the best models for test- bedding. As seen in Table2, we obtained ing, we monitor the development accuracy where a performance gain over the baseline by us- we always use the model prediction in order to ing conversational-context embeddings which is simulate the testing scenario. At inference, we learned from training set. used a left-right beam search method (Sutskever et al., 2014) with the beam size 10 for reducing the 6.1 Pre-training decoder network computational cost. We adjusted the final score, s(y|x), with the length penalty 0.5. The models Then, we observe that pre-training of decoder net- are implemented using the PyTorch deep learning work can improve accuracy further as shown in library (Paszke et al., 2017), and ESPnet toolkit Table2. Using pre-training the decoder network, (Kim et al., 2017; Watanabe et al., 2017, 2018). we achieved 5% relative improvement in WER on SWBD set. Since we add external parameters in 6 Results decoder network to learn conversational-context embeddings, our model requires more efforts to Our results are summarized in the Table2 where learn these additional parameters. To relieve this we first present the baseline results and then show issue, we used pre-training techniques to train de- the improvements by adding each of the indi- coder network with text-only data first. We simply vidual components that we discussed in previous used a mask on top of the Encoder/Attention layer sections, namely, gated decoding, pretraining de- so that we can control the gradients of batches coder network, external word embedding, external contains text-only data and do not update the En-

1136 coder/Attention sub-network parameters. Mean 12.5 Concat

12 6.2 Use of words/sentence embeddings 11.5

Next, we evaluated the use of pretrained external 11

embeddings (fastText and BERT). We initially ob- Relative Improvement(%) served that we can obtain 2.4% relative improve- 10.5 ment over (the model with decoder pretraining) in 2 4 6 8 WER by using fastText for additional word em- # of utterance history beddings to the gated decoder network. Figure 3: The relative improvement in Development We also extensively evaluated various ways to accuracy over sets over baseline obtained by us- use fastText/BERT for conversational-context em- ing conversational-context embeddings with different beddings. Both methods with fastText and with number of utterance history and different merging tech- BERT shows significant improvement from the niques. baseline as well as vanilla conversational-context aware model. 6.4 Sampling technique

Accuracy on Dev. set 3 6.3 Conversational-context Receptive Field

2 We also investigate the effect of the number of ut- terance history being encoded. We tried differ- 1 ent N = [1, 5, 9] number of utterance histories

to learn the conversational-context embeddings. Relative Improvement(%) Figure3 shows the relative improvements in the 0 accuracy on the Dev set (5.2) over the baseline 0 20 40 60 80 100 “non-conversational” model. We show the im- Utterance Sampling Rate(%) provements on the two different methods of merg- ing the contextual embeddings, namely mean and Figure 4: The relative improvement in Develop- concatenation. Typically increasing the receptive ment accuracy over 100% sampling rate which was field of the conversational-context helps improve used in (Kim and Metze, 2018) obtained by using the model. However, as the number of utterence conversational-context embeddings with different sam- pling rate. history increased, the number of trainable param- eters of the concatenate model increased making it harder for the model to train. This led to a reduc- We also experiment with an utterance level tion in the accuracy. sampling strategy with various sampling ratio, [0.0, 0.2, 0.5, 1.0]. Sampling techniques have been We also found that using 5-utterance history extensively used in sequence prediction tasks to with concatenation performed best (15%) on the reduce overfitting (Bengio et al., 2015) by train- SWBD set, and using 9-number of utterance his- ing the model conditioning on generated tokens tory with mean method performed best (5%) on from the model itself, which is how the model CH set. We also observed that the improvement actually do at inference, rather than the ground- diminished when we used 9-utterance history for truth tokens. Similar to choosing previous word SWBD set, unlike CH set. One possible explana- tokens from the ground truth or from the model tion is that the conversational-context may not be output, we apply it to choose previous utterance relevant to the current utterance prediction or the from the ground truth or from the model output for model is overfitting. learning conversational-context embeddings. Fig-

1137 ure4 shows the relative improvement in the de- This indicates that our conversational-context em- velopment accuracy (5.2) over the 1.0 sampling bedding leads to improved similarity across ad- rate which is always choosing model’s output. We jacent utterances, resulting in better processing a found that a sampling rate of 20% performed best. long conversation. 6.5 Analysis of context embeddings 7 Conclusion We develop a scoring function, s(i, j) to check We have introduced a novel method for if our model conserves the conversational consis- conversational-context aware end-to-end speech tency for validating the accuracy improvement of recognition based on a gated network that in- our approach. The scoring function measures the corporates word/sentence/speech embeddings. average of the conversational distances over every Unlike prior work, our model is trained on conver- consecutive hypotheses generated from a partic- sational datasets to predict a word, conditioning ular model. The conversational distance is cal- on multiple preceding conversational-context culated by the Euclidean distance, dist(ei, ej) of representations, and consequently improves the fixed-length vectors ei, ej which represent the recognition accuracy of a long conversation. model’s i, j-th hypothesis, respectively. To obtain Moreover, our gated network can incorporate a fixed-length vector, utterance embedding, given effectively with text-based external resources, the model hypothesis, we use BERT sentence em- word or sentence embeddings (i.e., , bedding as an oracle. Mathematically it can be BERT) within an end-to-end framework and so written as, that the whole system can be optimized towards 1 X our final objectives, speech recognition accuracy. s(i, j) = (dist(e , e )) N i j By incorporating external embeddings with i,j∈eval gating mechanism, our model can achieve further where, i, j is a pair of consecutive hypotheses in improvement with better conversational-context evaluation data eval, N is the total number of i, j representation. We evaluated the models on pairs, ei, ej are BERT embeddings. In our experi- the Switchboard conversational speech corpus ment, we select the pairs of consecutive utterances and show that our proposed model using gated from the reference that show lower distance score conversational-context embedding show 15%, at least baseline hypotheses. 5% relative improvement in WER compared to a From this process, we obtained three conversa- baseline model for Switchboard and CallHome tional distance scores from 1) the reference tran- subsets respectively. Our model was shown to scripts, 2) the hypotheses of our vanilla conversa- outperform standard end-to-end speech recogni- tional model which is not using BERT, and 3) the tion models trained on isolated sentences. This hypotheses of our baseline model. Figure5 shows work is easy to scale and can potentially be the score comparison. applied to any speech related task that can benefit from longer context information, such as spoken dialog system, sentimental analysis. Acknowledgments We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Ti- tan Xp GPU used for this research. This work also used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pitts- burgh Supercomputing Center (PSC). Figure 5: Comparison of the conversational distance score on the consecutive utterances of 1) reference, 2) our proposed conversational end-to-end model, and 3) References our end-to-end baseline model. Uri Alon, Golan Pundak, and Tara N Sainath. 2019. Contextual speech recognition with difficult nega- We found that our proposed model was 7.4% tive training examples. In ICASSP 2019-2019 IEEE relatively closer to the reference than the baseline. International Conference on Acoustics, Speech and

1138 Signal Processing (ICASSP), pages 6440–6444. John J Godfrey, Edward C Holliman, and Jane Mc- IEEE. Daniel. 1992. Switchboard: Telephone speech cor- pus for research and development. In Acoustics, John Arevalo, Thamar Solorio, Manuel Montes-y Speech, and Signal Processing, 1992. ICASSP-92., Gomez,´ and Fabio A Gonzalez.´ 2017. Gated mul- 1992 IEEE International Conference on, volume 1, timodal units for information fusion. arXiv preprint pages 517–520. IEEE. arXiv:1702.01992. Alex Graves, Santiago Fernandez,´ Faustino Gomez, Kartik Audhkhasi, Brian Kingsbury, Bhuvana Ramab- and Jurgen¨ Schmidhuber. 2006. Connectionist hadran, George Saon, and Michael Picheny. 2018. temporal classification: labelling unsegmented se- Building competitive direct acoustics-to-word mod- quence data with recurrent neural networks. In Pro- els for english conversational speech recognition. ceedings of the 23rd international conference on In 2018 IEEE International Conference on Acous- Machine learning, pages 369–376. ACM. tics, Speech and Signal Processing (ICASSP), pages 4759–4763. IEEE. Alex Graves and Navdeep Jaitly. 2014. Towards end- to-end speech recognition with recurrent neural net- Kartik Audhkhasi, Bhuvana Ramabhadran, George works. In Proceedings of the 31st International Saon, Michael Picheny, and David Nahamoo. Conference on Machine Learning (ICML-14), pages 2017. Direct acoustics-to-word models for en- 1764–1772. glish conversational speech recognition. CoRR, abs/1703.07754. Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam gio. 2015. Neural by jointly Coates, et al. 2014. Deep speech: Scaling up learning to align and translate. ICLR. end-to-end speech recognition. arXiv preprint arXiv:1412.5567. Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio. 2016. End- Takaaki Hori, Shinji Watanabe, Yu Zhang, and William to-end attention-based large vocabulary speech Chan. 2017. Advances in joint ctc-attention based recognition. In 2016 IEEE International Confer- end-to-end speech recognition with a deep cnn en- ence on Acoustics, Speech and Signal Processing coder and rnn-lm. Interspeech. (ICASSP), pages 4945–4949. IEEE.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, Noam Shazeer. 2015. Scheduled sampling for se- and Jacob Eisenstein. 2016. Document context lan- ICLR (Workshop track) quence prediction with recurrent neural networks. guage models. . In Advances in Neural Information Processing Sys- Armand Joulin, Edouard Grave, Piotr Bojanowski, tems, pages 1171–1179. Matthijs Douze, Herve´ Jegou,´ and Tomas Mikolov. Piotr Bojanowski, Edouard Grave, Armand Joulin, and 2016. Fasttext. zip: Compressing text classification Tomas Mikolov. 2017. Enriching word vectors with models. arXiv preprint arXiv:1612.03651. subword information. Transactions of the Associa- tion for Computational Linguistics, 5:135–146. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient William Chan, Navdeep Jaitly, Quoc Le, and Oriol text classification. In Proceedings of the 15th Con- Vinyals. 2016. Listen, attend and spell: A neural ference of the European Chapter of the Association network for large vocabulary conversational speech for Computational Linguistics: Volume 2, Short Pa- recognition. In 2016 IEEE International Confer- pers, pages 427–431. Association for Computational ence on Acoustics, Speech and Signal Processing Linguistics. (ICASSP), pages 4960–4964. IEEE. Suyoun Kim, Siddharth Dalmia, and Florian Metze. Jan K Chorowski, Dzmitry Bahdanau, Dmitriy 2018. Situation informed end-to-end asr for chime-5 Serdyuk, Kyunghyun Cho, and Yoshua Bengio. challenge. CHiME5 workshop. 2015. Attention-based models for speech recogni- tion. In Advances in neural information processing Suyoun Kim, Takaaki Hori, and Shinji Watanabe. systems, pages 577–585. 2017. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In Acous- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tics, Speech and Signal Processing (ICASSP), 2017 Kristina Toutanova. 2019. Bert: Pre-training of deep IEEE International Conference on, pages 4835– bidirectional transformers for language understand- 4839. IEEE. ing. NAACL. Suyoun Kim and Florian Metze. 2018. Dialog- John Godfrey and Edward Holliman. 1993. context aware end-to-end speech recognition. In Switchboard-1 release 2 ldc97s62. Linguistic 2018 IEEE Spoken Language Technology Workshop Data Consortium, Philadelphia, LDC97S62. (SLT), pages 434–440. IEEE.

1139 Suyoun Kim and Florian Metze. 2019. Acoustic-to- Lerer. 2017. Automatic differentiation in pytorch. word models with conversational context informa- In NIPS-W. tion. NAACL. Jeffrey Pennington, Richard Socher, and Christopher Suyoun Kim and Michael L Seltzer. 2018. Towards Manning. 2014. Glove: Global vectors for word language-universal end-to-end speech recognition. representation. In Proceedings of the 2014 confer- In 2018 IEEE International Conference on Acous- ence on empirical methods in natural language pro- tics, Speech and Signal Processing (ICASSP), pages cessing (EMNLP), pages 1532–1543. 4914–4918. IEEE. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Jamie Kiros, William Chan, and Geoffrey Hinton. Gardner, Christopher Clark, Kenton Lee, and Luke 2018. Illustrative language understanding: Large- Zettlemoyer. 2018. Deep contextualized word rep- scale visual grounding with image search. In Pro- resentations. ceedings of the 56th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pe- Papers), volume 1, pages 922–933. gah Ghahremani, Vimal Manohar, Xingyu Na, Yim- ing Wang, and Sanjeev Khudanpur. 2016. Purely Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, and Yi- sequence-trained neural networks for asr based on fan Gong. 2018. Advancing acoustic-to-word ctc lattice-free mmi. In Interspeech, pages 2751–2755. model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, pages 5794–5798. IEEE. Anjuli Kannan, and Ding Zhao. 2018. Deep con- text: end-to-end contextual speech recognition. In Bing Liu and Ian Lane. 2017. Dialog context language 2018 IEEE Spoken Language Technology Workshop modeling with recurrent neural networks. In Acous- (SLT), pages 418–425. IEEE. tics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 5715– Alec Radford, Jeff Wu, Rewon Child, David Luan, 5719. IEEE. Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. Yajie Miao, Mohammad Gowayyed, and Florian Metze. 2015. EESEN: End-to-end speech recog- Ramon Sanabria and Florian Metze. 2018. Hierarchi- nition using deep RNN models and WFST-based cal multitask learning with ctc. In 2018 IEEE Spo- decoding. In 2015 IEEE Workshop on Automatic ken Language Technology Workshop (SLT), pages Speech Recognition and Understanding (ASRU), 485–490. IEEE. pages 167–174. IEEE. Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, Yajie Miao, Mohammad Gowayyed, Xingyu Na, Tom and Hannaneh Hajishirzi. 2017. Bidirectional at- Ko, Florian Metze, and Alexander Waibel. 2016. tention flow for machine comprehension. CoRR, An empirical exploration of ctc acoustic models. abs/1611.01603. In 2016 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages Hagen Soltau, Hank Liao, and Hasim Sak. 2017. Neu- 2623–2627. IEEE. ral speech recognizer: Acoustic-to-word lstm model for large vocabulary speech recognition. Inter- Toma´sˇ Mikolov, Martin Karafiat,´ Luka´sˇ Burget, Jan speech. Cernockˇ y,` and Sanjeev Khudanpur. 2010. Recur- rent neural network based language model. In Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014. Eleventh Annual Conference of the International Sequence to sequence learning with neural net- Speech Communication Association. works. In Advances in neural information process- ing systems, pages 3104–3112. Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. Tian Wang and Kyunghyun Cho. 2016. Larger-context SLT, 12:234–239. language modelling. ACL. Shruti Palaskar and Florian Metze. 2018. Acoustic-to- Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki word recognition with sequence-to-sequence mod- Hayashi, Jiro Nishitoba, Yuya Unno, Nelson En- els. In 2018 IEEE Spoken Language Technology rique Yalta Soplin, Jahn Heymann, Matthew Wies- Workshop (SLT), pages 397–404. IEEE. ner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai. 2018. Espnet: End-to-end speech Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. processing toolkit. In Interspeech, pages 2207– 2013. On the difficulty of training recurrent neural 2211. networks. In International Conference on Machine Learning, pages 1310–1318. Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi. 2017. Hybrid Adam Paszke, Sam Gross, Soumith Chintala, Gre- ctc/attention architecture for end-to-end speech gory Chanan, Edward Yang, Zachary DeVito, Zem- recognition. IEEE Journal of Selected Topics in Sig- ing Lin, Alban Desmaison, Luca Antiga, and Adam nal Processing, 11(8):1240–1253.

1140 Wayne Xiong, Lingfeng Wu, Jun Zhang, and Andreas Stolcke. 2018. Session-level language modeling for conversational speech. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 2764–2768. Matthew D Zeiler. 2012. Adadelta: an adaptive learn- ing rate method. arXiv preprint arXiv:1212.5701. Albert Zeyer, Kazuki Irie, Ralf Schluter,¨ and Hermann Ney. 2018. Improved training of end-to-end atten- tion models for speech recognition. Interspeech. Yu Zhang, William Chan, and Navdeep Jaitly. 2017. Very deep convolutional networks for end-to-end speech recognition. In Acoustics, Speech and Sig- nal Processing (ICASSP), 2017 IEEE International Conference on, pages 4845–4849. IEEE. Geoffrey Zweig, Chengzhu Yu, Jasha Droppo, and An- dreas Stolcke. 2017. Advances in all-neural speech recognition. In Acoustics, Speech and Signal Pro- cessing (ICASSP), 2017 IEEE International Confer- ence on, pages 4805–4809. IEEE.

1141