Efficient Learning for Spoken Language Understanding Tasks with Word Embedding Based Pre-Training
Total Page:16
File Type:pdf, Size:1020Kb
INTERSPEECH 2015 Efficient learning for spoken language understanding tasks with word embedding based pre-training Yi Luan , Shinji Watanabe, and Bret Harsham Mitsubishi Electric Research Laboratories [email protected], watanabe,harsham @merl.com { } Abstract Intention understanding is framed as a semantic utterance classification problem while goal estimation is framed as clas- Spoken language understanding (SLU) tasks such as goal esti- sification problem of an entire dialog. mation and intention identification from user’s commands are Conventional intention and goal estimation approaches use essential components in spoken dialog systems. In recent years, bag of word (BoW) features (or bag of intention features in goal neural network approaches have shown great success in various estimation) as inputs to standard classification methods such as SLU tasks. However, one major difficulty of SLU is that the boosting, support vector machine, and logistic regression. How- annotation of collected data can be expensive. Often this results ever, one of the problems of applying BoW features to SLU in insufficient data being available for a task. The performance tasks is that the feature vector tends to be very sparse. Each ut- of a neural network trained in low resource conditions is usually terance only includes a dozen words at most (unlike document inferior because of over-training. To improve the performance, analysis). Therefore, the feature vector sometimes lacks enough this paper investigates the use of unsupervised training methods semantic information to accurately estimate user intentions or with large-scale corpora based on word embedding and latent goals. This paper investigates the use of additional semantic in- topic models to pre-train the SLU networks. In order to capture formation by using word embedding [8] and latent topic model long-term characteristics over the entire dialog, we propose a based on Latent Dirichlet Allocation (LDA) [9] techniques for novel Recurrent Neural Network (RNN) architecture. The pro- intention understanding and goal estimation. posed RNN uses two sub-networks to model the different time scales represented by word and turn sequences. The combi- This paper first applies the semantic information as input nation of pre-training and RNN gives us a 18% relative error features, and then uses them as an additional input layer in neu- reduction compared to a baseline system. ral network architectures, by following the great success of neu- ral network approaches in various SLU tasks. One of the most Index Terms: spoken language understanding, fine-tuning, se- mantic embedding, recurrent neural networks, goal estimation successful neural network approaches is based on Deep belief networks (DBNs) [10]. DBNs are stacks of Restricted Boltz- mann Machines (RBMs), and their parameters are used as ini- 1. Introduction tial values when we estimate the neural network parameters by Spoken language understanding (SLU) in dialog systems ex- a back propagation algorithm (see [11] for more details). In tracts semantic information from the output of an automatic the DBN context, the first step of finding initial parameters is speech recognizer (ASR) [1, 2]. The dialog manager (DM) called pre-training, and the second step of discriminative net- 10.21437/Interspeech.2015-56 then determines the next machine action given the SLU output. work training is called fine-tuning. Following the success of In the last decade, a variety of practical goal-oriented spoken DBN/DNN training in ASR and image processing, researchers dialog systems have been built for different tasks [3, 4, 5, 6]. have applied other neural network architectures to SLU includ- This paper focuses on two key tasks in targeted dialog and un- ing DBN/DNN [12], Deep Convex Network [13], Recurrent derstanding applications: user intention understanding and user Neural Network (RNN) [14, 15], and Long Short-Term Mem- goal estimation. ory (LSTM) RNN [16]. User intention understanding is the extraction of the in- However, in applying these techniques to SLU, one major tended meaning (called “intention” hereafter) of one user ut- difficulty is that we often have insufficient training data for a terance, performed by the SLU module. The dialog manager task, since the annotation of collected data can be expensive. determines which action to take next based on the result of in- The performance of a neural network trained in low resource tention understanding. conditions is usually inferior because of over-training. As men- User goal estimation is a similar concept, but is an estima- tioned above, our approach uses word embedding as an addi- tion of the final system action (called “goal” hereafter) that the tional input layer of a neural network, which mitigates the over- user wishes to accomplish during the dialog. A dialog usually training problem. To accomplish this, we initialize the affine consists of a series of user utterances and system actions, so the transformation of the first layer by using a word embedding goal estimation takes place over a longer time scale than user matrix estimated from a large-scale general corpus with unsu- intention understanding. The system’s estimate of the goal may pervised training methods (pre-training). Then, the entire SLU change during the dialog as more information is captured. Goal network is trained with the annotated training data, with word estimation performance is important since it can facilitate the embeddings fine-tuned to the SLU task. This concept has been user achieving the correct action more quickly [7]. studied in [14, 17]. We also propose a novel Multi-scale RNN (MSRNN) architecture pre-trained with word embedding in or- Y. Luan contributed to this work while doing internship at Merl. der to capture long-term characteristics over the entire dialog for Current affiliation of Y. Luan is University of Washington goal estimation. The proposed MSRNN uses two sub-networks Copyright © 2015 ISCA 1398 September 6-10, 2015, Dresden, Germany to model the different time scales represented by word and turn Our framework uses LDA, which models each sentence or sequences. We applied a combination of pre-training and our document as a mixture of proportions for latent components by proposed MSRNN network to a route guidance dialog task. The using a Dirichlet prior. LDA has great performance on large- combination gives us a 18% relative error reduction compared scale corpus and can be trained efficiently [26]. However, since to a baseline system. LDA embedding is obtained an iterative inference algorithm (variational EM) or sampling method, it is hard to fine-tune an 2. Incorporating Semantic information LDA embedding within a neural network framework. There- fore, in this paper, we do not fine-tune LDA features. It could In this paper, we investigate the performance of importing se- be possible to use a deep unfolding method to fine-tune the LDA mantic text embeddings to our SLU tasks. Two kinds of seman- inference, such as that presented in [27]. tic features are trained in unsupervised fashion using large-scale web data. One is a word-level embedding which learns contex- tual information about word sequences by a feed-forward neural 3. Fine-tuning of linear input networks network frame. The other is document-level topic embedding, The baseline method uses a discriminative approach to repre- which models the latent topic information of a given sentence sent the goal and intention estimation models, and we can flexi- or article. The two kinds of semantic features have respective bly incorporate various information via feature engineering. We advantages: word-level embedding captures local contextual in- use multivariate logistic regression to compute P (g X) for clas- | formation, while topic embedding capture the statistics of the sification target g and feature vector X as corpus, and thus provides more global information. P (g X) = softmax([WX]g) (1) | 2.1. Word embedding where [Y ]g means a g-th raw element of vector Y . The softmax Many current NLP systems use a bag-of-words or one-hot function is defined as follows word vector as an input, which leads to feature vectors of ex- ezm tremely large dimension. An alternative is a word embedding, softmax(zm) , (2) exp(ezk ) which projects the large sparse word feature vector into a low- k dimensional, dense vector representation. W is a weight matrix to be estimatedP at the training step. For There are two main model families for learning word vec- intention prediction, X is a bag-of-words feature vector, and tors: 1) matrix factorization methods, such as latent semantic g is an intention category. For the goal estimation task, X is analysis (LSA) [18] and LR-MVL [19] and 2) neural network a bag-of-intentions including confidence scores for each pre- language model (NNLM) based methods, which model on local dicted intention in the dialog history, and g is a goal category. context window, such as Continuous Bag of Words (CBOW), The baseline model can be regarded as a shallow neural net- Skip-gram [8] and others [20, 21]. Most word vector methods work, with only one input layer and one softmax output layer. rely on the distance or angle between pairs of word vectors as In order to import a word2vec embedding to the system, the primary method for evaluating the intrinsic quality of word we begin by purely concatenating embedding Xw with baseline representations. Mikolov et al. [22] introduced an evaluation feature Xb, i.e., scheme based on word analogies, which favors models that pro- X = [Xb>,Xw>]> (3) duce dimensions of meaning. Among all the methods, CBOW Then for each turn or sentence, Xw is obtained by summing and Skip-gram are the current state-of-the-art for the word anal- over normalized word2vec feature for each word in the turn or ogy task. CBOW predicts the current word based on the con- sentence: text, and the Skip-gram predicts surrounding words given the Xw(i) current word.