Language Modeling with Neural Trans-Dimensional Random Fields

LANGUAGE MODELING WITH NEURAL TRANS-DIMENSIONAL RANDOM FIELDS Bin Wang, Zhijian Ou Department of Electronic Engineering, Tsinghua university, Beijing, China. [email protected], [email protected] ABSTRACT discrete features (such as word and class n-gram features) are used Trans-dimensional random field language models (TRF LMs) in the previous TRF models, which limit their performances. The have recently been introduced, where sentences are modeled as a previous TRF models [4,5] will thus be referred to as discrete TRFs. collection of random fields. The TRF approach has been shown to This limitation is clear when comparing discrete TRF LMs with L- have the advantages of being computationally more efficient in infer- STM LMs. First, LSTM LMs associate with each word in the vocab- ence than LSTM LMs with close performance and being able to flex- ulary a real-valued feature vector. Such word embedding in contin- ibly integrate rich features. In this paper we propose neural TRFs, uous vector space creates a notion of similarity between words and beyond of the previous discrete TRFs that only use linear potentials achieves a level of generalization that is hard with discrete features. with discrete features. The idea is to use nonlinear potentials with Discrete TRFs mainly rely on word classing and various orders of continuous features, implemented by neural networks (NNs), in the discrete features for smoothing parameter estimates. Second, LST- TRF framework. Neural TRFs combine the advantages of both NNs M LMs learn nonlinear interactions between underlying features by and TRFs. The benefits of word embedding, nonlinear feature learn- use of NNs, while discrete TRF LMs basically are log-linear models. ing and larger context modeling are inherited from the use of NNs. Third, LSTM models could model larger contexts by using memory At the same time, the strength of efficient inference by avoiding ex- cells than discrete TRF models. Despite these differences, discrete pensive softmax is preserved. A number of technical contributions, TRF LMs still achieves impressive performances, being close to L- including employing deep convolutional neural networks (CNNs) to STM LMs. A promising extension is to integrate NNs into the TRF define the potentials and incorporating the joint stochastic approx- framework, thus eliminating the above limitation of discrete TRFs. imation (JSA) strategy in the training algorithm, are developed in The above analysis motivates us to propose neural trans- this work, which enable us to successfully train neural TRF LMs. dimensional random fields (neural TRFs) in this paper. The idea Various LMs are evaluated in terms of speech recognition WERs by is to use nonlinear potentials with continuous features, implemented rescoring the 1000-best lists of WSJ’92 test data. The results show by NNs, in the TRF framework. Neural TRFs combine the advan- that neural TRF LMs not only improve over discrete TRF LMs, but tages of both NNs and TRFs. The benefits of word embedding, also perform slightly better than LSTM LMs with only one fifth of nonlinear feature learning and larger context modeling are inherited parameters and 16x faster inference efficiency. from the use of NNs. At the same time, the strength of efficient inference by avoiding expensive softmax is preserved. Index Terms— Language modeling, Random field, Stochastic approximation We have developed a stochastic approximation (SA) algorithm, called augmented SA (AugSA), with Markov chain Monte Carlo (M- CMC) to estimate the model parameters and normalizing constants 1. INTRODUCTION for discrete TRFs. Note that the log-likelihood of a discrete TRF is concave, guaranteeing training convergence to the global maximum. Statistical language models, which estimate the joint probability of Fitting neural TRFs is a non-convex optimization problem, which words in a sentence, form a crucial component in many applications is more challenging. There are a number of technical contribution- such as automatic speech recognition (ASR) and machine transla- s made in this work, which enable us to successfully train neural tion (MT). Recently, neural network language models (NN LMs), TRFs. First, we employ deep convolutional neural networks (CNNs) which can be either feedforward NNs (FNNs) [1] or recurrent NNs to define the potential functions. CNNs can be stacked to represent (RNNs) [2, 3], have been shown to surpass classical n-gram LMs. larger and larger context, and allows easier gradient propagation than RNNs with Long Short-Term Memory (LSTM) units are particularly LSTM RNNs. Second, the AugSA training algorithm is extended to popular. Remarkably, both n-gram LMs and NN LMs follow the di- train neural TRFs, by incorporating the joint stochastic approxima- rected graphical modeling approach, which represents the joint prob- tion (JSA) [6] strategy, which has been used to successfully train ability in terms of conditionals. In contrast, a new trans-dimensional deep generative models. The JSA strategy is to introduce an aux- random field (TRF) LM [4, 5] has recently been introduced in the iliary distribution to serve as the proposal for constructing MCMC undirected graphical modeling approach, where sentences are mod- operator for the target distribution. The log-likelihood of the target eled as a collection of random fields and the joint probability is de- distribution and the KL-divergence between the target distribution fined in terms of local potential functions. It has been shown that and the auxiliary distribution are jointly optimized. The resulting TRF LMs significantly outperform n-gram LMs, and perform close AugSA plus JSA algorithm is crucial for handling deep CNN fea- to LSTM LMs but are computationally more efficient (200x faster) tures, not only significantly reducing computation cost for every SA in inference (i.e. computing sentence probability). iteration step but also considerably improving SA training conver- Although the TRF approach has the capacity to support nonlin- gence. Third, several additional techniques are found to improve the ear potential functions and rich features, only linear potentials with convergence for training neural TRFs, including wider local jump in This work is supported by NSFC grant 61473168. MCMC, Adam optimizer [7], and training set mini-batching. 978-1-5090-4788-8/17/$31.00 ©2017 IEEE 294 ASRU 2017 Various LMs are evaluated in terms of speech recognition WERs 2.2. Convolutional neural networks by rescoring the 1000-best lists of WSJ’92 test data. The neural TRF LM improves over the discrete TRF LM, reducing WER from 7.92% Besides the great success in computer vision, CNNs have recently to 7.60%, with less parameters. Compared with state-of-the-art LST- received more attention in language modeling. CNNs over language M LMs [8], the neural TRF LM outperforms the small LSTM LM act as feature detectors and can be stacked hierarchically to capture (2 hidden layers and 200 units per layer) with relative WER reduc- large context, like in computer vision. It is shown in [18] that apply- tion of 4.5%, and performs slightly better than the medium LSTM ing convolutional layers in FNNs performs better than conventional LM (2 hidden layers and 650 units per layer) with only one fifth of FNN LMs but is below LSTM LMs. Convolutional layers can al- parameters. Moreover, the inference of the neural TRF LM is about so be used within RNNs, e.g. as studied in [19], 1-D convolutional 16 times faster than the medium LSTM LM. The average time cost filters of varying widths are applied over characters, whose output for rescoring a 1000-best list for a utterance in WSJ’92 test set are is fed to the upper LSTM RNN. Recently, it is shown in [20] that about 0.4 second vs 6.36 seconds, both using 1 GPU. CNN-FNNs with a novel gating mechanism benefit gradient propagation and perform slightly better than LSTM LMs. Similarly, our In the rest of the paper, we first discuss related works in Section pilot experiment shows that using the stacked structure of CNNs in 2. Then we introduce the new neural TRF model in Section 3 and neural TRFs allows easier model training than using the recurrent its training algorithm in Section 4. After presenting experimental structure of RNNs. results in Section 5, the conclusions are made in Section 6. 3. MODEL DEFINITION l 2. RELATED WORK Throughout, we denote by x = (x1; ··· ; xl) a sentence (i.e. word sequence) of length l, ranging from 1 to m. Sentences of length l are assumed to be distributed from an exponential family model: LM research can be roughly divided into two tracks. The direct- ed graphical modeling approach, includes the classic n-gram LMs l 1 ϕ(xl;θ) and various NN LMs. The undirected graphical modeling approach, pl(x ; θ) = e (1) Zl(θ) has few priori work, except [4, 5, 9]. A review of the two tracks can be found in [5]. To our knowledge, the TRF work represents where θ indicates the set of parameters and ϕ is the potential func- the first success in using undirected graphical modeling approach Z (θ) l Z (θ) = Ption, and l is the normalization constant of length , i.e. l to language modeling. Starting from discrete TRFs, the main new ϕ(xl;θ) l e . Moreover, assume that length l is associated with a features of neural TRFs proposed in this paper is the marriage of x probability π for l = 1; ··· ; m. Therefore, the pair (l; xl) is jointly random fields and neural networks, and the use of CNNs for fea- l distributed as: ture extraction. In the following, we mainly comment on these two l l p(l; x ; θ) = π p (x ; θ) (2) related studies and the connection to our work. l l Different from using linear potentials in discrete TRFs [5], neural TRFs define the potential function ϕ(xl; θ) by a deep CNN, as described below and shown in Fig.

Load more