UNIVERSITY of CALIFORNIA Los Angeles Neural Network Based
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Arxiv:1807.06441V1 [Eess.AS] 12 Jul 2018 Etr Saln Hr-Emmmr LT)Ta a Endesigne Been Has That (LSTM) Memory Popu Short-Term Most Long the a (Rnns)
A Comparison of Adaptation Techniques and Recurrent Neural Network Architectures Jan Vanˇek[0000−0002−2639−6731], Josef Mich´alek[0000−0001−7757−3163], Jan Zelinka[0000−0001−6834−9178], and Josef Psutka[0000−0002−0764−3207] University of West Bohemia Univerzitn´ı8, 301 00 Pilsen, Czech Republic {vanekyj,orcus,zelinka,psutka}@kky.zcu.cz Abstract. Recently, recurrent neural networks have become state-of- the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and fea- ture normalization techniques. We have evaluated feature-space maxi- mum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, ac- cording to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development. Keywords: neural networks · acoustic model · TIMIT · LSTM · GRU · phone recognition · adaptation · i-vectors 1 Introduction Neural Networks (NNs) and deep NNs (DNNs) became dominant in the field of the acoustic modeling several years ago. Simple feed-forward (FF) DNNs were faded away in recent years. -
A Simple RNN-Plus-Highway Network for Statistical Parametric Speech Synthesis
ISSN 1346-5597 NII Technical Report A simple RNN-plus-highway network for statistical parametric speech synthesis Xin Wang, Shinji Takaki, Junichi Yamagishi NII-2017-003E Apr. 2017 A simple RNN-plus-highway network for statistical parametric speech synthesis Xin Wang, Shinji Takaki, Junichi Yamagishi Abstract In this report, we proposes a neural network structure that combines a recurrent neural network (RNN) and a deep highway network. Com- pared with the highway RNN structures proposed in other studies, the one proposed in this study is simpler since it only concatenates a highway network after a pre-trained RNN. The main idea is to use the `iterative unrolled estimation' of a highway network to finely change the output from the RNN. The experiments on the proposed network structure with a baseline RNN and 7 highway blocks demonstrated that this network per- formed relatively better than a deep RNN network with a similar mode size. Furthermore, it took less than half the training time of the deep RNN. 1 Introduction Statistical parametric speech synthesis is widely used in text-to-speech (TTS) synthesis. We assume a TTS system using a pipeline structure in which a front- end derives the information on the pronunciation and prosody of the input text, then the back-end generates the acoustic features based on the output of the front-end module. While various approaches can be used for the back-end acoustic modeling, we focus on the neural-network (NN)-based acoustic models. Suppose the output from the front-end is a sequence of textual feature vectors x1:T = [x1; ··· ; xT ] in T frames, where each xt may include the phoneme identity, pitch accent, and other binary or continuous-valued linguistic features. -
Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition
INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition Jaeyoung Kim1, Mostafa El-Khamy1, Jungwon Lee1 1Samsung Semiconductor, Inc. 4921 Directors Place, San Diego, CA, USA [email protected], [email protected], [email protected] Abstract train more than 100 convolutional layers for image classifica- tion and detection. The key insight in the residual network is to In this paper, a novel architecture for a deep recurrent neural provide a shortcut path between layers that can be used for an network, residual LSTM is introduced. A plain LSTM has an additional gradient path. Highway network [18] is an another internal memory cell that can learn long term dependencies of way of implementing a shortcut path in a feed-forward neural sequential data. It also provides a temporal shortcut path to network. [18] presented successful MNIST training results with avoid vanishing or exploding gradients in the temporal domain. 100 layers. The residual LSTM provides an additional spatial shortcut path Shortcut paths have also been investigated for RNN and from lower layers for efficient training of deep networks with LSTM networks. The maximum entropy RNN (ME-RNN) multiple LSTM layers. Compared with the previous work, high- model [19] has direct connections between the input and out- way LSTM, residual LSTM separates a spatial shortcut path put layers of an RNN layer. Although limited to RNN net- with temporal one by using output layers, which can help to works with a single hidden layer, the perplexity improved by avoid a conflict between spatial and temporal-domain gradient training the direct connections as part of the whole network. -
Dense Recurrent Neural Network with Attention Gate
Under review as a conference paper at ICLR 2018 DENSE RECURRENT NEURAL NETWORK WITH ATTENTION GATE Anonymous authors Paper under double-blind review ABSTRACT We propose the dense RNN, which has fully connections from each hidden state directly to multiple preceding hidden states of all layers. As the density of the connection increases, the number of paths through which the gradient flows is increased. The magnitude of gradients is also increased, which helps to prevent the vanishing gradient problem in time. Larger gradients, however, may cause exploding gradient problem. To complement the trade-off between two problems, we propose an attention gate, which controls the amounts of gradient flows. We describe the relation between the attention gate and the gradient flows by approx- imation. The experiment on the language modeling using Penn Treebank corpus shows dense connections with the attention gate improve the model’s performance of the RNN. 1 INTRODUCTION In order to analyze sequential data, it is important to choose an appropriate model to represent the data. Recurrent neural network (RNN), as one of the model capturing sequential data, has been applied to many problems such as natural language (Mikolov et al., 2013), machine translation (Bahdanau et al., 2014), speech recognition (Graves et al., 2013). There are two main research issues to improve the RNNs performance: 1) vanishing and exploding gradient problems and 2) regularization. The vanishing and exploding gradient problems occur as the sequential data has long-term depen- dency (Hochreiter, 1998; Pascanu et al., 2013). One of the solutions is to add gate functions such as the long short-term memory (LSTM) and gated recurrent unit (GRU). -
LSTM, GRU, Highway and a Bit of Attention: an Empirical Overview for Language Modeling in Speech Recognition
INTERSPEECH 2016 September 8–12, 2016, San Francisco, USA LSTM, GRU, Highway and a Bit of Attention: An Empirical Overview for Language Modeling in Speech Recognition Kazuki Irie, Zoltan´ Tuske,¨ Tamer Alkhouli, Ralf Schluter,¨ Hermann Ney Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Aachen, Germany firie,tuske,alkhouli,schlueter,[email protected] Abstract The success of the LSTM has opened room for creativity in designing neural networks with multiplicative gates. While ear- Popularized by the long short-term memory (LSTM), multi- lier works [12] motivated multiplications to make higher-order plicative gates have become a standard means to design arti- neural networks, recent works use them as a means to control ficial neural networks with intentionally organized information the information flow inside the network. As a result, many con- flow. Notable examples of such architectures include gated re- cepts have been emerged. The gated recurrent unit [13] was current units (GRU) and highway networks. In this work, we proposed as a simpler alternative to the LSTM. The highway first focus on the evaluation of each of the classical gated ar- network [14, 15] has gates to ensure the unobstructed informa- chitectures for language modeling for large vocabulary speech tion flow across the depth of the network. Also, more elemen- recognition. Namely, we evaluate the highway network, lateral tary gating can be found in tensor networks [16], also known as network, LSTM and GRU. Furthermore, the motivation under- the lateral network [17]. lying the highway network also applies to LSTM and GRU. An The objective of this paper is to evaluate the effectiveness of extension specific to the LSTM has been recently proposed with these concepts for language modeling with application to large an additional highway connection between the memory cells of vocabulary speech recognition. -
Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition
ARTICULATORY INFORMATION AND MULTIVIEW FEATURES FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Vikramjit Mitra*, Wen Wang, Chris Bartels, Horacio Franco, Dimitra Vergyri University of Maryland, College Park, MD, USA Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USA [email protected], {wen.wang, chris.bartels, horacio.franco, dimitra.vergyri}@sri.com ABSTRACT acoustic conditions. Advanced variants of DNNs, such as convolutional neural nets (CNNs) [12], recurrent neural This paper explores the use of multi-view features and their nets (RNNs) [13], long short-term memory nets (LSTMs) discriminative transforms in a convolutional deep neural [14], time-delay neural nets (TDNNs) [15, 29], VGG-nets network (CNN) architecture for a continuous large [8], have significantly improved recognition performance, vocabulary speech recognition task. Mel-filterbank energies bringing them closer to human performance [9]. Both and perceptually motivated forced damped oscillator abundance of data and sophistication of deep learning coefficient (DOC) features are used after feature-space algorithms have symbiotically contributed to the maximum-likelihood linear regression (fMLLR) advancement of speech recognition performance. The role transforms, which are combined and fed as a multi-view of acoustic features has not been explored in comparable feature to a single CNN acoustic model. Use of multi-view detail, and their potential contribution to performance gains feature representation demonstrated significant reduction in is unknown. This paper focuses on acoustic features and word error rates (WERs) compared to the use of individual investigates how their selection improves recognition features by themselves. In addition, when articulatory performance using benchmark training datasets: information was used as an additional input to a fused deep Switchboard and Fisher, when evaluated on the NIST 2000 neural network (DNN) and CNN acoustic model, it was CTS test set [2]. -
Long Short-Term Memory Recurrent Neural Network for Short-Term Freeway Traffic Speed Prediction
LONG SHORT-TERM MEMORY RECURRENT NEURAL NETWORK FOR SHORT-TERM FREEWAY TRAFFIC SPEED PREDICTION Junxiong Huang A thesis submitted in partial fulfillment of the requirements for the degree of: Master of Science From the University of Wisconsin – Madison Graduate School Department of Civil and Environmental Engineering April 2018 ii EXECUTIVE SUMMARY Traffic congestions on Riverside Freeway, California, SR 91 usually happen on weekdays from 0:00 to 23:55 of January 3, 2011 to March 16, 2012. Traffic congestions can be detected from average mainline speed, the free flow speed fluctuates around 60 to 70 mph, the congestion speed fluctuate around 30 to 40 mph. This research aims at predicting average mainline speed of the segment which is downstream to the merging road of a mainline and a ramp. The first objective is to determine the independent traffic features which can be used into the experiments, the second objective is to do data processing in order to form the data instances for the experiments, the third objective is to determine the predicted average speed with different time lags and to do the prediction, the fourth objective is to evaluate the performance of predicted average mainline speed. This research data source is the Caltrans Performance Measurement System (PeMS), which is collected in real-time from over 39000 individual detectors. These sensors span the freeway system across all major metropolitan areas of the State of California. PeMS is also an Archived Data User Service (ADUS) that provides over ten years of data for historical analysis. It integrates a wide variety of information from Caltrans and other local agency systems includes traffic detectors, census traffic counts, incidents, vehicle classification, lane closures, weight-in-motion, toll tags and roadway inventory. -
Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT Hinda DRIDI1, Kais OUNI2 Research Laboratory Smart Electricity & ICT, SE&ICT Lab, LR18ES44 National Engineering School of Carthage, ENICarthage University of Carthage, Tunis, Tunisia Abstract—Over the last years, many researchers have getting ideal ASR systems. Thus, we should open the most engaged in improving accuracies on Automatic Speech relevant research topics to attain the principal goal of Recognition (ASR) task by using deep learning. In state-of-the- Automatic Speech Recognition. In fact, recognizing isolated art speech recognizers, both Long Short-Term Memory (LSTM) words is not a hard task, but recognizing continuous speech is a and Gated Recurrent Unit (GRU) based Reccurent Neural real challenge. Any ASR system has two parts: the language Network (RNN) have achieved improved performances model and the acoustic model. For small vocabularies, compared to Convolutional Neural Network (CNN) and Deep modeling acoustics of individual words is may be easily done Neural Network (DNN). Due to the strong complementarity of and we may get good speech recognition rates. However, for CNN, LSTM-RNN and DNN, they may be combined in one large vocabularies, developing successful ASR systems with architecture called Convolutional Long Short-Term Memory, good speech recognition rates has become an active issue for Deep Neural Network (CLDNN). Similarly we propose to combine CNN, GRU-RNN and DNN in a single deep architecture researchers. As vocabulary size grows, we will not be able to called Convolutional Gated Recurrent Unit, Deep Neural get enough spoken samples of all words and we should model Network (CGDNN). -
Arxiv:1805.04699V4 [Cs.CL] 13 Jun 2019
TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation Fran¸coisHernandez1, Vincent Nguyen1, Sahar Ghannay2, Natalia Tomashenko2, and Yannick Est`eve2 1 Ubiqus, Paris, France [email protected] https://www.ubiqus.com 2 LIUM, University of Le Mans, France [email protected] https://lium.univ-lemans.fr/ Abstract. In this paper, we present TED-LIUM release 3 corpus3 ded- icated to speech recognition in English, which multiplies the available data to train acoustic models in comparison with TED-LIUM 2, by a factor of more than two. We present the recent development on Auto- matic Speech Recognition (ASR) systems in comparison with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. We demonstrate that, passing from 207 to 452 hours of transcribed speech training data is really more useful for end-to-end ASR systems than for HMM-based state-of-the-art ones. This is the case even if the HMM- based ASR system still outperforms the end-to-end ASR system when the size of audio training data is 452 hours, with a Word Error Rate (WER) of 6.7% and 13.7%, respectively. Finally, we propose two repar- titions of the TED-LIUM release 3 corpus: the legacy repartition that is the same as that existing in release 2, and a new repartition, calibrated and designed to make experiments on speaker adaptation. Similar to the two first releases, TED-LIUM 3 corpus will be freely available for the research community. Keywords: Speech recognition · Opensource corpus · Deep learning · Speaker adaptation · TED-LIUM. -
Recurrent Highway Networks
Recurrent Highway Networks Julian Georg Zilly * 1 Rupesh Kumar Srivastava * 2 Jan Koutník 2 Jürgen Schmidhuber 2 Abstract layers usually do not take advantage of depth (Pascanu et al., 2013). For example, the state update from one time step to Many sequential processing tasks require com- the next is typically modeled using a single trainable linear plex nonlinear transition functions from one step transformation followed by a non-linearity. to the next. However, recurrent neural networks with "deep" transition functions remain difficult Unfortunately, increased depth represents a challenge when to train, even when using Long Short-Term Mem- neural network parameters are optimized by means of error ory (LSTM) networks. We introduce a novel the- backpropagation (Linnainmaa, 1970; 1976; Werbos, 1982). oretical analysis of recurrent networks based on Deep networks suffer from what are commonly referred to Geršgorin’s circle theorem that illuminates several as the vanishing and exploding gradient problems (Hochre- modeling and optimization issues and improves iter, 1991; Bengio et al., 1994; Hochreiter et al., 2001), since our understanding of the LSTM cell. Based on the magnitude of the gradients may shrink or explode expo- this analysis we propose Recurrent Highway Net- nentially during backpropagation. These training difficulties works, which extend the LSTM architecture to al- were first studied in the context of standard RNNs where low step-to-step transition depths larger than one. the depth through time is proportional to the length of input Several language modeling experiments demon- sequence, which may have arbitrary size. The widely used strate that the proposed architecture results in pow- Long Short-Term Memory (LSTM; Hochreiter & Schmid- erful and efficient models. -
A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip
A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip Deepak Kadetotad, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287, USA {deepak.kadetotad, visar, chaitali, jaesun.seo}@asu.edu Abstract —Long short-term memory (LSTM) networks are 1H[W 5HFXUUHQW /D\HU /D\HU widely used for speech applications but pose difficulties for KW /670 KW ࢚ ൌ࣌ሺࢃ࢚࢞࢞ ࢃࢎࢎ࢚െ ࢈ሻ efficient implementation on hardware due to large weight storage RW sig- E output gate • + R &HOO moid requirements. We present an energy-efficient LSTM recurrent ࢌ࢚ ൌ࣌ሺࢃ࢞ࢌ࢚࢞ ࢃࢎࢌࢎ࢚െ ࢈ࢌሻ neural network (RNN) accelerator, featuring an algorithm- [W tanh ࢚ ൌ࣌ሺࢃ࢚࢞࢞ ࢃࢎࢎ࢚െ ࢈ሻ KW forget gate hardware co-optimized memory compression technique called &W memory cell ̱ sig- • &W ࢉ࢚ ൌ ܜ܉ܖܐሺࢃ࢞ࢉ࢚࢞ ࢃࢎࢉࢎ࢚െ ࢈ࢉሻ EI + hierarchical coarse-grain sparsity (HCGS). Aided by HCGS- moid K + input gate W ̱ ࢉ࢚ ൌࢌ࢚ ήࢉ࢚െ ࢚ ήࢉ࢚ based block-wise recursive weight compression, we demonstrate [W sig- • LW EL moid + × 8QZHLJKWHG LSTM networks with up to 16 fewer weights while achieving a ࢎ࢚ ൌ࢚ ήܜ܉ܖܐሺࢉ࢚ሻ &RQQHFWLRQ & W [W minimal accuracy loss. The prototype chip fabricated in 65nm LP :HLJKWHG candidate &RQQHFWLRQ tanh memory ࢎ࢚െKLGGHQYHFWRU CMOS achieves 8.93/7.22 TOPS/W for 2-/3-layer LSTM RNNs &RQQHFWLRQ ࢞ LQSXWYHFWRU ZLWKWLPHODJ ࢚ trained with HCGS for TIMIT/TED-LIUM corpora. + ZHLJKWPDWULFHVכࢃࢎכࢃ࢞ KW [W Index Terms—long short-term memory, speech recognition, EF hardware accelerator, weight compression, structured sparsity Fig. 1. Illustration of LSTM cell with computation equations. weights by 16× with minimal accuracy loss. -
Automatic Speech Recognition with Deep Neural Networks for Impaired Speech
Automatic Speech Recognition with Deep Neural Networks for Impaired Speech Cristina Espa~na-Bonet1;2 and Jos´eA. R. Fonollosa1 1 TALP Research Center, Universitat Polit`ecnicade Catalunya, Barcelona, Spain 2 Universit¨atdes Saarlandes, Saarbr¨ucken, Germany [email protected], [email protected] Abstract. Automatic Speech Recognition has reached almost human performance in some controlled scenarios. However, recognition of im- paired speech is a difficult task for two main reasons: data is (i) scarce and (ii) heterogeneous. In this work we train different architectures on a database of dysarthric speech. A comparison between architectures shows that, even with a small database, hybrid DNN-HMM models out- perform classical GMM-HMM according to word error rate measures. A DNN is able to improve the recognition word error rate a 13% for sub- jects with dysarthria with respect to the best classical architecture. This improvement is higher than the one given by other deep neural networks such as CNNs, TDNNs and LSTMs. All the experiments have been done with the Kaldi toolkit for speech recognition for which we have adapted several recipes to deal with dysarthric speech and work on the TORGO database. These recipes are publicly available. Keywords: speech recognition, speaker adaptation, deep learning, neu- ral networks, dysarthria, Kaldi 1 Introduction Automatic speech recognition (ASR) consists on automatically transcribing voice into text. It is not an easy task: one has do deal with noise, differences among speakers and spontaneous speech phenomena among others. For some controlled scenarios where one can minimise the effect of these phenomena, ASR approaches or exceeds the accuracy of humans on several benchmarks [2, 19].