A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip

Total Page:16

File Type:pdf, Size:1020Kb

A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip A 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity with All Parameters Stored On-Chip Deepak Kadetotad, Visar Berisha, Chaitali Chakrabarti, Jae-sun Seo School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85287, USA {deepak.kadetotad, visar, chaitali, jaesun.seo}@asu.edu Abstract —Long short-term memory (LSTM) networks are 1H[W 5HFXUUHQW /D\HU /D\HU widely used for speech applications but pose difficulties for KW /670 KW ࢏࢚ ൌ࣌ሺࢃ࢞࢏࢚࢞ ൅ࢃࢎ࢏ࢎ࢚െ૚ ൅࢈࢏ሻ efficient implementation on hardware due to large weight storage RW sig- E output gate • + R &HOO moid requirements. We present an energy-efficient LSTM recurrent ࢌ࢚ ൌ࣌ሺࢃ࢞ࢌ࢚࢞ ൅ࢃࢎࢌࢎ࢚െ૚ ൅࢈ࢌሻ neural network (RNN) accelerator, featuring an algorithm- [W tanh ࢕࢚ ൌ࣌ሺࢃ࢞࢕࢚࢞ ൅ࢃࢎ࢕ࢎ࢚െ૚ ൅࢈࢕ሻ KW forget gate hardware co-optimized memory compression technique called &W memory cell ̱ sig- • &W ࢉ࢚ ൌ ܜ܉ܖܐሺࢃ࢞ࢉ࢚࢞ ൅ࢃࢎࢉࢎ࢚െ૚ ൅࢈ࢉሻ EI + hierarchical coarse-grain sparsity (HCGS). Aided by HCGS- moid K + input gate W ̱ ࢉ࢚ ൌࢌ࢚ ήࢉ࢚െ૚ ൅࢏࢚ ήࢉ࢚ based block-wise recursive weight compression, we demonstrate [W sig- • LW EL moid + × 8QZHLJKWHG LSTM networks with up to 16 fewer weights while achieving a ࢎ࢚ ൌ࢕࢚ ήܜ܉ܖܐሺࢉ࢚ሻ &RQQHFWLRQ & W [W minimal accuracy loss. The prototype chip fabricated in 65nm LP :HLJKWHG candidate &RQQHFWLRQ tanh memory ࢎ࢚െ૚KLGGHQYHFWRU CMOS achieves 8.93/7.22 TOPS/W for 2-/3-layer LSTM RNNs &RQQHFWLRQ ࢞ LQSXWYHFWRU ZLWKWLPHODJ ࢚ trained with HCGS for TIMIT/TED-LIUM corpora. + ZHLJKWPDWULFHVכࢃࢎכࢃ࢞ KW [W Index Terms—long short-term memory, speech recognition, EF hardware accelerator, weight compression, structured sparsity Fig. 1. Illustration of LSTM cell with computation equations. weights by 16× with minimal accuracy loss. HCGS-based I. INTRODUCTION LSTM accelerator which executes 2-/3-layer LSTMs for real- The emergence of internet of things (IoT) devices that time speech recognition was prototyped in 65nm LP CMOS. require edge computing with limited area and energy has It consumes 1.85/3.42 mW power and achieves 8.93/7.22 garnered intense interest in energy-efficient ASIC accelerators TOPS/W for TIMIT/TED-LIUM corpora. for deep learning applications. The particular challenge of II. LSTM AND HIERARCHICAL COARSE-GRAIN SPARSITY performing on-device automatic speech recognition (ASR) is that LSTMs that show high accuracy suffer from high A. LSTM-based Speech Recognition complexity and require a large number of parameters to be LSTM is a type of recurrent neural network (RNN) that trained and stored [1]. shows state-of-the-art accuracy for speech recognition [1]. Recent works presented methods to reduce the complexity Each layer of a LSTM consists of neurons, which computes the and storage of ASR hardware. Magnitude-based pruning was final output ht through four intermediate results called gates applied to LSTM hardware in [2], resulting in 20× model size (Fig. 1). From the LSTM equations in Fig. 1, we see that the reduction, but element-wise sparsity incurs considerable index weight memory requirement of LSTMs is 8× when compared memory and irregular memory access, hurting both perfor- to MLPs with the same number of neurons per layer. mance and power. To overcome this, structured sparsity tech- LSTM-based speech recognition typically consists of a niques have been proposed with row-/column-wise sparsity for pipeline of a feature extraction module, followed by a LSTM RNNs [3], with block-wise sparsity for multi-layer perceptrons RNN and then by a Viterbi decoder. A commonly used feature (MLPs) [4], and with block-circulant weight matrix for RNNs for speech recognition is feature-space Maximum Likelihood [5] in speech processing applications. However, these works Linear Regression (fMLLR). fMLLR features are extracted exhibit limited weight compression of ∼4× [3], [4] or high from Mel Frequency Cepstral Coefficients (MFCC) features, error rate [5], and have not been implemented in ASIC [2]–[5]. derived typically from 25 ms windows of audio samples with a While recent ASIC designs targeting RNNs focus on improved 10 ms overlap between subsequent windows. The features for energy-efficiency [6], [7], they do not incorporate compression the current window are combined with those of past and future techniques and do not report RNN accuracy for representative windows to provide context and the merged set of features are benchmarks, which are both necessary to accomplish practical inputs to the neural network. In our implementation, we merge ASR on small-form-factor edge devices. five past and five future windows to the current window to In this work, we present a new hierarchical coarse-grain create an input frame with 11 windows, leading to 440 fMLLR sparsity (HCGS) scheme that structurely compresses LSTM features per frame. The output layer of the LSTM consists of +LHUDUFKLFDO&RPSUHVVLRQRI:HLJKWV VW QG /HYHO6HOHFWRU 0HPRU\%DQN /HYHO6HOHFWRU 6HOHFWRU : : : LQSXWQHXURQV 2XWSXW%XIIHU ,1 1 6HOHFWHG1HXURQV )URPFKLSLQ VW QG 6HO08; 6HO08; OHYHO OHYHO 6HOHFWRU î 5(* &%XIIHU SDUDOOHO0$&XQLWV 0$&8QLW ,QSXW%XIIHU /670*DWH &RPSXWDWLRQ +%XIIHU SDUDOOHO0$&XQLWV RXWSXWQHXURQV RXWSXWQHXURQV 287 287 1 1 6HOHFWRU 5(* 1HXURQVRI/D\HU L î 7RFKLSRXW ࢏࢚ ൌ࣌ൌ ࣌൫࢏࢚࢞ ൅࢏൅ ࢏ࢎ࢚െ૚࢚െ૚ ൅࢈൅ ࢈࢏൯ 0HPRU\%DQN )60 ࢌ࢚ ൌ ࣌ሺࢌ࣌ሺࢌ࢚࢞ ൅ࢌ൅ ࢌࢎ࢚െ૚࢚െ૚ ൅࢈൅ ࢈ࢌሻ 1HXURQVRI/D\HU L %LDVHV ࢕࢚ ൌ ࣌ሺ࢕࣌ሺ࢕࢚࢞ ൅࢕൅ ࢕ࢎ࢚െ૚࢚െ૚ ൅࢈൅ ࢈࢕ሻ ࢉ̱ ൌܜ܉ܖܐሺࢉൌ ܜ܉ܖܐሺࢉ ൅ࢉ൅ ࢉ ൅࢈൅ ࢈ ሻ : : : ࢚ ࢚࢞ ࢎ࢚െ૚࢚െ૚ ࢉ ̱ Fig. 2. Illustration of LSTM RNN weight compression featuring the proposed +&*6,QGH[ ࢉ࢚ ൌࢌ࢚ ήࢉ࢚െ૚ ൅࢏࢚ ήࢉ࢚ 0HPRU\%DQN ࢎ ൌ࢕ൌ ࢕ ήܜ܉ܖܐሺࢉή ܜ܉ܖܐሺࢉ ሻ hierarchical coarse-grain sparsity (HCGS). ࢚ ࢚ ࢚ Fig. 3. Overall architecture of the proposed LSTM RNN accelerator. probability estimates that are sent to the Viterbi decoder unit to determine the best sequence of phonemes/words. 1) HCGS Selector: The HCGS selector (Fig. 3, top-left) has two levels, where the first level of selector only enables B. Hierarchical Coarse-Grain Sparsity the propagation of activations associated with larger non-zero The proposed HCGS scheme maintains coarse-grain spar- weights blocks and the second level further filters through the sity while further allowing fine-grain weight connectivity, lead- activations associated with smaller non-zero blocks. For 16× ing to significant area and energy savings. Two-level HCGS is HCGS compression, only 32 activation outputs are required illustrated in Fig. 2, where the first level compresses weights from a total of 512 activations, ensuring only activations (e.g. 4× compression) using a larger block size (e.g. 32×32) corresponding to non-zero weights propagate to the MAC unit, and the remaining weights in the large blocks go through the greatly boosting energy-efficiency. × second level of compression (e.g. 4 ) with a smaller block size 2) Input and Output buffers: An input frame consists of × (e.g. 8 8). The hierarchical structure of block-wise weights is fMLLR features as described in Sec. II-A. The input buffer is randomly selected before the RNN training process starts, and used to store the fMLLR features of an input frame, which is is maintained throughout training and classification phases. streamed in 13-bits at a time over 512 cycles. The output buffer The dropped blocks remain at zero and do not contribute to consists of two identical buffers for double buffering, which the physical memory footprint during both training and classi- enables continuous computation of the LSTM accelerator fication. TIMIT and TED-LIUM corpora are used to train the while streaming out the final layer outputs. Each output buffer RNNs for phoneme and speech recognition, respectively. The employs a HCGS selector and a 6656:416 multiplexer to baseline 3-layer, 512-cell LSTM RNN that performs speech feedback the output of the current layer output to the next recognition for TED-LIUM corpus requires 24 MB of weight layer. The feedback from the output buffer to the input of the memory in floating-point precision. Aided by (1) the proposed MAC facilitates the reuse of the MAC unit. HCGS that reduces the number of weights by 16× and (2) low- 3) H-buffer and C-buffer: The H-buffer and C-buffer store precision (6-bit) representation of weights, the compressed the outputs of the previous frame (ht−1) and cell state (ct−1) parameters of a 3-layer, 512-cell LSTM RNN are reduced to for each layer, respectively. only 288 kB (83× reduction in model size compared to 24 MB). The resultant LSTM network can be fully stored on- 4) MAC Unit: The MAC unit consists of 64 parallel MACs chip, which results in energy-efficient acceleration. (computing vector-matrix multiplications) and the LSTM gate computation module (computing intermediate LSTM gate and III. ARCHITECTURE AND DESIGN OPTIMIZATIONS final output values), which can effectively perform 2,064 oper- ations in each cycle aided by the proposed HCGS compression. A. Hardware Architecture The non-linear activation functions (sigmoid and hyperbolic Fig. 3 shows the overall architecture of the proposed LSTM tangent) are implemented through piece-wise linear modules accelerator. It consists of the HCGS selector, input and output using 20 linear segments. buffers, MAC unit, H-buffer, C-buffer, two memory banks (144 5) Weight/bias storage and global controller: Weights are kB each) for weight storage, bias/index memory bank (8.5 kB), stored in the interleaved fashion as described in Sec. III-B, and the global controller. The proposed architecture facilitates where each memory sub-bank (W1-W3) stores weights cor- the computation of one LSTM cell output per cycle after
Recommended publications
  • Arxiv:1807.06441V1 [Eess.AS] 12 Jul 2018 Etr Saln Hr-Emmmr LT)Ta a Endesigne Been Has That (LSTM) Memory Popu Short-Term Most Long the a (Rnns)
    A Comparison of Adaptation Techniques and Recurrent Neural Network Architectures Jan Vanˇek[0000−0002−2639−6731], Josef Mich´alek[0000−0001−7757−3163], Jan Zelinka[0000−0001−6834−9178], and Josef Psutka[0000−0002−0764−3207] University of West Bohemia Univerzitn´ı8, 301 00 Pilsen, Czech Republic {vanekyj,orcus,zelinka,psutka}@kky.zcu.cz Abstract. Recently, recurrent neural networks have become state-of- the-art in acoustic modeling for automatic speech recognition. The long short-term memory (LSTM) units are the most popular ones. However, alternative units like gated recurrent unit (GRU) and its modifications outperformed LSTM in some publications. In this paper, we compared five neural network (NN) architectures with various adaptation and fea- ture normalization techniques. We have evaluated feature-space maxi- mum likelihood linear regression, five variants of i-vector adaptation and two variants of cepstral mean normalization. The most adaptation and normalization techniques were developed for feed-forward NNs and, ac- cording to results in this paper, not all of them worked also with RNNs. For experiments, we have chosen a well known and available TIMIT phone recognition task. The phone recognition is much more sensitive to the quality of AM than large vocabulary task with a complex language model. Also, we published the open-source scripts to easily replicate the results and to help continue the development. Keywords: neural networks · acoustic model · TIMIT · LSTM · GRU · phone recognition · adaptation · i-vectors 1 Introduction Neural Networks (NNs) and deep NNs (DNNs) became dominant in the field of the acoustic modeling several years ago. Simple feed-forward (FF) DNNs were faded away in recent years.
    [Show full text]
  • Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition
    ARTICULATORY INFORMATION AND MULTIVIEW FEATURES FOR LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION Vikramjit Mitra*, Wen Wang, Chris Bartels, Horacio Franco, Dimitra Vergyri University of Maryland, College Park, MD, USA Speech Technology and Research Laboratory, SRI International, Menlo Park, CA, USA [email protected], {wen.wang, chris.bartels, horacio.franco, dimitra.vergyri}@sri.com ABSTRACT acoustic conditions. Advanced variants of DNNs, such as convolutional neural nets (CNNs) [12], recurrent neural This paper explores the use of multi-view features and their nets (RNNs) [13], long short-term memory nets (LSTMs) discriminative transforms in a convolutional deep neural [14], time-delay neural nets (TDNNs) [15, 29], VGG-nets network (CNN) architecture for a continuous large [8], have significantly improved recognition performance, vocabulary speech recognition task. Mel-filterbank energies bringing them closer to human performance [9]. Both and perceptually motivated forced damped oscillator abundance of data and sophistication of deep learning coefficient (DOC) features are used after feature-space algorithms have symbiotically contributed to the maximum-likelihood linear regression (fMLLR) advancement of speech recognition performance. The role transforms, which are combined and fed as a multi-view of acoustic features has not been explored in comparable feature to a single CNN acoustic model. Use of multi-view detail, and their potential contribution to performance gains feature representation demonstrated significant reduction in is unknown. This paper focuses on acoustic features and word error rates (WERs) compared to the use of individual investigates how their selection improves recognition features by themselves. In addition, when articulatory performance using benchmark training datasets: information was used as an additional input to a fused deep Switchboard and Fisher, when evaluated on the NIST 2000 neural network (DNN) and CNN acoustic model, it was CTS test set [2].
    [Show full text]
  • Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT
    (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 4, 2020 Towards Robust Combined Deep Architecture for Speech Recognition : Experiments on TIMIT Hinda DRIDI1, Kais OUNI2 Research Laboratory Smart Electricity & ICT, SE&ICT Lab, LR18ES44 National Engineering School of Carthage, ENICarthage University of Carthage, Tunis, Tunisia Abstract—Over the last years, many researchers have getting ideal ASR systems. Thus, we should open the most engaged in improving accuracies on Automatic Speech relevant research topics to attain the principal goal of Recognition (ASR) task by using deep learning. In state-of-the- Automatic Speech Recognition. In fact, recognizing isolated art speech recognizers, both Long Short-Term Memory (LSTM) words is not a hard task, but recognizing continuous speech is a and Gated Recurrent Unit (GRU) based Reccurent Neural real challenge. Any ASR system has two parts: the language Network (RNN) have achieved improved performances model and the acoustic model. For small vocabularies, compared to Convolutional Neural Network (CNN) and Deep modeling acoustics of individual words is may be easily done Neural Network (DNN). Due to the strong complementarity of and we may get good speech recognition rates. However, for CNN, LSTM-RNN and DNN, they may be combined in one large vocabularies, developing successful ASR systems with architecture called Convolutional Long Short-Term Memory, good speech recognition rates has become an active issue for Deep Neural Network (CLDNN). Similarly we propose to combine CNN, GRU-RNN and DNN in a single deep architecture researchers. As vocabulary size grows, we will not be able to called Convolutional Gated Recurrent Unit, Deep Neural get enough spoken samples of all words and we should model Network (CGDNN).
    [Show full text]
  • Arxiv:1805.04699V4 [Cs.CL] 13 Jun 2019
    TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation Fran¸coisHernandez1, Vincent Nguyen1, Sahar Ghannay2, Natalia Tomashenko2, and Yannick Est`eve2 1 Ubiqus, Paris, France [email protected] https://www.ubiqus.com 2 LIUM, University of Le Mans, France [email protected] https://lium.univ-lemans.fr/ Abstract. In this paper, we present TED-LIUM release 3 corpus3 ded- icated to speech recognition in English, which multiplies the available data to train acoustic models in comparison with TED-LIUM 2, by a factor of more than two. We present the recent development on Auto- matic Speech Recognition (ASR) systems in comparison with the two previous releases of the TED-LIUM Corpus from 2012 and 2014. We demonstrate that, passing from 207 to 452 hours of transcribed speech training data is really more useful for end-to-end ASR systems than for HMM-based state-of-the-art ones. This is the case even if the HMM- based ASR system still outperforms the end-to-end ASR system when the size of audio training data is 452 hours, with a Word Error Rate (WER) of 6.7% and 13.7%, respectively. Finally, we propose two repar- titions of the TED-LIUM release 3 corpus: the legacy repartition that is the same as that existing in release 2, and a new repartition, calibrated and designed to make experiments on speaker adaptation. Similar to the two first releases, TED-LIUM 3 corpus will be freely available for the research community. Keywords: Speech recognition · Opensource corpus · Deep learning · Speaker adaptation · TED-LIUM.
    [Show full text]
  • Automatic Speech Recognition with Deep Neural Networks for Impaired Speech
    Automatic Speech Recognition with Deep Neural Networks for Impaired Speech Cristina Espa~na-Bonet1;2 and Jos´eA. R. Fonollosa1 1 TALP Research Center, Universitat Polit`ecnicade Catalunya, Barcelona, Spain 2 Universit¨atdes Saarlandes, Saarbr¨ucken, Germany [email protected], [email protected] Abstract. Automatic Speech Recognition has reached almost human performance in some controlled scenarios. However, recognition of im- paired speech is a difficult task for two main reasons: data is (i) scarce and (ii) heterogeneous. In this work we train different architectures on a database of dysarthric speech. A comparison between architectures shows that, even with a small database, hybrid DNN-HMM models out- perform classical GMM-HMM according to word error rate measures. A DNN is able to improve the recognition word error rate a 13% for sub- jects with dysarthria with respect to the best classical architecture. This improvement is higher than the one given by other deep neural networks such as CNNs, TDNNs and LSTMs. All the experiments have been done with the Kaldi toolkit for speech recognition for which we have adapted several recipes to deal with dysarthric speech and work on the TORGO database. These recipes are publicly available. Keywords: speech recognition, speaker adaptation, deep learning, neu- ral networks, dysarthria, Kaldi 1 Introduction Automatic speech recognition (ASR) consists on automatically transcribing voice into text. It is not an easy task: one has do deal with noise, differences among speakers and spontaneous speech phenomena among others. For some controlled scenarios where one can minimise the effect of these phenomena, ASR approaches or exceeds the accuracy of humans on several benchmarks [2, 19].
    [Show full text]
  • Improving Speech Recognition by Revising Gated Recurrent Units
    Improving speech recognition by revising gated recurrent units Mirco Ravanelli1, Philemon Brakel2, Maurizio Omologo1, Yoshua Bengio2 1Fondazione Bruno Kessler, Trento, Italy 2Universite´ de Montreal,´ Montreal,´ Canada [email protected], [email protected], [email protected], [email protected] Abstract Markov Models (HMMs) in DNN/HMM speech recognizers. Training RNNs, however, can be complicated by vanish- Speech recognition is largely taking advantage of deep learning, ing and exploding gradients, which might impair learning long- showing that substantial benefits can be obtained by modern term dependencies [22]. Although exploding gradients can ef- Recurrent Neural Networks (RNNs). The most popular RNNs fectively be tackled with simple clipping strategies [23], the are Long Short-Term Memory (LSTMs), which typically reach vanishing gradient problem requires special architectures to be state-of-the-art performance in many tasks thanks to their abil- properly addressed. A common approach relies on the so- ity to learn long-term dependencies and robustness to vanishing called gated RNNs, whose core idea is to introduce a gating gradients. Nevertheless, LSTMs have a rather complex design mechanism for better controlling the flow of the information with three multiplicative gates, that might impair their efficient through the various time-steps. Within this family of archi- implementation. An attempt to simplify LSTMs has recently tectures, vanishing gradient issues are mitigated by creating ef- led to Gated Recurrent Units (GRUs), which are based on just fective “shortcuts”, in which the gradients can bypass multiple two multiplicative gates. temporal steps. This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suit- The most popular gated RNNs are Long Short-Term Mem- able for speech recognition.
    [Show full text]
  • Speaker-Aware Training of Lstm-Rnns for Acoustic Modelling
    SPEAKER-AWARE TRAINING OF LSTM-RNNS FOR ACOUSTIC MODELLING Tian Tan 1, Yanmin Qian1;2, Dong Yu3, Souvik Kundu4, Liang Lu5, Khe Chai SIM4, Xiong Xiao6, Yu Zhang7 1Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China 2 Cambridge University Engineering Department, Cambridge, UK 3Microsoft Research, Redmond, USA 4School of Computing, National University of Singapore, Republic of Singapore 5Centre for Speech Technology Research, University of Edinburgh, UK 6Temasek Labs, Nanyang Technological University, Singapore. 7CSAIL, MIT, Cambridge, USA ABSTRACT this context, Long Short-Term Memory (LSTM) [4] is proposed to overcome these problems by introducing separate gate functions to Long Short-Term Memory (LSTM) is a particular type of recurrent control the flow of the information. For acoustic modelling, LSTM- neural network (RNN) that can model long term temporal dynam- RNNs are reported to outperform the DNNs on the large vocabulary ics. Recently it has been shown that LSTM-RNNs can achieve tasks [5, 6]. higher recognition accuracy than deep feed-forword neural net- However, a long standing problem in acoustic modelling is the works (DNNs) in acoustic modelling. However, speaker adaption mismatch between training and test data caused by speakers and en- for LSTM-RNN based acoustic models has not been well inves- vironmental differences. Although DNNs are more robust to the mis- tigated. In this paper, we study the LSTM-RNN speaker-aware match compared to GMMs, significant performance degradation has training that incorporates the speaker information during model been observed [7]. There have been a number of studies to improve training to normalise the speaker variability.
    [Show full text]
  • UNIVERSITY of CALIFORNIA Los Angeles Neural Network Based
    UNIVERSITY OF CALIFORNIA Los Angeles Neural network based representation learning and modeling for speech and speaker recognition A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Electrical and Computer Engineering by Jinxi Guo 2019 c Copyright by Jinxi Guo 2019 ABSTRACT OF THE DISSERTATION Neural network based representation learning and modeling for speech and speaker recognition by Jinxi Guo Doctor of Philosophy in Electrical and Computer Engineering University of California, Los Angeles, 2019 Professor Abeer A. H. Alwan, Chair Deep learning and neural network research has grown significantly in the fields of automatic speech recognition (ASR) and speaker recognition. Compared to traditional methods, deep learning-based approaches are more powerful in learning representation from data and build- ing complex models. In this dissertation, we focus on representation learning and modeling using neural network-based approaches for speech and speaker recognition. In the first part of the dissertation, we present two novel neural network-based methods to learn speaker-specific and phoneme-invariant features for short-utterance speaker verifi- cation. We first propose to learn a spectral feature mapping from each speech signal to the corresponding subglottal acoustic signal which has less phoneme variation, using deep neural networks (DNNs). The estimated subglottal features show better speaker-separation abil- ity and provide complementary information when combined with traditional speech features on speaker verification tasks. Additional, we propose another DNN-based mapping model, which maps the speaker representation extracted from short utterances to the speaker rep- resentation extracted from long utterances of the same speaker. Two non-linear regression models using an autoencoder are proposed to learn this mapping, and they both improve speaker verification performance significantly.
    [Show full text]
  • Speaker Adaptation of Deep Neural Network Acoustic Models Using Gaussian Mixture Model Framework in Automatic Speech Recognition Systems Natalia Tomashenko
    Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems Natalia Tomashenko To cite this version: Natalia Tomashenko. Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems. Neural and Evolutionary Com- puting [cs.NE]. Université du Maine; ITMO University, 2017. English. NNT : 2017LEMA1040. tel-01797231 HAL Id: tel-01797231 https://tel.archives-ouvertes.fr/tel-01797231 Submitted on 22 May 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Natalia TOMASHENKO Mémoire présenté en vue de l’obtention du grade de Docteur de Le Mans Université sous le sceau de l’Université Bretagne Loire en cotutelle avec ITMO University École doctorale : MathSTIC Discipline : Informatique Spécialité : Informatique Unité de recherche : EA 4023 LIUM Soutenue le 1 décembre 2017 Thèse N° : 2017LEMA1040 Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech
    [Show full text]
  • Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors Yajie Miao, Hao Zhang, and Florian Metze
    IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. XX, NO. XX, MONTH YEAR 1 Speaker Adaptive Training of Deep Neural Network Acoustic Models using I-vectors Yajie Miao, Hao Zhang, and Florian Metze Abstract—In acoustic modeling, speaker adaptive training straightforward method is to re-update the SI model, or certain (SAT) has been a long-standing technique for the traditional layers of the model, on the adaptation data of each testing Gaussian mixture models (GMMs). Acoustic models trained with speaker [8], [9]. In [10], [2], [11], the SI-DNN models are SAT become independent of training speakers and generalize better to unseen testing speakers. This paper ports the idea of augmented with additional speaker-dependent (SD) layers that SAT to deep neural networks (DNNs), and proposes a framework are trained on the adaptation data. Also, adaptation can be to perform feature-space SAT for DNNs. Using i-vectors as naturally achieved by training (and decoding) DNN models speaker representations, our framework learns an adaptation using speaker-adaptive (SA) features or features enriched with neural network to derive speaker-normalized features. Speaker speaker-specific information [2], [12], [13]. adaptive models are obtained by fine-tuning DNNs in such a feature space. This framework can be applied to various Another technique closely related with speaker adaptation is feature types and network structures, posing a very general SAT speaker adaptive training (SAT) [14], [15]. When carried out solution. In this work, we fully investigate how to build SAT-DNN in the feature space, SAT performs adaptation on the training models effectively and efficiently.
    [Show full text]
  • Biosignal Sensors and Deep Learning-Based Speech Recognition: a Review
    sensors Review Biosignal Sensors and Deep Learning-Based Speech Recognition: A Review Wookey Lee 1 , Jessica Jiwon Seong 2, Busra Ozlu 3, Bong Sup Shim 3, Azizbek Marakhimov 4 and Suan Lee 5,* 1 Biomedical Science and Engineering & Dept. of Industrial Security Governance & IE, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] 2 Department of Industrial Security Governance, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] 3 Biomedical Science and Engineering & Department of Chemical Engineering, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] (B.O.); [email protected] (B.S.S.) 4 Frontier College, Inha University, 100 Inharo, Incheon 22212, Korea; [email protected] 5 School of Computer Science, Semyung University, Jecheon 27136, Korea * Correspondence: [email protected] Abstract: Voice is one of the essential mechanisms for communicating and expressing one’s intentions as a human being. There are several causes of voice inability, including disease, accident, vocal abuse, medical surgery, ageing, and environmental pollution, and the risk of voice loss continues to increase. Novel approaches should have been developed for speech recognition and production because that would seriously undermine the quality of life and sometimes leads to isolation from society. In this review, we survey mouth interface technologies which are mouth-mounted devices for speech recognition, production, and volitional control, and the corresponding research to de- velop artificial mouth technologies based on various sensors, including electromyography (EMG), electroencephalography (EEG), electropalatography (EPG), electromagnetic articulography (EMA), permanent magnet articulography (PMA), gyros, images and 3-axial magnetic sensors, especially Citation: Lee, W.; Seong, J.J.; Ozlu, B.; with deep learning techniques.
    [Show full text]
  • Applying Convolutional Gated Recurrent Deep Neural Network for Keywords Spotting in Continuous Speech
    International Journal of Scientific & Engineering Research Volume 9, Issue 7, July-2018 1530 ISSN 2229-5518 Applying Convolutional Gated Recurrent Deep Neural Network for keywords spotting in continuous speech Hinda DRIDI, Kais OUNI Abstract— Recently, the task of Keywords Spotting (KWS) in continuous speech has known an increased interest. It has been considered as a very challenging and forward-looking field of speech processing technologies. The KWS systems have been widely used in many applications, like, spoken data retrieval, speech data mining, spoken term detection, telephone routing, etc. In this paper, we propose a two-stage approach for keywords spotting. In first stage, the inputted utterances will be decoded into a phonetic flow. And in second stage the keywords will be detected from this phonetic flow using the Classification and Regression Trees (CARTs). The phonetic decoding of continuous speech is largely taking benefits of deep learning. Very promising performances have been achieved using Deep Neural Network (DNN), Convolutional Neural Network (CNN) and other advanced Recurrent Neural Networks (RNNs) like Long Short Term Memory (LSTM) and Gated Recurrent units (GRU). This paper builds on these efforts and proposes potentially more pertinent architecture by combining CNN, GRU and DNN in a single framework that will be called “Convolutional Recurrent Gated Deep Neural Network” or simply “CNN-GRU-DNN”. The work will be conducted on TIMIT data set. Index Terms— Deep Neural Network, Convolutional Neural Network, Recurrent
    [Show full text]