Deep Learning for Speech/Language Processing --- Machine Learning & Signal Processing Perspectives

Deep Learning for Speech/Language Processing --- machine learning & signal processing perspectives Li Deng Deep Learning Technology Center Microsoft Research, Redmond, USA Tutorial given at Interspeech, Sept 6, 2015 Thanks go to many colleagues at DLTC/MSR, collaborating universities, and at Microsoft’s engineering groups Outline • Part I: Basics of Machine Learning (Deep and Shallow) and of Signal Processing • Part II: Speech • Part III: Language (In case you did not get link to slides, send email to: [email protected]) 2 Reading Material Books: Bengio, Yoshua (2009). "Learning Deep Architectures for AI". L. Deng and D. Yu (2014) "Deep Learning: Methods and Applications" http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf D. Yu and L. Deng (2014). "Automatic Speech Recognition: A Deep Learning A Deep-Learning Approach” (Publisher: Springer). Approach 3 Reading Material 4 5 6 Reading Material (cont’d) Wikipedia: https://en.wikipedia.org/wiki/Deep_learning Papers: G. E. Hinton, R. Salakutdinov. "Reducing the Dimensionality of Data with Neural Networks". Science 313: 504–507, 2016. G. E. Hinton, L. Deng, D. Yu, etc. "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups," IEEE Signal Processing Magazine, pp. 82–97, November 2012. G. Dahl, D. Yu, L. Deng, A. Acero. "Context-Dependent Pre-Trained Deep Neural Networks for Large- Vocabulary Speech Recognition". IEEE Trans. Audio, Speech, and Language Processing, Vol 20(1): 30–42, 2012. (plus other papers in the same special issue) Y. Bengio, A. Courville, and P. Vincent. "Representation Learning: A Review and New Perspectives," IEEE Trans. PAMI, special issue Learning Deep Architectures, 2013. J. Schmidhuber. “Deep learning in neural networks: An overview,” arXiv, October 2014. Y. LeCun, Y. Bengio, and G. Hinton. “Deep Learning”, Nature, Vol. 521, May 2015. J. Bellegarda and C. Monz. “State of the art in statistical methods for language and speech processing,”7 Computer Speech and Language, 2015 Part I: Machine Learning (Deep/Shallow) and Signal Processing 8 Machine Learning basics - Survey of audience background PART I: Basics PART II: Advanced Topics (tomorrow) Cognitive science 9 Revised slide from: Pascal Vincent, 2015 Machine Learning & Deep Learning Machine learning Analysis/ Data Statistics Programs Deep learning Machine learning What Is Deep Learning? Deep learning (deep machine learning, or deep structured learning, or hierarchical learning, or sometimes DL) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non-linear transformations.[1](p198)[2][3][4] 11 (Slide from: Yoshua Bengio) 12 SHALLOW DEEP Modified from Y LeCun MA Ranzato Boosting Neural Net RNN Perceptron AE D-AE Conv. Net SVM RBM DBN DBM Sparse GMM Coding BayesNP SP Bayes Nets DecisionTree SHALLOW DEEP Modified from Y LeCun MA Ranzato Boosting Neural Networks Deep Neural Net RNN Perceptron AE D-AE Conv. Net SVM RBM DBN DBM Sparse GMM Coding BayesNP SP Bayes Nets Probabilistic Models DecisionTree SHALLOW DEEP Modified from Y LeCun MA Ranzato Boosting Neural Networks Deep Neural Net RNN Perceptron AE D-AE Conv. Net SVM RBM DBN DBM Sparse ?GMM Coding BayesNP SP ?Bayes Nets Probabilistic Models DecisionTree Unsupervised Supervised Supervised Signal Processing Information Processing Signals Audio/Music Speech Image/ Video Text/ Animation/ Language Processing Graphics Coding/ Audio Speech Image Video Document Compression Coding Coding Coding Coding Compression/ Summary Communication Voice over IP, DAB,etc 4G/5G Networks, DVB, Home Networking, etc Security Multimedia watermarking, encryption, etc. Enhancement/ De-noising/ Speech Image/video enhancement (Clear Grammar Analysis Source separation Enhancement/ Type), Segmentation, feature checking, Text Feature extraction extraction Parsing Synthesis/ Computer Speech Computer Video Natural Rendering Music Synthesis Graphics/ Synthesis Language (text-to-speech) Generation User-Interface Multi-Modal Human Computer Interaction (HCI --- Input Methods) Recognition Auditory Automatic Image Document Scene Analysis Speech/Speaker Recognition Recognition (Computer Recognition audition; e.g. Computer Melody Detection Vision & Singer ID) (e.g. 3-D object Understanding Spoken Image recognition) Natural (Semantic IE) Language Understanding Language Understanding Understanding Retrieval/Mining Music Spoken Document Image Video Text Search Retrieval Retrieval & Retrieval Search (info retrieval) Voice/Mobile Search translation SpeechSpeech translation translation Machine translation 16 Social Media Apps . Photo Sharing Video Sharing Blogs, Wiki, (e.g. flickr) (e.g. Youtube, del.icio.us… 3D Second Machine Learning Basics 17 1060-1089 18 Input Output Decision function Loss function Inductive {(x , y )} f TS i i iTR {(x , y )} and {x } {y } Transductive i i iTR i iTS i iTS Generative Generative models L ln p(x, y; ) Generative OR Discriminative loss (form Discriminative Discriminative models varies) Supervised {(x , y )} i i iTR Unsupervised {x } i iU Semi-supervised {(x , y )} and {x } i i iTR i iU {(x , y )} and Active i i iTR {x } and {y } i iU i iLU TR f and {(x , y )} f TS Adaptive i i iAD Multitask {(x , y )} for all k f k or {y } for all k i i iTRk i iTSk Supervised Machine Learning (classification) Training phase (usually offline) Training data set Learned model Training algorithm measurements (features) Parameters/weights & (and sometimes structure) associated ‘class’ labels (colors used to show class labels) Supervised Machine Learning (classification) Test phase (run time, online) Input test data point Learned model Output predicted class label or measurements (features) only structure + parameters label sequence (e.g. sentence) A key ML concept: Generalization under-fitting over-fitting error best generalization test set error training set error model capacity (e.g. size, depth) • To avoid over-fitting, • The need for regularization (or make model simpler, or to add more training) • move the training objective away from (empirical) error rate Generalization – effects of more data under-fitting over-fitting error best generalization test set error training set error model capacity A variety of ML methods • Decision trees/forests/jungles, Boosting • Support Vector Machines (SVMs) • Model-Based (Graphical models, often generative models: sparse connections w. interpretability) – model tailored for each new application and incorporates prior knowledge – Bayesian statistics exploited to ‘invert the model’ & infer variables of interest • Neural Networks (DNN, RNN, dense connections) These two types of methods can be made DEEP: Deep generative models and DNNs Recipe for (Supervised) Deep Learning with Big Data (i.e., underfitting) (i.e., overfitting) Deeper Contrast with Signal Processing Approaches • Strong focus on sophisticated objective functions for optimization (e.g. MCE, MWE, MPE, MMI, string-level, super-string-level, …) • Can be regarded as “end2end” learning in ASR • Almost always non-convex optimization (praised by deep-ML researchers) • Weaker focus on regularization & overfitting • Why? • Our ASR community has been using shallow, low-capacity models for too long (e.g., GMM- HMM) 2008 • Less need for overcoming overfitting • Now, deep models add a new dimension for increasing model capacity • Regularization becomes essential for DNN • E.g. DBN pre-training, “dropout” method, STDP spiking neurons, etc. 26 Deep Neural Net (DNN) Basics --- why gradient vanishes & how to rescue it 27 (Shallow) Neural Networks for ASR (prior to the rise of deep learning) Temporal & Time-Delay (1-D Convolutional) Neural Nets • Atlas, Homma, and Marks, “An Artificial Neural Network for Spatio-Temporal Bipolar Patterns, Application to Phoneme Classification,” NIPS, 1988. 1988 • Waibel, Hanazawa, Hinton, Shikano, Lang. “Phoneme recognition using time-delay neural 1989 networks.” IEEE Transactions on Acoustics, Speech and Signal Processing, 1989. Hybrid Neural Nets-HMM • Morgan and Bourlard. “Continuous speech recognition using MLP with HMMs,” ICASSP, 1990. 1990 Recurrent Neural Nets • Bengio. “Artificial Neural Networks and their Application to Speech/Sequence Recognition”, Ph.D. thesis, 1991. 1991 • Robinson. “A real-time recurrent error propagation network word recognition system,” ICASSP 1992. Neural-Net Nonlinear Prediction 1992 • Deng, Hassanein, Elmasry. “Analysis of correlation structure for a neural predictive model with applications to speech recognition,” Neural Networks, vol. 7, No. 2, 1994. 1994 Bidirectional Recurrent Neural Nets • Schuster, Paliwal. "Bidirectional recurrent neural networks," IEEE Trans. Signal Processing, 1997. 1997 Neural-Net TANDEM • Hermansky, Ellis, Sharma. "Tandem connectionist feature extraction for conventional HMM 2000 systems." ICASSP 2000. • Morgan, Zhu, Stolcke, Sonmez, Sivadas, Shinozaki, Ostendorf, Jain, Hermansky, Ellis, Doddington, Chen, Cretin, Bourlard, Athineos, “Pushing the envelope - aside [speech recognition],” IEEE Signal 2005 Processing Magazine, vol. 22, no. 5, 2005. DARPA EARS Program 2001-2004: Novel Approach I (Novel Approach II: Deep Generative Model) Bottle-neck Features Extracted from Neural-Nets • Grezl, Karaﬁat, Kontar & Cernocky. “Probabilistic and bottle-neck features for LVCSR of meetings,” 2007 ICASSP, 2007. 28 One-hidden-layer neural networks • Starting from (nonlinear) regression

Deep Learning for Speech/Language Processing --- Machine Learning & Signal Processing Perspectives

Speech Signal Basics

Benchmarking Audio Signal Representation Techniques for Classiﬁcation with Convolutional Neural Networks

Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive Applications

Introduction to Digital Speech Processing Schafer Rabiner and Ronald W

Models of Speech Synthesis ROLF CARLSON Department of Speech Communication and Music Acoustics, Royal Institute of Technology, S-100 44 Stockholm, Sweden

Audio Processing Laboratory

ECE438 - Digital Signal Processing with Applications 1

Speech Processing Based on a Sinusoidal Model

Edinburgh Research Explorer

Speech Processing and Prosody Denis Jouvet

Linear Prediction Analysis of Speech

Speech Recognition Using Neural Networks