<<

Deep Learning for Speech/Language Processing --- machine learning & processing perspectives

Li Deng Technology Center Microsoft Research, Redmond, USA Tutorial given at Interspeech, Sept 6, 2015

Thanks go to many colleagues at DLTC/MSR, collaborating universities, and at Microsoft’s engineering groups Outline

• Part I: Basics of Machine Learning (Deep and Shallow) and of Signal Processing

• Part II: Speech

• Part III: Language (In case you did not get link to slides, send email to: [email protected])

2 Reading Material

Books: Bengio, Yoshua (2009). "Learning Deep Architectures for AI".

L. Deng and D. Yu (2014) "Deep Learning: Methods and Applications" http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol7-SIG-039.pdf

D. Yu and L. Deng (2014). "Automatic : A Deep Learning A Deep-Learning Approach” (Publisher: Springer). Approach

3 Reading Material

4 5 6 Reading Material (cont’d)

Wikipedia: https://en.wikipedia.org/wiki/Deep_learning

Papers: G. E. Hinton, R. Salakutdinov. "Reducing the Dimensionality of Data with Neural Networks". Science 313: 504–507, 2016.

G. E. Hinton, L. Deng, D. Yu, etc. "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups," IEEE Signal Processing Magazine, pp. 82–97, November 2012.

G. Dahl, D. Yu, L. Deng, A. Acero. "Context-Dependent Pre-Trained Deep Neural Networks for Large- Vocabulary Speech Recognition". IEEE Trans. Audio, Speech, and Language Processing, Vol 20(1): 30–42, 2012. (plus other papers in the same special issue)

Y. Bengio, A. Courville, and P. Vincent. "Representation Learning: A Review and New Perspectives," IEEE Trans. PAMI, special issue Learning Deep Architectures, 2013.

J. Schmidhuber. “Deep learning in neural networks: An overview,” arXiv, October 2014.

Y. LeCun, Y. Bengio, and G. Hinton. “Deep Learning”, Nature, Vol. 521, May 2015.

J. Bellegarda and C. Monz. “State of the art in statistical methods for language and speech processing,”7 Computer Speech and Language, 2015 Part I: Machine Learning (Deep/Shallow) and Signal Processing

8 Machine Learning basics

- Survey of audience background

PART I: Basics PART II: Advanced Topics (tomorrow)

Cognitive science

9 Revised slide from: Pascal Vincent, 2015 Machine Learning & Deep Learning

Machine learning Analysis/ Data Statistics Programs

Deep learning

Machine learning What Is Deep Learning?

Deep learning (deep machine learning, or deep structured learning, or hierarchical learning, or sometimes DL) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non-linear transformations.[1](p198)[2][3][4]

11 (Slide from: Yoshua Bengio) 12 SHALLOW DEEP Modified from Y LeCun MA Ranzato Boosting Neural Net RNN Perceptron AE D-AE Conv. Net

SVM RBM DBN DBM

Sparse GMM Coding BayesNP SP Bayes Nets

DecisionTree SHALLOW DEEP Modified from Y LeCun MA Ranzato Boosting Neural Networks Deep Neural Net RNN Perceptron AE D-AE Conv. Net

SVM RBM DBN DBM

Sparse GMM Coding BayesNP SP Bayes Nets Probabilistic Models

DecisionTree SHALLOW DEEP Modified from Y LeCun MA Ranzato Boosting Neural Networks Deep Neural Net RNN Perceptron AE D-AE Conv. Net

SVM RBM DBN DBM

Sparse ?GMM Coding BayesNP SP ?Bayes Nets Probabilistic Models

DecisionTree Unsupervised Supervised Supervised Signal Processing  Information Processing Audio/Music Speech Image/ Video Text/ Animation/ Language Processing Graphics

Coding/ Audio Speech Image Video Document Compression Coding Coding Coding Coding Compression/ Summary Communication Voice over IP, DAB,etc 4G/5G Networks, DVB, Home Networking, etc Security Multimedia watermarking, encryption, etc. Enhancement/ De-noising/ Speech Image/video enhancement (Clear Grammar Analysis Source separation Enhancement/ Type), Segmentation, feature checking, Text Feature extraction extraction Parsing Synthesis/ Computer Speech Computer Video Natural Rendering Music Synthesis Graphics/ Synthesis Language (text-to-speech) Generation

User-Interface Multi-Modal Human Computer Interaction (HCI --- Input Methods) Recognition Auditory Automatic Image Document Scene Analysis Speech/Speaker Recognition Recognition (Computer Recognition audition; e.g. Computer Melody Detection Vision & Singer ID) (e.g. 3-D object Understanding Spoken Image recognition) Natural (Semantic IE) Language Understanding Language Understanding Understanding

Retrieval/Mining Music Spoken Document Image Video Text Search Retrieval Retrieval & Retrieval Search (info retrieval) Voice/Mobile Search translation SpeechSpeech translation translation Machine translation 16 Social Media Apps . . Photo Sharing Video Sharing Blogs, Wiki, (e.g. flickr) (e.g. Youtube, del.icio.us… 3D Second Machine Learning Basics

17 1060-1089

18 Input Output Decision function Loss function

Inductive {(x , y )} f TS i i iTR {(x , y )} and {x } {y } Transductive i i iTR i iTS i iTS

Generative Generative models L  ln p(x, y; ) Generative OR Discriminative loss (form Discriminative Discriminative models varies)

Supervised {(x , y )} i i iTR Unsupervised {x } i iU Semi-supervised {(x , y )} and {x } i i iTR i iU {(x , y )} and Active i i iTR {x } and {y } i iU i iLU

TR f and {(x , y )} f TS Adaptive i i iAD

Multitask {(x , y )} for all k f k or {y } for all k i i iTRk i iTSk Supervised Machine Learning (classification)

Training phase (usually offline)

Training data set Learned model

Training algorithm

measurements (features) Parameters/weights & (and sometimes structure) associated ‘class’ labels

(colors used to show class labels) Supervised Machine Learning (classification)

Test phase (run time, online)

Input test data point Learned model Output

predicted class label or measurements (features) only structure + parameters label sequence (e.g. sentence) A key ML concept: Generalization

under-fitting over-fitting error best generalization test set error

training set error

model capacity (e.g. size, depth)

• To avoid over-fitting, • The need for regularization (or make model simpler, or to add more training) •  move the training objective away from (empirical) error rate Generalization – effects of more data

under-fitting over-fitting error best generalization

test set error

training set error model capacity A variety of ML methods

• Decision trees/forests/jungles, Boosting • Support Vector Machines (SVMs) • Model-Based (Graphical models, often generative models: sparse connections w. interpretability) – model tailored for each new application and incorporates prior knowledge – Bayesian statistics exploited to ‘invert the model’ & infer variables of interest • Neural Networks (DNN, RNN, dense connections)

These two types of methods can be made DEEP: Deep generative models and DNNs Recipe for (Supervised) Deep Learning with Big Data

(i.e., underfitting) (i.e., overfitting) Deeper Contrast with Signal Processing Approaches

• Strong focus on sophisticated objective functions for optimization (e.g. MCE, MWE, MPE, MMI, string-level, super-string-level, …) • Can be regarded as “end2end” learning in ASR • Almost always non-convex optimization (praised by deep-ML researchers) • Weaker focus on regularization & overfitting • Why? • Our ASR community has been using shallow, low-capacity models for too long (e.g., GMM- HMM) 2008 • Less need for overcoming overfitting • Now, deep models add a new dimension for increasing model capacity • Regularization becomes essential for DNN • E.g. DBN pre-training, “dropout” method, STDP spiking neurons, etc. 26 Deep Neural Net (DNN) Basics --- why gradient vanishes & how to rescue it

27 (Shallow) Neural Networks for ASR (prior to the rise of deep learning)

Temporal & Time-Delay (1-D Convolutional) Neural Nets • Atlas, Homma, and Marks, “An Artificial Neural Network for Spatio-Temporal Bipolar Patterns, Application to Phoneme Classification,” NIPS, 1988. 1988 • Waibel, Hanazawa, Hinton, Shikano, Lang. “Phoneme recognition using time-delay neural 1989 networks.” IEEE Transactions on Acoustics, Speech and Signal Processing, 1989. Hybrid Neural Nets-HMM • Morgan and Bourlard. “Continuous speech recognition using MLP with HMMs,” ICASSP, 1990. 1990 Recurrent Neural Nets • Bengio. “Artificial Neural Networks and their Application to Speech/Sequence Recognition”, Ph.D. thesis, 1991. 1991 • Robinson. “A real-time recurrent error propagation network word recognition system,” ICASSP 1992. Neural-Net Nonlinear Prediction 1992 • Deng, Hassanein, Elmasry. “Analysis of correlation structure for a neural predictive model with applications to speech recognition,” Neural Networks, vol. 7, No. 2, 1994. 1994 Bidirectional Recurrent Neural Nets • Schuster, Paliwal. "Bidirectional recurrent neural networks," IEEE Trans. Signal Processing, 1997. 1997 Neural-Net TANDEM • Hermansky, Ellis, Sharma. "Tandem connectionist feature extraction for conventional HMM 2000 systems." ICASSP 2000. • Morgan, Zhu, Stolcke, Sonmez, Sivadas, Shinozaki, Ostendorf, Jain, Hermansky, Ellis, Doddington, Chen, Cretin, Bourlard, Athineos, “Pushing the envelope - aside [speech recognition],” IEEE Signal 2005 Processing Magazine, vol. 22, no. 5, 2005.  DARPA EARS Program 2001-2004: Novel Approach I (Novel Approach II: Deep Generative Model) Bottle-neck Features Extracted from Neural-Nets • Grezl, Karafiat, Kontar & Cernocky. “Probabilistic and bottle-neck features for LVCSR of meetings,” 2007 ICASSP, 2007. 28 One-hidden-layer neural networks

• Starting from (nonlinear) regression

• Replace each fj with a variable zj, where

and h() is a fixed activation function • The outputs obtained from

where s() is another fixed function • In all, we have (simplifying biases): Multi-layer neural networks

Denote all activation functions by h

• The sum is over those values of j with instantiated weights wkj Unchallenged learning algorithm: Back propagation (BP)

• For regression, we consider a squared error cost function: 2 E(w) = ½ Sn Sk ( tnk – yk(xn,w) ) which corresponds to a Gaussian density p(t|x) • We can substitute

and use a general purpose optimizer to estimate w, but it is much more efficient to exploit derivatives of E, the essence of BP Learning neural networks

2 E(w) = ½ Sn Sk ( tnk – yk(xn,w) ) • Recall that for linear regression:

E(w)/wm = -Sn ( tn - yn ) xnm Weight in-between error Error signal Input signal signal and input signal

• We’ll use the chain rule of differentiation to derive a similar-looking expression, where – Local input signals are forward-propagated from the input – Local error signals are back-propagated from the output Local signals needed for learning • For clarity, consider the error for one training case:

• To compute En/wji, note that wji appears in only one term of the overall expression, namely

if wji is in the 1st layer, zi is actually input xi • Using the chain rule of differentiation, we have

Weight Local Local error input where signal signal Forward-propagating local input signals

• Forward propagation gives all the a’s and z’s Back-propagating local error signals

t2

t1

• Back-propagation gives all the d ’s Back-propagating error signals

• To compute En/aj (i.e., dj), aj (also called “logit”) appears in

all those expressions ak = Si wki h(ai) that depend on aj • Using the chain rule, we have

• The sum is over k s.t. unit j is connected to unit k and

for each such term, ak/aj = wkj h’(aj)

• Noting that En/ak = dk, we get the back-propagation rule:

• For output units: - Putting the propagations together

• For each training case n, apply forward propagation and back-propagation to compute

for each weight wji • Sum these over training cases to compute

• Use these derivatives for steepest descent learning (too slow for large set of training data)

• Minibatch learning: After a small set of input samples, use the above gradient to update the weights (so update more often) Why gradients tend to vanish for DNN • Recall BP for adjacent layer pair: • For sigmoid units: If the = h (1-h ) • If there is one hidden layer, gradients are not likely to vanish • Problem becomes serious when nets get deep

39 Why gradients tend to vanish for DNN

3 d errorin 3 2 3 y i d yyˆˆ(1 ) yˆ s() W h b in in in y i i ij j i d sin j

W3,b3 W3,b3 2 2 2 3 3 ddjnh jn(1 h jn )  W ij in 2 2 1 2 h2 upstream i his() W ij h j b i h2 j

W2,b2

W2,b2 1 1 1 2 2 ddknh kn(1 h kn )  W jk jn h1s() W 1 x b 1 h1 upstream j h1 i ij j i j

W1,b1 W1,b1 x x downward pass Upward pass Why gradients tend to vanish for DNN

• To illustrate to problem, let’s use matrix form of error BP:

[ l+2 l+2 +1 ] …, …, … …, …, …

• So even if forward pass is nonlinear, error backprop is a linear process • It suffers from all problems associated with linear processes • Many terms of (1- ) for sigmoid units • In addition, many terms in the product of W’s • If any sigmoid unit saturates in either direction, the error gradient becomes zero • If ||W||<1, the product will shrink fast for high depths • If ||W||>1, the product may grow fast for high depths How to rescue it

• Pre-train the DNN by generative DBN (solution by 2010) – Complicated process • Discriminative pre-training (much easier to do) • Still random initialization but with carefully set variance values; e.g. – Layer-dependent variance values – For lower layers (with more terms in the product), make the variance closer to 0 (e.g. < 0.1) • Use bigger training data to reduce the chance of vanishing gradients for each epoch • Use ReLU units: only one side of zero gradient instead of two as for sigmoid units

The power of understanding root causes!!! (mid 2010 at MSR Redmond) 42 An alternative way of training NN • Backprop takes partial derivatives:

• If the output layer is linear, total (instead of partial) derivative can be computed

• This is basic learning method for Deep Convex/Stacking Net (DSN), designed to be easily parallelizable by batch training • Using the total derivative is equivalent to coordinate descent algorithm with an “infinite” step size to achieve the global optimum along the “coordinate” of updating U while fixing W.

43 Deep Stacking Nets

...

• Learn weight matrices U U4 and W in individual ...... modules separately.

W4 • Given W and linear output Wrand layer, U can be expressed ...... as explicit nonlinear function of W. U3 • This nonlinear function is ...... used as the constraint in W3 solving nonlinear least Wrand square for learning W...... • Initializing W with RBM U2 (bottom layer) ......

2 • For higher layers, part of W Wrand W is initialized with the ......

optimized W from the U1 immediately lower layer ...... and part of it with random W1 numbers ......

44 (Deng, Yu, Platt, ICASSP-2012; Hutchinson, Deng, Yu, IEEE T-PAMI, 2013) A neat way of learning DSN weights

1 E = ෍ ||풚 − 풕 ||2, where 풚 = 푼푇풉 = 푼푇휎 푾푇풙 = 퐺 (푼, 푾) 2 푛 푛 푛 푛 푛 푛 푛

휕퐸 푇 푇 푻 −1 푇 푇 = 2푯 푼 푯 − 푻  푼 = 푯푯 푯푻 = F(푾), where 풉푛 = 휎 푾 풙푛 휕푼 1 2 E = σ ||퐺 (푼, 푾) − 풕 || , subject to U= F(푾), 2 푛 푛 푛 Y Use of Lagrange multiplier method:

1 U E = σ ||퐺 (푼, 푾) − 풕 ||2 + 휆 ||U −F 푾 || 2 푛 푛 푛 H to learn W (& then U)  full derivation in closed form W (i.e. no longer recursion on partial derivation as in backpropagation X

• Advantages found: --- less noise in gradient than using chain rule which ignores explicit constraint U= F(푾) --- batch learning is effective, aiding parallel training 45 How the Brain May Do BackProp • Canadian Psychology, Vol 44, pp 10-13, 2003.

• Feedback system in biological neural nets • Key roles of STDP (Spike-Time-Dependent Plasticity) --- temporal derivative • To provide a way to encode error derivatives

46 How the Brain May Do BackProp • Backprop algorithm requires that feedforward and feedback weights are the same • This is clearly not true for biological neural nets • How to reconcile this discrepancy? • Recent studies showed that use of random feedback weights in BP performs close to rigorous BP • Implications for regularizing BP learning (like dropout, which may not make sense at 1st glance)

47 Recurrent Neural Net (RNN) Basics --- why memory decays fast or explode (1990’s) --- how to rescue it (2010’s, in the new deep learning era)

48 Basic architecture of an RNN

ℎ푡 is the hidden layer that carries the information from time 0~푡 where 푥푡: 푡ℎ푒 𝑖푛푝푢푡 푤표푟푑 , 푦푡: 푡ℎ푒 표푢푡푝푢푡 푡푎푔 푦푡 = 푆표푓푡푀푎푥 푈 ∙ ℎ푡 , 푤ℎ푒푟푒 ℎ푡 = 휎(푊 ∙ ℎ푡−1 + 푉 ∙ 푥푡)

풚1 풚2 풚3 풚푇

푼 푼 푼 푼 푾 푾 푾 … 풉1 풉2 풉3 풉푇

푽 푽 푽 푽

풙1 풙2 풙3 풗푇

Used for slot filling in SLU [Mesnil, He, Deng, Bengio, 2013; Yao, Zweig, Hwang, Shi, Yu, 2013]

Microsoft Research 49 Back-propagation through time (BPTT)

풍풂풃풆풍3 at time 푡 = 3

1. Forward propagation

풚1 풚2 풚3 2. Generate output 3. Calculate error

푼 푼 푼 4. Back propagation 5. Back prop. through time 푾 푾 풉1 풉2 풉3

푽 푽 푽

풙1 풙2 풙3

Microsoft Research 50 http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Microsoft Research 51 Some early attempts to examine difficulties in learning RNNs

• Yoshua Bengio’s Ph.D. thesis at McGill University (1991) • Based on attractor properties of nonlinear dynamic systems • His recent, more intuitive explanation --- in extreme of nonlinearity, discrete functions and gradients vanish or explode:

• An alternative analysis: based on perturbation analysis of nonlinear differential equations (next several slides)

52 • NN for nonlinear sequence prediction (like NN language model used today) • Memory (temporal correlation) proved to be stronger than linear prediction • No GPUs to use; very slow to train with BP; did not make NN big and deep, etc.

• Conceptually easy to make it deep using (state-space) signal processing & graphical models… by moving away from NN…

53 54 55 56 57 Why gradients vanish/explode for RNN

• The easiest account is to follow the analysis for DNN • except that “depth” of RNN is much larger: the length of input sequence • Especially serious for speech sequence (not as bad for text input) • Tony Robinson group was the only one that made RNN work for TIMIT phone recognition (1994)

58 How to rescue it

• Echo state nets (“lazy” approach) – avoiding problems by not training input & recurrent weights – H. Jaeger. “Short term memory in echo state networks”, 2001 • But if you figure out smart ways to train them, you get much better results

(NIPS-WS, 2013) 59 How to rescue it • By Better optimization – Hessian-free method – Primal-dual method

60 How to rescue it

• Use of LSTM (long short-term memory) cells – Sepp Hochreiter & Jürgen Schmidhuber (1997). "Long short- term memory" Neural Computation 9 (8): 1735–1780. – Many earlier-to-read materials, especially after 2013 • The best way so far to train RNNs well – Increasingly popular in speech/language processing – Attracted big attention from ASR community at ICASSP-2013’s DNN special session (Graves et al.) – Huge progress since then

61 Many ways to show an LSTM cell

(Slide revised from: Koutnik & Schmidhuber, 2015)62 Many ways to show an LSTM cell

63 Many ways to show an LSTM cell

64 Huh?....

Don’t worry, we’ll come back to this shortly LSTM Cells in an RNN

66 LSTM cell unfolding over time

(Jozefowics, Zarembe, Sutskever, ICML 2015) 68 Gated Recurrent Unit (GRU) (simpler than LSTM; no output gates)

(Jozefowics, Zarembe, Sutskever, ICML 2015; Google Kumar et al., arXiv, July, 2015; Metamind) 69 70 71 (2006, 2012) Part II: Speech

72 Deep/Dynamic Structure in Human Speech Production and Perception (part of my tutorial at 2009 NIPS WS)

73 Production & Perception: Closed-Loop Chain

SPEAKER LISTENER decoded message

Internal message model

Speech Acoustics in closed-loop chain Encoder: Two-Stage Production Mechanisms

Phonology (higher level): •Symbolic encoding of linguistic message •Discrete representation by phonological features SPEAKER •Loosely-coupled multiple feature tiers •Overcome beads-on-a-string phone model •Theories of distinctive features, feature geometry & articulatory phonology • Account for partial/full sound deletion/modification message in casual speech

Phonetics (lower level): •Convert discrete linguistic features to continuous acoustics •Mediated by motor control & articulatory dynamics •Mapping from articulatory variables to Speech AcousticsVT area function to acoustics •Account for co-articulation and reduction (target undershoot), etc. Encoder: Phonological Modeling

Computational phonology: • Represent pronunciation variations as constrained factorial SPEAKER • Constraint: from articulatory phonology • Language-universal representation

ten themes message / t ε n ө i: m z /

Tongue Tip

Tongue High / Front Mid / Front Body Speech Acoustics Encoder: Phonetic Modeling

Computational phonetics: • Segmental factorial HMM for sequential target in articulatory or vocal tract resonance domain SPEAKER • Switching trajectory model for target-directed articulatory dynamics • Switching nonlinear state-space model for dynamics in speech acoustics • Illustration: message

Speech Acoustics

Decoder I: Auditory Reception

• Convert speech acoustic waves into efficient & robust auditory representation LISTENER • This processing is largely independent decoded of phonological units message • Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC,…, all Internal the messageway to A1 cortex model • Principal roles: 1) combat environmental acoustic distortion; 2) detect relevant speech features 3) provide temporal landmarks to aid decoding • Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds 4) modulation in independent frequency bands, 5) binaural noise suppression, etc. Decoder II: Cognitive Perception

• Cognitive process: recovery of linguistic message LISTENER • Relies on decoded 1) “Internal” model: structural knowledge of message the encoder (production system) 2) Robust auditory representation of features 3) Temporal landmarks Internal model • Childmessage speech acquisition process is one that gradually establishes the “internal” model • Strategy: analysis by synthesis • i.e., Probabilistic inference on (deeply) hidden linguistic units using the internal model • No motor theory: the above strategy requires no articulatory recovery from speech acoustics Human Speech Perception (decoder)

• Convert speech acoustic waves into efficient & robust auditory representation LISTENER • This processing is largely independent decoded of phonological units message • Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC,…, all Internal themessage way to A1 cortex model • Two principal roles: 1) combat environmental acoustic distortion; 2) provide temporal landmarks to aid decoding • Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds (CN), 4) modulation in independent frequency bands, 5) binaural noise suppression, etc. Types of Speech Perception Theories

• Active vs. Passive • Bottom up vs./and Top Down • Autonomous vs. Interactive Active vs. Passive

• Active theories suggests that speech perception and production are closely related – Listener knowledge of how sounds are produced facilitates recognition of sounds • Passive theories emphasizes the sensory aspects of speech perception – Listeners utilize internal filtering mechanisms – Knowledge of vocal tract characteristics plays a minor role, for example when listening in noise conditions Bottom up vs. & Top Down

• Top-down processing works with knowledge a listener has about a language, context, experience, etc. – Listeners use stored information about language and the world to make sense of the speech • Bottom-up processing works in the absence of a knowledge base providing top-down information – listeners receive auditory information, convert it into a neural signal and process the phonetic feature information Specific Speech Perception Theories • Motor Theory • Acoustic Invariance Theory • Direct Realism • Trace Model (based on neural nets) • Cohort Theory • Fuzzy Logic Model of Perception • Native Language Magnet Theory Motor Theory • Postulates speech is perceived by reference to how it is produced – when perceiving speech, listeners access their own knowledge of how phonemes are articulated – Articulatory gestures (such as rounding or pressing the lips together) are units of perception that directly provide the listener with phonetic information

Liberman, Cooper, Shankweiler, & Studdert- Kennedy, 1967 Acoustic Invariance Theory • Listeners inspect the incoming signal for the so- called acoustic landmarks which are particular events in the spectrum carrying information about gestures which produced them. • Gestures are limited by the capacities of humans’ articulators and listeners are sensitive to their auditory correlates, the lack of invariance simply does not exist in this model. • The acoustic properties of the landmarks constitute the basis for establishing the distinctive features. • Bundles of the distinctive features uniquely specify phonetic segments (phonemes, syllables, words).

Stevens, K.N. (2002). "Toward a model of lexical access based on acoustic landmarks and distinctive features" (PDF). Journal of the Acoustical Society of America 111 (4): 1872–1891. TRACE Model • For example, a listener hears the beginning of bald, and the words bald, ball, bad, bill become active in memory. Then, soon after, only bald and ball remain in competition (bad, bill have been eliminated because the vowel sound doesn't match the input). – Soon after, bald is recognized. • TRACE simulates this process by representing the temporal dimension of speech, allowing words in the lexicon to vary in activation strength, and by having words compete during processing. A Deep/Generative Model of Speech Production/Perception --- Perception as “variational inference”

SPEAKER targets

articulation message distortion-free acoustics

distorted acoustics

Speech Acoustics distortion factors & feedback to articulation Deep Generative Models, Variational Interference/Learning, & Applications to Speech

90 Deep Learning ≈ Neural Networks, Deep in space & time (recurrent LSTM), & deep RNN + Generative Models, Deep in space & time (dynamic), & deep/hidden dynamic models + …, …, …

91 Deep Neural Nets Deep Generative Models Structure Graphical; info flow: bottom-up Graphical; info flow: top-down Incorp constraints & Hard Easy domain knowledge Semi/unsupervised Harder or impossible Easier, at least possible

Interpretation Harder Easy (generative “story” on data and hidden variables) Representation Distributed Localist (mostly); can be distributed also Inference/decode Easy Harder (but note recent progress) Scalability/compute Easier (regular computes/GPU) Harder (but note recent progress) Incorp. uncertainty Hard Easy Empirical goal Classification, feature learning, … Classification (via Bayes rule), latent variable inference… Terminology Neurons, activation/gate functions, Random vars, stochastic “neurons”, weights … potential function, parameters … Learning algorithm A single, unchallenged, algorithm -- A major focus of open research, many BackProp algorithms, & more to come Evaluation On a black-box score – end performance On almost every intermediate quantity

Implementation Hard (but increasingly easier) Standardized but insights needed Experiments Massive, real data Modest, often simulated data Parameterization Dense matrices Sparse (often PDFs); can be dense Example: (Shallow) Generative Model

“TOPICS” war animals … computers as hidden layer

Iraqi the Matlab

slide revised from: Max Welling Another example: Medical Diagnosis

diseases Inference problem: What is the most probable disease given the symptoms?

symptoms

94 Example: (Deep) Generative Model

targets

articulation

distortion-free acoustics

distorted acoustics

Speech Acoustics distortion factors & feedback to articulation Deep Generative/Graphical Model Inference

• Key issues: – Representation: syntax and semantics (directed/undirected,variables/factors,..) – Inference: computing probabilities and most likely assignments/explanations – Learning: of model parameters based on observed data. Relies on inference! • Inference is NP-hard (incl. approximation hardness) • Exact inference: works for very limited subset of models/structures – E.g., chains or low-treewidth trees • Approximate inference: highly computationally intensive – Deterministic: Variational, loopy belief propagation, expectation propagation – Numerical sampling (Monte Carlo): Gibbs sampling • Variational learning: – EM algorithm – E-step uses variational inference (recent new advances in ML) Variational inference/learning is not trivial: Example

diseases Difficulty: Explaining away: observation introduces correlation of nodes in the parent hidden layer

symptoms

97 Variational EM

Step 1: Maximize the bound with respect to Q

(k1) (k) (E step) : Q  argmax Q L(Q, ) (Q approximates true posterior, often by factorizing it) Step 2: Fix Q, maximize with respect to

(k1) (k1) (M step) :   argmax L(Q ,)

Note in traditional EM, Q is precise; e.g. posteriors computed by forward/backward algorithm for HMMs

98 Statistical HDM Deterministic (hidden dynamic HDM Model)

Bridle, Deng, Picone, Richards, Ma, Kamm, Schuster, Pike, Reagan. Final Report for Workshop on Language Engineering, 99 Johns Hopkins University, 1998. (experiments on Switchboard tasks) ICASSP-2004

Auxiliary function:

102 Surprisingly Good Inference Results for Continuous Hidden States

• By-product: accurately tracking dynamics of resonances (formants) in vocal tract (TIMIT & SWBD). • Best formant tracker by then in speech analysis; used as basis to form a formant database as “ground truth” • We thought we solved the ASR problem, except • “Intractable” for decoding

Deng & Huang, Challenges in Adopting Speech Recognition, Communications of the ACM, vol. 47, pp. 69-75, 2004. Deng, Cui, Pruvenok, Huang, Momen, Chen, Alwan, A Database of Vocal Tract Resonance Trajectories for Research in Speech , ICASSP, 2006. 103 Deep Generative Models in Speech Recognition (prior to the rising of deep learning)

Segment & Nonstationary-State Models • Digalakis, Rohlicek, Ostendorf. “ML estimation of a stochastic linear system with the EM alg & 1993 application to speech recognition,” IEEE T-SAP, 1993 • Deng, Aksmanovic, Sun, Wu, Speech recognition using HMM with polynomial regression functions 1994 as nonstationary states,” IEEE T-SAP, 1994. Hidden Dynamic Models (HDM) • Deng, Ramsay, Sun. “Production models as a structural basis for automatic speech recognition,” 1997 Speech Communication, vol. 33, pp. 93–111, 1997. 1998 • Bridle et al. “An investigation of segmental hidden dynamic models of speech coarticulation for speech recognition,” Final Report Workshop on Language Engineering, Johns Hopkins U, 1998. • Picone et al. “Initial evaluation of hidden dynamic models on conversational speech,” ICASSP, 1999. 1999 • Deng and Ma. “Spontaneous speech recognition using a statistical co-articulatory model for the 2000 vocal tract resonance dynamics,” JASA, 2000. Structured Hidden Trajectory Models (HTM) • Zhou, et al. “Coarticulation modeling by embedding a target-directed hidden trajectory model into 2003 HMM,” ICASSP, 2003.  DARPA EARS Program 2001-2004: Novel Approach II 2006 • Deng, Yu, Acero. “Structured speech modeling,” IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 5, 2006. Switching Nonlinear State-Space Models • Deng. “Switching Dynamic System Models for Speech Articulation and Acoustics,” in Mathematical Foundations of Speech and Language Processing, vol. 138, pp. 115 - 134, Springer, 2003. • Lee et al. “A Multimodal Variational Approach to Learning and Inference in Switching State Space Models,” ICASSP, 2004. 104 Other Deep Generative Models (developed outside speech)

• Sigmoid belief nets & wake/sleep alg. (1992) • Deep belief nets (DBN, 2006);  Start of deep learning • Totally non-obvious result: Stacking many RBMs (undirected)  not Deep Boltzmann Machine (DBM, undirected)  but a DBN (directed, generative model) • Excellent in generating images &

• Similar type of deep generative models to HDM • But simpler: no temporal dynamics • With very different parameterization • Most intriguing of DBN: inference is easy (i.e. no need for approximate variational Bayes)  ”Restriction” of connections in RBM

• Pros/cons analysis Hinton coming to MSR 2009

105 This is a very different kind of deep generative model

(Mohamed, Dahl, Hinton, 2009, 2012) (Deng et al., 2006; Deng & Yu, 2007)

(after adding Backprop to the generative DBN) Error Analysis d

• Elegant model formulation & knowledge incorporation • Strong empirical results: 96% TIMIT accuracy with Nbest=1001; 75.2% lattice decoding w. monophones; fast approx. training -- DBN/DNN made many new errors • Still very expensive for decoding; could not on short, undershoot vowels -- 11 frames contain too much “noise” ship (very frustrating!) 107 Early Successes of Deep Learning in Speech Recognition

108 Academic-Industrial Collaboration (2009,2010)

• I invited Geoff Hinton to work with me at MSR, Redmond • Well-timed academic-industrial collaboration: – ASR industry searching for new solutions when “principled” deep generative approaches could not deliver – Academia developed deep learning tools (e.g. DBN 2006) looking for applications – Add Backprop to deep generative models (DBN)  DNN (hybrid generative/discriminative) – Advent of GPU computing (Nvidia CUDA library released 2007/08) – Big training data in speech recognition were already available 109 Invitee 1: give me one week to decide …,… Not worth my time to fly to Vancouver for this…

Mohamed, Dahl, Hinton, Deep belief networks for phone recognition, NIPS 2009 Workshop on Deep Learning, 2009 Yu, Deng, Wang, Learning in the Deep-Structured Conditional Random Fields, NIPS 2009 Workshop on Deep Learning, 2009 110 …, …, … Expanding DNN at Industry Scale

• Scale DNN’s success to large speech tasks (2010-2011) – Grew output neurons from context-independent phone states (100-200) to context-dependent ones (1k-30k)  CD-DNN-HMM for Bing Voice Search and then to SWBD tasks – Motivated initially by saving huge MSFT investment in the speech decoder software infrastructure – CD-DNN-HMM also gave much higher accuracy than CI-DNN-HMM – Earlier NNs made use of context only as appended inputs, not coded directly as outputs – Discovered that with large training data Backprop works well without DBN pre-training by understanding why gradients often vanish (patent filed for “discriminative pre-training” 2011)

• Engineering for large speech systems: – Combined expertise in DNN (esp. with GPU implementation) and speech recognition – Collaborations among MSRR, MSRA, academic researchers

• Yu, Deng, Dahl, Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition, in NIPS Workshop on Deep Learning, 2010. • Dahl, Yu, Deng, Acero, Large Vocabulary Continuous Speech Recognition With Context-Dependent DBN-HMMS, in Proc. ICASSP, 2011. • Dahl, Yu, Deng, Acero, Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, in IEEE Transactions on Audio, Speech, and Language Processing (2013 IEEE SPS Best Paper Award) , vol. 20, no. 1, pp. 30-42, January 2012. • Seide, Li, Yu, "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks", Interspeech 2011, pp. 437-440. • Hinton, Deng, Yu, Dahl, Mohamed, Jaitly, Senior, Vanhoucke, Nguyen, Sainath, Kingsbury, Deep Neural Networks for Acoustic Modeling in SpeechRecognition, in IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, November 2012 • Sainath, T., Kingsbury, B., Ramabhadran, B., Novak, P., and Mohamed, A. “Making deep belief networks effective for large vocabulary continuous speech recognition,” Proc. ASRU, 2011. • Sainath, T., Kingsbury, B., Soltau, H., and Ramabhadran, B. “Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks,” IEEE Transactions on Audio, Speech, and Language Processing, vol.21, no.11, pp.2267-2276, Nov. 2013. • Jaitly, N., Nguyen, P., Senior, A., and Vanhoucke, V. “Application of Pretrained Deep Neural Networks to Large Vocabulary Speech Recognition,” Proc. Interspeech, 2012. 112 DNN-HMM (replacing GMM only; longer MFCC/filter-back windows w. no transformation)

Model tied triphone states directly

Many layers of nonlinear feature transformation + SoftMax

113 DNN vs. Pre-DNN Prior-Art

. Table: TIMIT Phone recognition (3 hours of training) Features Setup Error Rates Pre-DNN Deep Generative Model 24.8% ~10% relative improvement DNN 5 layers x 2048 23.4% • Table: Voice Search SER (24-48 hours of training) Features Setup Error Rates ~20% relative Pre-DNN GMM-HMM with MPE 36.2% improvement DNN 5 layers x 2048 30.1%  Table: SwitchBoard WER (309 hours training) Features Setup Error Rates ~30% relative Pre-DNN GMM-HMM with BMMI 23.6% Improvement DNN 7 layers x 2048 15.8%

For DNN, the more data, the better! 114 Scientists See Promise in Deep-Learning Programs John Markoff November 23, 2012 Rick Rashid in Tianjin, China, October, 25, 2012

Deep learning technology enabled speech-to-speech translation A voice recognition program translated a speech given by Richard F. Rashid, Microsoft’s top scientist, into Mandarin Chinese. CD-DNN-HMM

Dahl, Yu, Deng, and Acero, “Context-Dependent Pre- trained Deep Neural Networks for Large Vocabulary Speech Recognition,” IEEE Trans. ASLP, Jan. 2012 (also ICASSP 2011) Seide et al, Interspeech, 2011.

After no improvement for 10+ years by the research community… …MSR reduced error from ~23% to <13% (and under 7% for Rick Rashid’s S2S demo in 2012)! Impact of deep learning in speech technology

Cortana 118 Microsoft Research 119 120 (Slide from Bengio, 2015) 121 In Academic World

“This joint paper from the major speech recognition laboratories was the first major industrial application of deep learning.”

122 DNN: (Fully-Connected) Deep Neural Networks “DNN for acoustic modeling in speech recognition,” in IEEE SPM, Nov. 2012

First train a stack of N models each of Then compose them into Then add outputs which has one hidden layer. Each model in a single Deep Belief and train the DNN the stack treats the hidden variables of the Network (DBN). with backprop. previous model as data. ASR Issues Solutions How to reduce the runtime without SVD accuracy loss? How to do speaker adaptation with SVD-based adaptation low footprints? How to be robust to noise? Variable component CNN

How to reduce accuracy gap Teacher-student learning using between large and small DNN? output posterior How to deal with large variety of DNN factorization, mixed band data? training How to enable languages with Multi-lingual DNN limited training data?

Microsoft Research 124 More Recent Development of Deep Learning for Speech

125 Chapter 7

126 Innovation: Better Optimization

• Sequence discriminative training for DNN: - Mohamed, Yu, Deng: “Investigation of full-sequence training of deep belief networks for speech recognition,” Interspeech, 2010. - Kingsbury, Sainath, Soltau. “Scalable minimum Bayes risk training of DNN acoustic models using distributed hessian-free optimization,” Interspeech, 2012. - Su, Li, Yu, Seide. “Error back propagation for sequence training of CD deep networks for conversational speech transcription,” ICASSP, 2013. - Vesely, Ghoshal, Burget, Povey. “Sequence-discriminative training of deep neural networks, Interspeech, 2013. • Distributed asynchronous SGD - Dean, Corrado,…Senior, Ng. “Large Scale Distributed Deep Networks,” NIPS, 2012. Input data X - Sak, Vinyals, Heigold, Senior, McDermott, Monga, Mao. “Sequence Discriminative Distributed Training of Long Short-Term Memory Recurrent Neural Networks,” Interspeech,2014.

127 Innovation: Towards Raw Inputs

• Bye-Bye MFCCs (no more cosine transform, Mel-scaling?) - Deng, Seltzer, Yu, Acero, Mohamed, Hinton. “Binary coding of speech using a deep auto-encoder,” Interspeech, 2010. - Mohamed, Hinton, Penn. “Understanding how deep belief networks perform acoustic modeling,” ICASSP, 2012. - Li, Yu, Huang, Gong, “Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM” SLT, 2012 - Deng, J. Li, Huang, Yao, Yu, Seide, Seltzer, Zweig, He, Williams, Gong, Acero. “Recent advances in deep learning for speech research at Microsoft,” ICASSP, 2013. - Sainath, Kingsbury, Mohamed, Ramabhadran. “Learning filter banks within a deep neural network framework,” ASRU, 2013. • Bye-Bye Fourier transforms? - Jaitly and Hinton. “Learning a better representation of speech sound waves using RBMs,” ICASSP, 2011. - Tuske, Golik, Schluter, Ney. “Acoustic modeling with deep neural networks using raw time signal for LVCSR,” Interspeech, 2014. - Golik et al, “Convolutional NNs for acoustic modeling of raw time signals in LVCSR,” Interspeech, 2015. Input data X - Sainath et al. “Learning the Speech Front-End with Raw Waveform CLDNNs,” Interspeech, 2015 • DNN as hierarchical nonlinear feature extractors: - Seide, Li, Chen, Yu. “Feature engineering in context-dependent deep neural networks for conversational speech transcription, ASRU, 2011. - Yu, Seltzer, Li, Huang, Seide. “Feature learning in deep neural networks - Studies on speech recognition tasks,” ICLR, 2013. - Yan, Huo, Xu. “A scalable approach to using DNN-derived in GMM-HMM based acoustic modeling in LVCSR,” Interspeech, 2013. - Deng, Chen. “Sequence classification using high-level features extracted from deep neural networks,” ICASSP, 2014. 128 Innovation: Transfer/Multitask Learning & Adaptation

Adaptation to speakers & environments (i-vectors)

• Too many references to list & organize 129 Innovation: Better regularization & nonlinearity

x x x x x x

Input data X

130 Innovation: Better architectures

• Recurrent Nets (bi-directional RNN/LSTM) and Conv Nets (CNN) are superior to fully- connected DNNs

• Sak, Senior, Beaufays. “LSTM Recurrent Neural Network architectures for large scale acoustic modeling,” Interspeech,2014.

• Soltau, Saon, Sainath. ”Joint Training of Convolutional and Non-Convolutional Neural Networks,” ICASSP, 2014.

131 Innovation: Ensemble Deep Learning

• Ensembles of RNN/LSTM, DNN, & Conv Nets (CNN) give huge gains: • T. Sainath, O. Vinyals, A. Senior, H. Sak. “Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks,” ICASSP 2015. • L. Deng and John Platt, Ensemble Deep Learning for Speech Recognition, Interspeech, 2014. • G. Saon, H. Kuo, S. Rennie, M. Picheny. “The IBM 2015 English conversational telephone speech recognition system,” arXiv, May 2015. (8% WER on SWB-309h)

132 Innovation: Better learning objectives/methods

• Use of CTC as a new objective in RNN/LSTM with end2end learning drastically simplifies ASR systems • Predict graphemes or words directly; no pron. dictionaries; no CD; no decision trees • Use of “Blank” symbols may be equivalent to a special HMM state tying scheme  CTC/RNN has NOT replaced HMM (left-to-right) • Relative 8% gain by CTC has been shown by a very limited number of labs

• A. Graves and N. Jaitly. “Towards End-to-End Speech Recognition with Recurrent Neural Networks,” ICML, 2014. • A. Hannun, A. Ng et al. “DeepSpeech: Scaling up End-to-End Speech Recognition,” arXiv Nov. 2014. • A. Maas et al. “Lexicon-Free Conversational ASR with NN,” NAACL, 2015 • H. Sak et al. “Learning Acoustic Frame Labeling for ASR with RNN,” ICASSP, 2015 • H. Sak, A. Senior, K. Rao, F. Beaufays. “Fast and Accurate Recurrent Neural Network Acoustic Models for Speech Recognition,” 133 Interspeech, 2015 Innovation: A new paradigm for speech recognition

• Seq2seq learning with attention mechanism (borrowed from NLP-MT)

• W. Chan, N. Jaitly, Q. Le, O. Vinyals. “Listen, attend, and spell,” arXiv, 2015. • J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, Y. Bengio. “Attention-Based Models for Speech Recognition,” arXiv, 2015.

134 A Perspective on Recent Innovations of ASR • All above deep learning innovations are based on supervised, discriminative learning of DNN and recurrent variants • Capitalizing on big, labeled data • Incorporating monotonic-sequential structure of speech (non-monotonic for language, later) • Hard to incorporate many other aspects of speech knowledge with (e.g. speech distortion model) • Hard to do semi- and unsupervised learning Deep generative modeling may overcome such difficulties

Li Deng and Roberto Togneri, Chapter 6: Deep Dynamic Models for Learning Hidden Representations of Speech Features, pp. 153-196, Springer, December 2014. Li Deng and Navdeep Jaitly, Chapter 2: Deep discriminative and generative models for pattern recognition, ~30 pages, in Handbook of Pattern Recognition and Computer Vision: 5th Edition, World Scientific Publishing, Jan 2016. 135 Deep Neural Nets Deep Generative Models Structure Graphical; info flow: bottom-up Graphical; info flow: top-down Incorp constraints & Hard Easy domain knowledge Unsupervised Harder or impossible Easier, at least possible

Interpretation Harder Easier (generative “story” on data and hidden variables) Representation Distributed Localist (mostly); can be distributed also Inference/decode Easy Harder (but note recent progress) Scalability/compute Easier (regular computes/GPU) Harder (but note recent progress) Incorp. uncertainty Hard Easy Empirical goal Classification, feature learning, … Classification (via Bayes rule), latent variable inference… Terminology Neurons, activation/gate functions, Random vars, stochastic “neurons”, weights … potential function, parameters … Learning algorithm A single, unchallenged, algorithm -- A major focus of open research, many BackProp algorithms, & more to come Evaluation On a black-box score – end performance On almost every intermediate quantity

Implementation Hard (but increasingly easier) Standardized but insights needed Experiments Massive, real data Modest, often simulated data Parameterization Dense matrices Sparse (often PDFs); can be dense Example 1: Interpretable deep learning using deep topic models (NIPS-2015)

137 Recall: (Shallow) Generative Model

“TOPICS” war animals … computers as hidden layer

Iraqi the Matlab

slide revised from: Max Welling Interpretable deep learning: Deep generative model

• Constructing interpretable DNNs based on generative topic models • End-to-end learning by mirror-descent backpropagation  to maximize posterior probability p(y|x) • y: output (win/loss), and x: input feature vector

Mirror Descent Algorithm (MDA)

J. Chen, J. He, Y. Shen, L. Xiao, X. He, J. Gao, X. Song, and L. Deng, “End-to-end Learning of Latent Dirichlet Allocation by Mirror-Descent Back Propagation”, submitted to NIPS2015. Example 2: Unsupervised learning using deep generative model (ACL, 2013)

• Distorted character string Images Text • Easier than unsupervised Speech Text • 47% error reduction over Google’s open-source OCR system

Motivated me to think about unsupervised ASR and NLP

140 Power: Character-level LM & generative modeling for unsupervised learning

• “Image” data are naturally “generated” by the model quite accurately (like “computer graphics”) • I had the same idea for unsupervised generative Speech-to-Text in 90’s • Not successful because 1) Deep generative models were too simple for generating speech waves 2) Inference/learning methods for deep generative models not mature then 3) Computers were too slow

141 Deep Generative Model for Image-Text Deep Generative Model for Speech-Text (Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 1997, 2000, 2003, 2006)

L.Deng, A dynamic, feature-based approach to the interface between phonology & phonetics for speech modeling and recognition, Speech Communication, vol. 24, no. 4, pp. 299-323, 1998. 142 Deep Generative Model for Image-Text Deep Generative Model for Speech-Text (Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 2000, 2003, 2006)

Word-level Language model Plus Feature-level Pronunciation model

143 Deep Generative Model for Image-Text Deep Generative Model for Speech-Text (Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 2000, 2003, 2006)

Articulatory dynamics

Easy: likely no “explaining away” problem in Hard: pervasive “explaining away” problem 144 inference and learning due to speech dynamics Deep Generative Model for Image-Text Deep Generative Model for Speech-Text (Berg-Kirkpatrick et al., 2013, 2015) (Deng, 1998; Deng et al, 2000, 2003, 2006)

Articulatory To Acoustics mapping

145 • In contrast, articulatory-to-acoustics mapping in ASR is much more complex • During 1997-2000, shallow NNs were used for this as “universal approximator” • Not successful • Now we have better DNN tool • Even RNN/LSTM/CTC tool for dynamic modeling

• Essence: exploit the strong prior of LM: trained with billions of words/text • No need to pair the text with acoustics; hence unsupervised learning • Think of it as using a new objective function based on distribution matching between LM prior and the distribution of words predicted from acoustic models Very simple, & easy to model accurately 146 Further thoughts on unsupervised ASR

• Deep generative modeling experiments not successful in 90’s • Computers were too slow • Models were too simple from text to speech waves • Still true  need speech scientists to work harder with technologists • And when generative models are not good enough, discriminative models and learning (e.g., RNN) can help a lot • Further, can iterate between the two, like wake-sleep (algorithm) • Inference/learning methods for deep generative models not mature at that time • Only partially true today •  due to recent big advances in machine learning • Based on new ways of thinking about generative graphical modeling motivated by the availability of deep learning tools (e.g. DNN) • A brief review next 147 Advances in Inference Algms for Deep Generative Models Kingma & Welling 2014, Salakhutdinov et al, 2015 ICML-2014 Talk Monday June 23, 15:20

In Track F (Deep Learning II)

“Efficient Gradient Based Inference through Transformations between Bayes Nets and Neural Nets” Other solutions to solve the "large variance problem“ in variational inference: -Variational Bayesian Inference with Stochastic Search [D.M. Blei, M.I. Jordan and J.W. Paisley, 2012] -Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression [T. Salimans and A. Knowles, 2013]. -Black Box Variational Inference. [R. Ranganath, S. Gerrish and D.M. Blei. 2013] -Stochastic Variational Inference [M.D. Hoffman, D. Blei, C. Wang and J. Paisley, 2013] -Estimating or propagating gradients through stochastic neurons. [Y. Bengio, 2013]. -Neural Variational Inference and Learning in Belief Networks. [A. Mnih and K. Gregor, 2014, ICML] -Stochastic backprop & approximation inference in deep generative models [D. Rezende, S. Mohamed, D. Wierstra, 2014] -Semi-supervised learning with deep generative models [K. Kingma, D. Rezende, S. Mohamed, M. Welling, 2014, NIPS] -auto-encoding variational Bayes [K. Kingma, M. Welling, 2014, ICML] -Learning stochastic recurrent networks [Bayer and Osendorfer, 2015 ICLR] -DRAW: A recurrent neural network for image generation. [K. Gregor, Danihelka, Rezende, Wierstra, 2015] - Plus a number of NIPS-2015 papers, to appear. 148 Slide provided by Max Welling (ICML-2014 tutorial) w. my updates on references of 2015 and late 2014 Further thoughts on unsupervised ASR

• Deep generative modeling experiments not successful in 90’s • Computers were too slow • Models were too simple from text to speech waves • Still true  need speech scientists to work harder with technologists • And when generative models are not good enough, discriminative models and learning (e.g., RNN) can help a lot; but to do it? A hint next • Further, can iterate between the two, like wake-sleep (algorithm) • Inference/learning methods for deep generative models not mature at that time • Only partially true today •  due to recent big advances in machine learning • Based on new ways of thinking about generative graphical modeling motivated by the availability of deep learning tools (e.g. DNN) • A brief review next 149 RNN vs. Generative HDM

Parameterization: Parameterization: • • 푊 , 푊 , 푊 : all unstructured 푊ℎℎ =M(ɣ푙); sparse system matrix ℎℎ ℎ푦 푥ℎ • 푊 =(Ω ); Gaussian-mix params; MLP regular matrices Ω 푙 • Λ = 풕푙

150 Generative HDM

151 RNN vs. Generative HDM

~DNN ~DBN

e.g. Generative pre-training (analogous to generative DBN pretraining for DNN) NIPS-2015 paper to appear on simpler dynamic models for a non-ASR application

Better ways of integrating deep generative/discriminative models are possible - Hint: Example 1 where generative models are used to define the DNN architecture end of Part II: Speech -Deep supervised learning shattered ASR via DNN/LSTM -Deep unsupervised learning may impact more in the future -No more low-hanging fruit

Deep Learning also Shattered the Entire Field of Image Recognition and Computer Vision (since 2012)

153 Krizhevsky, Sutskever, Hinton, “ImageNet Classification with Deep Convolutional Neural Networks.” NIPS, Dec. 2012

GoogLeNet 6.67%

4.94%

Deep CNN Univ. Toronto team

Microsoft Research 154 Part III: Language

Moving from perception: phonetic/word recognition, image/gesture recognition, etc to Cognition: memory, attention, reasoning, Q/A, & decision making, etc.

Embedding enables exploring models for these human cognitive functions

155 Embedding Linguistic entities in low-dimensional continuous space and Examples of NLP Applications

156 푓 풄풂풕 = a.k.a the 1-hot word embedding word vector vector in the semantic space The index of “cat” in the vocabulary

Deerwester, Dumais, Furnas, Landauer, Harshman, "Indexing by latent semantic analysis," JASIS 1990

Microsoft Research 157 NN Word Embedding

Bengio, Ducharme, Vincent, Jauvin, “A neural probabilistic language model. “ JMLR, 2003 Scoring: 푇 푆푐표푟푒 푤1, 푤2, 푤3, 푤4, 푤5 = 푈 휎(푊 푓1, 푓2, 푓3, 푓4, 푓5 + 푏) Training: − + 퐽 = max 0, 1 + 푆 − 푆 Update the model until 푆+ > 1 + 푆− Where + 푆 = 푆푐표푟푒 푤1, 푤2, 푤3, 푤4, 푤5 − − 푆 = 푆푐표푟푒 푤1, 푤2, 푤 , 푤4, 푤5 And < 푤1, 푤2, 푤3, 푤4, 푤5 > is a valid 5-gram − U < 푤1, 푤2, 푤 , 푤4, 푤5 > is a “negative sample” constructed − by replacing the word 푤3 with a random word 푤

e.g., a negative example: “cat chills X a mat” W

Collobert, Weston, Bottou, Karlen, Kavukcuoglu, Kuksa, “Natural Language Processing (Almost) from Scratch,” JMLR Word embedding 2011

Microsoft Research 159 Word Embedding

cat

chases …

is Mikolov, Yih, Zweig, “Linguistic Regularities in Continuous Space Word Representations,” NAACL 2013

Microsoft Research 160 Continuous Bag-of-Words The CBOW architecture (a) on the left, and the Skip-gram architecture (b) on the right. [Mikolov et al., 2013 ICLR].

Microsoft Research 161 # words

word embedding 풗 푤 = dim matrix

w1,w2, … w wN

. However, for large scale NL tasks a decomposable, robust representation is preferable . Vocabulary of real-world big data tasks could be huge (scalability) >100M unique words in a modern commercial search engine log, and keeps growing . New words, misspellings, and word fragments frequently occur (generalizability)

Microsoft Research 162 embedding vector 푊 → 푈 × 푉 embedding vector dim=500 dim=500 SWU embedding word embedding 푈 matrix: 500 × 50퐾 matrix: 500 × 100푀 푊 dim = 50K SWU encoding 푉 matrix dim = 100M dim = 100M 1-hot word vector 1-hot word vector Could go up to extremely large Huang, He, Gao, Deng, Acero, Heck, “Learning deep structured semantic models for web search using clickthrough data,” CIKM, 2013

Microsoft Research 163 Microsoft Research 164 Example: cat → #cat# → #-c-a, c-a-t, a-t-# (w/ word boundary mark #)

풖푘 퐾 Letter-trigram embedding dim ෍(훼 ∙ ) matrix 풗 푐푎푡 = 푐푎푡,푘 푘=1

.…1,...0… 1,… 1,… Count of LTG(k) … #-c-a …… c-a-t ... a-t-# in the word “cat” 풖:The vector of LTG(k) … # total letter-trigrams

Two words has the same LTG: collision rate ≈ 0.004%

Microsoft Research 165 Supervised Embeddings for Semantic Modeling with Applications

--- embedding linguistic symbols by backprop --- mining distant supervision signals

Microsoft Research 166 Deep Structured Semantic Model (DSSM) • Build word/phrase, or sentence-level semantic vector representation • Trained by a similarity-driven objective • projecting semantically similar phrases to vectors close to each other • projecting semantically different phrases to vectors far apart

Microsoft Research 167 DSSM for learning semantic embedding

Initialization: Neural networks are initialized with random weights

Semantic vector 풗풔 풗풕+ 풗풕−

d=300 d=300 d=300

W4 d=500 d=500 d=500

W3 Letter-trigram d=500 d=500 d=500 embedding matrix W2 Letter-trigram encoding dim = 50K dim = 50K dim = 50K matrix (fixed) W1 Bag-of-words vector dim = 100M dim = 100M dim = 100M Input word/phrase s: “racing car” t+: “formula one” t -: “racing to me”

Microsoft Research 168 DSSM for learning semantic embedding

Training: Compute Cosine similarity between semantic vectors cos(푣 , 푣 +) cos(푣 , 푣 −) Compute 풆풙풑(풄풐풔 풗풔 , 풗풕+ ) 푠 푡 푠 푡 휕 ൗ휕W gradients σ풕′={풕+,풕−} 풆풙풑(풄풐풔 풗풔 , 풗풕′ )

Semantic vector 풗풔 풗풕+ 풗풕−

d=300 d=300 d=300

W4 d=500 d=500 d=500

W3 Letter-trigram d=500 d=500 d=500 embedding matrix W2 Letter-trigram encoding dim = 50K dim = 50K dim = 50K matrix (fixed) W1 Bag-of-words vector dim = 100M dim = 100M dim = 100M Input word/phrase s: “racing car” t+: “formula one” t -: “racing to me”

Microsoft Research 169 DSSM for learning semantic embedding

Runtime:

similar apart

Semantic vector 풗풔 풗풕ퟏ 풗풕ퟐ

d=300 d=300 d=300

Ws,4 Wt,4 Wt,4 d=500 d=500 d=500

Ws,3 Wt,3 Wt,3 Letter-trigram d=500 d=500 d=500 embedding matrix Ws,2 Wt,2 Wt,2 Letter-trigram encoding dim = 50K dim = 50K dim = 50K matrix (fixed) Ws,1 Wt,1 Wt,1 Bag-of-words vector dim = 100M dim = 100M dim = 100M Input word/phrase s: “racing car” t1: “formula one” t2: “racing to me”

Microsoft Research 170 context <-> word query <-> clicked-doc pattern<-> relationship

exp (훾 푐표푠 푞, 푑+ ) 푃 푑+ 푞 = σ푑∈푫 exp(훾 푐표푠 푞, 푑 )

푷 풅+ 풒

Microsoft Research 171 Many applications of DSSM (many low-hanging fruits): Learning semantic similarity between X and Y Tasks Source X Target Y Word semantic embedding context word

Web search search query web documents Query intent detection Search query Use intent Question answering pattern / mention (in NL) relation / entity (in KB) Machine translation sentence in language a translated sentences in language b Query auto-suggestion Search query Suggested query Query auto-completion Partial search query Completed query Apps recommendation User profile recommended Apps Distillation of survey feedbacks Feedbacks in text Relevant feedbacks Automatic image captioning image text caption Image retrieval text query images Natural user interface command (text / speech / gesture) actions Ads selection search query ad keywords Ads click prediction search query ad documents Email analysis: people prediction Email content Recipients, senders Email search Search query Email content Email declutering Email contents Email contents in similar threads

Knowledge-base construction entity from source entity fitting desired relationship

Contextual entity search key phrase / context entity / its corresponding page

Automatic highlighting documents in reading key phrases to be highlighted 172 Text summarization long text summarized short text Automatic image captioning (MSR system)

Detector Models, street Deep Neural Net signs Features, … light Computer under on pole Vision stop red System sign building bus traffic city

Language Model

a red stop sign sitting under a traffic light on a city street Caption a stop sign at an intersection on a street Generation a stop sign with two street signs on a pole on a sidewalk System a stop sign at an intersection on a city street a stop sign at an intersection on a city street … a stop sign a red traffic light

DSSM Model

Semantic Ranking System Fang, Gupta, Iandola, Srivastava, Deng, Dollar, Gao, He, Mitchell, Platt, Zitnick, Zweig, “From captions to visual concepts and back,” accepted to appear in CVPR, 2015; in arXiv 2014 Microsoft System (MSR): Use of DSSM for Global Semantic Matching

174 A B Machine: Human:

Competition results (CVPR-2015, June, Boston)

Won 1st Prize at MS COCO Captioning Challenge 2015!

The top teams and the state-of-the-art Measure the quality of % of captions that Official the captions by human pass the Turing Test Rank judge (e.g., Turing Test). MSR 32.2% 1st(tie) Google 31.7% 1st(tie) MSR Captivator 30.1% 3rd(tie) Montreal/Toronto 27.2% 3rd(tie) Berkeley LRCN 26.8% 5th Note: even a Human Other groups: Baidu/UCLA, Stanford, Tsinghua, etc. cannot guarantee to pass Turing Test by 100% Human 67.5% -- Many applications of DSSM (many low-hanging fruits): Learning semantic similarity between X and Y Tasks Source X Target Y Word semantic embedding context word

Web search search query web documents Query intent detection Search query Use intent Question answering pattern / mention (in NL) relation / entity (in KB) Machine translation sentence in language a translated sentences in language b Query auto-suggestion Search query Suggested query Query auto-completion Partial search query Completed query Apps recommendation User profile recommended Apps Distillation of survey feedbacks Feedbacks in text Relevant feedbacks Automatic image captioning image text caption Image retrieval text query images Natural user interface command (text / speech / gesture) actions Ads selection search query ad keywords Ads click prediction search query ad documents Email analysis: people prediction Email content Recipients, senders Email search Search query Email content Email declutering Email contents Email contents in similar threads

Knowledge-base construction entity from source entity fitting desired relationship

Contextual entity search key phrase / context entity / its corresponding page Automatic highlighting documents in reading key phrases to be highlighted 181 Text summarization long text summarized short text 182 http://blogs.bing.com/search/2014/12/10/bing-brings-the-worlds-knowledge-straight-to-you-with-insights-for-office/ Scenario: Contextual search in Microsoft Office/Word

When “Lincoln” is selected, pages of a car company, movie, or the town in Nebraska will not appear183 Towards Modeling Cognitive Functions: Memory and Attention

-seq-to-seq learning via LSTM with attention mechanism -memory nets and neural Turing machines -dynamic memory nets -from seq2seq to seq2struct and to struct2struct

Microsoft Research 184 Deep “Thought”-Vector Approach to MT

Microsoft Research 185 Deep “Thought”-Vector Approach to MT

Microsoft Research 186 “attention” mechanism used for “softly” selecting relevant input portion from memory

Microsoft Research 189 Microsoft Research 190 Popular theories of human memory/attention

The attention and memory models discussed so far are far from human memory/attention Mechanisms (https://en.wikipedia.org/wiki/Atkinson%E2%80%93Shiffrin_memory_model):

Hopfield nets store (associative) memories as attractors of the dynamic network

191 LSTM mainly models short-term memory

192 LSTM does not model long-term memory well

• LSTM makes short-term memory lasting via a simple “unit-loop” mechanism, very different from long-term memory in human cognition • Review of a very recent modeling study on episodic and semantic memories, extending the basic LSTM formulation

193 Slides from: Coursera, 2014

194 Going beyond L-STM --- towards more realist long-term memory (episodic & semantic)

195 196 Towards Modeling Cognitive Functions: Memory and Attention

-seq-to-seq learning via LSTM with attention mechanism -memory nets and neural Turing machines -dynamic memory nets -from seq2seq to seq2struct & to struct2struct

Microsoft Research 197 Review: embedding in the form of “flat” vectors

• A linguistic or physical entity or a simple “relation” mapping via distributed representations by NN A low-dim continuous-space vector or embedding

Special Issue, vol. 46 (1990) Connectionist Symbol Processing PDP book, 1986 (4 articles) 198 Extension: “flat” vectors structures (tree/graph)

•Structured embedding vectors via tensor-product rep.

symbolic semantic parse tree (complex relation)

Then, reasoning in symbolic-space (traditional AI) can be beautifully carried out in the continuous-space in human cognitive and neural-net (i.e., connectionist) terms

Smolensky & Legendre: The Harmonic Mind, MIT Press, 2006 From Neural Computation to Optimality-Theoretic Grammar Volume I: Cognitive Architecture; Volume 2: Linguistic Implications

Rogers & McClelland Semantic Cognition MIT Press, 2006 199 Few leaders are admired by George Bush admire(George Bush, few leaders)

ƒ(s) = cons(ex1(ex0(ex1(s))), cons(ex1(ex1(ex1(s))), ex0(s)))

cons ex ex ex W = W 0[W 1W 0W 1] + cons cons ex ex ex cons ex W 1[W 0(W 1W 1W 1)+W 1(W 0)] F

G B D C Patient V ψ Output A P Meaning (LF)

Input

P B ψ D C Aux F b Patien y E G Aux V Agent W by A Isomorphism t “Passive sentence”

Slide from Paul Smolensky, 2015 Summary & Perspective

• Speech recognition is the first success example of deep learning at industry scale • Deep learning is very effective in speech recognition, speech translation (Skype Translator), image recognition (Onedrive Image tagging), image captioning, language understanding (Cortana), semantic intelligence, multimodal and multitask learning, web search, advertising, entity search (Insights for MS Office), user and business activity prediction, etc. • Enabling factors: • Big datasets for training deep models • Powerful GPGPU computing • Innovations in deep learning architectures and algorithms • How to discover distant supervision signals free from human labeling • How to build deep learning systems grounded on exploiting such “smart” signals (example: DSSM) • …

203 Summary & Perspective

• Speech recognition: all low-hanging fruits are taken • i.e. more innovation and hard work needed than before • Image recognition: most low-hanging fruits are taken • Natural Language: does not seem there is much low-hanging fruit there • i.e. even more innovation and hard work needed than before • Big data analytics (e.g. user behavior, business activities, etc): - A new frontier • Small data: deep learning may still win (e.g. 2012 Kaggle’s drug discovery) • Perceptual data: deep learning methods always win, and win big • Be careful: data with adversarial nature; data with odd variability

204 Issues for “Near” Future of Deep Learning

• For perceptual tasks (e.g. speech, image/video, gesture, etc.) • With supervised data: what will be the limit for growing accuracy wrt increasing amounts of labeled data? • Beyond this limit or when labeled data are exhausted or non-economical to collect, will novel and effective unsupervised deep learning emerge and what will they be (e.g. deep generative models)? • Many new innovations are to come, likely in the area of unsupervised learning • For cognitive tasks (e.g. natural language, reasoning, knowledge, decision making, etc.) • Will supervised deep learning (e.g. MT) beat the non-deep-learning state of the art like speech/image recognition? • How to distill/exploit “distant” supervision signals for supervised deep learning? • Will dense vector embedding be sufficient for language? Do we really need to directly encode and recover syntactic/semantic structure of language? • Even more new innovations are to come, likely in the area of new architectures and learning methods pertaining to distant supervised learning

205 The Future of Deep Learning • Continued rapid progress in language processing methods and applications by both industry and academia • From image to video processing/understanding • From supervised learning (huge success already) to unsupervised learning (not much success yet but ideas abound) • From perception to cognition • More exploration of attention modeling • Combine representation learning with complex knowledge extraction & reasoning • Modeling human memory functions more faithfully • Learning to act and control (deep reinforcement learning) • Successes in business applications will propel more rapid advances in deep learning (positive feedbacks)

206 • Auli, M., Galley, M., Quirk, C. and Zweig, G., 2013. Joint language and translation modeling with recurrent neural networks. In EMNLP. • Auli, M., and Gao, J., 2014. Decoder integration and expected bleu training for recurrent neural network language models. In ACL. • Baker, J., Li Deng, Jim Glass, S. Khudanpur, C.-H. Lee, N. Morgan, and D. O'Shgughnessy, Research Developments and Directions in Speech Recognition and Understanding, Part 1, in IEEE Signal Processing Magazine, vol. 26, no. 3, pp. 75-80, 2009. Updated MINDS Report on Speech Recognition and Understanding

• Bengio, Y., 2009. Learning deep architectures for AI. Foundumental Trends Machine Learning, vol. 2. • Bengio, Y., Courville, A., and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Trans. PAMI, vol. 38, pp. 1798-1828. • Bengio, Y., Ducharme, R., and Vincent, P., 2000. A Neural Probabilistic Language Model, in NIPS.

• Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P., 2011. Natural language processing (almost) from scratch. in JMLR, vol. 12. • Dahl, G., Yu, D., Deng, L., and Acero, 2012. A. Context-dependent, pre-trained deep neural networks for large vocabulary speech recognition, IEEE Trans. Audio, Speech, & Language Proc., Vol. 20 (1), pp. 30-42. • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T., and Harshman, R. 1990. Indexing by latent semantic analysis. J. American Society for Information Science, 41(6): 391-407 • Deng, L. A dynamic, feature-based approach to the interface between phonology & phonetics for speech modeling and recognition, Speech Communication, vol. 24, no. 4, pp. 299-323, 1998. Computational Models for Speech Production Switching Dynamic System Models for Speech Articulation and Acoustics

• Deng L., G. Ramsay, and D. Sun, Production models as a structural basis for automatic speech recognition," Speech Communication (special issue on speech production modeling), in Speech Communication, vol. 22, no. 2, pp. 93-112, August 1997. • Deng L. and J. Ma, Spontaneous Speech Recognition Using a Statistical Coarticulatory Model for the Vocal Tract Resonance Dynamics, Journal of the Acoustical Society of America, 2000.

DEEP LEARNING: Methods and Applications Machine Learning Paradigms for Speech Recognition: An Overview

• Deng L. and Xuedong Huang, Challenges in Adopting Speech Recognition, in Communications of the ACM, vol. 47, no. 1, pp. 11-13, January 2004. • Deng, L., Seltzer, M., Yu, D., Acero, A., Mohamed, A., and Hinton, G., Binary Coding of Speech Spectrograms Using a Deep Auto-encoder, Interspeech, 2010. • Deng, L., Tur, G, He, X, and Hakkani-Tur, D. 2012. Use of kernel deep convex networks and end-to-end learning for spoken language understanding, Proc. IEEE Workshop on Spoken Language Technologies. • Deng, L., Yu, D. and Acero, A. 2006. Structured speech modeling, IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 5, pp. 1492-1504. New types of deep neural network learning for speech recognition and related applications: An overview Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition Deep Stacking Networks for Information Retrieval Deep Convex Network: A Scalable Architecture for Speech Pattern Classification

Microsoft Research 207 A Multimodal Variational Approach to Learning and Inference in Switching State Space Models

Microsoft Research 208 Learning with Recursive Perceptual Representations

Microsoft Research 209 210