Deep Learning for Web Search and Natural Language Processing

Deep Learning for Web Search and Natural Language Processing Jianfeng Gao Deep Learning Technology Center (DLTC) Microsoft Research, Redmond, USA WSDM 2015, Shanghai, China *Thank Li Deng and Xiaodong He, with whom we participated in the previous ICASSP2014 and CIKM2014 versions of this tutorial Mission of Machine (Deep) Learning Data (collected/labeled) Model (architecture) Training (algorithm) 2 Outline • The basics • Background of deep learning • A query classification problem • A single neuron model • A deep neural network (DNN) model • Potentials and problems of DNN • The breakthrough after 2006 • Deep Semantic Similarity Models (DSSM) for text processing • Recurrent Neural Networks 3 4 Scientists See Promise in Deep-Learning Programs John Markoff November 23, 2012 Rick Rashid in Tianjin, China, October, 25, 2012 Geoff Hinton The universal translator on A voice recognition program translated a speech given by Richard F. “Star Trek” comes true… Rashid, Microsoft’s top scientist, into Chinese. 5 6 Impact of deep learning in speech technology Cortana 7 8 A query classification problem • Given a search query 푞, e.g., “denver sushi downtown” • Identify its domain 푐 e.g., • Restaurant • Hotel • Nightlife • Flight • etc. • So that a search engine can tailor the interface and result to provide a richer personalized user experience 9 A single neuron model • For each domain 푐, build a binary classifier 푇 • Input: represent a query 푞 as a vector of features 푥 = [푥1, … 푥푛] • Output: 푦 = 푃 푐 푞 • 푞 is labeled 푐 is 푃 푐 푞 > 0.5 • Input feature vector, e.g., a bag of words vector • Regards words as atomic symbols: denver, sushi, downtown • Each word is represented as a one-hot vector: 0, … , 0,1,0, … , 0 푇 • Bag of words vector = sum of one-hot vectors • We may use other features, such as n-grams, phrases, (hidden) topics 10 A single neuron model 푥 Output: 푃(푐|푞) Input features 1 푦 = 휎 푧 = 1+exp(−푧) 푛 푧 = 푖=0 푤푖푥푖 • 푤: weight vector to be learned • 푧: weighted sum of input features • 휎: the logistic function • Turn a score to a probability • A sigmoid non-linearlity (activation function), essential in multi-layer/deep neural network models 11 Model training: how to assign 푤 • Training data: a set of 푥 푚 , 푦 푚 pairs 푚={1,2,…,푀} • Input 푥 푚 ∈ 푅푛 • Output 푦 푚 = {0,1} • Goal: learn function 푓: 푥 → 푦 to predict correctly on new input 푥 • Step 1: choose a function family, e.g., • neural networks, logistic regression, support vector machine, in our case 푛 푇 • 푓 푥 = 휎 푖=0 푤푖푥푖 = 휎(푤 푥) • Step 2: optimize parameters 푤 on training data, e.g., • minimize a loss function (mean square error loss) 푀 푚 • min 푚=1 퐿 푤 1 2 • where 퐿(푚) = 푓 푥 푚 − 푦 푚 2 푤 12 Training the single neuron model, 푤 • Stochastic gradient descent (SGD) algorithm • Initialize 푤 randomly 휕퐿 • Update for each training sample until convergence: 푤푛푒푤 = 푤표푙푑 − 휂 휕푤 1 • Mean square error loss: 퐿 = 휎 푤푇푥 − 푦 2 2 휕퐿 • Gradient: = 훿휎′ 푧 푥 휕푤 • 푧 = 푤푇푥 • Error: 훿 = 휎 푧 − 푦 • Derivative of sigmoid 휎′(푧) = 휎 푧 1 − 휎 푧 13 SGD vs. gradient descent • Gradient descent is a batch training algorithm • update 푤 per batch of training samples • goes in steepest descent direction • SGD is noisy descent (but faster per iteration) • Loss function contour plot (Duh 2014) 1 • 푀 휎 푤푇푥 − 푦 2 + 푤 푚=1 2 14 Multi-layer (deep) neural networks Output layer 푦표 = 휎(푤푇푦2) Vector 푤 This is exactly the single neuron model with hidden features. st 2 1 2 hidden layer 푦 = 휎(퐖2푦 ) Projection matrix 퐖2 Feature generation: project raw input st 1 1 hidden layer 푦 = 휎(퐖1푥) features (bag of words) to hidden features (topics). Projection matrix 퐖1 Input features 푥 15 Standard Machine Deep Learning Learning Process Adapted from [Duh 2014] 16 Revisit the activation function: 휎 • Assuming a L-layer neural network • 푦 = 퐖퐿휎 … 휎 퐖2휎 퐖1푥 , where 푦 is the output vector • If 휎 is a linear function, then L-layer neural network is compiled down into a single linear transform • 휎: map scores to probabilities • Useful in prediction as it transforms the neuron weighted sum into the interval [0..1] • Unnecessary for model training except in the Boltzman machine or graphical models 17 Training a two-layer neural net • Training data: a set of 푥 푚 , 푦 푚 pairs 푚={1,2,…,푀} • Input 푥 푚 ∈ 푅푛 • Output 푦 푚 = {0,1} • Goal: learn function 푓: 푥 → 푦 to predict correctly on new input 푥 • 푓 푥 = 휎 푗 푤푗 ∙ 휎( 푖 푤푖푗푥푖) • Optimize parameters 푤 on training data via 푀 푚 • minimize a loss function: min 푚=1 퐿 푤 1 2 • where 퐿(푚) = 푓 푥 푚 − 푦 푚 2 푤 18 Training neural nets: back-propagation • Stochastic gradient descent (SGD) algorithm 휕퐿 • 푤푛푒푤 = 푤표푙푑 − 휂 휕푤 휕퐿 • : sample-wise loss w.r.t. parameters 휕푤 • Need to apply the derivative chain rule correctly • 푧 = 푓 푦 • 푦 = 푔 푥 휕푧 휕푧 휕푦 • = 휕푥 휕푦 휕푥 • A detailed discussion in [Socher & Manning 2013] 19 Simple chain rule 20 [Socher & Manning 2013] Multiple paths chain rule [Socher & Manning 2013] 21 Chain rule in flow graph [Socher & Manning 2013] 22 Training neural nets: back-propagation Assume two outputs (푦1, 푦2) per input 푥, and 1 Loss per sample: 퐿 = 휎 푧 − 푦 2 푘 2 푘 푘 Forward pass: 푦푘 = 휎(푧푘), 푧푘 = 푗 푤푗푘ℎ푗 ℎ푗 = 휎(푧푗), 푧푗 = 푖 푤푖푗푥푖 Derivatives of the weights 휕퐿 휕퐿 휕푧푘 휕( 푗 푤푗푘ℎ푗) = = 훿푘 = 훿푘ℎ푗 휕푤푗푘 휕푧푘 휕푤푗푘 휕푤푗푘 휕퐿 휕퐿 휕푧푗 휕( 푖 푤푖푗푥푖) = = 훿푗 = 훿푗푥푖 휕푤푖푗 휕푧푗 휕푤푖푗 휕푤푖푗 휕퐿 ′ 훿푘 = = 휎 푧푘 − 푦푘 휎 푧푘 휕푧푘 휕퐿 휕푧푘 휕 훿푗 = 푘 = 푘 훿푘 푗 푤푗푘휎 푧푗 = 푘 훿푘푤푗푘 휎′(푧푗) 휕푧푘 휕푧푗 휕푧푗 23 Adapted from [Duh 2014] Training neural nets: back-propagation • All updates involve some scaled error from output × input feature: 휕퐿 ′ • = 훿푘ℎ푗 where 훿푘 = 휎 푧푘 − 푦푘 휎 푧푘 휕푤푗푘 휕퐿 • = 훿푗푥푖 where 훿푗 = 푘 훿푘푤푗푘 휎′(푧푗) 휕푤푖푗 • First compute 훿푘 from output layer, then 훿푗 for other layers and iterate. 훿 푘=푦1 훿푘=푦2 푤31 푤32 훿푗=ℎ3 = 훿푘=푦1푤31 + 훿푘=푦2푤32 휎′(푧푗=ℎ3) 24 Adapted from (Duh 2014) Potential of DNN This is exactly the single neuron model with hidden features. Project raw input features to hidden features (high level representation). 25 [Bengio, 2009] DNN is difficult to training • Vanishing gradient problem in backpropagation 휕퐿 휕퐿 휕푧푗 • = = 훿푗푥푖 휕푤푖푗 휕푧푗 휕푤푖푗 • 훿푗 = 푘 훿푘푤푗푘 휎′(푧푗) • 훿푗 may vanish after repeated multiplication • Scalability problem 26 Many, but NOT ALL, limitations of early DNNs have been overcome better learning algorithms and different nonlinearities. SGD can often allow the training to jump out of local optima due to the noisy gradients estimated from a small batch of samples. SGD effective for parallelizing over many machines with an asynchronous mode • Vanishing gradient problem? Try deep belief net (DBN) to initialize it – Layer-wise pre-training (Hinton et al. 2006) • Scalability problem Computational power due to the use of GPU and large-scale CPU clusters 27 DNN: (Fully-Connected) Deep Neural Networks Hinton, Deng, Yu, etc., DNN for AM in speech recognition, IEEE SPM, 2012 Geoff Hinton Li Deng First train a stack of N models each of Then compose them into Then add outputs which has one hidden layer. Each model in a single Deep Belief and train the DNN Dong Yu the stack treats the hidden variables of the Network. with backprop. previous model as data. 28 CD-DNN-HMM Dahl, Yu, Deng, and Acero, “Context-Dependent Pre- trained Deep Neural Networks for Large Vocabulary Speech Recognition,” IEEE Trans. ASLP, Jan. 2012 After no improvement for 10+ years by the research community… …MSR reduced error from ~23% to <13% (and under 7% for Rick Rashid’s S2S demo)! 29 Deep Convolutional Neural Network for Images CNN: local connections with weight sharing; pooling for translation invariance Image Output [LeCun et al., 1998] 30 A basic module of the CNN Pooling Convolution Image 31 Deep Convolutional NN for Images 2012-2014 Fully connected Fully connected earlier Fully connected Convolution/pooling SVM Convolution/pooling Pooling Convolution/pooling Histogram Oriented Grads Convolution/pooling Image Convolution/pooling Raw Image pixels 32 ImageNet 1K Competition Krizhevsky, Sutskever, Hinton, “ImageNet Classification with Deep Convolutional Neural Networks.” NIPS, Dec. 2012 Deep CNN Univ. Toronto team 33 Gartner hyper cycle graph for NN history 34 [Deng and Yu 2014] Useful Sites on Deep Learning • http://www.cs.toronto.edu/~hinton/ • http://ufldl.stanford.edu/wiki/index.php/UFLDL_Recommended_Readings • http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial (Andrew Ng’s group) • http://deeplearning.net/reading-list/ (Bengio’s group) • http://deeplearning.net/tutorial/ • http://deeplearning.net/deep-learning-research-groups-and-labs/ • Google+ Deep Learning community 35 Outline • The basics • Deep Semantic Similarity Models (DSSM) for text processing • What is DSSM • DSSM for web search ranking • DSSM for recommendation • DSSM for automatic image captioning • Recurrent Neural Networks 36 Computing Semantic Similarity • Fundamental to almost all Web search and NLP tasks, e.g., • Machine translation: similarity between sentences in different languages • Web search: similarity between queries and documents • Problems of the existing approaches • Lexical matching cannot handle language discrepancy. • Unsupervised word embedding or topic models are not optimal for the task of interest. 37 Deep Semantic Similarity Model (DSSM) [Huang et al. 2013; Gao et al. 2014a; Gao et al. 2014b; Shen et al. 2014] • Compute semantic similarity between two text strings X and Y • Map X and Y to feature vectors in a latent semantic space via deep neural net • Compute the cosine similarity between the feature vectors • Also called “Deep Structured Similarity Model” in Huang et al.

Deep Learning for Web Search and Natural Language Processing

Automatic Correction of Real-Word Errors in Spanish Clinical Texts

Unified Language Model Pre-Training for Natural

Analyzing Political Polarization in News Media with Natural Language Processing

Automatic Spelling Correction Based on N-Gram Model

N-Gram-Based Machine Translation

Spell Checker in CET Designer

Automatic Extraction of Personal Events from Dialogue

Training Automatic Transliteration Models on Dbpedia Data

Heafield, K. and A. Lavie. "Voting on N-Grams for Machine Translation

Proceedings of the 55Th Annual Meeting of the Association For

Phrase-Based Attentions

EECS 498-004: Introduction to Natural Language Processing