1

Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, Rabab Ward

Abstract—This paper develops a model that addresses EARNING a good representation (or features) of sentence embedding, a hot topic in current natural lan- L input data is an important task in . guage processing research, using recurrent neural networks In text and language processing, one such problem is (RNN) with Long Short-Term Memory (LSTM) cells. The proposed LSTM-RNN model sequentially takes each word learning of an embedding vector for a sentence; that is, to in a sentence, extracts its information, and embeds it into train a model that can automatically transform a sentence a semantic vector. Due to its ability to capture long term to a vector that encodes the semantic meaning of the memory, the LSTM-RNN accumulates increasingly richer sentence. While is learned using a information as it goes through the sentence, and when it loss function defined on word pairs, sentence embedding reaches the last word, the hidden layer of the network provides a semantic representation of the whole sentence. is learned using a loss function defined on sentence In this paper, the LSTM-RNN is trained in a weakly pairs. In the sentence embedding usually the relationship supervised manner on user click-through data logged by a among words in the sentence, i.e., the context informa- commercial web search engine. Visualization and analysis tion, is taken into consideration. Therefore, sentence em- are performed to understand how the embedding process bedding is more suitable for tasks that require computing works. The model is found to automatically attenuate the unimportant words and detects the salient keywords in semantic similarities between text strings. By mapping the sentence. Furthermore, these detected keywords are texts into a unified semantic representation, the embed- found to automatically activate different cells of the LSTM- ding vector can be further used for different language RNN, where words belonging to a similar topic activate the processing applications, such as machine translation [1], same cell. As a semantic representation of the sentence, sentiment analysis [2], and information retrieval [3]. the embedding vector can be used in many different applications. These automatic keyword detection and topic In machine translation, the recurrent neural networks allocation abilities enabled by the LSTM-RNN allow the (RNN) with Long Short-Term Memory (LSTM) cells, or network to perform document retrieval, a difficult language the LSTM-RNN, is used to encode an English sentence processing task, where the similarity between the query and into a vector, which contains the semantic meaning of documents can be measured by the distance between their the input sentence, and then another LSTM-RNN is corresponding sentence embedding vectors computed by the LSTM-RNN. On a web search task, the LSTM-RNN used to generate a French (or another target language) embedding is shown to significantly outperform several sentence from the vector. The model is trained to best existing state of the art methods. We emphasize that the predict the output sentence. In [2], a paragraph vector proposed model generates sentence embedding vectors that is learned in an unsupervised manner as a distributed arXiv:1502.06922v3 [cs.CL] 16 Jan 2016 are specially useful for web document retrieval tasks. A representation of sentences and documents, which are comparison with a well known general sentence embedding method, the Paragraph Vector, is performed. The results then used for sentiment analysis. Sentence embedding show that the proposed method in this paper significantly can also be applied to information retrieval, where the outperforms it for web document retrieval task. contextual information are properly represented by the Index Terms—, Long Short-Term Mem- vectors in the same space for fuzzy text matching [3]. ory, Sentence Embedding. In this paper, we propose to use an RNN to sequen- tially accept each word in a sentence and recurrently map I.INTRODUCTION it into a latent space together with the historical informa- tion. As the RNN reaches the last word in the sentence, H. Palangi and R. Ward are with the Department of Electrical and the hidden activations form a natural embedding vector Computer Engineering, University of British Columbia, Vancouver, for the contextual information of the sentence. We further BC, V6T 1Z4 Canada (e-mail: {hamidp,rababw}@ece.ubc.ca) incorporate the LSTM cells into the RNN model (i.e. the L. Deng, Y. Shen, J.Gao, X. He, J. Chen and X. Song are with Microsoft Research, Redmond, WA 98052 USA (e-mail: LSTM-RNN) to address the difficulty of learning long {deng,jfgao,xiaohe,yeshen,jianshuc,xinson}@microsoft.com) term memory in RNN. The learning of such a model 2 is performed in a weakly supervised manner on the s(t−1) and s(t+1), using two separate decoders. The en- click-through data logged by a commercial web search coder and decoders are RNNs with Gated Recurrent Unit engine. Although manually labelled data are insufficient (GRU) [8]. However, this sentence embedding method in machine learning, logged data with limited feedback is not designed for document retrieval task having a signals are massively available due to the widely used supervision among queries and clicked and unclicked commercial web search engines. Limited feedback in- documents. In [9], a Semi-Supervised Recursive Au- formation such as click-through data provides a weak toencoder (RAE) is proposed and used for sentiment supervision signal that indicates the semantic similarity prediction. Similar to our proposed method, it does not between the text on the query side and the clicked text need any language specific sentiment parsers. A greedy on the document side. To exploit such a signal, the approximation method is proposed to construct a tree objective of our training is to maximize the similarity structure for the input sentence. It assigns a vector per between the two vectors mapped by the LSTM-RNN word. It can become practically problematic for large from the query and the clicked document, respectively. vocabularies. It also works both on unlabeled data and Consequently, the learned embedding vectors of the supervised sentiment data. query and clicked document are specifically useful for Similar to the recurrent models in this paper, The web document retrieval task. DSSM [3] and CLSM [10] models, developed for in- An important contribution of this paper is to analyse formation retrieval, can also be interpreted as sentence the embedding process of the LSTM-RNN by visualizing embedding methods. However, DSSM treats the input the internal activation behaviours in response to different sentence as a bag-of-words and does not model word text inputs. We show that the embedding process of the dependencies explicitly. CLSM treats a sentence as a bag learned LSTM-RNN effectively detects the keywords, of n-grams, where n is defined by a window, and can while attenuating less important words, in the sentence capture local word dependencies. Then a Max-pooling automatically by switching on and off the gates within layer is used to form a global feature vector. Methods in the LSTM-RNN cells. We further show that different [11] are also convolutional based networks for Natural cells in the learned model indeed correspond to differ- Language Processing (NLP). These models, by design, ent topics, and the keywords associated with a similar cannot capture long distance dependencies, i.e., depen- topic activate the same cell unit in the model. As the dencies among words belonging to non-overlapping n- LSTM-RNN reads to the end of the sentence, the topic grams. In [12] a Dynamic Convolutional Neural Network activation accumulates and the hidden vector at the last (DCNN) is proposed for sentence embedding. Similar to word encodes the rich contextual information of the CLSM, DCNN does not rely on a parse tree and is easily entire sentence. For this reason, a natural application applicable to any language. However, different from of sentence embedding is web search ranking, in which CLSM where a regular max-pooling is used, in DCNN a the embedding vector from the query can be used to dynamic k-max-pooling is used. This means that instead match the embedding vectors of the candidate documents of just keeping the largest entries among word vectors in according to the maximum cosine similarity rule. Evalu- one vector, k largest entries are kept in k different vec- ated on a real web document ranking task, our proposed tors. DCNN has shown good performance in sentiment method significantly outperforms many of the existing prediction and question type classification tasks. In [13], state of the art methods in NDCG scores. Please note a convolutional neural network architecture is proposed that when we refer to document in the paper we mean for sentence matching. It has shown great performance in the title (headline) of the document. several matching tasks. In [14], a Bilingually-constrained Recursive Auto-encoders (BRAE) is proposed to create semantic vector representation for phrases. Through ex- II.RELATED WORK periments it is shown that the proposed method has great Inspired by the word embedding method [4], [5], the performance in two end-to-end SMT tasks. authors in [2] proposed an method Long short-term memory networks were developed to learn a paragraph vector as a distributed representation in [15] to address the difficulty of capturing long term of sentences and documents, which are then used for memory in RNN. It has been successfully applied to sentiment analysis with superior performance. However, speech recognition, which achieves state-of-art perfor- the model is not designed to capture the fine-grained mance [16], [17]. In text analysis, LSTM-RNN treats a sentence structure. In [6], an unsupervised sentence sentence as a sequence of words with internal structures, embedding method is proposed with great performance i.e., word dependencies. It encodes a semantic vector of on large corpus of contiguous text corpus, e.g., the a sentence incrementally which differs from DSSM and BookCorpus [7]. The main idea is to encode the sentence CLSM. The encoding process is performed left-to-right, s(t) and then decode previous and next sentences, i.e., word-by-word. At each time step, a new word is encoded 3 into the semantic vector, and the word dependencies a fixed window size in CLSM. Our experiments show embedded in the vector are “updated”. When the process that this difference leads to significantly better results in reaches the end of the sentence, the semantic vector has web document retrieval task. Furthermore, it has other embedded all the words and their dependencies, hence, advantages. It allows us to capture keywords and key can be viewed as a feature vector representation of the topics effectively. The models in this paper also do not whole sentence. In the machine translation work [1], an need the extra max-pooling layer, as required by the input English sentence is converted into a vector repre- CLSM, to capture global contextual information and they sentation using LSTM-RNN, and then another LSTM- do so more effectively. RNN is used to generate an output French sentence. The model is trained to maximize the probability of III.SENTENCE EMBEDDING USING RNNSWITHAND predicting the correct output sentence. In [18], there are WITHOUT LSTMCELLS two main composition models, ADD model that is bag In this section, we introduce the model of recurrent of words and BI model that is a summation over bi-gram neural networks and its long short-term memory version pairs plus a non-linearity. In our proposed model, instead for learning the sentence embedding vectors. We start of simple summation, we have used LSTM model with with the basic RNN and then proceed to LSTM-RNN. letter tri-grams which keeps valuable information over long intervals (for long sentences) and throws away use- A. The basic version of RNN less information. In [19], an encoder-decoder approach is proposed to jointly learn to align and translate sentences The RNN is a type of deep neural networks that from English to French using RNNs. The concept of are “deep” in temporal dimension and it has been used “attention” in the decoder, discussed in this paper, is extensively in time sequence modelling [21], [22], [23], closely related to how our proposed model extracts [24], [25], [26], [27], [28], [29]. The main idea of using keywords in the document side. For further explanations RNN for sentence embedding is to find a dense and please see section V-A2. In [20] a set of visualizations low dimensional semantic representation by sequentially are presented for RNNs with and without LSTM cells and recurrently processing each word in a sentence and and GRUs. Different from our work where the target task mapping it into a low dimensional vector. In this model, is sentence embedding for document retrieval, the target the global contextual features of the whole text will be tasks in [20] were character level sequence modelling for in the semantic representation of the last word in the text characters and source codes. Interesting observations text sequence — see Figure 1, where x(t) is the t-th about interpretability of some LSTM cells and statistics word, coded as a 1-hot vector, Wh is a fixed hashing of gates activations are presented. In section V-A we operator similar to the one used in [3] that converts the show that some of the results of our visualization are word vector to a letter tri-gram vector, W is the input consistent with the observations reported in [20]. We weight matrix, Wrec is the recurrent weight matrix, y(t) also present more detailed visualization specific to the is the hidden activation vector of the RNN, which can be document retrieval task using click-through data. We also used as a semantic representation of the t-th word, and present visualizations about how our proposed model can y(t) associated to the last word x(m) is the semantic be used for keyword detection. representation vector of the entire sentence. Note that Different from the aforementioned studies, the method this is very different from the approach in [3] where the developed in this paper trains the model so that sentences bag-of-words representation is used for the whole text that are paraphrase of each other are close in their and no context information is used. This is also different semantic embedding vectors — see the description in from [10] where the sliding window of a fixed size (akin Sec. IV further ahead. Another reason that LSTM-RNN to an FIR filter) is used to capture local features and a is particularly effective for sentence embedding, is its max-pooling layer on the top to capture global features. robustness to noise. For example, in the web document In the RNN there is neither a fixed-sized window nor ranking task, the noise comes from two sources: (i) Not a max-pooling layer; rather the recurrence is used to every word in query / document is equally important, capture the context information in the sequence (akin and we only want to “remember” salient words using to an IIR filter). the limited “memory”. (ii) A word or phrase that is The mathematical formulation of the above RNN important to a document may not be relevant to a model for sentence embedding can be expressed as given query, and we only want to “remember” related l(t) = Whx(t) words that are useful to compute the relevance of the y(t) = f(Wl(t) + W y(t − 1) + b) (1) document for a given query. We will illustrate robustness rec of LSTM-RNN in this paper. The structure of LSTM- where W and Wrec are the input and recurrent matrices RNN will also circumvent the serious limitation of using to be learned, Wh is a fixed word hashing operator, b 4

풚(푡 − 1) 풙(푡) 풙(푡) 풚(푡 − 1) Embedding vector 푾ℎ ퟏ ퟏ 푾ℎ 풍(푡) 풍(푡) 푾 푾 푟푒푐1 푟푒푐3 푾3 풃 푾1 3 풃1

… 풚(푡 − 1) Input Gate 휎(. ) 풄(푡 − 1) Output Gate 휎(. ) 풊(푡) 푾푟푒푐4 푾푝3 푾푝1 풚𝑔(푡) 풐(푡)

푾ℎ 푾4 × Cell × ℎ(. ) 풄(푡) 풚(푡) 풍(푡) 풍(1) 풍(2) 풍(푚) 𝑔(. ) 풙(푡) 푾푝2 풃4 × 풄(푡 − 1)

ퟏ 풇(푡) 휎(. ) Forget Gate

푾 2 푾 풃2 푟푒푐2 풍(푡)

푾ℎ ퟏ 풚(푡 − 1) 풙(푡) Fig. 1. The basic architecture of the RNN for sentence embedding, where temporal recurrence is used to model the contextual information Fig. 2. The basic LSTM architecture used for sentence embedding across words in the text string. The hidden activation vector corre- sponding to the last word is the sentence embedding vector (blue). model is as follows:

is the bias vector and f(·) is assumed to be tanh(·). yg(t) = g(W4l(t) + Wrec4y(t − 1) + b4) Note that the architecture proposed here for sentence i(t) = σ(W3l(t) + Wrec3y(t − 1) + Wp3c(t − 1) + b3) embedding is slightly different from traditional RNN in f(t) = σ(W l(t) + W y(t − 1) + W c(t − 1) + b ) that there is a word hashing layer that convert the high 2 rec2 p2 2 dimensional input into a relatively lower dimensional c(t) = f(t) ◦ c(t − 1) + i(t) ◦ yg(t) letter tri-gram representation. There is also no per word o(t) = σ(W1l(t) + Wrec1y(t − 1) + Wp1c(t) + b1) supervision during training, instead, the whole sentence y(t) = o(t) ◦ h(c(t)) (2) has a label. This is explained in more detail in section IV. where ◦ denotes Hadamard (element-wise) product. A diagram of the proposed model with more details is presented in section VI of Supplementary Materials.

B. The RNN with LSTM cells IV. LEARNING METHOD Although RNN performs the transformation from the To learn a good semantic representation of the input sentence to a vector in a principled manner, it is generally sentence, our objective is to make the embedding vectors difficult to learn the long term dependency within the for sentences of similar meaning as close as possible, sequence due to vanishing gradients problem. One of and meanwhile, to make sentences of different meanings the effective solutions for this problem in RNNs is as far apart as possible. This is challenging in practice using memory cells instead of neurons originally pro- since it is hard to collect a large amount of manually posed in [15] as Long Short-Term Memory (LSTM) and labelled data that give the semantic similarity signal completed in [30] and [31] by adding forget gate and between different sentences. Nevertheless, the widely peephole connections to the architecture. used commercial web search engine is able to log We use the architecture of LSTM illustrated in Fig. massive amount of data with some limited user feedback 2 for the proposed sentence embedding method. In this signals. For example, given a particular query, the click- figure, i(t), f(t) , o(t) , c(t) are input gate, forget gate, through information about the user-clicked document output gate and cell state vector respectively, Wp1, Wp2 among many candidates is usually recorded and can be and Wp3 are peephole connections, Wi, Wreci and bi, used as a weak (binary) supervision signal to indicate i = 1, 2, 3, 4 are input connections, recurrent connections the semantic similarity between two sentences (on the and bias values, respectively, g(·) and h(·) are tanh(·) query side and the document side). In this section, we function and σ(·) is the sigmoid function. We use this explain how to leverage such a weak supervision signal architecture to find y for each word, then use the y(m) to learn a sentence embedding vector that achieves the corresponding to the last word in the sentence as the aforementioned training objective. Please also note that semantic vector for the entire sentence. above objective to make sentences with similar meaning Considering Fig. 2, the forward pass for LSTM-RNN as close as possible is similar to machine translation 5

yD+ (TD+ ) Posi/ve'sample' of clicked document given the r-th query, N is number Click:'1' of query / clicked-document pairs in the corpus and

 +  y (T ) γR(Qr ,D ) yQ(TQ) D1 D1 e r 0' − lr(Λ) = log  +  …' n γR(Qr ,Dr ) P γR(Qr ,Dr,j− ) Nega/ve'samples' e + i=j e   0' n y (T ) Dn Dn X − ·∆ = log 1 + e γ r,j  (5) j=1 Fig. 3. The click-through signal can be used as a (binary) indication of the semantic similarity between the sentence on the query side and + − where ∆r,j = R(Qr,Dr ) − R(Qr,Dr,j), R(·, ·) was the sentence on the document side. The negative samples are randomly − sampled from the training data. defined earlier in (3), Dr,j is the j-th negative candidate document for r-th query and n denotes the number of negative samples used during training. tasks where two sentences belong to two different lan- The expression in (5) is a logistic loss over ∆r,j. guages with similar meanings and we want to make their It upper-bounds the pairwise accuracy, i.e., the 0 - 1 semantic representation as close as possible. loss. Since the similarity measure is the cosine function, We now describe how to train the model to achieve the ∆r,j ∈ [−2, 2]. To have a larger range for ∆r,j, we use above objective using the click-through data logged by a γ for scaling. It helps to penalize the prediction error commercial search engine. For a complete description of more. Its value is set empirically by experiments on a the click-through data please refer to section 2 in [32]. held out dataset. To begin with, we adopt the cosine similarity between To train the RNN and LSTM-RNN, we use Back Prop- the semantic vectors of two sentences as a measure for agation Through Time (BPTT). The update equations for their similarity: parameter Λ at epoch k are as follows:

T y (T ) y (T ) 4Λk = Λk − Λk−1 R(Q, D) = Q Q D D (3) kyQ(TQ)k · kyD(TD)k 4Λk = µk−14Λk−1 − k−1∇L(Λk−1 + µk−14Λk−1) (6) where TQ and TD are the lengths of the sentence Q and sentence D, respectively. In the context of where ∇L(·) is the gradient of the cost function in (4), training over click-through data, we will use Q and  is the learning rate and µk is a momentum parameter D to denote “query” and “document”, respectively. determined by the scheduling scheme used for training. In Figure 3, we show the sentence embedding vec- Above equations are equivalent to Nesterov method tors corresponding to the query, yQ(TQ), and all the in [33]. To see why, please refer to appendix A.1 of documents, {yD+ (TD+ ), y (T ),..., y (T )}, [34] where Nesterov method is derived as a momentum D1− D1− Dn− Dn− where the subscript D+ denotes the (clicked) positive method. The gradient of the cost function, ∇L(Λ), is: − sample among the documents, and the subscript Dj N n T X X X ∂∆r,j,τ denotes the j-th (un-clicked) negative sample. All these ∇L(Λ) = − α (7) r,j ∂Λ embedding vectors are generated by feeding the sen- r=1 j=1 τ=0 tences into the RNN or LSTM-RNN model described | {z } one large update in Sec. III and take the y corresponding to the last word — see the blue box in Figure 1. where T is the number of time steps that we unfold the We want to maximize the likelihood of the clicked network over time and document given query, which can be formulated as the −γe−γ∆r,j αr,j = . (8) following optimization problem: Pn −γ∆r,j 1 + j=1 e ( N ) N Y X ∂∆r,j,τ + ∂Λ in (7) and error signals for different param- L(Λ) = min − log P (Dr |Qr) = min lr(Λ) Λ Λ eters of RNN and LSTM-RNN that are necessary for r=1 r=1 (4) training are presented in Appendix A. Full derivation of where Λ denotes the collection of the model parameters; gradients in both models is presented in section III of in regular RNN case, it includes Wrec and W in Figure supplementary materials. 1, and in LSTM-RNN case, it includes W1, W2, W3, To accelerate training by parallelization, we use mini- W4, Wrec1, Wrec2, Wrec3, Wrec4, Wp1, Wp2, Wp3, batch training and one large update instead of incremen- + b1, b2, b3 and b4 in Figure 2. Dr is the clicked tal updates during back propagation through time. To + document for r-th query, P (Dr |Qr) is the probability resolve the gradient explosion problem we use gradient 6

Algorithm 1 Training LSTM-RNN for Sentence Embed- (ii) How does LSTM-RNN attenuate unimportant infor- ding mation and detect critical information from the input Inputs: Fixed step size “”, Scheduling for “µ”, Gradient clip threshold sentence? Or, how are the keywords embedded into the “thG”, Maximum number of Epochs “nEpoch”, Total number of query / clicked-document pairs “N”, Total number of un-clicked (negative) docu- semantic vector? (iii) How are the global topics identified ments for a given query “n”, Maximum sequence length for truncated BPTT by LSTM-RNN? “T ”. Outputs: Two trained models, one in query side “ΛQ”, one in document To answer these questions, we train the RNN with side “ΛD ”. Initialization: Set all parameters in ΛQ and ΛD to small random numbers, and without LSTM cells on the click-through dataset i = 0, k = 1. which are logged by a commercial web search engine. procedure LSTM-RNN(ΛQ,ΛD ) while i ≤ nEpoch do The training method has been described in Sec. IV. for “first minibatch” → “last minibatch” do Description of the corpus is as follows. The training set r ← 1 while r ≤ N do includes 200,000 positive query / document pairs where for j = 1 → n do only the clicked signal is used as a weak supervision for Compute αr,j . use (8) PT ∂∆r,j,τ training LSTM. The relevance judgement set (test set) Compute αr,j τ=0 ∂Λk,Q . use (14) to (44) in appendix A is constructed as follows. First, the queries are sampled PT ∂∆r,j,τ Compute αr,j τ=0 ∂Λk,D from a year of search engine logs. Adult, spam, and . use (14) to (44) in appendix A bot queries are all removed. Queries are de-duped so sum above terms for Q and D over j end for that only unique queries remain. To reflex a natural sum above terms for Q and D over r query distribution, we do not try to control the quality r ← r + 1 end while of these queries. For example, in our query sets, there Compute ∇L(Λk,Q) . use (7) are around 20% misspelled queries, and around 20% Compute ∇L(Λk,D ) . use (7) if k∇L(Λk,Q)k > thG then navigational queries and 10% transactional queries, etc. L(Λ ) ∇ k,Q ∇L(Λk,Q) ← thG · L(Λ ) Second, for each query, we collect Web documents to k∇ k,Q k end if be judged by issuing the query to several popular search if k∇L(Λk,D )k > thG then L(Λ ) ∇ k,D engines (e.g., Google, Bing) and fetching top-10 retrieval ∇L(Λk,D ) ← thG · L(Λk,D ) end if k∇ k results from each. Finally, the query-document pairs are Compute 4Λk,Q . use (6) judged by a group of well-trained assessors. In this Compute 4Λk,D . use (6) Update: Λk,Q ← 4Λk,Q + Λk 1,Q study all the queries are preprocessed as follows. The − Update: Λk,D ← 4Λk,D + Λk 1,D text is white-space tokenized and lower-cased, numbers k ← k + 1 − end for are retained, and no stemming/inflection treatment is i ← i + 1 performed. Unless stated otherwise, in the experiments end while end procedure we used 4 negative samples, i.e., n = 4 in Fig. 3. We now proceed to perform a comprehensive analysis by visualizing the trained RNN and LSTM-RNN models. re-normalization method described in [35], [24]. To In particular, we will visualize the on-and-off behaviors accelerate the convergence, we use Nesterov method [33] of the input gates, output gates, cell states, and the and found it effective in training both RNN and LSTM- semantic vectors in LSTM-RNN model, which reveals RNN for sentence embedding. how the model extracts useful information from the input We have used a simple yet effective scheduling for sentence and embeds it properly into the semantic vector µk for both RNN and LSTM-RNN models, in the first according to the topic information. and last 2% of all parameter updates µ = 0.9 and for k Although giving the full learning formula for all the other 96% of all parameter updates µ = 0.995. We k the model parameters in the previous section, we will have used a fixed step size for training RNN and a fixed remove the peephole connections and the forget gate step size for training LSTM-RNN. from the LSTM-RNN model in the current task. This A summary of training method for LSTM-RNN is is because the length of each sequence, i.e., the number presented in Algorithm 1. of words in a query or a document, is known in advance, and we set the state of each cell to zero in the beginning V. ANALYSIS OF THE SENTENCE EMBEDDING of a new sequence. Therefore, forget gates are not a great PROCESSAND PERFORMANCE EVALUATION help here. Also, as long as the order of words is kept, the To understand how the LSTM-RNN performs sentence precise timing in the sequence is not of great concern. embedding, we use visualization tools to analyze the Therefore, peephole connections are not that important semantic vectors generated by our model. We would as well. Removing peephole connections and forget gate like to answer the following questions: (i) How are will also reduce the amount of training time, since a word dependencies and context information captured? smaller number of parameters need to be learned. 7

5 0.4 5 0.4 5 0.4 5 0.4

10 0.2 10 0.2 10 0.2 10 0.2

15 0 15 0 15 0 15 0 20 20 20 20 −0.2 −0.2 −0.2 −0.2 25 25 25 25 −0.4 −0.4 −0.4 −0.4 30 30 30 30 −0.6 −0.6 −0.6 −0.6 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

(a) i(t) (b) c(t) (a) i(t) (b) c(t)

5 0.4 5 0.4 5 0.4 5 0.4

10 0.2 10 0.2 10 0.2 10 0.2

15 0 15 0 15 0 15 0 20 20 20 20 −0.2 −0.2 −0.2 −0.2 25 25 25 25 −0.4 −0.4 −0.4 −0.4 30 30 30 30 −0.6 −0.6 −0.6 −0.6 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10

(c) o(t) (d) y(t) (c) o(t) (d) y(t)

Fig. 4. Query: “hotels in shanghai”. Since the sentence ends at Fig. 5. Document: “shanghai hotels accommodation hotel the third word, all the values to the right of it are zero (green color). in shanghai discount and reservation”. Since the sentence ends at the ninth word, all the values to the right of it are zero (green color).

A. Analysis In this section we would like to examine how the in- sentence. For example, in Fig. 5(a), most of formation in the input sentence is sequentially extracted the input gate values corresponding to word 3, and embedded into the semantic vector over time by the word 7 and word 9 have very small values 1 LSTM-RNN model. (light green-yellow color) , which corresponds to 1) Attenuating Unimportant Information: First, we the words “accommodation”, “discount” and examine the evolution of the semantic vector and how “reservation”, respectively, in the document sen- unimportant words are attenuated. Specifically, we feed tence. Interestingly, input gates reduce the effect of the following input sentences from the test dataset into these three words in the final semantic representa- the trained LSTM-RNN model: tion, y(t), such that the semantic similarity between sentences from query and document sides are not • Query: “hotels in shanghai” affected by these words. • Document: “shanghai hotels accommodation hotel in shanghai discount and reservation” 2) Keywords Extraction: In this section, we show Activations of input gate, output gate, cell state and the how the trained LSTM-RNN extracts the important in- embedding vector for each cell for query and document formation, i.e., keywords, from the input sentences. To are shown in Fig. 4 and Fig. 5, respectively. The vertical this end, we backtrack semantic representations, y(t), axis is the cell index from 1 to 32, and the horizontal over time. We focus on the 10 most active cells in axis is the word index from 1 to 10 numbered from left final semantic representation. Whenever there is a large to right in a sequence of words and color codes show enough change in cell activation value (y(t)), we assume activation values. From Figs.4–5, we make the following an important keyword has been detected by the model. observations: We illustrate the result using the above example (“hotels in shanghai”). The evolution of the 10 most active cells • Semantic representation y(t) and cell states c(t) are evolving over time. Valuable context information is activation, y(t), over time are shown in Fig. 6 for the 2 gradually absorbed into c(t) and y(t), so that the query and the document sentences. From Fig. 6, we also information in these two vectors becomes richer observe that different words activate different cells. In over time, and the semantic information of the Tables I–II, we show the number of cells each word entire input sentence is embedded into vector y(t), 1If this is not clearly visible, please refer to Fig. 1 in section I of which is obtained by applying output gates to the supplementary materials. We have adjusted color bar for all figures to cell states c(t). have the same range, for this reason the structure might not be clearly visible. More visualization examples could also be found in section IV • The input gates evolve in such a way that it of Supplementary Materials attenuates the unimportant information and de- 2Likewise, the vertical axis is the cell index and horizontal axis is tects the important information from the input the word index in the sentence. 8

TABLE II KEYWORDSFORDOCUMENT:“shanghai hotels accommodation hotel in shanghai discount and reservation” shanghai hotels accommodation hotel in shanghai discount and reservation Number of assigned cells out of 10 Left to Right - 4 3 8 1 8 5 3 4 Number of assigned cells out of 10 Right to Left 4 6 5 4 5 1 7 5 -

by a specific cell. For simplicity, we use the following

0.08 2 2 simple approach: for each given query we look into the 0.06 0.1 keywords that are extracted by the 5 most active cells 4 4 0.04 of LSTM-RNN and list them in Table III. Interestingly, 6 6 0.05 each cell collects keywords of a specific topic. For 0.02 8 8 example, cell 26 in Table III extracts keywords related 10 0 10 0 to the topic “food” and cells 2 and 6 mainly focus on 1 2 3 2 4 6 8 the keywords related to the topic “health”. (a) y(t) top 10 for query (b) y(t) top 10 for document

Fig. 6. Activation values, y(t), of 10 most active cells for Query: “hotels in shanghai” and Document: “shanghai hotels accommodation B. Performance Evaluation hotel in shanghai discount and reservation” 1) Web Document Retrieval Task: In this section, we apply the proposed sentence embedding method to an TABLE I important web document retrieval task for a commercial KEY WORDS FOR QUERY:“hotels in shanghai” Query hotels in shanghai web search engine. Specifically, the RNN models (with Number of assigned and without LSTM cells) embed the sentences from the cells out of 10 Left to Right - 0 7 query and the document sides into their corresponding Number of assigned semantic vectors, and then compute the cosine similarity cells out of 10 Right to Left 6 0 - between these vectors to measure the semantic similarity between the query and candidate documents. Experimental results for this task are shown in Table IV using the standard metric mean Normalized Dis- activates.3 We used Bidirectional LSTM-RNN to get the counted Cumulative Gain (NDCG) [36] (the higher the results of these tables where in the first row, LSTM-RNN better) for evaluating the ranking performance of the reads sentences from left to right and in the second row RNN and LSTM-RNN on a standalone human-rated test it reads sentences from right to left. In these tables we dataset. We also trained several strong baselines, such as labelled a word as a keyword if more than 40% of top DSSM [3] and CLSM [10], on the same training dataset 10 active cells in both directions declare it as keyword. and evaluated their performance on the same task. For The boldface numbers in the table show that the number fair comparison, our proposed RNN and LSTM-RNN of cells assigned to that word is more than 4, i.e., 40% models are trained with the same number of parameters of top 10 active cells. From the tables, we observe that as the DSSM and CLSM models (14.4M parameters). the keywords activate more cells than the unimportant Besides, we also include in Table IV two well-known words, meaning that they are selectively embedded into information retrieval (IR) models, BM25 and PLSA, for the semantic vector. the sake of benchmarking. The BM25 model uses the 3) Topic Allocation: Now, we further show that the bag-of-words representation for queries and documents, trained LSTM-RNN model not only detects the key- which is a state-of-the-art document ranking model based words, but also allocates them properly to different cells on term matching, widely used as a baseline in IR according to the topics they belong to. To do this, we go society. PLSA (Probabilistic Latent Semantic Analysis) through the test dataset using the trained LSTM-RNN is a topic model proposed in [37], which is trained model and search for the keywords that are detected using the Maximum A Posterior estimation [38] on 3Note that before presenting the first word of the sequence, activation the documents side from the same training dataset. We values are initially zero so that there is always a considerable change in experimented with a varying number of topics from 100 the cell states after presenting the first word. For this reason, we have to 500 for PLSA, which gives similar performance, and not indicated the number of cells detecting the first word as a keyword. Moreover, another keyword extraction example can be found in section we report in Table IV the results of using 500 topics. IV of supplementary materials. Results for a language model based method, uni-gram 9

TABLE III KEYWORDSASSIGNEDTOEACHCELLOF LSTM-RNN FOR DIFFERENT QUERIES OF TWO TOPICS,“FOOD” AND “HEALTH” Query cell 1 cell 2 cell 3 cell 4 cell 5 cell 6 cell 7 cell 8 cell 9 cell 10 cell 11 cell 12 cell 13 cell 14 cell 15 cell 16 al yo yo sauce yo sauce sauce atkins diet lasagna diet blender recipes cake bakery edinburgh bakery canning corn beef hash beef, hash torre de pizza famous desserts desserts fried chicken chicken chicken smoked turkey recipes italian sausage hoagies sausage do you get allergy allergy much pain will after total knee replacement pain pain, knee how to make whiter teeth make, teeth to illini community hospital community, hospital hospital community implant infection infection infection introductory psychology psychology psychology narcotics during pregnancy side effects pregnancy pregnancy,effects, during during fight sinus infections infections health insurance high blood pressure insurance blood high, blood all antidepressant medications antidepressant, medications Query cell 17 cell 18 cell 19 cell 20 cell 21 cell 22 cell 23 cell 24 cell 25 cell 26 cell 27 cell 28 cell 29 cell 30 cell 31 cell 32 al yo yo sauce atkins diet lasagna diet diet blender recipes recipes cake bakery edinburgh bakery bakery canning corn beef hash corn, beef torre de pizza pizza pizza famous desserts fried chicken chicken smoked turkey recipes turkey recipes italian sausage hoagies hoagies sausage sausage do you get allergy much pain will after total knee replacement knee replacement how to make whiter teeth whiter illini community hospital hospital hospital implant infection infection introductory psychology psychology narcotics during pregnancy side effects fight sinus infections sinus, infections infections health insurance high blood pressure high, pressure insurance,high all antidepressant medications antidepressant medications language model (ULM) with Dirichlet smoothing, are per sentence, sentences with words less than also presented in the table. this will be ignored. We set it to 1 to make To compare the performance of the proposed method sure we do not throw away anything. with general sentence embedding methods in document • window=5 : fixed window size explained in retrieval task, we also performed experiments using two [2]. We used different window sizes, it re- general sentence embedding methods. sulted in about just 0.4% difference in final 1) In the first experiment, we used the method pro- NDCG values. posed in [2] that generates embedding vectors • size=100 : feature vector dimension. We used known as Paragraph Vectors. It is also known as 400 as well but did not get significantly dif- doc2vec. It maps each word to a vector and then ferent NDCG values. uses the vectors representing all words inside a • sample=1e-4 : this is the down sampling ratio context window to predict the vector representation for the words that are repeated a lot in corpus. of the next word. The main idea in this method is • negative=5 : the number of noise words, i.e., to use an additional paragraph token from previ- words used for negative sampling as explained ous sentences in the document inside the context in [2]. window. This paragraph token is mapped to vector • We used 30 epochs of training. We ran an ex- space using a different matrix from the one used to periment with 100 epochs but did not observe map the words. A primary version of this method much difference in the results. is known as word2vec proposed in [39]. The only • We used gensim [40] to perform experiments. difference is that word2vec does not include the To make sure that a meaningful model is trained, paragraph token. we used the trained doc2vec model to find the To use doc2vec on our dataset, we first trained most similar words to two sample words in our doc2vec model on both train set (about 200,000 dataset, e.g., the words “pizza” and “infection”. query-document pairs) and test set (about 900,000 The resulting words and corresponding scores are query-document pairs). This gives us an embed- presented in section V of Supplementary Materi- ding vector for every query and document in the als. As it is observed from the resulting words, dataset. We used the following parameters for the trained model is a meaningful model and can training: recognise semantic similarity. • min-count=1 : minimum number of of words Doc2vec also assigns an embedding vector for 10

each query and document in our test set. We used LSTM−RNN these embedding vectors to calculate the cosine RNN

4 similarity score between each query-document pair 10 in the test set. We used these scores to calcu- late NDCG values reported in Table IV for the

Doc2Vec model. ), log scale Λ ( Comparing the results of doc2vec model with L our proposed method for document retrieval task shows that the proposed method in this paper 3 10 significantly outperforms doc2vec. One reason for 0 5 10 15 20 25 30 35 40 45 50 Epoch this is that we have used a very general sen- Fig. 7. LSTM-RNN compared to RNN during training: The vertical tence embedding method, doc2vec, for document axis is logarithmic scale of the training cost, L(Λ), in (4). Horizontal retrieval task. This experiment shows that it is axis is the number of epochs during training. not a good idea to use a general sentence embed- ding method and using a better task oriented cost function, like the one proposed in this paper, is necessary. NDCG for the whole test set which are reported 2) In the second experiment, we used the Skip- in Table IV. Thought vectors proposed in [6]. During train- The proposed method in this paper is perform- ing significantly better than the off-the-shelf skip- ing, skip-thought method gets a tuple (s(t − thought method for document retrieval task. Nev- 1), s(t), s(t + 1)) where it encodes the sentence ertheless, since we used skip-thought as an off- s(t) using one encoder, and tries to reconstruct the-shelf sentence embedding method, its result the previous and next sentences, i.e., s(t − 1) and is good. This result also confirms that learning s(t + 1), using two separate decoders. The model uses RNNs with Gated Recurrent Unit (GRU) embedding vectors using a model and cost function which is shown to perform as good as LSTM. specifically designed for document retrieval task is In the paper, authors have emphasized that: ”Our necessary. model depends on having a training corpus of con- tiguous text”. Therefore, training it on our training As shown in Table IV, the LSTM-RNN significantly set where we barely have more than one sentence outperforms all these models, and exceeds the best in query or document title is not fair. However, baseline model (CLSM) by 1.3% in NDCG@1 score, since their model is trained on 11,038 books from which is a statistically significant improvement. As we BookCorpus dataset [7] which includes about 74 pointed out in Sec. V-A, such an improvement comes million sentences, we can use the trained model from the LSTM-RNN’s ability to embed the contextual as an off-the-shelf sentence embedding method as and semantic information of the sentences into a finite authors have concluded in the conclusion of the dimension vector. In Table IV, we have also presented paper. the results when different number of negative samples, To do this we downloaded their trained mod- n, is used. Generally, by increasing n we expect the els and word embeddings (its size was more performance to improve. This is because more nega- than 2GB) available from “https://github.com/ tive samples results in a more accurate approximation ryankiros/skip-thoughts”. Then we encoded each of the partition function in (5). The results of using query and its corresponding document title in our Bidirectional LSTM-RNN are also presented in Table test set as vector. IV. In this model, one LSTM-RNN reads queries and We used the combine-skip sentence embedding documents from left to right, and the other LSTM-RNN method, a vector of size 4800 × 1, where it is reads queries and documents from right to left. Then the concatenation of a uni-skip, i.e., a unidirectional embedding vectors from left to right and right to left encoder resulting in a 2400 × 1 vector, and a bi- LSTM-RNNs are concatenated to compute the cosine skip, i.e., a bidirectional encoder resulting in a similarity score and NDCG values. 1200 × 1 vector by forward encoder and another A comparison between the value of the cost function 1200×1 vector by backward encoder. The authors during training for LSTM-RNN and RNN on the click- have reported their best results with the combine- through data is shown in Fig. 7. From this figure, skip encoder. we conclude that LSTM-RNN is optimizing the cost Using the 4800 × 1 embedding vectors for each function in (4) more effectively. Please note that all query and document we calculated the scores and parameters of both models are initialized randomly. 11

TABLE IV from a specific topic. These findings have been supported COMPARISONS OF NDCG PERFORMANCEMEASURES (THEHIGHER using extensive examples. As a concrete sample appli- THEBETTER) OFPROPOSEDMODELSANDASERIESOFBASELINE MODELS, WHERE nhid REFERSTOTHENUMBEROFHIDDENUNITS, cation of the proposed sentence embedding method, we ncell REFERSTONUMBEROFCELLS, win REFERSTOWINDOWSIZE, evaluated it on the important language processing task of AND n ISTHENUMBEROFNEGATIVESAMPLESWHICHISSETTO 4 web document retrieval. We showed that, for this task, UNLESS OTHERWISE STATED.UNLESS STATED OTHERWISE, THE RNN AND LSTM-RNN MODELSARECHOSENTOHAVETHESAME the proposed method outperforms all existing state of the NUMBEROFMODELPARAMETERSASTHE DSSM AND CLSM art methods significantly. MODELS: 14.4M, WHERE 1M = 106.THE BOLDFACE NUMBERS This work has been motivated by the earlier successes ARE THE BEST RESULTS. Model NDCG NDCG NDCG of deep learning methods in speech [41], [42], [43], @1 @3 @10 [44], [45] and in semantic modelling [3], [10], [46], Skip-Thought 26.9% 29.7% 36.2% [47], and it adds further evidence for the effectiveness off-the-shelf Doc2Vec 29.1% 31.8% 38.4% of these methods. Our future work will further extend ULM 30.4% 32.7% 38.5% the methods to include 1) Using the proposed sentence BM25 30.5% 32.8% 38.8% embedding method for other important language pro- PLSA (T=500) 30.8% 33.7% 40.2% cessing tasks for which we believe sentence embedding DSSM (nhid = 288/96) 31.0% 34.4% 41.7% 2 Layers plays a key role, e.g., the question / answering task. 2) CLSM (nhid = 288/96, win=1) 31.8% 35.1% 42.6% Exploit the prior information about the structure of the 2 Layers, 14.4 M parameters different matrices in Fig. 2 to develop a more effective CLSM (nhid = 288/96, win=3) 32.1% 35.2% 42.7% 2 Layers, 43.2 M parameters cost function and learning method. 3) Exploiting atten- CLSM (nhid = 288/96, win=5) 32.0% 35.2% 42.6% tion mechanism in the proposed model to improve the 2 Layers, 72 M parameters performance and find out which words in the query are RNN (nhid = 288) 31.7% 35.0% 42.3% 1 Layer aligned to which words of the document. LSTM-RNN (ncell = 32) 31.9% 35.5% 42.7% 1 Layer, 4.8 M parameters APPENDIX A LSTM-RNN (ncell = 64) 32.9% 36.3% 43.4% 1 Layer, 9.6 M parameters EXPRESSIONSFORTHE GRADIENTS LSTM-RNN (ncell = 96) 32.6% 36.0% 43.4% 1 Layer, n = 2 In this appendix we present the final gradient expres- LSTM-RNN (ncell = 96) 33.1% 36.5% 43.6% sions that are necessary to use for training the proposed 1 Layer, n = 4 models. Full derivations of these gradients are presented LSTM-RNN (ncell = 96) 33.1% 36.6% 43.6% in section III of supplementary materials. 1 Layer, n = 6 LSTM-RNN (ncell = 96) 33.1% 36.4% 43.7% 1 Layer, n = 8 Bidirectional LSTM-RNN 33.2% 36.6% 43.6% A. RNN (ncell = 96), 1 Layer For the recurrent parameters, Λ = Wrec (we have ommitted r subscript for simplicity):

∂∆ + j,τ D − T − − VI.CONCLUSIONSAND FUTURE WORK = [δyQ (t τ)yQ(t τ 1)+ ∂Wrec + D− This paper addresses deep sentence embedding. We D T j T δy (t − τ)yD+ (t − τ − 1)] − [δyQ (t − τ)yQ(t − τ − 1) propose a model based on long short-term memory to D Dj− T model the long range context information and embed the + δyD (t − τ)y (t − τ − 1)] (9) Dj− key information of a sentence in one semantic vector. We − show that the semantic vector evolves over time and only where Dj means j-th candidate document that is not takes useful information from any new input. This has clicked and been made possible by input gates that detect useless δ (t − τ − 1) = (1 − y (t − τ − 1))◦ information and attenuate it. Due to general limitation yQ Q − − ◦ T − of available human labelled data, we proposed and (1 + yQ(t τ 1)) WrecδyQ (t τ) (10) implemented training the model with a weak supervision and the same as (10) for δyD (t−τ −1) with D subscript signal using user click-through data of a commercial web for document side model. Please also note that: search engine.

By performing a detailed analysis on the model, we δyQ (TQ) = (1 − yQ(TQ)) ◦ (1 + yQ(TQ))◦ 3 showed that: 1) The proposed model is robust to noise, (b.c.yD(TD) − a.b .c.yQ(TQ)), i.e., it mainly embeds keywords in the final semantic δyD (TD) = (1 − yD(TD)) ◦ (1 + yD(TD))◦ vector representing the whole sentence and 2) In the pro- 3 posed model, each cell is usually allocated to keywords (b.c.yQ(TQ) − a.b.c .yD(TD)) (11) 12 where T ∂R(Q, D) a = yQ(t = TQ) yD(t = TD) rec1 T rec1 T (19) = δyQ (t).cQ(t) + δyD (t).cD(t) 1 1 ∂Wp1 b = , c = (12) kyQ(t = TQ)k kyD(t = TD)k The derivative for output gate bias values will be: For the input parameters, Λ = W: ∂R(Q, D) rec1 rec1 = δy (t) + δy (t) (20) ∂∆ + ∂b Q D j,τ D − T − 1 = [δyQ (t τ)lQ(t τ)+ ∂W 2) Input Gate: For the recurrent connections we have: D+ T δ (t − τ)l + (t − τ)]− yD D ∂R(Q, D) Dj− T Dj− T = [δyQ (t − τ)lQ(t − τ) + δyD (t − τ)l (t − τ)] (13) ∂Wrec3 Dj− rec3 ∂cQ(t) rec3 ∂cD(t) A full derivation of BPTT for RNN is presented in diag(δyQ (t)). + diag(δyD (t)). ∂Wrec3 ∂Wrec3 section III of supplementary materials. (21)

B. LSTM-RNN where rec3 − ◦ ◦ ◦ Starting with the cost function in (4), we δyQ (t) = (1 h(cQ(t))) (1 + h(cQ(t))) oQ(t) vQ(t) use the Nesterov method described in (6) to ∂cQ(t) ∂cQ(t − 1) T update LSTM-RNN model parameters. Here, Λ = diag(fQ(t)). + bi,Q(t).yQ(t − 1) ∂Wrec3 ∂Wrec3 is one of the weight matrices or bias vectors bi,Q(t) = yg,Q(t) ◦ iQ(t) ◦ (1 − iQ(t)) (22) {W1, W2, W3, W4, Wrec1, Wrec2, Wrec3, Wrec4 ∂c (t) , Wp1, Wp2, Wp3, b1, b2, b3, b4} in the LSTM-RNN In equation (21), δrec3(t) and D are the same as yD ∂Wrec3 architecture. The general format of the gradient of the (22) with D subscript. For the input connections we will cost function, ∇L(Λ), is the same as (7). By definition have the following: of ∆r,j, we have: ∂R(Q, D) + ∂∆r,j ∂R(Qr,D ) ∂R(Qr,Dr,j) = = r − (14) ∂W3 ∂Λ ∂Λ ∂Λ ∂c (t) ∂c (t) diag(δrec3(t)). Q + diag(δrec3(t)). D (23) We omit r and j subscripts for simplicity and present yQ yD ∂R(Q,D) ∂W3 ∂W3 ∂Λ for different parameters of each cell of LSTM- RNN in the following subsections. This will complete where the process of calculating ∇L(Λ) in (7) and then we ∂cQ(t) ∂cQ(t − 1) T = diag(fQ(t)). + bi,Q(t).xQ(t) can use (6) to update LSTM-RNN model parameters. ∂W3 ∂W3 (24) In the subsequent subsections vectors vQ and vD are defined as: For the peephole connections we will have: 3 ∂R(Q, D) vQ = (b.c.yD(t = TD) − a.b .c.yQ(t = TQ)) = 3 ∂Wp3 vD = (b.c.yQ(t = TQ) − a.b.c .yD(t = TD)) (15) ∂c (t) ∂c (t) rec3 Q rec3 D (25) diag(δyQ (t)). + diag(δyD (t)). where a, b and c are defined in (12). Full derivation of ∂Wp3 ∂Wp3 truncated BPTT for LSTM-RNN model is presented in section III of supplementary materials. where 1) Output Gate: For recurrent connections we have: ∂cQ(t) ∂cQ(t − 1) T = diag(fQ(t)). +bi,Q(t).cQ(t−1) ∂R(Q, D) ∂Wp3 ∂Wp3 rec1 − T rec1 − T (26) = δyQ (t).yQ(t 1) +δyD (t).yD(t 1) ∂Wrec1 (16) For bias values, b3, we will have: where ∂R(Q, D) = rec1 ◦ − ◦ ◦ (17) ∂b3 δyQ (t) = oQ(t) (1 oQ(t)) h(cQ(t)) vQ(t) rec3 ∂cQ(t) rec3 ∂cD(t) rec1 diag(δ (t)). + diag(δ (t)). (27) and the same as (17) for δ (t) with subscript D for yQ yD yD ∂b3 ∂b3 document side model. For input connections, W1, and where peephole connections, Wp1, we will have: ∂c (t) ∂c (t − 1) ∂R(Q, D) Q = diag(f (t)). Q + b (t) (28) rec1 T rec1 T (18) Q i,Q = δyQ (t).lQ(t) + δyD (t).lD(t) ∂b3 ∂b3 ∂W1 13

3) Forget Gate: For the recurrent connections we will where have: rec4 δy (t) = (1 − h(cQ(t))) ◦ (1 + h(cQ(t))) ◦ oQ(t) ◦ vQ(t) ∂R(Q, D) Q = ∂cQ(t) ∂cQ(t − 1) T ∂W = diag(fQ(t)). + bg,Q(t).yQ(t − 1) rec2 ∂W ∂W ∂c (t) ∂c (t) rec4 rec4 rec2 Q rec2 D ◦ − ◦ diag(δyQ (t)). + diag(δyD (t)). bg,Q(t) = iQ(t) (1 yg,Q(t)) (1 + yg,Q(t)) (38) ∂Wrec2 ∂Wrec2 (29) For input connection we have: where ∂R(Q, D) = ∂W δrec2(t) = (1 − h(c (t))) ◦ (1 + h(c (t))) ◦ o (t) ◦ v (t) 4 yQ Q Q Q Q ∂c (t) ∂c (t) diag(δrec4(t)). Q + diag(δrec4(t)). D (39) ∂cQ(t) ∂cQ(t − 1) yQ yD = diag(f (t)). + b (t).y (t − 1)T ∂W4 ∂W4 ∂W Q ∂W f,Q Q rec2 rec2 where bf,Q(t) = cQ(t − 1) ◦ fQ(t) ◦ (1 − fQ(t)) (30) ∂cQ(t) ∂cQ(t − 1) T = diag(fQ(t)). + bg,Q(t).xQ(t) For input connections to forget gate we will have: ∂W4 ∂W4 (40) ∂R(Q, D) = For bias values we will have: ∂W 2 ∂R(Q, D) rec2 ∂cQ(t) rec2 ∂cD(t) = diag(δ (t)). + diag(δ (t)). (31) ∂b4 yQ ∂W yD ∂W 2 2 ∂c (t) ∂c (t) rec4 Q rec4 D (41) diag(δyQ (t)). + diag(δyD (t)). where ∂b4 ∂b4

∂cQ(t) ∂cQ(t − 1) T where = diag(fQ(t)). + bf,Q(t).xQ(t) ∂W2 ∂W2 ∂cQ(t) ∂cQ(t − 1) (32) = diag(fQ(t)). + bg,Q(t) (42) For peephole connections we have: ∂b4 ∂b4 5) Error signal backpropagation: Error signals are ∂R(Q, D) = back propagated through time using following equations: ∂Wp2 rec1 ∂c (t) ∂c (t) δQ (t − 1) = rec2 Q rec2 D (33) diag(δyQ (t)). + diag(δyD (t)). ∂Wp2 ∂Wp2 [oQ(t − 1) ◦ (1 − oQ(t − 1)) ◦ h(cQ(t − 1))] T rec1 where ◦ Wrec1.δQ (t) (43) ∂c (t) ∂c (t − 1) Q Q T reci = diag(fQ(t)). +bf,Q(t).cQ(t−1) δQ (t − 1) = [(1 − h(cQ(t − 1))) ◦ (1 + h(cQ(t − 1))) ∂Wp2 ∂Wp2 (34) ◦ − ◦ T reci ∈ { } oQ(t 1)] Wreci .δQ (t), for i 2, 3, 4 For forget gate’s bias values we will have: (44) ∂R(Q, D) = REFERENCES ∂b2 [1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence 2 ∂cQ(t) 2 ∂cD(t) rec rec (35) learning with neural networks,” in Proceedings of Advances in diag(δyQ (t)). + diag(δyD (t)). ∂b2 ∂b2 Neural Information Processing Systems, 2014, pp. 3104–3112. [2] Q. V. Le and T. Mikolov, “Distributed representations of sen- where tences and documents,” Proceedings of the 31st International Conference on Machine Learning, pp. 1188–1196, 2014. ∂cQ(t) ∂cQ(t − 1) = diag(fQ(t)). + bf,Q(t) (36) [3] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck, ∂b2 ∂b3 “Learning deep structured semantic models for web search using clickthrough data,” in Proceedings of the 22Nd ACM 4) Input without Gating (yg(t)): For recurrent con- International Conference on Conference on Information & nections we will have: Knowledge Management, ser. CIKM ’13. ACM, 2013, pp. 2333– 2338. ∂R(Q, D) [4] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, = “Distributed representations of words and phrases and their com- ∂Wrec4 positionality,” in Proceedings of Advances in Neural Information rec4 ∂cQ(t) rec4 ∂cD(t) Processing Systems, 2013, pp. 3111–3119. diag(δyQ (t)). + diag(δyD (t)). [5] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti- ∂Wrec4 ∂Wrec4 mation of word representations in vector space,” arXiv preprint (37) arXiv:1301.3781, 2013. 14

[6] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, vances in optimizing recurrent networks,” in Proc. ICASSP, R. Urtasun, and S. Fidler, “Skip-thought vectors,” Advances in Vancouver, Canada, May 2013. Neural Information Processing Systems (NIPS), 2015. [27] J. Chen and L. Deng, “A primal-dual method for training recur- [7] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, rent neural networks constrained by the echo-state property,” in A. Torralba, and S. Fidler, “Aligning books and movies: Towards Proceedings of the International Conf. on Learning Representa- story-like visual explanations by watching movies and reading tions (ICLR), 2014. books,” arXiv preprint arXiv:1506.06724, 2015. [28] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evalu- recurrent-neural-network architectures and learning methods for ation of gated recurrent neural networks on sequence modeling,” spoken language understanding,” in Proc. INTERSPEECH, Lyon, NIPS Deep Learning Workshop, 2014. France, August 2013. [9] R. Socher, J. Pennington, E. H. Huang, A. Y. Ng, and C. D. [29] L. Deng and J. Chen, “Sequence classification using high-level Manning, “Semi-supervised recursive for predicting features extracted from deep neural networks,” in Proc. ICASSP, sentiment distributions,” in Proceedings of the Conference on 2014. Empirical Methods in Natural Language Processing, ser. EMNLP [30] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: ’11, 2011, pp. 151–161. Continual prediction with lstm,” Neural Computation, vol. 12, pp. [10] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “A latent seman- 2451–2471, 1999. tic model with convolutional-pooling structure for information [31] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning retrieval.” CIKM, November 2014. precise timing with lstm recurrent networks,” J. Mach. Learn. [11] R. Collobert and J. Weston, “A unified architecture for natural Res., vol. 3, pp. 115–143, Mar. 2003. language processing: Deep neural networks with multitask learn- [32] J. Gao, W. Yuan, X. Li, K. Deng, and J.-Y. Nie, “Smoothing ing,” in International Conference on Machine Learning, ICML, clickthrough data for web search ranking,” in Proceedings of 2008. the 32Nd International ACM SIGIR Conference on Research and [12] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolu- Development in Information Retrieval, ser. SIGIR ’09. New tional neural network for modelling sentences,” Proceedings of York, NY, USA: ACM, 2009, pp. 355–362. the 52nd Annual Meeting of the Association for Computational [33] Y. Nesterov, “A method of solving a convex programming Linguistics, June 2014. problem with convergence rate o (1/k2),” Soviet Mathematics [13] B. Hu, Z. Lu, H. Li, and Q. Chen, “Convolutional neural Doklady, vol. 27, pp. 372–376, 1983. network architectures for matching natural language sentences,” [34] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the in Advances in Neural Information Processing Systems 27, 2014, importance of initialization and momentum in deep learning,” in pp. 2042–2050. ICML (3)’13, 2013, pp. 1139–1147. [14] J. Zhang, S. Liu, M. Li, M. Zhou, and C. Zong, “Bilingually- [35] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of constrained phrase embeddings for machine translation,” in Pro- training recurrent neural networks.” in ICML 2013, ser. JMLR ceedings of the 52nd Annual Meeting of the Association for Proceedings, vol. 28. JMLR.org, 2013, pp. 1310–1318. Computational Linguistics (ACL) (Volume 1: Long Papers), Bal- [36]K.J arvelin¨ and J. Kekal¨ ainen,¨ “Ir evaluation methods for re- timore, Maryland, 2014, pp. 111–121. trieving highly relevant documents,” in Proceedings of the 23rd [15] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Annual International ACM SIGIR Conference on Research and Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997. Development in Information Retrieval, ser. SIGIR. ACM, 2000, [16] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition pp. 41–48. with deep recurrent neural networks,” in Proc. ICASSP, Vancou- [37] T. Hofmann, “Probabilistic latent semantic analysis,” in In Proc. ver, Canada, May 2013. of Uncertainty in Artificial Intelligence, UAI99, 1999, pp. 289– [17] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory 296. architectures for large scale acoustic [38] J. Gao, K. Toutanova, and W.-t. Yih, “Clickthrough-based latent modeling,” in Proceedings of the Annual Conference of Inter- semantic models for web search,” ser. SIGIR ’11. ACM, 2011, national Speech Communication Association (INTERSPEECH), pp. 675–684. 2014. [39] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient [18] K. M. Hermann and P. Blunsom, “Multilingual mod- estimation of word representations in vector space,” in els for compositional distributed semantics,” arXiv preprint International Conference on Learning Representations (ICLR), arXiv:1404.4641, 2014. 2013. [Online]. Available: arXiv:1301.3781 [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine [40]R. Rehˇ u˚rekˇ and P. Sojka, “Software Framework for Topic Mod- translation by jointly learning to align and translate,” ICLR2015, elling with Large Corpora,” in Proceedings of the LREC 2010 2015. [Online]. Available: http://arxiv.org/abs/1409.0473 Workshop on New Challenges for NLP Frameworks. Val- [20] A. Karpathy, J. Johnson, and L. Fei-Fei, “Visualizing and under- letta, Malta: ELRA, May 2010, pp. 45–50, http://is.muni.cz/ standing recurrent networks,” arXiv preprint arXiv:1506.02078, publication/884893/en. 2015. [41] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Large vocabulary [21] J. L. Elman, “Finding structure in time,” Cognitive Science, continuous speech recognition with context-dependent DBN- vol. 14, no. 2, pp. 179–211, 1990. HMMs,” in Proc. IEEE ICASSP, Prague, Czech, May 2011, pp. [22] A. J. Robinson, “An application of recurrent nets to phone 4688–4691. probability estimation,” IEEE Transactions on Neural Networks, [42] D. Yu and L. Deng, “Deep learning and its applications to sig- vol. 5, no. 2, pp. 298–305, August 1994. nal and information processing [exploratory dsp],” IEEE Signal [23] L. Deng, K. Hassanein, and M. Elmasry, “Analysis of the corre- Processing Magazine, vol. 28, no. 1, pp. 145 –154, jan. 2011. lation structure for a neural predictive model with application to [43] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent speech recognition,” Neural Networks, vol. 7, no. 2, pp. 331–339, pre-trained deep neural networks for large-vocabulary speech 1994. recognition,” Audio, Speech, and Language Processing, IEEE [24] T. Mikolov, M. Karafiat,´ L. Burget, J. Cernocky,` and S. Khudan- Transactions on, vol. 20, no. 1, pp. 30 –42, jan. 2012. pur, “Recurrent neural network based language model.” in Proc. [44] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, INTERSPEECH, Makuhari, Japan, September 2010, pp. 1045– A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kings- 1048. bury, “Deep neural networks for acoustic modeling in speech [25] A. Graves, “Sequence transduction with recurrent neural net- recognition: The shared views of four research groups,” IEEE works,” in Representation Learning Workshp, ICML, 2012. Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, November [26] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Ad- 2012. 15

[45] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building deep architectures,” in Proc. ICASSP, march 2012, pp. 2133 –2136. [46] J. Gao, P. Pantel, M. Gamon, X. He, L. Deng, and Y. Shen, “Modeling interestingness with deep neural networks,” in Proc. EMNLP, 2014. [47] J. Gao, X. He, W. tau Yih, and L. Deng, “Learning continuous phrase representations for translation modeling,” in Proc. ACL, 2014. 16

TABLE V RNNSWITH & WITHOUT LSTM CELLS FOR THE SAME QUERY: “hotels in shanghai” 5 0.4 hotels in shanghai Number of assigned 10 cells out of 10 (LSTM-RNN) - 0 7 0.2 Number of assigned neurons out of 10 (RNN) - 2 9 15 0 20 −0.2 “Excellent”. This example is rated as a “Good” match 25 in the dataset. The score for this pair assigned by RNN −0.4 is “0.8165” while the score assigned by LSTM-RNN is 30 “0.9161”. Please note that the score is between 0 and −0.6 2 4 6 8 10 1. This means that the score assigned by LSTM-RNN is more correspondent with the human generated label. Second, we compare the number of assigned neurons Fig. 8. Input gate, i(t), for Document: “shanghai hotels accommoda- tion hotel in shanghai discount and reservation” and cells to each word by RNN and LSTM-RNN respec- tively. To do this, we rely on the 10 most active cells and neurons in the final semantic vectors in both models. SUPPLEMENTARY Results are presented in Table V and Table VI for query and document respectively. An interesting observation MATERIAL is that RNN sometimes assigns neurons to unimportant words, e.g., 6 neurons are assigned to the word “in” in Table VI. APPENDIX B As another example we consider the query, A MORE CLEAR FIGURE FOR INPUT GATE FOR “how to fix bath tub wont turn off”. This “hotels in shanghai” EXAMPLE example is rated as a “Bad” match in the dataset by In this section we present a more clear figure for part human. It is good to know that the score for this pair (a) of Fig. 5 that shows the structure of the input gate for assigned by RNN is “0.7016” while the score assigned document side of “hotels in shanghai” example. As it by LSTM-RNN is “0.5944”. This shows the score is clearly visible from this figure, the input gate values generated by LSTM-RNN is closer to human generated for most of the cells corresponding to word 3, word label. 7 and word 9 in document side of LSTM-RNN have Number of assigned neurons and cells to each word very small values (light green-yellow color). These are by RNN and LSTM-RNN are presented in Table VII corresponding to words “accommodation”, “discount” and Table VIII for query and document. This is out of and “reservation” respectively in the document title. 10 most active neurons and cells in the semantic vector Interestingly, input gates are trying to reduce effect of RNN and LSTM-RNN. Examples of RNN assigning of these three words in the final representation (y(t)) neurons to unimportant words are 3 neurons to the word because the LSTM-RNN model is trained to maximize “a” and 4 neurons to the word “you” in Table VIII. the similarity between query and document if they are a good match. APPENDIX D DERIVATION OF BPTT FOR RNN AND LSTM-RNN APPENDIX C In this appendix we present the full derivation of the ACLOSER LOOK AT RNNSWITHANDWITHOUT gradients for RNN and LSTM-RNN. LSTMCELLSIN WEB DOCUMENT RETRIEVAL TASK In this section we further show examples to reveal A. Derivation of BPTT for RNN the advantage of LSTM-RNN sentence embedding com- From (4) and (5) we have: pared to the RNN sentence embedding. First, we compare the scores assigned by trained RNN N N n ∂L(Λ) X ∂lr(Λ) X X ∂∆r,j and LSTM-RNN to our “hotels in shanghai” example. = = − αr,j (45) ∂Λ ∂Λ ∂Λ On average, each query in our test dataset is associated r=1 r=1 j=1 with 15 web documents (URLs). Each query / document where pair has a relevance label which is human generated. −γe−γ∆r,j αr,j = n (46) These relevance labels are “Bad”, “Fair”, “Good” and P −γ∆r,j 1 + j=1 e 17

TABLE VI RNNSWITH & WITHOUT LSTM CELLSFORTHESAME DOCUMENT:“shanghai hotels accommodation hotel in shanghai discount and reservation” shanghai hotels accommodation hotel in shanghai discount and reservation Number of assigned cells out of 10 (LSTM-RNN) - 4 3 8 1 8 5 3 4 Number of assigned neurons out of 10 (RNN) - 10 7 9 6 8 3 2 6

TABLE VII RNN VERSUS LSTM-RNN FOR QUERY:“how to fix bath tub wont turn off ” how to fix bath tub wont turn off Number of assigned cells out of 10 (LSTM-RNN) - 0 4 7 6 3 5 0 Number of assigned neurons out of 10 (RNN) - 1 10 4 6 2 7 1

TABLE VIII RNN VERSUS LSTM-RNN FOR DOCUMENT:“how do you paint a bathtub and what paint should . . . ” how do you paint a bathtub and what paint should you . . . Number of assigned cells out of 10(LSTM-RNN) - 1 1 7 0 9 2 3 8 4 Number of assigned neurons out of 10(RNN) - 1 4 4 3 7 2 5 4 7

and We have

T + ∂yQ(t = TQ) yD(t = TD).b.c ∆r,j = R(Qr,Dr ) − R(Qr,Dr,j) (47) D = ∂Wrec T ∂yQ(t = TQ) yD(t = TD).b.c ∂yQ(t = TQ) ∂∆r,j = . + We need to find ∂Λ for input weights and recurrent ∂yQ(t = TQ) ∂Wrec weights. We omit r subscript for simplicity. ∂y (t = T )T y (t = T ).b.c ∂y (t = T ) Q Q D D . D D 1) Recurrent Weights: ∂yD(t = TD) ∂Wrec ∂y (t = T ) = y (t = T ).b.c. Q Q + − D D ∂∆ ∂R(Q, D+) ∂R(Q, D ) ∂Wrec j − j (48) = T ∂yD(t = TD) ∂Wrec ∂Wrec ∂Wrec yQ(t = TQ). (b.c) . (51) | {z } ∂Wrec b.c We divide R(D,Q) into three components: Since f(.) = tanh(.), using chain rule we have

T ∂y (t = T ) R(Q, D) = yQ(t = TQ) yD(t = TD) . Q Q = | {z } W a rec T 1 1 [(1 − yQ(t = TQ)) ◦ (1 + yQ(t = TQ))]yQ(t − 1) . (49) (52) kyQ(t = TQ)k kyD(t = TD)k | {z } | {z } b c and therefore then D = [b.c.yD(t = TD) ◦ (1 − yQ(t = TQ))◦ T (1 + yQ(t = TQ))]yQ(t − 1) + ∂R(Q, D) ∂a ∂b [b.c.y (t = T ) ◦ (1 − y (t = T ))◦ = .b.c + a. .c + Q Q D D T ∂Wrec ∂Wrec ∂Wrec (1 + y (t = T ))]y (t − 1) (53) | {z } | {z } D D D D E ∂c To find E we use following basic rule: a.b. (50) ∂Wrec | {z } ∂ x − a F kx − ak2 = (54) ∂x kx − ak2 18

Therefore 2) Input Weights: Using a similar procedure we will have the following for input weights: ∂ −1 E = a.c. (kyQ(t = TQ)k) = ∂Wrec ∂R(Q, D) T T = δy (t−τ)l (t−τ) +δy (t−τ)l (t−τ) ∂ky (t = T )k ∂W Q Q D D − a.c.(ky (t = T )k)−2. Q Q (62) Q Q ∂W rec where −2 yQ(t = TQ) ∂yQ(t = TQ) = −a.c.(kyQ(t = TQ)k) . kyQ(t = TQ)k ∂Wrec δyQ (t − τ) = (1 − yQ(t − τ)) ◦ (1 + yQ(t − τ))◦ 3 3 = −[a.c.b .yQ(t = TQ) ◦ (1 − yQ(t = TQ))◦ (b.c.yD(t − τ) − a.b .c.yQ(t − τ)),

(1 + yQ(t = TQ))]yQ(t − 1) (55) δyD (t − τ) = (1 − yD(t − τ)) ◦ (1 + yD(t − τ))◦ (b.c.y (t − τ) − a.b.c3.y (t − τ)) (63) F is calculated similar to (55): Q D

3 Therefore: F = −[a.b.c .yD(t = TD) ◦ (1 − yD(t = TD))◦ ∂∆j,τ − = (1 + yD(t = TD))]yD(t 1) (56) ∂W D+ T D+ T − − − + − − Considering (50),(53),(55) and (56) we have: [δyQ (t τ)lQ(t τ) + δyD (t τ)lD (t τ)]

Dj− T Dj− T ∂R(Q, D) [δyQ (t − τ)lQ(t − τ) + δyD (t − τ)l (t − τ)] (64) T T Dj− = δyQ (t)yQ(t − 1) + δyD (t)yD(t − 1) ∂Wrec (57) and therefore: where N n T ∂L(Λ) X X X ∂∆r,j,τ = − α (65) ∂W r,j ∂W δyQ (t = TQ) = (1 − yQ(t = TQ)) ◦ (1 + yQ(t = TQ))◦ r=1 j=1 τ=0 3 | {z } (b.c.yD(t = TD) − a.b .c.yQ(t = TQ)), one large update

δyD (t = TD) = (1 − yD(t = TD)) ◦ (1 + yD(t = TD))◦ 3 (b.c.yQ(t = TQ) − a.b.c .yD(t = TD)) (58) B. Derivation of BPTT for LSTM-RNN Following from (50) for every parameter, Λ, in Equation (58) will just unfold the network one time step, LSTM-RNN architecture we have: to unfold it over rest of time steps using backpropagation ∂R(Q, D) ∂a ∂b ∂c we have: = .b.c + a. .c + a.b. (66) ∂Λ ∂Λ ∂Λ ∂Λ | {z } | {z } | {z } δyQ (t − τ − 1) = (1 − yQ(t − τ − 1))◦ D E F T (1 + yQ(t − τ − 1)) ◦ WrecδyQ (t − τ), and from (51): − − − − − ◦ δyD (t τ 1) = (1 yD(t τ 1)) ∂yQ(t = TQ) T D = yD(t = TD).b.c. + (1 + yD(t − τ − 1)) ◦ WrecδyD (t − τ) (59) ∂Λ ∂y (t = T ) y (t = T ).b.c. D D (67) where τ is the number of time steps that we unfold the Q Q ∂Λ network over time which is from to and for 0 TQ TD From (55) and (56) we have: queries and documents respectively. Now using (48) we ∂y (t = T ) have: E = −a.c.b3.y (t = T ) Q Q (68) Q Q ∂Λ ∂∆ + j,τ D − T − − = [δyQ (t τ)yQ(t τ 1)+ ∂Wrec ∂y (t = T ) − 3 D D (69) + D− F = a.b.c .yD(t = TD) D T j T ∂Λ − + − − − − − − δyD (t τ)yD (t τ 1)] [δyQ (t τ)yQ(t τ 1) Therefore Dj− T + δyD (t − τ)y (t − τ − 1)] (60) Dj− ∂R(Q, D) = D + E + F = To calculate final value of gradient we should fold back ∂Λ ∂y (t = T ) ∂y (t = T ) the network over time and use (45), we will have: v Q Q + v D D (70) Q ∂Λ D ∂Λ N n T ∂L(Λ) X X X ∂∆r,j,τ where = − αr,j,TD,Q (61) ∂Wrec ∂Wrec 3 r=1 j=1 τ=0 vQ = (b.c.yD(t = TD) − a.b .c.yQ(t = TQ)) | {z } 3 one large update vD = (b.c.yQ(t = TQ) − a.b.c .yD(t = TD)) (71) 19

1) Output Gate: Since α◦β = diag(α)β = diag(β)α Therefore where diag(α) is a diagonal matrix whose main diagonal t ∂c(t) X ∂diag(yg(k)) ∂i(k) entries are entries of vector α, we have: = [ .i(k) + diag(yg(k)). ] ∂Wrec3 Wrec3 Wrec3 k=1 ∂y(t) ∂ | {z } = (diag(h(c(t))).o(t)) zero ∂W ∂W t rec1 rec1 X ∂diag(h(c(t))) ∂o(t) = diag(y (k)).i(k) ◦ (1 − i(k)).y(k − 1)T (80) = .o(t) + diag(h(c(t))). g =1 ∂Wrec1 ∂Wrec1 k | {z } (81) zero = o(t) ◦ (1 − o(t)) ◦ h(c(t)).y(t − 1)T (72) and t ∂y(t) X Substituting (72) in (70) we have: = [o(t) ◦ (1 − h(c(t))) ◦ (1 + h(c(t))) ∂Wrec3 ∂R(Q, D) k=1 | {z } rec1 − T rec1 − T a(t) = δyQ (t).yQ(t 1) +δyD (t).yD(t 1) ∂Wrec1 ◦ ◦ ◦ − − T (82) (73) yg(k) i(k) (1 i(k))]y(k 1) | {z } where b(k) But this is expensive to implement, to resolve it we have: rec1 ◦ − ◦ ◦ δyQ (t) = oQ(t) (1 oQ(t)) h(cQ(t)) vQ(t) t−1 rec1 δ (t) = o (t) ◦ (1 − o (t)) ◦ h(c (t)) ◦ v (t) ∂y(t) X T yD D D D D = [a(t) ◦ b(k)]y(k − 1) (74) ∂Wrec3 k=1 | {z } with a similar derivation for W1 and Wp1 we get: expensive part + [a(t) ◦ b(t)]y(t − 1)T ∂R(Q, D) rec1 T rec1 T = δ (t).l (t) + δ (t).l (t) (75) t−1 ∂W yQ Q yD D X 1 = diag(a(t)) b(k).y(k − 1)T k=1 | {z } ∂R(Q, D) rec1 T rec1 T ∂c(t 1) (76) ∂W − = δyQ (t).cQ(t) + δyD (t).cD(t) rec3 ∂Wp1 + diag(a(t)).b(t).y(t − 1)T (83) For output gate bias values we have: Therefore ∂R(Q, D) rec1 rec1 (77) ∂y(t) ∂c(t − 1) T = δyQ (t) + δyD (t) = [diag(a(t))][ + b(t).y(t − 1) ] ∂b1 ∂Wrec3 ∂Wrec3 (84) 2) Input Gate: Similar to output gate we start with: For f(t) =6 1 we have ∂y(t) ∂ = (diag(o(t)).h(c(t))) ∂y(t) ∂c(t − 1) ∂W ∂W = [diag(a(t))][diag(f(t)). rec3 rec3 ∂Wrec3 ∂Wrec3 ∂diag(o(t)) ∂h(c(t)) T = .h(c(t)) + diag(o(t)). + bi(t).y(t − 1) ] (85) ∂Wrec3 ∂Wrec3 | {z } where zero ∂c(t) a(t) = o(t) ◦ (1 − h(c(t))) ◦ (1 + h(c(t))) = diag(o(t)).(1 − h(c(t))) ◦ (1 + h(c(t))) ∂Wrec3 bi(t) = yg(t) ◦ i(t) ◦ (1 − i(t)) (86) (78) substituting above equation in (70) we will have: To find ∂c(t) assuming f(t) = 1 (we derive formula- ∂Wrec3 ∂R(Q, D) rec3 ∂cQ(t) tion for f(t) =6 1 from this simple solution) we have: = diag(δyQ (t)). ∂Wrec3 ∂Wrec3 ∂c (t) c(0) = 0 rec3 D (87) + diag(δyD (t)). ∂Wrec3 c(1) = c(0) + i(1) ◦ yg(1) = i(1) ◦ yg(1) where c(2) = c(1) + i(2) ◦ yg(2) δrec3(t) = (1 − h(c (t))) ◦ (1 + h(c (t))) ◦ o (t) ◦ v (t) ... yQ Q Q Q Q t t ∂cQ(t) ∂cQ(t − 1) T X X = diag(fQ(t)). + bi,Q(t).yQ(t − 1) c(t) = i(k) ◦ yg(k) = diag(yg(k)).i(k) (79) ∂Wrec3 ∂Wrec3 k=1 k=1 bi,Q(t) = yg,Q(t) ◦ iQ(t) ◦ (1 − iQ(t)) (88) 20

In equation (87), δrec3(t) and ∂cD (t) are the same as substituting above equation in (70) we will have: yD ∂Wrec3 (88) with D subscript. Therefore, update equations for ∂R(Q, D) ∂c (t) Wrec3 are (87), (88) for Q and D and (6). rec2 Q = diag(δyQ (t)). With a similar procedure for W3 we will have the ∂Wrec2 ∂Wrec2 ∂c (t) following: rec2 D (97) + diag(δyD (t)). ∂Wrec2 ∂R(Q, D) rec3 ∂cQ(t) = diag(δyQ (t)). ∂W3 ∂W3 where ∂cD(t) + diag(δrec3(t)). (89) rec2 yD δ (t) = (1 − h(cQ(t))) ◦ (1 + h(cQ(t))) ◦ oQ(t) ◦ vQ(t) ∂W3 yQ ∂cQ(t) ∂cQ(t − 1) T where = diag(fQ(t)). + bf,Q(t).yQ(t − 1) ∂Wrec2 ∂Wrec2 ∂cQ(t) ∂cQ(t − 1) T bf,Q(t) = cQ(t − 1) ◦ fQ(t) ◦ (1 − fQ(t)) (98) = diag(fQ(t)). + bi,Q(t).xQ(t) ∂W3 ∂W3 (90) Therefore, update equations for Wrec2 are (97), (98) for Therefore, update equations for W3 are (89), (90) for Q Q and D and (6). and D and (6). For input weights to forget gate, W2, we have For peephole connections we will have:

∂R(Q, D) rec2 ∂cQ(t) ∂R(Q, D) rec3 ∂cQ(t) = diag(δyQ (t)). = diag(δ (t)). ∂W2 ∂W2 ∂W yQ ∂W p3 p3 ∂c (t) + diag(δrec2(t)). D (99) rec3 ∂cD(t) yD (91) ∂W2 + diag(δyD (t)). ∂Wp3 where where ∂cQ(t) ∂cQ(t − 1) T ∂cQ(t) ∂cQ(t − 1) T = diag(fQ(t)). + bf,Q(t).xQ(t) = diag(fQ(t)). +bi,Q(t).cQ(t−1) ∂W2 ∂W2 ∂Wp3 ∂Wp3 (100) (92) Therefore, update equations for W2 are (99), (100) for Hence, update equations for Wp3 are (91), (92) for Q Q and D and (6). and D and (6). For peephole connections, Wp2, we have Following similar derivation for bias values b3 we will have: ∂R(Q, D) rec2 ∂cQ(t) = diag(δyQ (t)). ∂R(Q, D) rec3 ∂cQ(t) ∂Wp2 ∂Wp2 = diag(δy (t)). ∂b Q ∂b 2 ∂cD(t) 3 3 + diag(δrec (t)). (101) ∂c (t) yD ∂W rec3 D (93) p2 + diag(δyD (t)). ∂b3 where where ∂c (t) ∂c (t − 1) Q Q − T ∂cQ(t) ∂cQ(t − 1) = diag(fQ(t)). +bf,Q(t).cQ(t 1) = diag(fQ(t)). + bi,Q(t) (94) ∂Wp2 ∂Wp2 ∂b3 ∂b3 (102) Therefore, update equations for W are (101), (102) Update equations for b are (93), (94) for Q and D and p2 3 for Q and D and (6). (6). Update equations for forget gate bias values, b , will 3) Forget Gate: For forget gate, with a similar deriva- 2 be following equations and (6): tion to input gate we will have

∂y(t) ∂c(t − 1) ∂R(Q, D) rec2 ∂cQ(t) = diag(δyQ (t)). = [diag(a(t))][diag(f(t)). ∂b2 ∂b2 ∂Wrec2 ∂Wrec2 T rec2 ∂cD(t) + bf (t).y(t − 1) ] (95) (103) + diag(δyD (t)). ∂b2 where where a(t) = o(t) ◦ (1 − h(c(t))) ◦ (1 + h(c(t))) ∂cQ(t) ∂cQ(t − 1) = diag(fQ(t)). + bf,Q(t) (104) bf (t) = c(t − 1) ◦ f(t) ◦ (1 − f(t)) (96) ∂b2 ∂b3 21

4) Input without Gating (yg(t)): Gradients for yg(t) A. LSTM-RNN Semantic Vectors: Another Example parameters are as follows: Consider the following example from test dataset: ∂y(t) ∂c(t − 1) • Query: “ ” = [diag(a(t))][diag(f(t)). how to fix bath tub wont turn off ∂Wrec4 ∂Wrec4 • Document: “how do you paint a bathtub and what T + bg(t).y(t − 1) ] (105) paint should you use yahoo answers” | {z } where treated as one word Activations of input gate, output gate, cell state and a(t) = o(t) ◦ (1 − h(c(t))) ◦ (1 + h(c(t))) cell output for each cell for query and document are presented in Fig.9 and Fig.10 respectively based on a bg(t) = i(t) ◦ (1 − yg(t)) ◦ (1 + yg(t)) (106) trained LSTM-RNN model. substituting above equation in (70) we will have: Three interesting observations from Fig.9 and Fig.10:

∂R(Q, D) rec4 ∂cQ(t) • Semantic representation y(t) and cell states c(t) are = diag(δyQ (t)). ∂Wrec4 ∂Wrec4 evolving over time. ∂c (t) • In part (a) of Fig.10, we observe that input gate rec4 D (107) + diag(δyD (t)). ∂Wrec4 values for most of the cells corresponding to word where 3, word 4, word 7 and word 9 in document side of LSTM-RNN have very small values (light blue rec4 − ◦ ◦ ◦ δyQ (t) = (1 h(cQ(t))) (1 + h(cQ(t))) oQ(t) vQ(t) color). These are corresponding to words “you”, “paint”, “and” and “paint” respectively in the ∂cQ(t) ∂cQ(t − 1) T = diag(fQ(t)). + bg,Q(t).yQ(t − 1) document title. Interestingly, input gates are trying ∂Wrec4 ∂Wrec4 to reduce effect of these words in the final repre- bg,Q(t) = iQ(t) ◦ (1 − yg,Q(t)) ◦ (1 + yg,Q(t)) (108) sentation (y(t)) because the LSTM-RNN model is Therefore, update equations for Wrec4 are (107), (108) trained to maximize the similarity between query for Q and D and (6). and document if they are a good match. For input weight parameters, W4, we have • y(t) is used as semantic representation after apply- ing output gate on cell states. Note that valuable ∂R(Q, D) rec4 ∂cQ(t) = diag(δyQ (t)). context information is stored in cell states c(t). ∂W4 ∂W4 ∂c (t) rec4 D (109) + diag(δyD (t)). ∂W4 B. Key Word Extraction: Another Example where Evolution of 10 most active cells over time for the second example are presented in Fig. 11 for query and ∂cQ(t) ∂cQ(t − 1) T = diag(fQ(t)). + bg,Q(t).xQ(t) Fig. 12 for document. Number of assigned cells out of ∂W4 ∂W4 (110) 10 most active cells to each word are presented in Table Therefore, update equations for W4 are (109), (110) for IX and Table X. Q and D and (6). Gradients with respect to bias values, b4, are APPENDIX F OC EC IMILARITY EST ∂R(Q, D) rec4 ∂cQ(t) D 2V S T = diag(δyQ (t)). ∂b4 ∂b4 To make sure that a meaningful model is trained, ∂cD(t) we used the trained doc2vec model to find the most rec4 (111) + diag(δyD (t)). ∂b4 similar words to two sample words in our dataset, the where words “pizza” and “infection”. The resulting words and corresponding scores are as follows: ∂c (t) ∂c (t − 1) Q = diag(f (t)). Q + b (t) (112) ∂b Q ∂b g,Q 4 4 print(model.most-similar(’pizza’)) : Therefore, update equations for b4 are (111), (112) for Q and D and (6). There is no peephole connections for [(u’recipes’, 0.9316294193267822), yg(t). (u’recipe’, 0.9295548796653748), (u’food’, 0.9250608682632446), APPENDIX E (u’restaurants’, 0.9223555326461792), LSTM-RNNVISUALIZATION (u’bar’, 0.9191627502441406), In this appendix we present more examples of LSTM- (u’sabayon’, 0.916868269443512), RNN visualization. (u’page’, 0.9160783290863037), 22

0.4 0.4 5 5 0.3 0.3 10 10 0.2 0.2 15 0.1 15 0.1 0 0 20 20 −0.1 −0.1 25 25 −0.2 −0.2 30 −0.3 30 −0.3

2 4 6 8 10 2 4 6 8 10

(a) i(t) (b) c(t)

0.4 0.4 5 5 0.3 0.3 10 10 0.2 0.2 15 0.1 15 0.1 0 0 20 20 −0.1 −0.1 25 25 −0.2 −0.2 30 −0.3 30 −0.3

2 4 6 8 10 2 4 6 8 10

(c) o(t) (d) y(t)

Fig. 9. Query: “how to fix bath tub wont turn off ”

TABLE IX KEYWORDEXTRACTIONFOR QUERY:“how to fix bath tub wont turn off ”

how to fix bath tub wont turn off Number of assigned cells out of 10 Left to Right - 0 4 7 6 3 5 0 Number of assigned cells out of 10 Right to Left 4 1 6 7 6 7 7 -

(u’restaurant’, 0.9112323522567749), (u’medlineplus’, 0.9032401442527771), (u’house’, 0.9104640483856201), (u’gout’, 0.9027985334396362)] (u’the’, 0.9103578925132751)] As it is observed from the resulting words, the trained print(model.most-similar(’infection’)): model is a meaningful model and can recognise semantic similarity. [(u’infections’, 0.9698576927185059), (u’treatment’, 0.9143450856208801), APPENDIX G (u’symptoms’, 0.9138627052307129), DIAGRAMOFTHEPROPOSEDMODEL (u’disease’, 0.9100595712661743), To clarify the difference between the proposed method (u’palpitations’, 0.9083651304244995), and the general sentence embedding methods, in this (u’pneumonia’, 0.9073051810264587), section we present a diagram illustrating the training (u’medical’, 0.9043352603912354), procedure of the proposed model. It is presented in (u’abdomen’, 0.9034136533737183), Fig. 13. In this figure n is the number of negative 23

0.3

0.22 5 5 0.2 0.2 10 10 0.18 0.1 15 15 0.16 0 20 0.14 20 −0.1 25 0.12 25 0.1 −0.2 30 30

2 4 6 8 10 2 4 6 8 10

(a) i(t) (b) c(t)

0.55 0.1 5 5 0.5 0.05 10 10

15 0.45 15 0

20 20 0.4 −0.05 25 25 0.35 30 30 −0.1

2 4 6 8 10 2 4 6 8 10

(c) o(t) (d) y(t)

Fig. 10. Document: “how do you paint a bathtub and what paint should . . . ”

TABLE X KEYWORDEXTRACTIONFOR DOCUMENT:“how do you paint a bathtub and what paint should . . . ”

how do you paint a bathtub and what paint should you . . . Number of assigned cells out of 10 Left to Right - 1 1 7 0 9 2 3 8 4 Number of assigned cells out of 10 Right to Left 5 9 5 4 8 4 5 5 9 -

(unclicked) documents. The other parameters in this figure are similar to those used in Fig. 2 and Fig. 3 of the paper. 24

0.1 2 0.08

4 0.06

6 0.04

8 0.02

10 0

2 4 6 8

Fig. 11. Query: “how to fix bath tub wont turn off ”

0.1 2 0.08 4 0.06

6 0.04

8 0.02

0 10

2 4 6 8 10

Fig. 12. Document: “how do you paint a bathtub and what paint should . . . ” 25

+ − − − 푃(퐷 |푄) 푃(퐷1 |푄) 푃(퐷2 |푄) 푃(퐷푛 |푄) Backpropagated error signal

훿 훿푄 훿푄 푄 훿푄 Cosine Cosine Cosine Cosine

훿퐷+ 훿 − 훿 − − 퐷1 퐷2 훿퐷푛 Unclicked Document Unclicked Document Unclicked Document Query LSTM Clicked Document … LSTM LSTM LSTM LSTM + − − − 푄 퐷 퐷1 퐷2 퐷푛

LSTM cells and gates

− 훿 − (2) 훿 − (푇 −) 훿퐷2 (1) 퐷2 … 퐷2 퐷2

LSTM 풚(1) LSTM 풚(2) LSTM − … 풚(푇퐷2 )

풍(1) 풍(2) … − 풍(푇퐷2 ) 푾ℎ 푾ℎ 푾ℎ − Semantic 풄(1) 풄(푇퐷2 − 1) representation of the 풙(1) 풙(2) − 풙(푇퐷2 ) document

Fig. 13. Architecture of the proposed method.