A Hierarchical Contextual Attention-Based GRU Network For
Total Page:16
File Type:pdf, Size:1020Kb
A Hierarchical Contextual Attention-based GRU Network for Sequential Recommendation Qiang Cui1 Shu Wu1 Yan Huang1 Liang Wang1;2 1Center for Research on Intelligent Perception and Computing National Laboratory of Pattern Recognition 2Center for Excellence in Brain Science and Intelligence Technology Institute of Automation, Chinese Academy of Sciences [email protected], fshu.wu, yhuang, [email protected] Abstract mance on different tasks, for example next basket rec- ommendation (Yu et al. 2016), location prediction (Liu Sequential recommendation is one of fundamental tasks for et al. 2016c), product recommendations (Cui et al. 2016; Web applications. Previous methods are mostly based on Hidasi et al. 2016), and multi-behavioral prediction (Liu, Markov chains with a strong Markov assumption. Recently, recurrent neural networks (RNNs) are getting more and more Wu, and Wang 2017). These RNN methods mostly employ a popular and has demonstrated its effectiveness in many tasks. common strategy: using the last hidden state of RNN as the The last hidden state is usually applied as the sequence’s rep- user’s final representation then making recommendations to resentation to make recommendation. Benefit from the nat- users. ural characteristics of RNN, the hidden state is a combina- In recommender systems, a better recommendation tion of long-term dependency and short-term interest to some should capture both the long-term user-taste and short- degrees. However, the monotonic temporal dependency of term sequential effect (Rendle, Freudenthaler, and Schmidt- RNN impairs the user’s short-term interest. Consequently, the hidden state is not sufficient to reflect the user’s final inter- Thieme 2010; Wang et al. 2015). They are referred as long- est. In this work, to deal with this problem, we propose a term dependency and short-term interest in this work. The Hierarchical Contextual Attention-based GRU (HCA-GRU) hidden state of RNN has both characteristics in nature and network. The first level of HCA-GRU is conducted on the in- is very suitable for recommendation. With the help of gated put. We construct a contextual input by using several recent activation function like long-short term memory (Hochreiter inputs based on the attention mechanism. This can model the and Schmidhuber 1997) or gated recurrent unit (Cho et al. complicated correlations among recent items and strengthen 2014), RNN can better capture the long-term dependency. the hidden state. The second level is executed on the hidden This advantage allows the hidden state to be able to con- state. We fuse the current hidden state and a contextual hid- nect previous information for a long time. Due to the re- den state built by the attention mechanism, which leads to a current structure and fixed transition matrices, RNN holds more suitable user’s overall interest. Experiments on two real- world datasets show that HCA-GRU can effectively generate an assumption that temporal dependency has a monotonic the personalized ranking list and achieve significant improve- change with the input time steps (Liu, Wu, and Wang 2017). ment. It assumes that the current item or hidden state is more sig- nificant than the previous one. This monotonic assumption is reasonable on the whole sequence, and it also enables hid- 1 Introduction den state to capture the user’s short-term interest to some Recently, with the development of Internet, some applica- degrees. tions of sequential scene have become numerous and multi- There is a problem that the monotonic assumption of lateral, such as ad click prediction, purchase recommenda- RNN restricts the modeling of user’s short-term interest. Be- arXiv:1711.05114v3 [cs.IR] 5 Feb 2018 tion and web page recommendation. A user’s behaviors in cause of this problem, the forementioned common strategy such applications is a sequence in chronological order, and is not sufficient for better recommendation. First, short-term his following behaviors can be predicted by sequential rec- interest should discover correlations among items within a ommendation methods. Taking online shopping for instance, local range. This will make the recommendation very re- after a user buys an item, the application would predict a list sponsive for the coming behaviors based on recent behav- of items that the user might purchase in the near future. iors (Wang et al. 2015). Next, for short-term interest in rec- Traditional sequential methods usually focus on Markov ommender systems, we can not say that one item or hidden chains (Rendle, Freudenthaler, and Schmidt-Thieme 2010; state is must more significant than the previous one (Liu, Chen, Wang, and Wang 2015; He and McAuley 2016a). Wu, and Wang 2017). The correlations are more compli- However, the underlying strong Markov assumption has dif- cated, but the monotonic assumption can not well distin- ficulty in constructing more effective relationship among guish importances of the several recent factors. For example, various factors. Lately, with the great success of deep learn- when a user buys clothes online, he might buy it for some- ing, recurrent neural network (RNN) methods become more one else. Small weights should be provided for such kind of and more popular. They have achieved promising perfor- behaviors. Take another example, we can consider the regu- lar three meals a day as a sequence. The connection may be closer among breakfasts instead of between breakfast and Table 1: Notation lunch or dinner. Therefore, the short-term interest should be Notation Explanation carefully examined and needs to be integrated with the long- U, I, Iu set of users, set of items, sequence of items of user u term dependency. p, q positive item, negative item t In this paper, we propose a Hierarchical Contextual x^upq preference of user u towards item p and q at the t-th time step Attention-based GRU (HCA-GRU) network to address the d dimensionality of input vector above problem. It can greatly strengthen the user’s short- x, h input and hidden state of GRU term interest represented in each hidden state, and make a xc, hc contextual input, contextual hidden state w w non-linear combination of long-term dependency and short- x, h window widths U, W , b transition matrices and bias of GRU term interest to obtain user’s overall interest. The first level V transition matrix for xc of our HCA-GRU is conducted on the input to make hidden E, F embedding matrices for h and hc states stronger. When we input the current item to GRU, we ho overall interest of h and hc also input a contextual input containing information of pre- rx, Qx weight vector and matrix for input vious inputs. Inspired by a learning framework for word vec- rh, Qh weight vector and matrix for hidden state tors in Natural Language Processing (NLP) (Le and Mikolov ax, ah attention weights 2014), we firstly collect several recent inputs as the contex- tual information at each time step. Then the attention mecha- nism is accordingly introduced to assign appropriate weights (Liu et al. 2016b). Incorporated with geographical distance to select important inputs. The contextual input is a weighted and time interval information, RNN gains the state-of-the-art sum of this context. The second level of our HCA-GRU is performance for location prediction (Liu et al. 2016c). The executed on the hidden state to make a fusion of recent hid- RLBL model combines RNN and Log-BiLinear (Mnih and den states. Each time, the user’s overall interest is a com- Hinton 2007) to make multi-behavioral prediction (Liu, Wu, bination of the current hidden state and a contextual hid- and Wang 2017). Furthermore, extensions of RNN like long- den state. Encouraged by the global context memory initial- short term memory (LSTM) (Hochreiter and Schmidhuber ized by all hidden states with average pooling in a Computer 1997) and gated recurrent unit (GRU) (Cho et al. 2014) are Vision (CV) work (Liu et al. 2017), we use several recent greatly developed, as they can better hold the long-term de- hidden states to construct local contextual information. It is pendency (Bengio, Simard, and Frasconi 1994). Here, our used to establish the current contextual hidden state via at- proposed network relies on GRU. tention. Finally, parameters of HCA-GRU are learned by the Contextual information has been proven to be very im- Bayesian Personalized Ranking (BPR) framework (Rendle portant on different tasks. When learning word vectors in et al. 2009) and the Back Propagation Through Time (BPTT) NLP, the prediction of next word is based on the sum or algorithm (Werbos 1990). The main contributions are listed concatenation of recent input words (Le and Mikolov 2014). as follows: About 3D action recognition, GCA-LSTM network applies • We propose a novel hierarchical GRU network for se- all the hidden states in ST-LSTM (Liu et al. 2016a) to ini- quential recommendation which can combine the long- tialize global context memory by average pooling (Liu et al. term dependency and user’s short-term interest. 2017). On scene labeling, the patch is widely used. Episodic CAMN model makes contextualized representation for ev- • We propose to employ contextual attention-based model- ery referenced patch by using surrounding patches (Ab- ing to deal with the monotonic assumption of RNN meth- dulnabi et al. 2017). As the winner of the ImageNet ob- ods. This can automatically focus on critical information, ject detection challenge of 2016, GBD-Net employs differ- which greatly strengthens the user’s short-term interest. ent resolutions and contextual regions to help identify ob- • Experiments on two large real-world datasets reveal that jects (Zeng et al. 2016). Our contextual modeling is mainly the HCA-GRU network is very effective and outperforms focused on the input and hidden state of GRU, and we do the state-of-the-art methods.