Reinforcement Learning to Optimize Long-Term User Engagement in Recommender Systems

Home , Data science

arXiv:1902.05570v4 [cs.IR] 11 Jul 2019 09AscainfrCmuigMachinery. Computing for Association 2019 © ok—FeRct piieteln-emue naeet Fe engagement. user long-term the optimize to FeedRec — work hncmiigbosrpigadfnto approximation. function and bootstrapping combining when C utb ooe.Asrcigwt rdti emte.T permitted. is credit with Abstracting honored. be must ACM C SN978-1-4503-6201-6/19/08...$15.00 ISBN ACM ization • CONCEPTS CCS state-of-the-arts. outperforms the and optimizes engagement effectively user FeedRec term that show data scale rea po a large and in data convergence synthetic on of experiments Extensive instability learning. the voids a and environment, Q-Network the simulates the which beh S-Network, user a complex 2) and modeling iors, of charge takes designed LSTM which Q-Network hierarchical a 1) components: two includes dRec efrigeetv ffplc erigi tl immature, still is learning off-policy effective performing ( feedback delayed and clicks) omdl hc yial ossso ohisatfeedback instant both ver of are consists behaviors en typically user which user challenges: model, long-term to facing optimize still to t is RL maximizing gagement applying of problem rewards, the term fits long naturally (RL) rein learning Though ment methods. a learning not supervised usually conventional is target for learning the as problem, non-trivial be engagement far by is measured typically which and stickiness, metrics instant user classical to attention more sys pay recommender should good se a manner, streaming a recommendati such of In feed feeds. manner never-ending The interactive the Apps. users mobile provides ting the recommender the on in F especially used lives. system, widely daily been our has in mechanism role streaming crucial a play systems Recommender ABSTRACT https://doi.org/10.1145/3292500.3330668 D 1,Ags –,21,Acoae K USA AK, Anchorage, 2019, 4–8, August [email protected] ’19, from KDD permissions requ Request lists, fee. to a redistribute to and/or or servers on post to work publish, this of components n for this Copyrights bear page. copies first that the and on thi tion advantage of commercial ar part copies or that or profit provided all for fee of without granted copies is hard use or classroom digital make to Permission ∗ okpromddrn nitrsi tJD.com. at internship an during performed Work nomto systems Information oadesteeise,i hswr,w nrdc Lframe RL a introduce we work, this in issues, these address To [email protected] • ; [email protected] hoyo computation of Theory enocmn erigt piieLn-emUser Long-term Optimize to Learning Reinforcement snhaUniversity Tsinghua snhaUniversity Tsinghua ietyotmzn ogtr sreggmn sa is engagement user long-term optimizing Directly . ixn Song Jiaxing ii Zou Lixin → naeeti eomne Systems Recommender in Engagement eomne systems Recommender e.g., ∗ wl ie eii) naddition, in revisit); time, dwell → eunildcso making decision Sequential rspirseicpermission specific prior ires o aeo distributed or made not e g. oyohrie rre- or otherwise, copy o okfrproa or personal for work s tc n h ulcita- full the and otice we yohr than others by owned ogtr user long-term [email protected] aaSineLb JD.com Lab, Science Data ; especially Personal- snhaUniversity Tsinghua vailable l-world [email protected] force- edn Liu Weidong satile ssists nin on yond long- ( tem licy e.g., eed av- ogXia Long he in e- t- - - . C eeec Format: Reference ACM sh ol losi ntrcieiesadsrl on an down, scroll and items unattractive skip also could (s)he i.21.RifreetLann oOtmz ogtr srEngage- User Long-term Optimize to Learning Reinforcement 2019. Yin. 1 3 2 nigfes uha h essrasi ao News Yahoo in streams news ne the the as by such generated items feeds, browse scen ending constantly streaming to feed able recent are In users preferences. and needs users’ ( goods ta suggesting information-seeking in user assist systems Recommender INTRODUCTION 1 2 4–8, USA. August AK, ’19), age, (KDD Mining Data and Discovery Knowledge In Systems. Recommender in Dawei ment and Liu, Weidong Song, Jiaxing Ding, Zhuoye Xia, Long Zou, Lixin system Recom engagement; user Long-term learning; Reinforcement KEYWORDS tem,wihflsi w od:isatengagement, instant folds: crit two is in It falls with anymore. which interactions rule of streams, golden satisfaction only users’ the the maximizing be to not will op clicks circumstance, re ing such many Under of items. appearance uninterested the or Meanwhil to dundant due items. system the recommender the of leave details the view and items the use on the streams, click product the with interacting Specifically, etn sr’dsr osa ihtesraslne n ope and longer [11]. streams the repeatedly streams with stay to typicall desire users’ stickiness, senting say engagement, long-term purchase; ilsrasi Facebook in streams cial piiigisatmtis( metrics instant optimizing otedffiut fmdln eae erc,drcl optim directly metrics, Unfortunately, delayed on. modeling th so of page-viewing, difficulty and the the visits, of to two depth between Apps, time the internal on time dwell complica cluding more long- are by usually metrics measured Delayed is metrics. typically delayed actively which users system, keep the also brin with but acting only rate not through to click higher able about be good should a system interaction, recommender with streaming deeply more Moving rate[19]). sion https://ca.news.yahoo.com/ https://www.amazon.com/ https://www.facebook.com/ oee,ms rdtoa eomne ytm nyfcso focus only systems recommender traditional most However, C,NwYr,N,UA ae.https://doi.org/10.1145/32 pages. 9 USA, NY, York, New ACM, e.g., rdcs es evcs htbs match best that services) news, products, 2 n h rdc tem nAmazon in streams product the and , aaSineLb JD.com Lab, Science Data aaSineLb JD.com Lab, Science Data e.g., [email protected] [email protected] h 5hAMSGD ofrneon Conference SIGKDD ACM 25th The hoeDing Zhuoye lc hog ae[2,conver- [12], rate through click ae Yin Dawei 1,Anchor- 019, e.g., 1 h feed the scould rs h so- the , repre- y even d e,in- ted, k by sks mender timiz- inter- arios, the n click, izing term feed ver- due ical e, n 3 g e - 92500.3330668 . the delayed metrics is very challenging. While only a few prelimi- dataset. The experimental results show the effectiveness of the pro- nary work[28] starts investigating the optimization of some long- posed algorithm over the state-of-the-art baselines for optimizing term/delayed metrics, a systematical solution to optimize the over- long user engagement. all engagement metrics is wanted. Contributions can be summarized as follow: Intuitively, reinforcement learning (RL), which was born to maximize long-term rewards, could be a unified framework to opti- (1) We propose a reinforcement learning model — FeedRec to di- mize the instant and long-term user engagement. Applying RL to rectly optimize the user engagement (both instant and long optimize long-term user engagement itself is a non-trivial prob- term user engagement) in feed streaming recommendation. lem. As mentioned, the long-term user engagement is very compli- (2) To model versatile user behaviors, which typically includes both cated (i.e., measured in versatile behaviors, e.g., dwell time, revisit), instant engagement (e.g., click and order) and long term en- and would require a very large number of environment interac- gagement (e.g., dwell time, revisit, etc), Q-Network with hierar- tions to model such long term behaviors and build a recommenda- chical LSTM architecture is presented. tion agent effectively. As a result, building a recommender agent (3) To ensure convergence in off-policy learning, an effective and from scratch through real online systems would be prohibitively safe training framework is designed. expensive, since numerous interactions with immature recommen- (4) The experimental results show that our proposed algorithms dation agent will harm user experiences, even annoy the users. An outperform the state-of-the-art baseline. alternative is to build a recommender agent offline through making use of the logged data, where the off-policy learning methods can mitigate the cost of the trial-and-error search. Unfortu- nately, current methods including Monte Carlo (MC) and temporal- 2 RELATED WORK difference (TD) have limitations for offline policy learning in re- 2.1 Traditional recommender system alistic recommender systems: MC-based methods suffer from the Most of the existing recommender systems try to balance the in- problem of high variance, especially when facing enormous ac- stant metrics and factors, i.e., the diversity, the novelty in recom- tion space (e.g., billions of candidate items) in real-world appli- mendations. From the perspective of the instant metrics, there are cations; TD-based methods improve the efficiency by using boot- numerous works focusing on improving the users’ implicit feed- strapping techniques in estimation, which, however, is confronted back clicks [10, 12, 27], explicit ratings [3, 17, 21], and dwell time with another notorious problem called Deadly Triad (i.e., the prob- on recommended items [30]. In fact, the instant metrics have been lem of instability and divergence arises whenever combining func- criticized to be insufficient to measure and represent real engage- tion approximation, bootstrapping, and offline training [24]). Un- ment of users. As the supplementary, methods [1, 2, 4] intended to fortunately, state-of-the-art methods [33, 34] in recommender sys- enhance user’s satisfaction through recommending diverse items tems, which are designed with neural architectures, will encounter have been proposed. However, all of these works can not model inevitably the Deadly Triad problem in offline policy learning. the iterative interactions with users. Furthermore, none of these To overcome the aforementioned issues of complex behaviors works could directly optimize delayed metrics of long-term user and offline policy learning, we here propose an RL-based frame- engagement. work, named FeedRec, to improve long-term user engagement in recommender systems. Specifically, we formalize the feed streaming recommendation as a Markov decision process (MDP), and design a Q-Network to directly optimize the metrics of user engage- 2.2 Reinforcement learning based ment. To avoid the problem of instability of convergence in offline recommender system Q-Learning, we further introduce a S-Network, which simulates Contextual bandit solutions are proposed to model the interaction the environments, to assist the policy learning. In Q-Network, to with users and handle the notorious explore/exploit dilemma in capture the information of versatile user long term behaviors, a online recommendation [8, 13, 20, 26, 31]. On one hand, these con- fine user behavior chain is modeled by LSTM, which consists of all textual bandit settings assume that the user’s interests remain the rough behaviors, e.g., click, skip, browse, ordering, dwell, revisit, same or smoothly drift which can not hold under the feed stream- etc. When modeling such fine-grained user behaviors, two prob- ing mechanism. On the other hand, although Wu et al. [28] pro- lems emerges: the numbers for specific user actions is extremely posed to optimize the delayed revisiting time, there is no system- imbalanced (i.e., clicks is much fewer than skips)[37]; and long- atical solution to optimizing delayed metrics for user engagement. term user behavior is more complicated to represent. We hence Apart from contextual bandits, a series of MDP based models [5, 14, further integrated hierarchical LSTM with temporal cell into Q- 15, 23, 32, 35, 39] are proposed in recommendation task. Arnold et Network to characterize fine-grained user behaviors. al. [5] proposed a modified DDPG model to deal with the problem On the other hand, in order to make effective use of the histor- of large discrete action spaces. Recently, Zhao et al. combined page- ical logged data and avoid the Deadly Triad problem in offline Q- wise, pairwise ranking technologies with reinforcement learning[33, Learning, we introduce an environment model, called S-network, 34]. Since only the instant metrics are considered, the above meth- to simulate the environment and generate simulated user experi- ods fail to optimize delayed metrics of user engagement. In this ences, assisting offline policy learning. We conduct extensive ex- paper, we proposed a systematically MDP-based solution to track periments on both the synthetic dataset and a real-world E-commerce user’s interests shift and directly optimize both instant metrics and delayed metrics of user engagement. 3 PROBLEM FORMULATION Instant metrics. In the instant user engagement, we can have clicks, purchase (in e-commerce), etc. The shared characteristics 3.1 Feed Streaming Recommendation of instant metrics are that these metrics are triggered instantly by In the feed streaming recommendation, the recommender system the current action. We here take click as an example, the number c interacts with a user u ∈ U at discrete time steps. At each time of clicks in t-th feedback is defined as the metric for click mt , step t, the agent feeds an item i and receives a feedback f from c = t t mt #clicks(ft ). the user, where it ∈ I is from the recommendable item set and ft ∈ F is user’s feedback/bevahior on it , including clicking, pur- chasing, or skipping, leaving, etc. The interaction process forms a Delayed metrics. The delayed metrics include browsing depth, sequence Xt = {u, (i1, f1,d1),..., (it , ft ,dt )} with dt as the dwell dwell time on the system, user revisit, etc. Such metrics are usually time on the recommendation, which indicates user’s preferences adopted for measuring long-term user engagement. The delayed on the recommendation. Given Xt , the agent needs to generate the metrics are triggered by previous behaviors, some of which even it+1 for next-time step with the goal of maximizing long term user hold long-term dependency. We here provide two example reward engagement, e.g., the total clicks or browsing depth. In this work, functions for delayed metrics: we focus on how to improving the expected quality of all items in Depth metric. The depth of browsing is a special indicator that feed streaming scenario. the feed streaming scenario differs from other types of recommen- 3.2 MDP Formulation of Feed Streams dation due to the infinite scroll mechanism. After viewing the t-th feed, the system should reward this feed if the user remained in = A MDP is defined by M hS, A, P, R,γ i, where S is the state space, the system and scrolled down. Intuitively, the metric of depth md R t A is the action space, P : S × A × S → is the transition function, can be defined as: R : S × A → R is the mean reward function with r(s,a) being the d = mt #scans(ft ) immediate goodness of (s, a), and γ ∈ [0, 1] is the discount factor. where #scans(f ) is the number of scans in the t-th feedback. A (stationary) policy π : S × A → [0, 1] assigns each state s ∈ S t a distribution over actions, where a ∈ A has probability π(a|s). In Return time metric. The user will use the system more often feed streaming recommendation, hS, A, Pi are set as follow: when (s)he is satisfied with the recommended items. Thus, the in- • State S is a set of states. We design the state at time step t as terval time between two visits can reflect the user’s satisfaction r the browsing sequence st = Xt−1. At the beginning, s1 = {u} with the system. The return time mt can be designed as the recip- rocal of time: just contains user’s information. At time step t, st = st−1 ⊕ r β {(i − , f − ,d − )} is updated with the old state s − concen- m = , t 1 t 1 t 1 t 1 t vr trated with the tuple of recommended item, feedback and dwell wherevr represents the time between two visits and β is the hyper- time (i , f ,d ). t−1 t−1 t−1 parameter. • Action A is a finite set of actions. The actions available depends From the above examples—click metric, depth metric and return on the state s, denoted as A(s). The A(s ) is initialized with all re- 1 time metric, we can clearly see m = [mc ,md ,mr ]⊤. Note that in called items. A(s ) is updated by removing recommended items t t t t t MDP setting, cumulative rewards will be maximized, that is, we are from A(s ) and action a is the recommending item i . t−1 t t actually optimizing total browsing depth, and frequency of visiting • Transition P is the transition function with p (s + |s ,i ) being t 1 t t in the future, which typically are long term user engagement. the probability of seeing state st+1 after taking action it at st . In our case, the uncertainty comes from user’s feedback ft w.r.t. it 4 POLICY LEARNING FOR RECOMMENDER and s . t SYSTEMS 3.3 User Engagement and Reward Function To estimate the future reward (i.e., the future user stickiness), the expected long-term user engagement for recommendation it is pre- As aforementioned, unlike traditional recommendation, instant met- sented with the Q-value as, rics (click, purchase, etc) are not the only measurements of the user T −t engagement/satisfactory, and long term engagement is even more π , = E + k , Q (st it ) ik ∼π [ rt γ rt +k ] (2) important, which is often measured in delayed metrics, e.g., brows- = kÕ 1 ing depth, user revisits and dwells time on the system. Reinforce- current rewards ment learning provides a way to directly optimize both instant and |{z} future rewards delayed metrics through the designs of reward functions. where γ is the discount factor to balance| the{z importance} of the current rewards and future rewards. The optimal Q∗(s , i ), having The reward function R : S × A → R can be designed in different t t the maximum expected reward achievable by the optimal policy, forms. We here instantiate it linearly by assuming that user engage- should follow the optimal Bellman equation [24] as, ment reward rt (mt ) at each step t is in the form of weighted sum ∗ = E + ∗ ′ of different metrics: Q (st, it ) st +1 rt γ max Q st +1, i |st, it . (3) i′ = ⊤ , rt ω mt (1) ∗ Given the Q , the recommendation it is chosen with the maximum ∗ where mt is a column vector consisted of different metrics, ω is the Q (st , it ). Nevertheless, in real-world recommender systems, with weight vector. Next, we give some instantiations of reward func- enormous users and items, estimating the action-value function ∗ tion w.r.t. both instant metrics and delayed metrics. Q (st , it ) for each state-action pairs is infeasible. Hence, it is more -network s Q State -action embedding t -network ˆ r S dt vˆ P ipeline 1 revisiting time h1,t dewell time

P ipeline 2 ˆ h ˆ l 2,t Softmax ft t Sigmoid

Q(st, i t) P ipeline 3 MLP MLP h3,t

hr,t T ime -LST M ⊕ (st, it) concat dwell time MLP d u j {dj} i′ j feedbacks State -action it Embedding dwell time {fj} d ⊕ items { j} concat feedbacks {ij} Fj {fj } × items P rojection {i } j Figure 2: The architecture of S-Network.

Figure 1: The architecture of Q-Network. embedding, we project {it } into a feedback-dependent space by multiplying the embedding with a projection matrix as follow: ′ = flexible and practical to use function approximation, e.g., neural it Fft it , ∗ networks, to estimate the action-value function, i.e., Q (st ,it ) ≈ H ×H where Ff ∈ R is a projection matrix for a specific feedback ft . Q(st , it ;θq ). In practice, neural networks are excellent to track user’s t interests in recommendation [10, 12, 36]. In this paper, we refer To futher model time information, in our work, a time-LSTM[38] is used to track the user state over time as: to a neural network function approximator with parameter θq as = ′ a Q-Network. The Q-Network can be trained by minimizing the hr,t Time-LSTM(it , dt ), (6) mean-squared loss function, defined as follows: where Time-LSTM models the dwell time by inducing a time gate ℓ = E , 2 (θq ) (st ,it ,rt ,st +1)∼M (yt − Q(st it ; θq )) (4) controlled by dt as follow: = + yt rt γ max Q(st +1, it +1; θq ), g = σ i′W + σ d W + b it +1 ∈I t t ig t gg g = + ′ + + ct pt ⊙ ct −1 et ⊙ gt ⊙ σ itWic ht −1Whc bc where M = {(st ,it ,rt ,st+1)} is a large replay buffer storing the = ′ + + + + past feeds, from which samples are taken in mini-batch training. ot σ (it Wio dt Wdo ht −1Who wco ⊙ ct bo ), By differentiating the loss function with respect to θq , we arrive at where ct is the memory cell. gt is the time dependent gate influ- the following gradient: encing the memory cell and output gate. pt is the forget gate. et is the input gate. ot is the output gate. W∗ and b∗ are the weight and = E + ∇θ ℓ θq (s ,i ,r ,s + )∼M (r γ max Q st +1, it +1; θq q t t t t 1 i + bias term. ⊙ is the element-wise product, σ is the sigmoid function. t 1 Given the ct and ot , the hidden state hr,t is modeled as −Q st, it ; θq ∇θq Q st , it ; θq (5) hr,t = ot ⊙ σ (ct ). i In practice, it is often computationally efficient to optimize the loss function by stochastic gradient descent, rather than comput- 4.1.2 Hierarchical Behavior Layer. To capture the information of ing the full expectations in the above gradient. versatile user behaviors, all rough behaviors are sequentially fed into raw Behavior Embedding Layer indiscriminate. In realistic, 4.1 The Q-Network the numbers for specific user actions is extremely imbalanced (e.g., The design of Q-Network is critical to the performances. In long clicks are fewer than skips). As a result, directly utilizingtheoutput term user engagement optimization, the user interactive behaviors of raw Behavior Embedding Layer will cause the Q-Network los- is versatile (e.g., not only click but also dwell time, revisit, skip, ing the information from the sparse user behaviors, e.g., purchase etc), which makes modeling non-trivial. To effective optimize such information will be buried by skips information. Moreover, each engagement, we have to first harvest previous information from type of user behaviors has its own characteristics: click on an item such behaviors into Q-Network. usually represents the users’ current preferences, purchase on an item may imply the shifting of user interest, and causality of skip- 4.1.1 Raw Behavior Embedding Layer. The purpose of this layer ping is a little complex, which could be casual browsing, neutral, is to take all raw behavior information, related to long term en- or annoyed, etc. gagement, to distill users’ state for further optimization. Given the To better represent the user state, as shown in Figure 1, we pro- observation s = {u, (i , f ,d ) ..., (i , f ,d )}, we let f be t 1 1 1 t−1 t−1 t−1 t pose a hierarchical behavior layer added to the raw behaviors em- all possible types of user behaviors on i , including clicking, pur- t bedding layers, that the major user behaviors, such as click, skip, chasing, or skipping, leaving etc, while d for the dwell time of the t purchase are tracked separately with different LSTM pipelines as behavior. The entire set of {it } are first converted into embedding vectors {it }. To represent the feedback information into the item hk,t = LSTM-k(hr,t ) if ft is the k-th behavior, where different user’s behaviors (e.g., the k-th behavior) is cap- Algorithm 1: Offline training of FeedRec. tured by the corresponding LSTM layer to avoid intensive behavior Input: D, ϵ,L,K dominance and capture specific characteristics. Finally, the state- Output: θq , θs action embedding is formed by concatenating different user’s be- 1 Randomly initialize parameters θq, θs ← Uniform(−0.1, 0.1); havior layer and user profile as: 2 #Pretraining the S-Network. 3 for j = 1 : K do t = , , , , , s concat[hr,t h1,t h·,t hk,t u] 4 Sample random mini-batches of (st , it , rt , st +1) from D; r where u is the embedding vector for a specific user. 5 Set ft , dt , v , lt according to st , rt , st +1; 6 Update θs via mini-batch SGD w.r.t. the loss in Equation (7); 4.1.3 Q-value Layer. The approximation of Q-value is accomplished 7 end by MLP with the input of the dense state embedding and the item 8 # Iterative training of S-Network and Q-Network.; embedding as follow: 9 repeat 10 for j = 1 : N do Q(st, it ; θq ) = MLP(st , it ). 11 # Sampling training samples from logged data. ′ 12 Sampling (s, i, r, s ) from D, and storing in buffer M; The value of θq is updated by SGD with gradient calculated as 13 # Sampling training samples by interacting with the Equation (5). S-Network. 14 l = False; 4.2 Off-Policy Learning Task 15 sample a initial user u from user set; 16 initial s = {u }; With the proposed Q-Learning based framework, we can train the 17 while l is False do parameters in the model through trial and error search before learn- 18 sample a recommendation i w.r.t ϵ-greedy Q-value; 19 execute i; r ing a stable recommendation policy. However, due to the cost and 20 S-Network responds with f , d, l,v ; r risk of deploying unsatisfactory policies, it is nearly impossible for 21 set r according to f , d, l,v ; ′ training the policy online. An alternative way is to train a reason- 22 set s = s ⊕ {i, r, d }; ′ 23 store (s, i, r, s ) in buffer M; able policy using the logged data D, collecting by a logging policy ′ 24 ← π , before deploying. Unfortunately, the Q-Learning framework update s s ; b 25 end in Equation (4) suffers from the problem of Deadly Trial[24], the 26 # Updating the Q-Network. problem of instability and divergence arises whenever combining 27 for j = 1 : L do function approximation, bootstrapping and offline training. 28 Sample random mini-batches of training (st , it , rt , st +1) from To avoid the problem of instability and divergence in offline M; 29 Update θq via mini-batch SGD w.r.t. Equation (5); Q-Learning, we further introduce a user simulator (refers to as S- 30 end Network), which simulates the environment and assists the policy 31 # Updating the S-Network. learning. Specifically, in each round of recommendation, aligning 32 for j = 1 : K do with real user feedback, the S-Network need to generate user’s 33 Sample mini-batches of (st , it , rt , st +1) from M; r r response ft , the dwell time dt , the revisited time v , and a bi- 34 Set f , d, l,v according to rt , st +1; 35 Update θ via mini-batch SGD w.r.t. the loss in Equation (7); nary variable lt , which indicates whether the user leaves the plat- s form. As shown in Figure 2, the generation of simulated user’s 36 end 37 end feedback is accomplished using the S-Network S(θ ), which is a s 38 until convergence; multi-head neural network. State-action embedding is designed in the same architecture as it in Q-Network, but has separate parameters. The layer (st ,it ) are shared across all tasks, while the other layers (above (st , it ) in Figure 2) are task-specific. As dwell time will cause the selection base. To debias the effects of loggind pol- and user’s feedback are inner-session behaviors, the prediction of icy πb [22], an importance weighted loss is minimized as follow: ˆ ˆ − ft and dt is calculated as follow, T 1 1 n ℓ = t , ˆ = + (θs ) γ {ω0:t c }δt (θs ) (7) ft Softmax(Wf xf bf ) = n = Õt 0 kÕ1 dˆ = W x + b ˆ ˆ 2 t d f d δt (θs ) = λf · Ψ(ft , ft ) + λd · (dt − dt ) + = + xf tanh(Wxf [st , it ] bxf ) ˆ r r 2 λl · Ψ(lt , lt ) + λv · (v − vˆ ) , whereW∗ andb∗ are the weight and bias term. [st ,it ] is the concen- tration of state action feature. The generation of revisiting time and where n is the total number of trajectories in the logged data. t π (ik |sk ) leaving the platform (inter-session behaviors) are accomplished as ω0:t = is the importance ratio to reduce the dispar- k=0 πb (ik |sk ) ˆ = ⊤ + ity between π (the policy derived from Q-Network, e.g., ϵ-greedy) lt Sigmoid(x wl bl ) Î f Ψ , r and πb , (· ·) is the cross-entropy loss function, and c is a hyper- vˆ = Wv xl + bd parameter to avoid too large importance ratio. δt (θs ) is a multi-task = + xl tanh(Wxl [st , it ] bxl ). loss function that combines two classification loss and two regres- sion loss, λ∗ is the hyper-parameter controling the importance of different task. 4.3 Simulator Learning As π derived from Q-Network is constantly changed with the In this process, S(st , it ;θs ) is refined via mini-batch SGD using update of θq , to keep adaptive to the intermediate policies, the S- logged data in the D. As the logged data is collected via a logging Network also keep updated in accordance with π to obtain the policy πb , directly using such logged data to build the simulator customized optimal accuracy. Finally, we implement an interactive training procedure, as shown in Algorithm 1, where we specify the order in which they occur within each iteration.

5 SIMULATION STUDY We demonstrate the ability of FeedRec to ﬁt the user’s interests by directly optimizing user engagement metrics through simulation. We use the synthetic datasets so that we know the “ground truth” mechanism to maximize user engagement, and we can easily check whether the algorithm can indeed learn the optimal policy to max- Figure 3: Diﬀerent distributions of user’s interests and browsing imize delayed user engagement. depth. The dashed line represents the distribution of scrolling down and entropy (linear in (a) and quadratic in (b)). The color bar shows 5.1 Setting the interaction iteration in training phrase, from blue to red. The average browsing depth over all users are shown as dots. Formally, we generate M users and N items, each of which is asso- ciated with a d-dimensional topic parameter vector ϑ ∈ Rd . For M = (1),..., (M) = (1),..., (N ) users (U {ϑu ϑu }) and N items (I {ϑi ϑi }), the topic vectors are initialized as

˜ = ϑ˜ ϑk 1 − κ, the primary topic k, ϑ = , where ϑ˜ = (8) ˜ ˜ ′ | |ϑ | | ( ϑk′ ∼ U (0, κ), k , k,

where κ controls how much attention would be paid on non-primary topics. Specifically, we set the dimension of user vectors and item vectors to 10. Once the item vector ϑi is initialized, it will keep Figure 4: Different distributions of user’s interests and interval the same for the simulation. At each time step t, the agent feeds days between two visits. The dashed line represents the distribution one item ϑi from I to one user ϑu . The user checks the feed and of return time and entropy (linear in (a) and quadratic in (b)). The gives feedback, e.g., click/skip, leave/stay (depth metric), and re- color bar shows the interaction iteration in training phrase, from visit (return time metric), based on the “satisfaction”. Specifically, blue to red. The average return time over all users are shown as dots. the probability of click is determined by the cosine similarity as ⊤ ϑi ϑu p(click|ϑu, ϑi ) = . For leave/stay or revisit, these feed- staying with the system after checking the fed items is set as: kϑi k kϑu k back are related to all the feeds. In the simulation, we assume these p(stay |ϑ1,..., ϑt ) = aE(ϑ1,..., ϑt ) + b, a > 0 feedback are determined by the mean entropy of recommendation 1 ϑm E(ϑ1,..., ϑt ) = ϑm log list because many existing works[1, 2, 4] assume the diversity is t × (t − 1) ϑn m,n∈{1, ...,t } able to improve the user’s satisfactory on the recommendation re- mÕ,n sults. Also, diversity is also delayed metrics [29, 39], which can where {ϑ1,..., ϑt } is the list of recommended items, E(ϑ1,..., ϑt ) verify whether FeedRec could optimize the delayed metrics or not. is the mean entropy of the items. a and b are used to scale into range (0,1). The interval days of two visit is set as: r = E 5.2 Simulation Results v V − d ∗ (ϑi,1,..., ϑi,t ), V > 0, d > 0, r Some existing works [1, 2, 4] assume the diversity is able to im- where V and d are constants to make v positive. prove the user’s satisfactory on the recommendation results. Ac- 2) Quadratic style. In the quadratic relationship, moderate en- tually, it is an indirect method to optimize user engagement, and tropy makes the best satisfaction. The probability of user staying with the system after checking the fed items is set as: diversity here play an instrumental role to achieve this goal. We E 2 now verify that the proposed FeedRec framework has the ability ( (ϑ1,..., ϑt ) − µ) p(stay |ϑ1,..., ϑt ) = exp{− }, to directly optimize user engagement through different forms of σ diversity. To generate the simulation data, we follow the popular where µ and σ are constants. Similarly, the interval days of two diversity assumption[1, 2, 4]. These works tried to enhance diver- visit is set as: 2 (E(ϑ , ,..., ϑ , ) − µ) sity to achieve better user engagement. However, it is unclear that vr = V (1 − exp{− i 1 i t }), V > 0. to what extent the diversity will lead to the best user engagement. σ Therefore, the pursuit of diversity may not lead to the improve- Following the above process of interaction between “user” and sys- ment of user satisfaction. tem agent, we generate 1,000 users, 5,000 items, and 2M episodes We assume that there are two types of relationship between user for training. engagement and diversity of recommendation list. We here report the average browsing depth and return time w.r.t. 1) Linear style. In the linear relationship, higher entropy brings different relationship—linear or quadratic—of each training step, more satisfaction, that is, higher entropy attracts user to browse where the blue points are at earlier training steps and the red point more items and use the system more often. The probability of user is at the later training steps. From the results shown in Figure 3 and Table 1: Summary statistics of dataset. ¯ = where ρt1:t2 is the max capping of importance ratio, T {ξk } is the set of trajectory ξk for evaluation, K is the total testing tra- Statistics Numerical Value jectory. The numerator of Equation (9) is the capped importance The number of trajectories 633,345 weighted reward and the denominator is the normalized factor. The number of items 456,805 Setting the rt with different metrics, we can evaluate the policy The number of users 471,822 from different perspective. To make the experimental results trust- Average/max/min clicks 2.04/99/0 ful and solid, we use the first 15 days logging as training samples, Average/max/min dwell time(minutes) 2.4/5.3/0.5 the last 2 days as testing data, the test data is kept isolated. The Average/max/min browsing depth 13.34/149/1 training samples are used for policy learning. The testing data are Average/max/min return time (days) 5.18/17/0 used for policy evaluation. To ensure small variance and control the bias, we set c as 5 in experiment.

Figure 4, we can see that no matter what diversity assumptions are, The metrics. Setting the reward in Equation (9) with different FeedRec is able to converge to the best diversity by directly opti- user engagement metrics, we could estimate a comprehensive set mizing delay metrics. In (a) of Figure 3 (and also Figure 4), FeedRec of evaluation metrics. Formally, these metrics are defined as fol- discloses that the distribution of entropy of recommendation list lows, and browsing depth (and also return time) is linear. As the number • Average Clicks per Session: the average cumulative number of of rounds of interaction increases, the user’s satisfaction gradually clicks over a user visit. increases, therefore more items are browsed (and internal time be- • Average Depth per Session: the average browsing depth that the tween two visits is shorter). In (b), user engagement is highest in a users interact with the recommender agent. certain entropy, higher or lower entropy can cause user’s dissatis- • Average Return Time: the average revisiting days between a user’s faction. Therefore, moderate entropy of recommendation list will consecutive visits up till a particular time point. attract the user to browse more items and use the recommender The baselines. We compare our model with state-of-the-art system more often. The results indicate that FeedRec has the abil- baselines, including both supervised learning based methods and ity to fit different types of distribution between user engagement reinforcement learning based methods. and the entropy of the recommendation list. • FM: Factorization Machines [21] is a strong factoring model, 6 EXPERIMENTS ON REAL-WORLD which can be easily implemented with factorization machine li- 4 E-COMMERCE DATASET brary (libFM) . • NCF: Neural network-based Collaborative Filtering [9] replaces 6.1 Dataset the inner product in factoring model with a neural architecture We collected 17 days users’ accessing logs in July 2018 from an e- to support arbitrary function from data. commerce platform. Each accessing log contains: timestamp, user • GRU4Rec: This is a representative approach that utilizes RNN 20 id u, user’s profile (up ∈ R ), recommended item’s id it , behav- to learn the dynamic representation of customers and items in ior policy’s ranking score for the item πb (it |st ), and user’s feed- recommender systems [10]. back ft , dwell time dt . Due to the sparsity of the dataset, we ini- • NARM: This is a state-of-the-art approach in personalized trajectory- tialized the items’ embedding (i ∈ R20) with pretrained vectors, based recommendation with RNN models [12]. It uses the atten- which is learned through modeling users’ clicking streams with tion mechanism to determine the relatedness of the past pur- skip-gram [16]. The user’s embedding u is initialized with user’s chases in the trajectory for the next purchase. profile up . The returning gap was computed as the interval days • DQN: Deep Q-Networks [18] combined Q-learning with Deep of the consecutive user visits. Table 1 shows the statistics of the Neural Networks. We use the same function approximation as dataset. FeedRec and train the neural network with naive Q-learning using the logged dataset. 6.2 Evaluation Setting • DEERs: DEERs [34] is a DQN based approach for maximizing users’ clicks with pairwise training, which considers user’s neg- off-line A/B testing. To perform evaluation of RL methods on ative behaviors. ground-truth, a straightforward way is to evaluate the learned pol- • DDPG-KNN: Deep Deterministic Policy Gradient with KNN [5] icy through online A/B test, which, however, could be prohibitively is a discrete version of DDPG for dealing with large action space, expensive and may hurt user experiences. Suggested by [6, 7], a which has been deployed for Pagewise recommendation in [33]. specific off-policy estimator NCIS [25] is employed to evaluate the • FeedRec: To verify the effect of different components, experi- performance of different recommender agents. The step-wise vari- ments are conducted on the degenerated models as follow: 1) S- ant of NCIS is defined as Network is purely based on our proposed S-Network, which T −1 ρ¯i rk makes recommendations based on the ranking of next possi- Rˆπ = 0:t t (9) step−NCIS K j ble clicking item. 2) FeedRec(C), FeedRec(D), FeedRec(R) and = = ρ¯ ξÕk ∈T Õt 0 j 1 0:t FeedRec(All) are our proposed methods with different metrics t2Í as reward. Specifically, they use the clicks, depth, return time ¯ = min π(at |st ) ρt1:t2 {c, }, ( | ) 4 t=t πb at st http://www.libfm.org Ö1 Table 2: Performance comparison of different agents on JD dataset.

Average Clicks Average Depth Average Agents per Session per Session Return Time FM 1.9829 11.2977 16.5349 NCF 1.9425 11.1973 18.2746 GRU4Rec 2.1154 13.8060 14.0268 NARM 2.3030 15.3913 11.0332 DQN 1.8211 15.2508 6.2307 DEER 2.2773 18.0602 5.7363 Figure 5: The influence of ω on performance. DDPG-KNN(k=1) 0.6659 9.8127 15.4012 DDPG-KNN(k=0.1N) 2.5569 16.0936 7.3918 DDPG-KNN(k=N) 2.5090 14.6689 14.1648 S-Network 2.5124 16.1745 10.1846 FeedRec(C) 2.6194 18.1204 6.9640 FeedRec(D) 2.8217 21.8328 4.8756 FeedRec(R) 3.7194 23.4582 3.9280 FeedRec(All) 4.0321∗ 25.5652∗ 3.9010∗ “ ∗ ” indicates the statistically significant improvements (i.e., two-sided t-test with p < 0.01) over the best baseline. Figure 6: Comparison between FeedRec and baselines under offline learning. and the weighted sum of instant and delayed metrics as the reward function respectively. Experimental Setting. The weight ω for different metrics is the ω is set to 0.005. Too much weight on these metrics will over- set to [1.0, 0.005, 0.005]⊤. The hidden units for LSTM is set as 50 whelm the importance of clicks on the rewards, which indicates for both Q-Network and S-Network. All the baseline models share that moderate value of weights on depth and return time can in- the same layer and hidden nodes configuration for the neural net- deed improve the performance on cumulative clicks. works. The buffer size for Q-Learning is set 10,000, the batch size The effect of S-Network. The notorious deadly triad problem is set to 256. ϵ-greedy is always applied for exploration in learn- causes the danger of instability and divergence of most off-policy ing, but discounted with increasing training epoch. The value c for learning methods, even the robust Q-Learning. To examine the ad- clipping importance sampling is set 5. We set the discount factor vantage of our proposed interactive training framework, we com- = 0 9 γ . . The networks are trained with SGD with learning rate of pared our proposed model FeedRec with DQN, DDPG-KNN under 0.005. We used Tensorflow to implement the pipelines and trained the same configuration. In Figure 6, we show different metrics vs networks with a Nvidia GTX 1080 ti GPU cards. All the experi- the training iteration. We find that DQN, DDPG-KNN achieves a ments are obtained by an average of 5 repeat runs. performance peak around 40 iterations and the performances are degraded rapidly with increasing iterations (the orange line and 6.3 Experimental Results blue line). On the contrary, FeedRec achieves better performances Comparison against baselines. We compared FeedRec with on these three metrics and the performances are stable at the high- state-of-the-art methods. The results of all methods over the real- est value (the green line). These observations indicate that FeedRec world dataset in terms of three metrics are shown in Table 2. From is stable and suitable through avoiding the deadly triad problem for the results, we can see that FeedRec outperformed all of the base- off-policy learning of recommendation policies. line methods on all metrics. We conducted significance testing (t- test) on the improvements of our approaches over all baselines. “∗” The relationship between user engagement and diversity. denotes strong significant divergence with p-value<0.01. The re- Some existing works [1, 2, 4] assume user engagement and diver- sults indicate that the improvements are significant, in terms of all sity are related and intent to increase user engagement by increas- of the evaluation measures. ing diversity. Actually, it is an indirect method to optimize the user engagement, and the assumption has not been verified. Here, we The influence of weight ω. The weight ω controls the rela- conducted experiments to see whether FeedRec, which direct op- tive importance of different user engagement metrics in reward timize user engagement, has the ability to improve the recommen- function. We examined the effects of the weights ω. (a) and (b) dation diversity. For each policy, we sample 300 state-action pairs of Figure 5 shows the parameter sensitivity of ω w.r.t. depth met- with importance ratio ρ¯ > 0.01 (the larger value of ρ¯ in Equation ric and return time metric respectively. In Figure 5, w.r.t. the in- (9) implies that the policy more favors such actions) and plot these crease of the weight of ω for depth and return time metrics, the state-action pairs, which are shown in Figure 7. The horizontal user browses more items and revisit the application more often axis indicates the diversity between recommendation items, and (the blue line). Meanwhile, in both (a) and (b), the model achieves the vertical axis indicates different types of user engagement (e.g., best results in the cumulative clicks metric (the orange line) when browsing depth, return time). We can see that the FeedRec policy, [12] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural Attentive Session-based Recommendation. In CIKM’17. ACM, 1419–1428. [13] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextual- bandit approach to personalized news article recommendation. In WWW’10. ACM, 661–670. [14] Zhongqi Lu and Qiang Yang. 2016. Partially Observable Markov Decision Pro- cess for Recommender Systems. arXiv preprint arXiv:1608.07793 (2016). [15] Tariq Mahmood and Francesco Ricci. 2009. Improving recommender systems with adaptive conversational strategies. In HT’09. ACM, 73–82. [16] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Esti- Figure 7: The relationship between user engagement and diversity. mation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781 (2013). [17] Andriy Mnih and Ruslan R Salakhutdinov. 2008. Probabilistic matrix factorization. In NIPS’08. 1257–1264. [18] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis learned by directly optimizing user engagement, favors for recom- Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with mending more diverse items. The results verifies that optimization deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013). of user satisfaction can increase the recommendation diversity and [19] Bruno Pradel, Savaneary Sean, Julien Delporte, Sébastien Guérif, Céline Rou- veirol, Nicolas Usunier, Françoise Fogelman-Soulié, and Frédéric Dufau-Joel. enhancing diversity is also a means of improving user satisfaction. 2011. A case study in a recommender system based on purchase data. In SIGKDD’11. ACM, 377–385. [20] Lijing Qin, Shouyuan Chen, and Xiaoyan Zhu. 2014. Contextual combinato- 7 CONCLUSION rial bandit and its application on diversified online recommendation. In SDM’14. It is critical to optimize long-term user engagement in the recom- SIAM, 461–469. [21] Steffen Rendle. 2010. Factorization machines. In ICDM’10. IEEE, 995–1000. mender system, especially in feed streaming setting. Though RL [22] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and naturally fits the problem of maximizing the long-term rewards, Thorsten Joachims. 2016. Recommendations as Treatments: Debiasing Learn- ing and Evaluation. In ICML’16. 1670–1679. there exist several challenges for applying RL in optimizing long- [23] Guy Shani, David Heckerman, and Ronen I Brafman. 2005. An MDP-based rec- term user engagement: difficult to model the omnifarious user feed- ommender system. JMLR 6, Sep (2005), 1265–1295. backs (e.g., clicks, dwell time, revisit, etc) and effective off-policy [24] Richard S Sutton and Andrew G Barto. 1998. Reinforcement learning: An introduction. Vol. 1. MIT press Cambridge. learning in recommender system. To address these issues, in this [25] Adith Swaminathan and Thorsten Joachims. 2015. The self-normalized estima- work, we introduce a RL-based framework — FeedRec to optimize tor for counterfactual learning. In NIPS’15. 3231–3239. [26] Huazheng Wang, Qingyun Wu, and Hongning Wang. 2017. Factorization Ban- the long-term user engagement. First, FeedRec leverage hierarchi- dits for Interactive Recommendation.. In AAAI’17. 2695–2702. cal RNNs to model complex user behaviors, refer to as Q-Network. [27] Zihan Wang, Ziheng Jiang, Zhaochun Ren, Jiliang Tang, and Dawei Yin. 2018. A Then to avoid the instability of convergence in policy learning, an path-constrained framework for discriminating substitutable and complemen- tary products in e-commerce. In WSDM’18. ACM, 619–627. S-Network is designed to simulate the environment and assist the [28] Qingyun Wu, Hongning Wang, Liangjie Hong, and Yue Shi. 2017. Returning is Q-Network. Extensive experiments on both synthetic datasets and Believing: Optimizing Long-term User Engagement in Recommender Systems. real-world e-commerce dataset have demonstrated effectiveness of In WWW’17. ACM, 1927–1936. [29] Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, Wei Zeng, and Xueqi Cheng. 2017. FeedRec for feed streaming recommendation. Adapting Markov decision process for search result diversification. In SIGIR’17. ACM, 535–544. [30] Xing Yi, Liangjie Hong, Erheng Zhong, Nanthan Nan Liu, and Suju Rajan. 2014. REFERENCES Beyond clicks: dwell time for personalization. In RecSys’14. ACM, 113–120. [1] Gediminas Adomavicius and YoungOk Kwon. 2012. Improving aggregate recom- [31] Chunqiu Zeng, Qing Wang, Shekoofeh Mokhtari, and Tao Li. 2016. On- mendation diversity using ranking-based techniques. TKDE 24, 5 (2012), 896– line context-aware recommendation with time varying multi-armed bandit. In 911. SIGKDD’16. ACM, 2025–2034. [2] Azin Ashkan, Branislav Kveton, Shlomo Berkovsky, and Zheng Wen. 2015. Op- [32] Xiangyu Zhao, Xia Long, Tang Jiliang, and Yin Dawei. 2018. Deep Reinforcement timal Greedy Diversity for Recommendation. In IJCAI’15. 1742–1748. Learning for Search, Recommendation, and Online Advertising: A Survey. arXiv [3] Shiyu Chang, Yang Zhang, Jiliang Tang, Dawei Yin, Yi Chang, MarkAHasegawa- preprint arXiv:1812.07127 (2018). Johnson, and Thomas S Huang. 2017. Streaming recommender systems. In [33] Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang WWW’17. ACM, 381–389. Tang. 2018. Deep reinforcement learning for page-wise recommendations. In [4] Peizhe Cheng, Shuaiqiang Wang, Jun Ma, Jiankai Sun, and Hui Xiong. 2017. RecSys’18. ACM, 95–103. Learning to Recommend Accurate and Diverse Items. In WWW’17. ACM, 183– [34] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei 192. Yin. 2018. Recommendations with negative feedback via pairwise deep rein- [5] Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Tim- forcement learning. In SIGKDD’18. ACM, 1040–1048. othy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas De- [35] Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang gris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action Tang. 2017. Deep reinforcement learning for list-wise recommendations. arXiv spaces. arXiv preprint arXiv:1512.07679 (2015). preprint arXiv:1801.00209 (2017). [6] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. 2018. More [36] Yin Zheng, Bangsheng Tang, Wenkui Ding, and Hanning Zhou. 2016. A neural Robust Doubly Robust Off-policy Evaluation. In ICML’18. 1446–1455. autoregressive approach to collaborative filtering. In ICML’16. 764–773. [7] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, [37] Meizi Zhou, Zhuoye Ding, Jiliang Tang, and Dawei Yin. 2018. Micro behaviors: and Simon Dollé. 2018. Offline A/B testing for Recommender Systems. In A new perspective in e-commerce recommender systems. In WSDM’18. ACM, WSDM’18. ACM, 198–206. 727–735. [8] Li He, Long Xia, Wei Zeng, Zhiming Ma, Yihong Zhao, and Dawei Yin. 2019. [38] Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Off-policy Learning for Multiple Loggers. In SIGKDD’19. ACM. Cai. 2017. What to do next: Modeling user behaviors by time-lstm. In IJCAI’17. [9] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng 3602–3608. Chua. 2017. Neural collaborative filtering. In WWW’17. ACM, 173–182. [39] Lixin Zou, Long Xia, Zhuoye Ding, Dawei Yin, Jiaxing Song, and Weidong Liu. [10] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2019. Reinforcement Learning to Diversify Top-N Recommendation. In DAS- 2015. Session-based recommendations with recurrent neural networks. arXiv FAA’19. Springer, 104–120. preprint arXiv:1511.06939 (2015). [11] Mounia Lalmas, Heather O’Brien, and Elad Yom-Tov. 2014. Measuring user engagement. Synthesis Lectures on Information Concepts, Retrieval, and Services 6, 4 (2014), 1–132.