Multi-Agent Reinforced Learning to Rank
Total Page:16
File Type:pdf, Size:1020Kb
MarlRank: Multi-agent Reinforced Learning to Rank Shihao Zou∗ Zhonghua Li Mohammad Akbari [email protected] [email protected] [email protected] University of Alberta Huawei University College London Jun Wang Peng Zhang [email protected] [email protected] University College London Tianjin University ABSTRACT 1 INTRODUCTION When estimating the relevancy between a query and a document, Learning to rank, which learns a model for ranking documents ranking models largely neglect the mutual information among doc- (items) based on their relevance scores to a given query, is a key uments. A common wisdom is that if two documents are similar task in information retrieval (IR) [18]. Typically, a query-document in terms of the same query, they are more likely to have similar pair is represented as a feature vector and many classical models, in- relevance score. To mitigate this problem, in this paper, we propose cluding SVM [8], AdaBoost [18] and regression tree [2] are applied a multi-agent reinforced ranking model, named MarlRank. In partic- to learn to predict the relevancy. Recently, deep learning models has ular, by considering each document as an agent, we formulate the also been applied in learning to rank task. For example, CDSSM [13], ranking process as a multi-agent Markov Decision Process (MDP), DeepRank [12] and their variants aim at learning better semantic where the mutual interactions among documents are incorporated representation of the query-document pair for more accurate pre- in the ranking process. To compute the ranking list, each docu- diction of the relevancy. Basically, most of previous approaches ment predicts its relevance to a query considering not only its own emphasize on the learning models with only query-document fea- query-document features but also its similar documents’ features tures considered. They do not investigate the effectiveness of global and actions. By defining reward as a function of NDCG, we can opti- mutual information among documents to improve the ranking per- mize our model directly on the ranking performance measure. Our formance, which will be the focus of this paper. experimental results on two LETOR benchmark datasets show that Reinforcement learning has been applied in a few IR tasks, such our model has significant performance gains over the state-of-art as session search [11, 20], where a user’s behavior in one session af- baselines. We also find that the NDCG shows an overall increas- fects documents ranking in the next session. This can be formulated ing trend along with the step of interactions, which demonstrates as an MDP and optimized using reinforcement learning. Recent that the mutual information among documents helps improve the works on reinforcement learning to rank model the ranking process ranking performance. as an MDP [6, 17], where, at each time step, an agent selects one document given its current observation (ranking position and un- CCS CONCEPTS ranked documents), and the reward is the improvement of NDCG. • Information systems → Learning to rank; Novelty in infor- But essentially, this contextual information is limited as documents mation retrieval. are passively selected and not allowed to interact with each other, and the interactions among documents can be considered as the KEYWORDS mutual information among documents which helps to improve ranking performance. multi-agent reinforcement learning, learning to rank Depart from prior approaches, we consider both query-document ACM Reference Format: and document-document relevancy for ranking. Our hypothesis Shihao Zou, Zhonghua Li, Mohammad Akbari, Jun Wang, and Peng Zhang. comes from a common wisdom that if two documents are very 2019. MarlRank: Multi-agent Reinforced Learning to Rank. In The 28th ACM similar in terms of the same query, they are likely to have similar arXiv:1909.06859v1 [cs.LG] 15 Sep 2019 International Conference on Information and Knowledge Management (CIKM relevance scores. We give a motivational example to illustrate our in- ’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 4 pages. tuition in Sec 2.2. Inspired by [10], we construct the ranking process https://doi.org/10.1145/3357384.3358075 as an interactive multi-agent environment where each document is treated as an agent. Specifically, our multi-agent environment ∗This work was done when Shihao was with UCL and interned at Huawei. comprises a query and a set of documents (agents). The ranking process is an interactive MDP among these agents. At each time Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed step, each agent simultaneously presents its relevance score based for profit or commercial advantage and that copies bear this notice and the full citation on the observation of its own query-related information and the on the first page. Copyrights for components of this work owned by others than ACM local information of its neighbor agents, and then a global ranking must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a performance reward is given. After several time steps of interaction, fee. Request permissions from [email protected]. all the agents achieve an equilibrium state. Their common objective CIKM ’19, November 3–7, 2019, Beijing, China is to find an optimal policy that jointly maximizes the expected © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6976-3/19/11...$15.00 accumulative reward. https://doi.org/10.1145/3357384.3358075 The contributions of this paper are as follows: (1) we propose to Reinforcement learning has shown great success in many tasks, leverage both query-document and document-document relevancy including game [14] and recommender system [21]. In IR, there are for ranking in IR; (2) we formulate the learning to rank problem mainly two lines of works which are formulated as an MDP. One as a multi-agent reinforced learning framework. To the best of line is session search. In [11, 20], a collaborative search process our knowledge, the proposed framework is the first work that between the user and the search engine is defined as an MDP. models document ranking as an interactive multi-agent reinforced The other line mainly focuses on reinforcement learning to rank. learning process and optimizes the model directly on the ranking The ranking process is formulated as an MDP. [17] At each time performance measure of the retrieval task. The experimental results step, the agent selects one document in the ranking list until all the show that our model outperforms most strong baselines and it is documents are selected and ranked. A further exploratory algorithm interesting to find NDCG increases along with time steps, which based on Monte Carlo tree search is proposed by [6]. In addition, attests the interactions among documents help improve the ranking IRGAN [16] uses policy gradient to optimize the generator. Our performance. model is different from that of[6, 17] whose agent is only for the selection of documents. By regarding each document as an agent in a collaborative environment, we include mutual information among 2 BACKGROUND documents in our model to do ranking. 2.1 Related Work 2.2 A Motivational Example Typically, ranking models can be classified into three categories: point-wise, pair-wise [1, 3, 8] and list-wise [4, 18, 19]. Point-wise A motivational example is presented to illustrate how our intuition ranking models p¹djq;θº assume that documents are independent of helps improve the ranking performance. In Fig. 1, given a query, we each other for predicting the relevance, where θ means the parame- assume there are six documents with three non-relevant fd1;d2;d3g ter of the ranking model. On the contrary, pair-wise models regard and three relevant ones fd4;d5;d6g. The initial predictions at step a pair of documents as an instance and the pair of documents’ rele- 0 are given in Tab. 1, and we assume the naive policy is the average vance orders as categories to perform classification-based ranking, of the target document and its top-2 similar documents’ previous actions (relevance scores). For example, the action of document d referred as p¹di jq;dj ;θº where di and dj are a pair of documents. 4 Empirical results show that pair-wise ranking models outperform at step 1 is the average of previous actions by document 4; 5 and 6, 4 ¹ 4 5 6)/ point-wise ones as, intuitively, pair-wise ranking process takes a1 = a0 +a0 +a0 3 = 0:63. We can see from Tab. 1 that document mutual information between a pair of documents in the ranking d4 finds its neighbor documents d5 and d6, so it believes that it process. As for list-wise models, they formulate a ranking list (either should have similar scores with these two neighbors and updates a document pair [19] or a documents list [4]) as a permutation distri- its prediction from 0.1 to 0.63. Finally, NDCG@3 increases after bution so that loss functions become continuous and differentiable. one step of interaction and remains the best ranking performance Then list-wise models can be directly optimized on ranking per- afterwards. Table 1: Relevance score (action) formance measures. However, in essence, list-wise models mainly given by each document. focus on the reformulation of loss functions and, like point-wise and step t 0 1 2 3 pair-wise models, treat each document or each pair of documents non-relevant d1 ¹r = 0º 0 0.37 0.46 0.52 dependently to do ranking.