Learning to Rank Using Gradient Descent

Learning to Rank using Gradient Descent Keywords: ranking, gradient descent, neural networks, probabilistic cost functions, internet search Chris Burges [email protected] Tal Shaked∗ [email protected] Erin Renshaw [email protected] Microsoft Research, One Microsoft Way, Redmond, WA 98052-6399 Ari Lazier [email protected] Matt Deeds [email protected] Nicole Hamilton [email protected] Greg Hullender [email protected] Microsoft, One Microsoft Way, Redmond, WA 98052-6399 Abstract that maps to the reals (having the model evaluate on We investigate using gradient descent meth- pairs would be prohibitively slow for many applica- ods for learning ranking functions; we pro- tions). However (Herbrich et al., 2000) cast the rank- pose a simple probabilistic cost function, and ing problem as an ordinal regression problem; rank we introduce RankNet, an implementation of boundaries play a critical role during training, as they these ideas using a neural network to model do for several other algorithms (Crammer & Singer, the underlying ranking function. We present 2002; Harrington, 2003). For our application, given test results on toy data and on data from a that item A appears higher than item B in the out- commercial internet search engine. put list, the user concludes that the system ranks A higher than, or equal to, B; no mapping to particular rank values, and no rank boundaries, are needed; to 1. Introduction cast this as an ordinal regression problem is to solve an unnecessarily hard problem, and our approach avoids Any system that presents results to a user, ordered this extra step. We also propose a natural probabilis- by a utility function that the user cares about, is per- tic cost function on pairs of examples. Such an ap- forming a ranking function. A common example is proach is not specific to the underlying learning al- the ranking of search results, for example from the gorithm; we chose to explore these ideas using neural Web or from an intranet; this is the task we will con- networks, since they are flexible (e.g. two layer neural sider in this paper. For this problem, the data con- nets can approximate any bounded continuous func- sists of a set of queries, and for each query, a set tion (Mitchell, 1997)), and since they are often faster of returned documents. In the training phase, some in test phase than competing kernel methods (and test query/document pairs are labeled for relevance (\ex- speed is critical for this application); however our cost cellent match", \good match", etc.). Only those doc- function could equally well be applied to a variety of uments returned for a given query are to be ranked machine learning algorithms. For the neural net case, against each other. Thus, rather than consisting of a we show that backpropagation (LeCun et al., 1998) is single set of objects to be ranked amongst each other, easily extended to handle ordered pairs; we call the re- the data is instead partitioned by query. In this pa- sulting algorithm, together with the probabilistic cost per we propose a new approach to this problem. Our function we describe below, RankNet. We present re- approach follows (Herbrich et al., 2000) in that we sults on toy data and on data gathered from a com- train on pairs of examples to learn a ranking function mercial internet search engine. For the latter, the data nd takes the form of 17,004 queries, and for each query, Appearing in Proceedings of the 22 International Confer- up to 1000 returned documents, namely the top docu- ence on Machine Learning, Bonn, Germany, 2005. Copy- ∗ right 2005 by the author(s)/owner(s). Current affiliation: Google, Inc. Learning to Rank using Gradient Descent ments returned by another, simple ranker. Thus each (Harrington, 2003) has proposed a simple but very ef- query generates up to 1000 feature vectors. fective extension of PRank, which approximates find- ing the Bayes point by averaging over PRank models. Therefore in this paper we will compare RankNet Notation: we denote the number of relevance levels (or with PRank, kernel PRank, large margin PRank, and ranks) by N, the training sample size by m, and the RankProp. dimension of the data by d. (Dekel et al., 2004) provide a very general framework 2. Previous Work for ranking using directed graphs, where an arc from A to B means that A is to be ranked higher than B (which RankProp (Caruana et al., 1996) is also a neural net here and below we write as A B B). This approach ranking model. RankProp alternates between two can represent arbitrary ranking functions, in particu- phases: an MSE regression on the current target val- lar, ones that are inconsistent - for example A B B, ues, and an adjustment of the target values themselves B B C, C B A. We adopt this more general view, and to reflect the current ranking given by the net. The note that for ranking algorithms that train on pairs, end result is a mapping of the data to a large num- all such sets of relations can be captured by specifying ber of targets which reflect the desired ranking, which a set of training pairs, which amounts to specifying the performs better than just regressing to the original, arcs in the graph. In addition, we introduce a prob- scaled rank values. Rankprop has the advantage that abilistic model, so that each training pair fA; Bg has it is trained on individual patterns rather than pairs; associated posterior P (A B B). This is an important however it is not known under what conditions it con- feature of our approach, since ranking algorithms often verges, and it does not give a probabilistic model. model preferences, and the ascription of preferences is a much more subjective process than the ascription of, (Herbrich et al., 2000) cast the problem of learning to say, classes. (Target probabilities could be measured, rank as ordinal regression, that is, learning the map- for example, by measuring multiple human preferences ping of an input vector to a member of an ordered set for each pair.) Finally, we use cost functions that are of numerical ranks. They model ranks as intervals on functions of the difference of the system's outputs for the real line, and consider loss functions that depend each member of a pair of examples, which encapsulates on pairs of examples and their target ranks. The po- the observation that for any given pair, an arbitrary sitions of the rank boundaries play a critical role in offset can be added to the outputs without changing the final ranking function. (Crammer & Singer, 2002) the final ranking; again, the goal is to avoid unneces- cast the problem in similar form and propose a ranker sary learning. based on the perceptron ('PRank'), which maps a feature vector x 2 Rd to the reals with a learned w 2 Rd RankBoost (Freund et al., 2003) is another ranking al- such that the output of the mapping function is just gorithm that is trained on pairs, and which is closer in w · x. PRank also learns the values of N increasing spirit to our work since it attempts to solve the prefer- 1 thresholds br = 1; · · · ; N and declares the rank of x ence learning problem directly, rather than solving an to be minrfw · x − br < 0g. PRank learns using one ordinal regression problem. In (Freund et al., 2003), example at a time, which is held as an advantage over results are given using decision stumps as the weak pair-based methods (e.g. (Freund et al., 2003)), since learners. The cost is a function of the margin over the latter must learn using O(m2) pairs rather than reweighted examples. Since boosting can be viewed m examples. However this is not the case in our ap- as gradient descent (Mason et al., 2000), the question plication; the number of pairs is much smaller than naturally arises as to how combining RankBoost with m2, since documents are only compared to other doc- our pair-wise differentiable cost function would com- uments retrieved for the same query, and since many pare. Due to space constraints we will describe this feature vectors have the same assigned rank. We find work elsewhere. that for our task the memory usage is strongly dom- inated by the feature vectors themselves. Although 3. A Probabilistic Ranking Cost 2 the linear version is an online algorithm , PRank has Function been compared to batch ranking algorithms, and a quadratic kernel version was found to outperform all We consider models where the learning algorithm is such algorithms described in (Herbrich et al., 2000). given a set of pairs of samples [A; B] in Rd, together 1 with target probabilities P¯ that sample A is to be Actually the last threshold is pegged at infinity. AB 2The general kernel version is not, since the support ranked higher than sample B. This is a general for- vectors must be saved. mulation: the pairs of ranks need not be complete (in Learning to Rank using Gradient Descent that taken together, they need not specify a complete P = 0:5 and P = 1, and only at those points. For ranking of the training data), or even consistent. We example, if we specify that P (A B B) = 0:5 and that consider models f : Rd 7! R such that the rank order P (B B C) = 0:5, then it follows that P (A B C) = 0:5; of a set of test samples is specified by the real values complete uncertainty propagates. Complete certainty that f takes, specifically, f(x1) > f(x2) is taken to (P = 0 or P = 1) propagates similarly.

Learning to Rank Using Gradient Descent

Training Autoencoders by Alternating Minimization

Learning to Learn by Gradient Descent by Gradient Descent

The Matrix Calculus You Need for Deep Learning

Context-Aware Learning to Rank with Self-Attention

Training Neural Networks Without Gradients: a Scalable ADMM Approach

Training Deep Networks with Stochastic Gradient Normalized by Layerwise Adaptive Second Moments

A Brief Tour of Vector Calculus

CSE 152: Computer Vision Manmohan Chandraker

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

An Introduction to Neural Information Retrieval

GEE: a Gradient-Based Explainable Variational Autoencoder for Network Anomaly Detection

Policy Gradient