A New Rank Correlation Coefficient for Information Retrieval

A New Rank Correlation Coefficient for Information Retrieval Emine Yilmaz∗ Javed A. Aslam† Stephen Robertson∗ [email protected] [email protected] [email protected] † ∗Microsoft Research College of Computer and Information Science 7 JJ Thomson Avenue Northeastern University Cambridge CB3 0FB, UK 360 Huntington Ave, #202 WVH Boston, MA 02115 ABSTRACT ranked lists and report the correlation between them. Two In the field of information retrieval, one is often faced with of the most commonly used rank correlation statistics are the problem of computing the correlation between two ranked Kendall’s τ [7] and Spearman rank correlation coefficient [15]. lists. The most commonly used statistic that quantifies this The Spearman correlation coefficient is equivalent to the correlation is Kendall’s τ. Often times, in the information traditional linear correlation coefficient computed on ranks retrieval community, discrepancies among those items hav- of items [15]. The Kendall’s τ distance between two ranked ing high rankings are more important than those among lists is proportional to the number of pairwise adjacent swaps items having low rankings. The Kendall’s τ statistic, how- needed to convert one ranking into the other. ever, does not make such distinctions and equally penalizes Kendall’s τ has become a standard statistic to compare errors both at high and low rankings. the correlation among two ranked lists. When various meth- In this paper, we propose a new rank correlation coeffi- ods are proposed to rank items, Kendall’s τ is often used to compare which method is better relative to a “gold stan- cient, AP correlation (τap), that is based on average precision and has a probabilistic interpretation. We show that dard”. The higher the correlation between the output rank- the proposed statistic gives more weight to the errors at high ing of a method and the “gold standard”, the better the rankings and has nice mathematical properties which make method is concluded to be. Pairs of rankings whose Kendall’s it easy to interpret. We further validate the applicability of τ values are at or above 0.9 are often considered “effectively the statistic using experimental data. equivalent” [13], at least empirically. For example, Soboroff et al. [12] propose a new method Categories and Subject Descriptors: H.3 Information for system evaluation in the absence of relevance judgments Storage and Retrieval; H.3.4 Systems and Software: Perfor- and use Kendall’s τ to measure the quality of their method. mance Evaluation Buckley and Voorhees [2] propose a new evaluation measure, General Terms: Experimentation, Measurement, Algo- bpref, to evaluate retrieval systems and use Kendall’s τ to rithms show that this measure ranks systems similar to average pre- Keywords: Evaluation, Kendall’s tau, Average Precision, cision. Similarly, Aslam et al. [1] propose a new method for Rank Correlation evaluating retrieval systems with fewer relevance judgments. They compare their method with the depth pooling method by comparing the Kendall’s τ correlation of the rankings of 1. INTRODUCTION systems obtained using both methods with the actual rank- Most of the research in the field of information retrieval ings systems to show that their method is better than the depends on ranked lists of items. The output of search en- depth pooling method. The Kendall’s τ statistic is also used gines are ranked list of documents, the search engines them- to compare the rankings of queries based on their estimated selves are also ranked based on their performance accord- difficulty with the actual ranking of queries [14]. Melucci ing to different evaluation criteria, the queries submitted to et al. [8] provides an analysis of places where Kendall’s τ is search engines are again ranked based on their difficulty, and used in information retrieval. so on. In most of the places where Kendall’s τ is used, authors Since most of the research in IR is based on ranked lists aim for a Kendall’s τ value of 0.9 and conclude that their of items, it is often the case that we need to compare two method produces “good” rankings if they obtain a τ value greater than this threshold [17, 3, 9]. Although Kendall’s τ seems to be a reasonable choice for comparing two rankings, there is an important problem Permission to make digital or hard copies of all or part of this work for with this statistic, at least in the context of IR. Kendall’s τ personal or classroom use is granted without fee provided that copies are equally penalizes errors that occur at any part of the list. In not made or distributed for profit or commercial advantage and that copies other words, it does not distinguish between the errors that bear this notice and the full citation on the first page. To copy otherwise, to occur towards the top of the list from the errors towards republish, to post on servers or to redistribute to lists, requires prior specific the bottom of the list [8]. However, in almost all cases in permission and/or a fee. SIGIR’08, July 20–24, 2008, Singapore. information retrieval we care more about the items that are Copyright 2008 ACM 978-1-60558-164-4/08/07 ...$5.00. ranked towards one end of the list (either top or bottom). reverse order in two lists. For example, in TREC, given the rankings of systems, the Shieh [11] recently devised an extension of Kendall’s τ goal is to identify the best systems. When the goal is to where the errors in different ranks are penalized by different predict query difficulty, it is more important to identify the weights. However, this requires assigning arbitrary weights most difficult queries. Similarly, when comparing outputs to these errors beforehand and defining such weights is not of two search engines, the differences towards the top of the easy. Similarly, Fagin et al. [5] proposed an extension to two rankings matter more than the differences towards the Kendall’s τ for comparing top-k lists. Their extension is bottom of the rankings. also based on defining arbitrary penalties when there are As an example of the aforementioned problem, consider errors in rankings. Furthermore, their approach still gives 8 different retrieval systems. Let’s assume that their actual equal weights to errors within the top-k lists and is not very ranking is <1 2 3 4 5 6 7 8>. Suppose there are two differ- applicable for comparing the entire ranked lists while giving ent methods to evaluate these systems with fewer judgments more weight to the errors at the top. Haveliwala et al. [6] and one would like to compare the quality of the two meth- used the Kruskal-Goodman τ statistic (a statistic very sim- ods. Let’s assume that when the first method is used, the ilar to Kendall’s τ) to compute correlations in the regions systems are ranked as <4 3 1 2 5 6 7 8> and when the second they are interested in (e.g. the top). However, this approach method is used, they are ranked as <1 2 3 4 8 7 5 6>. The also suffers from the same problems as the former method in former ranking has the first four systems in inverse order that it is not very applicable for comparing all the items in compared with the actual ranking, while the latter the last ranked lists at once while giving more weight to the errors four system in inverse order. The Kendall’s τ correlation at the top. of each rankings with the actual ranking of the systems is In this paper, we first show that the problem of evalu- the same (in both cases equal to 0.6429). Hence, based on ating the correlation between two ranked lists is analogous the Kendall’s τ values, the two methods are equivalent in to the problem of evaluating the quality of a search engine, terms of how they rank the systems. Note, however, that concluding that similar ideas can be used in both cases. We in many IR contexts it is much more important to get the propose a new rank correlation coefficient, AP correlation “top half” of the list “right” than the “bottom half”. Thus, (τap), that is based on average precision and has a prob- we might well much prefer the latter ranking as compared abilistic interpretation similar to Kendall’s τ, while giving to the former. more weight to the errors nearer the top of the list, as in AP. In the real world, we are often times faced with ranked The proposed statistic has two nice properties: (1) When lists in which there are many mistakes in terms of the rank- ranking errors are randomly distributed across the list, the ings of the top (best) items. Figure 1 and Figure 2 show AP correlation value is equal to Kendall’s τ, in expecta- two such cases for TREC 8. In the figures, the leftmost tion. (2) If there are less (more) errors towards the top of plots show the mean average precision (MAP) values of sys- the list, the AP correlation value is higher (lower) than the tems using depth-1 and depth-6 pooling versus the actual Kendall’s τ value, as desired. These two properties make MAP values. It can be seen in Figure 1 that the Kendall’s the AP correlation coefficient easy to interpret (by compar- τ correlation between the rankings of systems induced by ing with Kendall’s τ) and use. We further demonstrate the depth-1 MAP values and the actual rankings of systems is applicability of the statistic through experimental data, and 0.733.

Load more