WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

Visualizing Differences in Web Search Algorithms Using the Expected Weighted Hoeffding Distance

Mingxuan Sun Guy Lebanon Kevyn Collins-Thompson Georgia Inst. of Technology Georgia Inst. of Technology Microsoft Research Atlanta, GA 30332 Atlanta, GA 30332 Redmond, WA 98052 [email protected] [email protected] [email protected]

ABSTRACT algorithms. The term search algorithm is an intentionally We introduce a new dissimilarity function for ranked lists, vague term corresponding to a mechanism for producing the expected weighted Hoeffding distance, that has several ranked lists of websites in response to queries. We focus on advantages over current dissimilarity measures for ranked the following three interpretations of search algorithms: (a) search results. First, it is easily customized for users who different search engines, (b) a single subjected pay varying degrees of attention to websites at different to different query manipulation techniques by the user, and ranks. Second, unlike existing measures such as general- (c) a single engine queried across different internal states. ized Kendall’s tau, it is based on a true metric, preserving A visualization of relationships among type-(a) search al- meaningful embeddings when visualization techniques like gorithms may be useful as it reveals which search engines multi-dimensional scaling are applied. Third, our measure users should or should not use. For example, two search can effectively handle partial or missing rank information engines that output very similar ranked lists may provide while retaining a probabilistic interpretation. Finally, the redundant information to the user. On the other hand, two measure can be made computationally tractable and we give search engines that output very dissimilar lists are worth ex- a highly efficient algorithm for computing it. We then apply amining more closely. We emphasize that we do not consider our new metric with multi-dimensional scaling to visualize here the issue of search quality or relevance. The techniques and explore relationships between the result sets from dif- developed in this paper allow users to understand the sim- ferent search engines, showing how the weighted Hoeffding ilarity relationships among search algorithms. Evaluating distance can distinguish important differences in search en- retrieval quality is a separate, well-studied problem that is gine behavior that are not apparent with other rank-distance beyond the scope of this paper. metrics. Such visualizations are highly effective at summa- Visualizing relationships among type-(b) search algorithms rizing and analyzing insights on which search engines to use, is useful as it indicates which query manipulation techniques what search strategies users can employ, and how search re- are worth using. For example, a query times square may sults evolve over time. We demonstrate our techniques using be manipulated by the user (prior to entering it in the search a collection of popular search engines, a representative set of box) to times AND square. Such manipulations may result queries, and frequently used query manipulation methods. in different ranked lists output by the search engine. Un- derstanding the similarities between these ranked lists may provide the user with insights into which manipulations are Categories and Subject Descriptors redundant and which are worth exploring further. H.3.3 [Information Systems]: Information Search and Re- Understanding the relationships among type-(c) search trieval; G.2.1 [Discrete Mathematics]: Combinatorics— algorithms is useful from the point of view of design and Permutations and combinations modification of search engines. Search engines are complex programs containing many tunable parameters, each one in- General Terms fluencing the formation of the ranked list in a different way. Algorithms, Measurement, Experimentation For example, commercial search engines frequently update the internal index which may result in different ranked lists Keywords output in different dates (in response to the same query). It Expected Weighted Hoeffding Distance, Search Algorithm is important for both the users and the search engineers to dissimilarity, Ranking understand how much the ranked lists differ across consec- utive days and how much this difference changes as internal parameters are modified. 1. INTRODUCTION There are several techniques for visualizing complex data Search engines return ranked lists of documents or web- such as search algorithms. One of the most popular tech- sites in response to a query, with the precise forms of the niques is multidimensional scaling (MDS) [3], which trans- ranked lists depending on the internal mechanisms of the forms complex high dimensional data s1, . . . , sm into 2-D engines. We consider the problems of comparing and visu- vectors z1, . . . , zm that are easily visualized by displaying alizing the similarity relationships between different search them on a 2-D scatter plot. Assuming that a suitable dissim- Copyright is held by the International World Wide Web Conference Com- ilarity measure between the high dimensional data ρ(si, sj ) mittee (IW3C2). Distribution of these papers is limited to classroom use, has been identified, MDS computes the 2-D embedding of 2 and personal use by others. the high dimensional data si 7→ zi ∈ R , i = 1, . . . , m that WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. ACM 978-1-60558-799-8/10/04.

931 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA minimizes the spatial distortion caused by the embedding phasizes disagreement at top ranks, where the weight func- tion decays linearly with the rank. 2 R(z1, . . . , zm) = (ρ(si, sj ) − kzi − zj k) . (1) While much previous work on visualization of search al- X i

932 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

to ranked lists and si(q) represents the ranked list retrieved is d(1|2|3, 3|1|2) = w1 + w2 + w2 + w1 due to the sequence by si in response to the query q. of moves 1|2|3 → |1, 2|3 → |1|2, 3 → |1, 3|2 → 3|1|2. The function ρ is extended to search algorithms by tak- Formally, the distance may be written as ing another expectation, this time with respect to queries n ′ q sampled from a representative set of queries Q. The ex- dw(π, σ) = d (π(r), σ(r)) where (4) X w pectations ensure that properties (ii) and (v) are satisfied. r=1 We derive an efficient closed form for the double expectation v−1 that verifies property (iv) in Sec. 3.2 and give a pseudo-code t=u wt if u < v ′ 8P′ implementation. dw(u, v) = <>dw(v, u) if u > v . (5) Formally, we define 0 otherwise > E : ρ(si, sj ) = q∼Q{ρ(si(q), sj (q))} The weight vector w = (w1, . . . , wn−1) allows differentiat- E E E ing the work associated with moving items across top and = q∼Q π∼S(si(q)) σ∼S(sj (q)){dw(π, σ)} (3) bottom ranks. A monotonic decreasing weight vector, e.g., −q where dw(π, σ) is a distance between permutations π, σ de- wt = t , t = 1, . . . , n − 1, q ≥ 0 correctly captures the fact E E fined in Section 3.1, π∼S(si(q)) σ∼S(sj (q)) is the expecta- that disagreements in top ranks should matter more than tion with respect to permutations π, σ that are sampled from disagreements in bottom ranks [9, 11, 16]. The exponent q the sets of all permutations consistent with the ranked lists is the corresponding rate of decay. A linear or slower rate output by the two algorithms si(q), sj (q) (respectively), and 0 ≤ q ≤ 1 may be appropriate for persistent search engine E q∼Q is an expectation with respect to all queries sampled users who are not very deterred by low-ranking websites. from a representative set of queries Q. In the absence of any Choosing q → 0 retrieves a weighting mathematically simi- evidence to the contrary, we assume a uniform distribution lar to the log function weighting that is used in NDCG [10] over the set of queries Q and over the sets of permutations to emphasize top ranks. A quadratic or cubic decay q = 2, 3 consistent with si(q), sj (q). may be appropriate for users who do not pay substantial We proceed with a description of dw(π, σ) in Section 3.1 attention to bottom ranks. The weight may be modified to −q and then follow up in Section 3.2 with additional details wt = max(t − ǫ, 0), ǫ > 0 to capture the fact that many regarding the expectations in (3) and how to compute them. users simply do not look at results beyond a certain rank. While it is possible to select an intuitive value of q, it is 3.1 Weighted Hoeffding Distance more desirable to select one that agrees with user studies. The weighted Hoeffding distance is a distance between An MDS embedding of permutations using dw appears in permutations, here considered as permutations over the n Figure 1 (see Section 4 for more details). indexed websites in the internet3. The fact that n is ex- Proposition tremely large should not bother us at this point as we will 1. Assuming w is a positive vector, the weighted derive closed form expressions eliminating any online com- Hoeffding distance (4) is a metric. plexity terms depending on n. For simplicity we refer to the Proof. Non-negativity dw(π, σ) ≥ 0 and symmetry dw(π, σ) websites using the integers {1, . . . , n}. = dw(σ, π) are trivial to show. Similarly it is easy to see that A permutation over n websites π is a bijection from {1, . . . , n} dw(π, σ) = 0 iff π = σ. The triangle inequality holds as to itself mapping websites to ranks. That is π(6) is the rank −1 n given to website 6 and π (2) is the website that is assigned ′ ′ dw(π, σ) + dw(σ, ϕ) = d (π(r), σ(r)) + d (σ(r), ϕ(r)) second rank. A permutation is thus a full ordering over the X w w entire web and we denote the set of all such permutations r=1 n by Sn. We will represent a permutation by a sorted list of ′ ≥ d (π(r), ϕ(r)) = dw(π, ϕ) (6) websites from most preferred to least, separated by verti- X w cal bars i.e. π−1(1)| · · · |π−1(n); for example, for n = 5 one r=1 permutation ranking item 3 as first and 2 as last is 3|5|1|4|2. where the inequality (6) holds due to the positivity of w. Our proposed distance dw(π, σ) is a variation of the earth movers distance4 [18] on permutations. It may also be re- The weighted Hoeffding distance has several nice proper- garded as a weighted version of the Hoeffding distance [15]. ties that make it more appropriate for our purposes than It is best described as the minimum amount of work needed other permutation measures. First, it allows customization to transform the permutation π to σ. Work, in this case, is to different users who pay varying degrees of attention to the total amount of work needed to bring each item from its websites in different ranks (typically higher attention is paid rank in π to its rank in σ i.e., the r-item is transported from to higher ranks). Standard permutation distances such as rank k = π(r) to l = σ(r) (for all r = 1, . . . , n) requiring Kendall’s tau, Spearman’s rho, the footrule, Ulam’s distance wk + ··· + wl−1 work (assuming k < l) where wk is the work and Cayley’s distance treat all ranks uniformly [5]. Second, required to transport an item from rank k to k + 1. For it is a true metric in contrast to the generalized Kendall’s tau example, the distance d(1|2|3, 2|1|3) is w1 + w1 due to the [7]. Third, its clear interpretation allows explicit specifica- sequence of moves 1|2|3 → |1, 2|3 → 2|1|3. Another example tion of the weight vector based on user studies. Finally, it is computationally tractable to compute the weighted Hoeffd- 3 There are several ways to define the number of indexed ing distance as well as its expectation over partially ranked websites in the internet. In any case, this number is very lists corresponding to ρ in (8). Figure 2 summarizes the large and is growing continuously. We avoid its dynamic nature and consider it as a fixed number. advantages of our distance over other dissimilarities. 4The earth mover distance between two non-negative valued 3.2 Ranked Lists and Expectations function is the minimum amount of work needed to trans- form one to the other, when the functions are viewed as A ranked list output by a search algorithm forms an or- representing spatial distributions of earth or rubble. dered list hi1, . . . , iki of a subset of the websites {i1, . . . , ik} ⊂

933 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 12543 21543

12345 21345 21453 12534 12354 21354 12453 21534 12345 21345 12354 21354 12435 21435 12435 21435 12453 21453 12453 21453 12534 21534 12543 21543 12435 21354 21435 12354 12534 21534 12543 21543

12345 21345

Figure 1: MDS embedding of permutations over n = 5 websites. The embeddings were computed using the weighted Hoeffding distance 2 with uniform weight function wt = 1 (left), linear weight function wt = 1/t (middle) and quadratic weight function wt = 1/t (right). The permutations starting with 1 and 2 (colored in red) and the permutations starting with 2 and 1 (colored in blue) become more spatially disparate as the rate of weight decay increases. This represents the increased importance assigned to agreement in top ranks as we move from uniform to linear and quadratic decay. (i) (ii) (iii) (iv) (v) Expression (8) provides a natural mechanism to incorpo- Kendall/Spear [5] X X rate information from partially ranked lists. It is difficult to Fligner Kendall [7] X compare directly two ranked lists hi1, . . . , iki, hj1, . . . , jli of E Kendall top k [6] X X X X S X X X X different sizes. However, the permutations in (hi1, . . . , iki) E Spearman [14] S InverseMeasure [2] X X X X and (hj1, . . . , jki) are directly comparable to each other as NDCG [10] X X X they are permutations over the same set of websites. The E Weighted Hoeffding X X X X X expectation (8) aggregates information over such directly comparable events to provide a single interpretable and co- Figure 2: Summary of how different dissimilarities satisfy prop- herent dissimilarity measure. Figure 3 displays the MDS erties (i)-(v) in Section 3. embedding for partial rankings using the expected ρ (8) (see Section 4 for more details). {1, . . . , n}. Different search strategies may result in lists of The expectation defining ρ in (8) appears to require in- different sizes but in general k is much smaller than n. In surmountable computation as it includes summations over addition to the notation hi , . . . , i i we also denote it using 1 k (n − k)!(n − l)! elements with n being the size of the web. the bar notation as However, using techniques similar to the ones developed in [15] we are able to derive the following closed form. i1|i2| · · · |ik|ik+1, . . . , in where {ik+1, . . . , in} = {1, . . . , n}\{i1, i2, ··· , ik} (7) Proposition 2. The following closed form applies to the expected distance over the weighted Hoeffding distance (4). indicating that the unranked items {1, . . . , n}\{i1, . . . , ik} are ranked after the k items. Partial rankings (7) are not n identical to permutations as there is no known preference ρ(hi , . . . , i i, hj , . . . , j i) = d¯(r) where (10) 1 k 1 l X among the unranked items {1, . . . , n}\{i1, . . . , ik}. We r=1 therefore omit vertical lines between these items and list them separated by commas i.e., 3|2|1, 4 is equivalent to the d′ (u, v) r ∈ A ∩ B ranked list h3, 2i which prefers 3 over 2 and ranks 1 and 4 8 w 1 n d′ (t, u) r ∈ A ∩ Bc last without clear preference between them. ¯ > n−l t=l+1 w d(r)=<> 1 P n ′ c . It is natural to identify a ranked list hi1, . . . , iki as a full n−k t=k+1 dw(t, v) r ∈ A ∩ B permutation of the web that is unknown except for the fact 1 P1 n n ′ > n−k n−l t=k+1 s=l+1 dw(t, s) otherwise that it agrees with the website ranking in hi1, . . . , iki. De- :> P P noting the set of permutations whose website ordering does Above, A = {i1, . . . , ik}, B = {j1, . . . , jl}, and u ∈ {1, . . . , k}, S not contradict hi1, . . . , iki as (hi1, . . . , iki), we have that v ∈ {1, . . . , l} are the respective ranks of r in {i1, . . . , ik} and S hi1, . . . , iki corresponds to a random draw from (hi1, . . . , iki). {j1, . . . , jl} (it they exist). Assuming lack of additional knowledge, we consider all per- Proof. A careful examination of (4) reveals that it may mutations in S(hi1, . . . , iki) as equally likely resulting in be written in matrix notation: def E ρ(hi1, . . . , iki, hj1, . . . , jli) = π∼S(hi1,...,iki),σ∼S(hj1,...,jli)d(π, σ) T d(π, σ) = tr(Aπ△A ) (11) 1 σ = d(π, σ). (8) (n − k)!(n − l)! X X where tr is the trace operator, △ is the n×n distance matrix π∈S(hi1,...,iki) σ∈S(hj1,...,jli) ′ with elements △uv = dw(u, v), and Aπ, Aσ are permutation For example, consider the case of n = 5 with two search matrices corresponding to the permutations π and σ i.e., strategies returning the following ranked lists h3, 1, 4i = [Aπ]uv = 1 iff π(u) = v. Using equation (11), we have 3|1|4|2, 5 and h1, 5i = 1|5|2, 3, 4. The expected distance is ρ(hi1, . . . , iki, hj1, . . . , jli) 1 ρ(3|1|4|2, 5, 1|5|2, 3, 4) = (d(3|1|4|2|5, 1|5|2|3|4) T π S i ,...,i σ S j ,...,j tr(Aπ△Aσ ) 2 · 6 = P ∈ (h 1 ki) P ∈ (h 1 k i) + d(3|1|4|5|2, 1|5|2|3|4) + ··· + d(3|1|4|2|5, 1|5|4|3|2) (n − k)!(n − l)! T + d(3|1|4|5|2, 1|5|4|3|2)). (9) = tr(Mˆ hii△(Mˆ hji) ) (12)

934 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA 15 4 4 14 41 51 45 1 4 14 15 42 41 45 43 54 41 12 13 43 45 1 54 42 12345 5 51 42 14 12 5 52 21 25 52 15 53 12345 24 24 54 1 21 51 25 24 13 52 5 12345 53 12 35 2 25 31 43 53 13 34 23 35 23 31 2 34 21 34 31 3 32 32 35 23 32 3 2 3

Figure 3: MDS embedding of ranked lists of varying lengths (k varies) over a total of n = 5 websites. The embeddings were computed using the expected weighted Hoeffding’s distance (8) with uniform weight function wt = 1 (left), linear weight function wt = 1/t (middle) 2 and quadratic weight function wt = 1/t (right). We observe the same phenomenon that we saw in Figure 1 for permutations. The expected distance (8) separates ranked lists agreeing in their top rankings (denoted by different colors) better as the weights decay faster. where Proof. From equation 10, the offline pre-computation E1 E2 requires computing Dm×m, Dm×m and Dm×m, where Duv = S Aπ S Aσ ˆ π∈ (hi1,...,ik i) ˆ σ∈ (hj1,...,jli) ′ E1 1 n ′ E2 1 1 Mhii = P , Mhji = P . dw(u, v), Dkv = n−k t=k+1 dw(t, v), and Dkl = n−k n−l (n − k)! (n − l)! n n ′ P dw(t, s). The space complexity for comput- (13) Pt=k+1 Ps=l+1 ing these matrices is O(m2). The time complexity to com- 2 Note that the marginal matrices Mˆ hii, Mˆ hji have a proba- pute Dm×m is O(m ). Exploiting features of cumulative bilistic interpretation as their u, v entries represent the prob- sums and the matrix Dm×m it can be shown that comput- E1 2 ability that item u is ranked at v. Combining (12) with ing Dm×m requires O(n + m ) time. Similarly, comput- E2 2 Lemma 1 below completes the proof. ing Dm×m requires O(n + m ) time. As a result, the to- tal offline complexity is O(m2) space and O(n + m2) time. Lemma 1. Let Mˆ be the marginal matrix for a top-k ranked Given the three precomputed matrices, computing the ex- list hi1, . . . , iki with a total of n items as in (13). If r ∈ pected distance for two partially ranked lists hi1, . . . , iki and {i1, . . . , ik} and r = is for some s = 1, . . . , k, then Mˆ rj = δjs hj1, . . . , jli requires O(k+l) time and space. The reasons are where δab = 1 if a = b and 0 otherwise. If r 6∈ {i1, . . . , ik} that given two lists, the time to identify overlapping items then Mˆ rj = 0 for j = 1, . . . , k and 1/(n − k) otherwise. from two lists of size k and l is O(k + l) and that for items ranked by at least one engine, we need to use the look-up Proof. For a top-k ranking hi1, . . . , iki out of n items, table no more than k + l times and another extra look-up S the size of the set (hi1, . . . , iki) is (n − k)!. Each of the for items never ranked in both. permutations compatible with it has exactly the same top- k ranks. If r ∈ {i1, . . . , ik} and r = is for some s = 4. SIMULATION STUDY 1, . . . , k then the number of permutations compatible with To evaluate the proposed framework we conducted two hi1, . . . , iki that assign rank s to the item is (n − k)!. Simi- sets of analyses. The first analysis (in this section) uses larly, the number of consistent permutations assigning rank synthetic data to examine properties of the embedding and other than s to the item is 0. As a result we have Mˆ rs = (n−k)! compare it to alternative methods under controlled settings. = 1 and Mˆ rj = 0 for j 6= s. If r 6∈ {i , . . . , i }, the (n−k)! 1 k The second set of experiments (next section) includes real number of permutations consistent with the ranked list that world search engine data. Its goal is to demonstrate the assign rank j ∈ {k + 1, . . . , n} to the item is (n − k − 1)!. benefit and applicability of the framework in the context of Similarly, the number of permutations that assign rank j ∈ a real world visualization problem. ˆ {1, . . . , k} to the item is 0. As a result Mrj = 0 for j = We start by examining the embedding of permutations ˆ (n−k−1)! 1 1, . . . , k, and Mrj = (n−k)! = n−k for j = k+1, . . . , n. over n = 5 websites. A small number is chosen intentionally for illustrative purposes. We consider two sets of permuta- The expected distance (8) may be computed very effi- tions. The first set contains all permutations ranking item ciently, assuming that some combinatorial numbers are pre- 1 first and item 2 second. The second set contains all per- computed offline. Bounding k, l by a certain number m such mutations ranking item 2 first and item 1 second. Figure 1 that k, l ≤ m ≪ n, the online complexity is O(k + l) and the displays the MDS embedding of these two sets of permuta- offline complexity is O(n + m2). Proposition 3 makes this tions based on the weighted Hoeffding’s distance with con- −1 precise. A pseudo-code description of the distance compu- stant weights wt = 1 (left), linear weight wt = t (middle) −2 tation algorithm is given as Algorithm 3.1. and quadratic weight wt = t (right). The first set of per- mutations are colored red and the second set blue. Proposition 3. Let hi1, . . . , iki and hj1, . . . , jli be top-k The uniform weight MDS embedding does not pay par- and top-l ranks on a total n items with k, l ≤ m ≪ n. As- ticular attention to differing websites in the top ranks and ′ suming that dw(u, v) in (4) is computable in constant time so the red and blue permutations are interspersed together. and space complexity (as is the case for many polynomial This is also the embedding obtained by using the Kendall’s decaying weight vectors w), the online space and time com- tau distance as in [5, 6, 12]. Moving to linearly decaying plexity is O(k + l). The offline space complexity is O(m2) and quadratic decaying weights increases the separation be- and the offline time complexity is O(n + m2). tween these two groups dramatically. The differences in web-

935 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

−1 −2 Off-line: d to 1|2|3|4|5 InverseMeasure wt = t wt = t 1. Specify n, the number of total items and m the list length bound. E1 E2 2 0.8374 0.6500 0.7539 D × D D 2. Precompute matrices m m, m×m and m×m (Section 3). 3 0.8374 0.7786 0.8589 On-line: 3. Call Expected-Weighted-Hoeffding(π, σ) for lists π and σ 4 0.8374 0.8357 0.8901 5 0.8374 0.8571 0.8988 Expected-Weighted-Hoeffding(π, σ) 1|3 0.2481 0.3048 0.2049 1 k1 ← size(π) 1|4 0.2481 0.3810 0.2464 ← 2 k2 size(σ) 1|5 0.2481 0.4095 0.2581 3 [πmark, σmark ] = Mark-Rank(π, σ); 4 sum ← 0; 5 for i ← 1 to k1 Table 1: A comparison of the inverse measure [2] with the 6 do weighted Hoeffding distance indicates that the inverse measure if 7 πmark[i] > 0 lacks discriminative power as it assigns the same dissimilarity to 8 then sum ← sum + D[i, πmark[i]] 9 else very different ranked lists (n = 5). E1 10 sum ← sum + D [k2, i] 11 d to 1|2|3|4|5 n = 5 n = 10 n = 103 n = 105 n = 107 12 count ← 0 13 for i ← 1 to k2 1|2|3|5|4| 0.0117 0.0176 0.0670 0.0698 0.0699 14 do 2|1|3|4|5| 0.7464 0.6755 0.6660 0.6683 0.6683 if 15 σmark [i] == 0 1|4|2| 0.1268 0.1362 0.1950 0.1980 0.1981 E1 16 then sum ← sum + D [k1, i] 1| 0.1064 0.1592 0.2656 0.2692 0.2692 ← 17 count count + 1 2|1 0.7726 0.7283 0.7515 0.7543 0.7543 18 E2 19 sum ← sum + (n − k1 − count) · D [k1, k2]; 5| 0.9395 0.9280 0.9820 0.9851 0.9852 20 return sum; 5|4|3|2|1| 1.0000 0.9025 0.8727 0.8748 0.8748

Mark-Rank a, b ( ) Table 2: A comparison of weighted Hoeffding distance with cubic 1 k1 = size(a) −3 2 k2 = size(b) weight decay wt = t reveals that increasing n beyond a certain 3 amark = zeros(1 . . . k1), bmark = zeros(1 . . . k2) size does not alter the distances between partially ranked lists. 4 for i ← 1 to k1 This indicates lack of sensitivity to the precise value of n as well ′ 5 do for j ← 1 to k2 as computational speedup resulting from replacing n by n ≪ n. 6 do if a[i] = b[j] then 7 amark [i] = j dissimilarity to very different ranked lists. Kendall’s tau 8 bmark [j] = i 9 and the other distances proposed in [5, 6] lack the ability 10 return [amark , bmark ] to distinguish disagreement in top ranks and bottom ranks. Algorithm 3.1: Algorithm to compute expected weighted Ho- In particular, Kendall’s tau is identical to our weighted Ho- effding distance between two ranked lists π and σ. The online effding distance with uniform weights (see Figures 1-3 for complexity of the above algorithm is O(k1k2). A slightly more a demonstration of its inadequacy). NDCG [10] and other complex algorithm can achieve online complexity O(k1 + k2) as precision recall measures rely on comparing a ranked list to a described in Proposition 3. ground truth of relevant and not-relevant websites. As such they are not symmetric and are inappropriate for computing MDS embedding based on pairwise distances. sites occupying top ranks are emphasized while differences in Table 2 shows a comparison of weighted Hoeffding dis- −3 websites occupying bottom ranks are de-emphasized. This tances with wt = t for different sizes of the web n. It demonstrates the ineffectiveness of using Kendall’s tau dis- reveals that increasing n beyond a certain size does not alter tance or uniform weight Hoeffding distance in the context the distances between partially ranked lists. This indicates of search engines. The precise form of the weight - linear, a lack of sensitivity to the precise value of n as well as com- quadratic, or higher decay rate depends on the degree to putational speedup resulting from replacing n by n′ ≪ n. which a user pays more attention to higher ranked websites. The second simulated experiment is similar to the first, but it contains partially ranked lists as opposed to permu- 5. SEARCH ENGINE EXPERIMENTS tations. We form five groups - each one containing par- We discuss in this section three experiments conducted on tially ranked lists ranking a particular website at the top. real world search engine data. In the first experiment we vi- Figure 3 displays the MDS embedding of these five sets of sualize the similarities and differences between nine different permutations based on the expected weighted Hoeffding dis- search engines: .com, alltheweb.com, ask.com, tance ρ(hi1, . . . , iki, hj1, . . . , jli) in (8) using constant weights google.com, .com, live.com, yahoo.com, .com, and −1 wt = 1 (left), linear weights wt = t (middle) and quadratic .com. We collected 50 popular queries online in −2 5 weights wt = t (right). Ranked lists in each of the differ- each of six different categories: company names, questions , ent groups are displayed in different colors. sports, tourism6, university names, and celebrity names. We observe a similar conclusion with the expected dis- These queries form a representative sample of queries Q tance over partially ranked lists as we did with the distances within each category over which we average the expected over permutations. The five groups are relatively inter- distance ρ according to (3). Figure 4 shows several queries spersed for uniform weights and get increasingly separated for each one of the topic categories. We visualize search as the rate of weight decay increases. This represents the result sets within each of the query categories in order to fact that as the decay rate increases, disagreements in top examine whether the discovered similarity patterns are spe- ranks are emphasized over disagreement at bottom ranks. cific to a query category, or are generalizable across many We also conducted some comparisons between the weighted different query types. Hoeffding distance and alternative distance measures. Ta- ble 1 shows how one recently proposed measure, the inverse 5queries from http://answers.yahoo.com measure [2], lacks discriminative power, assigning the same 6queries from http://en.wikipedia.org/wiki/Tourism

936 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

Categories Queries Tourism Times Square, Sydney Opera House, Eiffel Tower, Niagara Falls, Disneyland, British Museum, Giza Pyramids Celebrity Names Michael Bolton, Michael Jackson, Jackie Chan, Harrison Ford, Halle Berry, Whoopi Goldberg, Robert Zemeckis Sports Football, Acrobatics, Karate, Pole Vault, Butterfly Stroke, Scuba Diving,Table Tennis, Beach Volleyball, Marathon University Names Georgia Institute of Technology, University of Florida, Virginia Tech, University of California Berkeley Company Goldman Sachs, Facebook, Honda, Cisco Systems, Nordstrom, CarMax, Wallmart, American Express, Microsoft Questions How are flying buttresses constructed, Does toothpaste expire, How are winners selected for the Nobel Prize Temporal Queries AIG Bonuses, G20 major economies, Timothy Geitner, Immigration Policy, NCAA Tournament Schedule Figure 4: Selected queries from each of the 6 query categories, and from the set used for examining temporal variations.

1. altavista company university celebrity 3 2. alltheweb 9 3 9 3. ask 9 4. google 3 5. lycos 6. live 7. yahoo 8. aol 9. dogpile 2 1 6 5 5 6 7 1 1 5 6 7 2 7 2

8 4 8 4 4 8

1. altavista questions sports tourism 4 2. alltheweb 4 8 3. ask 8 4. google 6 5. lycos 9 7 6. live 5 7. yahoo 8. aol 6 9. dogpile 7 1 3 5 2 2 1 71 25 6

9 3 3 48 9

Figure 5: MDS embedding of search engine results over 6 sets of representative queries: company names, university names, celebrity −1 names, questions, sports, and tourism. The MDS was based on the expected weighted Hoeffding distance with linear weighting wt = t over the top 100 sites. Circle sizes indicate position variance with respect to within category queries.

celebrity celebrity

1. altavista 2. alltheweb 3. ask 1. altavista 4. google 2. alltheweb 5. lycos 3. ask 6. live 4. google 7. yahoo 5. lycos 8. aol 6. live 9. dogpile 7. yahoo 8. aol 9. dogpile

Figure 6: MDS embedding of search engine results over the query category celebrity with different query manipulations. The MDS was computed based on the expected distance (3) with (8) corresponding to the weighted Hoeffding distance with quadratic decaying weights (left panel) and Kendall top k distance [6] (right panel). Each marker represents a combination of one of the 9 search engines and one of the 5 query manipulation techniques. By comparison, the embedding of the Kendall top-k distance (right) lacks discriminative power and results in a loss of information.

937 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

5.1 Search Engines Similarities Figure 5 displays the MDS embedding of each of the nine engines for the six query categories, based on the expected weighted Hoeffding distance (3) with linear weight decay. The ρ quantity was averaged over the 50 representative queries from that category. Each search engine is represented as a circle whose center is the 2-D coordinates obtained from the MDS embedding. The radii of the circles in Figure 5 were scaled proportion- ally to the positional stability of the search engine. More precisely, we scaled the radius of the circle corresponding to the i-th search engine proportionally to its distance variance over the 50 queries altav alweb lycos yahoo ask google aol dog msn def stability(i) = Var q Q{ρ(si(q), sj (q))}. (14) Figure 7: Hierarchical clustering dendogram for the nine search X ∼ j:j6=i engines over the query category Companies. Engine/Distance (b) + (c) “ ” (d) and (e) or Scaling the circles according to (14) provides a visual indi- 0.6903 cation of how much will the position change if one or more Ask 0.5308 0.6424 0.6447 Live 0.5625 0.5639 0.5006 0.5374 queries are deleted or added to Q. This can also be in- Google 0.3553 0.4117 0.5584 0.5500 terpreted as the degree of uncertainty regarding the precise Yahoo 0.4281 0.4647 0.5777 0.5918 location of the search engines due to the averaging over Q. Examining Figure 5 reveals several interesting facts. To Figure 8: The expected distance of different query manipulations begin with there are five distinct clusters. The first and from the original query for different search engines. largest one contains the engines altavista, alltheweb, lycos, and yahoo (indicated by the numeric codes 1,2,5,7). These 5.2 Query Manipulations four search engines are clustered together very tightly in all In the second experiment we used search engine data to 6 query categories. The second cluster is composed of google examine the sensitivity of the search engines to four com- and AOL (numeric codes 4, 8) who also appear in very close monly used query manipulation techniques. Assuming that proximity across all 6 query categories. The remaining three the queries contained several words w1w2 ··· wl with l > 1, clusters contain individual engines: live, dogpile, and ask. the query manipulation techniques that we considered were The clusters in the embedding do in fact mirror the tech- nology relationships that have evolved in the search engine (a) w1w2 ··· wl ⇒ w1w2 ··· wl industry. FAST, the company behind alltheweb, bought (b) w1w2 ··· wl ⇒ w1 + w2 + ··· + wl Lycos and was subsequently bought by Overture who also (c) w1w2 ··· wl ⇒ “w1w2 ··· wl” bought Altavista7. Overture was subsequently bought by (d) w1w2 ··· wl ⇒ w1 and w2 and ··· and wl the fourth member of the cluster, Yahoo. All four search (e) w1w2 ··· wl ⇒ w1 or w2 or ··· or wl engines in the first cluster have close proximity in the em- bedding and yet are dissimilar from the remaining competi- with the first technique (a) being the identity i.e. no query tors. The second cluster, for Google and AOL, reflects the manipulation. The embeddings of queries in the query cat- fact that AOL now relies heavily on Google’s web search egory celebrity are displayed in Figure 6. The left panel dis- technology, leading to extremely similar ranked lists. plays the MDS embedding based on our expected weighted The remaining engines are quite distinct. Dogpile is a Hoeffding distance with quadratic decaying weight. As a meta-search engine which incorporates the input of the other comparison, the right panel shows the MDS based on Kendall’s major search engines. We see that dogpile’s results are top k distance as described by Fagin et al. [6]. Each marker roughly equidistant from both Yahoo and Google clusters in the figure represents the MDS embedding of a particu- for all query categories. Figure 5 also shows that dogpile is lar engine using a particular query manipulation technique more similar to the two remaining engines - Live and Ask. which brings the total number of markers to 9 · 5 = 45. Apparently, dogpile emphasizes pages highly-ranked by Live Comparing the left and right panels shows that visualiz- and Ask in its meta search more than Google and AOL and ing using Kendall’s top k distance [6] lacks discriminative more than Yahoo, Lycos, Altavista, and alltheweb. power. The points in the right panel fall almost on top of We present the similarity structure between the search en- each other limiting their use for visualization purposes. In gines in Figure 7. The figure displays a dendogram output contrast, the points in the left panel (weighted Hoeffding by standard hierarchical bottom up clustering. The clus- distance) differentiate among not only different engines but tering was computed based on the expected Hoeffding simi- also different types of query manipulations. larity measure for queries sampled from the query category In particular, it shows that most search engines produce Companies. The dendogram in the figure confirms the anal- two different clusters of results corresponding to two sets ysis above visually. Altavista, alltheweb, lycos, and yahoo of query manipulation techniques: transformations {(a),(b), all form a tightly knit cluster. A similar tight cluster con- (c)} in one cluster and transformations {(d),(e)} in the other tains google and aol. More loosely it can also be said that cluster. Live and ask form an exception to that rule forming google and aol are closer to the remaining search engines clusters {(a),(d)}, {(b),(c)}, {(e)} (live) and {(a),(b),(d),(e)}, (ask, dogpile, and msn) than the yahoo cluster. {(c)} (ask). Figure 8 shows the query manipulations that produce ranked lists most distinct from the original query: 7http://google.blogspace.com/archives/000845.html (c) for ask and live, (d) for google, and (e) for yahoo.

938 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

slope of the curve represents the degree of temporal change ∆(st, st+τ ) between yahoo at time t and at time t + τ as a function of τ with respect to the reference point t.

1. altavista 2. alltheweb 5.4 Validation of the MDS Embedding 3. google 4. lycos A central assumption of this study is that MDS embed- 5. live dings can give a faithful representation of the distances in 6. yahoo 7. aol the original higher-dimensional space. Thus, we now pro- vide a validation of the MDS embedding as a visualization tool, diagnosing whether MDS is providing a reasonable em- bedding and which MDS variant should be preferable. The most common tool for validating the MDS embedding is Shepard’s plot (Figure 10, top) which displays a scatter plot contrasting the original dissimilarities (on the x axis) and the corresponding distances after the embedding (on 0.6 the y axis). Points on the diagonal represent zero distortion 0.4 and a curve that deviates substantially from the diagonal represents substantial distortion. The Shepard’s plot in Fig- 0.2 1. to Day2 ure 10 (top) corresponds to the metric MDS using the stan- 2. to Day6 dard stress criterion as described in (2). The plot displays 0 1 2 3 4 5 6 7 low distortion with a tendency to undervalue dissimilarities in the range [0, 0.5] and to overvalue dissimilarities in the range [0.5, 0.9]. Such a systematic discrepancy between the Figure 9: Top: MDS embedding of search engine results over way small and large distances are captured is undesirable. seven days for a set of queries based on temporal events. The An alternative is non-metric MDS which achieves an em- MDS embedding was based on the expected Hoeffding distance bedding by transforming the original dissimilarities into al- with linear weighting wt = 1/t over the top 50 sites. Bottom: The dissimilarity of Yahoo results over seven days with respective to ternative quantities called disparities using a monotonic in- a reference day for a set of queries based on temporal events. creasing mapping which are then approximated by the em- bedding distances [3]. Doing so preserves the relative or- dering of the original dissimilarities and thus (assuming the 5.3 Temporal Variation embedding distances approximate well the disparities) accu- rately represent the spatial relationship between the points. In the third experiment, we visualize search result sets Figure 10 (bottom) displays the Shepard’s plot for the same created by replicating 7 out of the 9 search engines over 7 data embedded using non-metric stress MDS with the dis- consecutive days resulting in 7 · 7 = 49 search result sets. parities displayed as a red line. Despite the fact that its The search engines were queried on a daily basis during numeric distortion is higher, the non-metric MDS is a vi- 3/25/2009 - 3/31/2009 and the returned results were em- able alternative to the metric MDS, and is what was used bedded in 2-D for visualization. In contrast to the previous to generate the figures in this paper. two experiments, we used a separate query category which was specifically aimed at capturing time sensitive matters. For example, we ignored tourism queries such as Eiffel Tower 6. DISCUSSION due to their time insensitive nature and instead used queries In this paper we present a framework for visualizing rela- such as Timothy Geitner or AIG bonuses which dominated tionships between search algorithms. The framework starts the news in March 2009. See Fig. 4 for more examples. by deriving an expected distance based on the earth mover’s The embedded rankings are displayed in Figure 9 (top). distance on permutations and then extends it to partially The embedding reveals that the yahoo cluster (yahoo, al- ranked lists by taking expectations over all permutations tavista, alltheweb, and lycos) shows a high degree of tem- consistent with the ranked lists. The expected distance ρ is poral variability, and in particular a sharp spatial shift on then averaged over representative queries q ∈ Q to obtain the third day from the bottom region to the top left region. a dissimilarity measure for use in multidimensional scaling This could be interpreted either as a change in the index, embedding. The expected distance has several nice proper- reflecting the dynamic nature of the Web, or an internal ties including being computationally efficient, customizable change in the retrieval algorithms underlying the engines. through a selection of the weight vector w, and interpretable. The other engines were more stable as their ranked lists We explore the validity of the framework using a simula- changed very little with the temporal variation. Note that tion study which indicates that the weighted Hoeffding dis- as the queries were time sensitive this should not be inter- tance is more appropriate than Kendall’s tau and more dis- preted as a measure of robustness, but rather as a stability criminative than the inverse measure. It is also more appro- measure for the internal index and ranking mechanisms. In- priate for MDS embedding than non-symmetric precision- terestingly, Vaughan [20] also reports that Altavista shows recall measures such as NDCG. We also demonstrate the ro- temporal jumps with Google being more stable over a set of bustness of the proposed distance with respect to the choice queries in 2004. of n and its efficient computation with complexity that is lin- Figure 9 (bottom) shows the expected distance between ear in the sizes of the ranked lists (assuming some quantities the yahoo search results across the seven consecutive days are precomputed offline). and a reference point (2nd and 6th day). As expected, for Experiments on search engine data reveal several inter- each one of the two plots, the deviation to the reference date esting clusters which are corroborated by examining recent increases monotonically with the temporal difference. The news stories about the web search industry. We demon-

939 WWW 2010 • Full Paper April 26-30 • Raleigh • NC • USA

0.9

0.8 [2] J. Bar-Ilan, K. Keenoy, E. Yaari, and M. Levene. User

0.7 rankings of search engine results. Journal of American

0.6 Society for Information Science and Technology, 58(9):1254–1266, 2007. 0.5 [3] I. Borg and P. J. F. Groenen. Modern 0.4 Multidimensional Scaling. Springer, 2nd edition, 2005. 0.3 [4] B. Carterette. On rank correlation and the distance 0.2 between rankings. In Proc. of the 32nd ACM SIGIR 0.1 Conference, 2009. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 [5] D. E. Critchlow. Metric Methods for Analyzing 0.7 Partially Ranked Data. Lecture Notes in Statistics, volume 34, Springer, 1985. 0.6 [6] R. Fagin, R. Kumar, and D. Sivakumar. Comparing 0.5 top k lists. In Proc. of ACM SODA, 2003.

0.4 [7] M. A. Fligner and J. S. Verducci. Distance based ranking models. Journal of the Royal Statistical 0.3 Society B, 43:359–369, 1986. 0.2 [8] M. Gordon and P. Pathak. Finding information on the

0.1 world wide web: the retrieval effectiveness of search engines. Information processing and management, 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 35(2):141–180, 1999. [9] L. Granka, T. Joachims, and G. Gay. Eye-tracking Figure 10: Shepard plots (embedding distances as a function of analysis of user behavior in www search. In Proc. of the original dissimilarities) for 2D MDS embeddings correspond- the ACM-SIGIR conference, pages 478–479, 2004. ing to the weighted Hoeffding distance of search engine results [10] K. J¨arvelin and J. Kek¨al¨ainen. Cumulated gain-based over query category celebrity with different query manipulations. The metric stress MDS [3] (top panel) produces an embedding evaluation of ir techniques. ACM Transactions on that has a lower overall distortion than the non-metric stress Information Systems, 20(4):422–446, 2002. MDS [3] (bottom panel). The non-metric stress MDS, however, [11] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and achieves an embedding by transforming the original dissimilari- G. Gay. Accurately interpreting clickthrough data as ties into alternative quantities called disparities using a monotonic implicit feedback. In Proc. of the ACM-SIGIR increasing mapping (red line in bottom panel). conference, pages 154–161, 2005. [12] P. Kidwell, G. Lebanon, and W. S. Cleveland. strate how to visualize the positional stability by scaling the Visualizing incomplete and partially ranked data. MDS markers proportionally to the total distance variance IEEE Transactions on Visualization and Computer and how to visualize sensitivity of search engines to popular Graphics, 14(6):1356–1363, 2008. query manipulation techniques. We also use the visualiza- [13] J. H. Lee. Combining multiple evidence from different tion framework to examine how the search results vary over properties of weighting schemes. In Proc. of the 18th consecutive days. ACM SIGIR conference, 1995. Search engines use complex proprietary algorithms con- [14] W. Liggett and C. Buckley. Query expansion seen taining many parameters that are automatically tuned based through return order of relevant documents. Technical on human provided ground truth information. As a result, report, NIST, 2001. it is difficult for search engineers to have a detailed under- standing of how precisely their search engine works. For ex- [15] J. I. Marden. Analyzing and modeling rank data. CRC ternal users the problem is even worse as they are not privy Press, 1996. to the internal algorithmic details. Our framework provides [16] A. Moffat and J. Zobel. Rank-biased precision for visual assistance in understanding the relationship between measurement of retrieval effectiveness. ACM a search algorithm’s ranked results and its dependence on Transactions on Information Systems, 27, 2008. internal and external parameters. Such visualization may [17] M. E. Rorvig. Images of similarity: A visual lead to better designed search engines as the engineers im- exploration of optimal similarity metrics and scaling prove their understanding of how search engines depend on properties of TREC topic-document sets. JASIS, the internal parameters. It may also improve the search ex- 50(8):639–651, 1999. perience as users understand better the relationship between [18] Y. Rubner, C. Tomasi, and L. J. Guibas. The earth different engines and their dependency on query manipula- mover’s distance as a metric for image retrieval. tion techniques and external parameters such as time. International Journal of Computer Vision, 40(2), 2000. 7. ACKNOWLEDGMENTS [19] A. Spoerri. Metacrystal: A visual interface for meta searching. In Proceedings of ACM CHI, 2004. We thank H. Zha and H. Li for helpful comments. This [20] L. Vaughan. New measurements for search engine research was funded in part by NSF grant DMS-0907466. evaluation proposed and tested. Information processing and management, 40(4):677–691, 2004. 8. REFERENCES [21] E. Yilmaz, J. A. Aslam, and S. Robertson. A new [1] M. Alvo and P. Cabilio. Rank correlation methods for rank correlation coefficient for information retrieval. In missing data. The Canadian Journal of Statistics, Proc. of the 31st ACM SIGIR conference, 2008. 23(4):345–358, 1995.

940