Evaluation of Pseudo Relevance Feedback Techniques for Cross Vertical Aggregated Search

B Hermann Ziak( ) and Roman Kern

Know-Center GmbH, Inffeldgasse 13, 8010 Graz, Austria {hziak,rkern}@know-center.at

Abstract. Cross vertical aggregated search is a special form of meta search, were multiple search engines from different domains and vary- ing behaviour are combined to produce a single search result for each query. Such a setting poses a number of challenges, among them the question of how to best evaluate the quality of the aggregated search results. We devised an evaluation strategy together with an evaluation platform in order to conduct a series of experiments. In particular, we are interested whether pseudo relevance feedback helps in such a sce- nario. Therefore we implemented a number of pseudo relevance feedback techniques based on knowledge bases, where the knowledge base is either Wikipedia or a combination of the underlying search engines themselves. While conducting the evaluations we gathered a number of qualitative and quantitative results and gained insights on how different users com- pare the quality of search result lists. In regard to the pseudo relevance feedback we found that using Wikipedia as knowledge base generally provides a benefit, unless for entity centric queries, which are targeting single persons or organisations. Our results will enable to help steering the development of cross vertical aggregated search engines and will also help to guide large scale evaluation strategies, for example using crowd sourcing techniques.

1 Introduction

Todays web users tend to always revert to the same sources of information[6] despite other potentially valuable sources of information exists. These sources are highly specialized in certain topics, but often left out since they are not familiar to the user. One key aspect to tackle this issue is to devise search methods that keep the users efforts minimal, where meta search serves as a starting point. This is motivated to improve the public awareness of systems in domains, which are considered to be niche areas by the general public, like cultural heritage or science. Meta search is the task of distributing a query to multiple search engines and combining their results into a single result list. In meta search there is usually no strict separation of domains, thus the results are expected to be homogeneous or even redundant, for example results from different web search engines. On the other hand vertical search engines try to combine results from sources of different domains. In our case these verticals or sources are highly specialized

c Springer International Publishing Switzerland 2015 J. Mothe et al. (Eds.): CLEF 2015, LNCS 9283, pp. 91–102, 2015. DOI: 10.1007/978-3-319-24027-5 8 92 H. Ziak and R. Kern collections, for example medicine, business, history, art or science. These verticals might also differ in the type of items which are retrieved [3] (e.g. images, web pages, textual documents). An example of vertical search is the combination of results from an underlying images search with results from a traditional textual search. In our work we focus on cross vertical aggregated search engines [11], also known as multi domain meta search engines [14], where we do not make any assumptions about the domain of the individual sources. Hence, in such a scenario the challenges [11] of both types of aggregated search engines are inherited. In particular we are dealing with so called uncooperative sources, thus the individual search engines are treated as black boxes. The overall goal of our work is to gain a profound understanding on how to provide aggregated search results, which prove to be useful for the user. This directly addresses the question on how to assess this usefulness, i.e. how to evaluate such a system? The traditional approach for evaluation follows the Cranfield paradigm [22]. Here the retrieval performance is assessed by a fixed set of relevant documents for each query and typically evaluated offline using mean average precision (MAP), normalized discounted cumulative gain (NDCG) or related measures. This type of evaluation does not appear to be appropriate, as it does not capture aspects like diversity, serendipity and usefulness of long-tail content which we consider to play an important role in our setting. Furthermore, these indicators are hard to measure since ground truth data is hard to create for cross vertical search systems where sources might be uncooperative. In order to fill this gap, we conducted user centred evaluations to get a better understanding of how users perceive search result lists and how to design evaluations in such a setting. In particular we are interested in how users evaluate longer result sets against each other, not only judging the top documents alone. Therefore we developed a dedicated evaluation tool allowing users to interactively vote for results which best match their expectations. The evaluation platform also allows us to evaluate the impact of different retrieval techniques: more specifically, the integration of pseudo relevance feedback. In pseudo relevance feedback the search is conducted two times. First the original query is issued and the top search hits are analysed. From those search hits a number of query term candidates are selected and added to the query. The expanded query is then used to generate the final search result list. As an extension to the basic procedure, the for the first query might be different from the one of the second round. Thus different knowledge bases can be studied when used for the first search and how they impact the final results. Before conducting the actual evaluation, we did a preliminary test with few friendly users to fine tune the evaluation system. In our main evaluation we gathered qualitative insights and quantitative results, of the integration of pseudo relevance feedback into the retrieval process and whether it proves beneficial and helps to diversify results of specialized sources. Another outcome of our work is a guideline on how human intelligence tasks have to be designed for large scale evaluations on crowd-sourcing platforms like Amazon Mechanical Turk. Evaluation of Pseudo Relevance Feedback Techniques 93

Fig. 1. Overview of the basic architecture of the whole system. In the user context detection component the query is extracted from the current user’s context. The cross vertical aggregated search engine is the core of the system where queries are expanded and distributed to the sources. The source connector is responsible to invoke the source specific API and return the individual search results, which are finally merged by the result aggregation component .

2 System Overview

Our cross vertical search algorithms are at the core of a bigger system, which is development within the EEXCESS1 (”Enhancing Europes eXchange in Cul- tural Educational and Scientific reSources”) project. The code is available under an open-source license2. An overview of the architecture is given in Figure 1. The vision of the project is to recommend high quality content of many sources to platforms and devices which are used on a daily basis for example in form of a browser add-on. In a traditional information retrieval setting the user is requested to explicitly state her information need, typically as a query consist- ing of a few keywords. We consider the case, where additionally to the explicit search capabilities, the information need is not explicitly given. In this case the query is automatically inferred from the current context, by the user context detection component [20]. Such a setting is also known as just-in-time informa- tion retrieval [18] and has a close connection to the field of recommender systems research. The search result list is continuously updated according to the users’ interactions, for example when navigating from one web site to another. Next, the query is processed by the query reformulation step, where the query expan- sion takes place. Optionally one of the pseudo relevance feedback algorithms is applied to add related query terms to the original query. The query is then fed to

1 http://eexcess.eu 2 http://github.com/EEXCESS/recommender 94 H. Ziak and R. Kern all known sources, i.e. all search engines that are registered with the system, via source specific connectors. These source specific connectors then adapt the query to the source specific format and invoke the respective API calls, for example by the use of the Open Search API3. Finally, the results from all sources are collected and aggregated into a single search result list that is presented to the users.

3 System Details

The automatic query generation poses a set of challenges, as the true information need might be only partially present in these automatically inferred queries and might not cover the user’s intent well. One approach to deal with such problems is to diversify results as suggested in the literature [19]. Diversification can be achieved by a number of methods, ranging from mining query logs to query reformulation strategies [17]. Other diversification techniques like IA-Select [1] rely on categorization of the query and the retrieved documents to greedily rearrange the given result lists. In the end, the final presented result should cover all topics of the query in proportion to its categories. Although we considered such approaches, we found that in our meta search environment some of the verticals returned insufficient information to do a categorization of the results. For example, digital libraries of images only supply short titles and no additional metadata. Farming query logs do also not apply to our scenario, as our system should also work with uncooperative sources. Another source of information are language models, which could provide benefit in the query expansion stage. One way to obtain a language model of an uncooperative source is probing. Here a number of search requests are issued to the individual sources to collect a sample set of the source’s documents. Pass et al. [16] showed that an amount of about 30,000 to 60,000 documents are required for every source to get a decent representation of the source’s language model. Again, such sampling methods will not work for our system for a number of reasons (e.g. the sources might restrict the number of API calls per day). Therefore we opted for an solution that does not rely on such datasets, namely pseudo relevance feedback with the help of knowledge bases.

3.1 Query Reformulation

We expect the user context detection component to produce short queries, which is an additional motivation to use query expansion as reformulation strategy. For the query expansion we followed the advise found in the literature. We limited the retrieved documents to the ten top-ranked documents as suggested by Montgomery et al. [15] to extract query expansion term candidates. Out of these documents we extracted the top terms and removed duplicate query terms. There are several suggestions on the number of query terms to be used for query

3 http://www.opensearch.org/ Evaluation of Pseudo Relevance Feedback Techniques 95 expansion [4]. Harman [7] showed that after a certain amount of added expansion terms there is a drop in precision. Most recommendations vary from ten to twenty. We decided to use the twenty most frequent terms for our evaluation. Next, one needs to define which meta data fields to use when selecting the query expansion candidate terms. Most of our sources provide a description together with the title for each of their search results, while others just return the title. Existing work shows that using the title only might already result in a satisfying performance [4]. Therefore we opted to use just the title, even if a description is present for some of the search results. Another problem in pseudo relevance feedback is the so called query drift [12], where the additional terms introduced by the query expansion also cause a semantic drift away from the original user’s information need. Shokouhi et al. [21] tackled this problem by running the query expansion separately for each source and using just the respective source to produce the candidate terms. They also pointed out one disadvantage of such a procedure: not all sources might be equally suited to produce expansion terms. In our case some of the sources return very sparse textual information (e.g. sources specialized on visual content with short similar titles). Here the query expansion algorithm will also pick semantically unrelated terms simply due to the data sparseness. Shokouhi, Azzopardi and Thomas [21] demonstrated that results could benefit from taking only selected sources for query expansion instead of following a global approach, where all sources are treated equally. Lynam et al. [13] showed that the extent to which a source is suited to serve as query expansion for other sources can be estimated by the performance benefit when used on itself. This implies that picking the source, which demonstrated the best result in a single search setting could also be a viable option to produce query expansion terms for other sources. This can be extended further, when an external knowledge base is used for query expansion, which is not actually used for the aggregated search. Finally, our system features three different strategies for query expansion together with a baseline.

No Query Expansion. In this setting, the query is not expanded and sent as it is to the sources - the baseline.

Multiple Sources. The multiple sources approach takes all sources into account by first retrieving a combined search result list of all sources using no query expansion. This initial aggregated result list is taken as input to compute and rank candidates for the query expansion step. Next, all sources are queried using these query terms.

Single Source. This approach is similar to the multiple sources approach but takes only a single, selected source into consideration. This source has been selected based on the observed behaviour when used to expand queries applied only on itself. 96 H. Ziak and R. Kern

External Knowledge Base. For this query expansion strategy we used Wikipedia, which has already been used by existing research [8]. Moti- vated by the assumption raised by Cai et al. [4] regarding diverse content of web pages, we segmented Wikipedia pages into their main sections and indexed these sections as separate documents. For pseudo relevance feed- back we ranked the terms contained in the top hits using the Divergence from Randomness approach [2]. Here, Wikipedia is not part of the sources which contribute to the final search result list.

3.2 Source Specific Query Reformulation Every source may require a different, dedicated query language. Therefore the query has to be adopted specifically to the sources’ capabilities. As a general strategy, we formulated the expanded query consisting of the original query terms as conjunction followed by the new terms as a disjunction query. We expect that this approach will generally produce satisfying results, although tweaking the query reformulation for every source would most likely provide benefit.

3.3 Result Aggregation The final stage of our cross vertical aggregated search is the merging of the individual search results. Many different approaches have been proposed in the literature on how to combine search results from different sources [11,3]. For the evaluation we tried to keep the result aggregation as simple as possible to prevent any interference with the query expansion. Therefore we followed a simple round robin based approach. This is additionally motivated by our intention to keep the results deterministic and reproducible. Here results of all sources are combined, by picking the top ranked results of each list in a fixed sequence, i.e. first result of the first source followed by the first result of the second source.

4 Evaluation

The main goal of our evaluation was to arrive at a deeper understanding on how users judge the usefulness of the search results as produced by our system. Furthermore, we wanted to assess the impact of our pseudo relevance feedback configurations. Additionally, the evaluation should contribute to the understand- ing on how to design an evaluation for crowd-sourcing platform complementary to existing work in this area [9,10].

Evaluation Platform. We opted to build our own tool to conduct the user based evaluation, as this would allow us to control all parameters of the algorithms. See Figure 2 for a screenshot of the tool. The user is presented a fixed query together with an optional short description of the query and some background information. A number of different search result lists are presented next to each other. Now the user has to compare these search results and decide on a ranking Evaluation of Pseudo Relevance Feedback Techniques 97

Fig. 2. Screenshot of the evaluation user interface, for the query “euro conversion rate”. A total of four different search results are presented next to each other. The user already picked result list #3 (green) as the best and result list #2 (yellow) as the second best result. Third and fourth place are not decided yet. of the lists. By clicking on the respective search result list, the user expresses her preference on the ranking of the results. Once the sequence is defined, the user is routed to the next query. All decisions of the user are recorded together with the consumed time for each task. In the design of the tool great attention has been dedicated to keep the results and behaviour of the tool deterministic and consistent. For example, the search result lists are identical for each user within one evaluation run. At the same time our tool is flexible enough to allow the search results to be configured on how they should look like. For example, they may contain an optional preview image, or may be composed of the title alone or a combination of the title plus a description. The individual search result lists are generated by different configurations of the pseudo relevance feedback techniques. The sequence, which technique comes first and so forth, has been randomly chosen to prevent any bias. The actual algorithm has been recorded by the tool, but not presented to the user. Thus no hints on the way the search result lists were generated are available to the user.

Query Selection. As input our evaluation tool requires a list of queries to be presented to the user. The query list was preselected by us. The decision to predefine queries was to create result lists with a balanced amount of results of each source which could not have been guaranteed otherwise. Thus we also did not follow the proposal of Diaz et al. [5], to give users the opportunity to commit own queries. Furthermore, in our system the user is typically not expected to manually define the query terms, as they should be automatically inferred from the current context. The final query terms were chosen from the AOL query 98 H. Ziak and R. Kern

Table 1. Results of all four query expansion strategies, where the number indicates the accumulated rank, thus lower values are better. Results are also separated into entity- centric queries and topical queries, where the first type of query refers to individuals, organisation, and other types of entities.

Expansion Method Overall Score Entity-Centric Queries Topical Queries No QE 431 103 328 Multiple Sources 466 129 337 Selected Source 427 143 284 Wikipedia 426 155 271 log [16] and further individually selected to match the sources to prevent the search result to be dominated by specific sources.

Pre-test. Before starting the actual evaluation with our framework we conducted a small pre-test with friendly users. We gathered some insights, which allowed us to fine tune the evaluation tool and the procedure. In short the three main find- ings were: The results themselves should be uniformly presented. Thus, even if some results may provide an additional preview image or a rich textual descrip- tion, it is preferable to stick with the smallest common denominator. Therefore in the evaluation just the title is displayed for all search results. A second result of the pre-test concerns the number of queries to be evaluated. Initially we have foreseen to have our users assess a total of 30 queries. Apparently, the introspec- tion of the search results takes much time, therefore we reduced the number of queries to just 10 for the main evaluation. Finally, it has been observed that some of the sources also returned non-English results. Thus the feedback has been to filtered out these results and only keep English results.

Main Evaluation. The main evaluation took place during a computer science conference where we tried to motivate conference visitors to take part in the evaluation. From the visitors that were motivated enough to start the evaluation, a total of 20 managed to state their preference for all search result lists for all queries. Curious users were given a short background of the system and a brief introduction on the user interface. Apart from that, no hints were made by which means the search result lists should be measured. This has been done in order to prevent users to exert any bias. Comments and feedback from the users were collected in addition to the interactions recorded by the evaluation tool. Generally, none of the participants signalled having problems using the evaluation tool.

5 Results

Given the recordings of our tool and the feedback of the participants we can summarise the result in two ways, qualitatively and quantitatively. Finally, this allows us to provide a guideline on how our evaluation could be improved and how future evaluation should be designed. Evaluation of Pseudo Relevance Feedback Techniques 99

5.1 Qualitative Evaluation

Users reported that deciding between results was often hard, since results appeared to be quite similar in many cases. The Krippendorff’s Alpha as mea- sure of disagreement lies between 0.66-0.78 for the different configurations. This agreement can only be considered as “fair”, corroborating the subjective impres- sion of the participants. This is also caused by the variety on how users conducted the process of comparing search result lists. Some users just focused on the top results, others picked the best overall result set studying each document in the lists. There does not appear to be a single, uniform strategy of users to assess search result lists. This could be also seen as an indicator why we observed a big variation of evaluation time, between 1 to 5 minutes per query. Further, since query terms where not randomized in our evaluation, studying each docu- ment might lead to a higher fatigue of the user and therefore might bias results. Generally, these results also hint that applying a form of personalisation on the search result to cater for the different assessment types. Some users showed a clear preference to result lists with more general top items, e.g. overview articles. One outcome of the qualitative evaluation suggests to design a system to have the first few search results to be of more general nature. This should maximise the chances that users perceive a search result as an appropriate response to the given query or information need. For our scenario of aggregating multiple sources, it appears to be advisable to reserve the first few spots of the aggre- gated search result to items from more generic sources, but allowing more specific sources to populate the remaining result list.

5.2 Quantitative Evaluation

Table 1 summarises the result from the recording of the evaluation tool and compares the four different pseudo relevance feedback strategies. The numbers represent the accumulated rank of the users’ rating, hence a lower number indi- cates a preference for the respective configuration. Two of the three pseudo relevance feedback strategies yielded better results than the baseline without any query reformulation. The knowledge base setting using Wikipedia for query expansion appears to give the best overall results, followed closely by the selected source strategy although the distinctions are minimal. Though, when inspecting the result in more detail, we made an interesting observation. We discovered that there is a pronounced discrepancy between queries which can be described as entity-centric queries and topical queries. For entity-centric queries one would expect that there is a single, defining Wikipedia page, for example “Michelle Obama”. For this kind of query, the query expansion using Wikipedia did not provide any benefit, in contrary it had a negative impact. This might be due to the way how the query expansion terms are constructed and that the terms used for expansion allow a too large query drift. From the result from the quantitative evaluation, one can conclude that pseudo relevance feedback might help, but not for all configurations and queries. Using Wikipedia as knowledge base demon- strated the best overall performance. For entity centric queries it is suggested to 100 H. Ziak and R. Kern introduce a query pre-preprocessing where the type of a query is inferred and enable pseudo relevance feedback just for topical queries.

5.3 Evaluation Guideline

Taken from the feedback we got during the evaluation sessions we collected a number of criteria, which could guide future user based evaluations, that com- plement our findings:

– A clear rating strategy for comparing result lists should be defined prior to the task to prevent different rating schemes. – Participants reported that it was often hard to decide between four lists. In particular if there were multiple similar sets for different configurations. Therefore only base line plus one of the configurations should be compared against. – If one wants to research diversity or serendipity in result lists, the partici- pants should be instructed to compare the entire result set, not just the first few items. – The amount of queries judged by one user should be selected carefully. The judgement process may take longer than expected and workers tend to get indifferent later in the process, which might lead to randomly chosen results. – To keep the task short one may consider using only the title of the search result. This requires the title to be informative enough, which might not always be the case. – Introduce questions where participants have to give insights into their deci- sion making process, similar to Kittur et al. [10]. This should also give the participants the impression that their decisions and answers will be examined closely and thus should help to improve the quality of their answers.

6 Conclusions and Future Work

In our evaluation we found that there is a large variety on how users assess the usefulness of search results, when they are not primed with a predefined scheme. Furthermore, users also showed a preference to more general search hits in the top results. Both findings can be exploited to improve future search systems, in particular for aggregated search scenarios. In our evaluation we found that different techniques for pseudo relevance feedback provide varying benefit. In particular, the use of Wikipedia as knowledge base and carefully selected single source approaches seem to be a sensible choice. In analysing the results, we found that queries should be pre-processed, whether they fall into the category of entity-centric queries or topical queries. In this case of entity-centric queries they should be processed differently to other types of queries. More research is needed to gain a deeper understanding why this kind of query does not respond well to be expanded and how an optimal strategy for query processing looks like. As future work we plan to follow our proposed guidelines in upcoming Evaluation of Pseudo Relevance Feedback Techniques 101 evaluations, in particular using crowd sourcing techniques. In particular we plan to extend the evaluation to study the impact of different aggregation methods and research on how to increase the diversity of search result, without negatively affecting the precision of the results.

Acknowledgments. The presented work was developed within the EEXCESS project funded by the European Union Seventh Framework Programme FP7/2007-2013 under grant agreement number 600601. The Know-Center is funded within the Austrian COMET Program - Competence Centers for Excellent Technologies - under the aus- pices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG. We would also like to thank all our test users, who underwent the tedious job of scrutinising vast amounts of search results.

References

1. Agrawal, R., Gollapudi, S., Halverson, A., Ieong, S.: Diversifying search results. In: Proceedings of the Second ACM International Conference on Web Search and Data Mining, WSDM 2009, pp. 5–14. ACM, New York (2009) 2. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20(4), 357–389 (2002). http://doi.acm.org/10.1145/ 582415.582416 3. Arguello, J., Diaz, F., Callan, J., Carterette, B.: A methodology for evaluating aggregated search results. In: Clough, P., Foley, C., Gurrin, C., Jones, G.J.F., Kraaij, W., Lee, H., Mudoch, V. (eds.) ECIR 2011. LNCS, vol. 6611, pp. 141–152. Springer, Heidelberg (2011) 4. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Block-based web search. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, pp. 456–463. ACM (2004) 5. Diaz, F., Allan, J.: When less is more: Relevance feedback falls short and term expansion succeeds at hard 2005. Tech. rep., DTIC Document (2006) 6. Gehlen, V., Finamore, A., Mellia, M., Munaf`o, M.M.: Uncovering the big players of the web. In: Pescap`e, A., Salgarelli, L., Dimitropoulos, X. (eds.) TMA 2012. LNCS, vol. 7189, pp. 15–28. Springer, Heidelberg (2012) 7. Harman, D.: Relevance feedback and other query modification techniques (1992) 8. He, B., Ounis, I.: Combining fields for query expansion and adaptive query expansion. Information Processing & Management 43(5), 1294–1307 (2007). http://linkinghub.elsevier.com/retrieve/pii/S0306457306001956 9. Kazai, G., Kamps, J., Koolen, M., Milic-Frayling, N.: Crowdsourcing for book search evaluation: impact of hit design on comparative system ranking. In: Proceed- ings of the 34th International ACM SIGIR Conference on Research and Develop- ment in Information Retrieval, SIGIR 2011, pp. 205–214. ACM, New York (2011). http://doi.acm.org/10.1145/2009916.2009947 10. Kittur, A., Chi, E.H., Suh, B.: Crowdsourcing user studies with mechanical turk. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Sys- tems, CHI 2008, pp. 453–456. ACM, New York (2008). http://doi.acm.org/10. 1145/1357054.1357127 102 H. Ziak and R. Kern

11. Kopliku, A., Pinel-Sauvagnat, K., Boughanem, M.: Aggregated search: A new infor- mation retrieval paradigm. ACM Computing Surveys (CSUR) 46(3), 41 (2014) 12. Lam-Adesina, A.M., Jones, G.J.: Applying summarization techniques for term selection in relevance feedback. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1–9. ACM (2001) 13. Lynam, T.R., Buckley, C., Clarke, C.L., Cormack, G.V.: A multi-system anal- ysis of document and term selection for blind feedback. In: Proceedings of the 13th ACM International Conference on Information and Knowledge Management, pp. 261–269. ACM (2004) 14. Minnie, D., Srinivasan, S.: Meta search engines for information retrieval on mul- tiple domains. In: Proceedings of the International Joint Journal Conference on Engineering and Technology (IJJCET 2011), pp. 115–118 (2011) 15. Montgomery, J., Si, L., Callan, J., Evans, D.: Effect of varying number of documents in blind feedback: analysis of the 2003 NRRC RIA workshop bfnumdocs experiment suite. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2004 (2004) 16. Pass, G., Chowdhury, A., Torgeson, C.: A picture of search. In: Proceedings of the 1st International Conference on Scalable Information Systems, InfoScale 2006. ACM, New York (2006). http://doi.acm.org/10.1145/1146847.1146848 17. Radlinski, F., Dumais, S.: Improving personalized web search using result diversi- fication. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 691–692. ACM (2006) 18. Rhodes, B.J.: Just-in-time information retrieval. Ph.D. thesis, Massachusetts Insti- tute of Technology (2000) 19. Santos, R.L., Macdonald, C., Ounis, I.: Exploiting query reformulations for web search result diversification. In: Proceedings of the 19th International Conference on , pp. 881–890. ACM (2010) 20. Schl¨otterer, J., Seifert, C., Granitzer, M.: Web-based just-in-time retrieval for cul- tural content. In: PATCH14: Proceedings of the 7th International ACM Workshop on Personalized Access to Cultural Heritage (2014) 21. Shokouhi, M., Azzopardi, L., Thomas, P.: Effective query expansion for feder- ated search. In: Proceedings of the 32th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2009, pp. 427–434. ACM, New York (2009). http://doi.acm.org/10.1145/1571941.1572015 22. Voorhees, E.M.: The philosophy of information retrieval evaluation. In: Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.) CLEF 2001. LNCS, vol. 2406, pp. 355–370. Springer, Heidelberg (2002)