Evaluation of Pseudo Relevance Feedback Techniques for Cross Vertical Aggregated Search

Evaluation of Pseudo Relevance Feedback Techniques for Cross Vertical Aggregated Search B Hermann Ziak( ) and Roman Kern Know-Center GmbH, Inffeldgasse 13, 8010 Graz, Austria {hziak,rkern}@know-center.at Abstract. Cross vertical aggregated search is a special form of meta search, were multiple search engines from different domains and vary- ing behaviour are combined to produce a single search result for each query. Such a setting poses a number of challenges, among them the question of how to best evaluate the quality of the aggregated search results. We devised an evaluation strategy together with an evaluation platform in order to conduct a series of experiments. In particular, we are interested whether pseudo relevance feedback helps in such a scenario. Therefore we implemented a number of pseudo relevance feedback techniques based on knowledge bases, where the knowledge base is either Wikipedia or a combination of the underlying search engines themselves. While conducting the evaluations we gathered a number of qualitative and quantitative results and gained insights on how different users com- pare the quality of search result lists. In regard to the pseudo relevance feedback we found that using Wikipedia as knowledge base generally provides a benefit, unless for entity centric queries, which are targeting single persons or organisations. Our results will enable to help steering the development of cross vertical aggregated search engines and will also help to guide large scale evaluation strategies, for example using crowd sourcing techniques. 1 Introduction Todays web users tend to always revert to the same sources of information[6] despite other potentially valuable sources of information exists. These sources are highly specialized in certain topics, but often left out since they are not familiar to the user. One key aspect to tackle this issue is to devise search methods that keep the users efforts minimal, where meta search serves as a starting point. This is motivated to improve the public awareness of systems in domains, which are considered to be niche areas by the general public, like cultural heritage or science. Meta search is the task of distributing a query to multiple search engines and combining their results into a single result list. In meta search there is usually no strict separation of domains, thus the results are expected to be homogeneous or even redundant, for example results from different web search engines. On the other hand vertical search engines try to combine results from sources of different domains. In our case these verticals or sources are highly specialized c Springer International Publishing Switzerland 2015 J. Mothe et al. (Eds.): CLEF 2015, LNCS 9283, pp. 91–102, 2015. DOI: 10.1007/978-3-319-24027-5 8 92 H. Ziak and R. Kern collections, for example medicine, business, history, art or science. These verticals might also differ in the type of items which are retrieved [3] (e.g. images, web pages, textual documents). An example of vertical search is the combination of results from an underlying images search with results from a traditional textual search. In our work we focus on cross vertical aggregated search engines [11], also known as multi domain meta search engines [14], where we do not make any assumptions about the domain of the individual sources. Hence, in such a scenario the challenges [11] of both types of aggregated search engines are inherited. In particular we are dealing with so called uncooperative sources, thus the individual search engines are treated as black boxes. The overall goal of our work is to gain a profound understanding on how to provide aggregated search results, which prove to be useful for the user. This directly addresses the question on how to assess this usefulness, i.e. how to evaluate such a system? The traditional approach for information retrieval evaluation follows the Cranfield paradigm [22]. Here the retrieval performance is assessed by a fixed set of relevant documents for each query and typically evaluated offline using mean average precision (MAP), normalized discounted cumulative gain (NDCG) or related measures. This type of evaluation does not appear to be appropriate, as it does not capture aspects like diversity, serendipity and usefulness of long-tail content which we consider to play an important role in our setting. Furthermore, these indicators are hard to measure since ground truth data is hard to create for cross vertical search systems where sources might be uncooperative. In order to fill this gap, we conducted user centred evaluations to get a better understanding of how users perceive search result lists and how to design evaluations in such a setting. In particular we are interested in how users evaluate longer result sets against each other, not only judging the top documents alone. Therefore we developed a dedicated evaluation tool allowing users to interactively vote for results which best match their expectations. The evaluation platform also allows us to evaluate the impact of different retrieval techniques: more specifically, the integration of pseudo relevance feedback. In pseudo relevance feedback the search is conducted two times. First the original query is issued and the top search hits are analysed. From those search hits a number of query term candidates are selected and added to the query. The expanded query is then used to generate the final search result list. As an extension to the basic procedure, the search engine for the first query might be different from the one of the second round. Thus different knowledge bases can be studied when used for the first search and how they impact the final results. Before conducting the actual evaluation, we did a preliminary test with few friendly users to fine tune the evaluation system. In our main evaluation we gathered qualitative insights and quantitative results, of the integration of pseudo relevance feedback into the retrieval process and whether it proves beneficial and helps to diversify results of specialized sources. Another outcome of our work is a guideline on how human intelligence tasks have to be designed for large scale evaluations on crowd-sourcing platforms like Amazon Mechanical Turk. Evaluation of Pseudo Relevance Feedback Techniques 93 Fig. 1. Overview of the basic architecture of the whole system. In the user context detection component the query is extracted from the current user’s context. The cross vertical aggregated search engine is the core of the system where queries are expanded and distributed to the sources. The source connector is responsible to invoke the source specific API and return the individual search results, which are finally merged by the result aggregation component . 2 System Overview Our cross vertical search algorithms are at the core of a bigger system, which is development within the EEXCESS1 (”Enhancing Europes eXchange in Cul- tural Educational and Scientific reSources”) project. The code is available under an open-source license2. An overview of the architecture is given in Figure 1. The vision of the project is to recommend high quality content of many sources to platforms and devices which are used on a daily basis for example in form of a browser add-on. In a traditional information retrieval setting the user is requested to explicitly state her information need, typically as a query consist- ing of a few keywords. We consider the case, where additionally to the explicit search capabilities, the information need is not explicitly given. In this case the query is automatically inferred from the current context, by the user context detection component [20]. Such a setting is also known as just-in-time information retrieval [18] and has a close connection to the field of recommender systems research. The search result list is continuously updated according to the users’ interactions, for example when navigating from one web site to another. Next, the query is processed by the query reformulation step, where the query expansion takes place. Optionally one of the pseudo relevance feedback algorithms is applied to add related query terms to the original query. The query is then fed to 1 http://eexcess.eu 2 http://github.com/EEXCESS/recommender 94 H. Ziak and R. Kern all known sources, i.e. all search engines that are registered with the system, via source specific connectors. These source specific connectors then adapt the query to the source specific format and invoke the respective API calls, for example by the use of the Open Search API3. Finally, the results from all sources are collected and aggregated into a single search result list that is presented to the users. 3 System Details The automatic query generation poses a set of challenges, as the true information need might be only partially present in these automatically inferred queries and might not cover the user’s intent well. One approach to deal with such problems is to diversify results as suggested in the literature [19]. Diversification can be achieved by a number of methods, ranging from mining query logs to query reformulation strategies [17]. Other diversification techniques like IA-Select [1] rely on categorization of the query and the retrieved documents to greedily rearrange the given result lists. In the end, the final presented result should cover all topics of the query in proportion to its categories. Although we considered such approaches, we found that in our meta search environment some of the verticals returned insufficient information to do a categorization of the results. For example, digital libraries of images only supply short titles and no additional metadata. Farming query logs do also not apply to our scenario, as our system should also work with uncooperative sources. Another source of information are language models, which could provide benefit in the query expansion stage. One way to obtain a language model of an uncooperative source is probing.

Load more