Distributed Multisearch and Resource Selection for the TREC Million Query Track
Total Page:16
File Type:pdf, Size:1020Kb
Distributed multisearch and resource selection for the TREC Million Query Track Chris Fallen and Greg Newby Arctic Region Supercomputing Center, University of Alaska Fairbanks, Fairbanks, AK 99775 Kylie McCormick Mount Holyoke College, South Hadley, MA 01075 Abstract A distributed information retrieval system with resource‐selection and result‐set merging capability was used to search subsets of the GOV2 document corpus for the 2008 TREC Million Query Track. The GOV2 collection was partitioned into host‐name subcollections and distributed to multiple remote machines. The Multisearch demonstration application restricted each search to a fraction of the available sub‐collections that was pre‐determined by a resource‐selection algorithm. Experiment results from topic‐by‐topic resource selection and aggregate topic resource selection are compared. The sensitivity of Multisearch retrieval performance to variations in the resource selection algorithm is discussed. 1. Overview regression merging strategy [7], modified for The information processing research group at efficiency, to merge results from a metasearch‐ ARSC works on problems affecting the style application that searched and merged results performance of distributed information retrieval from two indexed copies of the GOV2 corpus [8]. applications such as metasearch [1], federated One index was constructed with the Lucene search [2], and collection sampling [3]. An Toolkit and the other index was constructed with ongoing goal of this research is to guide the the Amberfish application [9]. For the 2006 TREC selection of standards and reference [10] TB Track [11], the GOV2 corpus was implementations for Grid Information Retrieval partitioned into approximately 17,000 collections (GIR) applications [4]. Prototype GIR applications by grouping documents with identical URL host developed at ARSC help to evaluate theoretical names [12]. Each query was searched against research and gain experience with the capabilities every collection and the ranked results from each and limitations of existing APIs, middleware, and collection were merged using the logistic security requirements. The TREC experiments regression algorithm used in the 2005 TB track. provide an additional context to test and develop The large number of document collections used in distributed IR technology. the 2006 experiment, coupled with the Prior TREC Terabyte (TB) and Million distribution of collection size (measured in Query (MQ) Track experiments performed at number of documents contained) that spanned ARSC have explored the IR performance and five orders of magnitude, introduced significant search efficiency of result‐set merging and wall‐clock, bandwidth, and IR performance ranking across small numbers of heterogeneous problems. Follow‐up work found that the systems and large numbers of homogeneous bandwidth performance could be improved systems. In the 2005 TREC [5] TB Track [6], the somewhat without sacrificing IR performance by ARSC IR group used a variant of the logistic TREC 17 Million Query Track NOTEBOOK: Multisearch and resource selection 1 / 10 using a simple relation based on collection size to algorithms applied to big‐document truncate the result sets before merging [13]. representations of collections or resources is The host‐name partition of the GOV2 discussed in [2]. corpus used in 2006 was used again for a distributed IR simulation for the 2007 TREC [14] 2. System description MQ track [15] but the full set of collections could Our goal for the 2008 MQ track was to efficiently no longer be searched for every query with the search a large number of document collections resources available because the resources using the Multisearch distributed IR research required to complete the track using a distributed system developed at ARSC. Each document collection approach increase as the product of the collection provides a search service to the number of queries and the number of collections Multisearch portal. The bandwidth cost for each searched rather than just the number of queries query is approximately proportional to the alone. Consequently, only a small fraction of the number of document collections available and this host‐name collections, comprising less than 20% bandwidth cost is split between the transmission of the documents in GOV2, was searched [16]. The of the query to the document collections and the collections were chosen according to historical transmission of the response from each document performance in the previous TREC TB Tracks by collection. Therefore, reducing the number of ranking the collections in decreasing order of total collections searched for each query was number of “relevant” relevance judgments (qrels) important. A resource or collection‐selection associated with their respective documents and algorithm was developed to restrict each search to the top 100 collections were searched. a small number of presumably relevant document For the 2008 MQ track, the ARSC IR group collections. ran several resource‐selection experiments on the host‐partitioned GOV2 collection. The purpose of each experiment was similar in motivation to the 2007 MQ experiment: quickly identify the collections most likely to be relevant to the queries, then search only the top 50 relevant collections, and finally merge the retrieved results into a single relevance‐ranked list. The union of the largest 50 of 996 host‐name collections available in the experiment contains less than 10% of the total documents in GOV2 so each MQ topic search was restricted to about a tenth of the GOV2 collection. Each collection‐ranking technique was based on a virtual “big document” representation of each collection relative to a fixed basis of terms. One of two document relevance‐ranking algorithms was modified to relevance‐rank the document collections or resources before sending the MQ track queries to the Multisearch Figure 1. Schematic of the ARSC Multisearch application: the conventional vector space model system. The resource selector is contained in the (VSM) [17] or Latent Semantic Indexing (LSI) [18]. Multisearch front‐end block and the host‐name Prior work on resource selection techniques using document collections are contained in the data variations of standard document retrieval blocks. TREC 17 Million Query Track NOTEBOOK: Multisearch and resource selection 2 / 10 2.1 Multisearch After the result set is completed, it is sent Multisearch 2.0 is coded entirely in Java with back to the Multisearch front end over the Apache Tomcat and OGSA‐DAI WSI 2.2 as network. There, it is combined with the other middleware. Tomcat is a servlet container, and it result sets from the other back ends. This requires is also a necessary piece of architecture for OGSA‐ the work of a merge algorithm. In our DAI. OGSA‐DAI is a middleware for grid services, experiments, we used Naïve Merge (or “log and it was used because it enables quick, easy merge” in [12]) to bring results together. launching of literally hundreds of backend web Essentially, Multisearch takes the separate result services using Axis. OGSA‐DAI also provides sets and attempts to merge them into one final, methods of searching secured information by reasonable result set. After this, the final set is having structures for grid credentials, although we produced to the user. did not utilize this ability. Three different servers were used for the 2.2 Resource selection TREC runs. Snowy is a Sun x4500 machine with The resource selection component of Multisearch two dual core Opteron processors at 2.6 GHz with described can relevance‐rank many minimally 16 GB of memory and 48 TB of disk space, and it cooperative document collection servers that runs Ubuntu Linux 7.10. Pileus is an ASL server provide a typical “search” interface that accepts with two hyperthreaded Xeon 3.6 GHz CPUs, 8 GB text queries. To build the data set used by the of memory, and 7 TB of disk space. It also runs resource selection component, the number of Ubuntu 7.10. Finally, Nimbus is a Sun v20z with document hits in each collection for each potential two Opteron processors at 2.2 GHz with 12 GB of query term is required. This data could, at least in memory and 73 GB of disk space. It runs Fedora principle, be constructed by searching each Core 9. Snowy has the majority of the back ends document collection one basis term (defined loaded onto it due to its disk space and speed. It below) at a time through the search interface had 873 back end services running, all loaded provided by the document collection server but in from the OGSA‐DAI architecture. Pileus hosted the this experiment the indexed document collections remaining 121 back ends. Nimbus ran the front‐ were directly accessed to build the resource‐ end client, so all the back ends had to transfer data selector data. back and forth over the network. A query is sent over the network to any number of backend services, as shown in Figure 1. 2.2.1 Vector representation of resources Each service has at least one Data Service The search resources, or back ends, in this Resource (DSR), although some might have experiment were Lucene indexes built from nearly multiple resources available. A DSR provides 1,000 of the largest host‐name document access and to the data within the service, as well collections in the GOV2 corpus. Collection size is as the appropriate methods of getting that data, defined here as the number of unique documents including a credential system if necessary. In the contained in a document collection. The largest case of Multisearch, each service was in fact its 1,000 host‐name collections in GOV2 contain own search engine. It takes the query and passes it approximately 21 million documents out of a total to the DSR which runs the query against the data, of 25 million GOV2 documents. For each of the compiling a result set to be returned to the 10,000 topics in the 2008 MQ Track, collections requesting query.