From Federated to Aggregated Search
Total Page:16
File Type:pdf, Size:1020Kb
From federated to aggregated search Fernando Diaz, Mounia Lalmas and Milad Shokouhi [email protected] [email protected] [email protected] Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography 1 Outline Introduction and Terminology Architecture Resource Representation Resource Selection Result Presentation Evaluation Open Problems Bibliography Introduction What is federated search? What is aggregated search? Motivations Challenges Relationships 2 A classical example of federated search One query Collections to be searched www.theeuropeanlibrary.org A classical example of federated search Merged list www.theeuropeanlibrary.org of results 3 Motivation for federated search Search a number of independent collections, with a focus on hidden web collections Collections not easily crawlable (and often should not) Access to up-to-date information and data Parallel search over several collections Effective tool for enterprise and digital library environments Challenges for federated search How to represent collections, so that to know what documents each contain? How to select the collection(s) to be searched for relevant documents? How to merge results retrieved from several collections, to return one list of results to the users? Cooperative environment Uncooperative environment 4 From federated search to aggregated search “Federated search on the web” Peer-to-peer network connects distributed peers (usually for file sharing), where each peer can be both server and client Metasearch engine combines the results of different search engines into a single result list Vertical search – also known as aggregated search – add the top-ranked results from relevant verticals (e.g. images, videos, maps) to typical web search results A classical example of aggregated Structured search Data News Homepage Wikipedia Real-time results Video Twitter 5 Motivation for aggregated search Increasingly different types of information being available, sough and relevant e.g. news, image, wiki, video, audio, blog, map, tweet Search engine allows accessing these through so-called verticals Two “ways” to search Users can directly search the verticals Or rely on so called aggregated search Google universal search 2007: [ … ] search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results [ … ] will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html Motivation for aggregated search 25K editorially classified queries (Arguello et al, 09) 6 Motivation for aggregated search Motivation for aggregated search 7 Challenges in aggregated search Extremely heterogeneous collections What is/are the vertical intent(s)? And Handling ambiguous (query | vertical) intent Handling non-stationary intent (e.g. news, local) How many results from each to return and where to position them in the result page? Slotting results Users looking at 1st result page Page optimization and its evaluation Ambiguous non-stationary intent Query - Travel - Molusk - Paul Vertical - Wikipedia - News - Image 8 Recap – Introduction federated aggregated search search heterogeneity low high scale (documents, small large users) user feedback little a lot Terminology 1. federated search, distributed information retrieval, data fusion, aggregated search, universal search, peer-to-peer network 2. resource, vertical, database, collection, source, server, domain, genre 3. merging, blending, fusion, aggregation, slotted, tiled 9 Problem definition Present the “querier” with a summary of search results from one or more resources. General architecture User Raw Query Search Interface/ Portal/ Broker Query Query Query Query Query Source/ Source/ Source/ Source/ Source/ Server/ Server/ Server/ Server/ Server/ Vertical Vertical Vertical Vertical Vertical 10 Peer-to-peer network Peer Directory Server Peer to Peer (P2P) networks Broker-based Single centralized broker with documents lists shared from peer (e.g. Napster, original version) Decentralized Each peer acts as both client and server (e.g. Gnutella v0.4) Structure-based Use distributed hash tables (DHT) (e.g. Chord (Stocia et al, 03) ) Hierarchical Use local directory services for routing and merging (e.g. Swapper.NET) 11 Federated search Query Merged results Broker Sum Sum Sum Sum Sum A B C D E Query Query Query Query Query Collection Collection Collection Collection Collection D E A B C Federated search Also known as distributed information retrieval (DIR) system Provides one portal for searching information from multiple sources corporate intranets, fee-based databases, library catalogues, internet resources, user- specific digital storage Funnelback, Westlaw, FedStats, Cheshire, etc (see also http://federatedsearchblog.com/)" 12 http://funnelback.com/pdfs/brochures/enterprise.pdf User Metasearch Raw Query Metasearch engine Query Query Query Query WWW 13 Metasearch !Search engine querying several different search engines and combines results from them (blended), or displays results separately (non-blended) !Does not crawl the web but rely on data gathered by other search engines !Dogpile,Metacrawler, Search.com, etc (see http://www.cryer.co.uk/resources/searchengines/meta.htm) Aggregated search User Angelina Jolie Results Query Query Query Query WWW Index (text) 14 Aggregated search !Specific to a web search engine !“Increasingly” more than one type of information relevant to an information need !mostly web page + image, map, blog, etc !These types of information are indexed and ranked using dedicated approaches (verticals) !Presenting the results from verticals in an aggregated way believed to be more useful !All major search engines are doing some levels of aggregated search Data fusion Query One ranked list of result (merged) Different document Merging representations Different retrieval models BM25 KL Inquery Anchor only Title only GOV2 One document collection (e.g. Voorhees etal, 95) 15 Data fusion !Search one collection !Document can be indexed in different ways !Title index, abstract index, etc (poly-representation) !Weighting scheme !Different retrieval models !Rankings generated by different retrieval models (or different document representations) merged to produce the final rank !Has often been shown to improve retrieval performance (TREC) Terminology - Resource !Source !Server !Database !Collection (federated search) !Server !Vertical (aggregated search) !Domain !Genre 16 Terminology - Aggregation !Merging !Blending !Fusion !Slotted !Tiled Aggregated search (tiled) http://au.alpha.yahoo.com/ 17 Aggregated search (tiled) Naver.com Aggregated search (slotted) 18 Others !Clustering !Faceted search !Multi-document summarization !Document generation !Entity search (see special issue – in press – on “Current research in focused retrieval and result aggregation”, Journal of Information Retrieval (Trotman etal, 10)) Yippy – Clustering search engine from Vivisimo clusty.com 19 Faceted search Multi-document summarization http://newsblaster.cs.columbia.edu/ 20 “Fictitious” document generation (Paris et al, 10) Entity search http://sandbox.yahoo.com/Correlator 21 Recap !Shown the relations between federated, aggregated search, and others !Exposed the various terminologies used !In the rest of the tutorial, we concentrate on federated search and aggregated search !Focus is on “effective search” Outline !Introduction and Terminology !Architecture !Resource Representation !Resource Selection !Result Presentation !Evaluation !Open Problems !Bibliography 22 Architecture: what are the general components of federated and aggregated search systems. Federated search architecture 23 Aggregated search architecture !Pre-retrieval aggregation: decide verticals before seeing results !Post-retrieval aggregation: decide verticals after seeing results !Pre-web aggregation: decide verticals before seeing web results !Post-web aggregation: decide verticals after seeing web results Post-retrieval, pre-web 24 Pre and post-retrieval, pre-web Outline !Introduction and Terminology !Architecture !Resource Representation !Resource Selection !Result Presentation !Evaluation !Open Problems !Bibliography 25 Resource representation: how to represent resources, so that we know what documents each contain. Resource representation in federated search (Also known as resource summary/description) 26 Resource representation !Cooperative environments !Comprehensive term statistics !Collection size information !Uncooperative environments !Query-based sampling !Collection size estimation Resource representation (cooperative environments) !STARTS Protocol (Gravano et al, 97) ! Source metadata ! Rich query language 27 Resource representation (cooperative environments) !Different types of term statistics (Callan et al, 95; Gravano et al, 94a,b,99; Meng et al, 01; Yuwono and Lee, 97; Xu and Callan, 98; Zobel, 97) !Anchor-text !HARP (Hawking and Thomas, 05) Resource representation (uncooperative environments) !Query-based sampling (Callan and Connell, 01) !Select a query, probe collection !Download the top n documents !Select the next query, repeat Query selector Query Sampled documents 28 Resource representation (uncooperative environments) !Query selector !(Callan and Connell,