Federated Search

Jaime Arguello INLS 509: Information Retrieval [email protected] November 21, 2016

Thursday, November 17, 16 Up to this point...

• Classic information retrieval ‣ search from a single centralized index ‣ all queries processed the same way

• Federated search ‣ search across multiple distributed collections ‣ a.k.a: resources, search engines, search services, etc.

Thursday, November 17, 16 Motivation

• Some content cannot be crawled and centrally indexed (exposed only via a search interface) ‣ also referred to as “the hidden web” • Even if crawl-able, we may prefer searchable access to this content via the third-party search engine. why? ‣ content updated locally ‣ unique document representation (e.g., metadata) ‣ customized retrieval

Thursday, November 17, 16 Federated Search Examples (World Wide Science)

• Exhaustive search (across all collections)

Thursday, November 17, 16 Federated Search Examples (World Wide Science)

Thursday, November 17, 16 . Federated Search Examples . (World Wide Science)

. . 6 Thursday, November 17, 16 . Federated Search Examples . (World Wide Science)

most results from a few collections!

. . 7 Thursday, November 17, 16 Federated Search Examples (Vertical Aggregation in Web Search)

maps

web

images

web

books 8

Thursday, November 17, 16 Federated Search

• Exhaustive search

C1 merged ranking

C3 . portal . interface .

query 9

Thursday, November 17, 16 Federated Search

• Some collections do not retrieve C1 any results merged ranking • Most of the top results come from C2 a few collections

C3 . portal . interface .

query 10

Thursday, November 17, 16 Federated Search

• We can often satisfy the user C1 with results from merged ranking only a few

collections C2 • Why?

C3 . portal . interface .

query 11

Thursday, November 17, 16 The Cluster Hypothesis (van Rijsbergen, 1979)

• Similar documents are relevant to similar information needs ‣ used in cluster-based retrieval ‣ document score normalization ‣ pseudo-relevance feedback ‣ federated search

Thursday, November 17, 16 Federated Search

• Objective: given a query, predict which few collections have relevant documents and combine their results into a single document ranking

Thursday, November 17, 16 Federated Search

off-line

Resource representation Resource selection Results merging

at query-time

Thursday, November 17, 16 Federated Search Resource Representation • Gathering information about C1 what each resource contains C2 • What types of information needs does each C3 . resource satisfy? portal . interface .

Thursday, November 17, 16 Federated Search Resource Selection • Deciding which few resources to C1 search given a user’s query

C3 . portal . interface .

query 16

Thursday, November 17, 16 Federated Search Results Merging • Combining their results into a C1 single output merged ranking

C3 . portal . interface .

query 17

Thursday, November 17, 16 Federated Search

off-line

resource representation

resource selection results merging

at query-time

Thursday, November 17, 16 Cooperative vs. Uncooperative

• Cooperative environment ‣ assumption: resources provide accurate and complete information to facilitate selection and merging ‣ centrally designed protocols and APIs • Uncooperative environment ‣ assumption: resources provide no special support for federated search ‣ only a search interface • Different environments require different solutions 19

Thursday, November 17, 16 Resource Representation

• Objective: to gather information about what each resource contains ‣ but, ultimately to inform resource selection • Discussion: what sources of evidence could we use to do this?

Thursday, November 17, 16 Resource Representation using content • Term frequencies: selection based on C1 the query-collection similarity

C3 • A set of “typical” docs: . selection based on the portal . predicted relevance of interface . sampled documents

Cn query 22

Thursday, November 17, 16 Resource Representation using manually-issued queries • Manually-issued queries: selection based on C1 query-query similarity

C3 . portal . . interface . .

Cn query 23

Thursday, November 17, 16 Resource Representation using previous retrievals • Automatically issued queries: C1 selection based on exhaustive merge query-query

similarity C2

C3 query . portal . interface .

Thursday, November 17, 16 Resource Representation using content • Problem: in an • Term frequencies: uncooperative selection based on C1 environment the query-collection resources provide similarity

only a search C2 interface

C3 • A set of ‘typical’ docs: . selection based on the portal . predicted relevance of interface . sampled documents

Cn query 25

Thursday, November 17, 16 Query-based Sampling (Callan and Connell, 2001)

• Repeat N times (e.g., N=100), 1. submit a query to the search engine 2. download a few results (e.g., 4) 3. update the collection representation (e.g., term frequencies) 4. select a new query for sampling (e.g., from the emerging representation)

Thursday, November 17, 16 Query-based Sampling

• Discussion: suppose we want to represent resources using term frequency information, how many samples do we need? • Hint: zipf’s law states that the number of new terms seen in each additional document decreases exponentially

Thursday, November 17, 16 Query-based Sampling (Callan and Connell, 2001)

1.0 | 1.0 |

0.9 | 0.9 | | |

ctf ratio 0.8 0.8 ctf ratio: % of 0.7 | 0.7 | 0.6 | 0.6 |

collection 0.5 | CACM: CACM ~ 3K docs 0.5 | CACM

“covered” by 0.4 | WSJ88: WSJ88 ~40K docs 0.4 | WSJ88

0.0 || | | | | | 0.0 || | | | | | 0 100 200 300 400 500 0 100 200 300 400 500 Number of documents examined Number of documents examined (a) (b) • After 500 docs we’ve seen enough vocabulary to account for about 80-90% all term occurrences 28

Thursday, November 17, 16 Query-based Sampling (Callan and Connell, 2001)

1.0 | 1.0 |

0.9 | 0.9 | | |

ctf ratio 0.8 0.8

0.7 | 0.7 |

0.6 | 0.6 |

0.5 | CACM 0.5 | CACM: CACM ~ 3K docs

0.4 | WSJ88 0.4 | WSJ88: WSJ88 ~40K docs

| TREC-123 | TREC-123 0.3 0.3 TREC-123: ~1M docs | | 0.2 Spearman Rank Correlation 0.2

0.1 | 0.1 |

0.0 || | | | | | 0.0 || | | | | | 0 100 200 300 400 500 0 100 200 300 400 500 Number of documents examined Number of documents examined (a) (b) • The ordering of terms (by frequency) based on sample set statistics approximates the actual one 29

Thursday, November 17, 16 Query-based Sampling Extensions

• Adaptive sampling: sample until rate of unseen terms decreases below threshold (Shokouhi et al., 2006) ‣ slight improvement • Sampling using (popular) query-log queries ‣ web query-log (Shokouhi et al., 2007), resource-speciﬁc query-log (Arguello et al., 2009) • Re-sampling to avoid stale representations ‣ re-sampling according to collection size is a good heuristic (Shokouhi et al., 2007b)

Thursday, November 17, 16 Resource Selection

• Objective: deciding which resources to search given a user’s query • Most prior work casts the problem as resource ranking ‣ given a query, select the k ≪ n collections that produce good merged results ‣ k is given (an interesting research problem)

Thursday, November 17, 16 Resource Selection

• Content-based methods: score resources based on the similarity between the query and content from the resource ‣ large vs. small document models • Query-similarity methods: score resources based on the effectiveness of previously issued queries that are similar to the query (will be covered at high level)

Thursday, November 17, 16 Resource Representation using content • Term frequencies: selection based on C1 the query-collection similarity

C3 • A set of ‘typical’ docs: . selection based on the portal . predicted relevance of interface . sampled documents

Cn query 34

Thursday, November 17, 16 Large Document Models resources

doc docdoc docdoc docdoc docdoc large document index

docdoc docdoc docdoc doc doc docdoc docdoc doclarge doc . .

doc docdoc docdoc docdoc docdoc

• Represent each resource (or its samples) as a single “large document” 35

Thursday, November 17, 16 Large Document Models

large document index R! q

docdoc query docdoclarge doc

1. Given the query, rank “large documents” using functions adapted from document retrieval 2. Select the top k

Thursday, November 17, 16 Large Document Models

• CORI (Callan, 1995) +0 5 log |C| . df ! cfw " CORI (C )=b +(1− b) × w,i × w i col len . df w,i + 50 + 150 × avg col len log(|C| +10)

• adapted from BM25 N+0.5 tf log ! df " P (w|d)=b +(1− b) × × doc len N . tf +0.5+1.5 × avg doc len log( +10)

Thursday, November 17, 16 Large Document Models

• KL-Divergence (Xu and Croft 1999) P (w|q) KLq(Ci)= P (w|q) log "P (w|C )# w!∈q i • Query Likelihood (Si et al., 2002)

P (q|Ci)=! λP (w|Ci)+(1− λ)P (w|G) w∈q

Thursday, November 17, 16 Large Document Models

• Discussion: potential limitations?

large document index R! q

docdoc query docdoclarge doc

Thursday, November 17, 16 Resource Representation using content • Term frequencies: selection based on C1 the query-collection similarity

C3 • A set of ‘typical’ docs: . selection based on the portal . predicted relevance of interface . sampled documents

Cn query 40

Thursday, November 17, 16 Small Document Models ReDDE (Si and Callan, 2003) resources

doc docdoc docdoc centralized sample index docdoc docdoc doc docdoc doc doc docdoc doc docdoc doc docdoc doc docdoc docdoc docdoc docdoc doc doc doc doc docdoc docdoc doc . sample doc docdoc . ∪n S doc i=1 i docdoc docdoc docdoc docdoc • Combine samples in a centralized index, keeping track

of which collection each sample came from 41