On Scalable Information Retrieval Systems
Total Page:16
File Type:pdf, Size:1020Kb
On Scalable Information Retrieval Systems Ophir Frieder [email protected] www.ir.iit.edu 1 © Ophir Frieder 2002 “Scalable Search” Structured Semi-structured Answer Text, video, etc. Engine 2 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics” n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator 3 © Ophir Frieder 2002 Duplicate Document Detection n Union of data obtained from multiple sources often contains duplicates n Duplicates affect both retrieval effectiveness and retrieval efficiency n Duplicate detection is either syntactic or semantic, where semantic is far more challenging. 4 © Ophir Frieder 2002 What is a Duplicate Document? • Semantic Similarity If a document contains roughly the same semantic content it is a duplicate whether or not it is a precise syntactic match. 5 © Ophir Frieder 2002 Duplicate Detection Techniques Main duplicate detection approaches: – Hash based approaches (syntactic) – Information retrieval techniques – Resemblance S ( A ) Ç S ( B ) r ( A, B ) = S ( A ) È S ( B ) 6 © Ophir Frieder 2002 Duplicate Detection with IR n Using documents as queries, rank all documents in the collection with similar terms – Documents with equivalent weights are duplicates n For each query term, the corresponding posting list entries must be retrieved – for large collections, I/O costs are prohibitive 7 © Ophir Frieder 2002 Duplicate Detection with Resemblance n Calculate the resemblance of each document to every other document with matching features n Divide the document into shingles (X terms) used to create a unique hash n Calculate the resemblance based on hashes rather than terms n N2 comparison approaches not feasible for large collections n Optimizations, filter which shingles to use – E.g., every 25th shingle or a combination of multiple shingles 8 © Ophir Frieder 2002 Issues with Prior Approaches n Hash techniques not resilient to small changes in document representation. n IR techniques - slow for large collections. n Resemblance – documents are clustered into multiple clusters due to partitioning – duplicate classification is difficult. 9 © Ophir Frieder 2002 Combined (I-Match) Algorithm n Tokenize document n Create list of unique tokens n Filter tokens - What to filter? n Create a unique hash of remaining tokens n Search collection for duplicate hashes 10 © Ophir Frieder 2002 Filtration Based On Collection Statistics Hi & Low 25% Low 25% High 25% Mid 50% æ N ö 1. Sort according to idf = logç ÷ è n ø N = Number _ Of _ Documents _ In _ Collection n = Number _ Of _ Documents _Term _ Occurs _ In 2. Filter unwanted components 11 © Ophir Frieder 2002 LA Times Collection n Create random duplicates to test effectiveness. n For every ith word, pick a random number from one to ten. n If the number is higher than the random threshold (call it alpha) then pick a number from 1 to 3. n If the random number chosen is a one then remove the word. n If the number is a two then flip it with a word at position i+1. n If it is a three, add a word (randomly pick one from the term list). n Insert duplicate into the collection. 12 © Ophir Frieder 2002 Document Clusters Formed Document Resemblance Resemblance-Opt Combined LA 123190-0013 9 9 7 LA 123190-0022 6 9 2 LA 123190-0025 9 11 3 LA 123190-0037 10 11 1 LA 123190-0043 8 11 2 LA 123190-0053 10 9 2 LA 123190-0058 7 11 3 LA 123190-0073 6 11 3 LA 123190-0074 11 11 1 LA 123190-0080 9 11 9 Average 8.5 10.4 3.3 I-Match did not produce any false positives while Resemblance did. 13 © Ophir Frieder 2002 Processing Time – 2GB Algorithm MEAN Std Deviation Median Time Time Resemblance 31838.22 807.9 30862.5 Resemblance - Opt 24514.7 1042.1 24475.5 I-Match 3815.8 975.8 3598.8 Syntactic 65 N/A N/A 14 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics” n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator 15 © Ophir Frieder 2002 SIRE Goals § Integrate structured and semi-structured data using a framework that also integrates unstructured data. § Improve accuracy of retrieved results § Support scalability: – data volume – retrieval speeds § Support legacy data 16 © Ophir Frieder 2002 Portability The information retrieval prototype was implemented on the following relational platforms: – NCR Teradata DBC-machines – Microsoft SQL Server – Sybase – Oracle – IBM DB2 and SQL/DS 17 © Ophir Frieder 2002 Relational Inverted Index All inverted index entries <term> <list of documents> e.g., vehicle D1, D3, D4 results in: term docID vehicle D1 vehicle D3 vehicle D4 18 © Ophir Frieder 2002 Text Retrieval Conference (TREC) Sample Document <DOC> <DOCNO> AP881214-0028 </DOCNO> <FILEID>AP-NR-12-14-88 0117EST</FILEID> <FIRST>u i BC-Japan-Stocks 12-14 0027</FIRST> <SECOND>BC-Japan-Stocks,0026</SECOND> <HEAD>Stocks Up In Tokyo</HEAD> <DATELINE>TOKYO (AP) </DATELINE> <TEXT> The Nikkei Stock Average closed at 29,754.73 points up 156.92 points on the Tokyo Stock Exchange Wednesday. </TEXT> </DOC> 19 © Ophir Frieder 2002 Relational Document Representation (Term Processing) DOCUMENT docID docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX TERM docID termcnt term term df idf 28 1 nikkei average 2265 1.08 28 2 stock closed 2208 1.08 28 1 average exchange 2790 1.00 28 1 closed nikkei 234 2.07 28 2 points points 1627 1.23 28 1 up stock 2674 1.00 28 1 tokyo tokyo 725 1.58 28 1 exchange up 12746 0.30 28 1 wednesday wednesday 6417 0.60 20 © Ophir Frieder 2002 Simplistic Models: Keyword and Boolean Searches 21 © Ophir Frieder 2002 Relational Approach: Keyword Search Techniques § Keyword search select i.docID from INDEX i, QUERY q where i.term = q.term § Keyword search with stop word list select i.docID from INDEX i, QUERY q, STOPLIST s where (i.term = q.term) and (i.term <> s.term) 22 © Ophir Frieder 2002 Relational Approach: Boolean Search Techniques § OR query select docID select docID from INDEX from INDEX where term = term1 where term = term1 OR union term = term2 OR select docID term = term3 OR from INDEX .... where term = term2 term = termN union select docID from INDEX where term = term3 .... union select docID from INDEX where term = termN 23 © Ophir Frieder 2002 Relational Approach: Boolean Search Techniques § AND query select docID select docID from INDEX from INDEX a, INDEX b, INDEX c, ... INDEX N where term = term1 where a.term = term1 AND intersect b.term = term2 AND select docID c.term = term3 AND from INDEX .... where term = term2 n.term = termN AND intersect a.docID = b.docID AND select docID b.docID = c.docID AND from INDEX .... where term = term3 N-1.docID = N.docID .... intersect select docID from INDEX where term = termN 24 © Ophir Frieder 2002 Fixed Join-Count AND Queries Find all documents that contain all of the terms found in the QUERY relation: select i.docID from INDEX i, QUERY q where i.term = q.term group by i.docID having count (distinct (i.term)) = select count(*) from QUERY 25 © Ophir Frieder 2002 TAND Queries Find all documents that contain at least X of the terms found in the QUERY relation: select i.docID from INDEX i, QUERY q where i.term = q.term group by i.docID having count (distinct (i.term)) >= X 26 © Ophir Frieder 2002 Relevance Ranking: Vector Space & Probabilistic Models 27 © Ophir Frieder 2002 Vector Space Model § Term Frequency (tfik): – number of occurrences of term tk in document i § Document Frequency (dfj ): – number of documents which contain tj § Inverse Document Frequency (idfj): – log(d/dfj) where d is the total number of documents § Notes: – idf is a measure of uniqueness of a term across the collection – tf is the frequency of a term in a given document 28 © Ophir Frieder 2002 Vector Space Model: Sample Relational Query List all documents in the order of their similarity coefficient where the coefficient is computed using the dot product. SELECT d.docID, d.docname, SUM(i.termcnt * t.idf * q.termcnt * t.idf) FROM DOCUMENT d, QUERY q, INDEX i, TERM t WHERE q.term = i.term AND q.term = t.term AND d.docID = i.docID GROUP BY d.docID, d.docname ORDER BY 3 DESC 29 © Ophir Frieder 2002 Similarity Coefficients § Several similarity coefficients based on the query vector X and the document vector Y are defined: t Inner Prod uct å xi × yi i=1 t å xiyi Cosine Coefficient i=1 t t 2 2 å xi · å yi i=1 i=1 30 © Ophir Frieder 2002 SQL for “Probabilistic” Similarity Measure num _ terms æ (numdocs-dfi )+.5ö æ 2.2*tfid ö å logç ÷*ç ÷*qtf i=1 è (dfi +.5) ø è.3+(.75*doclength/ avgdoclength)+tfid ø SELECT d.docID, d.docname, SUM( LOG(((NumDocs - t.df) + 0.5) / (t.df + 0.5)) * ((2.2*i.tf) / (.3 + ((.75 * d.DocLen)/AvgDocLen) + i.tf)) * q.termcnt ) FROM INDEX i, TERM t, DOCUMENT d, QUERY q WHERE i.term = t.term AND i.docID = d.docID AND t.term = q.term GROUP BY d.docID, d.docname ORDER BY 3; 31 © Ophir Frieder 2002 Relational Document Representation (Term Processing) DOCUMENT docID docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX TERM docID termcnt term term df idf 28 1 nikkei average 2265 1.08 28 2 stock closed 2208 1.08 28 1 average exchange 2790 1.00 28 1 closed nikkei 234 2.07 28 2 points points 1627 1.23 28 1 up stock 2674 1.00 28 1 tokyo tokyo 725 1.58 28