On Scalable Information Retrieval Systems

On Scalable Information Retrieval Systems Ophir Frieder [email protected] www.ir.iit.edu 1 © Ophir Frieder 2002 “Scalable Search” Structured Semi-structured Answer Text, video, etc. Engine 2 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics” n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator 3 © Ophir Frieder 2002 Duplicate Document Detection n Union of data obtained from multiple sources often contains duplicates n Duplicates affect both retrieval effectiveness and retrieval efficiency n Duplicate detection is either syntactic or semantic, where semantic is far more challenging. 4 © Ophir Frieder 2002 What is a Duplicate Document? • Semantic Similarity If a document contains roughly the same semantic content it is a duplicate whether or not it is a precise syntactic match. 5 © Ophir Frieder 2002 Duplicate Detection Techniques Main duplicate detection approaches: – Hash based approaches (syntactic) – Information retrieval techniques – Resemblance S ( A ) Ç S ( B ) r ( A, B ) = S ( A ) È S ( B ) 6 © Ophir Frieder 2002 Duplicate Detection with IR n Using documents as queries, rank all documents in the collection with similar terms – Documents with equivalent weights are duplicates n For each query term, the corresponding posting list entries must be retrieved – for large collections, I/O costs are prohibitive 7 © Ophir Frieder 2002 Duplicate Detection with Resemblance n Calculate the resemblance of each document to every other document with matching features n Divide the document into shingles (X terms) used to create a unique hash n Calculate the resemblance based on hashes rather than terms n N2 comparison approaches not feasible for large collections n Optimizations, filter which shingles to use – E.g., every 25th shingle or a combination of multiple shingles 8 © Ophir Frieder 2002 Issues with Prior Approaches n Hash techniques not resilient to small changes in document representation. n IR techniques - slow for large collections. n Resemblance – documents are clustered into multiple clusters due to partitioning – duplicate classification is difficult. 9 © Ophir Frieder 2002 Combined (I-Match) Algorithm n Tokenize document n Create list of unique tokens n Filter tokens - What to filter? n Create a unique hash of remaining tokens n Search collection for duplicate hashes 10 © Ophir Frieder 2002 Filtration Based On Collection Statistics Hi & Low 25% Low 25% High 25% Mid 50% æ N ö 1. Sort according to idf = logç ÷ è n ø N = Number _ Of _ Documents _ In _ Collection n = Number _ Of _ Documents _Term _ Occurs _ In 2. Filter unwanted components 11 © Ophir Frieder 2002 LA Times Collection n Create random duplicates to test effectiveness. n For every ith word, pick a random number from one to ten. n If the number is higher than the random threshold (call it alpha) then pick a number from 1 to 3. n If the random number chosen is a one then remove the word. n If the number is a two then flip it with a word at position i+1. n If it is a three, add a word (randomly pick one from the term list). n Insert duplicate into the collection. 12 © Ophir Frieder 2002 Document Clusters Formed Document Resemblance Resemblance-Opt Combined LA 123190-0013 9 9 7 LA 123190-0022 6 9 2 LA 123190-0025 9 11 3 LA 123190-0037 10 11 1 LA 123190-0043 8 11 2 LA 123190-0053 10 9 2 LA 123190-0058 7 11 3 LA 123190-0073 6 11 3 LA 123190-0074 11 11 1 LA 123190-0080 9 11 9 Average 8.5 10.4 3.3 I-Match did not produce any false positives while Resemblance did. 13 © Ophir Frieder 2002 Processing Time – 2GB Algorithm MEAN Std Deviation Median Time Time Resemblance 31838.22 807.9 30862.5 Resemblance - Opt 24514.7 1042.1 24475.5 I-Match 3815.8 975.8 3598.8 Syntactic 65 N/A N/A 14 © Ophir Frieder 2002 Scalable Information Systems: “Characteristics” n Ingest data from multiple sources – Duplicate document detection n Process multiple type data sources – Structured & unstructured data integration (SIRE) n Use scalable (parallel) technology systems – Parallel SIRE n Integrate retrieved data to yield answers – IIT Mediator 15 © Ophir Frieder 2002 SIRE Goals § Integrate structured and semi-structured data using a framework that also integrates unstructured data. § Improve accuracy of retrieved results § Support scalability: – data volume – retrieval speeds § Support legacy data 16 © Ophir Frieder 2002 Portability The information retrieval prototype was implemented on the following relational platforms: – NCR Teradata DBC-machines – Microsoft SQL Server – Sybase – Oracle – IBM DB2 and SQL/DS 17 © Ophir Frieder 2002 Relational Inverted Index All inverted index entries <term> <list of documents> e.g., vehicle D1, D3, D4 results in: term docID vehicle D1 vehicle D3 vehicle D4 18 © Ophir Frieder 2002 Text Retrieval Conference (TREC) Sample Document <DOC> <DOCNO> AP881214-0028 </DOCNO> <FILEID>AP-NR-12-14-88 0117EST</FILEID> <FIRST>u i BC-Japan-Stocks 12-14 0027</FIRST> <SECOND>BC-Japan-Stocks,0026</SECOND> <HEAD>Stocks Up In Tokyo</HEAD> <DATELINE>TOKYO (AP) </DATELINE> <TEXT> The Nikkei Stock Average closed at 29,754.73 points up 156.92 points on the Tokyo Stock Exchange Wednesday. </TEXT> </DOC> 19 © Ophir Frieder 2002 Relational Document Representation (Term Processing) DOCUMENT docID docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX TERM docID termcnt term term df idf 28 1 nikkei average 2265 1.08 28 2 stock closed 2208 1.08 28 1 average exchange 2790 1.00 28 1 closed nikkei 234 2.07 28 2 points points 1627 1.23 28 1 up stock 2674 1.00 28 1 tokyo tokyo 725 1.58 28 1 exchange up 12746 0.30 28 1 wednesday wednesday 6417 0.60 20 © Ophir Frieder 2002 Simplistic Models: Keyword and Boolean Searches 21 © Ophir Frieder 2002 Relational Approach: Keyword Search Techniques § Keyword search select i.docID from INDEX i, QUERY q where i.term = q.term § Keyword search with stop word list select i.docID from INDEX i, QUERY q, STOPLIST s where (i.term = q.term) and (i.term <> s.term) 22 © Ophir Frieder 2002 Relational Approach: Boolean Search Techniques § OR query select docID select docID from INDEX from INDEX where term = term1 where term = term1 OR union term = term2 OR select docID term = term3 OR from INDEX .... where term = term2 term = termN union select docID from INDEX where term = term3 .... union select docID from INDEX where term = termN 23 © Ophir Frieder 2002 Relational Approach: Boolean Search Techniques § AND query select docID select docID from INDEX from INDEX a, INDEX b, INDEX c, ... INDEX N where term = term1 where a.term = term1 AND intersect b.term = term2 AND select docID c.term = term3 AND from INDEX .... where term = term2 n.term = termN AND intersect a.docID = b.docID AND select docID b.docID = c.docID AND from INDEX .... where term = term3 N-1.docID = N.docID .... intersect select docID from INDEX where term = termN 24 © Ophir Frieder 2002 Fixed Join-Count AND Queries Find all documents that contain all of the terms found in the QUERY relation: select i.docID from INDEX i, QUERY q where i.term = q.term group by i.docID having count (distinct (i.term)) = select count(*) from QUERY 25 © Ophir Frieder 2002 TAND Queries Find all documents that contain at least X of the terms found in the QUERY relation: select i.docID from INDEX i, QUERY q where i.term = q.term group by i.docID having count (distinct (i.term)) >= X 26 © Ophir Frieder 2002 Relevance Ranking: Vector Space & Probabilistic Models 27 © Ophir Frieder 2002 Vector Space Model § Term Frequency (tfik): – number of occurrences of term tk in document i § Document Frequency (dfj ): – number of documents which contain tj § Inverse Document Frequency (idfj): – log(d/dfj) where d is the total number of documents § Notes: – idf is a measure of uniqueness of a term across the collection – tf is the frequency of a term in a given document 28 © Ophir Frieder 2002 Vector Space Model: Sample Relational Query List all documents in the order of their similarity coefficient where the coefficient is computed using the dot product. SELECT d.docID, d.docname, SUM(i.termcnt * t.idf * q.termcnt * t.idf) FROM DOCUMENT d, QUERY q, INDEX i, TERM t WHERE q.term = i.term AND q.term = t.term AND d.docID = i.docID GROUP BY d.docID, d.docname ORDER BY 3 DESC 29 © Ophir Frieder 2002 Similarity Coefficients § Several similarity coefficients based on the query vector X and the document vector Y are defined: t Inner Prod uct å xi × yi i=1 t å xiyi Cosine Coefficient i=1 t t 2 2 å xi · å yi i=1 i=1 30 © Ophir Frieder 2002 SQL for “Probabilistic” Similarity Measure num _ terms æ (numdocs-dfi )+.5ö æ 2.2*tfid ö å logç ÷*ç ÷*qtf i=1 è (dfi +.5) ø è.3+(.75*doclength/ avgdoclength)+tfid ø SELECT d.docID, d.docname, SUM( LOG(((NumDocs - t.df) + 0.5) / (t.df + 0.5)) * ((2.2*i.tf) / (.3 + ((.75 * d.DocLen)/AvgDocLen) + i.tf)) * q.termcnt ) FROM INDEX i, TERM t, DOCUMENT d, QUERY q WHERE i.term = t.term AND i.docID = d.docID AND t.term = q.term GROUP BY d.docID, d.docname ORDER BY 3; 31 © Ophir Frieder 2002 Relational Document Representation (Term Processing) DOCUMENT docID docname headline dateline 28 AP881214-0028 Stocks Up In Tokyo TOKYO (AP) INDEX TERM docID termcnt term term df idf 28 1 nikkei average 2265 1.08 28 2 stock closed 2208 1.08 28 1 average exchange 2790 1.00 28 1 closed nikkei 234 2.07 28 2 points points 1627 1.23 28 1 up stock 2674 1.00 28 1 tokyo tokyo 725 1.58 28

On Scalable Information Retrieval Systems

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support