An Evaluation of Standard Retrieval Algorithms and a Binary Neural

An Evaluation of Standard Retrieval Algorithms and a Binary Neural Approach Jim Austin Victoria J Ho dge Dept of Computer Science Dept of Computer Science University of York UK University of York UK This research was supp orted by an EPSRC studentship Contact Address Victoria J Ho dge Dept of Computer Science University of York Heslington York UK YO DD Tel Fax vickycsyorkacuk Running Title A Neural Retrieval Approach A Neural Retrieval Approach An Evaluation of Standard Retrieval Algorithms and a Binary Neural Approach Abstract In this pap er we evaluate a selection of data retrieval algorithms for storage eciency retrieval sp eed and partial matching capabilities using a large information retrieval dataset We evaluate standard data structures for example inverted le lists and hash tables but also a novel binary neu ral network that incorp orates singleep o ch training sup erimp osed co ding and asso ciative matching in a binary matrix data structure We identify the strengths and weaknesses of the approaches From our evaluation the novel neural network approach is sup erior with resp ect to training sp eed and partial match retrieval time From the results we make recommenda tions for the appropriate usage of the novel neural approach Keywords Information Retrieval Algorithm Binary Neural Network Cor relation Matrix Memory WordDocument Asso ciation Partial Match Storage Eciency Sp eed of Training Sp eed of Retrieval A Neural Retrieval Approach Many computational implementations require algorithms that are storage ecient may b e rapidly trained with data and allow fast retrieval of selected data Information Retrieval IR requires the storage of massive sets of word to do cument asso ciations The underlying principle of most IR systems is to retrieve stored data on the basis of queries supplied by the user The data structure must allow the do cuments matching query terms to b e retrieved using some form of indexing This inevitably requires an algorithm that is ecient for storage allows rapid training of the asso ciations fast retrieval of do cuments matching the query terms and additionally p ermits partial matching M of N query terms where M N and N is the total number of query terms and M is the number of those terms that must match Many metho dologies have b een p osited for storing these asso ciations see for example including inverted le lists hash tables do cument vectors sup erimp osed co ding techniques and Latent Semantic Indexing LSI In this pap er we compare Perfect tech niques ie those that preserve all word to do cument asso ciations as opp osed to Imp erfect metho dologies such as LSI A representation strategy used in many systems are do cument vectors There are various adaptations of the underlying strategy but fundamentally the do c uments are represented by vectors with elements representing an attribute for each word in the corpus The do cument vectors form a matrix representation of the corpus with one row p er do cument vector and one column p er word at tribute For binary do cument vectors used in for example Koller Sahami the weights are Bo olean so if a word is present in a do cument the appropriate bit in the do cument vector is set to The matrix may also b e integerbased as used in Goldszmidt Sahami where w represents the number of times j k w or d is present in document By activating the appropriate columns words j k the do cuments containing those words may b e retrieved from the matrix LSI decomp oses a worddocument matrix to pro duce a metalevel representation of A Neural Retrieval Approach the corpus with the aim of correlating terms and extracting do cument topics LSI reduces the storage using Singular Valued Decomp osition SVD factor analysis Although this serves to reduce storage it also discards information and compresses out the worddocument asso ciations we need for our evaluation There are alternative hashing strategies as compared to the standard hash structure evaluated in this pap er They are aimed at partial matching but tend to retrieve false p ositives do cuments app ear to match that should not and are thus Imp erfect There may b e insucient dimensions for uniqueness so many words may hash to the same bits and thus false matches will b e re trieved The extra matches then have to b e rechecked for correctness and the false matches eliminated thus slowing retrieval They are describ ed in and include address generation hashing see also and hashing with descriptors see also There are also sup erimp osed co ding SIC techniques describ ed in for hash table partial matching applications but again these are Imp er fect and tend to overretrieve for example onelevel sup erimp osed co ding see also and twolevel sup erimp osed co ding see also We wish to avoid the information loss inherent in LSI and also the false p ositives of SIC with their intrinsic requirement for a p ostmatch recheck to eliminate false matches Therefore we concentrate on Perfect lexical techniques We analyse an inverted le list a slightly more sophisticated version of which is used in the Go ogle search engine against hash tables using various hash functions against binary asso ciative memory AURA We implement a novel binary matrix version of do cument vectors using AURA where the word do cument asso ciations are added incrementally and sup erimp osed Therefore training requires only a singlepass through the worddocument asso ciation list In all cases we evaluate the algorithms in their standard form without any so phisticated improvements to provide a valid comparison of the approaches We A Neural Retrieval Approach evaluate the algorithms for storage use training sp eed retrieval sp eed and partial matching capabilities Knuth p osits that hash tables are sup erior to inverted le lists with resp ect to sp eed but the inverted le list uses slightly less memory Knuth also details sup erimp osed co ding techniques but fo cuses on approaches with multiple bits set as describ ed ab ove that inevitably generate false p ositives We implement orthogonal singlebit set vectors for individual words and do cuments that pro duce no false p ositives AURA may b e used in Imp erfect mo de where multiple bits are set to represent words and do cuments this reduces the vector length required to store all asso ciations but generates false matches as describ ed previously and in Knuth In this pap er we fo cus on using AURA in Perfect mo de For our evaluation we use the Reuters Newswire text corpus as the dataset is a standard Information Retrieval b enchmark It is also large consisting of do cuments with on average approximately words p er do cument This allows the extraction of a large set of wordtodocument asso ciations for a thor ough evaluation of the algorithms We discarded any do cuments that contained little text and do cuments that were rep etitions This left do cuments that were further tidied to leave lowercase alphab etical and six standard punctua tion characters only All alphab etical characters were changed to lower case to ensure matching ie The b ecomes the to ensure all instances match We felt numbers and the other punctuation characters added little value to the dataset and are unlikely to b e stored in an IR system We left punctuation characters as control verication values We extracted all remaining words including the six punctuation symbols from the do cuments and derived wordtodocument asso ciations We created a le with the do cument 1 training time and thus sp eed is implementation dep endent To minimise variance we preserve as much similarity b etween the data structures as p ossible particularly during training see section subsections and for details A Neural Retrieval Approach ID integers from to and a list of each word that o ccurs in that do cu ment The fraction of do cuments that contain each individual word are shown in the graph see gure The minimal fraction is do cument and the maximal fraction is do cuments We analyse the three data structures for multiple single query term retrievals and also for partial match retrievals where do cuments matching N of M query terms are retrieved We assume that all words searched are present we do not consider error cases in this pap er eg words not present or sp elling errors Data structures Inverted File List IFL For the inverted le list compared in this pap er we use an array of words sorted alphab etically and linked to an array of lists This data structure min imises storage and provides exibility The array of words provides an index into the array of lists see gure and app endix A for the C implementation A word is passed to an indexing function binary search through the alphab et ically sorted array of words that returns the p osition of the word in the array The do cument list stored at array p osition X represents the do cuments asso ci ated with the word at p osition X in the alphab etic word array The lists are only as long as the number of do cuments asso ciated with the words to minimise storage but yet can easily b e extended to incorp orate new asso ciations The do cument ID is app ended to the head of the list for sp eed app ending at the tail requires a traversal of the entire list and thus slows training The approach requires minimal storage for the word array the array need only b e as long as the number of words stored as compared to the hash table b elow for example that requires the word storage array to b e only full but additions of new words would require the array to b e reorganised

Load more